Improving btree performance through specializing by key shape, take 2
Hi,
Here's a new patch, well-timed for the next feature freeze. [0]The one for PG16, that is.
Last year I submitted a patchset that updated the way nbtree keys are
compared [1]/messages/by-id/CAEze2Whwvr8aYcBf0BeBuPy8mJGtwxGvQYA9OGR5eLFh6Q_ZvA@mail.gmail.com; by moving from index_getattr to an iterator-based
approach. That patch did perform quite well for multi-column indexes,
but significantly reduced performance for keys that benefit from
Form_pg_attribute->attcacheoff.
Here's generation 2 of this effort. Instead of proceeding to trying to
shoehorn all types of btree keys into one common code path, this
patchset acknowledges that there exist different shapes of keys that
each have a different best way of accessing each subsequent key
attribute. This patch achieves this by specializing the functions to
different key shapes.
The approach is comparable to JIT, but significantly different in that
it optimizes a set of statically defined shapes, and not any arbitrary
shape through a runtime compilation step. Jit could be implemented,
but not easily with the APIs available to IndexAMs: the amhandler for
indexes does not receive any information what the shape of the index
is going to be; so
0001: Moves code that can benefit from key attribute accessor
specialization out of their current files and into specializable
files.
The functions selected for specialization are either not much more
than wrappers for specializable functions, or functions that have
(hot) loops around specializable code; where specializable means
accessing multiple IndexTuple attributes directly.
0002: Updates the specializable code to use the specialized attribute
iteration macros
0003: Optimizes access to the key column when there's only one key column.
0004: Optimizes access to the key columns when we cannot use
attcacheoff for the key columns
0005: Creates a helper function to populate all attcacheoff fields
with their correct values; populating them with -2 whenever a cache
offset is impossible to determine, as opposed to the default -1
(unknown); allowing functions to determine the cachability of the n-th
attribute in O(1).
0006: Creates a specialization macro that replaces rd_indam->aminsert
with its optimized variant, for improved index tuple insertion
performance.
These patches still have some rough edges (specifically: some
functions that are being generated are unused, and intermediate
patches don't compile), but I wanted to get this out to get some
feedback on this approach.
I expect the performance to be at least on par with current btree
code, and I'll try to publish a more polished patchset with
performance results sometime in the near future. I'll also try to
re-attach dynamic page-level prefix truncation, but that depends on
the amount of time I have and the amount of feedback on this patchset.
-Matthias
[0]: The one for PG16, that is.
[1]: /messages/by-id/CAEze2Whwvr8aYcBf0BeBuPy8mJGtwxGvQYA9OGR5eLFh6Q_ZvA@mail.gmail.com
Attachments:
v1-0004-Implement-specialized-uncacheable-attribute-itera.patchapplication/octet-stream; name=v1-0004-Implement-specialized-uncacheable-attribute-itera.patchDownload
From 18014f792c0c5d289c93b03aff456a196fa319ff Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 8 Apr 2022 14:44:01 +0200
Subject: [PATCH v1 4/6] Implement specialized uncacheable attribute iteration
Uses an iterator to prevent doing duplicate work while iterating over
attributes.
Inspiration: https://www.postgresql.org/message-id/CAEze2WjE9ka8i%3Ds-Vv5oShro9xTrt5VQnQvFG9AaRwWpMm3-fg%40mail.gmail.com
---
src/include/access/itup.h | 235 +++++++++++++++++++++++++
src/include/access/nbtree.h | 12 +-
src/include/access/nbtree_specialize.h | 33 ++++
3 files changed, 278 insertions(+), 2 deletions(-)
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 2c8877e991..3b429ff115 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -59,6 +59,15 @@ typedef struct IndexAttributeBitMapData
typedef IndexAttributeBitMapData * IndexAttributeBitMap;
+typedef struct IAttrIterStateData
+{
+ int offset;
+ bool slow;
+ bool isNull;
+} IAttrIterStateData;
+
+typedef IAttrIterStateData * IAttrIterState;
+
/*
* t_info manipulation macros
*/
@@ -126,6 +135,84 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
) \
)
+/* ----------------
+ * index_attiterinit
+ *
+ * This gets called many times, so we macro the cacheable and NULL
+ * lookups, and call nocache_index_attiterinit() for the rest.
+ *
+ * tup - the tuple being iterated on
+ * attnum - the attribute number that we start the iteration with
+ * in the first index_attiternext call
+ * tupdesc - the tuple description
+ *
+ * ----------------
+ */
+#define index_attiterinit(tup, attnum, tupleDesc, iter) \
+do { \
+ if ((attnum) == 1) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ 0 /* offset */, \
+ false /* slow */, \
+ false /* isNull */ \
+ }); \
+ } \
+ else if (!IndexTupleHasNulls(tup) && \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff >= 0) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff, /* offset */ \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attlen >= 0, /* slow */ \
+ false /* isNull */ \
+ }); \
+ } \
+ else \
+ nocache_index_attiterinit((tup), (attnum) - 1, (tupleDesc), (iter)); \
+} while (false);
+
+/* ----------------
+ * index_attiternext
+ *
+ * This gets called many times, so we macro the cacheable and NULL
+ * lookups, and call nocache_index_attiternext() for the rest.
+ *
+ * ----------------
+ */
+#define index_attiternext(itup, attnum, tupleDesc, iter) \
+( \
+ AssertMacro(PointerIsValid(iter) && (attnum) > 0), \
+ (!IndexTupleHasNulls(itup)) ? \
+ ( \
+ !(iter)->slow && TupleDescAttr((tupleDesc), (attnum) - 1)->attcacheoff >= 0 ? \
+ ( \
+ (iter)->offset = att_addlength_pointer(TupleDescAttr((tupleDesc), \
+ (attnum) - 1)->attcacheoff, TupleDescAttr((tupleDesc), \
+ (attnum) - 1)->attlen, (char *) (itup) + \
+ IndexInfoFindDataOffset((itup)->t_info) + \
+ TupleDescAttr((tupleDesc), (attnum) - 1)->attcacheoff), \
+ (iter)->isNull = false,\
+ (iter)->slow = TupleDescAttr((tupleDesc), (attnum) - 1)->attlen < 0, \
+ (Datum) fetchatt(TupleDescAttr((tupleDesc), (attnum) - 1), \
+ (char *) (itup) + IndexInfoFindDataOffset((itup)->t_info) + \
+ TupleDescAttr((tupleDesc), (attnum) - 1)->attcacheoff) \
+ ) \
+ : \
+ nocache_index_attiternext((itup), (attnum), (tupleDesc), (iter)) \
+ ) \
+ : \
+ ( \
+ att_isnull((attnum) - 1, (char *) (itup) + sizeof(IndexTupleData)) ? \
+ ( \
+ (iter)->isNull = true, \
+ (iter)->slow = true, \
+ (Datum) 0 \
+ ) \
+ : \
+ nocache_index_attiternext((itup), (attnum), (tupleDesc), (iter)) \
+ ) \
+)
+
/*
* MaxIndexTuplesPerPage is an upper bound on the number of tuples that can
* fit on one index page. An index tuple must have either data or a null
@@ -161,4 +248,152 @@ extern IndexTuple CopyIndexTuple(IndexTuple source);
extern IndexTuple index_truncate_tuple(TupleDesc sourceDescriptor,
IndexTuple source, int leavenatts);
+/*
+ * Initiate an index attribute iterator to attribute attnum,
+ * and return the corresponding datum.
+ *
+ * This is nearly the same as index_deform_tuple, except that this
+ * returns the internal state up to attnum, instead of populating the
+ * datum- and isnull-arrays
+ */
+static inline void
+nocache_index_attiterinit(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ int curatt;
+ char *tp; /* ptr to tuple data */
+ int off; /* offset in tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ bool slow = false; /* can we use/set attcacheoff? */
+ bool null = false;
+
+ /* Assert to protect callers */
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ /* XXX "knows" t_bits are just after fixed tuple header! */
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+ off = 0;
+
+ for (curatt = 0; curatt < attnum; curatt++)
+ {
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, curatt);
+
+ if (hasnulls && att_isnull(curatt, bp))
+ {
+ null = true;
+ slow = true; /* can't use attcacheoff anymore */
+ continue;
+ }
+
+ null = false;
+
+ if (!slow && thisatt->attcacheoff >= 0)
+ off = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ /*
+ * We can only cache the offset for a varlena attribute if the
+ * offset is already suitably aligned, so that there would be no
+ * pad bytes in any case: then the offset will be valid for either
+ * an aligned or unaligned value.
+ */
+ if (!slow &&
+ off == att_align_nominal(off, thisatt->attalign))
+ thisatt->attcacheoff = off;
+ else
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ slow = true;
+ }
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+
+ if (!slow)
+ thisatt->attcacheoff = off;
+ }
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+
+ if (thisatt->attlen <= 0)
+ slow = true; /* can't use attcacheoff anymore */
+ }
+
+ iter->isNull = null;
+ iter->offset = off;
+ iter->slow = slow;
+}
+
+static inline Datum
+nocache_index_attiternext(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ char *tp; /* ptr to tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ Datum datum;
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, attnum - 1);
+
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+
+ if (hasnulls && att_isnull(attnum - 1, bp))
+ {
+ iter->isNull = true;
+ iter->slow = true; /* can't use attcacheoff anymore */
+ return (Datum) 0;
+ }
+
+ iter->isNull = false;
+
+ if (!iter->slow && thisatt->attcacheoff >= 0)
+ iter->offset = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ /*
+ * We can only cache the offset for a varlena attribute if the
+ * offset is already suitably aligned, so that there would be no
+ * pad bytes in any case: then the offset will be valid for either
+ * an aligned or unaligned value.
+ */
+ if (!iter->slow &&
+ iter->offset == att_align_nominal(iter->offset, thisatt->attalign))
+ thisatt->attcacheoff = iter->offset;
+ else
+ {
+ iter->offset = att_align_pointer(iter->offset, thisatt->attalign, -1,
+ tp + iter->offset);
+ iter->slow = true;
+ }
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ iter->offset = att_align_nominal(iter->offset, thisatt->attalign);
+
+ if (!iter->slow)
+ thisatt->attcacheoff = iter->offset;
+ }
+
+ datum = fetchatt(thisatt, tp + iter->offset);
+
+ iter->offset = att_addlength_pointer(iter->offset, thisatt->attlen, tp + iter->offset);
+
+ if (thisatt->attlen <= 0)
+ iter->slow = true; /* can't use attcacheoff anymore */
+
+ return datum;
+}
+
#endif /* ITUP_H */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a758748635..d59531f3b3 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1122,6 +1122,7 @@ typedef struct BTOptions
*/
#define NBTS_TYPE_SINGLE_COLUMN single
#define NBTS_TYPE_CACHED cached
+#define NBTS_TYPE_UNCACHED uncached
#define NBTS_TYPE_DEFAULT default
@@ -1135,12 +1136,19 @@ typedef struct BTOptions
#define NBT_SPECIALIZE_CALL(function, rel, ...) \
( \
- (rel)->rd_index->indnkeyatts == 1 ? ( \
+ IndexRelationGetNumberOfKeyAttributes(rel) == 1 ? ( \
NBTS_MAKE_NAME(function, NBTS_TYPE_SINGLE_COLUMN)(__VA_ARGS__) \
) \
: \
( \
- NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+ TupleDescAttr(RelationGetDescr(rel), \
+ IndexRelationGetNumberOfKeyAttributes(rel) - 1)->attcacheoff > 0 ? ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+ ) \
+ : \
+ ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_UNCACHED)(__VA_ARGS__) \
+ ) \
) \
)
diff --git a/src/include/access/nbtree_specialize.h b/src/include/access/nbtree_specialize.h
index 2e17a761a0..37d373480d 100644
--- a/src/include/access/nbtree_specialize.h
+++ b/src/include/access/nbtree_specialize.h
@@ -166,6 +166,39 @@
#undef nbts_attiter_nextattdatum
#undef nbts_attiter_curattisnull
+/*
+ * Multiple key columns, but attcacheoff -optimization doesn't apply.
+ */
+#define NBTS_SPECIALIZING_UNCACHED
+#define NBTS_TYPE NBTS_TYPE_UNCACHED
+
+#define nbts_attiterdeclare(itup) \
+ IAttrIterStateData NBTS_MAKE_NAME(itup, iter);
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc) \
+ index_attiterinit(itup, initAttNum, tupDesc, &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_attiternext(itup, spec_i, tupDesc, &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, iter).isNull
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_UNCACHED
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
/* reset call to SPECIALIZE_CALL for default behaviour */
#undef nbts_call_norel
#define nbts_call_norel(name, rel, ...) \
--
2.30.2
v1-0005-Add-a-function-whose-task-it-is-to-populate-all-a.patchapplication/octet-stream; name=v1-0005-Add-a-function-whose-task-it-is-to-populate-all-a.patchDownload
From 395d1e1f93dabac54040ebac00f1d3eadb4367de Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 8 Apr 2022 14:51:01 +0200
Subject: [PATCH v1 5/6] Add a function whose task it is to populate all
attcacheoff-s of a TupleDesc's attributes
It fills uncacheable offsets with -2; as opposed to -1 which signals
"unknown", thus allowing users of the API to determine the cache-ability
of an attribute at O(1) complexity after this one-time O(n) cost, as
opposed to the repeated O(n) cost that currently applies.
---
src/backend/access/common/tupdesc.c | 97 +++++++++++++++++++++++++++++
src/include/access/tupdesc.h | 2 +
2 files changed, 99 insertions(+)
diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c
index 9f41b1e854..5630fc9da0 100644
--- a/src/backend/access/common/tupdesc.c
+++ b/src/backend/access/common/tupdesc.c
@@ -910,3 +910,100 @@ BuildDescFromLists(List *names, List *types, List *typmods, List *collations)
return desc;
}
+
+/*
+ * PopulateTupleDescCacheOffsets
+ *
+ * Populate the attcacheoff fields of a TupleDesc, returning the last
+ * attcacheoff with a valid value.
+ *
+ * Sets attcacheoff to -2 for uncacheable attributes (i.e. attributes after a
+ * variable length attributes).
+ */
+AttrNumber
+PopulateTupleDescCacheOffsets(TupleDesc desc)
+{
+ int numberOfAttributes = desc->natts;
+ AttrNumber i, j;
+
+ if (TupleDescAttr(desc, desc->natts - 1)->attcacheoff != -1)
+ {
+ /*
+ * Already done the calculations, find the last attribute that has
+ * cache offset.
+ */
+ for (i = (AttrNumber) numberOfAttributes; i > 1; i--)
+ {
+ if (TupleDescAttr(desc, i - 1)->attcacheoff != -2)
+ return i;
+ }
+
+ return 1;
+ }
+
+ /*
+ * First attribute always starts at offset zero.
+ */
+ TupleDescAttr(desc, 0)->attcacheoff = 0;
+
+ i = 1;
+ /*
+ * Someone might have set some offsets previously.
+ * Skip all positive offsets to get to the first attribute without
+ * attcacheoff.
+ */
+ while (i < numberOfAttributes && TupleDescAttr(desc, i)->attcacheoff > 0)
+ i++;
+
+ /* Cache offset is undetermined. Start calculating offsets if possible */
+ if (i < numberOfAttributes &&
+ TupleDescAttr(desc, i)->attcacheoff == -1)
+ {
+ Form_pg_attribute att = TupleDescAttr(desc, i - 1);
+ Size off = att->attcacheoff;
+
+ if (att->attlen >= 0) {
+ off += att->attlen;
+
+ while (i < numberOfAttributes)
+ {
+ att = TupleDescAttr(desc, i);
+
+ if (att->attlen < 0)
+ {
+ if (off == att_align_nominal(off, att->attalign))
+ att->attcacheoff = off;
+ else
+ att->attcacheoff = -2;
+ i++;
+ break;
+ }
+
+ off = att_align_nominal(off, att->attalign);
+ att->attcacheoff = off;
+ off += att->attlen;
+ i++;
+ }
+ } else {
+ if (off == att_align_nominal(off, att->attalign))
+ att->attcacheoff = off;
+ else
+ att->attcacheoff = -2;
+ i++;
+ }
+ }
+
+ /*
+ * No cacheable offsets left. Fill the rest with -2s, but return the latest
+ * cached offset.
+ */
+ j = i;
+
+ while (i < numberOfAttributes)
+ {
+ TupleDescAttr(desc, i)->attcacheoff = -2;
+ i++;
+ }
+
+ return j;
+}
diff --git a/src/include/access/tupdesc.h b/src/include/access/tupdesc.h
index 28dd6de18b..219f837875 100644
--- a/src/include/access/tupdesc.h
+++ b/src/include/access/tupdesc.h
@@ -151,4 +151,6 @@ extern TupleDesc BuildDescForRelation(List *schema);
extern TupleDesc BuildDescFromLists(List *names, List *types, List *typmods, List *collations);
+extern AttrNumber PopulateTupleDescCacheOffsets(TupleDesc desc);
+
#endif /* TUPDESC_H */
--
2.30.2
v1-0002-Use-specialized-attribute-iterators-in-nbt-_spec..patchapplication/octet-stream; name=v1-0002-Use-specialized-attribute-iterators-in-nbt-_spec..patchDownload
From a6e7e6344b56f47be07edb0775bc77d31ac0a277 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 7 Apr 2022 12:30:00 +0200
Subject: [PATCH v1 2/6] Use specialized attribute iterators in nbt*_spec.h
code
Split out for making it clear what substantial changes were made to the
pre-existing functions.
---
src/backend/access/nbtree/nbtsearch_spec.h | 16 +++---
src/backend/access/nbtree/nbtsort_spec.h | 24 +++++----
src/backend/access/nbtree/nbtutils_spec.h | 60 +++++++++++++---------
3 files changed, 60 insertions(+), 40 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsearch_spec.h b/src/backend/access/nbtree/nbtsearch_spec.h
index 39b4e6c5ec..e6da30dc73 100644
--- a/src/backend/access/nbtree/nbtsearch_spec.h
+++ b/src/backend/access/nbtree/nbtsearch_spec.h
@@ -538,6 +538,7 @@ NBTS_FUNCTION(_bt_compare)(Relation rel,
int ncmpkey;
int ntupatts;
int32 result;
+ nbts_attiterdeclare(itup);
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -569,23 +570,26 @@ NBTS_FUNCTION(_bt_compare)(Relation rel,
Assert(key->heapkeyspace || ncmpkey == key->keysz);
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
scankey = key->scankeys;
- for (int i = 1; i <= ncmpkey; i++)
+ nbts_attiterinit(itup, 1, itupdesc);
+
+ nbts_foreachattr(1, ncmpkey)
{
Datum datum;
- bool isNull;
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+ datum = nbts_attiter_nextattdatum(itup, itupdesc);
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ /* key is NULL */
+ if (scankey->sk_flags & SK_ISNULL)
{
- if (isNull)
+ if (nbts_attiter_curattisnull(itup))
result = 0; /* NULL "=" NULL */
else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = -1; /* NULL "<" NOT_NULL */
else
result = 1; /* NULL ">" NOT_NULL */
}
- else if (isNull) /* key is NOT_NULL and item is NULL */
+ /* key is NOT_NULL and item is NULL */
+ else if (nbts_attiter_curattisnull(itup))
{
if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = 1; /* NOT_NULL ">" NULL */
diff --git a/src/backend/access/nbtree/nbtsort_spec.h b/src/backend/access/nbtree/nbtsort_spec.h
index c2f7588914..0db4304835 100644
--- a/src/backend/access/nbtree/nbtsort_spec.h
+++ b/src/backend/access/nbtree/nbtsort_spec.h
@@ -19,8 +19,7 @@ NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
itup2 = NULL;
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
bool deduplicate;
@@ -42,7 +41,7 @@ NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
/* Prepare SortSupport data for each column */
sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
- for (i = 0; i < keysz; i++)
+ for (int i = 0; i < keysz; i++)
{
SortSupport sortKey = sortKeys + i;
ScanKey scanKey = wstate->inskey->scankeys + i;
@@ -74,22 +73,25 @@ NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
}
else if (itup != NULL)
{
+ nbts_attiterdeclare(itup);
+ nbts_attiterdeclare(itup2);
int32 compare = 0;
- for (i = 1; i <= keysz; i++)
+ nbts_attiterinit(itup, 1, tupdes);
+ nbts_attiterinit(itup2, 1, tupdes);
+
+ nbts_foreachattr(1, keysz)
{
SortSupport entry;
Datum attrDatum1,
attrDatum2;
- bool isNull1,
- isNull2;
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+ entry = sortKeys + nbts_attiter_attnum - 1;
+ attrDatum1 = nbts_attiter_nextattdatum(itup, tupdes);
+ attrDatum2 = nbts_attiter_nextattdatum(itup2, tupdes);
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
+ compare = ApplySortComparator(attrDatum1, nbts_attiter_curattisnull(itup),
+ attrDatum2, nbts_attiter_curattisnull(itup2),
entry);
if (compare > 0)
{
diff --git a/src/backend/access/nbtree/nbtutils_spec.h b/src/backend/access/nbtree/nbtutils_spec.h
index 41f606318c..b47a3aaf77 100644
--- a/src/backend/access/nbtree/nbtutils_spec.h
+++ b/src/backend/access/nbtree/nbtutils_spec.h
@@ -47,6 +47,7 @@ NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
int indnkeyatts;
int16 *indoption;
int tupnatts;
+ nbts_attiterdeclare(itup);
int i;
itupdesc = RelationGetDescr(rel);
@@ -78,7 +79,10 @@ NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
key->scantid = key->heapkeyspace && itup ?
BTreeTupleGetHeapTID(itup) : NULL;
skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
+
+ nbts_attiterinit(itup, 1, itupdesc);
+
+ nbts_foreachattr(1, indnkeyatts)
{
FmgrInfo *procinfo;
Datum arg;
@@ -89,27 +93,30 @@ NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
* We can use the cached (default) support procs since no cross-type
* comparison can be needed.
*/
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+ procinfo = index_getprocinfo(rel, nbts_attiter_attnum, BTORDER_PROC);
/*
* Key arguments built from truncated attributes (or when caller
* provides no tuple) are defensively represented as NULL values. They
* should never be used.
*/
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
+ if (nbts_attiter_attnum <= tupnatts)
+ {
+ arg = nbts_attiter_nextattdatum(itup, itupdesc);
+ null = nbts_attiter_curattisnull(itup);
+ }
else
{
arg = (Datum) 0;
null = true;
}
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags = (null ? SK_ISNULL : 0) | (indoption[nbts_attiter_attnum - 1] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[nbts_attiter_attnum - 1],
flags,
- (AttrNumber) (i + 1),
+ (AttrNumber) nbts_attiter_attnum,
InvalidStrategy,
InvalidOid,
- rel->rd_indcollation[i],
+ rel->rd_indcollation[nbts_attiter_attnum - 1],
procinfo,
arg);
/* Record if any key attribute is NULL (or truncated) */
@@ -671,20 +678,22 @@ NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
scankey = itup_key->scankeys;
keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, nkeyatts)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ datum1 = nbts_attiter_nextattdatum(lastleft, itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) != nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
scankey->sk_collation,
datum1,
@@ -692,6 +701,7 @@ NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
break;
keepnatts++;
+ scankey++;
}
/*
@@ -734,24 +744,28 @@ NBTS_FUNCTION(_bt_keep_natts_fast)(Relation rel,
TupleDesc itupdesc = RelationGetDescr(rel);
int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
int keepnatts;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, keysz)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
Form_pg_attribute att;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
+ datum1 = nbts_attiter_nextattdatum(lastleft,itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
+ att = TupleDescAttr(itupdesc, nbts_attiter_attnum - 1);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) !=
+ nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
--
2.30.2
v1-0003-Optimize-attribute-iterator-access-for-single-col.patchapplication/octet-stream; name=v1-0003-Optimize-attribute-iterator-access-for-single-col.patchDownload
From c2637dcca05a0d52184cf92c14529f4bcdd1a5d3 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 7 Apr 2022 12:47:50 +0200
Subject: [PATCH v1 3/6] Optimize attribute iterator access for single-column
btree keys
This removes the index_getattr_nocache call path, which has significant overhead.
---
src/include/access/nbtree_specialize.h | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/src/include/access/nbtree_specialize.h b/src/include/access/nbtree_specialize.h
index b2ee09621e..2e17a761a0 100644
--- a/src/include/access/nbtree_specialize.h
+++ b/src/include/access/nbtree_specialize.h
@@ -100,8 +100,26 @@
#define nbts_attiter_attnum spec_i
+/*
+ * Simplified (optimized) variant of index_getattr specialized for extracting
+ * only the first attribute: cache offset is guaranteed to be 0, and as such
+ * no cache is required.
+ */
#define nbts_attiter_nextattdatum(itup, tupDesc) \
- index_getattr(itup, 1, tupDesc, &(NBTS_MAKE_NAME(itup, isNull)))
+( \
+ AssertMacro(spec_i == 1), \
+ (IndexTupleHasNulls(itup) && att_isnull(0, (char *)(itup) + sizeof(IndexTupleData))) ? \
+ ( \
+ (NBTS_MAKE_NAME(itup, isNull)) = true, \
+ (Datum)NULL \
+ ) \
+ : \
+ ( \
+ (NBTS_MAKE_NAME(itup, isNull) = false), \
+ (Datum) fetchatt(TupleDescAttr((tupDesc), 0), \
+ (char *) (itup) + IndexInfoFindDataOffset((itup)->t_info)) \
+ ) \
+)
#define nbts_attiter_curattisnull(tuple) \
NBTS_MAKE_NAME(tuple, isNull)
--
2.30.2
v1-0001-Specialize-nbtree-functions-on-btree-key-shape.patchapplication/octet-stream; name=v1-0001-Specialize-nbtree-functions-on-btree-key-shape.patchDownload
From 5b61e1ccae63c277aed7c852b9679a6cd2fde779 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Sun, 30 Jan 2022 16:23:31 +0100
Subject: [PATCH v1 1/6] Specialize nbtree functions on btree key shape
nbtree keys are not all made the same, so a significant amount of time is
spent on code that exists only to deal with other key's shape. By specializing
function calls based on the key shape, we can remove or reduce these causes
of overhead.
This commit adds the basic infrastructure for specializing specific hot code
in the nbtree AM to certain shapes of keys, initially splitting splitting out
(not yet: specializing) the single-key-column case and the attcacheoff-capable
case.
Note that we generate N specialized functions and 1 'default' function for each
specializable function.
This feature can be disabled by defining NBTS_ENABLED.
---
src/backend/access/nbtree/README | 22 +
src/backend/access/nbtree/nbtdedup.c | 300 +------
src/backend/access/nbtree/nbtdedup_spec.h | 303 +++++++
src/backend/access/nbtree/nbtinsert.c | 572 +-----------
src/backend/access/nbtree/nbtinsert_spec.h | 560 ++++++++++++
src/backend/access/nbtree/nbtpage.c | 4 +-
src/backend/access/nbtree/nbtree.c | 31 +-
src/backend/access/nbtree/nbtree_spec.h | 42 +
src/backend/access/nbtree/nbtsearch.c | 994 +--------------------
src/backend/access/nbtree/nbtsearch_spec.h | 985 ++++++++++++++++++++
src/backend/access/nbtree/nbtsort.c | 271 +-----
src/backend/access/nbtree/nbtsort_spec.h | 265 ++++++
src/backend/access/nbtree/nbtsplitloc.c | 14 +-
src/backend/access/nbtree/nbtutils.c | 755 +---------------
src/backend/access/nbtree/nbtutils_spec.h | 762 ++++++++++++++++
src/backend/utils/sort/tuplesort.c | 4 +-
src/include/access/nbtree.h | 95 +-
src/include/access/nbtree_specialize.h | 238 +++++
src/include/access/nbtree_specialized.h | 67 ++
19 files changed, 3369 insertions(+), 2915 deletions(-)
create mode 100644 src/backend/access/nbtree/nbtdedup_spec.h
create mode 100644 src/backend/access/nbtree/nbtinsert_spec.h
create mode 100644 src/backend/access/nbtree/nbtree_spec.h
create mode 100644 src/backend/access/nbtree/nbtsearch_spec.h
create mode 100644 src/backend/access/nbtree/nbtsort_spec.h
create mode 100644 src/backend/access/nbtree/nbtutils_spec.h
create mode 100644 src/include/access/nbtree_specialize.h
create mode 100644 src/include/access/nbtree_specialized.h
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 5529afc1fe..3c08888c23 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -1041,6 +1041,28 @@ that need a page split anyway. Besides, supporting variable "split points"
while splitting posting lists won't actually improve overall space
utilization.
+
+Notes about nbtree call specialization
+--------------------------------------
+
+Attribute iteration is a significant overhead for multi-column indexes.
+We can avoid it by specializing performance-sensitive search functions
+and calling those selectively. Additionally, we update the entry points
+in the index AM to call the specialized functions, increasing the
+performance of those hot paths. This performance benefit is at the cost
+of binary size, so this feature can be disabled by defining NBTS_DISABLED.
+
+Optimized code paths exist for the following cases, in order of preference:
+ - single-column indexes
+ NB: The code paths of this optimization do not support multiple key columns.
+ - multi-column indexes that could benefit from the attcacheoff optimization
+ NB: This is also used for the default case, and is slow for uncachable
+ attribute offsets.
+
+Future work will optimize for multi-column indexes that don't benefit
+from the attcacheoff optimization by improving on the O(n^2) nature of
+index_getattr through storing attribute offsets.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 3e11805293..d7025d8e1c 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -22,259 +22,16 @@
static void _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
TM_IndexDeleteOp *delstate);
-static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem);
static void _bt_singleval_fillfactor(Page page, BTDedupState state,
Size newitemsz);
#ifdef USE_ASSERT_CHECKING
static bool _bt_posting_valid(IndexTuple posting);
#endif
-/*
- * Perform a deduplication pass.
- *
- * The general approach taken here is to perform as much deduplication as
- * possible to free as much space as possible. Note, however, that "single
- * value" strategy is used for !bottomupdedup callers when the page is full of
- * tuples of a single value. Deduplication passes that apply the strategy
- * will leave behind a few untouched tuples at the end of the page, preparing
- * the page for an anticipated page split that uses nbtsplitloc.c's own single
- * value strategy. Our high level goal is to delay merging the untouched
- * tuples until after the page splits.
- *
- * When a call to _bt_bottomupdel_pass() just took place (and failed), our
- * high level goal is to prevent a page split entirely by buying more time.
- * We still hope that a page split can be avoided altogether. That's why
- * single value strategy is not even considered for bottomupdedup callers.
- *
- * The page will have to be split if we cannot successfully free at least
- * newitemsz (we also need space for newitem's line pointer, which isn't
- * included in caller's newitemsz).
- *
- * Note: Caller should have already deleted all existing items with their
- * LP_DEAD bits set.
- */
-void
-_bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
- Size newitemsz, bool bottomupdedup)
-{
- OffsetNumber offnum,
- minoff,
- maxoff;
- Page page = BufferGetPage(buf);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- Page newpage;
- BTDedupState state;
- Size pagesaving = 0;
- bool singlevalstrat = false;
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtdedup_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
- /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
- newitemsz += sizeof(ItemIdData);
-
- /*
- * Initialize deduplication state.
- *
- * It would be possible for maxpostingsize (limit on posting list tuple
- * size) to be set to one third of the page. However, it seems like a
- * good idea to limit the size of posting lists to one sixth of a page.
- * That ought to leave us with a good split point when pages full of
- * duplicates can be split several times.
- */
- state = (BTDedupState) palloc(sizeof(BTDedupStateData));
- state->deduplicate = true;
- state->nmaxitems = 0;
- state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
- /* Metadata about base tuple of current pending posting list */
- state->base = NULL;
- state->baseoff = InvalidOffsetNumber;
- state->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- state->htids = palloc(state->maxpostingsize);
- state->nhtids = 0;
- state->nitems = 0;
- /* Size of all physical tuples to be replaced by pending posting list */
- state->phystupsize = 0;
- /* nintervals should be initialized to zero */
- state->nintervals = 0;
-
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * Consider applying "single value" strategy, though only if the page
- * seems likely to be split in the near future
- */
- if (!bottomupdedup)
- singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
-
- /*
- * Deduplicate items from page, and write them to newpage.
- *
- * Copy the original page's LSN into newpage copy. This will become the
- * updated version of the page. We need this because XLogInsert will
- * examine the LSN and possibly dump it in a page image.
- */
- newpage = PageGetTempPageCopySpecial(page);
- PageSetLSN(newpage, PageGetLSN(page));
-
- /* Copy high key, if any */
- if (!P_RIGHTMOST(opaque))
- {
- ItemId hitemid = PageGetItemId(page, P_HIKEY);
- Size hitemsz = ItemIdGetLength(hitemid);
- IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
-
- if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
- false, false) == InvalidOffsetNumber)
- elog(ERROR, "deduplication failed to add highkey");
- }
-
- for (offnum = minoff;
- offnum <= maxoff;
- offnum = OffsetNumberNext(offnum))
- {
- ItemId itemid = PageGetItemId(page, offnum);
- IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
-
- Assert(!ItemIdIsDead(itemid));
-
- if (offnum == minoff)
- {
- /*
- * No previous/base tuple for the data item -- use the data item
- * as base tuple of pending posting list
- */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- else if (state->deduplicate &&
- _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
- _bt_dedup_save_htid(state, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID(s) for itup have been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list for some other reason (e.g., adding more
- * TIDs would have caused posting list to exceed current
- * maxpostingsize).
- *
- * If state contains pending posting list with more than one item,
- * form new posting tuple, and actually update the page. Else
- * reset the state and move on without modifying the page.
- */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- if (singlevalstrat)
- {
- /*
- * Single value strategy's extra steps.
- *
- * Lower maxpostingsize for sixth and final large posting list
- * tuple at the point where 5 maxpostingsize-capped tuples
- * have either been formed or observed.
- *
- * When a sixth maxpostingsize-capped item is formed/observed,
- * stop merging together tuples altogether. The few tuples
- * that remain at the end of the page won't be merged together
- * at all (at least not until after a future page split takes
- * place).
- */
- if (state->nmaxitems == 5)
- _bt_singleval_fillfactor(page, state, newitemsz);
- else if (state->nmaxitems == 6)
- {
- state->deduplicate = false;
- singlevalstrat = false; /* won't be back here */
- }
- }
-
- /* itup starts new pending posting list */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- }
-
- /* Handle the last item */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- /*
- * If no items suitable for deduplication were found, newpage must be
- * exactly the same as the original page, so just return from function.
- *
- * We could determine whether or not to proceed on the basis the space
- * savings being sufficient to avoid an immediate page split instead. We
- * don't do that because there is some small value in nbtsplitloc.c always
- * operating against a page that is fully deduplicated (apart from
- * newitem). Besides, most of the cost has already been paid.
- */
- if (state->nintervals == 0)
- {
- /* cannot leak memory here */
- pfree(newpage);
- pfree(state->htids);
- pfree(state);
- return;
- }
-
- /*
- * By here, it's clear that deduplication will definitely go ahead.
- *
- * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
- * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
- * But keep things tidy.
- */
- if (P_HAS_GARBAGE(opaque))
- {
- BTPageOpaque nopaque = BTPageGetOpaque(newpage);
-
- nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
- }
-
- START_CRIT_SECTION();
-
- PageRestoreTempPage(newpage, page);
- MarkBufferDirty(buf);
-
- /* XLOG stuff */
- if (RelationNeedsWAL(rel))
- {
- XLogRecPtr recptr;
- xl_btree_dedup xlrec_dedup;
-
- xlrec_dedup.nintervals = state->nintervals;
-
- XLogBeginInsert();
- XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
-
- /*
- * The intervals array is not in the buffer, but pretend that it is.
- * When XLogInsert stores the whole buffer, the array need not be
- * stored too.
- */
- XLogRegisterBufData(0, (char *) state->intervals,
- state->nintervals * sizeof(BTDedupInterval));
-
- recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
-
- PageSetLSN(page, recptr);
- }
-
- END_CRIT_SECTION();
-
- /* Local space accounting should agree with page accounting */
- Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
-
- /* cannot leak memory here */
- pfree(state->htids);
- pfree(state);
-}
/*
* Perform bottom-up index deletion pass.
@@ -373,7 +130,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
/* itup starts first pending interval */
_bt_dedup_start_pending(state, itup, offnum);
}
- else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+ else if (nbts_call(_bt_keep_natts_fast, rel, state->base, itup) > nkeyatts &&
_bt_dedup_save_htid(state, itup))
{
/* Tuple is equal; just added its TIDs to pending interval */
@@ -748,55 +505,6 @@ _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
state->phystupsize = 0;
}
-/*
- * Determine if page non-pivot tuples (data items) are all duplicates of the
- * same value -- if they are, deduplication's "single value" strategy should
- * be applied. The general goal of this strategy is to ensure that
- * nbtsplitloc.c (which uses its own single value strategy) will find a useful
- * split point as further duplicates are inserted, and successive rightmost
- * page splits occur among pages that store the same duplicate value. When
- * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
- * just like it would if deduplication were disabled.
- *
- * We expect that affected workloads will require _several_ single value
- * strategy deduplication passes (over a page that only stores duplicates)
- * before the page is finally split. The first deduplication pass should only
- * find regular non-pivot tuples. Later deduplication passes will find
- * existing maxpostingsize-capped posting list tuples, which must be skipped
- * over. The penultimate pass is generally the first pass that actually
- * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
- * few untouched non-pivot tuples. The final deduplication pass won't free
- * any space -- it will skip over everything without merging anything (it
- * retraces the steps of the penultimate pass).
- *
- * Fortunately, having several passes isn't too expensive. Each pass (after
- * the first pass) won't spend many cycles on the large posting list tuples
- * left by previous passes. Each pass will find a large contiguous group of
- * smaller duplicate tuples to merge together at the end of the page.
- */
-static bool
-_bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- ItemId itemid;
- IndexTuple itup;
-
- itemid = PageGetItemId(page, minoff);
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- {
- itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- return true;
- }
-
- return false;
-}
-
/*
* Lower maxpostingsize when using "single value" strategy, to avoid a sixth
* and final maxpostingsize-capped tuple. The sixth and final posting list
diff --git a/src/backend/access/nbtree/nbtdedup_spec.h b/src/backend/access/nbtree/nbtdedup_spec.h
new file mode 100644
index 0000000000..06fb89ccd1
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup_spec.h
@@ -0,0 +1,303 @@
+/*
+ * Specialized functions included in nbtdedup.c
+ */
+
+static bool NBTS_FUNCTION(_bt_do_singleval)(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem);
+
+/*
+ * Perform a deduplication pass.
+ *
+ * The general approach taken here is to perform as much deduplication as
+ * possible to free as much space as possible. Note, however, that "single
+ * value" strategy is used for !bottomupdedup callers when the page is full of
+ * tuples of a single value. Deduplication passes that apply the strategy
+ * will leave behind a few untouched tuples at the end of the page, preparing
+ * the page for an anticipated page split that uses nbtsplitloc.c's own single
+ * value strategy. Our high level goal is to delay merging the untouched
+ * tuples until after the page splits.
+ *
+ * When a call to _bt_bottomupdel_pass() just took place (and failed), our
+ * high level goal is to prevent a page split entirely by buying more time.
+ * We still hope that a page split can be avoided altogether. That's why
+ * single value strategy is not even considered for bottomupdedup callers.
+ *
+ * The page will have to be split if we cannot successfully free at least
+ * newitemsz (we also need space for newitem's line pointer, which isn't
+ * included in caller's newitemsz).
+ *
+ * Note: Caller should have already deleted all existing items with their
+ * LP_DEAD bits set.
+ */
+void
+NBTS_FUNCTION(_bt_dedup_pass)(Relation rel, Buffer buf, Relation heapRel,
+ IndexTuple newitem, Size newitemsz,
+ bool bottomupdedup)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ Page newpage;
+ BTDedupState state;
+ Size pagesaving = 0;
+ bool singlevalstrat = false;
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+
+ /*
+ * Initialize deduplication state.
+ *
+ * It would be possible for maxpostingsize (limit on posting list tuple
+ * size) to be set to one third of the page. However, it seems like a
+ * good idea to limit the size of posting lists to one sixth of a page.
+ * That ought to leave us with a good split point when pages full of
+ * duplicates can be split several times.
+ */
+ state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ state->deduplicate = true;
+ state->nmaxitems = 0;
+ state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
+ /* Metadata about base tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ state->htids = palloc(state->maxpostingsize);
+ state->nhtids = 0;
+ state->nitems = 0;
+ /* Size of all physical tuples to be replaced by pending posting list */
+ state->phystupsize = 0;
+ /* nintervals should be initialized to zero */
+ state->nintervals = 0;
+
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Consider applying "single value" strategy, though only if the page
+ * seems likely to be split in the near future
+ */
+ if (!bottomupdedup)
+ singlevalstrat = nbts_call(_bt_do_singleval, rel, page, state,
+ minoff, newitem);
+
+ /*
+ * Deduplicate items from page, and write them to newpage.
+ *
+ * Copy the original page's LSN into newpage copy. This will become the
+ * updated version of the page. We need this because XLogInsert will
+ * examine the LSN and possibly dump it in a page image.
+ */
+ newpage = PageGetTempPageCopySpecial(page);
+ PageSetLSN(newpage, PageGetLSN(page));
+
+ /* Copy high key, if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId hitemid = PageGetItemId(page, P_HIKEY);
+ Size hitemsz = ItemIdGetLength(hitemid);
+ IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+ if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add highkey");
+ }
+
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == minoff)
+ {
+ /*
+ * No previous/base tuple for the data item -- use the data item
+ * as base tuple of pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (state->deduplicate &&
+ nbts_call(_bt_keep_natts_fast, rel, state->base, itup) > nkeyatts &&
+ _bt_dedup_save_htid(state, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID(s) for itup have been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list for some other reason (e.g., adding more
+ * TIDs would have caused posting list to exceed current
+ * maxpostingsize).
+ *
+ * If state contains pending posting list with more than one item,
+ * form new posting tuple, and actually update the page. Else
+ * reset the state and move on without modifying the page.
+ */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ if (singlevalstrat)
+ {
+ /*
+ * Single value strategy's extra steps.
+ *
+ * Lower maxpostingsize for sixth and final large posting list
+ * tuple at the point where 5 maxpostingsize-capped tuples
+ * have either been formed or observed.
+ *
+ * When a sixth maxpostingsize-capped item is formed/observed,
+ * stop merging together tuples altogether. The few tuples
+ * that remain at the end of the page won't be merged together
+ * at all (at least not until after a future page split takes
+ * place).
+ */
+ if (state->nmaxitems == 5)
+ _bt_singleval_fillfactor(page, state, newitemsz);
+ else if (state->nmaxitems == 6)
+ {
+ state->deduplicate = false;
+ singlevalstrat = false; /* won't be back here */
+ }
+ }
+
+ /* itup starts new pending posting list */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ }
+
+ /* Handle the last item */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ /*
+ * If no items suitable for deduplication were found, newpage must be
+ * exactly the same as the original page, so just return from function.
+ *
+ * We could determine whether or not to proceed on the basis the space
+ * savings being sufficient to avoid an immediate page split instead. We
+ * don't do that because there is some small value in nbtsplitloc.c always
+ * operating against a page that is fully deduplicated (apart from
+ * newitem). Besides, most of the cost has already been paid.
+ */
+ if (state->nintervals == 0)
+ {
+ /* cannot leak memory here */
+ pfree(newpage);
+ pfree(state->htids);
+ pfree(state);
+ return;
+ }
+
+ /*
+ * By here, it's clear that deduplication will definitely go ahead.
+ *
+ * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
+ * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
+ * But keep things tidy.
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ BTPageOpaque nopaque = BTPageGetOpaque(newpage);
+
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+ }
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.nintervals = state->nintervals;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ /*
+ * The intervals array is not in the buffer, but pretend that it is.
+ * When XLogInsert stores the whole buffer, the array need not be
+ * stored too.
+ */
+ XLogRegisterBufData(0, (char *) state->intervals,
+ state->nintervals * sizeof(BTDedupInterval));
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* Local space accounting should agree with page accounting */
+ Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+ /* cannot leak memory here */
+ pfree(state->htids);
+ pfree(state);
+}
+
+/*
+ * Determine if page non-pivot tuples (data items) are all duplicates of the
+ * same value -- if they are, deduplication's "single value" strategy should
+ * be applied. The general goal of this strategy is to ensure that
+ * nbtsplitloc.c (which uses its own single value strategy) will find a useful
+ * split point as further duplicates are inserted, and successive rightmost
+ * page splits occur among pages that store the same duplicate value. When
+ * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
+ * just like it would if deduplication were disabled.
+ *
+ * We expect that affected workloads will require _several_ single value
+ * strategy deduplication passes (over a page that only stores duplicates)
+ * before the page is finally split. The first deduplication pass should only
+ * find regular non-pivot tuples. Later deduplication passes will find
+ * existing maxpostingsize-capped posting list tuples, which must be skipped
+ * over. The penultimate pass is generally the first pass that actually
+ * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
+ * few untouched non-pivot tuples. The final deduplication pass won't free
+ * any space -- it will skip over everything without merging anything (it
+ * retraces the steps of the penultimate pass).
+ *
+ * Fortunately, having several passes isn't too expensive. Each pass (after
+ * the first pass) won't spend many cycles on the large posting list tuples
+ * left by previous passes. Each pass will find a large contiguous group of
+ * smaller duplicate tuples to merge together at the end of the page.
+ */
+static bool
+NBTS_FUNCTION(_bt_do_singleval)(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, minoff);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (nbts_call(_bt_keep_natts_fast, rel, newitem, itup) > nkeyatts)
+ {
+ itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (nbts_call(_bt_keep_natts_fast, rel, newitem, itup) > nkeyatts)
+ return true;
+ }
+
+ return false;
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f6f4af8bfe..ec6c73d1cc 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,18 +30,13 @@
#define BTREE_FASTPATH_MIN_LEVEL 2
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
Relation heapRel,
IndexUniqueCheck checkUnique, bool *is_unique,
uint32 *speculativeToken);
-static OffsetNumber _bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel);
-static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_stepright(Relation rel,
+ BTInsertState insertstate,
+ BTStack stack);
static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
Buffer buf,
Buffer cbuf,
@@ -73,311 +68,10 @@ static BlockNumber *_bt_deadblocks(Page page, OffsetNumber *deletable,
int *nblocks);
static inline int _bt_blk_cmp(const void *arg1, const void *arg2);
-/*
- * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
- *
- * This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
- *
- * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
- * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
- * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
- * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
- * don't actually insert.
- *
- * indexUnchanged executor hint indicates if itup is from an
- * UPDATE that didn't logically change the indexed value, but
- * must nevertheless have a new entry to point to a successor
- * version.
- *
- * The result value is only significant for UNIQUE_CHECK_PARTIAL:
- * it must be true if the entry is known unique, else false.
- * (In the current implementation we'll also return true after a
- * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
- * that's just a coding artifact.)
- */
-bool
-_bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel)
-{
- bool is_unique = false;
- BTInsertStateData insertstate;
- BTScanInsert itup_key;
- BTStack stack;
- bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
-
- /* we need an insertion scan key to do our search, so build one */
- itup_key = _bt_mkscankey(rel, itup);
-
- if (checkingunique)
- {
- if (!itup_key->anynullkeys)
- {
- /* No (heapkeyspace) scantid until uniqueness established */
- itup_key->scantid = NULL;
- }
- else
- {
- /*
- * Scan key for new tuple contains NULL key values. Bypass
- * checkingunique steps. They are unnecessary because core code
- * considers NULL unequal to every value, including NULL.
- *
- * This optimization avoids O(N^2) behavior within the
- * _bt_findinsertloc() heapkeyspace path when a unique index has a
- * large number of "duplicates" with NULL key values.
- */
- checkingunique = false;
- /* Tuple is unique in the sense that core code cares about */
- Assert(checkUnique != UNIQUE_CHECK_EXISTING);
- is_unique = true;
- }
- }
-
- /*
- * Fill in the BTInsertState working area, to track the current page and
- * position within the page to insert on.
- *
- * Note that itemsz is passed down to lower level code that deals with
- * inserting the item. It must be MAXALIGN()'d. This ensures that space
- * accounting code consistently considers the alignment overhead that we
- * expect PageAddItem() will add later. (Actually, index_form_tuple() is
- * already conservative about alignment, but we don't rely on that from
- * this distance. Besides, preserving the "true" tuple size in index
- * tuple headers for the benefit of nbtsplitloc.c might happen someday.
- * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
- */
- insertstate.itup = itup;
- insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
- insertstate.itup_key = itup_key;
- insertstate.bounds_valid = false;
- insertstate.buf = InvalidBuffer;
- insertstate.postingoff = 0;
-
-search:
-
- /*
- * Find and lock the leaf page that the tuple should be added to by
- * searching from the root page. insertstate.buf will hold a buffer that
- * is locked in exclusive mode afterwards.
- */
- stack = _bt_search_insert(rel, &insertstate);
-
- /*
- * checkingunique inserts are not allowed to go ahead when two tuples with
- * equal key attribute values would be visible to new MVCC snapshots once
- * the xact commits. Check for conflicts in the locked page/buffer (if
- * needed) here.
- *
- * It might be necessary to check a page to the right in _bt_check_unique,
- * though that should be very rare. In practice the first page the value
- * could be on (with scantid omitted) is almost always also the only page
- * that a matching tuple might be found on. This is due to the behavior
- * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
- * only be allowed to cross a page boundary when there is no candidate
- * leaf page split point that avoids it. Also, _bt_check_unique can use
- * the leaf page high key to determine that there will be no duplicates on
- * the right sibling without actually visiting it (it uses the high key in
- * cases where the new item happens to belong at the far right of the leaf
- * page).
- *
- * NOTE: obviously, _bt_check_unique can only detect keys that are already
- * in the index; so it cannot defend against concurrent insertions of the
- * same key. We protect against that by means of holding a write lock on
- * the first page the value could be on, with omitted/-inf value for the
- * implicit heap TID tiebreaker attribute. Any other would-be inserter of
- * the same key must acquire a write lock on the same page, so only one
- * would-be inserter can be making the check at one time. Furthermore,
- * once we are past the check we hold write locks continuously until we
- * have performed our insertion, so no later inserter can fail to see our
- * insertion. (This requires some care in _bt_findinsertloc.)
- *
- * If we must wait for another xact, we release the lock while waiting,
- * and then must perform a new search.
- *
- * For a partial uniqueness check, we don't wait for the other xact. Just
- * let the tuple in and return false for possibly non-unique, or true for
- * definitely unique.
- */
- if (checkingunique)
- {
- TransactionId xwait;
- uint32 speculativeToken;
-
- xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
- &is_unique, &speculativeToken);
-
- if (unlikely(TransactionIdIsValid(xwait)))
- {
- /* Have to wait for the other guy ... */
- _bt_relbuf(rel, insertstate.buf);
- insertstate.buf = InvalidBuffer;
-
- /*
- * If it's a speculative insertion, wait for it to finish (ie. to
- * go ahead with the insertion, or kill the tuple). Otherwise
- * wait for the transaction to finish as usual.
- */
- if (speculativeToken)
- SpeculativeInsertionWait(xwait, speculativeToken);
- else
- XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
-
- /* start over... */
- if (stack)
- _bt_freestack(stack);
- goto search;
- }
-
- /* Uniqueness is established -- restore heap tid as scantid */
- if (itup_key->heapkeyspace)
- itup_key->scantid = &itup->t_tid;
- }
-
- if (checkUnique != UNIQUE_CHECK_EXISTING)
- {
- OffsetNumber newitemoff;
-
- /*
- * The only conflict predicate locking cares about for indexes is when
- * an index tuple insert conflicts with an existing lock. We don't
- * know the actual page we're going to insert on for sure just yet in
- * checkingunique and !heapkeyspace cases, but it's okay to use the
- * first page the value could be on (with scantid omitted) instead.
- */
- CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
-
- /*
- * Do the insertion. Note that insertstate contains cached binary
- * search bounds established within _bt_check_unique when insertion is
- * checkingunique.
- */
- newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
- indexUnchanged, stack, heapRel);
- _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, insertstate.itemsz, newitemoff,
- insertstate.postingoff, false);
- }
- else
- {
- /* just release the buffer */
- _bt_relbuf(rel, insertstate.buf);
- }
-
- /* be tidy */
- if (stack)
- _bt_freestack(stack);
- pfree(itup_key);
-
- return is_unique;
-}
-
-/*
- * _bt_search_insert() -- _bt_search() wrapper for inserts
- *
- * Search the tree for a particular scankey, or more precisely for the first
- * leaf page it could be on. Try to make use of the fastpath optimization's
- * rightmost leaf page cache before actually searching the tree from the root
- * page, though.
- *
- * Return value is a stack of parent-page pointers (though see notes about
- * fastpath optimization and page splits below). insertstate->buf is set to
- * the address of the leaf-page buffer, which is write-locked and pinned in
- * all cases (if necessary by creating a new empty root page for caller).
- *
- * The fastpath optimization avoids most of the work of searching the tree
- * repeatedly when a single backend inserts successive new tuples on the
- * rightmost leaf page of an index. A backend cache of the rightmost leaf
- * page is maintained within _bt_insertonpg(), and used here. The cache is
- * invalidated here when an insert of a non-pivot tuple must take place on a
- * non-rightmost leaf page.
- *
- * The optimization helps with indexes on an auto-incremented field. It also
- * helps with indexes on datetime columns, as well as indexes with lots of
- * NULL values. (NULLs usually get inserted in the rightmost page for single
- * column indexes, since they usually get treated as coming after everything
- * else in the key space. Individual NULL tuples will generally be placed on
- * the rightmost leaf page due to the influence of the heap TID column.)
- *
- * Note that we avoid applying the optimization when there is insufficient
- * space on the rightmost page to fit caller's new item. This is necessary
- * because we'll need to return a real descent stack when a page split is
- * expected (actually, caller can cope with a leaf page split that uses a NULL
- * stack, but that's very slow and so must be avoided). Note also that the
- * fastpath optimization acquires the lock on the page conditionally as a way
- * of reducing extra contention when there are concurrent insertions into the
- * rightmost page (we give up if we'd have to wait for the lock). We assume
- * that it isn't useful to apply the optimization when there is contention,
- * since each per-backend cache won't stay valid for long.
- */
-static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
-{
- Assert(insertstate->buf == InvalidBuffer);
- Assert(!insertstate->bounds_valid);
- Assert(insertstate->postingoff == 0);
-
- if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
- {
- /* Simulate a _bt_getbuf() call with conditional locking */
- insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
- if (_bt_conditionallockbuf(rel, insertstate->buf))
- {
- Page page;
- BTPageOpaque opaque;
-
- _bt_checkpage(rel, insertstate->buf);
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- /*
- * Check if the page is still the rightmost leaf page and has
- * enough free space to accommodate the new tuple. Also check
- * that the insertion scan key is strictly greater than the first
- * non-pivot tuple on the page. (Note that we expect itup_key's
- * scantid to be unset when our caller is a checkingunique
- * inserter.)
- */
- if (P_RIGHTMOST(opaque) &&
- P_ISLEAF(opaque) &&
- !P_IGNORE(opaque) &&
- PageGetFreeSpace(page) > insertstate->itemsz &&
- PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- _bt_compare(rel, insertstate->itup_key, page, P_HIKEY) > 0)
- {
- /*
- * Caller can use the fastpath optimization because cached
- * block is still rightmost leaf page, which can fit caller's
- * new tuple without splitting. Keep block in local cache for
- * next insert, and have caller use NULL stack.
- *
- * Note that _bt_insert_parent() has an assertion that catches
- * leaf page splits that somehow follow from a fastpath insert
- * (it should only be passed a NULL stack when it must deal
- * with a concurrent root page split, and never because a NULL
- * stack was returned here).
- */
- return NULL;
- }
-
- /* Page unsuitable for caller, drop lock and pin */
- _bt_relbuf(rel, insertstate->buf);
- }
- else
- {
- /* Lock unavailable, drop pin */
- ReleaseBuffer(insertstate->buf);
- }
-
- /* Forget block, since cache doesn't appear to be useful */
- RelationSetTargetBlock(rel, InvalidBlockNumber);
- }
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtinsert_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
- /* Cannot use optimization -- descend tree, return proper descent stack */
- return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
- NULL);
-}
/*
* _bt_check_unique() -- Check for violation of unique index constraint
@@ -438,7 +132,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* in the fastpath below, but also in the _bt_findinsertloc() call later.
*/
Assert(!insertstate->bounds_valid);
- offset = _bt_binsrch_insert(rel, insertstate);
+ offset = nbts_call(_bt_binsrch_insert, rel, insertstate);
/*
* Scan over all equal tuples, looking for live conflicts.
@@ -483,7 +177,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(insertstate->bounds_valid);
Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
Assert(insertstate->low <= insertstate->stricthigh);
- Assert(_bt_compare(rel, itup_key, page, offset) < 0);
+ Assert(nbts_call(_bt_compare, rel, itup_key, page, offset) < 0);
break;
}
@@ -508,7 +202,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
if (!inposting)
{
/* Plain tuple, or first TID in posting list tuple */
- if (_bt_compare(rel, itup_key, page, offset) != 0)
+ if (nbts_call(_bt_compare, rel, itup_key, page, offset) != 0)
break; /* we're past all the equal tuples */
/* Advanced curitup */
@@ -722,7 +416,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* If scankey == hikey we gotta check the next page too */
if (P_RIGHTMOST(opaque))
break;
- highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+ highkeycmp = nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY);
Assert(highkeycmp <= 0);
if (highkeycmp != 0)
break;
@@ -769,246 +463,6 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
return InvalidTransactionId;
}
-
-/*
- * _bt_findinsertloc() -- Finds an insert location for a tuple
- *
- * On entry, insertstate buffer contains the page the new tuple belongs
- * on. It is exclusive-locked and pinned by the caller.
- *
- * If 'checkingunique' is true, the buffer on entry is the first page
- * that contains duplicates of the new key. If there are duplicates on
- * multiple pages, the correct insertion position might be some page to
- * the right, rather than the first page. In that case, this function
- * moves right to the correct target page.
- *
- * (In a !heapkeyspace index, there can be multiple pages with the same
- * high key, where the new tuple could legitimately be placed on. In
- * that case, the caller passes the first page containing duplicates,
- * just like when checkingunique=true. If that page doesn't have enough
- * room for the new tuple, this function moves right, trying to find a
- * legal page that does.)
- *
- * If 'indexUnchanged' is true, this is for an UPDATE that didn't
- * logically change the indexed value, but must nevertheless have a new
- * entry to point to a successor version. This hint from the executor
- * will influence our behavior when the page might have to be split and
- * we must consider our options. Bottom-up index deletion can avoid
- * pathological version-driven page splits, but we only want to go to the
- * trouble of trying it when we already have moderate confidence that
- * it's appropriate. The hint should not significantly affect our
- * behavior over time unless practically all inserts on to the leaf page
- * get the hint.
- *
- * On exit, insertstate buffer contains the chosen insertion page, and
- * the offset within that page is returned. If _bt_findinsertloc needed
- * to move right, the lock and pin on the original page are released, and
- * the new buffer is exclusively locked and pinned instead.
- *
- * If insertstate contains cached binary search bounds, we will take
- * advantage of them. This avoids repeating comparisons that we made in
- * _bt_check_unique() already.
- */
-static OffsetNumber
-_bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel)
-{
- BTScanInsert itup_key = insertstate->itup_key;
- Page page = BufferGetPage(insertstate->buf);
- BTPageOpaque opaque;
- OffsetNumber newitemoff;
-
- opaque = BTPageGetOpaque(page);
-
- /* Check 1/3 of a page restriction */
- if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
- _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
- insertstate->itup);
-
- Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
- Assert(!insertstate->bounds_valid || checkingunique);
- Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
- Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
- Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
-
- if (itup_key->heapkeyspace)
- {
- /* Keep track of whether checkingunique duplicate seen */
- bool uniquedup = indexUnchanged;
-
- /*
- * If we're inserting into a unique index, we may have to walk right
- * through leaf pages to find the one leaf page that we must insert on
- * to.
- *
- * This is needed for checkingunique callers because a scantid was not
- * used when we called _bt_search(). scantid can only be set after
- * _bt_check_unique() has checked for duplicates. The buffer
- * initially stored in insertstate->buf has the page where the first
- * duplicate key might be found, which isn't always the page that new
- * tuple belongs on. The heap TID attribute for new tuple (scantid)
- * could force us to insert on a sibling page, though that should be
- * very rare in practice.
- */
- if (checkingunique)
- {
- if (insertstate->low < insertstate->stricthigh)
- {
- /* Encountered a duplicate in _bt_check_unique() */
- Assert(insertstate->bounds_valid);
- uniquedup = true;
- }
-
- for (;;)
- {
- /*
- * Does the new tuple belong on this page?
- *
- * The earlier _bt_check_unique() call may well have
- * established a strict upper bound on the offset for the new
- * item. If it's not the last item of the page (i.e. if there
- * is at least one tuple on the page that goes after the tuple
- * we're inserting) then we know that the tuple belongs on
- * this page. We can skip the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- /* Test '<=', not '!=', since scantid is set now */
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
- break;
-
- _bt_stepright(rel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- /* Assume duplicates (if checkingunique) */
- uniquedup = true;
- }
- }
-
- /*
- * If the target page cannot fit newitem, try to avoid splitting the
- * page on insert by performing deletion or deduplication now
- */
- if (PageGetFreeSpace(page) < insertstate->itemsz)
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
- checkingunique, uniquedup,
- indexUnchanged);
- }
- else
- {
- /*----------
- * This is a !heapkeyspace (version 2 or 3) index. The current page
- * is the first page that we could insert the new tuple to, but there
- * may be other pages to the right that we could opt to use instead.
- *
- * If the new key is equal to one or more existing keys, we can
- * legitimately place it anywhere in the series of equal keys. In
- * fact, if the new key is equal to the page's "high key" we can place
- * it on the next page. If it is equal to the high key, and there's
- * not room to insert the new tuple on the current page without
- * splitting, then we move right hoping to find more free space and
- * avoid a split.
- *
- * Keep scanning right until we
- * (a) find a page with enough free space,
- * (b) reach the last page where the tuple can legally go, or
- * (c) get tired of searching.
- * (c) is not flippant; it is important because if there are many
- * pages' worth of equal keys, it's better to split one of the early
- * pages than to scan all the way to the end of the run of equal keys
- * on every insert. We implement "get tired" as a random choice,
- * since stopping after scanning a fixed number of pages wouldn't work
- * well (we'd never reach the right-hand side of previously split
- * pages). The probability of moving right is set at 0.99, which may
- * seem too high to change the behavior much, but it does an excellent
- * job of preventing O(N^2) behavior with many equal keys.
- *----------
- */
- while (PageGetFreeSpace(page) < insertstate->itemsz)
- {
- /*
- * Before considering moving right, see if we can obtain enough
- * space by erasing LP_DEAD items
- */
- if (P_HAS_GARBAGE(opaque))
- {
- /* Perform simple deletion */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- if (PageGetFreeSpace(page) >= insertstate->itemsz)
- break; /* OK, now we have enough space */
- }
-
- /*
- * Nope, so check conditions (b) and (c) enumerated above
- *
- * The earlier _bt_check_unique() call may well have established a
- * strict upper bound on the offset for the new item. If it's not
- * the last item of the page (i.e. if there is at least one tuple
- * on the page that's greater than the tuple we're inserting to)
- * then we know that the tuple belongs on this page. We can skip
- * the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) != 0 ||
- pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
- break;
-
- _bt_stepright(rel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- }
- }
-
- /*
- * We should now be on the correct page. Find the offset within the page
- * for the new tuple. (Possibly reusing earlier search bounds.)
- */
- Assert(P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
-
- newitemoff = _bt_binsrch_insert(rel, insertstate);
-
- if (insertstate->postingoff == -1)
- {
- /*
- * There is an overlapping posting list tuple with its LP_DEAD bit
- * set. We don't want to unnecessarily unset its LP_DEAD bit while
- * performing a posting list split, so perform simple index tuple
- * deletion early.
- */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- /*
- * Do new binary search. New insert location cannot overlap with any
- * posting list now.
- */
- Assert(!insertstate->bounds_valid);
- insertstate->postingoff = 0;
- newitemoff = _bt_binsrch_insert(rel, insertstate);
- Assert(insertstate->postingoff == 0);
- }
-
- return newitemoff;
-}
-
/*
* Step right to next non-dead page, during insertion.
*
@@ -1649,7 +1103,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
lastleft = nposting;
}
- lefthighkey = _bt_truncate(rel, lastleft, firstright, itup_key);
+ lefthighkey = nbts_call(_bt_truncate, rel, lastleft, firstright, itup_key);
itemsz = IndexTupleSize(lefthighkey);
}
else
@@ -2764,8 +2218,8 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
/* Perform deduplication pass (when enabled and index-is-allequalimage) */
if (BTGetDeduplicateItems(rel) && itup_key->allequalimage)
- _bt_dedup_pass(rel, buffer, heapRel, insertstate->itup,
- insertstate->itemsz, (indexUnchanged || uniquedup));
+ nbts_call(_bt_dedup_pass, rel, buffer, heapRel, insertstate->itup,
+ insertstate->itemsz, (indexUnchanged || uniquedup));
}
/*
diff --git a/src/backend/access/nbtree/nbtinsert_spec.h b/src/backend/access/nbtree/nbtinsert_spec.h
new file mode 100644
index 0000000000..80684bde81
--- /dev/null
+++ b/src/backend/access/nbtree/nbtinsert_spec.h
@@ -0,0 +1,560 @@
+/*
+ * Specialized functions for nbtinsert.c
+ */
+
+static BTStack NBTS_FUNCTION(_bt_search_insert)(Relation rel,
+ BTInsertState insertstate);
+
+static OffsetNumber NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel);
+
+
+/*
+ * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
+ *
+ * This routine is called by the public interface routine, btinsert.
+ * By here, itup is filled in, including the TID.
+ *
+ * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
+ * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
+ * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
+ * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
+ * don't actually insert.
+ *
+ * indexUnchanged executor hint indicates if itup is from an
+ * UPDATE that didn't logically change the indexed value, but
+ * must nevertheless have a new entry to point to a successor
+ * version.
+ *
+ * The result value is only significant for UNIQUE_CHECK_PARTIAL:
+ * it must be true if the entry is known unique, else false.
+ * (In the current implementation we'll also return true after a
+ * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
+ * that's just a coding artifact.)
+ */
+bool
+NBTS_FUNCTION(_bt_doinsert)(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ Relation heapRel)
+{
+ bool is_unique = false;
+ BTInsertStateData insertstate;
+ BTScanInsert itup_key;
+ BTStack stack;
+ bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
+
+ /* we need an insertion scan key to do our search, so build one */
+ itup_key = nbts_call(_bt_mkscankey, rel, itup);
+
+ if (checkingunique)
+ {
+ if (!itup_key->anynullkeys)
+ {
+ /* No (heapkeyspace) scantid until uniqueness established */
+ itup_key->scantid = NULL;
+ }
+ else
+ {
+ /*
+ * Scan key for new tuple contains NULL key values. Bypass
+ * checkingunique steps. They are unnecessary because core code
+ * considers NULL unequal to every value, including NULL.
+ *
+ * This optimization avoids O(N^2) behavior within the
+ * _bt_findinsertloc() heapkeyspace path when a unique index has a
+ * large number of "duplicates" with NULL key values.
+ */
+ checkingunique = false;
+ /* Tuple is unique in the sense that core code cares about */
+ Assert(checkUnique != UNIQUE_CHECK_EXISTING);
+ is_unique = true;
+ }
+ }
+
+ /*
+ * Fill in the BTInsertState working area, to track the current page and
+ * position within the page to insert on.
+ *
+ * Note that itemsz is passed down to lower level code that deals with
+ * inserting the item. It must be MAXALIGN()'d. This ensures that space
+ * accounting code consistently considers the alignment overhead that we
+ * expect PageAddItem() will add later. (Actually, index_form_tuple() is
+ * already conservative about alignment, but we don't rely on that from
+ * this distance. Besides, preserving the "true" tuple size in index
+ * tuple headers for the benefit of nbtsplitloc.c might happen someday.
+ * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
+ */
+ insertstate.itup = itup;
+ insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
+ insertstate.itup_key = itup_key;
+ insertstate.bounds_valid = false;
+ insertstate.buf = InvalidBuffer;
+ insertstate.postingoff = 0;
+
+ search:
+
+ /*
+ * Find and lock the leaf page that the tuple should be added to by
+ * searching from the root page. insertstate.buf will hold a buffer that
+ * is locked in exclusive mode afterwards.
+ */
+ stack = nbts_call(_bt_search_insert, rel, &insertstate);
+
+ /*
+ * checkingunique inserts are not allowed to go ahead when two tuples with
+ * equal key attribute values would be visible to new MVCC snapshots once
+ * the xact commits. Check for conflicts in the locked page/buffer (if
+ * needed) here.
+ *
+ * It might be necessary to check a page to the right in _bt_check_unique,
+ * though that should be very rare. In practice the first page the value
+ * could be on (with scantid omitted) is almost always also the only page
+ * that a matching tuple might be found on. This is due to the behavior
+ * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
+ * only be allowed to cross a page boundary when there is no candidate
+ * leaf page split point that avoids it. Also, _bt_check_unique can use
+ * the leaf page high key to determine that there will be no duplicates on
+ * the right sibling without actually visiting it (it uses the high key in
+ * cases where the new item happens to belong at the far right of the leaf
+ * page).
+ *
+ * NOTE: obviously, _bt_check_unique can only detect keys that are already
+ * in the index; so it cannot defend against concurrent insertions of the
+ * same key. We protect against that by means of holding a write lock on
+ * the first page the value could be on, with omitted/-inf value for the
+ * implicit heap TID tiebreaker attribute. Any other would-be inserter of
+ * the same key must acquire a write lock on the same page, so only one
+ * would-be inserter can be making the check at one time. Furthermore,
+ * once we are past the check we hold write locks continuously until we
+ * have performed our insertion, so no later inserter can fail to see our
+ * insertion. (This requires some care in _bt_findinsertloc.)
+ *
+ * If we must wait for another xact, we release the lock while waiting,
+ * and then must perform a new search.
+ *
+ * For a partial uniqueness check, we don't wait for the other xact. Just
+ * let the tuple in and return false for possibly non-unique, or true for
+ * definitely unique.
+ */
+ if (checkingunique)
+ {
+ TransactionId xwait;
+ uint32 speculativeToken;
+
+ xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
+ &is_unique, &speculativeToken);
+
+ if (unlikely(TransactionIdIsValid(xwait)))
+ {
+ /* Have to wait for the other guy ... */
+ _bt_relbuf(rel, insertstate.buf);
+ insertstate.buf = InvalidBuffer;
+
+ /*
+ * If it's a speculative insertion, wait for it to finish (ie. to
+ * go ahead with the insertion, or kill the tuple). Otherwise
+ * wait for the transaction to finish as usual.
+ */
+ if (speculativeToken)
+ SpeculativeInsertionWait(xwait, speculativeToken);
+ else
+ XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
+
+ /* start over... */
+ if (stack)
+ _bt_freestack(stack);
+ goto search;
+ }
+
+ /* Uniqueness is established -- restore heap tid as scantid */
+ if (itup_key->heapkeyspace)
+ itup_key->scantid = &itup->t_tid;
+ }
+
+ if (checkUnique != UNIQUE_CHECK_EXISTING)
+ {
+ OffsetNumber newitemoff;
+
+ /*
+ * The only conflict predicate locking cares about for indexes is when
+ * an index tuple insert conflicts with an existing lock. We don't
+ * know the actual page we're going to insert on for sure just yet in
+ * checkingunique and !heapkeyspace cases, but it's okay to use the
+ * first page the value could be on (with scantid omitted) instead.
+ */
+ CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
+
+ /*
+ * Do the insertion. Note that insertstate contains cached binary
+ * search bounds established within _bt_check_unique when insertion is
+ * checkingunique.
+ */
+ newitemoff = nbts_call(_bt_findinsertloc, rel, &insertstate, checkingunique,
+ indexUnchanged, stack, heapRel);
+ _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
+ itup, insertstate.itemsz, newitemoff,
+ insertstate.postingoff, false);
+ }
+ else
+ {
+ /* just release the buffer */
+ _bt_relbuf(rel, insertstate.buf);
+ }
+
+ /* be tidy */
+ if (stack)
+ _bt_freestack(stack);
+ pfree(itup_key);
+
+ return is_unique;
+}
+
+/*
+ * _bt_search_insert() -- _bt_search() wrapper for inserts
+ *
+ * Search the tree for a particular scankey, or more precisely for the first
+ * leaf page it could be on. Try to make use of the fastpath optimization's
+ * rightmost leaf page cache before actually searching the tree from the root
+ * page, though.
+ *
+ * Return value is a stack of parent-page pointers (though see notes about
+ * fastpath optimization and page splits below). insertstate->buf is set to
+ * the address of the leaf-page buffer, which is write-locked and pinned in
+ * all cases (if necessary by creating a new empty root page for caller).
+ *
+ * The fastpath optimization avoids most of the work of searching the tree
+ * repeatedly when a single backend inserts successive new tuples on the
+ * rightmost leaf page of an index. A backend cache of the rightmost leaf
+ * page is maintained within _bt_insertonpg(), and used here. The cache is
+ * invalidated here when an insert of a non-pivot tuple must take place on a
+ * non-rightmost leaf page.
+ *
+ * The optimization helps with indexes on an auto-incremented field. It also
+ * helps with indexes on datetime columns, as well as indexes with lots of
+ * NULL values. (NULLs usually get inserted in the rightmost page for single
+ * column indexes, since they usually get treated as coming after everything
+ * else in the key space. Individual NULL tuples will generally be placed on
+ * the rightmost leaf page due to the influence of the heap TID column.)
+ *
+ * Note that we avoid applying the optimization when there is insufficient
+ * space on the rightmost page to fit caller's new item. This is necessary
+ * because we'll need to return a real descent stack when a page split is
+ * expected (actually, caller can cope with a leaf page split that uses a NULL
+ * stack, but that's very slow and so must be avoided). Note also that the
+ * fastpath optimization acquires the lock on the page conditionally as a way
+ * of reducing extra contention when there are concurrent insertions into the
+ * rightmost page (we give up if we'd have to wait for the lock). We assume
+ * that it isn't useful to apply the optimization when there is contention,
+ * since each per-backend cache won't stay valid for long.
+ */
+static BTStack
+NBTS_FUNCTION(_bt_search_insert)(Relation rel, BTInsertState insertstate)
+{
+ Assert(insertstate->buf == InvalidBuffer);
+ Assert(!insertstate->bounds_valid);
+ Assert(insertstate->postingoff == 0);
+
+ if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
+ {
+ /* Simulate a _bt_getbuf() call with conditional locking */
+ insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
+ if (_bt_conditionallockbuf(rel, insertstate->buf))
+ {
+ Page page;
+ BTPageOpaque opaque;
+
+ _bt_checkpage(rel, insertstate->buf);
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ /*
+ * Check if the page is still the rightmost leaf page and has
+ * enough free space to accommodate the new tuple. Also check
+ * that the insertion scan key is strictly greater than the first
+ * non-pivot tuple on the page. (Note that we expect itup_key's
+ * scantid to be unset when our caller is a checkingunique
+ * inserter.)
+ */
+ if (P_RIGHTMOST(opaque) &&
+ P_ISLEAF(opaque) &&
+ !P_IGNORE(opaque) &&
+ PageGetFreeSpace(page) > insertstate->itemsz &&
+ PageGetMaxOffsetNumber(page) >= P_HIKEY &&
+ nbts_call(_bt_compare, rel, insertstate->itup_key, page, P_HIKEY) > 0)
+ {
+ /*
+ * Caller can use the fastpath optimization because cached
+ * block is still rightmost leaf page, which can fit caller's
+ * new tuple without splitting. Keep block in local cache for
+ * next insert, and have caller use NULL stack.
+ *
+ * Note that _bt_insert_parent() has an assertion that catches
+ * leaf page splits that somehow follow from a fastpath insert
+ * (it should only be passed a NULL stack when it must deal
+ * with a concurrent root page split, and never because a NULL
+ * stack was returned here).
+ */
+ return NULL;
+ }
+
+ /* Page unsuitable for caller, drop lock and pin */
+ _bt_relbuf(rel, insertstate->buf);
+ }
+ else
+ {
+ /* Lock unavailable, drop pin */
+ ReleaseBuffer(insertstate->buf);
+ }
+
+ /* Forget block, since cache doesn't appear to be useful */
+ RelationSetTargetBlock(rel, InvalidBlockNumber);
+ }
+
+ /* Cannot use optimization -- descend tree, return proper descent stack */
+ return nbts_call(_bt_search, rel, insertstate->itup_key,
+ &insertstate->buf, BT_WRITE, NULL);
+}
+
+/*
+ * _bt_findinsertloc() -- Finds an insert location for a tuple
+ *
+ * On entry, insertstate buffer contains the page the new tuple belongs
+ * on. It is exclusive-locked and pinned by the caller.
+ *
+ * If 'checkingunique' is true, the buffer on entry is the first page
+ * that contains duplicates of the new key. If there are duplicates on
+ * multiple pages, the correct insertion position might be some page to
+ * the right, rather than the first page. In that case, this function
+ * moves right to the correct target page.
+ *
+ * (In a !heapkeyspace index, there can be multiple pages with the same
+ * high key, where the new tuple could legitimately be placed on. In
+ * that case, the caller passes the first page containing duplicates,
+ * just like when checkingunique=true. If that page doesn't have enough
+ * room for the new tuple, this function moves right, trying to find a
+ * legal page that does.)
+ *
+ * If 'indexUnchanged' is true, this is for an UPDATE that didn't
+ * logically change the indexed value, but must nevertheless have a new
+ * entry to point to a successor version. This hint from the executor
+ * will influence our behavior when the page might have to be split and
+ * we must consider our options. Bottom-up index deletion can avoid
+ * pathological version-driven page splits, but we only want to go to the
+ * trouble of trying it when we already have moderate confidence that
+ * it's appropriate. The hint should not significantly affect our
+ * behavior over time unless practically all inserts on to the leaf page
+ * get the hint.
+ *
+ * On exit, insertstate buffer contains the chosen insertion page, and
+ * the offset within that page is returned. If _bt_findinsertloc needed
+ * to move right, the lock and pin on the original page are released, and
+ * the new buffer is exclusively locked and pinned instead.
+ *
+ * If insertstate contains cached binary search bounds, we will take
+ * advantage of them. This avoids repeating comparisons that we made in
+ * _bt_check_unique() already.
+ */
+static OffsetNumber
+NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel)
+{
+ BTScanInsert itup_key = insertstate->itup_key;
+ Page page = BufferGetPage(insertstate->buf);
+ BTPageOpaque opaque;
+ OffsetNumber newitemoff;
+
+ opaque = BTPageGetOpaque(page);
+
+ /* Check 1/3 of a page restriction */
+ if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
+ _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
+ insertstate->itup);
+
+ Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
+ Assert(!insertstate->bounds_valid || checkingunique);
+ Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
+ Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
+ Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
+
+ if (itup_key->heapkeyspace)
+ {
+ /* Keep track of whether checkingunique duplicate seen */
+ bool uniquedup = indexUnchanged;
+
+ /*
+ * If we're inserting into a unique index, we may have to walk right
+ * through leaf pages to find the one leaf page that we must insert on
+ * to.
+ *
+ * This is needed for checkingunique callers because a scantid was not
+ * used when we called _bt_search(). scantid can only be set after
+ * _bt_check_unique() has checked for duplicates. The buffer
+ * initially stored in insertstate->buf has the page where the first
+ * duplicate key might be found, which isn't always the page that new
+ * tuple belongs on. The heap TID attribute for new tuple (scantid)
+ * could force us to insert on a sibling page, though that should be
+ * very rare in practice.
+ */
+ if (checkingunique)
+ {
+ if (insertstate->low < insertstate->stricthigh)
+ {
+ /* Encountered a duplicate in _bt_check_unique() */
+ Assert(insertstate->bounds_valid);
+ uniquedup = true;
+ }
+
+ for (;;)
+ {
+ /*
+ * Does the new tuple belong on this page?
+ *
+ * The earlier _bt_check_unique() call may well have
+ * established a strict upper bound on the offset for the new
+ * item. If it's not the last item of the page (i.e. if there
+ * is at least one tuple on the page that goes after the tuple
+ * we're inserting) then we know that the tuple belongs on
+ * this page. We can skip the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ /* Test '<=', not '!=', since scantid is set now */
+ if (P_RIGHTMOST(opaque) ||
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) <= 0)
+ break;
+
+ _bt_stepright(rel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ /* Assume duplicates (if checkingunique) */
+ uniquedup = true;
+ }
+ }
+
+ /*
+ * If the target page cannot fit newitem, try to avoid splitting the
+ * page on insert by performing deletion or deduplication now
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
+ checkingunique, uniquedup,
+ indexUnchanged);
+ }
+ else
+ {
+ /*----------
+ * This is a !heapkeyspace (version 2 or 3) index. The current page
+ * is the first page that we could insert the new tuple to, but there
+ * may be other pages to the right that we could opt to use instead.
+ *
+ * If the new key is equal to one or more existing keys, we can
+ * legitimately place it anywhere in the series of equal keys. In
+ * fact, if the new key is equal to the page's "high key" we can place
+ * it on the next page. If it is equal to the high key, and there's
+ * not room to insert the new tuple on the current page without
+ * splitting, then we move right hoping to find more free space and
+ * avoid a split.
+ *
+ * Keep scanning right until we
+ * (a) find a page with enough free space,
+ * (b) reach the last page where the tuple can legally go, or
+ * (c) get tired of searching.
+ * (c) is not flippant; it is important because if there are many
+ * pages' worth of equal keys, it's better to split one of the early
+ * pages than to scan all the way to the end of the run of equal keys
+ * on every insert. We implement "get tired" as a random choice,
+ * since stopping after scanning a fixed number of pages wouldn't work
+ * well (we'd never reach the right-hand side of previously split
+ * pages). The probability of moving right is set at 0.99, which may
+ * seem too high to change the behavior much, but it does an excellent
+ * job of preventing O(N^2) behavior with many equal keys.
+ *----------
+ */
+ while (PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ /*
+ * Before considering moving right, see if we can obtain enough
+ * space by erasing LP_DEAD items
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ /* Perform simple deletion */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ if (PageGetFreeSpace(page) >= insertstate->itemsz)
+ break; /* OK, now we have enough space */
+ }
+
+ /*
+ * Nope, so check conditions (b) and (c) enumerated above
+ *
+ * The earlier _bt_check_unique() call may well have established a
+ * strict upper bound on the offset for the new item. If it's not
+ * the last item of the page (i.e. if there is at least one tuple
+ * on the page that's greater than the tuple we're inserting to)
+ * then we know that the tuple belongs on this page. We can skip
+ * the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ if (P_RIGHTMOST(opaque) ||
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) != 0 ||
+ pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
+ break;
+
+ _bt_stepright(rel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ }
+ }
+
+ /*
+ * We should now be on the correct page. Find the offset within the page
+ * for the new tuple. (Possibly reusing earlier search bounds.)
+ */
+ Assert(P_RIGHTMOST(opaque) ||
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) <= 0);
+
+ newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate);
+
+ if (insertstate->postingoff == -1)
+ {
+ /*
+ * There is an overlapping posting list tuple with its LP_DEAD bit
+ * set. We don't want to unnecessarily unset its LP_DEAD bit while
+ * performing a posting list split, so perform simple index tuple
+ * deletion early.
+ */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ /*
+ * Do new binary search. New insert location cannot overlap with any
+ * posting list now.
+ */
+ Assert(!insertstate->bounds_valid);
+ insertstate->postingoff = 0;
+ newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate);
+ Assert(insertstate->postingoff == 0);
+ }
+
+ return newitemoff;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 20adb602a4..e66299ebd8 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1967,10 +1967,10 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
}
/* we need an insertion scan key for the search, so build one */
- itup_key = _bt_mkscankey(rel, targetkey);
+ itup_key = nbts_call(_bt_mkscankey, rel, targetkey);
/* find the leftmost leaf page with matching pivot/high key */
itup_key->pivotsearch = true;
- stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
+ stack = nbts_call(_bt_search, rel, itup_key, &sleafbuf, BT_READ, NULL);
/* won't need a second lock or pin on leafbuf */
_bt_relbuf(rel, sleafbuf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 06131f23d4..09c43eb226 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -87,6 +87,10 @@ static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
OffsetNumber updatedoffset,
int *nremaining);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtree_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
+
/*
* Btree handler function: return IndexAmRoutine with access method parameters
@@ -178,33 +182,6 @@ btbuildempty(Relation index)
smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
}
-/*
- * btinsert() -- insert an index tuple into a btree.
- *
- * Descend the tree recursively, find the appropriate location for our
- * new tuple, and put it there.
- */
-bool
-btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- IndexInfo *indexInfo)
-{
- bool result;
- IndexTuple itup;
-
- /* generate an index tuple */
- itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
- itup->t_tid = *ht_ctid;
-
- result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);
-
- pfree(itup);
-
- return result;
-}
-
/*
* btgettuple() -- Get the next tuple in the scan.
*/
diff --git a/src/backend/access/nbtree/nbtree_spec.h b/src/backend/access/nbtree/nbtree_spec.h
new file mode 100644
index 0000000000..2e9190f267
--- /dev/null
+++ b/src/backend/access/nbtree/nbtree_spec.h
@@ -0,0 +1,42 @@
+/*
+ * Specialized functions for nbtree.c
+ */
+
+void
+NBTS_FUNCTION(_bt_specialize)(Relation rel) {
+ rel->rd_indam->aminsert = NBTS_FUNCTION(btinsert);
+}
+
+/*
+ * btinsert() -- insert an index tuple into a btree.
+ *
+ * Descend the tree recursively, find the appropriate location for our
+ * new tuple, and put it there.
+ */
+bool
+NBTS_FUNCTION(btinsert)(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ IndexInfo *indexInfo)
+{
+ bool result;
+ IndexTuple itup;
+
+#ifdef NBT_SPEC_DEFAULT
+ nbts_call(_bt_specialize, rel);
+ nbts_call(_bt_insert, rel, values, isnull, ht_ctid, heapRel, checkUnique,
+ indexUnchanged, indexInfo);
+#else
+
+ /* generate an index tuple */
+ itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
+ itup->t_tid = *ht_ctid;
+
+ result = nbts_call(_bt_doinsert, rel, itup, checkUnique, indexUnchanged, heapRel);
+
+ pfree(itup);
+
+ return result;
+#endif
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c74543bfde..e81eee9c35 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,11 +25,8 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
-static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
static int _bt_binsrch_posting(BTScanInsert key, Page page,
OffsetNumber offnum);
-static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
- OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
static int _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
@@ -46,6 +43,9 @@ static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsearch_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
/*
* _bt_drop_lock_and_maybe_pin()
@@ -70,493 +70,6 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
}
}
-/*
- * _bt_search() -- Search the tree for a particular scankey,
- * or more precisely for the first leaf page it could be on.
- *
- * The passed scankey is an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * Return value is a stack of parent-page pointers (i.e. there is no entry for
- * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
- * which is locked and pinned. No locks are held on the parent pages,
- * however!
- *
- * If the snapshot parameter is not NULL, "old snapshot" checking will take
- * place during the descent through the tree. This is not needed when
- * positioning for an insert or delete, so NULL is used for those cases.
- *
- * The returned buffer is locked according to access parameter. Additionally,
- * access = BT_WRITE will allow an empty root page to be created and returned.
- * When access = BT_READ, an empty index will result in *bufP being set to
- * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
- * during the search will be finished.
- */
-BTStack
-_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
- Snapshot snapshot)
-{
- BTStack stack_in = NULL;
- int page_access = BT_READ;
-
- /* Get the root page to start with */
- *bufP = _bt_getroot(rel, access);
-
- /* If index is empty and access = BT_READ, no root page is created. */
- if (!BufferIsValid(*bufP))
- return (BTStack) NULL;
-
- /* Loop iterates once per level descended in the tree */
- for (;;)
- {
- Page page;
- BTPageOpaque opaque;
- OffsetNumber offnum;
- ItemId itemid;
- IndexTuple itup;
- BlockNumber child;
- BTStack new_stack;
-
- /*
- * Race -- the page we just grabbed may have split since we read its
- * downlink in its parent page (or the metapage). If it has, we may
- * need to move right to its new sibling. Do that.
- *
- * In write-mode, allow _bt_moveright to finish any incomplete splits
- * along the way. Strictly speaking, we'd only need to finish an
- * incomplete split on the leaf page we're about to insert to, not on
- * any of the upper levels (internal pages with incomplete splits are
- * also taken care of in _bt_getstackbuf). But this is a good
- * opportunity to finish splits of internal pages too.
- */
- *bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
- page_access, snapshot);
-
- /* if this is a leaf page, we're done */
- page = BufferGetPage(*bufP);
- opaque = BTPageGetOpaque(page);
- if (P_ISLEAF(opaque))
- break;
-
- /*
- * Find the appropriate pivot tuple on this page. Its downlink points
- * to the child page that we're about to descend to.
- */
- offnum = _bt_binsrch(rel, key, *bufP);
- itemid = PageGetItemId(page, offnum);
- itup = (IndexTuple) PageGetItem(page, itemid);
- Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
- child = BTreeTupleGetDownLink(itup);
-
- /*
- * We need to save the location of the pivot tuple we chose in a new
- * stack entry for this page/level. If caller ends up splitting a
- * page one level down, it usually ends up inserting a new pivot
- * tuple/downlink immediately after the location recorded here.
- */
- new_stack = (BTStack) palloc(sizeof(BTStackData));
- new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
- new_stack->bts_offset = offnum;
- new_stack->bts_parent = stack_in;
-
- /*
- * Page level 1 is lowest non-leaf page level prior to leaves. So, if
- * we're on the level 1 and asked to lock leaf page in write mode,
- * then lock next page in write mode, because it must be a leaf.
- */
- if (opaque->btpo_level == 1 && access == BT_WRITE)
- page_access = BT_WRITE;
-
- /* drop the read lock on the page, then acquire one on its child */
- *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
-
- /* okay, all set to move down a level */
- stack_in = new_stack;
- }
-
- /*
- * If we're asked to lock leaf in write mode, but didn't manage to, then
- * relock. This should only happen when the root page is a leaf page (and
- * the only page in the index other than the metapage).
- */
- if (access == BT_WRITE && page_access == BT_READ)
- {
- /* trade in our read lock for a write lock */
- _bt_unlockbuf(rel, *bufP);
- _bt_lockbuf(rel, *bufP, BT_WRITE);
-
- /*
- * Race -- the leaf page may have split after we dropped the read lock
- * but before we acquired a write lock. If it has, we may need to
- * move right to its new sibling. Do that.
- */
- *bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
- snapshot);
- }
-
- return stack_in;
-}
-
-/*
- * _bt_moveright() -- move right in the btree if necessary.
- *
- * When we follow a pointer to reach a page, it is possible that
- * the page has changed in the meanwhile. If this happens, we're
- * guaranteed that the page has "split right" -- that is, that any
- * data that appeared on the page originally is either on the page
- * or strictly to the right of it.
- *
- * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page. If that entry is
- * strictly less than the scankey, or <= the scankey in the
- * key.nextkey=true case, then we followed the wrong link and we need
- * to move right.
- *
- * The passed insertion-type scankey can omit the rightmost column(s) of the
- * index. (see nbtree/README)
- *
- * When key.nextkey is false (the usual case), we are looking for the first
- * item >= key. When key.nextkey is true, we are looking for the first item
- * strictly greater than key.
- *
- * If forupdate is true, we will attempt to finish any incomplete splits
- * that we encounter. This is required when locking a target page for an
- * insertion, because we don't allow inserting on a page before the split
- * is completed. 'stack' is only used if forupdate is true.
- *
- * On entry, we have the buffer pinned and a lock of the type specified by
- * 'access'. If we move right, we release the buffer and lock and acquire
- * the same on the right sibling. Return value is the buffer we stop at.
- *
- * If the snapshot parameter is not NULL, "old snapshot" checking will take
- * place during the descent through the tree. This is not needed when
- * positioning for an insert or delete, so NULL is used for those cases.
- */
-Buffer
-_bt_moveright(Relation rel,
- BTScanInsert key,
- Buffer buf,
- bool forupdate,
- BTStack stack,
- int access,
- Snapshot snapshot)
-{
- Page page;
- BTPageOpaque opaque;
- int32 cmpval;
-
- /*
- * When nextkey = false (normal case): if the scan key that brought us to
- * this page is > the high key stored on the page, then the page has split
- * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
- * have some duplicates to the right as well as the left, but that's
- * something that's only ever dealt with on the leaf level, after
- * _bt_search has found an initial leaf page.)
- *
- * When nextkey = true: move right if the scan key is >= page's high key.
- * (Note that key.scantid cannot be set in this case.)
- *
- * The page could even have split more than once, so scan as far as
- * needed.
- *
- * We also have to move right if we followed a link that brought us to a
- * dead page.
- */
- cmpval = key->nextkey ? 0 : 1;
-
- for (;;)
- {
- page = BufferGetPage(buf);
- TestForOldSnapshot(snapshot, rel, page);
- opaque = BTPageGetOpaque(page);
-
- if (P_RIGHTMOST(opaque))
- break;
-
- /*
- * Finish any incomplete splits we encounter along the way.
- */
- if (forupdate && P_INCOMPLETE_SPLIT(opaque))
- {
- BlockNumber blkno = BufferGetBlockNumber(buf);
-
- /* upgrade our lock if necessary */
- if (access == BT_READ)
- {
- _bt_unlockbuf(rel, buf);
- _bt_lockbuf(rel, buf, BT_WRITE);
- }
-
- if (P_INCOMPLETE_SPLIT(opaque))
- _bt_finish_split(rel, buf, stack);
- else
- _bt_relbuf(rel, buf);
-
- /* re-acquire the lock in the right mode, and re-check */
- buf = _bt_getbuf(rel, blkno, access);
- continue;
- }
-
- if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
- {
- /* step right one page */
- buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
- continue;
- }
- else
- break;
- }
-
- if (P_IGNORE(opaque))
- elog(ERROR, "fell off the end of index \"%s\"",
- RelationGetRelationName(rel));
-
- return buf;
-}
-
-/*
- * _bt_binsrch() -- Do a binary search for a key on a particular page.
- *
- * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
- * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
- * particular, this means it is possible to return a value 1 greater than the
- * number of keys on the page, if the scankey is > all keys on the page.)
- *
- * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
- * of the last key < given scankey, or last key <= given scankey if nextkey
- * is true. (Since _bt_compare treats the first data key of such a page as
- * minus infinity, there will be at least one key < scankey, so the result
- * always points at one of the keys on the page.) This key indicates the
- * right place to descend to be sure we find all leaf keys >= given scankey
- * (or leaf keys > given scankey when nextkey is true).
- *
- * This procedure is not responsible for walking right, it just examines
- * the given page. _bt_binsrch() has no lock or refcount side effects
- * on the buffer.
- */
-static OffsetNumber
-_bt_binsrch(Relation rel,
- BTScanInsert key,
- Buffer buf)
-{
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high;
- int32 result,
- cmpval;
-
- page = BufferGetPage(buf);
- opaque = BTPageGetOpaque(page);
-
- /* Requesting nextkey semantics while using scantid seems nonsensical */
- Assert(!key->nextkey || key->scantid == NULL);
- /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
- Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
-
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
-
- /*
- * If there are no keys on the page, return the first available slot. Note
- * this covers two cases: the page is really empty (no keys), or it
- * contains only a high key. The latter case is possible after vacuuming.
- * This can never happen on an internal page, however, since they are
- * never empty (an internal page must have children).
- */
- if (unlikely(high < low))
- return low;
-
- /*
- * Binary search to find the first key on the page >= scan key, or first
- * key > scankey when nextkey is true.
- *
- * For nextkey=false (cmpval=1), the loop invariant is: all slots before
- * 'low' are < scan key, all slots at or after 'high' are >= scan key.
- *
- * For nextkey=true (cmpval=0), the loop invariant is: all slots before
- * 'low' are <= scan key, all slots at or after 'high' are > scan key.
- *
- * We can fall out when high == low.
- */
- high++; /* establish the loop invariant for high */
-
- cmpval = key->nextkey ? 0 : 1; /* select comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid);
-
- if (result >= cmpval)
- low = mid + 1;
- else
- high = mid;
- }
-
- /*
- * At this point we have high == low, but be careful: they could point
- * past the last slot on the page.
- *
- * On a leaf page, we always return the first key >= scan key (resp. >
- * scan key), which could be the last slot + 1.
- */
- if (P_ISLEAF(opaque))
- return low;
-
- /*
- * On a non-leaf page, return the last key < scan key (resp. <= scan key).
- * There must be one if _bt_compare() is playing by the rules.
- */
- Assert(low > P_FIRSTDATAKEY(opaque));
-
- return OffsetNumberPrev(low);
-}
-
-/*
- *
- * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
- *
- * Like _bt_binsrch(), but with support for caching the binary search
- * bounds. Only used during insertion, and only on the leaf page that it
- * looks like caller will insert tuple on. Exclusive-locked and pinned
- * leaf page is contained within insertstate.
- *
- * Caches the bounds fields in insertstate so that a subsequent call can
- * reuse the low and strict high bounds of original binary search. Callers
- * that use these fields directly must be prepared for the case where low
- * and/or stricthigh are not on the same page (one or both exceed maxoff
- * for the page). The case where there are no items on the page (high <
- * low) makes bounds invalid.
- *
- * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time, and for dealing with posting list
- * tuple matches (callers can use insertstate's postingoff field to
- * determine which existing heap TID will need to be replaced by a posting
- * list split).
- */
-OffsetNumber
-_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
-{
- BTScanInsert key = insertstate->itup_key;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high,
- stricthigh;
- int32 result,
- cmpval;
-
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- Assert(P_ISLEAF(opaque));
- Assert(!key->nextkey);
- Assert(insertstate->postingoff == 0);
-
- if (!insertstate->bounds_valid)
- {
- /* Start new binary search */
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
- }
- else
- {
- /* Restore result of previous binary search against same page */
- low = insertstate->low;
- high = insertstate->stricthigh;
- }
-
- /* If there are no keys on the page, return the first available slot */
- if (unlikely(high < low))
- {
- /* Caller can't reuse bounds */
- insertstate->low = InvalidOffsetNumber;
- insertstate->stricthigh = InvalidOffsetNumber;
- insertstate->bounds_valid = false;
- return low;
- }
-
- /*
- * Binary search to find the first key on the page >= scan key. (nextkey
- * is always false when inserting).
- *
- * The loop invariant is: all slots before 'low' are < scan key, all slots
- * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
- * maintained to save additional search effort for caller.
- *
- * We can fall out when high == low.
- */
- if (!insertstate->bounds_valid)
- high++; /* establish the loop invariant for high */
- stricthigh = high; /* high initially strictly higher */
-
- cmpval = 1; /* !nextkey comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid);
-
- if (result >= cmpval)
- low = mid + 1;
- else
- {
- high = mid;
- if (result != 0)
- stricthigh = high;
- }
-
- /*
- * If tuple at offset located by binary search is a posting list whose
- * TID range overlaps with caller's scantid, perform posting list
- * binary search to set postingoff for caller. Caller must split the
- * posting list when postingoff is set. This should happen
- * infrequently.
- */
- if (unlikely(result == 0 && key->scantid != NULL))
- {
- /*
- * postingoff should never be set more than once per leaf page
- * binary search. That would mean that there are duplicate table
- * TIDs in the index, which is never okay. Check for that here.
- */
- if (insertstate->postingoff != 0)
- ereport(ERROR,
- (errcode(ERRCODE_INDEX_CORRUPTED),
- errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
- ItemPointerGetBlockNumber(key->scantid),
- ItemPointerGetOffsetNumber(key->scantid),
- low, stricthigh,
- BufferGetBlockNumber(insertstate->buf),
- RelationGetRelationName(rel))));
-
- insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
- }
- }
-
- /*
- * On a leaf page, a binary search always returns the first key >= scan
- * key (at least in !nextkey case), which could be the last slot + 1. This
- * is also the lower bound of cached search.
- *
- * stricthigh may also be the last slot + 1, which prevents caller from
- * using bounds directly, but is still useful to us if we're called a
- * second time with cached bounds (cached low will be < stricthigh when
- * that happens).
- */
- insertstate->low = low;
- insertstate->stricthigh = stricthigh;
- insertstate->bounds_valid = true;
-
- return low;
-}
/*----------
* _bt_binsrch_posting() -- posting list binary search.
@@ -625,217 +138,6 @@ _bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
return low;
}
-/*----------
- * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
- *
- * page/offnum: location of btree item to be compared to.
- *
- * This routine returns:
- * <0 if scankey < tuple at offnum;
- * 0 if scankey == tuple at offnum;
- * >0 if scankey > tuple at offnum.
- *
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be returned
- * to the caller as a matching key. Similarly, an insertion scankey
- * with its scantid set is treated as equal to a posting tuple whose TID
- * range overlaps with their scantid. There generally won't be a
- * matching TID in the posting tuple, which caller must handle
- * themselves (e.g., by splitting the posting list tuple).
- *
- * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
- * "minus infinity": this routine will always claim it is less than the
- * scankey. The actual key value stored is explicitly truncated to 0
- * attributes (explicitly minus infinity) with version 3+ indexes, but
- * that isn't relied upon. This allows us to implement the Lehman and
- * Yao convention that the first down-link pointer is before the first
- * key. See backend/access/nbtree/README for details.
- *----------
- */
-int32
-_bt_compare(Relation rel,
- BTScanInsert key,
- Page page,
- OffsetNumber offnum)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- IndexTuple itup;
- ItemPointer heapTid;
- ScanKey scankey;
- int ncmpkey;
- int ntupatts;
- int32 result;
-
- Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
- Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
- Assert(key->heapkeyspace || key->scantid == NULL);
-
- /*
- * Force result ">" if target item is first data item on an internal page
- * --- see NOTE above.
- */
- if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
- return 1;
-
- itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
- ntupatts = BTreeTupleGetNAtts(itup, rel);
-
- /*
- * The scan key is set up with the attribute number associated with each
- * term in the key. It is important that, if the index is multi-key, the
- * scan contain the first k key attributes, and that they be in order. If
- * you think about how multi-key ordering works, you'll understand why
- * this is.
- *
- * We don't test for violation of this condition here, however. The
- * initial setup for the index scan had better have gotten it right (see
- * _bt_first).
- */
-
- ncmpkey = Min(ntupatts, key->keysz);
- Assert(key->heapkeyspace || ncmpkey == key->keysz);
- Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
- scankey = key->scankeys;
- for (int i = 1; i <= ncmpkey; i++)
- {
- Datum datum;
- bool isNull;
-
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
-
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
- {
- if (isNull)
- result = 0; /* NULL "=" NULL */
- else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = -1; /* NULL "<" NOT_NULL */
- else
- result = 1; /* NULL ">" NOT_NULL */
- }
- else if (isNull) /* key is NOT_NULL and item is NULL */
- {
- if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = 1; /* NOT_NULL ">" NULL */
- else
- result = -1; /* NOT_NULL "<" NULL */
- }
- else
- {
- /*
- * The sk_func needs to be passed the index value as left arg and
- * the sk_argument as right arg (they might be of different
- * types). Since it is convenient for callers to think of
- * _bt_compare as comparing the scankey to the index item, we have
- * to flip the sign of the comparison result. (Unless it's a DESC
- * column, in which case we *don't* flip the sign.)
- */
- result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum,
- scankey->sk_argument));
-
- if (!(scankey->sk_flags & SK_BT_DESC))
- INVERT_COMPARE_RESULT(result);
- }
-
- /* if the keys are unequal, return the difference */
- if (result != 0)
- return result;
-
- scankey++;
- }
-
- /*
- * All non-truncated attributes (other than heap TID) were found to be
- * equal. Treat truncated attributes as minus infinity when scankey has a
- * key attribute value that would otherwise be compared directly.
- *
- * Note: it doesn't matter if ntupatts includes non-key attributes;
- * scankey won't, so explicitly excluding non-key attributes isn't
- * necessary.
- */
- if (key->keysz > ntupatts)
- return 1;
-
- /*
- * Use the heap TID attribute and scantid to try to break the tie. The
- * rules are the same as any other key attribute -- only the
- * representation differs.
- */
- heapTid = BTreeTupleGetHeapTID(itup);
- if (key->scantid == NULL)
- {
- /*
- * Most searches have a scankey that is considered greater than a
- * truncated pivot tuple if and when the scankey has equal values for
- * attributes up to and including the least significant untruncated
- * attribute in tuple.
- *
- * For example, if an index has the minimum two attributes (single
- * user key attribute, plus heap TID attribute), and a page's high key
- * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
- * will not descend to the page to the left. The search will descend
- * right instead. The truncated attribute in pivot tuple means that
- * all non-pivot tuples on the page to the left are strictly < 'foo',
- * so it isn't necessary to descend left. In other words, search
- * doesn't have to descend left because it isn't interested in a match
- * that has a heap TID value of -inf.
- *
- * However, some searches (pivotsearch searches) actually require that
- * we descend left when this happens. -inf is treated as a possible
- * match for omitted scankey attribute(s). This is needed by page
- * deletion, which must re-find leaf pages that are targets for
- * deletion using their high keys.
- *
- * Note: the heap TID part of the test ensures that scankey is being
- * compared to a pivot tuple with one or more truncated key
- * attributes.
- *
- * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
- * left here, since they have no heap TID attribute (and cannot have
- * any -inf key values in any case, since truncation can only remove
- * non-key attributes). !heapkeyspace searches must always be
- * prepared to deal with matches on both sides of the pivot once the
- * leaf level is reached.
- */
- if (key->heapkeyspace && !key->pivotsearch &&
- key->keysz == ntupatts && heapTid == NULL)
- return 1;
-
- /* All provided scankey arguments found to be equal */
- return 0;
- }
-
- /*
- * Treat truncated heap TID as minus infinity, since scankey has a key
- * attribute value (scantid) that would otherwise be compared directly
- */
- Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
- if (heapTid == NULL)
- return 1;
-
- /*
- * Scankey must be treated as equal to a posting list tuple if its scantid
- * value falls within the range of the posting list. In all other cases
- * there can only be a single heap TID value, which is compared directly
- * with scantid.
- */
- Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- result = ItemPointerCompare(key->scantid, heapTid);
- if (result <= 0 || !BTreeTupleIsPosting(itup))
- return result;
- else
- {
- result = ItemPointerCompare(key->scantid,
- BTreeTupleGetMaxHeapTID(itup));
- if (result > 0)
- return 1;
- }
-
- return 0;
-}
-
/*
* _bt_first() -- Find the first item in a scan.
*
@@ -1363,7 +665,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* Use the manufactured insertion scan key to descend the tree and
* position ourselves on the target leaf page.
*/
- stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+ stack = nbts_call(_bt_search, rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
/* don't need to keep the stack around... */
_bt_freestack(stack);
@@ -1392,7 +694,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
_bt_initialize_more_data(so, dir);
/* position to the precise item on the page */
- offnum = _bt_binsrch(rel, &inskey, buf);
+ offnum = nbts_call(_bt_binsrch, rel, &inskey, buf);
/*
* If nextkey = false, we are positioned at the first item >= scan key, or
@@ -1422,9 +724,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
/*
* Now load data from the first page of the scan.
*/
- if (!_bt_readpage(scan, dir, offnum))
+ if (!nbts_call_norel(_bt_readpage, scan->indexRelation, scan, dir, offnum))
{
- /*
+ /*`
* There's no actually-matching data on this page. Try to advance to
* the next page. Return false if there's no matching data at all.
*/
@@ -1498,280 +800,6 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
-/*
- * _bt_readpage() -- Load data from current index page into so->currPos
- *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate. All other fields of so->currPos are
- * initialized from scratch here.
- *
- * We scan the current page starting at offnum and moving in the indicated
- * direction. All items matching the scan keys are loaded into currPos.items.
- * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
- * that there can be no more matching tuples in the current scan direction.
- *
- * In the case of a parallel scan, caller must have called _bt_parallel_seize
- * prior to calling this function; this function will invoke
- * _bt_parallel_release before returning.
- *
- * Returns true if any matching items found on the page, false if none.
- */
-static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
-{
- BTScanOpaque so = (BTScanOpaque) scan->opaque;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber minoff;
- OffsetNumber maxoff;
- int itemIndex;
- bool continuescan;
- int indnatts;
-
- /*
- * We must have the buffer pinned and locked, but the usual macro can't be
- * used here; this function is what makes it good for currPos.
- */
- Assert(BufferIsValid(so->currPos.buf));
-
- page = BufferGetPage(so->currPos.buf);
- opaque = BTPageGetOpaque(page);
-
- /* allow next page be processed by parallel worker */
- if (scan->parallel_scan)
- {
- if (ScanDirectionIsForward(dir))
- _bt_parallel_release(scan, opaque->btpo_next);
- else
- _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
- }
-
- continuescan = true; /* default assumption */
- indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * We note the buffer's block number so that we can release the pin later.
- * This allows us to re-read the buffer if it is needed again for hinting.
- */
- so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-
- /*
- * We save the LSN of the page as we read it, so that we know whether it
- * safe to apply LP_DEAD hints to the page later. This allows us to drop
- * the pin for MVCC scans, which allows vacuum to avoid blocking.
- */
- so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-
- /*
- * we must save the page's right-link while scanning it; this tells us
- * where to step right to after we're done with these items. There is no
- * corresponding need for the left-link, since splits always go right.
- */
- so->currPos.nextPage = opaque->btpo_next;
-
- /* initialize tuple workspace to empty */
- so->currPos.nextTupleOffset = 0;
-
- /*
- * Now that the current page has been made consistent, the macro should be
- * good.
- */
- Assert(BTScanPosIsPinned(so->currPos));
-
- if (ScanDirectionIsForward(dir))
- {
- /* load items[] in ascending order */
- itemIndex = 0;
-
- offnum = Max(offnum, minoff);
-
- while (offnum <= maxoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- offnum = OffsetNumberNext(offnum);
- continue;
- }
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID
- */
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- itemIndex++;
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- itemIndex++;
- }
- }
- }
- /* When !continuescan, there can't be any more matches, so stop */
- if (!continuescan)
- break;
-
- offnum = OffsetNumberNext(offnum);
- }
-
- /*
- * We don't need to visit page to the right when the high key
- * indicates that no more matches will be found there.
- *
- * Checking the high key like this works out more often than you might
- * think. Leaf page splits pick a split point between the two most
- * dissimilar tuples (this is weighed against the need to evenly share
- * free space). Leaf pages with high key attribute values that can
- * only appear on non-pivot tuples on the right sibling page are
- * common.
- */
- if (continuescan && !P_RIGHTMOST(opaque))
- {
- ItemId iid = PageGetItemId(page, P_HIKEY);
- IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
- int truncatt;
-
- truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
- _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
- }
-
- if (!continuescan)
- so->currPos.moreRight = false;
-
- Assert(itemIndex <= MaxTIDsPerBTreePage);
- so->currPos.firstItem = 0;
- so->currPos.lastItem = itemIndex - 1;
- so->currPos.itemIndex = 0;
- }
- else
- {
- /* load items[] in descending order */
- itemIndex = MaxTIDsPerBTreePage;
-
- offnum = Min(offnum, maxoff);
-
- while (offnum >= minoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
- bool tuple_alive;
- bool passes_quals;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual. Most of the
- * time, it's a win to not bother examining the tuple's index
- * keys, but just skip to the next tuple (previous, actually,
- * since we're scanning backwards). However, if this is the first
- * tuple on the page, we do check the index keys, to prevent
- * uselessly advancing to the page to the left. This is similar
- * to the high key optimization used by forward scans.
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- Assert(offnum >= P_FIRSTDATAKEY(opaque));
- if (offnum > P_FIRSTDATAKEY(opaque))
- {
- offnum = OffsetNumberPrev(offnum);
- continue;
- }
-
- tuple_alive = false;
- }
- else
- tuple_alive = true;
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan);
- if (passes_quals && tuple_alive)
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID.
- *
- * Note that we deliberately save/return items from
- * posting lists in ascending heap TID order for backwards
- * scans. This allows _bt_killitems() to make a
- * consistent assumption about the order of items
- * associated with the same posting list tuple.
- */
- itemIndex--;
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- itemIndex--;
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- }
- }
- }
- if (!continuescan)
- {
- /* there can't be any more matches, so stop */
- so->currPos.moreLeft = false;
- break;
- }
-
- offnum = OffsetNumberPrev(offnum);
- }
-
- Assert(itemIndex >= 0);
- so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
- so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
- }
-
- return (so->currPos.firstItem <= so->currPos.lastItem);
-}
-
/* Save an index item into so->currPos.items[itemIndex] */
static void
_bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -2014,7 +1042,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
PredicateLockPage(rel, blkno, scan->xs_snapshot);
/* see if there are any matches on this page */
/* note that this will clear moreRight if we can stop */
- if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
+ if (nbts_call_norel(_bt_readpage, scan->indexRelation,
+ scan, dir, P_FIRSTDATAKEY(opaque)))
break;
}
else if (scan->parallel_scan != NULL)
@@ -2116,7 +1145,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
/* see if there are any matches on this page */
/* note that this will clear moreLeft if we can stop */
- if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
+ if (nbts_call_norel(_bt_readpage, scan->indexRelation, scan,
+ dir, PageGetMaxOffsetNumber(page)))
break;
}
else if (scan->parallel_scan != NULL)
@@ -2448,7 +1478,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
/*
* Now load data from the first page of the scan.
*/
- if (!_bt_readpage(scan, dir, start))
+ if (!nbts_call_norel(_bt_readpage, scan->indexRelation, scan, dir, start))
{
/*
* There's no actually-matching data on this page. Try to advance to
diff --git a/src/backend/access/nbtree/nbtsearch_spec.h b/src/backend/access/nbtree/nbtsearch_spec.h
new file mode 100644
index 0000000000..39b4e6c5ec
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsearch_spec.h
@@ -0,0 +1,985 @@
+/*
+ * Specialized functions for nbtsearch.c
+ */
+
+static OffsetNumber NBTS_FUNCTION(_bt_binsrch)(Relation rel, BTScanInsert key,
+ Buffer buf);
+static bool NBTS_FUNCTION(_bt_readpage)(IndexScanDesc scan, ScanDirection dir,
+ OffsetNumber offnum);
+
+
+/*
+ * _bt_search() -- Search the tree for a particular scankey,
+ * or more precisely for the first leaf page it could be on.
+ *
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
+ * but it can omit the rightmost column(s) of the index.
+ *
+ * Return value is a stack of parent-page pointers (i.e. there is no entry for
+ * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
+ * which is locked and pinned. No locks are held on the parent pages,
+ * however!
+ *
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
+ *
+ * The returned buffer is locked according to access parameter. Additionally,
+ * access = BT_WRITE will allow an empty root page to be created and returned.
+ * When access = BT_READ, an empty index will result in *bufP being set to
+ * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
+ * during the search will be finished.
+ */
+BTStack
+NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
+ int access, Snapshot snapshot)
+{
+ BTStack stack_in = NULL;
+ int page_access = BT_READ;
+
+ /* Get the root page to start with */
+ *bufP = _bt_getroot(rel, access);
+
+ /* If index is empty and access = BT_READ, no root page is created. */
+ if (!BufferIsValid(*bufP))
+ return (BTStack) NULL;
+
+ /* Loop iterates once per level descended in the tree */
+ for (;;)
+ {
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber offnum;
+ ItemId itemid;
+ IndexTuple itup;
+ BlockNumber child;
+ BTStack new_stack;
+
+ /*
+ * Race -- the page we just grabbed may have split since we read its
+ * downlink in its parent page (or the metapage). If it has, we may
+ * need to move right to its new sibling. Do that.
+ *
+ * In write-mode, allow _bt_moveright to finish any incomplete splits
+ * along the way. Strictly speaking, we'd only need to finish an
+ * incomplete split on the leaf page we're about to insert to, not on
+ * any of the upper levels (internal pages with incomplete splits are
+ * also taken care of in _bt_getstackbuf). But this is a good
+ * opportunity to finish splits of internal pages too.
+ */
+ *bufP = nbts_call(_bt_moveright, rel, key, *bufP,
+ (access == BT_WRITE), stack_in,
+ page_access, snapshot);
+
+ /* if this is a leaf page, we're done */
+ page = BufferGetPage(*bufP);
+ opaque = BTPageGetOpaque(page);
+ if (P_ISLEAF(opaque))
+ break;
+
+ /*
+ * Find the appropriate pivot tuple on this page. Its downlink points
+ * to the child page that we're about to descend to.
+ */
+ offnum = nbts_call(_bt_binsrch, rel, key, *bufP);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
+ child = BTreeTupleGetDownLink(itup);
+
+ /*
+ * We need to save the location of the pivot tuple we chose in a new
+ * stack entry for this page/level. If caller ends up splitting a
+ * page one level down, it usually ends up inserting a new pivot
+ * tuple/downlink immediately after the location recorded here.
+ */
+ new_stack = (BTStack) palloc(sizeof(BTStackData));
+ new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
+ new_stack->bts_offset = offnum;
+ new_stack->bts_parent = stack_in;
+
+ /*
+ * Page level 1 is lowest non-leaf page level prior to leaves. So, if
+ * we're on the level 1 and asked to lock leaf page in write mode,
+ * then lock next page in write mode, because it must be a leaf.
+ */
+ if (opaque->btpo_level == 1 && access == BT_WRITE)
+ page_access = BT_WRITE;
+
+ /* drop the read lock on the page, then acquire one on its child */
+ *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
+
+ /* okay, all set to move down a level */
+ stack_in = new_stack;
+ }
+
+ /*
+ * If we're asked to lock leaf in write mode, but didn't manage to, then
+ * relock. This should only happen when the root page is a leaf page (and
+ * the only page in the index other than the metapage).
+ */
+ if (access == BT_WRITE && page_access == BT_READ)
+ {
+ /* trade in our read lock for a write lock */
+ _bt_unlockbuf(rel, *bufP);
+ _bt_lockbuf(rel, *bufP, BT_WRITE);
+
+ /*
+ * Race -- the leaf page may have split after we dropped the read lock
+ * but before we acquired a write lock. If it has, we may need to
+ * move right to its new sibling. Do that.
+ */
+ *bufP = nbts_call(_bt_moveright, rel, key, *bufP, true, stack_in,
+ BT_WRITE, snapshot);
+ }
+
+ return stack_in;
+}
+
+/*
+ * _bt_moveright() -- move right in the btree if necessary.
+ *
+ * When we follow a pointer to reach a page, it is possible that
+ * the page has changed in the meanwhile. If this happens, we're
+ * guaranteed that the page has "split right" -- that is, that any
+ * data that appeared on the page originally is either on the page
+ * or strictly to the right of it.
+ *
+ * This routine decides whether or not we need to move right in the
+ * tree by examining the high key entry on the page. If that entry is
+ * strictly less than the scankey, or <= the scankey in the
+ * key.nextkey=true case, then we followed the wrong link and we need
+ * to move right.
+ *
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
+ *
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key. When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
+ *
+ * If forupdate is true, we will attempt to finish any incomplete splits
+ * that we encounter. This is required when locking a target page for an
+ * insertion, because we don't allow inserting on a page before the split
+ * is completed. 'stack' is only used if forupdate is true.
+ *
+ * On entry, we have the buffer pinned and a lock of the type specified by
+ * 'access'. If we move right, we release the buffer and lock and acquire
+ * the same on the right sibling. Return value is the buffer we stop at.
+ *
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
+ */
+Buffer
+NBTS_FUNCTION(_bt_moveright)(Relation rel,
+ BTScanInsert key,
+ Buffer buf,
+ bool forupdate,
+ BTStack stack,
+ int access,
+ Snapshot snapshot)
+{
+ Page page;
+ BTPageOpaque opaque;
+ int32 cmpval;
+
+ /*
+ * When nextkey = false (normal case): if the scan key that brought us to
+ * this page is > the high key stored on the page, then the page has split
+ * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
+ * have some duplicates to the right as well as the left, but that's
+ * something that's only ever dealt with on the leaf level, after
+ * _bt_search has found an initial leaf page.)
+ *
+ * When nextkey = true: move right if the scan key is >= page's high key.
+ * (Note that key.scantid cannot be set in this case.)
+ *
+ * The page could even have split more than once, so scan as far as
+ * needed.
+ *
+ * We also have to move right if we followed a link that brought us to a
+ * dead page.
+ */
+ cmpval = key->nextkey ? 0 : 1;
+
+ for (;;)
+ {
+ page = BufferGetPage(buf);
+ TestForOldSnapshot(snapshot, rel, page);
+ opaque = BTPageGetOpaque(page);
+
+ if (P_RIGHTMOST(opaque))
+ break;
+
+ /*
+ * Finish any incomplete splits we encounter along the way.
+ */
+ if (forupdate && P_INCOMPLETE_SPLIT(opaque))
+ {
+ BlockNumber blkno = BufferGetBlockNumber(buf);
+
+ /* upgrade our lock if necessary */
+ if (access == BT_READ)
+ {
+ _bt_unlockbuf(rel, buf);
+ _bt_lockbuf(rel, buf, BT_WRITE);
+ }
+
+ if (P_INCOMPLETE_SPLIT(opaque))
+ _bt_finish_split(rel, buf, stack);
+ else
+ _bt_relbuf(rel, buf);
+
+ /* re-acquire the lock in the right mode, and re-check */
+ buf = _bt_getbuf(rel, blkno, access);
+ continue;
+ }
+
+ if (P_IGNORE(opaque) || nbts_call(_bt_compare, rel, key, page, P_HIKEY) >= cmpval)
+ {
+ /* step right one page */
+ buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
+ continue;
+ }
+ else
+ break;
+ }
+
+ if (P_IGNORE(opaque))
+ elog(ERROR, "fell off the end of index \"%s\"",
+ RelationGetRelationName(rel));
+
+ return buf;
+}
+
+/*
+ * _bt_binsrch() -- Do a binary search for a key on a particular page.
+ *
+ * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
+ * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
+ * particular, this means it is possible to return a value 1 greater than the
+ * number of keys on the page, if the scankey is > all keys on the page.)
+ *
+ * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
+ * of the last key < given scankey, or last key <= given scankey if nextkey
+ * is true. (Since _bt_compare treats the first data key of such a page as
+ * minus infinity, there will be at least one key < scankey, so the result
+ * always points at one of the keys on the page.) This key indicates the
+ * right place to descend to be sure we find all leaf keys >= given scankey
+ * (or leaf keys > given scankey when nextkey is true).
+ *
+ * This procedure is not responsible for walking right, it just examines
+ * the given page. _bt_binsrch() has no lock or refcount side effects
+ * on the buffer.
+ */
+static OffsetNumber
+NBTS_FUNCTION(_bt_binsrch)(Relation rel,
+ BTScanInsert key,
+ Buffer buf)
+{
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high;
+ int32 result,
+ cmpval;
+
+ page = BufferGetPage(buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* Requesting nextkey semantics while using scantid seems nonsensical */
+ Assert(!key->nextkey || key->scantid == NULL);
+ /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
+ Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+
+ /*
+ * If there are no keys on the page, return the first available slot. Note
+ * this covers two cases: the page is really empty (no keys), or it
+ * contains only a high key. The latter case is possible after vacuuming.
+ * This can never happen on an internal page, however, since they are
+ * never empty (an internal page must have children).
+ */
+ if (unlikely(high < low))
+ return low;
+
+ /*
+ * Binary search to find the first key on the page >= scan key, or first
+ * key > scankey when nextkey is true.
+ *
+ * For nextkey=false (cmpval=1), the loop invariant is: all slots before
+ * 'low' are < scan key, all slots at or after 'high' are >= scan key.
+ *
+ * For nextkey=true (cmpval=0), the loop invariant is: all slots before
+ * 'low' are <= scan key, all slots at or after 'high' are > scan key.
+ *
+ * We can fall out when high == low.
+ */
+ high++; /* establish the loop invariant for high */
+
+ cmpval = key->nextkey ? 0 : 1; /* select comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = nbts_call(_bt_compare, rel, key, page, mid);
+
+ if (result >= cmpval)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ /*
+ * At this point we have high == low, but be careful: they could point
+ * past the last slot on the page.
+ *
+ * On a leaf page, we always return the first key >= scan key (resp. >
+ * scan key), which could be the last slot + 1.
+ */
+ if (P_ISLEAF(opaque))
+ return low;
+
+ /*
+ * On a non-leaf page, return the last key < scan key (resp. <= scan key).
+ * There must be one if _bt_compare() is playing by the rules.
+ */
+ Assert(low > P_FIRSTDATAKEY(opaque));
+
+ return OffsetNumberPrev(low);
+}
+
+/*
+ *
+ * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
+ *
+ * Like _bt_binsrch(), but with support for caching the binary search
+ * bounds. Only used during insertion, and only on the leaf page that it
+ * looks like caller will insert tuple on. Exclusive-locked and pinned
+ * leaf page is contained within insertstate.
+ *
+ * Caches the bounds fields in insertstate so that a subsequent call can
+ * reuse the low and strict high bounds of original binary search. Callers
+ * that use these fields directly must be prepared for the case where low
+ * and/or stricthigh are not on the same page (one or both exceed maxoff
+ * for the page). The case where there are no items on the page (high <
+ * low) makes bounds invalid.
+ *
+ * Caller is responsible for invalidating bounds when it modifies the page
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by a posting
+ * list split).
+ */
+OffsetNumber
+NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate)
+{
+ BTScanInsert key = insertstate->itup_key;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high,
+ stricthigh;
+ int32 result,
+ cmpval;
+
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ Assert(P_ISLEAF(opaque));
+ Assert(!key->nextkey);
+ Assert(insertstate->postingoff == 0);
+
+ if (!insertstate->bounds_valid)
+ {
+ /* Start new binary search */
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ }
+ else
+ {
+ /* Restore result of previous binary search against same page */
+ low = insertstate->low;
+ high = insertstate->stricthigh;
+ }
+
+ /* If there are no keys on the page, return the first available slot */
+ if (unlikely(high < low))
+ {
+ /* Caller can't reuse bounds */
+ insertstate->low = InvalidOffsetNumber;
+ insertstate->stricthigh = InvalidOffsetNumber;
+ insertstate->bounds_valid = false;
+ return low;
+ }
+
+ /*
+ * Binary search to find the first key on the page >= scan key. (nextkey
+ * is always false when inserting).
+ *
+ * The loop invariant is: all slots before 'low' are < scan key, all slots
+ * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
+ * maintained to save additional search effort for caller.
+ *
+ * We can fall out when high == low.
+ */
+ if (!insertstate->bounds_valid)
+ high++; /* establish the loop invariant for high */
+ stricthigh = high; /* high initially strictly higher */
+
+ cmpval = 1; /* !nextkey comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = nbts_call(_bt_compare, rel, key, page, mid);
+
+ if (result >= cmpval)
+ low = mid + 1;
+ else
+ {
+ high = mid;
+ if (result != 0)
+ stricthigh = high;
+ }
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set postingoff for caller. Caller must split the
+ * posting list when postingoff is set. This should happen
+ * infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ {
+ /*
+ * postingoff should never be set more than once per leaf page
+ * binary search. That would mean that there are duplicate table
+ * TIDs in the index, which is never okay. Check for that here.
+ */
+ if (insertstate->postingoff != 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
+ ItemPointerGetBlockNumber(key->scantid),
+ ItemPointerGetOffsetNumber(key->scantid),
+ low, stricthigh,
+ BufferGetBlockNumber(insertstate->buf),
+ RelationGetRelationName(rel))));
+
+ insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
+ }
+ }
+
+ /*
+ * On a leaf page, a binary search always returns the first key >= scan
+ * key (at least in !nextkey case), which could be the last slot + 1. This
+ * is also the lower bound of cached search.
+ *
+ * stricthigh may also be the last slot + 1, which prevents caller from
+ * using bounds directly, but is still useful to us if we're called a
+ * second time with cached bounds (cached low will be < stricthigh when
+ * that happens).
+ */
+ insertstate->low = low;
+ insertstate->stricthigh = stricthigh;
+ insertstate->bounds_valid = true;
+
+ return low;
+}
+
+/*----------
+ * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
+ *
+ * page/offnum: location of btree item to be compared to.
+ *
+ * This routine returns:
+ * <0 if scankey < tuple at offnum;
+ * 0 if scankey == tuple at offnum;
+ * >0 if scankey > tuple at offnum.
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
+ * "minus infinity": this routine will always claim it is less than the
+ * scankey. The actual key value stored is explicitly truncated to 0
+ * attributes (explicitly minus infinity) with version 3+ indexes, but
+ * that isn't relied upon. This allows us to implement the Lehman and
+ * Yao convention that the first down-link pointer is before the first
+ * key. See backend/access/nbtree/README for details.
+ *----------
+ */
+int32
+NBTS_FUNCTION(_bt_compare)(Relation rel,
+ BTScanInsert key,
+ Page page,
+ OffsetNumber offnum)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ IndexTuple itup;
+ ItemPointer heapTid;
+ ScanKey scankey;
+ int ncmpkey;
+ int ntupatts;
+ int32 result;
+
+ Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
+ Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+ Assert(key->heapkeyspace || key->scantid == NULL);
+
+ /*
+ * Force result ">" if target item is first data item on an internal page
+ * --- see NOTE above.
+ */
+ if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
+ return 1;
+
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+ ntupatts = BTreeTupleGetNAtts(itup, rel);
+
+ /*
+ * The scan key is set up with the attribute number associated with each
+ * term in the key. It is important that, if the index is multi-key, the
+ * scan contain the first k key attributes, and that they be in order. If
+ * you think about how multi-key ordering works, you'll understand why
+ * this is.
+ *
+ * We don't test for violation of this condition here, however. The
+ * initial setup for the index scan had better have gotten it right (see
+ * _bt_first).
+ */
+
+ ncmpkey = Min(ntupatts, key->keysz);
+ Assert(key->heapkeyspace || ncmpkey == key->keysz);
+ Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
+ scankey = key->scankeys;
+ for (int i = 1; i <= ncmpkey; i++)
+ {
+ Datum datum;
+ bool isNull;
+
+ datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+
+ if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ {
+ if (isNull)
+ result = 0; /* NULL "=" NULL */
+ else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = -1; /* NULL "<" NOT_NULL */
+ else
+ result = 1; /* NULL ">" NOT_NULL */
+ }
+ else if (isNull) /* key is NOT_NULL and item is NULL */
+ {
+ if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = 1; /* NOT_NULL ">" NULL */
+ else
+ result = -1; /* NOT_NULL "<" NULL */
+ }
+ else
+ {
+ /*
+ * The sk_func needs to be passed the index value as left arg and
+ * the sk_argument as right arg (they might be of different
+ * types). Since it is convenient for callers to think of
+ * _bt_compare as comparing the scankey to the index item, we have
+ * to flip the sign of the comparison result. (Unless it's a DESC
+ * column, in which case we *don't* flip the sign.)
+ */
+ result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum,
+ scankey->sk_argument));
+
+ if (!(scankey->sk_flags & SK_BT_DESC))
+ INVERT_COMPARE_RESULT(result);
+ }
+
+ /* if the keys are unequal, return the difference */
+ if (result != 0)
+ return result;
+
+ scankey++;
+ }
+
+ /*
+ * All non-truncated attributes (other than heap TID) were found to be
+ * equal. Treat truncated attributes as minus infinity when scankey has a
+ * key attribute value that would otherwise be compared directly.
+ *
+ * Note: it doesn't matter if ntupatts includes non-key attributes;
+ * scankey won't, so explicitly excluding non-key attributes isn't
+ * necessary.
+ */
+ if (key->keysz > ntupatts)
+ return 1;
+
+ /*
+ * Use the heap TID attribute and scantid to try to break the tie. The
+ * rules are the same as any other key attribute -- only the
+ * representation differs.
+ */
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid == NULL)
+ {
+ /*
+ * Most searches have a scankey that is considered greater than a
+ * truncated pivot tuple if and when the scankey has equal values for
+ * attributes up to and including the least significant untruncated
+ * attribute in tuple.
+ *
+ * For example, if an index has the minimum two attributes (single
+ * user key attribute, plus heap TID attribute), and a page's high key
+ * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
+ * will not descend to the page to the left. The search will descend
+ * right instead. The truncated attribute in pivot tuple means that
+ * all non-pivot tuples on the page to the left are strictly < 'foo',
+ * so it isn't necessary to descend left. In other words, search
+ * doesn't have to descend left because it isn't interested in a match
+ * that has a heap TID value of -inf.
+ *
+ * However, some searches (pivotsearch searches) actually require that
+ * we descend left when this happens. -inf is treated as a possible
+ * match for omitted scankey attribute(s). This is needed by page
+ * deletion, which must re-find leaf pages that are targets for
+ * deletion using their high keys.
+ *
+ * Note: the heap TID part of the test ensures that scankey is being
+ * compared to a pivot tuple with one or more truncated key
+ * attributes.
+ *
+ * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
+ * left here, since they have no heap TID attribute (and cannot have
+ * any -inf key values in any case, since truncation can only remove
+ * non-key attributes). !heapkeyspace searches must always be
+ * prepared to deal with matches on both sides of the pivot once the
+ * leaf level is reached.
+ */
+ if (key->heapkeyspace && !key->pivotsearch &&
+ key->keysz == ntupatts && heapTid == NULL)
+ return 1;
+
+ /* All provided scankey arguments found to be equal */
+ return 0;
+ }
+
+ /*
+ * Treat truncated heap TID as minus infinity, since scankey has a key
+ * attribute value (scantid) that would otherwise be compared directly
+ */
+ Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+ if (heapTid == NULL)
+ return 1;
+
+ /*
+ * Scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * with scantid.
+ */
+ Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (result <= 0 || !BTreeTupleIsPosting(itup))
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid,
+ BTreeTupleGetMaxHeapTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
+}
+
+/*
+ * _bt_readpage() -- Load data from current index page into so->currPos
+ *
+ * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
+ * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
+ * they are updated as appropriate. All other fields of so->currPos are
+ * initialized from scratch here.
+ *
+ * We scan the current page starting at offnum and moving in the indicated
+ * direction. All items matching the scan keys are loaded into currPos.items.
+ * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
+ * that there can be no more matching tuples in the current scan direction.
+ *
+ * In the case of a parallel scan, caller must have called _bt_parallel_seize
+ * prior to calling this function; this function will invoke
+ * _bt_parallel_release before returning.
+ *
+ * Returns true if any matching items found on the page, false if none.
+ */
+static bool
+NBTS_FUNCTION(_bt_readpage)(IndexScanDesc scan, ScanDirection dir,
+ OffsetNumber offnum)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber minoff;
+ OffsetNumber maxoff;
+ int itemIndex;
+ bool continuescan;
+ int indnatts;
+
+ /*
+ * We must have the buffer pinned and locked, but the usual macro can't be
+ * used here; this function is what makes it good for currPos.
+ */
+ Assert(BufferIsValid(so->currPos.buf));
+
+ page = BufferGetPage(so->currPos.buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* allow next page be processed by parallel worker */
+ if (scan->parallel_scan)
+ {
+ if (ScanDirectionIsForward(dir))
+ _bt_parallel_release(scan, opaque->btpo_next);
+ else
+ _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+ }
+
+ continuescan = true; /* default assumption */
+ indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * We note the buffer's block number so that we can release the pin later.
+ * This allows us to re-read the buffer if it is needed again for hinting.
+ */
+ so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later. This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+
+ /*
+ * we must save the page's right-link while scanning it; this tells us
+ * where to step right to after we're done with these items. There is no
+ * corresponding need for the left-link, since splits always go right.
+ */
+ so->currPos.nextPage = opaque->btpo_next;
+
+ /* initialize tuple workspace to empty */
+ so->currPos.nextTupleOffset = 0;
+
+ /*
+ * Now that the current page has been made consistent, the macro should be
+ * good.
+ */
+ Assert(BTScanPosIsPinned(so->currPos));
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* load items[] in ascending order */
+ itemIndex = 0;
+
+ offnum = Max(offnum, minoff);
+
+ while (offnum <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ offnum = OffsetNumberNext(offnum);
+ continue;
+ }
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ if (nbts_call(_bt_checkkeys, scan->indexRelation, scan, itup, indnatts, dir, &continuescan))
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID
+ */
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ itemIndex++;
+ }
+ }
+ }
+ /* When !continuescan, there can't be any more matches, so stop */
+ if (!continuescan)
+ break;
+
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /*
+ * We don't need to visit page to the right when the high key
+ * indicates that no more matches will be found there.
+ *
+ * Checking the high key like this works out more often than you might
+ * think. Leaf page splits pick a split point between the two most
+ * dissimilar tuples (this is weighed against the need to evenly share
+ * free space). Leaf pages with high key attribute values that can
+ * only appear on non-pivot tuples on the right sibling page are
+ * common.
+ */
+ if (continuescan && !P_RIGHTMOST(opaque))
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
+ int truncatt;
+
+ truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
+ nbts_call(_bt_checkkeys, scan->indexRelation, scan, itup, truncatt, dir, &continuescan);
+ }
+
+ if (!continuescan)
+ so->currPos.moreRight = false;
+
+ Assert(itemIndex <= MaxTIDsPerBTreePage);
+ so->currPos.firstItem = 0;
+ so->currPos.lastItem = itemIndex - 1;
+ so->currPos.itemIndex = 0;
+ }
+ else
+ {
+ /* load items[] in descending order */
+ itemIndex = MaxTIDsPerBTreePage;
+
+ offnum = Min(offnum, maxoff);
+
+ while (offnum >= minoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+ bool tuple_alive;
+ bool passes_quals;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual. Most of the
+ * time, it's a win to not bother examining the tuple's index
+ * keys, but just skip to the next tuple (previous, actually,
+ * since we're scanning backwards). However, if this is the first
+ * tuple on the page, we do check the index keys, to prevent
+ * uselessly advancing to the page to the left. This is similar
+ * to the high key optimization used by forward scans.
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ Assert(offnum >= P_FIRSTDATAKEY(opaque));
+ if (offnum > P_FIRSTDATAKEY(opaque))
+ {
+ offnum = OffsetNumberPrev(offnum);
+ continue;
+ }
+
+ tuple_alive = false;
+ }
+ else
+ tuple_alive = true;
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ passes_quals = nbts_call(_bt_checkkeys, scan->indexRelation,
+ scan, itup, indnatts, dir,
+ &continuescan);
+ if (passes_quals && tuple_alive)
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID.
+ *
+ * Note that we deliberately save/return items from
+ * posting lists in ascending heap TID order for backwards
+ * scans. This allows _bt_killitems() to make a
+ * consistent assumption about the order of items
+ * associated with the same posting list tuple.
+ */
+ itemIndex--;
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ }
+ }
+ }
+ if (!continuescan)
+ {
+ /* there can't be any more matches, so stop */
+ so->currPos.moreLeft = false;
+ break;
+ }
+
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ Assert(itemIndex >= 0);
+ so->currPos.firstItem = itemIndex;
+ so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
+ so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+ }
+
+ return (so->currPos.firstItem <= so->currPos.lastItem);
+}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index c074513efa..762921e66a 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -279,8 +279,6 @@ static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
BTPageState *state,
BTDedupState dstate);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
-static void _bt_load(BTWriteState *wstate,
- BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
int request);
static void _bt_end_parallel(BTLeader *btleader);
@@ -293,6 +291,9 @@ static void _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
Sharedsort *sharedsort2, int sortmem,
bool progress);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsort_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
/*
* btbuild() -- build a new btree index.
@@ -566,7 +567,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
wstate.heap = btspool->heap;
wstate.index = btspool->index;
- wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+ wstate.inskey = nbts_call(_bt_mkscankey, wstate.index, NULL);
/* _bt_mkscankey() won't set allequalimage without metapage */
wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
@@ -578,7 +579,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
PROGRESS_BTREE_PHASE_LEAF_LOAD);
- _bt_load(&wstate, btspool, btspool2);
+ nbts_call_norel(_bt_load, wstate.index, &wstate, btspool, btspool2);
}
/*
@@ -978,8 +979,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
lastleft = (IndexTuple) PageGetItem(opage, ii);
Assert(IndexTupleSize(oitup) > last_truncextra);
- truncated = _bt_truncate(wstate->index, lastleft, oitup,
- wstate->inskey);
+ truncated = nbts_call(_bt_truncate, wstate->index, lastleft, oitup,
+ wstate->inskey);
if (!PageIndexTupleOverwrite(opage, P_HIKEY, (Item) truncated,
IndexTupleSize(truncated)))
elog(ERROR, "failed to add high key to the index page");
@@ -1176,264 +1177,6 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
-/*
- * Read tuples in correct sort order from tuplesort, and load them into
- * btree leaves.
- */
-static void
-_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
-{
- BTPageState *state = NULL;
- bool merge = (btspool2 != NULL);
- IndexTuple itup,
- itup2 = NULL;
- bool load1;
- TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
- SortSupport sortKeys;
- int64 tuples_done = 0;
- bool deduplicate;
-
- deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
- BTGetDeduplicateItems(wstate->index);
-
- if (merge)
- {
- /*
- * Another BTSpool for dead tuples exists. Now we have to merge
- * btspool and btspool2.
- */
-
- /* the preparation of merge */
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
-
- /* Prepare SortSupport data for each column */
- sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
- for (i = 0; i < keysz; i++)
- {
- SortSupport sortKey = sortKeys + i;
- ScanKey scanKey = wstate->inskey->scankeys + i;
- int16 strategy;
-
- sortKey->ssup_cxt = CurrentMemoryContext;
- sortKey->ssup_collation = scanKey->sk_collation;
- sortKey->ssup_nulls_first =
- (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
- sortKey->ssup_attno = scanKey->sk_attno;
- /* Abbreviation is not supported here */
- sortKey->abbreviate = false;
-
- AssertState(sortKey->ssup_attno != 0);
-
- strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
- BTGreaterStrategyNumber : BTLessStrategyNumber;
-
- PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
- }
-
- for (;;)
- {
- load1 = true; /* load BTSpool next ? */
- if (itup2 == NULL)
- {
- if (itup == NULL)
- break;
- }
- else if (itup != NULL)
- {
- int32 compare = 0;
-
- for (i = 1; i <= keysz; i++)
- {
- SortSupport entry;
- Datum attrDatum1,
- attrDatum2;
- bool isNull1,
- isNull2;
-
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
- entry);
- if (compare > 0)
- {
- load1 = false;
- break;
- }
- else if (compare < 0)
- break;
- }
-
- /*
- * If key values are equal, we sort on ItemPointer. This is
- * required for btree indexes, since heap TID is treated as an
- * implicit last key attribute in order to ensure that all
- * keys in the index are physically unique.
- */
- if (compare == 0)
- {
- compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
- Assert(compare != 0);
- if (compare > 0)
- load1 = false;
- }
- }
- else
- load1 = false;
-
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- if (load1)
- {
- _bt_buildadd(wstate, state, itup, 0);
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- }
- else
- {
- _bt_buildadd(wstate, state, itup2, 0);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- pfree(sortKeys);
- }
- else if (deduplicate)
- {
- /* merge is unnecessary, deduplicate into posting lists */
- BTDedupState dstate;
-
- dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
- dstate->deduplicate = true; /* unused */
- dstate->nmaxitems = 0; /* unused */
- dstate->maxpostingsize = 0; /* set later */
- /* Metadata about base tuple of current pending posting list */
- dstate->base = NULL;
- dstate->baseoff = InvalidOffsetNumber; /* unused */
- dstate->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- dstate->htids = NULL;
- dstate->nhtids = 0;
- dstate->nitems = 0;
- dstate->phystupsize = 0; /* unused */
- dstate->nintervals = 0; /* unused */
-
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- {
- state = _bt_pagestate(wstate, 0);
-
- /*
- * Limit size of posting list tuples to 1/10 space we want to
- * leave behind on the page, plus space for final item's line
- * pointer. This is equal to the space that we'd like to
- * leave behind on each leaf page when fillfactor is 90,
- * allowing us to get close to fillfactor% space utilization
- * when there happen to be a great many duplicates. (This
- * makes higher leaf fillfactor settings ineffective when
- * building indexes that have many duplicates, but packing
- * leaf pages full with few very large tuples doesn't seem
- * like a useful goal.)
- */
- dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
- sizeof(ItemIdData);
- Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
- dstate->maxpostingsize <= INDEX_SIZE_MASK);
- dstate->htids = palloc(dstate->maxpostingsize);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
- else if (_bt_keep_natts_fast(wstate->index, dstate->base,
- itup) > keysz &&
- _bt_dedup_save_htid(dstate, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID from itup has been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list.
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
-
- if (state)
- {
- /*
- * Handle the last item (there must be a last item when the
- * tuplesort returned one or more tuples)
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
- pfree(dstate->htids);
- }
-
- pfree(dstate);
- }
- else
- {
- /* merging and deduplication are both unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- _bt_buildadd(wstate, state, itup, 0);
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- }
-
- /* Close down final pages and write the metapage */
- _bt_uppershutdown(wstate, state);
-
- /*
- * When we WAL-logged index pages, we must nonetheless fsync index files.
- * Since we're building outside shared buffers, a CHECKPOINT occurring
- * during the build has no way to flush the previously written data to
- * disk (indeed it won't know the index even exists). A crash later on
- * would replay WAL from the checkpoint, therefore it wouldn't replay our
- * earlier WAL entries. If we do not fsync those pages here, they might
- * still not be on disk when the crash occurs.
- */
- if (wstate->btws_use_wal)
- smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
-}
-
/*
* Create parallel context, and launch workers for leader.
*
diff --git a/src/backend/access/nbtree/nbtsort_spec.h b/src/backend/access/nbtree/nbtsort_spec.h
new file mode 100644
index 0000000000..c2f7588914
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsort_spec.h
@@ -0,0 +1,265 @@
+/*
+ * Specialized functions included in nbtsort.c
+ */
+
+static void NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
+ BTSpool *btspool2);
+
+/*
+ * Read tuples in correct sort order from tuplesort, and load them into
+ * btree leaves.
+ */
+static void
+NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
+ BTSpool *btspool2)
+{
+ BTPageState *state = NULL;
+ bool merge = (btspool2 != NULL);
+ IndexTuple itup,
+ itup2 = NULL;
+ bool load1;
+ TupleDesc tupdes = RelationGetDescr(wstate->index);
+ int i,
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ SortSupport sortKeys;
+ int64 tuples_done = 0;
+ bool deduplicate;
+
+ deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
+ BTGetDeduplicateItems(wstate->index);
+
+ if (merge)
+ {
+ /*
+ * Another BTSpool for dead tuples exists. Now we have to merge
+ * btspool and btspool2.
+ */
+
+ /* the preparation of merge */
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+
+ /* Prepare SortSupport data for each column */
+ sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+
+ for (i = 0; i < keysz; i++)
+ {
+ SortSupport sortKey = sortKeys + i;
+ ScanKey scanKey = wstate->inskey->scankeys + i;
+ int16 strategy;
+
+ sortKey->ssup_cxt = CurrentMemoryContext;
+ sortKey->ssup_collation = scanKey->sk_collation;
+ sortKey->ssup_nulls_first =
+ (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+ sortKey->ssup_attno = scanKey->sk_attno;
+ /* Abbreviation is not supported here */
+ sortKey->abbreviate = false;
+
+ AssertState(sortKey->ssup_attno != 0);
+
+ strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+ BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+ PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+ }
+
+ for (;;)
+ {
+ load1 = true; /* load BTSpool next ? */
+ if (itup2 == NULL)
+ {
+ if (itup == NULL)
+ break;
+ }
+ else if (itup != NULL)
+ {
+ int32 compare = 0;
+
+ for (i = 1; i <= keysz; i++)
+ {
+ SortSupport entry;
+ Datum attrDatum1,
+ attrDatum2;
+ bool isNull1,
+ isNull2;
+
+ entry = sortKeys + i - 1;
+ attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+ attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+ compare = ApplySortComparator(attrDatum1, isNull1,
+ attrDatum2, isNull2,
+ entry);
+ if (compare > 0)
+ {
+ load1 = false;
+ break;
+ }
+ else if (compare < 0)
+ break;
+ }
+
+ /*
+ * If key values are equal, we sort on ItemPointer. This is
+ * required for btree indexes, since heap TID is treated as an
+ * implicit last key attribute in order to ensure that all
+ * keys in the index are physically unique.
+ */
+ if (compare == 0)
+ {
+ compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+ Assert(compare != 0);
+ if (compare > 0)
+ load1 = false;
+ }
+ }
+ else
+ load1 = false;
+
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ if (load1)
+ {
+ _bt_buildadd(wstate, state, itup, 0);
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ }
+ else
+ {
+ _bt_buildadd(wstate, state, itup2, 0);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ pfree(sortKeys);
+ }
+ else if (deduplicate)
+ {
+ /* merge is unnecessary, deduplicate into posting lists */
+ BTDedupState dstate;
+
+ dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ dstate->deduplicate = true; /* unused */
+ dstate->nmaxitems = 0; /* unused */
+ dstate->maxpostingsize = 0; /* set later */
+ /* Metadata about base tuple of current pending posting list */
+ dstate->base = NULL;
+ dstate->baseoff = InvalidOffsetNumber; /* unused */
+ dstate->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ dstate->htids = NULL;
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+ dstate->phystupsize = 0; /* unused */
+ dstate->nintervals = 0; /* unused */
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+
+ /*
+ * Limit size of posting list tuples to 1/10 space we want to
+ * leave behind on the page, plus space for final item's line
+ * pointer. This is equal to the space that we'd like to
+ * leave behind on each leaf page when fillfactor is 90,
+ * allowing us to get close to fillfactor% space utilization
+ * when there happen to be a great many duplicates. (This
+ * makes higher leaf fillfactor settings ineffective when
+ * building indexes that have many duplicates, but packing
+ * leaf pages full with few very large tuples doesn't seem
+ * like a useful goal.)
+ */
+ dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
+ sizeof(ItemIdData);
+ Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
+ dstate->maxpostingsize <= INDEX_SIZE_MASK);
+ dstate->htids = palloc(dstate->maxpostingsize);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+ else if (nbts_call(_bt_keep_natts_fast, wstate->index, dstate->base,
+ itup) > keysz &&
+ _bt_dedup_save_htid(dstate, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID from itup has been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list.
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+
+ if (state)
+ {
+ /*
+ * Handle the last item (there must be a last item when the
+ * tuplesort returned one or more tuples)
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+ pfree(dstate->htids);
+ }
+
+ pfree(dstate);
+ }
+ else
+ {
+ /* merging and deduplication are both unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ _bt_buildadd(wstate, state, itup, 0);
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+
+ /* Close down final pages and write the metapage */
+ _bt_uppershutdown(wstate, state);
+
+ /*
+ * When we WAL-logged index pages, we must nonetheless fsync index files.
+ * Since we're building outside shared buffers, a CHECKPOINT occurring
+ * during the build has no way to flush the previously written data to
+ * disk (indeed it won't know the index even exists). A crash later on
+ * would replay WAL from the checkpoint, therefore it wouldn't replay our
+ * earlier WAL entries. If we do not fsync those pages here, they might
+ * still not be on disk when the crash occurs.
+ */
+ if (wstate->btws_use_wal)
+ smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index ee01ceafda..762a64714a 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -693,7 +693,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
{
itemid = PageGetItemId(state->origpage, maxoff);
tup = (IndexTuple) PageGetItem(state->origpage, itemid);
- keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+ keepnatts = nbts_call(_bt_keep_natts_fast, state->rel, tup, state->newitem);
if (keepnatts > 1 && keepnatts <= nkeyatts)
{
@@ -724,7 +724,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
- keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+ keepnatts = nbts_call(_bt_keep_natts_fast, state->rel, tup, state->newitem);
if (keepnatts > 1 && keepnatts <= nkeyatts)
{
@@ -973,7 +973,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
* avoid appending a heap TID in new high key, we're done. Finish split
* with default strategy and initial split interval.
*/
- perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+ perfectpenalty = nbts_call(_bt_keep_natts_fast, state->rel, leftmost, rightmost);
if (perfectpenalty <= indnkeyatts)
return perfectpenalty;
@@ -994,7 +994,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
* If page is entirely full of duplicates, a single value strategy split
* will be performed.
*/
- perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+ perfectpenalty = nbts_call(_bt_keep_natts_fast, state->rel, leftmost, rightmost);
if (perfectpenalty <= indnkeyatts)
{
*strategy = SPLIT_MANY_DUPLICATES;
@@ -1032,8 +1032,8 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
itemid = PageGetItemId(state->origpage, P_HIKEY);
hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
- perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
- state->newitem);
+ perfectpenalty = nbts_call(_bt_keep_natts_fast, state->rel, hikey,
+ state->newitem);
if (perfectpenalty <= indnkeyatts)
*strategy = SPLIT_SINGLE_VALUE;
else
@@ -1155,7 +1155,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
lastleft = _bt_split_lastleft(state, split);
firstright = _bt_split_firstright(state, split);
- return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+ return nbts_call(_bt_keep_natts_fast, state->rel, lastleft, firstright);
}
/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 96c72fc432..02faf922ce 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -50,130 +50,11 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
bool *result);
static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
static void _bt_mark_scankey_required(ScanKey skey);
-static bool _bt_check_rowcompare(ScanKey skey,
- IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
- ScanDirection dir, bool *continuescan);
-static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtutils_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
-/*
- * _bt_mkscankey
- * Build an insertion scan key that contains comparison data from itup
- * as well as comparator routines appropriate to the key datatypes.
- *
- * When itup is a non-pivot tuple, the returned insertion scan key is
- * suitable for finding a place for it to go on the leaf level. Pivot
- * tuples can be used to re-find leaf page with matching high key, but
- * then caller needs to set scan key's pivotsearch field to true. This
- * allows caller to search for a leaf page with a matching high key,
- * which is usually to the left of the first leaf page a non-pivot match
- * might appear on.
- *
- * The result is intended for use with _bt_compare() and _bt_truncate().
- * Callers that don't need to fill out the insertion scankey arguments
- * (e.g. they use an ad-hoc comparison routine, or only need a scankey
- * for _bt_truncate()) can pass a NULL index tuple. The scankey will
- * be initialized as if an "all truncated" pivot tuple was passed
- * instead.
- *
- * Note that we may occasionally have to share lock the metapage to
- * determine whether or not the keys in the index are expected to be
- * unique (i.e. if this is a "heapkeyspace" index). We assume a
- * heapkeyspace index when caller passes a NULL tuple, allowing index
- * build callers to avoid accessing the non-existent metapage. We
- * also assume that the index is _not_ allequalimage when a NULL tuple
- * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
- * field themselves.
- */
-BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
-{
- BTScanInsert key;
- ScanKey skey;
- TupleDesc itupdesc;
- int indnkeyatts;
- int16 *indoption;
- int tupnatts;
- int i;
-
- itupdesc = RelationGetDescr(rel);
- indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- indoption = rel->rd_indoption;
- tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
-
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
- /*
- * We'll execute search using scan key constructed on key columns.
- * Truncated attributes and non-key attributes are omitted from the final
- * scan key.
- */
- key = palloc(offsetof(BTScanInsertData, scankeys) +
- sizeof(ScanKeyData) * indnkeyatts);
- if (itup)
- _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
- else
- {
- /* Utility statement callers can set these fields themselves */
- key->heapkeyspace = true;
- key->allequalimage = false;
- }
- key->anynullkeys = false; /* initial assumption */
- key->nextkey = false;
- key->pivotsearch = false;
- key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
- skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
- {
- FmgrInfo *procinfo;
- Datum arg;
- bool null;
- int flags;
-
- /*
- * We can use the cached (default) support procs since no cross-type
- * comparison can be needed.
- */
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-
- /*
- * Key arguments built from truncated attributes (or when caller
- * provides no tuple) are defensively represented as NULL values. They
- * should never be used.
- */
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
- else
- {
- arg = (Datum) 0;
- null = true;
- }
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
- flags,
- (AttrNumber) (i + 1),
- InvalidStrategy,
- InvalidOid,
- rel->rd_indcollation[i],
- procinfo,
- arg);
- /* Record if any key attribute is NULL (or truncated) */
- if (null)
- key->anynullkeys = true;
- }
-
- /*
- * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
- * that full uniqueness check is done.
- */
- if (rel->rd_index->indnullsnotdistinct)
- key->anynullkeys = false;
-
- return key;
-}
/*
* free a retracement stack made by _bt_search.
@@ -1340,356 +1221,6 @@ _bt_mark_scankey_required(ScanKey skey)
}
}
-/*
- * Test whether an indextuple satisfies all the scankey conditions.
- *
- * Return true if so, false if not. If the tuple fails to pass the qual,
- * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly. See comments for
- * _bt_preprocess_keys(), above, about how this is done.
- *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
- *
- * scan: index scan descriptor (containing a search-type scankey)
- * tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
- */
-bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
- ScanDirection dir, bool *continuescan)
-{
- TupleDesc tupdesc;
- BTScanOpaque so;
- int keysz;
- int ikey;
- ScanKey key;
-
- Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
-
- *continuescan = true; /* default assumption */
-
- tupdesc = RelationGetDescr(scan->indexRelation);
- so = (BTScanOpaque) scan->opaque;
- keysz = so->numberOfKeys;
-
- for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
- {
- Datum datum;
- bool isNull;
- Datum test;
-
- if (key->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- continue;
- }
-
- /* row-comparison keys need special processing */
- if (key->sk_flags & SK_ROW_HEADER)
- {
- if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
- continuescan))
- continue;
- return false;
- }
-
- datum = index_getattr(tuple,
- key->sk_attno,
- tupdesc,
- &isNull);
-
- if (key->sk_flags & SK_ISNULL)
- {
- /* Handle IS NULL/NOT NULL tests */
- if (key->sk_flags & SK_SEARCHNULL)
- {
- if (isNull)
- continue; /* tuple satisfies this qual */
- }
- else
- {
- Assert(key->sk_flags & SK_SEARCHNOTNULL);
- if (!isNull)
- continue; /* tuple satisfies this qual */
- }
-
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (isNull)
- {
- if (key->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
- datum, key->sk_argument);
-
- if (!DatumGetBool(test))
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- *
- * Note: because we stop the scan as soon as any required equality
- * qual fails, it is critical that equality quals be used for the
- * initial positioning in _bt_first() when they are available. See
- * comments in _bt_first().
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
- }
-
- /* If we get here, the tuple passes all index quals. */
- return true;
-}
-
-/*
- * Test whether an indextuple satisfies a row-comparison scan condition.
- *
- * Return true if so, false if not. If not, also clear *continuescan if
- * it's not possible for any future tuples in the current scan direction
- * to pass the qual.
- *
- * This is a subroutine for _bt_checkkeys, which see for more info.
- */
-static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
- TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
-{
- ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
- int32 cmpresult = 0;
- bool result;
-
- /* First subkey should be same as the header says */
- Assert(subkey->sk_attno == skey->sk_attno);
-
- /* Loop over columns of the row condition */
- for (;;)
- {
- Datum datum;
- bool isNull;
-
- Assert(subkey->sk_flags & SK_ROW_MEMBER);
-
- if (subkey->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- cmpresult = 0;
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- continue;
- }
-
- datum = index_getattr(tuple,
- subkey->sk_attno,
- tupdesc,
- &isNull);
-
- if (isNull)
- {
- if (subkey->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (subkey->sk_flags & SK_ISNULL)
- {
- /*
- * Unlike the simple-scankey case, this isn't a disallowed case.
- * But it can never match. If all the earlier row comparison
- * columns are required for the scan direction, we can stop the
- * scan, because there can't be another tuple that will succeed.
- */
- if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
- subkey--;
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- return false;
- }
-
- /* Perform the test --- three-way comparison not bool operator */
- cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
- subkey->sk_collation,
- datum,
- subkey->sk_argument));
-
- if (subkey->sk_flags & SK_BT_DESC)
- INVERT_COMPARE_RESULT(cmpresult);
-
- /* Done comparing if unequal, else advance to next column */
- if (cmpresult != 0)
- break;
-
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- }
-
- /*
- * At this point cmpresult indicates the overall result of the row
- * comparison, and subkey points to the deciding column (or the last
- * column if the result is "=").
- */
- switch (subkey->sk_strategy)
- {
- /* EQ and NE cases aren't allowed here */
- case BTLessStrategyNumber:
- result = (cmpresult < 0);
- break;
- case BTLessEqualStrategyNumber:
- result = (cmpresult <= 0);
- break;
- case BTGreaterEqualStrategyNumber:
- result = (cmpresult >= 0);
- break;
- case BTGreaterStrategyNumber:
- result = (cmpresult > 0);
- break;
- default:
- elog(ERROR, "unrecognized RowCompareType: %d",
- (int) subkey->sk_strategy);
- result = 0; /* keep compiler quiet */
- break;
- }
-
- if (!result)
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will pass,
- * either. Note we have to look at the deciding column, not
- * necessarily the first or last column of the row condition.
- */
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
-
- return result;
-}
-
/*
* _bt_killitems - set LP_DEAD state for items an indexscan caller has
* told us were killed
@@ -2175,286 +1706,6 @@ btbuildphasename(int64 phasenum)
}
}
-/*
- * _bt_truncate() -- create tuple without unneeded suffix attributes.
- *
- * Returns truncated pivot index tuple allocated in caller's memory context,
- * with key attributes copied from caller's firstright argument. If rel is
- * an INCLUDE index, non-key attributes will definitely be truncated away,
- * since they're not part of the key space. More aggressive suffix
- * truncation can take place when it's clear that the returned tuple does not
- * need one or more suffix key attributes. We only need to keep firstright
- * attributes up to and including the first non-lastleft-equal attribute.
- * Caller's insertion scankey is used to compare the tuples; the scankey's
- * argument values are not considered here.
- *
- * Note that returned tuple's t_tid offset will hold the number of attributes
- * present, so the original item pointer offset is not represented. Caller
- * should only change truncated tuple's downlink. Note also that truncated
- * key attributes are treated as containing "minus infinity" values by
- * _bt_compare().
- *
- * In the worst case (when a heap TID must be appended to distinguish lastleft
- * from firstright), the size of the returned tuple is the size of firstright
- * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
- * is important, since callers need to stay under the 1/3 of a page
- * restriction on tuple size. If this routine is ever taught to truncate
- * within an attribute/datum, it will need to avoid returning an enlarged
- * tuple to caller when truncation + TOAST compression ends up enlarging the
- * final datum.
- */
-IndexTuple
-_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
- IndexTuple pivot;
- IndexTuple tidpivot;
- ItemPointer pivotheaptid;
- Size newsize;
-
- /*
- * We should only ever truncate non-pivot tuples from leaf pages. It's
- * never okay to truncate when splitting an internal page.
- */
- Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
-
- /* Determine how many attributes must be kept in truncated tuple */
- keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
-
-#ifdef DEBUG_NO_TRUNCATE
- /* Force truncation to be ineffective for testing purposes */
- keepnatts = nkeyatts + 1;
-#endif
-
- pivot = index_truncate_tuple(itupdesc, firstright,
- Min(keepnatts, nkeyatts));
-
- if (BTreeTupleIsPosting(pivot))
- {
- /*
- * index_truncate_tuple() just returns a straight copy of firstright
- * when it has no attributes to truncate. When that happens, we may
- * need to truncate away a posting list here instead.
- */
- Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
- Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
- pivot->t_info &= ~INDEX_SIZE_MASK;
- pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
- }
-
- /*
- * If there is a distinguishing key attribute within pivot tuple, we're
- * done
- */
- if (keepnatts <= nkeyatts)
- {
- BTreeTupleSetNAtts(pivot, keepnatts, false);
- return pivot;
- }
-
- /*
- * We have to store a heap TID in the new pivot tuple, since no non-TID
- * key attribute value in firstright distinguishes the right side of the
- * split from the left side. nbtree conceptualizes this case as an
- * inability to truncate away any key attributes, since heap TID is
- * treated as just another key attribute (despite lacking a pg_attribute
- * entry).
- *
- * Use enlarged space that holds a copy of pivot. We need the extra space
- * to store a heap TID at the end (using the special pivot tuple
- * representation). Note that the original pivot already has firstright's
- * possible posting list/non-key attribute values removed at this point.
- */
- newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
- tidpivot = palloc0(newsize);
- memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
- /* Cannot leak memory here */
- pfree(pivot);
-
- /*
- * Store all of firstright's key attribute values plus a tiebreaker heap
- * TID value in enlarged pivot tuple
- */
- tidpivot->t_info &= ~INDEX_SIZE_MASK;
- tidpivot->t_info |= newsize;
- BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
- pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
-
- /*
- * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
- * consider suffix truncation. It seems like a good idea to follow that
- * example in cases where no truncation takes place -- use lastleft's heap
- * TID. (This is also the closest value to negative infinity that's
- * legally usable.)
- */
- ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
-
- /*
- * We're done. Assert() that heap TID invariants hold before returning.
- *
- * Lehman and Yao require that the downlink to the right page, which is to
- * be inserted into the parent page in the second phase of a page split be
- * a strict lower bound on items on the right page, and a non-strict upper
- * bound for items on the left page. Assert that heap TIDs follow these
- * invariants, since a heap TID value is apparently needed as a
- * tiebreaker.
- */
-#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
- BTreeTupleGetHeapTID(firstright)) < 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(lastleft)) >= 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#else
-
- /*
- * Those invariants aren't guaranteed to hold for lastleft + firstright
- * heap TID attribute values when they're considered here only because
- * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
- * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
- * TID value that always works as a strict lower bound for items to the
- * right. In particular, it must avoid using firstright's leading key
- * attribute values along with lastleft's heap TID value when lastleft's
- * TID happens to be greater than firstright's TID.
- */
- ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
-
- /*
- * Pivot heap TID should never be fully equal to firstright. Note that
- * the pivot heap TID will still end up equal to lastleft's heap TID when
- * that's the only usable value.
- */
- ItemPointerSetOffsetNumber(pivotheaptid,
- OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#endif
-
- return tidpivot;
-}
-
-/*
- * _bt_keep_natts - how many key attributes to keep when truncating.
- *
- * Caller provides two tuples that enclose a split point. Caller's insertion
- * scankey is used to compare the tuples; the scankey's argument values are
- * not considered here.
- *
- * This can return a number of attributes that is one greater than the
- * number of key attributes for the index relation. This indicates that the
- * caller must use a heap TID as a unique-ifier in new pivot tuple.
- */
-static int
-_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keepnatts;
- ScanKey scankey;
-
- /*
- * _bt_compare() treats truncated key attributes as having the value minus
- * infinity, which would break searches within !heapkeyspace indexes. We
- * must still truncate away non-key attribute values, though.
- */
- if (!itup_key->heapkeyspace)
- return nkeyatts;
-
- scankey = itup_key->scankeys;
- keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum1,
- datum2)) != 0)
- break;
-
- keepnatts++;
- }
-
- /*
- * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
- * expected in an allequalimage index.
- */
- Assert(!itup_key->allequalimage ||
- keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
-
- return keepnatts;
-}
-
-/*
- * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
- *
- * This is exported so that a candidate split point can have its effect on
- * suffix truncation inexpensively evaluated ahead of time when finding a
- * split location. A naive bitwise approach to datum comparisons is used to
- * save cycles.
- *
- * The approach taken here usually provides the same answer as _bt_keep_natts
- * will (for the same pair of tuples from a heapkeyspace index), since the
- * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting. When an index only has
- * "equal image" columns, routine is guaranteed to give the same result as
- * _bt_keep_natts would.
- *
- * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts, even when the index uses
- * an opclass or collation that is not "allequalimage"/deduplication-safe.
- * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
- * negatives generally only have the effect of making leaf page splits use a
- * more balanced split point.
- */
-int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
-
- keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
- Form_pg_attribute att;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
- break;
-
- keepnatts++;
- }
-
- return keepnatts;
-}
-
/*
* _bt_check_natts() -- Verify tuple has expected number of attributes.
*
diff --git a/src/backend/access/nbtree/nbtutils_spec.h b/src/backend/access/nbtree/nbtutils_spec.h
new file mode 100644
index 0000000000..41f606318c
--- /dev/null
+++ b/src/backend/access/nbtree/nbtutils_spec.h
@@ -0,0 +1,762 @@
+/*
+ * Specialized functions included in nbtutils.c
+ */
+
+static bool NBTS_FUNCTION(_bt_check_rowcompare)(ScanKey skey,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ ScanDirection dir, bool *continuescan);
+
+static int NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+
+/*
+ * _bt_mkscankey
+ * Build an insertion scan key that contains comparison data from itup
+ * as well as comparator routines appropriate to the key datatypes.
+ *
+ * When itup is a non-pivot tuple, the returned insertion scan key is
+ * suitable for finding a place for it to go on the leaf level. Pivot
+ * tuples can be used to re-find leaf page with matching high key, but
+ * then caller needs to set scan key's pivotsearch field to true. This
+ * allows caller to search for a leaf page with a matching high key,
+ * which is usually to the left of the first leaf page a non-pivot match
+ * might appear on.
+ *
+ * The result is intended for use with _bt_compare() and _bt_truncate().
+ * Callers that don't need to fill out the insertion scankey arguments
+ * (e.g. they use an ad-hoc comparison routine, or only need a scankey
+ * for _bt_truncate()) can pass a NULL index tuple. The scankey will
+ * be initialized as if an "all truncated" pivot tuple was passed
+ * instead.
+ *
+ * Note that we may occasionally have to share lock the metapage to
+ * determine whether or not the keys in the index are expected to be
+ * unique (i.e. if this is a "heapkeyspace" index). We assume a
+ * heapkeyspace index when caller passes a NULL tuple, allowing index
+ * build callers to avoid accessing the non-existent metapage. We
+ * also assume that the index is _not_ allequalimage when a NULL tuple
+ * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
+ * field themselves.
+ */
+BTScanInsert
+NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
+{
+ BTScanInsert key;
+ ScanKey skey;
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int16 *indoption;
+ int tupnatts;
+ int i;
+
+ itupdesc = RelationGetDescr(rel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ indoption = rel->rd_indoption;
+ tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+
+ /*
+ * We'll execute search using scan key constructed on key columns.
+ * Truncated attributes and non-key attributes are omitted from the final
+ * scan key.
+ */
+ key = palloc(offsetof(BTScanInsertData, scankeys) +
+ sizeof(ScanKeyData) * indnkeyatts);
+ if (itup)
+ _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+ else
+ {
+ /* Utility statement callers can set these fields themselves */
+ key->heapkeyspace = true;
+ key->allequalimage = false;
+ }
+ key->anynullkeys = false; /* initial assumption */
+ key->nextkey = false;
+ key->pivotsearch = false;
+ key->keysz = Min(indnkeyatts, tupnatts);
+ key->scantid = key->heapkeyspace && itup ?
+ BTreeTupleGetHeapTID(itup) : NULL;
+ skey = key->scankeys;
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ FmgrInfo *procinfo;
+ Datum arg;
+ bool null;
+ int flags;
+
+ /*
+ * We can use the cached (default) support procs since no cross-type
+ * comparison can be needed.
+ */
+ procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+
+ /*
+ * Key arguments built from truncated attributes (or when caller
+ * provides no tuple) are defensively represented as NULL values. They
+ * should never be used.
+ */
+ if (i < tupnatts)
+ arg = index_getattr(itup, i + 1, itupdesc, &null);
+ else
+ {
+ arg = (Datum) 0;
+ null = true;
+ }
+ flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags,
+ (AttrNumber) (i + 1),
+ InvalidStrategy,
+ InvalidOid,
+ rel->rd_indcollation[i],
+ procinfo,
+ arg);
+ /* Record if any key attribute is NULL (or truncated) */
+ if (null)
+ key->anynullkeys = true;
+ }
+
+ /*
+ * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
+ * that full uniqueness check is done.
+ */
+ if (rel->rd_index->indnullsnotdistinct)
+ key->anynullkeys = false;
+
+ return key;
+}
+
+/*
+ * Test whether an indextuple satisfies all the scankey conditions.
+ *
+ * Return true if so, false if not. If the tuple fails to pass the qual,
+ * we also determine whether there's any need to continue the scan beyond
+ * this tuple, and set *continuescan accordingly. See comments for
+ * _bt_preprocess_keys(), above, about how this is done.
+ *
+ * Forward scan callers can pass a high key tuple in the hopes of having
+ * us set *continuescan to false, and avoiding an unnecessary visit to
+ * the page to the right.
+ *
+ * scan: index scan descriptor (containing a search-type scankey)
+ * tuple: index tuple to test
+ * tupnatts: number of attributes in tupnatts (high key may be truncated)
+ * dir: direction we are scanning in
+ * continuescan: output parameter (will be set correctly in all cases)
+ */
+bool
+NBTS_FUNCTION(_bt_checkkeys)(Relation rel, IndexScanDesc scan,
+ IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan)
+{
+ TupleDesc tupdesc;
+ BTScanOpaque so;
+ int keysz;
+ int ikey;
+ ScanKey key;
+
+ Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+
+ *continuescan = true; /* default assumption */
+
+ tupdesc = RelationGetDescr(scan->indexRelation);
+ so = (BTScanOpaque) scan->opaque;
+ keysz = so->numberOfKeys;
+
+ for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+ {
+ Datum datum;
+ bool isNull;
+ Datum test;
+
+ if (key->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ continue;
+ }
+
+ /* row-comparison keys need special processing */
+ if (key->sk_flags & SK_ROW_HEADER)
+ {
+ if (nbts_call_norel(_bt_check_rowcompare, rel, key, tuple,
+ tupnatts, tupdesc, dir, continuescan))
+ continue;
+ return false;
+ }
+
+ datum = index_getattr(tuple,
+ key->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (key->sk_flags & SK_ISNULL)
+ {
+ /* Handle IS NULL/NOT NULL tests */
+ if (key->sk_flags & SK_SEARCHNULL)
+ {
+ if (isNull)
+ continue; /* tuple satisfies this qual */
+ }
+ else
+ {
+ Assert(key->sk_flags & SK_SEARCHNOTNULL);
+ if (!isNull)
+ continue; /* tuple satisfies this qual */
+ }
+
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (isNull)
+ {
+ if (key->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
+ datum, key->sk_argument);
+
+ if (!DatumGetBool(test))
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ *
+ * Note: because we stop the scan as soon as any required equality
+ * qual fails, it is critical that equality quals be used for the
+ * initial positioning in _bt_first() when they are available. See
+ * comments in _bt_first().
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+ }
+
+ /* If we get here, the tuple passes all index quals. */
+ return true;
+}
+
+/*
+ * Test whether an indextuple satisfies a row-comparison scan condition.
+ *
+ * Return true if so, false if not. If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction
+ * to pass the qual.
+ *
+ * This is a subroutine for _bt_checkkeys, which see for more info.
+ */
+static bool
+NBTS_FUNCTION(_bt_check_rowcompare)(ScanKey skey, IndexTuple tuple,
+ int tupnatts, TupleDesc tupdesc,
+ ScanDirection dir, bool *continuescan)
+{
+ ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
+ int32 cmpresult = 0;
+ bool result;
+
+ /* First subkey should be same as the header says */
+ Assert(subkey->sk_attno == skey->sk_attno);
+
+ /* Loop over columns of the row condition */
+ for (;;)
+ {
+ Datum datum;
+ bool isNull;
+
+ Assert(subkey->sk_flags & SK_ROW_MEMBER);
+
+ if (subkey->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ cmpresult = 0;
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ continue;
+ }
+
+ datum = index_getattr(tuple,
+ subkey->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (isNull)
+ {
+ if (subkey->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (subkey->sk_flags & SK_ISNULL)
+ {
+ /*
+ * Unlike the simple-scankey case, this isn't a disallowed case.
+ * But it can never match. If all the earlier row comparison
+ * columns are required for the scan direction, we can stop the
+ * scan, because there can't be another tuple that will succeed.
+ */
+ if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
+ subkey--;
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ return false;
+ }
+
+ /* Perform the test --- three-way comparison not bool operator */
+ cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
+ subkey->sk_collation,
+ datum,
+ subkey->sk_argument));
+
+ if (subkey->sk_flags & SK_BT_DESC)
+ INVERT_COMPARE_RESULT(cmpresult);
+
+ /* Done comparing if unequal, else advance to next column */
+ if (cmpresult != 0)
+ break;
+
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ }
+
+ /*
+ * At this point cmpresult indicates the overall result of the row
+ * comparison, and subkey points to the deciding column (or the last
+ * column if the result is "=").
+ */
+ switch (subkey->sk_strategy)
+ {
+ /* EQ and NE cases aren't allowed here */
+ case BTLessStrategyNumber:
+ result = (cmpresult < 0);
+ break;
+ case BTLessEqualStrategyNumber:
+ result = (cmpresult <= 0);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ result = (cmpresult >= 0);
+ break;
+ case BTGreaterStrategyNumber:
+ result = (cmpresult > 0);
+ break;
+ default:
+ elog(ERROR, "unrecognized RowCompareType: %d",
+ (int) subkey->sk_strategy);
+ result = 0; /* keep compiler quiet */
+ break;
+ }
+
+ if (!result)
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will pass,
+ * either. Note we have to look at the deciding column, not
+ * necessarily the first or last column of the row condition.
+ */
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+
+ return result;
+}
+
+/*
+ * _bt_truncate() -- create tuple without unneeded suffix attributes.
+ *
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument. If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space. More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes. We only need to keep firstright
+ * attributes up to and including the first non-lastleft-equal attribute.
+ * Caller's insertion scankey is used to compare the tuples; the scankey's
+ * argument values are not considered here.
+ *
+ * Note that returned tuple's t_tid offset will hold the number of attributes
+ * present, so the original item pointer offset is not represented. Caller
+ * should only change truncated tuple's downlink. Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
+ *
+ * In the worst case (when a heap TID must be appended to distinguish lastleft
+ * from firstright), the size of the returned tuple is the size of firstright
+ * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
+ * is important, since callers need to stay under the 1/3 of a page
+ * restriction on tuple size. If this routine is ever taught to truncate
+ * within an attribute/datum, it will need to avoid returning an enlarged
+ * tuple to caller when truncation + TOAST compression ends up enlarging the
+ * final datum.
+ */
+IndexTuple
+NBTS_FUNCTION(_bt_truncate)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+ IndexTuple pivot;
+ IndexTuple tidpivot;
+ ItemPointer pivotheaptid;
+ Size newsize;
+
+ /*
+ * We should only ever truncate non-pivot tuples from leaf pages. It's
+ * never okay to truncate when splitting an internal page.
+ */
+ Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
+
+ /* Determine how many attributes must be kept in truncated tuple */
+ keepnatts = nbts_call(_bt_keep_natts, rel, lastleft, firstright, itup_key);
+
+#ifdef DEBUG_NO_TRUNCATE
+ /* Force truncation to be ineffective for testing purposes */
+ keepnatts = nkeyatts + 1;
+#endif
+
+ pivot = index_truncate_tuple(itupdesc, firstright,
+ Min(keepnatts, nkeyatts));
+
+ if (BTreeTupleIsPosting(pivot))
+ {
+ /*
+ * index_truncate_tuple() just returns a straight copy of firstright
+ * when it has no attributes to truncate. When that happens, we may
+ * need to truncate away a posting list here instead.
+ */
+ Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
+ Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+
+ /*
+ * If there is a distinguishing key attribute within pivot tuple, we're
+ * done
+ */
+ if (keepnatts <= nkeyatts)
+ {
+ BTreeTupleSetNAtts(pivot, keepnatts, false);
+ return pivot;
+ }
+
+ /*
+ * We have to store a heap TID in the new pivot tuple, since no non-TID
+ * key attribute value in firstright distinguishes the right side of the
+ * split from the left side. nbtree conceptualizes this case as an
+ * inability to truncate away any key attributes, since heap TID is
+ * treated as just another key attribute (despite lacking a pg_attribute
+ * entry).
+ *
+ * Use enlarged space that holds a copy of pivot. We need the extra space
+ * to store a heap TID at the end (using the special pivot tuple
+ * representation). Note that the original pivot already has firstright's
+ * possible posting list/non-key attribute values removed at this point.
+ */
+ newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
+ tidpivot = palloc0(newsize);
+ memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
+ /* Cannot leak memory here */
+ pfree(pivot);
+
+ /*
+ * Store all of firstright's key attribute values plus a tiebreaker heap
+ * TID value in enlarged pivot tuple
+ */
+ tidpivot->t_info &= ~INDEX_SIZE_MASK;
+ tidpivot->t_info |= newsize;
+ BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
+ pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
+
+ /*
+ * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
+ * consider suffix truncation. It seems like a good idea to follow that
+ * example in cases where no truncation takes place -- use lastleft's heap
+ * TID. (This is also the closest value to negative infinity that's
+ * legally usable.)
+ */
+ ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
+
+ /*
+ * We're done. Assert() that heap TID invariants hold before returning.
+ *
+ * Lehman and Yao require that the downlink to the right page, which is to
+ * be inserted into the parent page in the second phase of a page split be
+ * a strict lower bound on items on the right page, and a non-strict upper
+ * bound for items on the left page. Assert that heap TIDs follow these
+ * invariants, since a heap TID value is apparently needed as a
+ * tiebreaker.
+ */
+#ifndef DEBUG_NO_TRUNCATE
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#else
+
+ /*
+ * Those invariants aren't guaranteed to hold for lastleft + firstright
+ * heap TID attribute values when they're considered here only because
+ * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
+ * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
+ * TID value that always works as a strict lower bound for items to the
+ * right. In particular, it must avoid using firstright's leading key
+ * attribute values along with lastleft's heap TID value when lastleft's
+ * TID happens to be greater than firstright's TID.
+ */
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
+
+ /*
+ * Pivot heap TID should never be fully equal to firstright. Note that
+ * the pivot heap TID will still end up equal to lastleft's heap TID when
+ * that's the only usable value.
+ */
+ ItemPointerSetOffsetNumber(pivotheaptid,
+ OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#endif
+
+ return tidpivot;
+}
+
+/*
+ * _bt_keep_natts - how many key attributes to keep when truncating.
+ *
+ * Caller provides two tuples that enclose a split point. Caller's insertion
+ * scankey is used to compare the tuples; the scankey's argument values are
+ * not considered here.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation. This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keepnatts;
+ ScanKey scankey;
+
+ /*
+ * _bt_compare() treats truncated key attributes as having the value minus
+ * infinity, which would break searches within !heapkeyspace indexes. We
+ * must still truncate away non-key attribute values, though.
+ */
+ if (!itup_key->heapkeyspace)
+ return nkeyatts;
+
+ scankey = itup_key->scankeys;
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum1,
+ datum2)) != 0)
+ break;
+
+ keepnatts++;
+ }
+
+ /*
+ * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
+ * expected in an allequalimage index.
+ */
+ Assert(!itup_key->allequalimage ||
+ keepnatts == nbts_call(_bt_keep_natts_fast, rel, lastleft, firstright));
+
+ return keepnatts;
+}
+
+/*
+ * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location. A naive bitwise approach to datum comparisons is used to
+ * save cycles.
+ *
+ * The approach taken here usually provides the same answer as _bt_keep_natts
+ * will (for the same pair of tuples from a heapkeyspace index), since the
+ * majority of btree opclasses can never indicate that two datums are equal
+ * unless they're bitwise equal after detoasting. When an index only has
+ * "equal image" columns, routine is guaranteed to give the same result as
+ * _bt_keep_natts would.
+ *
+ * Callers can rely on the fact that attributes considered equal here are
+ * definitely also equal according to _bt_keep_natts, even when the index uses
+ * an opclass or collation that is not "allequalimage"/deduplication-safe.
+ * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
+ * negatives generally only have the effect of making leaf page splits use a
+ * more balanced split point.
+ */
+int
+NBTS_FUNCTION(_bt_keep_natts_fast)(Relation rel,
+ IndexTuple lastleft,
+ IndexTuple firstright)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= keysz; attnum++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+ Form_pg_attribute att;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ att = TupleDescAttr(itupdesc, attnum - 1);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
+ break;
+
+ keepnatts++;
+ }
+
+ return keepnatts;
+}
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 361527098f..a978fb7f98 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1097,7 +1097,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
state->tupDesc = tupDesc; /* assume we need not copy tupDesc */
- indexScanKey = _bt_mkscankey(indexRel, NULL);
+ indexScanKey = nbts_call(_bt_mkscankey, indexRel, NULL);
if (state->indexInfo->ii_Expressions != NULL)
{
@@ -1194,7 +1194,7 @@ tuplesort_begin_index_btree(Relation heapRel,
state->enforceUnique = enforceUnique;
state->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
- indexScanKey = _bt_mkscankey(indexRel, NULL);
+ indexScanKey = nbts_call(_bt_mkscankey, indexRel, NULL);
/* Prepare SortSupport data for each column */
state->sortKeys = (SortSupport) palloc0(state->nKeys *
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 93f8267b48..a758748635 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1116,15 +1116,56 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_PERFORMSORT_2 4
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
+
+/*
+ * Macros used in the nbtree specialization code.
+ */
+#define NBTS_TYPE_SINGLE_COLUMN single
+#define NBTS_TYPE_CACHED cached
+#define NBTS_TYPE_DEFAULT default
+
+
+#define NBTS_MAKE_PREFIX(a) CppConcat(a,_)
+#define NBTS_MAKE_NAME_(a,b) CppConcat(a,b)
+#define NBTS_MAKE_NAME(a,b) NBTS_MAKE_NAME_(NBTS_MAKE_PREFIX(a),b)
+
+#define NBTS_ENABLED
+
+#ifdef NBTS_ENABLED
+
+#define NBT_SPECIALIZE_CALL(function, rel, ...) \
+( \
+ (rel)->rd_index->indnkeyatts == 1 ? ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_SINGLE_COLUMN)(__VA_ARGS__) \
+ ) \
+ : \
+ ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+ ) \
+)
+
+#else /* not defined NBTS_ENABLED */
+
+#define NBT_SPECIALIZE_CALL(function, rel, ...) function(__VA_ARGS__)
+
+#endif /* NBTS_ENABLED */
+
+
+#define NBT_SPECIALIZE_FILE "access/nbtree_specialized.h"
+#include "nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
+
+
/*
* external entry points for btree, in nbtree.c
*/
extern void btbuildempty(Relation index);
-extern bool btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- struct IndexInfo *indexInfo);
+// see nbtree_specialized.h
+// extern bool btinsert(Relation rel, Datum *values, bool *isnull,
+// ItemPointer ht_ctid, Relation heapRel,
+// IndexUniqueCheck checkUnique,
+// bool indexUnchanged,
+// struct IndexInfo *indexInfo);
extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
extern Size btestimateparallelscan(void);
extern void btinitparallelscan(void *target);
@@ -1155,9 +1196,10 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
/*
* prototypes for functions in nbtdedup.c
*/
-extern void _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel,
- IndexTuple newitem, Size newitemsz,
- bool bottomupdedup);
+// specialized
+//extern void _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel,
+// IndexTuple newitem, Size newitemsz,
+// bool bottomupdedup);
extern bool _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
Size newitemsz);
extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
@@ -1173,9 +1215,10 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
/*
* prototypes for functions in nbtinsert.c
*/
-extern bool _bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel);
+// see nbtree_specialized.h
+// extern bool _bt_doinsert(Relation rel, IndexTuple itup,
+// IndexUniqueCheck checkUnique, bool indexUnchanged,
+// Relation heapRel);
extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
@@ -1223,12 +1266,13 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
/*
* prototypes for functions in nbtsearch.c
*/
-extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
- int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
- bool forupdate, BTStack stack, int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
-extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+// see nbtree_specialized.h
+// extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
+// int access, Snapshot snapshot);
+// extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
+// bool forupdate, BTStack stack, int access, Snapshot snapshot);
+// extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
+// extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -1237,7 +1281,8 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
/*
* prototypes for functions in nbtutils.c
*/
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
+// see nbtree_specialized.h
+// extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
extern void _bt_freestack(BTStack stack);
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
@@ -1245,8 +1290,9 @@ extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
extern void _bt_mark_array_keys(IndexScanDesc scan);
extern void _bt_restore_array_keys(IndexScanDesc scan);
extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
- int tupnatts, ScanDirection dir, bool *continuescan);
+// see nbtree_specialized.h
+//extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
+// int tupnatts, ScanDirection dir, bool *continuescan);
extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
extern BTCycleId _bt_start_vacuum(Relation rel);
@@ -1259,10 +1305,11 @@ extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
extern char *btbuildphasename(int64 phasenum);
-extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
-extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
- IndexTuple firstright);
+// see nbtree_specialized.h
+// extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
+// IndexTuple firstright, BTScanInsert itup_key);
+// extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
+// IndexTuple firstright);
extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/nbtree_specialize.h b/src/include/access/nbtree_specialize.h
new file mode 100644
index 0000000000..b2ee09621e
--- /dev/null
+++ b/src/include/access/nbtree_specialize.h
@@ -0,0 +1,238 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtree_specialize.h
+ * header file for postgres btree access method implementation.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/nbtree_specialize.h
+ *
+ *-------------------------------------------------------------------------
+ *
+ * Specialize key-accessing functions and the hot code around those.
+ *
+ * Key attribute iteration is specialized through the use of the following
+ * macros:
+ *
+ * - nbts_call(function, indexrel, ...rest_of_args), and
+ * nbts_call_norel(function, indexrel, ...args)
+ * This will call the specialized variant of 'function' based on the index
+ * relation data.
+ * The difference between nbts_call and nbts_call_norel is that _call
+ * uses indexrel as first argument in the function call, whereas
+ * nbts_call_norel does not.
+ * - nbts_attiterdeclare(itup)
+ * Declare the variables required to iterate over the provided IndexTuple's
+ * key attributes. Many tuples may have their attributes iterated over at the
+ * same time.
+ * - nbts_attiterinit(itup, initAttNum, tupDesc)
+ * Initialize the attribute iterator for the provided IndexTuple at
+ * the provided AttributeNumber.
+ * - nbts_foreachattr(initAttNum, endAttNum)
+ * Start a loop over the attributes, starting at initAttNum and ending at
+ * endAttNum, inclusive. It also takes care of truncated attributes.
+ * - nbts_attiter_attnum
+ * The current attribute number
+ * - nbts_attiter_nextattdatum(itup, tupDesc)
+ * Updates the attribute iterator state to the next attribute. Returns the
+ * datum of the next attribute, which might be null (see below)
+ * - nbts_attiter_curattisnull(itup)
+ * Returns whether the result from the last nbts_attiter_nextattdatum is
+ * null.
+ *
+ * example usage:
+ *
+ * kwithnulls = nbts_call_norel(_bt_key_hasnulls, myindex, mytuple, tupDesc);
+ *
+ * NBTS_FUNCTION(_bt_key_hasnulls)(IndexTuple mytuple, TupleDesc tupDesc)
+ * {
+ * nbts_attiterdeclare(mytuple);
+ * nbts_attiterinit(mytuple, 1, tupDesc);
+ * nbts_foreachattr(1, 10)
+ * {
+ * Datum it = nbts_attiter_nextattdatum(tuple, tupDesc);
+ * if (nbts_attiter_curattisnull(tuple))
+ * return true;
+ * }
+ * return false
+ * }
+ */
+
+/*
+ * Call a potentially specialized function for a given btree operation.
+ *
+ * NB: the rel argument is evaluated multiple times.
+ */
+#define nbts_call(name, rel, ...) \
+ nbts_call_norel(name, rel, rel, __VA_ARGS__)
+
+#ifdef NBTS_ENABLED
+
+#define NBTS_FUNCTION(name) NBTS_MAKE_NAME(name, NBTS_TYPE)
+
+#ifdef nbts_call_norel
+#undef nbts_call_norel
+#endif
+
+#define nbts_call_norel(name, rel, ...) \
+ (NBTS_FUNCTION(name)(__VA_ARGS__))
+
+/*
+ * Optimized access for indexes with a single key column.
+ *
+ * Note that this path may never be used for indexes with multiple key
+ * columns, because it does not ever continue to a next column.
+ */
+
+#define NBTS_SPECIALIZING_SINGLE_COLUMN
+#define NBTS_TYPE NBTS_TYPE_SINGLE_COLUMN
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull);
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ Assert(endAttNum <= 1); \
+ for (int spec_i = endAttNum; spec_i == 1; spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr(itup, 1, tupDesc, &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(tuple) \
+ NBTS_MAKE_NAME(tuple, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_SINGLE_COLUMN
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/*
+ * Multiple key columns, optimized access for attcacheoff -cacheable offsets.
+ */
+#define NBTS_SPECIALIZING_CACHED
+#define NBTS_TYPE NBTS_TYPE_CACHED
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull);
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr(itup, spec_i, tupDesc, &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_CACHED
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/* reset call to SPECIALIZE_CALL for default behaviour */
+#undef nbts_call_norel
+#define nbts_call_norel(name, rel, ...) \
+ NBT_SPECIALIZE_CALL(name, rel, __VA_ARGS__)
+
+/*
+ * "Default", externally accessible, not so much optimized functions
+ */
+
+#define NBTS_SPECIALIZING_DEFAULT
+#define NBTS_TYPE NBTS_TYPE_DEFAULT
+
+// for the default functions, we want to use the unspecialized name.
+#undef NBTS_FUNCTION
+#define NBTS_FUNCTION(name) name
+
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull);
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr(itup, spec_i, tupDesc, &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(tuple) \
+ NBTS_MAKE_NAME(tuple, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_DEFAULT
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/* from here on there are no more NBTS_FUNCTIONs */
+#undef NBTS_FUNCTION
+
+#else // not defined NBTS_ENABLED
+
+// NBTS_ENABLE is not defined, so we don't want to use the specializations.
+// We revert to the behaviour from PG14 and earlier, which only uses
+// attcacheoff.
+
+#define NBTS_FUNCTION(name) name
+
+#define nbts_call_specialized(name, ...) \
+ NBTS_FUNCTION(name)(__VA_ARGS__)
+
+#define NBTS_TYPE NBTS_TYPE_CACHED
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull);
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr(itup, spec_i, tupDesc, &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(tuple) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+#define nbts_call_norel(name, rel, ...) \
+ NBT_SPECIALIZE_CALL(name, rel, __VA_ARGS__)
+
+#endif // NBTS_ENABLED
diff --git a/src/include/access/nbtree_specialized.h b/src/include/access/nbtree_specialized.h
new file mode 100644
index 0000000000..2e835b8e4f
--- /dev/null
+++ b/src/include/access/nbtree_specialized.h
@@ -0,0 +1,67 @@
+//
+// Created by matthias on 27/07/2021.
+//
+
+/*
+ * prototypes for functions in nbtree_spec.h
+ */
+extern void
+NBTS_FUNCTION(_bt_specialize)(Relation rel);
+
+extern bool
+NBTS_FUNCTION(btinsert)(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ struct IndexInfo *indexInfo);
+
+/*
+ * prototypes for functions in nbtdedup_spec.h
+ */
+extern void
+NBTS_FUNCTION(_bt_dedup_pass)(Relation rel, Buffer buf, Relation heapRel,
+ IndexTuple newitem, Size newitemsz,
+ bool bottomupdedup);
+
+
+/*
+ * prototypes for functions in nbtinsert_spec.h
+ */
+
+extern bool
+NBTS_FUNCTION(_bt_doinsert)(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ Relation heapRel);
+
+/*
+ * prototypes for functions in nbtsearch_spec.h
+ */
+extern BTStack
+NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key,
+ Buffer *bufP, int access,
+ Snapshot snapshot);
+extern Buffer
+NBTS_FUNCTION(_bt_moveright)(Relation rel, BTScanInsert key, Buffer buf,
+ bool forupdate, BTStack stack, int access,
+ Snapshot snapshot);
+extern OffsetNumber
+NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate);
+extern int32
+NBTS_FUNCTION(_bt_compare)(Relation rel, BTScanInsert key,
+ Page page, OffsetNumber offnum);
+
+/*
+ * prototypes for functions in nbtutils_spec.h
+ */
+extern BTScanInsert
+NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup);
+extern bool
+NBTS_FUNCTION(_bt_checkkeys)(Relation rel, IndexScanDesc scan,
+ IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan);
+extern IndexTuple
+NBTS_FUNCTION(_bt_truncate)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+extern int
+NBTS_FUNCTION(_bt_keep_natts_fast)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright);
--
2.30.2
v1-0006-Specialize-the-nbtree-rd_indam-entry.patchapplication/octet-stream; name=v1-0006-Specialize-the-nbtree-rd_indam-entry.patchDownload
From b6e6b651af17cbf36192ab1f3f658968e94e172c Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 8 Apr 2022 14:54:52 +0200
Subject: [PATCH v1 6/6] Specialize the nbtree rd_indam entry.
Because each rd_indam struct is seperately allocated for each index, we can
freely modify it at runtime without impacting other indexes of the same
access method. For btinsert (which effectively only calls _bt_insert) it is
useful to specialize that function, which also makes rd_indam->aminsert a
good signal whether or not the indexRelation has been fully optimized yet.
---
src/backend/access/nbtree/nbtree.c | 7 +++++++
src/backend/access/nbtree/nbtree_spec.h | 20 +++++++++++++++-----
src/backend/access/nbtree/nbtsearch.c | 2 ++
src/backend/access/nbtree/nbtsort.c | 2 ++
src/include/access/nbtree.h | 7 +++++++
5 files changed, 33 insertions(+), 5 deletions(-)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 09c43eb226..95da2c46bf 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -161,6 +161,8 @@ btbuildempty(Relation index)
metapage = (Page) palloc(BLCKSZ);
_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
+ nbt_opt_specialize(index);
+
/*
* Write the page and log it. It might seem that an immediate sync would
* be sufficient to guarantee that the file exists on disk, but recovery
@@ -323,6 +325,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
IndexScanDesc scan;
BTScanOpaque so;
+ nbt_opt_specialize(rel);
+
/* no order by operators allowed */
Assert(norderbys == 0);
@@ -765,6 +769,7 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
{
Relation rel = info->index;
BTCycleId cycleid;
+ nbt_opt_specialize(info->index);
/* allocate stats if first time through, else re-use existing struct */
if (stats == NULL)
@@ -798,6 +803,8 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
if (info->analyze_only)
return stats;
+ nbt_opt_specialize(info->index);
+
/*
* If btbulkdelete was called, we need not do anything (we just maintain
* the information used within _bt_vacuum_needs_cleanup() by calling
diff --git a/src/backend/access/nbtree/nbtree_spec.h b/src/backend/access/nbtree/nbtree_spec.h
index 2e9190f267..001e56bfb8 100644
--- a/src/backend/access/nbtree/nbtree_spec.h
+++ b/src/backend/access/nbtree/nbtree_spec.h
@@ -2,9 +2,18 @@
* Specialized functions for nbtree.c
*/
+/*
+ * _bt_specialize() -- Specialize this index relation for its index key.
+ */
void
-NBTS_FUNCTION(_bt_specialize)(Relation rel) {
+NBTS_FUNCTION(_bt_specialize)(Relation rel)
+{
+ PopulateTupleDescCacheOffsets(rel->rd_att);
+#ifdef NBTS_SPECIALIZING_DEFAULT
+ nbts_call_norel(_bt_specialize, rel, rel);
+#else
rel->rd_indam->aminsert = NBTS_FUNCTION(btinsert);
+#endif
}
/*
@@ -23,10 +32,11 @@ NBTS_FUNCTION(btinsert)(Relation rel, Datum *values, bool *isnull,
bool result;
IndexTuple itup;
-#ifdef NBT_SPEC_DEFAULT
- nbts_call(_bt_specialize, rel);
- nbts_call(_bt_insert, rel, values, isnull, ht_ctid, heapRel, checkUnique,
- indexUnchanged, indexInfo);
+#ifdef NBTS_SPECIALIZING_DEFAULT
+ nbt_opt_specialize(rel);
+
+ return nbts_call(btinsert, rel, values, isnull, ht_ctid, heapRel,
+ checkUnique, indexUnchanged, indexInfo);
#else
/* generate an index tuple */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index e81eee9c35..d5152bfcb7 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -181,6 +181,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
Assert(!BTScanPosIsValid(so->currPos));
+ nbt_opt_specialize(scan->indexRelation);
+
pgstat_count_index_scan(rel);
/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 762921e66a..f2311caf16 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -305,6 +305,8 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
BTBuildState buildstate;
double reltuples;
+ nbt_opt_specialize(index);
+
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
ResetUsage();
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d59531f3b3..e101797419 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1134,6 +1134,12 @@ typedef struct BTOptions
#ifdef NBTS_ENABLED
+#define nbt_opt_specialize(rel) \
+do { \
+ if (unlikely((rel)->rd_indam->aminsert == btinsert)) \
+ _bt_specialize(rel); \
+} while (false)
+
#define NBT_SPECIALIZE_CALL(function, rel, ...) \
( \
IndexRelationGetNumberOfKeyAttributes(rel) == 1 ? ( \
@@ -1154,6 +1160,7 @@ typedef struct BTOptions
#else /* not defined NBTS_ENABLED */
+#define nbt_opt_specialize(rel)
#define NBT_SPECIALIZE_CALL(function, rel, ...) function(__VA_ARGS__)
#endif /* NBTS_ENABLED */
--
2.30.2
On Fri, Apr 8, 2022 at 9:55 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
Here's generation 2 of this effort. Instead of proceeding to trying to
shoehorn all types of btree keys into one common code path, this
patchset acknowledges that there exist different shapes of keys that
each have a different best way of accessing each subsequent key
attribute. This patch achieves this by specializing the functions to
different key shapes.
Cool.
These patches still have some rough edges (specifically: some
functions that are being generated are unused, and intermediate
patches don't compile), but I wanted to get this out to get some
feedback on this approach.
I attempted to apply your patch series to get some general sense of
how it affects performance, by using my own test cases from previous
nbtree project work. I gave up on that pretty quickly, though, since
the code wouldn't compile. That in itself might have been okay (some
"rough edges" are generally okay). The real problem was that it wasn't
clear what I was expected to do about it! You mentioned that some of
the patches just didn't compile, but which ones? How do I quickly get
some idea of the benefits on offer here, however imperfect or
preliminary?
Can you post a version of this that compiles? As a general rule you
should try to post patches that can be "test driven" easily. An
opening paragraph that says "here is why you should care about my
patch" is often something to strive for, too. I suspect that you
actually could have done that here, but you didn't, for whatever
reason.
I expect the performance to be at least on par with current btree
code, and I'll try to publish a more polished patchset with
performance results sometime in the near future. I'll also try to
re-attach dynamic page-level prefix truncation, but that depends on
the amount of time I have and the amount of feedback on this patchset.
Can you give a few motivating examples? You know, queries that are
sped up by the patch series, with an explanation of where the benefit
comes from. You had some on the original thread, but that included
dynamic prefix truncation stuff as well.
Ideally you would also describe where the adversized improvements come
from for each test case -- which patch, which enhancement (perhaps
only in rough terms for now).
--
Peter Geoghegan
On Sun, Apr 10, 2022 at 2:44 PM Peter Geoghegan <pg@bowt.ie> wrote:
Can you post a version of this that compiles?
I forgot to add: the patch also bitrot due to recent commit dbafe127.
I didn't get stuck at this point (this is minor bitrot), but no reason
not to rebase.
--
Peter Geoghegan
On Sun, 10 Apr 2022 at 23:45, Peter Geoghegan <pg@bowt.ie> wrote:
On Fri, Apr 8, 2022 at 9:55 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:Here's generation 2 of this effort. Instead of proceeding to trying to
shoehorn all types of btree keys into one common code path, this
patchset acknowledges that there exist different shapes of keys that
each have a different best way of accessing each subsequent key
attribute. This patch achieves this by specializing the functions to
different key shapes.Cool.
These patches still have some rough edges (specifically: some
functions that are being generated are unused, and intermediate
patches don't compile), but I wanted to get this out to get some
feedback on this approach.I attempted to apply your patch series to get some general sense of
how it affects performance, by using my own test cases from previous
nbtree project work. I gave up on that pretty quickly, though, since
the code wouldn't compile. That in itself might have been okay (some
"rough edges" are generally okay). The real problem was that it wasn't
clear what I was expected to do about it! You mentioned that some of
the patches just didn't compile, but which ones? How do I quickly get
some idea of the benefits on offer here, however imperfect or
preliminary?Can you post a version of this that compiles? As a general rule you
should try to post patches that can be "test driven" easily. An
opening paragraph that says "here is why you should care about my
patch" is often something to strive for, too. I suspect that you
actually could have done that here, but you didn't, for whatever
reason.
Yes, my bad. I pulled one patch that included unrelated changes from
the set; but I missed that it also contained some changes that
should've been in an earlier commit, thus breaking the set.
I'll send an updated patchset soon (I'm planning on moving around when
what is changed/added); but before that the attached incremental patch
should help. FYI, the patchset has been tested on commit 05023a23, and
compiles (with unused function warnings) after applying the attached
patch.
I expect the performance to be at least on par with current btree
code, and I'll try to publish a more polished patchset with
performance results sometime in the near future. I'll also try to
re-attach dynamic page-level prefix truncation, but that depends on
the amount of time I have and the amount of feedback on this patchset.Can you give a few motivating examples? You know, queries that are
sped up by the patch series, with an explanation of where the benefit
comes from. You had some on the original thread, but that included
dynamic prefix truncation stuff as well.
Queries that I expect to be faster are situations where the index does
front-to-back attribute accesses in a tight loop and repeated index
lookups; such as in index builds, data loads, JOINs, or IN () and =
ANY () operations; and then specifically for indexes with only a
single key attribute, or indexes where we can determine based on the
index attributes' types that nocache_index_getattr will be called at
least once for a full _bt_compare call (i.e. att->attcacheoff cannot
be set for at least one key attribute).
Queries that I expect to be slower to a limited degree are hot loops
on btree indexes that do not have a specifically optimized path, as
there is some extra overhead calling the specialized functions. Other
code might also see a minimal performance impact due to an increased
binary size resulting in more cache thrashing.
Ideally you would also describe where the adversized improvements come
from for each test case -- which patch, which enhancement (perhaps
only in rough terms for now).
In the previous iteration, I discerned that there are different
"shapes" of indexes, some of which currently have significant overhead
in the existing btree infrastructure. Especially indexes with multiple
key attributes can see significant overhead while their attributes are
being extracted, which (for a significant part) can be attributed to
the O(n) behaviour of nocache_index_getattr. This O(n) overhead is
currently avoided only by indexes with only a single key attribute and
by indexes in which all key attributes have a fixed size (att->attlen
0).
The types of btree keys I discerned were:
CREATE INDEX ON tst ...
... (single_key_attribute)
... (varlen, other, attributes, ...)
... (fixed_size, also_fixed, ...)
... (sometimes_null, other, attributes, ...)
For single-key-attribute btrees, the performance benefits in the patch
are achieved by reducing branching in the attribute extraction: There
are no other key attributes to worry about, so much of the code
dealing with looping over attributes can inline values, and thus
reduce the amount of code generated in the hot path.
For btrees with multiple key attributes, benefits are achieved if some
attributes are of variable length (e.g. text):
On master, if your index looks like CREATE INDEX ON tst (varlen,
fixed, fixed), for the latter attributes the code will always hit the
slow path of nocache_index_getattr. This introduces a significant
overhead; as that function wants to re-establish that the requested
attribute's offset is indeed not cached and not cacheable, and
calculates the requested attributes' offset in the tuple from
effectively zero. That is usually quite wasteful, as (in btree code,
usually) we'd already calculated the previous attribute's offset just
a moment ago; which should be reusable.
In this patch, the code will use an attribute iterator (as described
and demonstrated in the linked thread) to remove this O(n)
per-attribute overhead and change the worst-case complexity of
iterating over the attributes of such an index tuple from O(n^2) to
O(n).
Null attributes in the key are not yet handled in any special manner
in this patch. That is mainly because it is impossible to statically
determine which attribute is going to be null based on the index
definition alone, and thus doesn't benefit as much from statically
generated optimized paths.
-Mathias
Attachments:
v1-0007-Add_missed_declarations_for__bt_keep_natts.patchapplication/octet-stream; name=v1-0007-Add_missed_declarations_for__bt_keep_natts.patchDownload
Index: src/backend/access/nbtree/nbtutils_spec.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/nbtree/nbtutils_spec.h b/src/backend/access/nbtree/nbtutils_spec.h
--- a/src/backend/access/nbtree/nbtutils_spec.h (revision b6e6b651af17cbf36192ab1f3f658968e94e172c)
+++ b/src/backend/access/nbtree/nbtutils_spec.h (date 1649627408105)
@@ -667,6 +667,8 @@
TupleDesc itupdesc = RelationGetDescr(rel);
int keepnatts;
ScanKey scankey;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
/*
* _bt_compare() treats truncated key attributes as having the value minus
On Sun, Apr 10, 2022 at 4:08 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
I'll send an updated patchset soon (I'm planning on moving around when
what is changed/added); but before that the attached incremental patch
should help. FYI, the patchset has been tested on commit 05023a23, and
compiles (with unused function warnings) after applying the attached
patch.
I can get it to work now, with your supplemental patch.
Queries that I expect to be faster are situations where the index does
front-to-back attribute accesses in a tight loop and repeated index
lookups; such as in index builds, data loads, JOINs, or IN () and =
ANY () operations; and then specifically for indexes with only a
single key attribute, or indexes where we can determine based on the
index attributes' types that nocache_index_getattr will be called at
least once for a full _bt_compare call (i.e. att->attcacheoff cannot
be set for at least one key attribute).
I did some quick testing of the patch series -- pretty much just
reusing my old test suite from the Postgres 12 and 13 nbtree work.
This showed that there is a consistent improvement in some cases. It
also failed to demonstrate any performance regressions. That's
definitely a good start.
I saw about a 4% reduction in runtime for the same UK land registry
test that you yourself have run in the past for the same patch series
[1]: /messages/by-id/CAEze2Whwvr8aYcBf0BeBuPy8mJGtwxGvQYA9OGR5eLFh6Q_ZvA@mail.gmail.com -- Peter Geoghegan
of speed up with this test case, except perhaps by further compressing
the on-disk representation used by nbtree. My guess is that the patch
reduces the runtime for this particular test case to a level that's
significantly closer to the limit for this particular piece of
silicon. Which is not to be sniffed at.
Admittedly these test cases were chosen purely because they were
convenient. They were originally designed to test space utilization,
which isn't affected either way here. I like writing reproducible test
cases for indexing stuff, and think that it could work well here too
(even though you're not optimizing space utilization at all). A real
test suite that targets a deliberately chosen cross section of "index
shapes" might work very well.
In the previous iteration, I discerned that there are different
"shapes" of indexes, some of which currently have significant overhead
in the existing btree infrastructure. Especially indexes with multiple
key attributes can see significant overhead while their attributes are
being extracted, which (for a significant part) can be attributed to
the O(n) behaviour of nocache_index_getattr. This O(n) overhead is
currently avoided only by indexes with only a single key attribute and
by indexes in which all key attributes have a fixed size (att->attlen0).
Good summary.
The types of btree keys I discerned were:
CREATE INDEX ON tst ...
... (single_key_attribute)
... (varlen, other, attributes, ...)
... (fixed_size, also_fixed, ...)
... (sometimes_null, other, attributes, ...)For single-key-attribute btrees, the performance benefits in the patch
are achieved by reducing branching in the attribute extraction: There
are no other key attributes to worry about, so much of the code
dealing with looping over attributes can inline values, and thus
reduce the amount of code generated in the hot path.
I agree that it might well be useful to bucket indexes into several
different "index shape archetypes" like this. Roughly the same
approach worked well for me in the past. This scheme might turn out to
be reductive, but even then it could still be very useful (all models
are wrong, some are useful, now as ever).
For btrees with multiple key attributes, benefits are achieved if some
attributes are of variable length (e.g. text):
On master, if your index looks like CREATE INDEX ON tst (varlen,
fixed, fixed), for the latter attributes the code will always hit the
slow path of nocache_index_getattr. This introduces a significant
overhead; as that function wants to re-establish that the requested
attribute's offset is indeed not cached and not cacheable, and
calculates the requested attributes' offset in the tuple from
effectively zero.
Right. So this particular index shape seems like something that we
treat in a rather naive way currently.
Can you demonstrate that with a custom test case? (The result I cited
before was from a '(varlen,varlen,varlen)' index, which is important,
but less relevant.)
[1]: /messages/by-id/CAEze2Whwvr8aYcBf0BeBuPy8mJGtwxGvQYA9OGR5eLFh6Q_ZvA@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan
On Mon, 11 Apr 2022 at 03:11, Peter Geoghegan <pg@bowt.ie> wrote:
On Sun, Apr 10, 2022 at 4:08 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:I'll send an updated patchset soon (I'm planning on moving around when
what is changed/added); but before that the attached incremental patch
should help. FYI, the patchset has been tested on commit 05023a23, and
compiles (with unused function warnings) after applying the attached
patch.I can get it to work now, with your supplemental patch.
Great. Attached is the updated patchset, with as main changes:
- Rebased on top of 5bb2b6ab
- All patches should compile when built on top of each preceding patch.
- Reordered the patches to a more logical order, and cleaned up the
content of each patch
- Updated code so that GCC doesn't warn about unused code.
- Add a patch for dynamic prefix truncation at page level; ref thread at [1]/messages/by-id/CAEze2WhyBT2bKZRdj_U0KS2Sbewa1XoO_BzgpzLC09sa5LUROg@mail.gmail.com.
- Fixed issues in the specialization macros that resulted in issues
with DPT above.
Still to-do:
- Validate performance and share the numbers for the same test indexes
in [1]/messages/by-id/CAEze2WhyBT2bKZRdj_U0KS2Sbewa1XoO_BzgpzLC09sa5LUROg@mail.gmail.com. I'm planning on doing that next Monday.
- Decide whether / how to keep the NBTS_ENABLED flag. The current
#define in nbtree.h is a bad example of a compile-time configuration,
that should be changed (even if we only want to be able to disable
specialization at compile-time, it should be moved).
Maybe:
- More tests: PG already extensively tests the btree code while it is
running the test suite - btree is the main index AM - but more tests
might be needed to test the validity of the specialized code.
Queries that I expect to be faster are situations where the index does
front-to-back attribute accesses in a tight loop and repeated index
lookups; such as in index builds, data loads, JOINs, or IN () and =
ANY () operations; and then specifically for indexes with only a
single key attribute, or indexes where we can determine based on the
index attributes' types that nocache_index_getattr will be called at
least once for a full _bt_compare call (i.e. att->attcacheoff cannot
be set for at least one key attribute).I did some quick testing of the patch series -- pretty much just
reusing my old test suite from the Postgres 12 and 13 nbtree work.
This showed that there is a consistent improvement in some cases. It
also failed to demonstrate any performance regressions. That's
definitely a good start.I saw about a 4% reduction in runtime for the same UK land registry
test that you yourself have run in the past for the same patch series
[1].
That's good to know. The updated patches (as attached) have dynamic
prefix truncation from the patch series in [1]/messages/by-id/CAEze2WhyBT2bKZRdj_U0KS2Sbewa1XoO_BzgpzLC09sa5LUROg@mail.gmail.com added too, which should
improve the performance by a few more percentage points in that
specific test case.
I suspect that there just aren't that many ways to get that kind
of speed up with this test case, except perhaps by further compressing
the on-disk representation used by nbtree. My guess is that the patch
reduces the runtime for this particular test case to a level that's
significantly closer to the limit for this particular piece of
silicon. Which is not to be sniffed at.Admittedly these test cases were chosen purely because they were
convenient. They were originally designed to test space utilization,
which isn't affected either way here. I like writing reproducible test
cases for indexing stuff, and think that it could work well here too
(even though you're not optimizing space utilization at all). A real
test suite that targets a deliberately chosen cross section of "index
shapes" might work very well.
I'm not sure what you're refering to. Is the set of indexes I used in
[1]: /messages/by-id/CAEze2WhyBT2bKZRdj_U0KS2Sbewa1XoO_BzgpzLC09sa5LUROg@mail.gmail.com
In the previous iteration, I discerned that there are different
"shapes" of indexes, some of which currently have significant overhead
in the existing btree infrastructure. Especially indexes with multiple
key attributes can see significant overhead while their attributes are
being extracted, which (for a significant part) can be attributed to
the O(n) behaviour of nocache_index_getattr. This O(n) overhead is
currently avoided only by indexes with only a single key attribute and
by indexes in which all key attributes have a fixed size (att->attlen0).
Good summary.
The types of btree keys I discerned were:
CREATE INDEX ON tst ...
... (single_key_attribute)
... (varlen, other, attributes, ...)
... (fixed_size, also_fixed, ...)
... (sometimes_null, other, attributes, ...)For single-key-attribute btrees, the performance benefits in the patch
are achieved by reducing branching in the attribute extraction: There
are no other key attributes to worry about, so much of the code
dealing with looping over attributes can inline values, and thus
reduce the amount of code generated in the hot path.I agree that it might well be useful to bucket indexes into several
different "index shape archetypes" like this. Roughly the same
approach worked well for me in the past. This scheme might turn out to
be reductive, but even then it could still be very useful (all models
are wrong, some are useful, now as ever).For btrees with multiple key attributes, benefits are achieved if some
attributes are of variable length (e.g. text):
On master, if your index looks like CREATE INDEX ON tst (varlen,
fixed, fixed), for the latter attributes the code will always hit the
slow path of nocache_index_getattr. This introduces a significant
overhead; as that function wants to re-establish that the requested
attribute's offset is indeed not cached and not cacheable, and
calculates the requested attributes' offset in the tuple from
effectively zero.Right. So this particular index shape seems like something that we
treat in a rather naive way currently.
But really every index shape is treated naively, except the cacheable
index shapes. The main reason we haven't cared about it much is that
you don't often see btrees with many key attributes, and when it's
slow that is explained away 'because it is a big index and a wide
index key' but still saves orders of magnitude over a table scan, so
people generally don't complain about it. A notable exception was
80b9e9c4, where a customer complained about index scans being faster
than index only scans.
Can you demonstrate that with a custom test case? (The result I cited
before was from a '(varlen,varlen,varlen)' index, which is important,
but less relevant.)
Anything that has a variable length in any attribute other than the
last; so that includes (varlen, int) and also (int, int, varlen, int,
int, int, int).
The catalogs currently seem to include only one such index:
pg_proc_proname_args_nsp_index is an index on pg_proc (name (const),
oidvector (varlen), oid (const)).
- Matthias
[1]: /messages/by-id/CAEze2WhyBT2bKZRdj_U0KS2Sbewa1XoO_BzgpzLC09sa5LUROg@mail.gmail.com
Attachments:
v2-0002-Use-specialized-attribute-iterators-in-backend-nb.patchapplication/x-patch; name=v2-0002-Use-specialized-attribute-iterators-in-backend-nb.patchDownload
From 8cc1ea41353ba0d0d69a6383c75ed2663608d609 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 7 Apr 2022 12:30:00 +0200
Subject: [PATCH v2 2/7] Use specialized attribute iterators in
backend/*/nbt*_spec.h
Split out for making it clear what substantial changes were made to the
pre-existing functions.
Even though not all nbt*_spec functions have been updated; most call sites
now can directly call the specialized functions instead of having to determine
the right specialization based on the (potentially locally unavailable) index
relation, making the specialization of those functions worth the effort.
---
src/backend/access/nbtree/nbtsearch_spec.h | 16 +++---
src/backend/access/nbtree/nbtsort_spec.h | 24 +++++----
src/backend/access/nbtree/nbtutils_spec.h | 63 +++++++++++++---------
3 files changed, 62 insertions(+), 41 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsearch_spec.h b/src/backend/access/nbtree/nbtsearch_spec.h
index 73d5370496..a5c5f2b94f 100644
--- a/src/backend/access/nbtree/nbtsearch_spec.h
+++ b/src/backend/access/nbtree/nbtsearch_spec.h
@@ -823,6 +823,7 @@ NBTS_FUNCTION(_bt_compare)(Relation rel,
int ncmpkey;
int ntupatts;
int32 result;
+ nbts_attiterdeclare(itup);
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -854,23 +855,26 @@ NBTS_FUNCTION(_bt_compare)(Relation rel,
Assert(key->heapkeyspace || ncmpkey == key->keysz);
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
scankey = key->scankeys;
- for (int i = 1; i <= ncmpkey; i++)
+ nbts_attiterinit(itup, 1, itupdesc);
+
+ nbts_foreachattr(1, ncmpkey)
{
Datum datum;
- bool isNull;
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+ datum = nbts_attiter_nextattdatum(itup, itupdesc);
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ /* key is NULL */
+ if (scankey->sk_flags & SK_ISNULL)
{
- if (isNull)
+ if (nbts_attiter_curattisnull(itup))
result = 0; /* NULL "=" NULL */
else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = -1; /* NULL "<" NOT_NULL */
else
result = 1; /* NULL ">" NOT_NULL */
}
- else if (isNull) /* key is NOT_NULL and item is NULL */
+ /* key is NOT_NULL and item is NULL */
+ else if (nbts_attiter_curattisnull(itup))
{
if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = 1; /* NOT_NULL ">" NULL */
diff --git a/src/backend/access/nbtree/nbtsort_spec.h b/src/backend/access/nbtree/nbtsort_spec.h
index 8f4a3602ca..d3f2db2dc4 100644
--- a/src/backend/access/nbtree/nbtsort_spec.h
+++ b/src/backend/access/nbtree/nbtsort_spec.h
@@ -27,8 +27,7 @@ NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
itup2 = NULL;
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
bool deduplicate;
@@ -50,7 +49,7 @@ NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
/* Prepare SortSupport data for each column */
sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
- for (i = 0; i < keysz; i++)
+ for (int i = 0; i < keysz; i++)
{
SortSupport sortKey = sortKeys + i;
ScanKey scanKey = wstate->inskey->scankeys + i;
@@ -82,22 +81,25 @@ NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
}
else if (itup != NULL)
{
+ nbts_attiterdeclare(itup);
+ nbts_attiterdeclare(itup2);
int32 compare = 0;
- for (i = 1; i <= keysz; i++)
+ nbts_attiterinit(itup, 1, tupdes);
+ nbts_attiterinit(itup2, 1, tupdes);
+
+ nbts_foreachattr(1, keysz)
{
SortSupport entry;
Datum attrDatum1,
attrDatum2;
- bool isNull1,
- isNull2;
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+ entry = sortKeys + nbts_attiter_attnum - 1;
+ attrDatum1 = nbts_attiter_nextattdatum(itup, tupdes);
+ attrDatum2 = nbts_attiter_nextattdatum(itup2, tupdes);
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
+ compare = ApplySortComparator(attrDatum1, nbts_attiter_curattisnull(itup),
+ attrDatum2, nbts_attiter_curattisnull(itup2),
entry);
if (compare > 0)
{
diff --git a/src/backend/access/nbtree/nbtutils_spec.h b/src/backend/access/nbtree/nbtutils_spec.h
index a4b934ae7a..638eff18f6 100644
--- a/src/backend/access/nbtree/nbtutils_spec.h
+++ b/src/backend/access/nbtree/nbtutils_spec.h
@@ -211,6 +211,8 @@ NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
TupleDesc itupdesc = RelationGetDescr(rel);
int keepnatts;
ScanKey scankey;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
/*
* _bt_compare() treats truncated key attributes as having the value minus
@@ -222,20 +224,22 @@ NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
scankey = itup_key->scankeys;
keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, nkeyatts)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ datum1 = nbts_attiter_nextattdatum(lastleft, itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) != nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
scankey->sk_collation,
datum1,
@@ -243,6 +247,7 @@ NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
break;
keepnatts++;
+ scankey++;
}
/*
@@ -295,7 +300,7 @@ NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
int indnkeyatts;
int16 *indoption;
int tupnatts;
- int i;
+ nbts_attiterdeclare(itup);
itupdesc = RelationGetDescr(rel);
indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -326,7 +331,10 @@ NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
key->scantid = key->heapkeyspace && itup ?
BTreeTupleGetHeapTID(itup) : NULL;
skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
+
+ nbts_attiterinit(itup, 1, itupdesc);
+
+ nbts_foreachattr(1, indnkeyatts)
{
FmgrInfo *procinfo;
Datum arg;
@@ -337,27 +345,30 @@ NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
* We can use the cached (default) support procs since no cross-type
* comparison can be needed.
*/
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+ procinfo = index_getprocinfo(rel, nbts_attiter_attnum, BTORDER_PROC);
/*
* Key arguments built from truncated attributes (or when caller
* provides no tuple) are defensively represented as NULL values. They
* should never be used.
*/
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
+ if (nbts_attiter_attnum <= tupnatts)
+ {
+ arg = nbts_attiter_nextattdatum(itup, itupdesc);
+ null = nbts_attiter_curattisnull(itup);
+ }
else
{
arg = (Datum) 0;
null = true;
}
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags = (null ? SK_ISNULL : 0) | (indoption[nbts_attiter_attnum - 1] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[nbts_attiter_attnum - 1],
flags,
- (AttrNumber) (i + 1),
+ (AttrNumber) nbts_attiter_attnum,
InvalidStrategy,
InvalidOid,
- rel->rd_indcollation[i],
+ rel->rd_indcollation[nbts_attiter_attnum - 1],
procinfo,
arg);
/* Record if any key attribute is NULL (or truncated) */
@@ -744,24 +755,28 @@ NBTS_FUNCTION(_bt_keep_natts_fast)(Relation rel,
TupleDesc itupdesc = RelationGetDescr(rel);
int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
int keepnatts;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, keysz)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
Form_pg_attribute att;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
+ datum1 = nbts_attiter_nextattdatum(lastleft,itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
+ att = TupleDescAttr(itupdesc, nbts_attiter_attnum - 1);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) !=
+ nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
--
2.30.2
v2-0005-Add-a-function-whose-task-it-is-to-populate-all-a.patchapplication/x-patch; name=v2-0005-Add-a-function-whose-task-it-is-to-populate-all-a.patchDownload
From 7cd7335a8e0c0540835a738ad92dc6c214ebe566 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 8 Apr 2022 14:51:01 +0200
Subject: [PATCH v2 5/7] Add a function whose task it is to populate all
attcacheoff-s of a TupleDesc's attributes
It fills uncacheable offsets with -2; as opposed to -1 which signals
"unknown", thus allowing users of the API to determine the cache-ability
of an attribute at O(1) complexity after this one-time O(n) cost, as
opposed to the repeated O(n) cost that currently applies.
---
src/backend/access/common/tupdesc.c | 97 +++++++++++++++++++++++++++++
src/include/access/tupdesc.h | 2 +
2 files changed, 99 insertions(+)
diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c
index 9f41b1e854..5630fc9da0 100644
--- a/src/backend/access/common/tupdesc.c
+++ b/src/backend/access/common/tupdesc.c
@@ -910,3 +910,100 @@ BuildDescFromLists(List *names, List *types, List *typmods, List *collations)
return desc;
}
+
+/*
+ * PopulateTupleDescCacheOffsets
+ *
+ * Populate the attcacheoff fields of a TupleDesc, returning the last
+ * attcacheoff with a valid value.
+ *
+ * Sets attcacheoff to -2 for uncacheable attributes (i.e. attributes after a
+ * variable length attributes).
+ */
+AttrNumber
+PopulateTupleDescCacheOffsets(TupleDesc desc)
+{
+ int numberOfAttributes = desc->natts;
+ AttrNumber i, j;
+
+ if (TupleDescAttr(desc, desc->natts - 1)->attcacheoff != -1)
+ {
+ /*
+ * Already done the calculations, find the last attribute that has
+ * cache offset.
+ */
+ for (i = (AttrNumber) numberOfAttributes; i > 1; i--)
+ {
+ if (TupleDescAttr(desc, i - 1)->attcacheoff != -2)
+ return i;
+ }
+
+ return 1;
+ }
+
+ /*
+ * First attribute always starts at offset zero.
+ */
+ TupleDescAttr(desc, 0)->attcacheoff = 0;
+
+ i = 1;
+ /*
+ * Someone might have set some offsets previously.
+ * Skip all positive offsets to get to the first attribute without
+ * attcacheoff.
+ */
+ while (i < numberOfAttributes && TupleDescAttr(desc, i)->attcacheoff > 0)
+ i++;
+
+ /* Cache offset is undetermined. Start calculating offsets if possible */
+ if (i < numberOfAttributes &&
+ TupleDescAttr(desc, i)->attcacheoff == -1)
+ {
+ Form_pg_attribute att = TupleDescAttr(desc, i - 1);
+ Size off = att->attcacheoff;
+
+ if (att->attlen >= 0) {
+ off += att->attlen;
+
+ while (i < numberOfAttributes)
+ {
+ att = TupleDescAttr(desc, i);
+
+ if (att->attlen < 0)
+ {
+ if (off == att_align_nominal(off, att->attalign))
+ att->attcacheoff = off;
+ else
+ att->attcacheoff = -2;
+ i++;
+ break;
+ }
+
+ off = att_align_nominal(off, att->attalign);
+ att->attcacheoff = off;
+ off += att->attlen;
+ i++;
+ }
+ } else {
+ if (off == att_align_nominal(off, att->attalign))
+ att->attcacheoff = off;
+ else
+ att->attcacheoff = -2;
+ i++;
+ }
+ }
+
+ /*
+ * No cacheable offsets left. Fill the rest with -2s, but return the latest
+ * cached offset.
+ */
+ j = i;
+
+ while (i < numberOfAttributes)
+ {
+ TupleDescAttr(desc, i)->attcacheoff = -2;
+ i++;
+ }
+
+ return j;
+}
diff --git a/src/include/access/tupdesc.h b/src/include/access/tupdesc.h
index 28dd6de18b..219f837875 100644
--- a/src/include/access/tupdesc.h
+++ b/src/include/access/tupdesc.h
@@ -151,4 +151,6 @@ extern TupleDesc BuildDescForRelation(List *schema);
extern TupleDesc BuildDescFromLists(List *names, List *types, List *typmods, List *collations);
+extern AttrNumber PopulateTupleDescCacheOffsets(TupleDesc desc);
+
#endif /* TUPDESC_H */
--
2.30.2
v2-0003-Specialize-the-nbtree-rd_indam-entry.patchapplication/x-patch; name=v2-0003-Specialize-the-nbtree-rd_indam-entry.patchDownload
From eb05201ec8f207aec3c106813424477b9ab3c454 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 8 Apr 2022 14:54:52 +0200
Subject: [PATCH v2 3/7] Specialize the nbtree rd_indam entry.
Because each rd_indam struct is seperately allocated for each index, we can
freely modify it at runtime without impacting other indexes of the same
access method. For btinsert (which effectively only calls _bt_insert) it is
useful to specialize that function, which also makes rd_indam->aminsert a
good signal whether or not the indexRelation has been fully optimized yet.
---
src/backend/access/nbtree/nbtree.c | 7 +++++++
src/backend/access/nbtree/nbtsearch.c | 2 ++
src/backend/access/nbtree/nbtsort.c | 2 ++
src/include/access/nbtree.h | 14 ++++++++++++++
4 files changed, 25 insertions(+)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 09c43eb226..95da2c46bf 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -161,6 +161,8 @@ btbuildempty(Relation index)
metapage = (Page) palloc(BLCKSZ);
_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
+ nbt_opt_specialize(index);
+
/*
* Write the page and log it. It might seem that an immediate sync would
* be sufficient to guarantee that the file exists on disk, but recovery
@@ -323,6 +325,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
IndexScanDesc scan;
BTScanOpaque so;
+ nbt_opt_specialize(rel);
+
/* no order by operators allowed */
Assert(norderbys == 0);
@@ -765,6 +769,7 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
{
Relation rel = info->index;
BTCycleId cycleid;
+ nbt_opt_specialize(info->index);
/* allocate stats if first time through, else re-use existing struct */
if (stats == NULL)
@@ -798,6 +803,8 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
if (info->analyze_only)
return stats;
+ nbt_opt_specialize(info->index);
+
/*
* If btbulkdelete was called, we need not do anything (we just maintain
* the information used within _bt_vacuum_needs_cleanup() by calling
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index e81eee9c35..d5152bfcb7 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -181,6 +181,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
Assert(!BTScanPosIsValid(so->currPos));
+ nbt_opt_specialize(scan->indexRelation);
+
pgstat_count_index_scan(rel);
/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f1d146ba71..22c7163197 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -305,6 +305,8 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
BTBuildState buildstate;
double reltuples;
+ nbt_opt_specialize(index);
+
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
ResetUsage();
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 83e0dbab16..489b623663 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1132,6 +1132,19 @@ typedef struct BTOptions
#ifdef NBTS_ENABLED
+/*
+ * Replace the functions in the rd_indam struct with a variant optimized for
+ * our key shape, if not already done.
+ *
+ * It only needs to be done once for every index relation loaded, so it's
+ * quite unlikely we need to do this and thus marked unlikely().
+ */
+#define nbt_opt_specialize(rel) \
+do { \
+ if (unlikely((rel)->rd_indam->aminsert == btinsert)) \
+ _bt_specialize(rel); \
+} while (false)
+
/*
* Access a specialized nbtree function, based on the shape of the index key.
*/
@@ -1143,6 +1156,7 @@ typedef struct BTOptions
#else /* not defined NBTS_ENABLED */
+#define nbt_opt_specialize(rel)
#define NBT_SPECIALIZE_CALL(function, rel, ...) function(__VA_ARGS__)
#endif /* NBTS_ENABLED */
--
2.30.2
v2-0004-Optimize-attribute-iterator-access-for-single-col.patchapplication/x-patch; name=v2-0004-Optimize-attribute-iterator-access-for-single-col.patchDownload
From 38f619c26422ff8b58f6f70478ba110401f53ee0 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 7 Apr 2022 12:47:50 +0200
Subject: [PATCH v2 4/7] Optimize attribute iterator access for single-column
btree keys
This removes the index_getattr_nocache call path, which has significant overhead.
---
src/include/access/nbtree.h | 9 ++++-
src/include/access/nbtree_specialize.h | 56 ++++++++++++++++++++++++++
2 files changed, 64 insertions(+), 1 deletion(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 489b623663..1559399b0e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1120,6 +1120,7 @@ typedef struct BTOptions
/*
* Macros used in the nbtree specialization code.
*/
+#define NBTS_TYPE_SINGLE_COLUMN single
#define NBTS_TYPE_CACHED cached
#define NBTS_TYPE_DEFAULT default
@@ -1151,7 +1152,13 @@ do { \
#define NBT_SPECIALIZE_CALL(function, rel, ...) \
( \
- NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+ IndexRelationGetNumberOfKeyAttributes(rel) == 1 ? ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_SINGLE_COLUMN)(__VA_ARGS__) \
+ ) \
+ : \
+ ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+ ) \
)
#else /* not defined NBTS_ENABLED */
diff --git a/src/include/access/nbtree_specialize.h b/src/include/access/nbtree_specialize.h
index 23fdda4f0e..642bc4c795 100644
--- a/src/include/access/nbtree_specialize.h
+++ b/src/include/access/nbtree_specialize.h
@@ -79,6 +79,62 @@
#define nbts_call_norel(name, rel, ...) \
(NBTS_FUNCTION(name)(__VA_ARGS__))
+/*
+ * Optimized access for indexes with a single key column.
+ *
+ * Note that this path may never be used for indexes with multiple key
+ * columns, because it does not ever continue to a next column.
+ */
+
+#define NBTS_SPECIALIZING_SINGLE_COLUMN
+#define NBTS_TYPE NBTS_TYPE_SINGLE_COLUMN
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ Assert((endAttNum) <= 1); \
+ for (int spec_i = 0; spec_i == 0 && (initAttNum) == 1 && (endAttNum) == 1; spec_i++)
+
+#define nbts_attiter_attnum 1
+
+/*
+ * Simplified (optimized) variant of index_getattr specialized for extracting
+ * only the first attribute: cache offset is guaranteed to be 0, and as such
+ * no cache is required.
+ */
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+( \
+ AssertMacro(spec_i == 0), \
+ (IndexTupleHasNulls(itup) && att_isnull(0, (char *)(itup) + sizeof(IndexTupleData))) ? \
+ ( \
+ (NBTS_MAKE_NAME(itup, isNull)) = true, \
+ (Datum)NULL \
+ ) \
+ : \
+ ( \
+ (NBTS_MAKE_NAME(itup, isNull) = false), \
+ (Datum) fetchatt(TupleDescAttr((tupDesc), 0), \
+ (char *) (itup) + IndexInfoFindDataOffset((itup)->t_info)) \
+ ) \
+)
+
+#define nbts_attiter_curattisnull(tuple) \
+ NBTS_MAKE_NAME(tuple, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_SINGLE_COLUMN
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
/*
* Multiple key columns, optimized access for attcacheoff -cacheable offsets.
*/
--
2.30.2
v2-0001-Specialize-nbtree-functions-on-btree-key-shape.patchapplication/x-patch; name=v2-0001-Specialize-nbtree-functions-on-btree-key-shape.patchDownload
From 1ff81e78ae0cecbe2d46735b26e0fad049f45772 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Sun, 30 Jan 2022 16:23:31 +0100
Subject: [PATCH v2 1/7] Specialize nbtree functions on btree key shape
nbtree keys are not all made the same, so a significant amount of time is
spent on code that exists only to deal with other key's shape. By specializing
function calls based on the key shape, we can remove or reduce these causes
of overhead.
This commit adds the basic infrastructure for specializing specific hot code
in the nbtree AM to certain shapes of keys, initially splitting splitting out
(not yet: specializing) the attcacheoff-capable case.
Note that we generate N specialized functions and 1 'default' function for each
specializable function.
This feature can be disabled by removing the #define NBTS_ENABLED -line in nbtree.h
---
src/backend/access/nbtree/README | 22 +
src/backend/access/nbtree/nbtdedup.c | 300 +------
src/backend/access/nbtree/nbtdedup_spec.h | 313 +++++++
src/backend/access/nbtree/nbtinsert.c | 572 +-----------
src/backend/access/nbtree/nbtinsert_spec.h | 569 ++++++++++++
src/backend/access/nbtree/nbtpage.c | 4 +-
src/backend/access/nbtree/nbtree.c | 31 +-
src/backend/access/nbtree/nbtree_spec.h | 50 ++
src/backend/access/nbtree/nbtsearch.c | 994 +--------------------
src/backend/access/nbtree/nbtsearch_spec.h | 994 +++++++++++++++++++++
src/backend/access/nbtree/nbtsort.c | 271 +-----
src/backend/access/nbtree/nbtsort_spec.h | 275 ++++++
src/backend/access/nbtree/nbtsplitloc.c | 14 +-
src/backend/access/nbtree/nbtutils.c | 755 +---------------
src/backend/access/nbtree/nbtutils_spec.h | 772 ++++++++++++++++
src/backend/utils/sort/tuplesort.c | 4 +-
src/include/access/nbtree.h | 61 +-
src/include/access/nbtree_specialize.h | 204 +++++
src/include/access/nbtree_specialized.h | 67 ++
19 files changed, 3357 insertions(+), 2915 deletions(-)
create mode 100644 src/backend/access/nbtree/nbtdedup_spec.h
create mode 100644 src/backend/access/nbtree/nbtinsert_spec.h
create mode 100644 src/backend/access/nbtree/nbtree_spec.h
create mode 100644 src/backend/access/nbtree/nbtsearch_spec.h
create mode 100644 src/backend/access/nbtree/nbtsort_spec.h
create mode 100644 src/backend/access/nbtree/nbtutils_spec.h
create mode 100644 src/include/access/nbtree_specialize.h
create mode 100644 src/include/access/nbtree_specialized.h
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 5529afc1fe..3c08888c23 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -1041,6 +1041,28 @@ that need a page split anyway. Besides, supporting variable "split points"
while splitting posting lists won't actually improve overall space
utilization.
+
+Notes about nbtree call specialization
+--------------------------------------
+
+Attribute iteration is a significant overhead for multi-column indexes.
+We can avoid it by specializing performance-sensitive search functions
+and calling those selectively. Additionally, we update the entry points
+in the index AM to call the specialized functions, increasing the
+performance of those hot paths. This performance benefit is at the cost
+of binary size, so this feature can be disabled by defining NBTS_DISABLED.
+
+Optimized code paths exist for the following cases, in order of preference:
+ - single-column indexes
+ NB: The code paths of this optimization do not support multiple key columns.
+ - multi-column indexes that could benefit from the attcacheoff optimization
+ NB: This is also used for the default case, and is slow for uncachable
+ attribute offsets.
+
+Future work will optimize for multi-column indexes that don't benefit
+from the attcacheoff optimization by improving on the O(n^2) nature of
+index_getattr through storing attribute offsets.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 0207421a5d..d7025d8e1c 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -22,259 +22,16 @@
static void _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
TM_IndexDeleteOp *delstate);
-static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem);
static void _bt_singleval_fillfactor(Page page, BTDedupState state,
Size newitemsz);
#ifdef USE_ASSERT_CHECKING
static bool _bt_posting_valid(IndexTuple posting);
#endif
-/*
- * Perform a deduplication pass.
- *
- * The general approach taken here is to perform as much deduplication as
- * possible to free as much space as possible. Note, however, that "single
- * value" strategy is used for !bottomupdedup callers when the page is full of
- * tuples of a single value. Deduplication passes that apply the strategy
- * will leave behind a few untouched tuples at the end of the page, preparing
- * the page for an anticipated page split that uses nbtsplitloc.c's own single
- * value strategy. Our high level goal is to delay merging the untouched
- * tuples until after the page splits.
- *
- * When a call to _bt_bottomupdel_pass() just took place (and failed), our
- * high level goal is to prevent a page split entirely by buying more time.
- * We still hope that a page split can be avoided altogether. That's why
- * single value strategy is not even considered for bottomupdedup callers.
- *
- * The page will have to be split if we cannot successfully free at least
- * newitemsz (we also need space for newitem's line pointer, which isn't
- * included in caller's newitemsz).
- *
- * Note: Caller should have already deleted all existing items with their
- * LP_DEAD bits set.
- */
-void
-_bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
- Size newitemsz, bool bottomupdedup)
-{
- OffsetNumber offnum,
- minoff,
- maxoff;
- Page page = BufferGetPage(buf);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- Page newpage;
- BTDedupState state;
- Size pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
- bool singlevalstrat = false;
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtdedup_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
- /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
- newitemsz += sizeof(ItemIdData);
-
- /*
- * Initialize deduplication state.
- *
- * It would be possible for maxpostingsize (limit on posting list tuple
- * size) to be set to one third of the page. However, it seems like a
- * good idea to limit the size of posting lists to one sixth of a page.
- * That ought to leave us with a good split point when pages full of
- * duplicates can be split several times.
- */
- state = (BTDedupState) palloc(sizeof(BTDedupStateData));
- state->deduplicate = true;
- state->nmaxitems = 0;
- state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
- /* Metadata about base tuple of current pending posting list */
- state->base = NULL;
- state->baseoff = InvalidOffsetNumber;
- state->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- state->htids = palloc(state->maxpostingsize);
- state->nhtids = 0;
- state->nitems = 0;
- /* Size of all physical tuples to be replaced by pending posting list */
- state->phystupsize = 0;
- /* nintervals should be initialized to zero */
- state->nintervals = 0;
-
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * Consider applying "single value" strategy, though only if the page
- * seems likely to be split in the near future
- */
- if (!bottomupdedup)
- singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
-
- /*
- * Deduplicate items from page, and write them to newpage.
- *
- * Copy the original page's LSN into newpage copy. This will become the
- * updated version of the page. We need this because XLogInsert will
- * examine the LSN and possibly dump it in a page image.
- */
- newpage = PageGetTempPageCopySpecial(page);
- PageSetLSN(newpage, PageGetLSN(page));
-
- /* Copy high key, if any */
- if (!P_RIGHTMOST(opaque))
- {
- ItemId hitemid = PageGetItemId(page, P_HIKEY);
- Size hitemsz = ItemIdGetLength(hitemid);
- IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
-
- if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
- false, false) == InvalidOffsetNumber)
- elog(ERROR, "deduplication failed to add highkey");
- }
-
- for (offnum = minoff;
- offnum <= maxoff;
- offnum = OffsetNumberNext(offnum))
- {
- ItemId itemid = PageGetItemId(page, offnum);
- IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
-
- Assert(!ItemIdIsDead(itemid));
-
- if (offnum == minoff)
- {
- /*
- * No previous/base tuple for the data item -- use the data item
- * as base tuple of pending posting list
- */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- else if (state->deduplicate &&
- _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
- _bt_dedup_save_htid(state, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID(s) for itup have been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list for some other reason (e.g., adding more
- * TIDs would have caused posting list to exceed current
- * maxpostingsize).
- *
- * If state contains pending posting list with more than one item,
- * form new posting tuple, and actually update the page. Else
- * reset the state and move on without modifying the page.
- */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- if (singlevalstrat)
- {
- /*
- * Single value strategy's extra steps.
- *
- * Lower maxpostingsize for sixth and final large posting list
- * tuple at the point where 5 maxpostingsize-capped tuples
- * have either been formed or observed.
- *
- * When a sixth maxpostingsize-capped item is formed/observed,
- * stop merging together tuples altogether. The few tuples
- * that remain at the end of the page won't be merged together
- * at all (at least not until after a future page split takes
- * place).
- */
- if (state->nmaxitems == 5)
- _bt_singleval_fillfactor(page, state, newitemsz);
- else if (state->nmaxitems == 6)
- {
- state->deduplicate = false;
- singlevalstrat = false; /* won't be back here */
- }
- }
-
- /* itup starts new pending posting list */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- }
-
- /* Handle the last item */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- /*
- * If no items suitable for deduplication were found, newpage must be
- * exactly the same as the original page, so just return from function.
- *
- * We could determine whether or not to proceed on the basis the space
- * savings being sufficient to avoid an immediate page split instead. We
- * don't do that because there is some small value in nbtsplitloc.c always
- * operating against a page that is fully deduplicated (apart from
- * newitem). Besides, most of the cost has already been paid.
- */
- if (state->nintervals == 0)
- {
- /* cannot leak memory here */
- pfree(newpage);
- pfree(state->htids);
- pfree(state);
- return;
- }
-
- /*
- * By here, it's clear that deduplication will definitely go ahead.
- *
- * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
- * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
- * But keep things tidy.
- */
- if (P_HAS_GARBAGE(opaque))
- {
- BTPageOpaque nopaque = BTPageGetOpaque(newpage);
-
- nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
- }
-
- START_CRIT_SECTION();
-
- PageRestoreTempPage(newpage, page);
- MarkBufferDirty(buf);
-
- /* XLOG stuff */
- if (RelationNeedsWAL(rel))
- {
- XLogRecPtr recptr;
- xl_btree_dedup xlrec_dedup;
-
- xlrec_dedup.nintervals = state->nintervals;
-
- XLogBeginInsert();
- XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
-
- /*
- * The intervals array is not in the buffer, but pretend that it is.
- * When XLogInsert stores the whole buffer, the array need not be
- * stored too.
- */
- XLogRegisterBufData(0, (char *) state->intervals,
- state->nintervals * sizeof(BTDedupInterval));
-
- recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
-
- PageSetLSN(page, recptr);
- }
-
- END_CRIT_SECTION();
-
- /* Local space accounting should agree with page accounting */
- Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
-
- /* cannot leak memory here */
- pfree(state->htids);
- pfree(state);
-}
/*
* Perform bottom-up index deletion pass.
@@ -373,7 +130,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
/* itup starts first pending interval */
_bt_dedup_start_pending(state, itup, offnum);
}
- else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+ else if (nbts_call(_bt_keep_natts_fast, rel, state->base, itup) > nkeyatts &&
_bt_dedup_save_htid(state, itup))
{
/* Tuple is equal; just added its TIDs to pending interval */
@@ -748,55 +505,6 @@ _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
state->phystupsize = 0;
}
-/*
- * Determine if page non-pivot tuples (data items) are all duplicates of the
- * same value -- if they are, deduplication's "single value" strategy should
- * be applied. The general goal of this strategy is to ensure that
- * nbtsplitloc.c (which uses its own single value strategy) will find a useful
- * split point as further duplicates are inserted, and successive rightmost
- * page splits occur among pages that store the same duplicate value. When
- * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
- * just like it would if deduplication were disabled.
- *
- * We expect that affected workloads will require _several_ single value
- * strategy deduplication passes (over a page that only stores duplicates)
- * before the page is finally split. The first deduplication pass should only
- * find regular non-pivot tuples. Later deduplication passes will find
- * existing maxpostingsize-capped posting list tuples, which must be skipped
- * over. The penultimate pass is generally the first pass that actually
- * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
- * few untouched non-pivot tuples. The final deduplication pass won't free
- * any space -- it will skip over everything without merging anything (it
- * retraces the steps of the penultimate pass).
- *
- * Fortunately, having several passes isn't too expensive. Each pass (after
- * the first pass) won't spend many cycles on the large posting list tuples
- * left by previous passes. Each pass will find a large contiguous group of
- * smaller duplicate tuples to merge together at the end of the page.
- */
-static bool
-_bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- ItemId itemid;
- IndexTuple itup;
-
- itemid = PageGetItemId(page, minoff);
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- {
- itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- return true;
- }
-
- return false;
-}
-
/*
* Lower maxpostingsize when using "single value" strategy, to avoid a sixth
* and final maxpostingsize-capped tuple. The sixth and final posting list
diff --git a/src/backend/access/nbtree/nbtdedup_spec.h b/src/backend/access/nbtree/nbtdedup_spec.h
new file mode 100644
index 0000000000..27e5a7e686
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup_spec.h
@@ -0,0 +1,313 @@
+/*
+ * Specialized functions included in nbtdedup.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static bool NBTS_FUNCTION(_bt_do_singleval)(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem);
+
+/*
+ * Determine if page non-pivot tuples (data items) are all duplicates of the
+ * same value -- if they are, deduplication's "single value" strategy should
+ * be applied. The general goal of this strategy is to ensure that
+ * nbtsplitloc.c (which uses its own single value strategy) will find a useful
+ * split point as further duplicates are inserted, and successive rightmost
+ * page splits occur among pages that store the same duplicate value. When
+ * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
+ * just like it would if deduplication were disabled.
+ *
+ * We expect that affected workloads will require _several_ single value
+ * strategy deduplication passes (over a page that only stores duplicates)
+ * before the page is finally split. The first deduplication pass should only
+ * find regular non-pivot tuples. Later deduplication passes will find
+ * existing maxpostingsize-capped posting list tuples, which must be skipped
+ * over. The penultimate pass is generally the first pass that actually
+ * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
+ * few untouched non-pivot tuples. The final deduplication pass won't free
+ * any space -- it will skip over everything without merging anything (it
+ * retraces the steps of the penultimate pass).
+ *
+ * Fortunately, having several passes isn't too expensive. Each pass (after
+ * the first pass) won't spend many cycles on the large posting list tuples
+ * left by previous passes. Each pass will find a large contiguous group of
+ * smaller duplicate tuples to merge together at the end of the page.
+ */
+static bool
+NBTS_FUNCTION(_bt_do_singleval)(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, minoff);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (nbts_call(_bt_keep_natts_fast, rel, newitem, itup) > nkeyatts)
+ {
+ itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (nbts_call(_bt_keep_natts_fast, rel, newitem, itup) > nkeyatts)
+ return true;
+ }
+
+ return false;
+}
+
+#endif /* ifndef NBTS_SPECIALIZING_DEFAULT */
+
+/*
+ * Perform a deduplication pass.
+ *
+ * The general approach taken here is to perform as much deduplication as
+ * possible to free as much space as possible. Note, however, that "single
+ * value" strategy is used for !bottomupdedup callers when the page is full of
+ * tuples of a single value. Deduplication passes that apply the strategy
+ * will leave behind a few untouched tuples at the end of the page, preparing
+ * the page for an anticipated page split that uses nbtsplitloc.c's own single
+ * value strategy. Our high level goal is to delay merging the untouched
+ * tuples until after the page splits.
+ *
+ * When a call to _bt_bottomupdel_pass() just took place (and failed), our
+ * high level goal is to prevent a page split entirely by buying more time.
+ * We still hope that a page split can be avoided altogether. That's why
+ * single value strategy is not even considered for bottomupdedup callers.
+ *
+ * The page will have to be split if we cannot successfully free at least
+ * newitemsz (we also need space for newitem's line pointer, which isn't
+ * included in caller's newitemsz).
+ *
+ * Note: Caller should have already deleted all existing items with their
+ * LP_DEAD bits set.
+ */
+void
+NBTS_FUNCTION(_bt_dedup_pass)(Relation rel, Buffer buf, Relation heapRel,
+ IndexTuple newitem, Size newitemsz,
+ bool bottomupdedup)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ Page newpage;
+ BTDedupState state;
+ Size pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
+ bool singlevalstrat = false;
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+
+ /*
+ * Initialize deduplication state.
+ *
+ * It would be possible for maxpostingsize (limit on posting list tuple
+ * size) to be set to one third of the page. However, it seems like a
+ * good idea to limit the size of posting lists to one sixth of a page.
+ * That ought to leave us with a good split point when pages full of
+ * duplicates can be split several times.
+ */
+ state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ state->deduplicate = true;
+ state->nmaxitems = 0;
+ state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
+ /* Metadata about base tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ state->htids = palloc(state->maxpostingsize);
+ state->nhtids = 0;
+ state->nitems = 0;
+ /* Size of all physical tuples to be replaced by pending posting list */
+ state->phystupsize = 0;
+ /* nintervals should be initialized to zero */
+ state->nintervals = 0;
+
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Consider applying "single value" strategy, though only if the page
+ * seems likely to be split in the near future
+ */
+ if (!bottomupdedup)
+ singlevalstrat = nbts_call(_bt_do_singleval, rel, page, state,
+ minoff, newitem);
+
+ /*
+ * Deduplicate items from page, and write them to newpage.
+ *
+ * Copy the original page's LSN into newpage copy. This will become the
+ * updated version of the page. We need this because XLogInsert will
+ * examine the LSN and possibly dump it in a page image.
+ */
+ newpage = PageGetTempPageCopySpecial(page);
+ PageSetLSN(newpage, PageGetLSN(page));
+
+ /* Copy high key, if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId hitemid = PageGetItemId(page, P_HIKEY);
+ Size hitemsz = ItemIdGetLength(hitemid);
+ IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+ if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add highkey");
+ }
+
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == minoff)
+ {
+ /*
+ * No previous/base tuple for the data item -- use the data item
+ * as base tuple of pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (state->deduplicate &&
+ nbts_call(_bt_keep_natts_fast, rel, state->base, itup) > nkeyatts &&
+ _bt_dedup_save_htid(state, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID(s) for itup have been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list for some other reason (e.g., adding more
+ * TIDs would have caused posting list to exceed current
+ * maxpostingsize).
+ *
+ * If state contains pending posting list with more than one item,
+ * form new posting tuple, and actually update the page. Else
+ * reset the state and move on without modifying the page.
+ */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ if (singlevalstrat)
+ {
+ /*
+ * Single value strategy's extra steps.
+ *
+ * Lower maxpostingsize for sixth and final large posting list
+ * tuple at the point where 5 maxpostingsize-capped tuples
+ * have either been formed or observed.
+ *
+ * When a sixth maxpostingsize-capped item is formed/observed,
+ * stop merging together tuples altogether. The few tuples
+ * that remain at the end of the page won't be merged together
+ * at all (at least not until after a future page split takes
+ * place).
+ */
+ if (state->nmaxitems == 5)
+ _bt_singleval_fillfactor(page, state, newitemsz);
+ else if (state->nmaxitems == 6)
+ {
+ state->deduplicate = false;
+ singlevalstrat = false; /* won't be back here */
+ }
+ }
+
+ /* itup starts new pending posting list */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ }
+
+ /* Handle the last item */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ /*
+ * If no items suitable for deduplication were found, newpage must be
+ * exactly the same as the original page, so just return from function.
+ *
+ * We could determine whether or not to proceed on the basis the space
+ * savings being sufficient to avoid an immediate page split instead. We
+ * don't do that because there is some small value in nbtsplitloc.c always
+ * operating against a page that is fully deduplicated (apart from
+ * newitem). Besides, most of the cost has already been paid.
+ */
+ if (state->nintervals == 0)
+ {
+ /* cannot leak memory here */
+ pfree(newpage);
+ pfree(state->htids);
+ pfree(state);
+ return;
+ }
+
+ /*
+ * By here, it's clear that deduplication will definitely go ahead.
+ *
+ * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
+ * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
+ * But keep things tidy.
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ BTPageOpaque nopaque = BTPageGetOpaque(newpage);
+
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+ }
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.nintervals = state->nintervals;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ /*
+ * The intervals array is not in the buffer, but pretend that it is.
+ * When XLogInsert stores the whole buffer, the array need not be
+ * stored too.
+ */
+ XLogRegisterBufData(0, (char *) state->intervals,
+ state->nintervals * sizeof(BTDedupInterval));
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* Local space accounting should agree with page accounting */
+ Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+ /* cannot leak memory here */
+ pfree(state->htids);
+ pfree(state);
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f6f4af8bfe..ec6c73d1cc 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,18 +30,13 @@
#define BTREE_FASTPATH_MIN_LEVEL 2
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
Relation heapRel,
IndexUniqueCheck checkUnique, bool *is_unique,
uint32 *speculativeToken);
-static OffsetNumber _bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel);
-static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_stepright(Relation rel,
+ BTInsertState insertstate,
+ BTStack stack);
static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
Buffer buf,
Buffer cbuf,
@@ -73,311 +68,10 @@ static BlockNumber *_bt_deadblocks(Page page, OffsetNumber *deletable,
int *nblocks);
static inline int _bt_blk_cmp(const void *arg1, const void *arg2);
-/*
- * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
- *
- * This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
- *
- * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
- * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
- * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
- * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
- * don't actually insert.
- *
- * indexUnchanged executor hint indicates if itup is from an
- * UPDATE that didn't logically change the indexed value, but
- * must nevertheless have a new entry to point to a successor
- * version.
- *
- * The result value is only significant for UNIQUE_CHECK_PARTIAL:
- * it must be true if the entry is known unique, else false.
- * (In the current implementation we'll also return true after a
- * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
- * that's just a coding artifact.)
- */
-bool
-_bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel)
-{
- bool is_unique = false;
- BTInsertStateData insertstate;
- BTScanInsert itup_key;
- BTStack stack;
- bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
-
- /* we need an insertion scan key to do our search, so build one */
- itup_key = _bt_mkscankey(rel, itup);
-
- if (checkingunique)
- {
- if (!itup_key->anynullkeys)
- {
- /* No (heapkeyspace) scantid until uniqueness established */
- itup_key->scantid = NULL;
- }
- else
- {
- /*
- * Scan key for new tuple contains NULL key values. Bypass
- * checkingunique steps. They are unnecessary because core code
- * considers NULL unequal to every value, including NULL.
- *
- * This optimization avoids O(N^2) behavior within the
- * _bt_findinsertloc() heapkeyspace path when a unique index has a
- * large number of "duplicates" with NULL key values.
- */
- checkingunique = false;
- /* Tuple is unique in the sense that core code cares about */
- Assert(checkUnique != UNIQUE_CHECK_EXISTING);
- is_unique = true;
- }
- }
-
- /*
- * Fill in the BTInsertState working area, to track the current page and
- * position within the page to insert on.
- *
- * Note that itemsz is passed down to lower level code that deals with
- * inserting the item. It must be MAXALIGN()'d. This ensures that space
- * accounting code consistently considers the alignment overhead that we
- * expect PageAddItem() will add later. (Actually, index_form_tuple() is
- * already conservative about alignment, but we don't rely on that from
- * this distance. Besides, preserving the "true" tuple size in index
- * tuple headers for the benefit of nbtsplitloc.c might happen someday.
- * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
- */
- insertstate.itup = itup;
- insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
- insertstate.itup_key = itup_key;
- insertstate.bounds_valid = false;
- insertstate.buf = InvalidBuffer;
- insertstate.postingoff = 0;
-
-search:
-
- /*
- * Find and lock the leaf page that the tuple should be added to by
- * searching from the root page. insertstate.buf will hold a buffer that
- * is locked in exclusive mode afterwards.
- */
- stack = _bt_search_insert(rel, &insertstate);
-
- /*
- * checkingunique inserts are not allowed to go ahead when two tuples with
- * equal key attribute values would be visible to new MVCC snapshots once
- * the xact commits. Check for conflicts in the locked page/buffer (if
- * needed) here.
- *
- * It might be necessary to check a page to the right in _bt_check_unique,
- * though that should be very rare. In practice the first page the value
- * could be on (with scantid omitted) is almost always also the only page
- * that a matching tuple might be found on. This is due to the behavior
- * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
- * only be allowed to cross a page boundary when there is no candidate
- * leaf page split point that avoids it. Also, _bt_check_unique can use
- * the leaf page high key to determine that there will be no duplicates on
- * the right sibling without actually visiting it (it uses the high key in
- * cases where the new item happens to belong at the far right of the leaf
- * page).
- *
- * NOTE: obviously, _bt_check_unique can only detect keys that are already
- * in the index; so it cannot defend against concurrent insertions of the
- * same key. We protect against that by means of holding a write lock on
- * the first page the value could be on, with omitted/-inf value for the
- * implicit heap TID tiebreaker attribute. Any other would-be inserter of
- * the same key must acquire a write lock on the same page, so only one
- * would-be inserter can be making the check at one time. Furthermore,
- * once we are past the check we hold write locks continuously until we
- * have performed our insertion, so no later inserter can fail to see our
- * insertion. (This requires some care in _bt_findinsertloc.)
- *
- * If we must wait for another xact, we release the lock while waiting,
- * and then must perform a new search.
- *
- * For a partial uniqueness check, we don't wait for the other xact. Just
- * let the tuple in and return false for possibly non-unique, or true for
- * definitely unique.
- */
- if (checkingunique)
- {
- TransactionId xwait;
- uint32 speculativeToken;
-
- xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
- &is_unique, &speculativeToken);
-
- if (unlikely(TransactionIdIsValid(xwait)))
- {
- /* Have to wait for the other guy ... */
- _bt_relbuf(rel, insertstate.buf);
- insertstate.buf = InvalidBuffer;
-
- /*
- * If it's a speculative insertion, wait for it to finish (ie. to
- * go ahead with the insertion, or kill the tuple). Otherwise
- * wait for the transaction to finish as usual.
- */
- if (speculativeToken)
- SpeculativeInsertionWait(xwait, speculativeToken);
- else
- XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
-
- /* start over... */
- if (stack)
- _bt_freestack(stack);
- goto search;
- }
-
- /* Uniqueness is established -- restore heap tid as scantid */
- if (itup_key->heapkeyspace)
- itup_key->scantid = &itup->t_tid;
- }
-
- if (checkUnique != UNIQUE_CHECK_EXISTING)
- {
- OffsetNumber newitemoff;
-
- /*
- * The only conflict predicate locking cares about for indexes is when
- * an index tuple insert conflicts with an existing lock. We don't
- * know the actual page we're going to insert on for sure just yet in
- * checkingunique and !heapkeyspace cases, but it's okay to use the
- * first page the value could be on (with scantid omitted) instead.
- */
- CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
-
- /*
- * Do the insertion. Note that insertstate contains cached binary
- * search bounds established within _bt_check_unique when insertion is
- * checkingunique.
- */
- newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
- indexUnchanged, stack, heapRel);
- _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, insertstate.itemsz, newitemoff,
- insertstate.postingoff, false);
- }
- else
- {
- /* just release the buffer */
- _bt_relbuf(rel, insertstate.buf);
- }
-
- /* be tidy */
- if (stack)
- _bt_freestack(stack);
- pfree(itup_key);
-
- return is_unique;
-}
-
-/*
- * _bt_search_insert() -- _bt_search() wrapper for inserts
- *
- * Search the tree for a particular scankey, or more precisely for the first
- * leaf page it could be on. Try to make use of the fastpath optimization's
- * rightmost leaf page cache before actually searching the tree from the root
- * page, though.
- *
- * Return value is a stack of parent-page pointers (though see notes about
- * fastpath optimization and page splits below). insertstate->buf is set to
- * the address of the leaf-page buffer, which is write-locked and pinned in
- * all cases (if necessary by creating a new empty root page for caller).
- *
- * The fastpath optimization avoids most of the work of searching the tree
- * repeatedly when a single backend inserts successive new tuples on the
- * rightmost leaf page of an index. A backend cache of the rightmost leaf
- * page is maintained within _bt_insertonpg(), and used here. The cache is
- * invalidated here when an insert of a non-pivot tuple must take place on a
- * non-rightmost leaf page.
- *
- * The optimization helps with indexes on an auto-incremented field. It also
- * helps with indexes on datetime columns, as well as indexes with lots of
- * NULL values. (NULLs usually get inserted in the rightmost page for single
- * column indexes, since they usually get treated as coming after everything
- * else in the key space. Individual NULL tuples will generally be placed on
- * the rightmost leaf page due to the influence of the heap TID column.)
- *
- * Note that we avoid applying the optimization when there is insufficient
- * space on the rightmost page to fit caller's new item. This is necessary
- * because we'll need to return a real descent stack when a page split is
- * expected (actually, caller can cope with a leaf page split that uses a NULL
- * stack, but that's very slow and so must be avoided). Note also that the
- * fastpath optimization acquires the lock on the page conditionally as a way
- * of reducing extra contention when there are concurrent insertions into the
- * rightmost page (we give up if we'd have to wait for the lock). We assume
- * that it isn't useful to apply the optimization when there is contention,
- * since each per-backend cache won't stay valid for long.
- */
-static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
-{
- Assert(insertstate->buf == InvalidBuffer);
- Assert(!insertstate->bounds_valid);
- Assert(insertstate->postingoff == 0);
-
- if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
- {
- /* Simulate a _bt_getbuf() call with conditional locking */
- insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
- if (_bt_conditionallockbuf(rel, insertstate->buf))
- {
- Page page;
- BTPageOpaque opaque;
-
- _bt_checkpage(rel, insertstate->buf);
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- /*
- * Check if the page is still the rightmost leaf page and has
- * enough free space to accommodate the new tuple. Also check
- * that the insertion scan key is strictly greater than the first
- * non-pivot tuple on the page. (Note that we expect itup_key's
- * scantid to be unset when our caller is a checkingunique
- * inserter.)
- */
- if (P_RIGHTMOST(opaque) &&
- P_ISLEAF(opaque) &&
- !P_IGNORE(opaque) &&
- PageGetFreeSpace(page) > insertstate->itemsz &&
- PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- _bt_compare(rel, insertstate->itup_key, page, P_HIKEY) > 0)
- {
- /*
- * Caller can use the fastpath optimization because cached
- * block is still rightmost leaf page, which can fit caller's
- * new tuple without splitting. Keep block in local cache for
- * next insert, and have caller use NULL stack.
- *
- * Note that _bt_insert_parent() has an assertion that catches
- * leaf page splits that somehow follow from a fastpath insert
- * (it should only be passed a NULL stack when it must deal
- * with a concurrent root page split, and never because a NULL
- * stack was returned here).
- */
- return NULL;
- }
-
- /* Page unsuitable for caller, drop lock and pin */
- _bt_relbuf(rel, insertstate->buf);
- }
- else
- {
- /* Lock unavailable, drop pin */
- ReleaseBuffer(insertstate->buf);
- }
-
- /* Forget block, since cache doesn't appear to be useful */
- RelationSetTargetBlock(rel, InvalidBlockNumber);
- }
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtinsert_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
- /* Cannot use optimization -- descend tree, return proper descent stack */
- return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
- NULL);
-}
/*
* _bt_check_unique() -- Check for violation of unique index constraint
@@ -438,7 +132,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* in the fastpath below, but also in the _bt_findinsertloc() call later.
*/
Assert(!insertstate->bounds_valid);
- offset = _bt_binsrch_insert(rel, insertstate);
+ offset = nbts_call(_bt_binsrch_insert, rel, insertstate);
/*
* Scan over all equal tuples, looking for live conflicts.
@@ -483,7 +177,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(insertstate->bounds_valid);
Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
Assert(insertstate->low <= insertstate->stricthigh);
- Assert(_bt_compare(rel, itup_key, page, offset) < 0);
+ Assert(nbts_call(_bt_compare, rel, itup_key, page, offset) < 0);
break;
}
@@ -508,7 +202,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
if (!inposting)
{
/* Plain tuple, or first TID in posting list tuple */
- if (_bt_compare(rel, itup_key, page, offset) != 0)
+ if (nbts_call(_bt_compare, rel, itup_key, page, offset) != 0)
break; /* we're past all the equal tuples */
/* Advanced curitup */
@@ -722,7 +416,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* If scankey == hikey we gotta check the next page too */
if (P_RIGHTMOST(opaque))
break;
- highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+ highkeycmp = nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY);
Assert(highkeycmp <= 0);
if (highkeycmp != 0)
break;
@@ -769,246 +463,6 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
return InvalidTransactionId;
}
-
-/*
- * _bt_findinsertloc() -- Finds an insert location for a tuple
- *
- * On entry, insertstate buffer contains the page the new tuple belongs
- * on. It is exclusive-locked and pinned by the caller.
- *
- * If 'checkingunique' is true, the buffer on entry is the first page
- * that contains duplicates of the new key. If there are duplicates on
- * multiple pages, the correct insertion position might be some page to
- * the right, rather than the first page. In that case, this function
- * moves right to the correct target page.
- *
- * (In a !heapkeyspace index, there can be multiple pages with the same
- * high key, where the new tuple could legitimately be placed on. In
- * that case, the caller passes the first page containing duplicates,
- * just like when checkingunique=true. If that page doesn't have enough
- * room for the new tuple, this function moves right, trying to find a
- * legal page that does.)
- *
- * If 'indexUnchanged' is true, this is for an UPDATE that didn't
- * logically change the indexed value, but must nevertheless have a new
- * entry to point to a successor version. This hint from the executor
- * will influence our behavior when the page might have to be split and
- * we must consider our options. Bottom-up index deletion can avoid
- * pathological version-driven page splits, but we only want to go to the
- * trouble of trying it when we already have moderate confidence that
- * it's appropriate. The hint should not significantly affect our
- * behavior over time unless practically all inserts on to the leaf page
- * get the hint.
- *
- * On exit, insertstate buffer contains the chosen insertion page, and
- * the offset within that page is returned. If _bt_findinsertloc needed
- * to move right, the lock and pin on the original page are released, and
- * the new buffer is exclusively locked and pinned instead.
- *
- * If insertstate contains cached binary search bounds, we will take
- * advantage of them. This avoids repeating comparisons that we made in
- * _bt_check_unique() already.
- */
-static OffsetNumber
-_bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel)
-{
- BTScanInsert itup_key = insertstate->itup_key;
- Page page = BufferGetPage(insertstate->buf);
- BTPageOpaque opaque;
- OffsetNumber newitemoff;
-
- opaque = BTPageGetOpaque(page);
-
- /* Check 1/3 of a page restriction */
- if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
- _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
- insertstate->itup);
-
- Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
- Assert(!insertstate->bounds_valid || checkingunique);
- Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
- Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
- Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
-
- if (itup_key->heapkeyspace)
- {
- /* Keep track of whether checkingunique duplicate seen */
- bool uniquedup = indexUnchanged;
-
- /*
- * If we're inserting into a unique index, we may have to walk right
- * through leaf pages to find the one leaf page that we must insert on
- * to.
- *
- * This is needed for checkingunique callers because a scantid was not
- * used when we called _bt_search(). scantid can only be set after
- * _bt_check_unique() has checked for duplicates. The buffer
- * initially stored in insertstate->buf has the page where the first
- * duplicate key might be found, which isn't always the page that new
- * tuple belongs on. The heap TID attribute for new tuple (scantid)
- * could force us to insert on a sibling page, though that should be
- * very rare in practice.
- */
- if (checkingunique)
- {
- if (insertstate->low < insertstate->stricthigh)
- {
- /* Encountered a duplicate in _bt_check_unique() */
- Assert(insertstate->bounds_valid);
- uniquedup = true;
- }
-
- for (;;)
- {
- /*
- * Does the new tuple belong on this page?
- *
- * The earlier _bt_check_unique() call may well have
- * established a strict upper bound on the offset for the new
- * item. If it's not the last item of the page (i.e. if there
- * is at least one tuple on the page that goes after the tuple
- * we're inserting) then we know that the tuple belongs on
- * this page. We can skip the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- /* Test '<=', not '!=', since scantid is set now */
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
- break;
-
- _bt_stepright(rel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- /* Assume duplicates (if checkingunique) */
- uniquedup = true;
- }
- }
-
- /*
- * If the target page cannot fit newitem, try to avoid splitting the
- * page on insert by performing deletion or deduplication now
- */
- if (PageGetFreeSpace(page) < insertstate->itemsz)
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
- checkingunique, uniquedup,
- indexUnchanged);
- }
- else
- {
- /*----------
- * This is a !heapkeyspace (version 2 or 3) index. The current page
- * is the first page that we could insert the new tuple to, but there
- * may be other pages to the right that we could opt to use instead.
- *
- * If the new key is equal to one or more existing keys, we can
- * legitimately place it anywhere in the series of equal keys. In
- * fact, if the new key is equal to the page's "high key" we can place
- * it on the next page. If it is equal to the high key, and there's
- * not room to insert the new tuple on the current page without
- * splitting, then we move right hoping to find more free space and
- * avoid a split.
- *
- * Keep scanning right until we
- * (a) find a page with enough free space,
- * (b) reach the last page where the tuple can legally go, or
- * (c) get tired of searching.
- * (c) is not flippant; it is important because if there are many
- * pages' worth of equal keys, it's better to split one of the early
- * pages than to scan all the way to the end of the run of equal keys
- * on every insert. We implement "get tired" as a random choice,
- * since stopping after scanning a fixed number of pages wouldn't work
- * well (we'd never reach the right-hand side of previously split
- * pages). The probability of moving right is set at 0.99, which may
- * seem too high to change the behavior much, but it does an excellent
- * job of preventing O(N^2) behavior with many equal keys.
- *----------
- */
- while (PageGetFreeSpace(page) < insertstate->itemsz)
- {
- /*
- * Before considering moving right, see if we can obtain enough
- * space by erasing LP_DEAD items
- */
- if (P_HAS_GARBAGE(opaque))
- {
- /* Perform simple deletion */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- if (PageGetFreeSpace(page) >= insertstate->itemsz)
- break; /* OK, now we have enough space */
- }
-
- /*
- * Nope, so check conditions (b) and (c) enumerated above
- *
- * The earlier _bt_check_unique() call may well have established a
- * strict upper bound on the offset for the new item. If it's not
- * the last item of the page (i.e. if there is at least one tuple
- * on the page that's greater than the tuple we're inserting to)
- * then we know that the tuple belongs on this page. We can skip
- * the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) != 0 ||
- pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
- break;
-
- _bt_stepright(rel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- }
- }
-
- /*
- * We should now be on the correct page. Find the offset within the page
- * for the new tuple. (Possibly reusing earlier search bounds.)
- */
- Assert(P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
-
- newitemoff = _bt_binsrch_insert(rel, insertstate);
-
- if (insertstate->postingoff == -1)
- {
- /*
- * There is an overlapping posting list tuple with its LP_DEAD bit
- * set. We don't want to unnecessarily unset its LP_DEAD bit while
- * performing a posting list split, so perform simple index tuple
- * deletion early.
- */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- /*
- * Do new binary search. New insert location cannot overlap with any
- * posting list now.
- */
- Assert(!insertstate->bounds_valid);
- insertstate->postingoff = 0;
- newitemoff = _bt_binsrch_insert(rel, insertstate);
- Assert(insertstate->postingoff == 0);
- }
-
- return newitemoff;
-}
-
/*
* Step right to next non-dead page, during insertion.
*
@@ -1649,7 +1103,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
lastleft = nposting;
}
- lefthighkey = _bt_truncate(rel, lastleft, firstright, itup_key);
+ lefthighkey = nbts_call(_bt_truncate, rel, lastleft, firstright, itup_key);
itemsz = IndexTupleSize(lefthighkey);
}
else
@@ -2764,8 +2218,8 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
/* Perform deduplication pass (when enabled and index-is-allequalimage) */
if (BTGetDeduplicateItems(rel) && itup_key->allequalimage)
- _bt_dedup_pass(rel, buffer, heapRel, insertstate->itup,
- insertstate->itemsz, (indexUnchanged || uniquedup));
+ nbts_call(_bt_dedup_pass, rel, buffer, heapRel, insertstate->itup,
+ insertstate->itemsz, (indexUnchanged || uniquedup));
}
/*
diff --git a/src/backend/access/nbtree/nbtinsert_spec.h b/src/backend/access/nbtree/nbtinsert_spec.h
new file mode 100644
index 0000000000..97c866aea3
--- /dev/null
+++ b/src/backend/access/nbtree/nbtinsert_spec.h
@@ -0,0 +1,569 @@
+/*
+ * Specialized functions for nbtinsert.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static BTStack NBTS_FUNCTION(_bt_search_insert)(Relation rel,
+ BTInsertState insertstate);
+
+static OffsetNumber NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel);
+
+/*
+ * _bt_search_insert() -- _bt_search() wrapper for inserts
+ *
+ * Search the tree for a particular scankey, or more precisely for the first
+ * leaf page it could be on. Try to make use of the fastpath optimization's
+ * rightmost leaf page cache before actually searching the tree from the root
+ * page, though.
+ *
+ * Return value is a stack of parent-page pointers (though see notes about
+ * fastpath optimization and page splits below). insertstate->buf is set to
+ * the address of the leaf-page buffer, which is write-locked and pinned in
+ * all cases (if necessary by creating a new empty root page for caller).
+ *
+ * The fastpath optimization avoids most of the work of searching the tree
+ * repeatedly when a single backend inserts successive new tuples on the
+ * rightmost leaf page of an index. A backend cache of the rightmost leaf
+ * page is maintained within _bt_insertonpg(), and used here. The cache is
+ * invalidated here when an insert of a non-pivot tuple must take place on a
+ * non-rightmost leaf page.
+ *
+ * The optimization helps with indexes on an auto-incremented field. It also
+ * helps with indexes on datetime columns, as well as indexes with lots of
+ * NULL values. (NULLs usually get inserted in the rightmost page for single
+ * column indexes, since they usually get treated as coming after everything
+ * else in the key space. Individual NULL tuples will generally be placed on
+ * the rightmost leaf page due to the influence of the heap TID column.)
+ *
+ * Note that we avoid applying the optimization when there is insufficient
+ * space on the rightmost page to fit caller's new item. This is necessary
+ * because we'll need to return a real descent stack when a page split is
+ * expected (actually, caller can cope with a leaf page split that uses a NULL
+ * stack, but that's very slow and so must be avoided). Note also that the
+ * fastpath optimization acquires the lock on the page conditionally as a way
+ * of reducing extra contention when there are concurrent insertions into the
+ * rightmost page (we give up if we'd have to wait for the lock). We assume
+ * that it isn't useful to apply the optimization when there is contention,
+ * since each per-backend cache won't stay valid for long.
+ */
+static BTStack
+NBTS_FUNCTION(_bt_search_insert)(Relation rel, BTInsertState insertstate)
+{
+ Assert(insertstate->buf == InvalidBuffer);
+ Assert(!insertstate->bounds_valid);
+ Assert(insertstate->postingoff == 0);
+
+ if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
+ {
+ /* Simulate a _bt_getbuf() call with conditional locking */
+ insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
+ if (_bt_conditionallockbuf(rel, insertstate->buf))
+ {
+ Page page;
+ BTPageOpaque opaque;
+
+ _bt_checkpage(rel, insertstate->buf);
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ /*
+ * Check if the page is still the rightmost leaf page and has
+ * enough free space to accommodate the new tuple. Also check
+ * that the insertion scan key is strictly greater than the first
+ * non-pivot tuple on the page. (Note that we expect itup_key's
+ * scantid to be unset when our caller is a checkingunique
+ * inserter.)
+ */
+ if (P_RIGHTMOST(opaque) &&
+ P_ISLEAF(opaque) &&
+ !P_IGNORE(opaque) &&
+ PageGetFreeSpace(page) > insertstate->itemsz &&
+ PageGetMaxOffsetNumber(page) >= P_HIKEY &&
+ nbts_call(_bt_compare, rel, insertstate->itup_key, page, P_HIKEY) > 0)
+ {
+ /*
+ * Caller can use the fastpath optimization because cached
+ * block is still rightmost leaf page, which can fit caller's
+ * new tuple without splitting. Keep block in local cache for
+ * next insert, and have caller use NULL stack.
+ *
+ * Note that _bt_insert_parent() has an assertion that catches
+ * leaf page splits that somehow follow from a fastpath insert
+ * (it should only be passed a NULL stack when it must deal
+ * with a concurrent root page split, and never because a NULL
+ * stack was returned here).
+ */
+ return NULL;
+ }
+
+ /* Page unsuitable for caller, drop lock and pin */
+ _bt_relbuf(rel, insertstate->buf);
+ }
+ else
+ {
+ /* Lock unavailable, drop pin */
+ ReleaseBuffer(insertstate->buf);
+ }
+
+ /* Forget block, since cache doesn't appear to be useful */
+ RelationSetTargetBlock(rel, InvalidBlockNumber);
+ }
+
+ /* Cannot use optimization -- descend tree, return proper descent stack */
+ return nbts_call(_bt_search, rel, insertstate->itup_key,
+ &insertstate->buf, BT_WRITE, NULL);
+}
+
+/*
+ * _bt_findinsertloc() -- Finds an insert location for a tuple
+ *
+ * On entry, insertstate buffer contains the page the new tuple belongs
+ * on. It is exclusive-locked and pinned by the caller.
+ *
+ * If 'checkingunique' is true, the buffer on entry is the first page
+ * that contains duplicates of the new key. If there are duplicates on
+ * multiple pages, the correct insertion position might be some page to
+ * the right, rather than the first page. In that case, this function
+ * moves right to the correct target page.
+ *
+ * (In a !heapkeyspace index, there can be multiple pages with the same
+ * high key, where the new tuple could legitimately be placed on. In
+ * that case, the caller passes the first page containing duplicates,
+ * just like when checkingunique=true. If that page doesn't have enough
+ * room for the new tuple, this function moves right, trying to find a
+ * legal page that does.)
+ *
+ * If 'indexUnchanged' is true, this is for an UPDATE that didn't
+ * logically change the indexed value, but must nevertheless have a new
+ * entry to point to a successor version. This hint from the executor
+ * will influence our behavior when the page might have to be split and
+ * we must consider our options. Bottom-up index deletion can avoid
+ * pathological version-driven page splits, but we only want to go to the
+ * trouble of trying it when we already have moderate confidence that
+ * it's appropriate. The hint should not significantly affect our
+ * behavior over time unless practically all inserts on to the leaf page
+ * get the hint.
+ *
+ * On exit, insertstate buffer contains the chosen insertion page, and
+ * the offset within that page is returned. If _bt_findinsertloc needed
+ * to move right, the lock and pin on the original page are released, and
+ * the new buffer is exclusively locked and pinned instead.
+ *
+ * If insertstate contains cached binary search bounds, we will take
+ * advantage of them. This avoids repeating comparisons that we made in
+ * _bt_check_unique() already.
+ */
+static OffsetNumber
+NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel)
+{
+ BTScanInsert itup_key = insertstate->itup_key;
+ Page page = BufferGetPage(insertstate->buf);
+ BTPageOpaque opaque;
+ OffsetNumber newitemoff;
+
+ opaque = BTPageGetOpaque(page);
+
+ /* Check 1/3 of a page restriction */
+ if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
+ _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
+ insertstate->itup);
+
+ Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
+ Assert(!insertstate->bounds_valid || checkingunique);
+ Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
+ Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
+ Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
+
+ if (itup_key->heapkeyspace)
+ {
+ /* Keep track of whether checkingunique duplicate seen */
+ bool uniquedup = indexUnchanged;
+
+ /*
+ * If we're inserting into a unique index, we may have to walk right
+ * through leaf pages to find the one leaf page that we must insert on
+ * to.
+ *
+ * This is needed for checkingunique callers because a scantid was not
+ * used when we called _bt_search(). scantid can only be set after
+ * _bt_check_unique() has checked for duplicates. The buffer
+ * initially stored in insertstate->buf has the page where the first
+ * duplicate key might be found, which isn't always the page that new
+ * tuple belongs on. The heap TID attribute for new tuple (scantid)
+ * could force us to insert on a sibling page, though that should be
+ * very rare in practice.
+ */
+ if (checkingunique)
+ {
+ if (insertstate->low < insertstate->stricthigh)
+ {
+ /* Encountered a duplicate in _bt_check_unique() */
+ Assert(insertstate->bounds_valid);
+ uniquedup = true;
+ }
+
+ for (;;)
+ {
+ /*
+ * Does the new tuple belong on this page?
+ *
+ * The earlier _bt_check_unique() call may well have
+ * established a strict upper bound on the offset for the new
+ * item. If it's not the last item of the page (i.e. if there
+ * is at least one tuple on the page that goes after the tuple
+ * we're inserting) then we know that the tuple belongs on
+ * this page. We can skip the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ /* Test '<=', not '!=', since scantid is set now */
+ if (P_RIGHTMOST(opaque) ||
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) <= 0)
+ break;
+
+ _bt_stepright(rel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ /* Assume duplicates (if checkingunique) */
+ uniquedup = true;
+ }
+ }
+
+ /*
+ * If the target page cannot fit newitem, try to avoid splitting the
+ * page on insert by performing deletion or deduplication now
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
+ checkingunique, uniquedup,
+ indexUnchanged);
+ }
+ else
+ {
+ /*----------
+ * This is a !heapkeyspace (version 2 or 3) index. The current page
+ * is the first page that we could insert the new tuple to, but there
+ * may be other pages to the right that we could opt to use instead.
+ *
+ * If the new key is equal to one or more existing keys, we can
+ * legitimately place it anywhere in the series of equal keys. In
+ * fact, if the new key is equal to the page's "high key" we can place
+ * it on the next page. If it is equal to the high key, and there's
+ * not room to insert the new tuple on the current page without
+ * splitting, then we move right hoping to find more free space and
+ * avoid a split.
+ *
+ * Keep scanning right until we
+ * (a) find a page with enough free space,
+ * (b) reach the last page where the tuple can legally go, or
+ * (c) get tired of searching.
+ * (c) is not flippant; it is important because if there are many
+ * pages' worth of equal keys, it's better to split one of the early
+ * pages than to scan all the way to the end of the run of equal keys
+ * on every insert. We implement "get tired" as a random choice,
+ * since stopping after scanning a fixed number of pages wouldn't work
+ * well (we'd never reach the right-hand side of previously split
+ * pages). The probability of moving right is set at 0.99, which may
+ * seem too high to change the behavior much, but it does an excellent
+ * job of preventing O(N^2) behavior with many equal keys.
+ *----------
+ */
+ while (PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ /*
+ * Before considering moving right, see if we can obtain enough
+ * space by erasing LP_DEAD items
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ /* Perform simple deletion */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ if (PageGetFreeSpace(page) >= insertstate->itemsz)
+ break; /* OK, now we have enough space */
+ }
+
+ /*
+ * Nope, so check conditions (b) and (c) enumerated above
+ *
+ * The earlier _bt_check_unique() call may well have established a
+ * strict upper bound on the offset for the new item. If it's not
+ * the last item of the page (i.e. if there is at least one tuple
+ * on the page that's greater than the tuple we're inserting to)
+ * then we know that the tuple belongs on this page. We can skip
+ * the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ if (P_RIGHTMOST(opaque) ||
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) != 0 ||
+ pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
+ break;
+
+ _bt_stepright(rel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ }
+ }
+
+ /*
+ * We should now be on the correct page. Find the offset within the page
+ * for the new tuple. (Possibly reusing earlier search bounds.)
+ */
+ Assert(P_RIGHTMOST(opaque) ||
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) <= 0);
+
+ newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate);
+
+ if (insertstate->postingoff == -1)
+ {
+ /*
+ * There is an overlapping posting list tuple with its LP_DEAD bit
+ * set. We don't want to unnecessarily unset its LP_DEAD bit while
+ * performing a posting list split, so perform simple index tuple
+ * deletion early.
+ */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ /*
+ * Do new binary search. New insert location cannot overlap with any
+ * posting list now.
+ */
+ Assert(!insertstate->bounds_valid);
+ insertstate->postingoff = 0;
+ newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate);
+ Assert(insertstate->postingoff == 0);
+ }
+
+ return newitemoff;
+}
+
+#endif /* ifndef NBTS_SPECIALIZING_DEFAULT */
+
+/*
+ * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
+ *
+ * This routine is called by the public interface routine, btinsert.
+ * By here, itup is filled in, including the TID.
+ *
+ * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
+ * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
+ * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
+ * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
+ * don't actually insert.
+ *
+ * indexUnchanged executor hint indicates if itup is from an
+ * UPDATE that didn't logically change the indexed value, but
+ * must nevertheless have a new entry to point to a successor
+ * version.
+ *
+ * The result value is only significant for UNIQUE_CHECK_PARTIAL:
+ * it must be true if the entry is known unique, else false.
+ * (In the current implementation we'll also return true after a
+ * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
+ * that's just a coding artifact.)
+ */
+bool
+NBTS_FUNCTION(_bt_doinsert)(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ Relation heapRel)
+{
+ bool is_unique = false;
+ BTInsertStateData insertstate;
+ BTScanInsert itup_key;
+ BTStack stack;
+ bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
+
+ /* we need an insertion scan key to do our search, so build one */
+ itup_key = nbts_call(_bt_mkscankey, rel, itup);
+
+ if (checkingunique)
+ {
+ if (!itup_key->anynullkeys)
+ {
+ /* No (heapkeyspace) scantid until uniqueness established */
+ itup_key->scantid = NULL;
+ }
+ else
+ {
+ /*
+ * Scan key for new tuple contains NULL key values. Bypass
+ * checkingunique steps. They are unnecessary because core code
+ * considers NULL unequal to every value, including NULL.
+ *
+ * This optimization avoids O(N^2) behavior within the
+ * _bt_findinsertloc() heapkeyspace path when a unique index has a
+ * large number of "duplicates" with NULL key values.
+ */
+ checkingunique = false;
+ /* Tuple is unique in the sense that core code cares about */
+ Assert(checkUnique != UNIQUE_CHECK_EXISTING);
+ is_unique = true;
+ }
+ }
+
+ /*
+ * Fill in the BTInsertState working area, to track the current page and
+ * position within the page to insert on.
+ *
+ * Note that itemsz is passed down to lower level code that deals with
+ * inserting the item. It must be MAXALIGN()'d. This ensures that space
+ * accounting code consistently considers the alignment overhead that we
+ * expect PageAddItem() will add later. (Actually, index_form_tuple() is
+ * already conservative about alignment, but we don't rely on that from
+ * this distance. Besides, preserving the "true" tuple size in index
+ * tuple headers for the benefit of nbtsplitloc.c might happen someday.
+ * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
+ */
+ insertstate.itup = itup;
+ insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
+ insertstate.itup_key = itup_key;
+ insertstate.bounds_valid = false;
+ insertstate.buf = InvalidBuffer;
+ insertstate.postingoff = 0;
+
+ search:
+
+ /*
+ * Find and lock the leaf page that the tuple should be added to by
+ * searching from the root page. insertstate.buf will hold a buffer that
+ * is locked in exclusive mode afterwards.
+ */
+ stack = nbts_call(_bt_search_insert, rel, &insertstate);
+
+ /*
+ * checkingunique inserts are not allowed to go ahead when two tuples with
+ * equal key attribute values would be visible to new MVCC snapshots once
+ * the xact commits. Check for conflicts in the locked page/buffer (if
+ * needed) here.
+ *
+ * It might be necessary to check a page to the right in _bt_check_unique,
+ * though that should be very rare. In practice the first page the value
+ * could be on (with scantid omitted) is almost always also the only page
+ * that a matching tuple might be found on. This is due to the behavior
+ * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
+ * only be allowed to cross a page boundary when there is no candidate
+ * leaf page split point that avoids it. Also, _bt_check_unique can use
+ * the leaf page high key to determine that there will be no duplicates on
+ * the right sibling without actually visiting it (it uses the high key in
+ * cases where the new item happens to belong at the far right of the leaf
+ * page).
+ *
+ * NOTE: obviously, _bt_check_unique can only detect keys that are already
+ * in the index; so it cannot defend against concurrent insertions of the
+ * same key. We protect against that by means of holding a write lock on
+ * the first page the value could be on, with omitted/-inf value for the
+ * implicit heap TID tiebreaker attribute. Any other would-be inserter of
+ * the same key must acquire a write lock on the same page, so only one
+ * would-be inserter can be making the check at one time. Furthermore,
+ * once we are past the check we hold write locks continuously until we
+ * have performed our insertion, so no later inserter can fail to see our
+ * insertion. (This requires some care in _bt_findinsertloc.)
+ *
+ * If we must wait for another xact, we release the lock while waiting,
+ * and then must perform a new search.
+ *
+ * For a partial uniqueness check, we don't wait for the other xact. Just
+ * let the tuple in and return false for possibly non-unique, or true for
+ * definitely unique.
+ */
+ if (checkingunique)
+ {
+ TransactionId xwait;
+ uint32 speculativeToken;
+
+ xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
+ &is_unique, &speculativeToken);
+
+ if (unlikely(TransactionIdIsValid(xwait)))
+ {
+ /* Have to wait for the other guy ... */
+ _bt_relbuf(rel, insertstate.buf);
+ insertstate.buf = InvalidBuffer;
+
+ /*
+ * If it's a speculative insertion, wait for it to finish (ie. to
+ * go ahead with the insertion, or kill the tuple). Otherwise
+ * wait for the transaction to finish as usual.
+ */
+ if (speculativeToken)
+ SpeculativeInsertionWait(xwait, speculativeToken);
+ else
+ XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
+
+ /* start over... */
+ if (stack)
+ _bt_freestack(stack);
+ goto search;
+ }
+
+ /* Uniqueness is established -- restore heap tid as scantid */
+ if (itup_key->heapkeyspace)
+ itup_key->scantid = &itup->t_tid;
+ }
+
+ if (checkUnique != UNIQUE_CHECK_EXISTING)
+ {
+ OffsetNumber newitemoff;
+
+ /*
+ * The only conflict predicate locking cares about for indexes is when
+ * an index tuple insert conflicts with an existing lock. We don't
+ * know the actual page we're going to insert on for sure just yet in
+ * checkingunique and !heapkeyspace cases, but it's okay to use the
+ * first page the value could be on (with scantid omitted) instead.
+ */
+ CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
+
+ /*
+ * Do the insertion. Note that insertstate contains cached binary
+ * search bounds established within _bt_check_unique when insertion is
+ * checkingunique.
+ */
+ newitemoff = nbts_call(_bt_findinsertloc, rel, &insertstate, checkingunique,
+ indexUnchanged, stack, heapRel);
+ _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
+ itup, insertstate.itemsz, newitemoff,
+ insertstate.postingoff, false);
+ }
+ else
+ {
+ /* just release the buffer */
+ _bt_relbuf(rel, insertstate.buf);
+ }
+
+ /* be tidy */
+ if (stack)
+ _bt_freestack(stack);
+ pfree(itup_key);
+
+ return is_unique;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 20adb602a4..e66299ebd8 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1967,10 +1967,10 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
}
/* we need an insertion scan key for the search, so build one */
- itup_key = _bt_mkscankey(rel, targetkey);
+ itup_key = nbts_call(_bt_mkscankey, rel, targetkey);
/* find the leftmost leaf page with matching pivot/high key */
itup_key->pivotsearch = true;
- stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
+ stack = nbts_call(_bt_search, rel, itup_key, &sleafbuf, BT_READ, NULL);
/* won't need a second lock or pin on leafbuf */
_bt_relbuf(rel, sleafbuf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 06131f23d4..09c43eb226 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -87,6 +87,10 @@ static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
OffsetNumber updatedoffset,
int *nremaining);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtree_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
+
/*
* Btree handler function: return IndexAmRoutine with access method parameters
@@ -178,33 +182,6 @@ btbuildempty(Relation index)
smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
}
-/*
- * btinsert() -- insert an index tuple into a btree.
- *
- * Descend the tree recursively, find the appropriate location for our
- * new tuple, and put it there.
- */
-bool
-btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- IndexInfo *indexInfo)
-{
- bool result;
- IndexTuple itup;
-
- /* generate an index tuple */
- itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
- itup->t_tid = *ht_ctid;
-
- result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);
-
- pfree(itup);
-
- return result;
-}
-
/*
* btgettuple() -- Get the next tuple in the scan.
*/
diff --git a/src/backend/access/nbtree/nbtree_spec.h b/src/backend/access/nbtree/nbtree_spec.h
new file mode 100644
index 0000000000..4c342287f6
--- /dev/null
+++ b/src/backend/access/nbtree/nbtree_spec.h
@@ -0,0 +1,50 @@
+/*
+ * Specialized functions for nbtree.c
+ */
+
+/*
+ * _bt_specialize() -- Specialize this index relation for its index key.
+ */
+void
+NBTS_FUNCTION(_bt_specialize)(Relation rel)
+{
+#ifdef NBTS_SPECIALIZING_DEFAULT
+ nbts_call_norel(_bt_specialize, rel, rel);
+#else
+ rel->rd_indam->aminsert = NBTS_FUNCTION(btinsert);
+#endif
+}
+
+/*
+ * btinsert() -- insert an index tuple into a btree.
+ *
+ * Descend the tree recursively, find the appropriate location for our
+ * new tuple, and put it there.
+ */
+bool
+NBTS_FUNCTION(btinsert)(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ IndexInfo *indexInfo)
+{
+#ifdef NBTS_SPECIALIZING_DEFAULT
+ nbts_call_norel(_bt_specialize, rel, rel);
+
+ return nbts_call(btinsert, rel, values, isnull, ht_ctid, heapRel,
+ checkUnique, indexUnchanged, indexInfo);
+#else
+ bool result;
+ IndexTuple itup;
+
+ /* generate an index tuple */
+ itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
+ itup->t_tid = *ht_ctid;
+
+ result = nbts_call(_bt_doinsert, rel, itup, checkUnique, indexUnchanged, heapRel);
+
+ pfree(itup);
+
+ return result;
+#endif
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c74543bfde..e81eee9c35 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,11 +25,8 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
-static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
static int _bt_binsrch_posting(BTScanInsert key, Page page,
OffsetNumber offnum);
-static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
- OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
static int _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
@@ -46,6 +43,9 @@ static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsearch_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
/*
* _bt_drop_lock_and_maybe_pin()
@@ -70,493 +70,6 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
}
}
-/*
- * _bt_search() -- Search the tree for a particular scankey,
- * or more precisely for the first leaf page it could be on.
- *
- * The passed scankey is an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * Return value is a stack of parent-page pointers (i.e. there is no entry for
- * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
- * which is locked and pinned. No locks are held on the parent pages,
- * however!
- *
- * If the snapshot parameter is not NULL, "old snapshot" checking will take
- * place during the descent through the tree. This is not needed when
- * positioning for an insert or delete, so NULL is used for those cases.
- *
- * The returned buffer is locked according to access parameter. Additionally,
- * access = BT_WRITE will allow an empty root page to be created and returned.
- * When access = BT_READ, an empty index will result in *bufP being set to
- * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
- * during the search will be finished.
- */
-BTStack
-_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
- Snapshot snapshot)
-{
- BTStack stack_in = NULL;
- int page_access = BT_READ;
-
- /* Get the root page to start with */
- *bufP = _bt_getroot(rel, access);
-
- /* If index is empty and access = BT_READ, no root page is created. */
- if (!BufferIsValid(*bufP))
- return (BTStack) NULL;
-
- /* Loop iterates once per level descended in the tree */
- for (;;)
- {
- Page page;
- BTPageOpaque opaque;
- OffsetNumber offnum;
- ItemId itemid;
- IndexTuple itup;
- BlockNumber child;
- BTStack new_stack;
-
- /*
- * Race -- the page we just grabbed may have split since we read its
- * downlink in its parent page (or the metapage). If it has, we may
- * need to move right to its new sibling. Do that.
- *
- * In write-mode, allow _bt_moveright to finish any incomplete splits
- * along the way. Strictly speaking, we'd only need to finish an
- * incomplete split on the leaf page we're about to insert to, not on
- * any of the upper levels (internal pages with incomplete splits are
- * also taken care of in _bt_getstackbuf). But this is a good
- * opportunity to finish splits of internal pages too.
- */
- *bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
- page_access, snapshot);
-
- /* if this is a leaf page, we're done */
- page = BufferGetPage(*bufP);
- opaque = BTPageGetOpaque(page);
- if (P_ISLEAF(opaque))
- break;
-
- /*
- * Find the appropriate pivot tuple on this page. Its downlink points
- * to the child page that we're about to descend to.
- */
- offnum = _bt_binsrch(rel, key, *bufP);
- itemid = PageGetItemId(page, offnum);
- itup = (IndexTuple) PageGetItem(page, itemid);
- Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
- child = BTreeTupleGetDownLink(itup);
-
- /*
- * We need to save the location of the pivot tuple we chose in a new
- * stack entry for this page/level. If caller ends up splitting a
- * page one level down, it usually ends up inserting a new pivot
- * tuple/downlink immediately after the location recorded here.
- */
- new_stack = (BTStack) palloc(sizeof(BTStackData));
- new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
- new_stack->bts_offset = offnum;
- new_stack->bts_parent = stack_in;
-
- /*
- * Page level 1 is lowest non-leaf page level prior to leaves. So, if
- * we're on the level 1 and asked to lock leaf page in write mode,
- * then lock next page in write mode, because it must be a leaf.
- */
- if (opaque->btpo_level == 1 && access == BT_WRITE)
- page_access = BT_WRITE;
-
- /* drop the read lock on the page, then acquire one on its child */
- *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
-
- /* okay, all set to move down a level */
- stack_in = new_stack;
- }
-
- /*
- * If we're asked to lock leaf in write mode, but didn't manage to, then
- * relock. This should only happen when the root page is a leaf page (and
- * the only page in the index other than the metapage).
- */
- if (access == BT_WRITE && page_access == BT_READ)
- {
- /* trade in our read lock for a write lock */
- _bt_unlockbuf(rel, *bufP);
- _bt_lockbuf(rel, *bufP, BT_WRITE);
-
- /*
- * Race -- the leaf page may have split after we dropped the read lock
- * but before we acquired a write lock. If it has, we may need to
- * move right to its new sibling. Do that.
- */
- *bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
- snapshot);
- }
-
- return stack_in;
-}
-
-/*
- * _bt_moveright() -- move right in the btree if necessary.
- *
- * When we follow a pointer to reach a page, it is possible that
- * the page has changed in the meanwhile. If this happens, we're
- * guaranteed that the page has "split right" -- that is, that any
- * data that appeared on the page originally is either on the page
- * or strictly to the right of it.
- *
- * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page. If that entry is
- * strictly less than the scankey, or <= the scankey in the
- * key.nextkey=true case, then we followed the wrong link and we need
- * to move right.
- *
- * The passed insertion-type scankey can omit the rightmost column(s) of the
- * index. (see nbtree/README)
- *
- * When key.nextkey is false (the usual case), we are looking for the first
- * item >= key. When key.nextkey is true, we are looking for the first item
- * strictly greater than key.
- *
- * If forupdate is true, we will attempt to finish any incomplete splits
- * that we encounter. This is required when locking a target page for an
- * insertion, because we don't allow inserting on a page before the split
- * is completed. 'stack' is only used if forupdate is true.
- *
- * On entry, we have the buffer pinned and a lock of the type specified by
- * 'access'. If we move right, we release the buffer and lock and acquire
- * the same on the right sibling. Return value is the buffer we stop at.
- *
- * If the snapshot parameter is not NULL, "old snapshot" checking will take
- * place during the descent through the tree. This is not needed when
- * positioning for an insert or delete, so NULL is used for those cases.
- */
-Buffer
-_bt_moveright(Relation rel,
- BTScanInsert key,
- Buffer buf,
- bool forupdate,
- BTStack stack,
- int access,
- Snapshot snapshot)
-{
- Page page;
- BTPageOpaque opaque;
- int32 cmpval;
-
- /*
- * When nextkey = false (normal case): if the scan key that brought us to
- * this page is > the high key stored on the page, then the page has split
- * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
- * have some duplicates to the right as well as the left, but that's
- * something that's only ever dealt with on the leaf level, after
- * _bt_search has found an initial leaf page.)
- *
- * When nextkey = true: move right if the scan key is >= page's high key.
- * (Note that key.scantid cannot be set in this case.)
- *
- * The page could even have split more than once, so scan as far as
- * needed.
- *
- * We also have to move right if we followed a link that brought us to a
- * dead page.
- */
- cmpval = key->nextkey ? 0 : 1;
-
- for (;;)
- {
- page = BufferGetPage(buf);
- TestForOldSnapshot(snapshot, rel, page);
- opaque = BTPageGetOpaque(page);
-
- if (P_RIGHTMOST(opaque))
- break;
-
- /*
- * Finish any incomplete splits we encounter along the way.
- */
- if (forupdate && P_INCOMPLETE_SPLIT(opaque))
- {
- BlockNumber blkno = BufferGetBlockNumber(buf);
-
- /* upgrade our lock if necessary */
- if (access == BT_READ)
- {
- _bt_unlockbuf(rel, buf);
- _bt_lockbuf(rel, buf, BT_WRITE);
- }
-
- if (P_INCOMPLETE_SPLIT(opaque))
- _bt_finish_split(rel, buf, stack);
- else
- _bt_relbuf(rel, buf);
-
- /* re-acquire the lock in the right mode, and re-check */
- buf = _bt_getbuf(rel, blkno, access);
- continue;
- }
-
- if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
- {
- /* step right one page */
- buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
- continue;
- }
- else
- break;
- }
-
- if (P_IGNORE(opaque))
- elog(ERROR, "fell off the end of index \"%s\"",
- RelationGetRelationName(rel));
-
- return buf;
-}
-
-/*
- * _bt_binsrch() -- Do a binary search for a key on a particular page.
- *
- * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
- * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
- * particular, this means it is possible to return a value 1 greater than the
- * number of keys on the page, if the scankey is > all keys on the page.)
- *
- * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
- * of the last key < given scankey, or last key <= given scankey if nextkey
- * is true. (Since _bt_compare treats the first data key of such a page as
- * minus infinity, there will be at least one key < scankey, so the result
- * always points at one of the keys on the page.) This key indicates the
- * right place to descend to be sure we find all leaf keys >= given scankey
- * (or leaf keys > given scankey when nextkey is true).
- *
- * This procedure is not responsible for walking right, it just examines
- * the given page. _bt_binsrch() has no lock or refcount side effects
- * on the buffer.
- */
-static OffsetNumber
-_bt_binsrch(Relation rel,
- BTScanInsert key,
- Buffer buf)
-{
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high;
- int32 result,
- cmpval;
-
- page = BufferGetPage(buf);
- opaque = BTPageGetOpaque(page);
-
- /* Requesting nextkey semantics while using scantid seems nonsensical */
- Assert(!key->nextkey || key->scantid == NULL);
- /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
- Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
-
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
-
- /*
- * If there are no keys on the page, return the first available slot. Note
- * this covers two cases: the page is really empty (no keys), or it
- * contains only a high key. The latter case is possible after vacuuming.
- * This can never happen on an internal page, however, since they are
- * never empty (an internal page must have children).
- */
- if (unlikely(high < low))
- return low;
-
- /*
- * Binary search to find the first key on the page >= scan key, or first
- * key > scankey when nextkey is true.
- *
- * For nextkey=false (cmpval=1), the loop invariant is: all slots before
- * 'low' are < scan key, all slots at or after 'high' are >= scan key.
- *
- * For nextkey=true (cmpval=0), the loop invariant is: all slots before
- * 'low' are <= scan key, all slots at or after 'high' are > scan key.
- *
- * We can fall out when high == low.
- */
- high++; /* establish the loop invariant for high */
-
- cmpval = key->nextkey ? 0 : 1; /* select comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid);
-
- if (result >= cmpval)
- low = mid + 1;
- else
- high = mid;
- }
-
- /*
- * At this point we have high == low, but be careful: they could point
- * past the last slot on the page.
- *
- * On a leaf page, we always return the first key >= scan key (resp. >
- * scan key), which could be the last slot + 1.
- */
- if (P_ISLEAF(opaque))
- return low;
-
- /*
- * On a non-leaf page, return the last key < scan key (resp. <= scan key).
- * There must be one if _bt_compare() is playing by the rules.
- */
- Assert(low > P_FIRSTDATAKEY(opaque));
-
- return OffsetNumberPrev(low);
-}
-
-/*
- *
- * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
- *
- * Like _bt_binsrch(), but with support for caching the binary search
- * bounds. Only used during insertion, and only on the leaf page that it
- * looks like caller will insert tuple on. Exclusive-locked and pinned
- * leaf page is contained within insertstate.
- *
- * Caches the bounds fields in insertstate so that a subsequent call can
- * reuse the low and strict high bounds of original binary search. Callers
- * that use these fields directly must be prepared for the case where low
- * and/or stricthigh are not on the same page (one or both exceed maxoff
- * for the page). The case where there are no items on the page (high <
- * low) makes bounds invalid.
- *
- * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time, and for dealing with posting list
- * tuple matches (callers can use insertstate's postingoff field to
- * determine which existing heap TID will need to be replaced by a posting
- * list split).
- */
-OffsetNumber
-_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
-{
- BTScanInsert key = insertstate->itup_key;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high,
- stricthigh;
- int32 result,
- cmpval;
-
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- Assert(P_ISLEAF(opaque));
- Assert(!key->nextkey);
- Assert(insertstate->postingoff == 0);
-
- if (!insertstate->bounds_valid)
- {
- /* Start new binary search */
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
- }
- else
- {
- /* Restore result of previous binary search against same page */
- low = insertstate->low;
- high = insertstate->stricthigh;
- }
-
- /* If there are no keys on the page, return the first available slot */
- if (unlikely(high < low))
- {
- /* Caller can't reuse bounds */
- insertstate->low = InvalidOffsetNumber;
- insertstate->stricthigh = InvalidOffsetNumber;
- insertstate->bounds_valid = false;
- return low;
- }
-
- /*
- * Binary search to find the first key on the page >= scan key. (nextkey
- * is always false when inserting).
- *
- * The loop invariant is: all slots before 'low' are < scan key, all slots
- * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
- * maintained to save additional search effort for caller.
- *
- * We can fall out when high == low.
- */
- if (!insertstate->bounds_valid)
- high++; /* establish the loop invariant for high */
- stricthigh = high; /* high initially strictly higher */
-
- cmpval = 1; /* !nextkey comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid);
-
- if (result >= cmpval)
- low = mid + 1;
- else
- {
- high = mid;
- if (result != 0)
- stricthigh = high;
- }
-
- /*
- * If tuple at offset located by binary search is a posting list whose
- * TID range overlaps with caller's scantid, perform posting list
- * binary search to set postingoff for caller. Caller must split the
- * posting list when postingoff is set. This should happen
- * infrequently.
- */
- if (unlikely(result == 0 && key->scantid != NULL))
- {
- /*
- * postingoff should never be set more than once per leaf page
- * binary search. That would mean that there are duplicate table
- * TIDs in the index, which is never okay. Check for that here.
- */
- if (insertstate->postingoff != 0)
- ereport(ERROR,
- (errcode(ERRCODE_INDEX_CORRUPTED),
- errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
- ItemPointerGetBlockNumber(key->scantid),
- ItemPointerGetOffsetNumber(key->scantid),
- low, stricthigh,
- BufferGetBlockNumber(insertstate->buf),
- RelationGetRelationName(rel))));
-
- insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
- }
- }
-
- /*
- * On a leaf page, a binary search always returns the first key >= scan
- * key (at least in !nextkey case), which could be the last slot + 1. This
- * is also the lower bound of cached search.
- *
- * stricthigh may also be the last slot + 1, which prevents caller from
- * using bounds directly, but is still useful to us if we're called a
- * second time with cached bounds (cached low will be < stricthigh when
- * that happens).
- */
- insertstate->low = low;
- insertstate->stricthigh = stricthigh;
- insertstate->bounds_valid = true;
-
- return low;
-}
/*----------
* _bt_binsrch_posting() -- posting list binary search.
@@ -625,217 +138,6 @@ _bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
return low;
}
-/*----------
- * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
- *
- * page/offnum: location of btree item to be compared to.
- *
- * This routine returns:
- * <0 if scankey < tuple at offnum;
- * 0 if scankey == tuple at offnum;
- * >0 if scankey > tuple at offnum.
- *
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be returned
- * to the caller as a matching key. Similarly, an insertion scankey
- * with its scantid set is treated as equal to a posting tuple whose TID
- * range overlaps with their scantid. There generally won't be a
- * matching TID in the posting tuple, which caller must handle
- * themselves (e.g., by splitting the posting list tuple).
- *
- * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
- * "minus infinity": this routine will always claim it is less than the
- * scankey. The actual key value stored is explicitly truncated to 0
- * attributes (explicitly minus infinity) with version 3+ indexes, but
- * that isn't relied upon. This allows us to implement the Lehman and
- * Yao convention that the first down-link pointer is before the first
- * key. See backend/access/nbtree/README for details.
- *----------
- */
-int32
-_bt_compare(Relation rel,
- BTScanInsert key,
- Page page,
- OffsetNumber offnum)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- IndexTuple itup;
- ItemPointer heapTid;
- ScanKey scankey;
- int ncmpkey;
- int ntupatts;
- int32 result;
-
- Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
- Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
- Assert(key->heapkeyspace || key->scantid == NULL);
-
- /*
- * Force result ">" if target item is first data item on an internal page
- * --- see NOTE above.
- */
- if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
- return 1;
-
- itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
- ntupatts = BTreeTupleGetNAtts(itup, rel);
-
- /*
- * The scan key is set up with the attribute number associated with each
- * term in the key. It is important that, if the index is multi-key, the
- * scan contain the first k key attributes, and that they be in order. If
- * you think about how multi-key ordering works, you'll understand why
- * this is.
- *
- * We don't test for violation of this condition here, however. The
- * initial setup for the index scan had better have gotten it right (see
- * _bt_first).
- */
-
- ncmpkey = Min(ntupatts, key->keysz);
- Assert(key->heapkeyspace || ncmpkey == key->keysz);
- Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
- scankey = key->scankeys;
- for (int i = 1; i <= ncmpkey; i++)
- {
- Datum datum;
- bool isNull;
-
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
-
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
- {
- if (isNull)
- result = 0; /* NULL "=" NULL */
- else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = -1; /* NULL "<" NOT_NULL */
- else
- result = 1; /* NULL ">" NOT_NULL */
- }
- else if (isNull) /* key is NOT_NULL and item is NULL */
- {
- if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = 1; /* NOT_NULL ">" NULL */
- else
- result = -1; /* NOT_NULL "<" NULL */
- }
- else
- {
- /*
- * The sk_func needs to be passed the index value as left arg and
- * the sk_argument as right arg (they might be of different
- * types). Since it is convenient for callers to think of
- * _bt_compare as comparing the scankey to the index item, we have
- * to flip the sign of the comparison result. (Unless it's a DESC
- * column, in which case we *don't* flip the sign.)
- */
- result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum,
- scankey->sk_argument));
-
- if (!(scankey->sk_flags & SK_BT_DESC))
- INVERT_COMPARE_RESULT(result);
- }
-
- /* if the keys are unequal, return the difference */
- if (result != 0)
- return result;
-
- scankey++;
- }
-
- /*
- * All non-truncated attributes (other than heap TID) were found to be
- * equal. Treat truncated attributes as minus infinity when scankey has a
- * key attribute value that would otherwise be compared directly.
- *
- * Note: it doesn't matter if ntupatts includes non-key attributes;
- * scankey won't, so explicitly excluding non-key attributes isn't
- * necessary.
- */
- if (key->keysz > ntupatts)
- return 1;
-
- /*
- * Use the heap TID attribute and scantid to try to break the tie. The
- * rules are the same as any other key attribute -- only the
- * representation differs.
- */
- heapTid = BTreeTupleGetHeapTID(itup);
- if (key->scantid == NULL)
- {
- /*
- * Most searches have a scankey that is considered greater than a
- * truncated pivot tuple if and when the scankey has equal values for
- * attributes up to and including the least significant untruncated
- * attribute in tuple.
- *
- * For example, if an index has the minimum two attributes (single
- * user key attribute, plus heap TID attribute), and a page's high key
- * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
- * will not descend to the page to the left. The search will descend
- * right instead. The truncated attribute in pivot tuple means that
- * all non-pivot tuples on the page to the left are strictly < 'foo',
- * so it isn't necessary to descend left. In other words, search
- * doesn't have to descend left because it isn't interested in a match
- * that has a heap TID value of -inf.
- *
- * However, some searches (pivotsearch searches) actually require that
- * we descend left when this happens. -inf is treated as a possible
- * match for omitted scankey attribute(s). This is needed by page
- * deletion, which must re-find leaf pages that are targets for
- * deletion using their high keys.
- *
- * Note: the heap TID part of the test ensures that scankey is being
- * compared to a pivot tuple with one or more truncated key
- * attributes.
- *
- * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
- * left here, since they have no heap TID attribute (and cannot have
- * any -inf key values in any case, since truncation can only remove
- * non-key attributes). !heapkeyspace searches must always be
- * prepared to deal with matches on both sides of the pivot once the
- * leaf level is reached.
- */
- if (key->heapkeyspace && !key->pivotsearch &&
- key->keysz == ntupatts && heapTid == NULL)
- return 1;
-
- /* All provided scankey arguments found to be equal */
- return 0;
- }
-
- /*
- * Treat truncated heap TID as minus infinity, since scankey has a key
- * attribute value (scantid) that would otherwise be compared directly
- */
- Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
- if (heapTid == NULL)
- return 1;
-
- /*
- * Scankey must be treated as equal to a posting list tuple if its scantid
- * value falls within the range of the posting list. In all other cases
- * there can only be a single heap TID value, which is compared directly
- * with scantid.
- */
- Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- result = ItemPointerCompare(key->scantid, heapTid);
- if (result <= 0 || !BTreeTupleIsPosting(itup))
- return result;
- else
- {
- result = ItemPointerCompare(key->scantid,
- BTreeTupleGetMaxHeapTID(itup));
- if (result > 0)
- return 1;
- }
-
- return 0;
-}
-
/*
* _bt_first() -- Find the first item in a scan.
*
@@ -1363,7 +665,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* Use the manufactured insertion scan key to descend the tree and
* position ourselves on the target leaf page.
*/
- stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+ stack = nbts_call(_bt_search, rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
/* don't need to keep the stack around... */
_bt_freestack(stack);
@@ -1392,7 +694,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
_bt_initialize_more_data(so, dir);
/* position to the precise item on the page */
- offnum = _bt_binsrch(rel, &inskey, buf);
+ offnum = nbts_call(_bt_binsrch, rel, &inskey, buf);
/*
* If nextkey = false, we are positioned at the first item >= scan key, or
@@ -1422,9 +724,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
/*
* Now load data from the first page of the scan.
*/
- if (!_bt_readpage(scan, dir, offnum))
+ if (!nbts_call_norel(_bt_readpage, scan->indexRelation, scan, dir, offnum))
{
- /*
+ /*`
* There's no actually-matching data on this page. Try to advance to
* the next page. Return false if there's no matching data at all.
*/
@@ -1498,280 +800,6 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
-/*
- * _bt_readpage() -- Load data from current index page into so->currPos
- *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate. All other fields of so->currPos are
- * initialized from scratch here.
- *
- * We scan the current page starting at offnum and moving in the indicated
- * direction. All items matching the scan keys are loaded into currPos.items.
- * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
- * that there can be no more matching tuples in the current scan direction.
- *
- * In the case of a parallel scan, caller must have called _bt_parallel_seize
- * prior to calling this function; this function will invoke
- * _bt_parallel_release before returning.
- *
- * Returns true if any matching items found on the page, false if none.
- */
-static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
-{
- BTScanOpaque so = (BTScanOpaque) scan->opaque;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber minoff;
- OffsetNumber maxoff;
- int itemIndex;
- bool continuescan;
- int indnatts;
-
- /*
- * We must have the buffer pinned and locked, but the usual macro can't be
- * used here; this function is what makes it good for currPos.
- */
- Assert(BufferIsValid(so->currPos.buf));
-
- page = BufferGetPage(so->currPos.buf);
- opaque = BTPageGetOpaque(page);
-
- /* allow next page be processed by parallel worker */
- if (scan->parallel_scan)
- {
- if (ScanDirectionIsForward(dir))
- _bt_parallel_release(scan, opaque->btpo_next);
- else
- _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
- }
-
- continuescan = true; /* default assumption */
- indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * We note the buffer's block number so that we can release the pin later.
- * This allows us to re-read the buffer if it is needed again for hinting.
- */
- so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-
- /*
- * We save the LSN of the page as we read it, so that we know whether it
- * safe to apply LP_DEAD hints to the page later. This allows us to drop
- * the pin for MVCC scans, which allows vacuum to avoid blocking.
- */
- so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-
- /*
- * we must save the page's right-link while scanning it; this tells us
- * where to step right to after we're done with these items. There is no
- * corresponding need for the left-link, since splits always go right.
- */
- so->currPos.nextPage = opaque->btpo_next;
-
- /* initialize tuple workspace to empty */
- so->currPos.nextTupleOffset = 0;
-
- /*
- * Now that the current page has been made consistent, the macro should be
- * good.
- */
- Assert(BTScanPosIsPinned(so->currPos));
-
- if (ScanDirectionIsForward(dir))
- {
- /* load items[] in ascending order */
- itemIndex = 0;
-
- offnum = Max(offnum, minoff);
-
- while (offnum <= maxoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- offnum = OffsetNumberNext(offnum);
- continue;
- }
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID
- */
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- itemIndex++;
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- itemIndex++;
- }
- }
- }
- /* When !continuescan, there can't be any more matches, so stop */
- if (!continuescan)
- break;
-
- offnum = OffsetNumberNext(offnum);
- }
-
- /*
- * We don't need to visit page to the right when the high key
- * indicates that no more matches will be found there.
- *
- * Checking the high key like this works out more often than you might
- * think. Leaf page splits pick a split point between the two most
- * dissimilar tuples (this is weighed against the need to evenly share
- * free space). Leaf pages with high key attribute values that can
- * only appear on non-pivot tuples on the right sibling page are
- * common.
- */
- if (continuescan && !P_RIGHTMOST(opaque))
- {
- ItemId iid = PageGetItemId(page, P_HIKEY);
- IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
- int truncatt;
-
- truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
- _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
- }
-
- if (!continuescan)
- so->currPos.moreRight = false;
-
- Assert(itemIndex <= MaxTIDsPerBTreePage);
- so->currPos.firstItem = 0;
- so->currPos.lastItem = itemIndex - 1;
- so->currPos.itemIndex = 0;
- }
- else
- {
- /* load items[] in descending order */
- itemIndex = MaxTIDsPerBTreePage;
-
- offnum = Min(offnum, maxoff);
-
- while (offnum >= minoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
- bool tuple_alive;
- bool passes_quals;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual. Most of the
- * time, it's a win to not bother examining the tuple's index
- * keys, but just skip to the next tuple (previous, actually,
- * since we're scanning backwards). However, if this is the first
- * tuple on the page, we do check the index keys, to prevent
- * uselessly advancing to the page to the left. This is similar
- * to the high key optimization used by forward scans.
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- Assert(offnum >= P_FIRSTDATAKEY(opaque));
- if (offnum > P_FIRSTDATAKEY(opaque))
- {
- offnum = OffsetNumberPrev(offnum);
- continue;
- }
-
- tuple_alive = false;
- }
- else
- tuple_alive = true;
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan);
- if (passes_quals && tuple_alive)
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID.
- *
- * Note that we deliberately save/return items from
- * posting lists in ascending heap TID order for backwards
- * scans. This allows _bt_killitems() to make a
- * consistent assumption about the order of items
- * associated with the same posting list tuple.
- */
- itemIndex--;
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- itemIndex--;
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- }
- }
- }
- if (!continuescan)
- {
- /* there can't be any more matches, so stop */
- so->currPos.moreLeft = false;
- break;
- }
-
- offnum = OffsetNumberPrev(offnum);
- }
-
- Assert(itemIndex >= 0);
- so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
- so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
- }
-
- return (so->currPos.firstItem <= so->currPos.lastItem);
-}
-
/* Save an index item into so->currPos.items[itemIndex] */
static void
_bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -2014,7 +1042,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
PredicateLockPage(rel, blkno, scan->xs_snapshot);
/* see if there are any matches on this page */
/* note that this will clear moreRight if we can stop */
- if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
+ if (nbts_call_norel(_bt_readpage, scan->indexRelation,
+ scan, dir, P_FIRSTDATAKEY(opaque)))
break;
}
else if (scan->parallel_scan != NULL)
@@ -2116,7 +1145,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
/* see if there are any matches on this page */
/* note that this will clear moreLeft if we can stop */
- if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
+ if (nbts_call_norel(_bt_readpage, scan->indexRelation, scan,
+ dir, PageGetMaxOffsetNumber(page)))
break;
}
else if (scan->parallel_scan != NULL)
@@ -2448,7 +1478,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
/*
* Now load data from the first page of the scan.
*/
- if (!_bt_readpage(scan, dir, start))
+ if (!nbts_call_norel(_bt_readpage, scan->indexRelation, scan, dir, start))
{
/*
* There's no actually-matching data on this page. Try to advance to
diff --git a/src/backend/access/nbtree/nbtsearch_spec.h b/src/backend/access/nbtree/nbtsearch_spec.h
new file mode 100644
index 0000000000..73d5370496
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsearch_spec.h
@@ -0,0 +1,994 @@
+/*
+ * Specialized functions for nbtsearch.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static OffsetNumber NBTS_FUNCTION(_bt_binsrch)(Relation rel, BTScanInsert key,
+ Buffer buf);
+static bool NBTS_FUNCTION(_bt_readpage)(IndexScanDesc scan, ScanDirection dir,
+ OffsetNumber offnum);
+
+/*
+ * _bt_binsrch() -- Do a binary search for a key on a particular page.
+ *
+ * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
+ * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
+ * particular, this means it is possible to return a value 1 greater than the
+ * number of keys on the page, if the scankey is > all keys on the page.)
+ *
+ * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
+ * of the last key < given scankey, or last key <= given scankey if nextkey
+ * is true. (Since _bt_compare treats the first data key of such a page as
+ * minus infinity, there will be at least one key < scankey, so the result
+ * always points at one of the keys on the page.) This key indicates the
+ * right place to descend to be sure we find all leaf keys >= given scankey
+ * (or leaf keys > given scankey when nextkey is true).
+ *
+ * This procedure is not responsible for walking right, it just examines
+ * the given page. _bt_binsrch() has no lock or refcount side effects
+ * on the buffer.
+ */
+static OffsetNumber
+NBTS_FUNCTION(_bt_binsrch)(Relation rel,
+ BTScanInsert key,
+ Buffer buf)
+{
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high;
+ int32 result,
+ cmpval;
+
+ page = BufferGetPage(buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* Requesting nextkey semantics while using scantid seems nonsensical */
+ Assert(!key->nextkey || key->scantid == NULL);
+ /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
+ Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+
+ /*
+ * If there are no keys on the page, return the first available slot. Note
+ * this covers two cases: the page is really empty (no keys), or it
+ * contains only a high key. The latter case is possible after vacuuming.
+ * This can never happen on an internal page, however, since they are
+ * never empty (an internal page must have children).
+ */
+ if (unlikely(high < low))
+ return low;
+
+ /*
+ * Binary search to find the first key on the page >= scan key, or first
+ * key > scankey when nextkey is true.
+ *
+ * For nextkey=false (cmpval=1), the loop invariant is: all slots before
+ * 'low' are < scan key, all slots at or after 'high' are >= scan key.
+ *
+ * For nextkey=true (cmpval=0), the loop invariant is: all slots before
+ * 'low' are <= scan key, all slots at or after 'high' are > scan key.
+ *
+ * We can fall out when high == low.
+ */
+ high++; /* establish the loop invariant for high */
+
+ cmpval = key->nextkey ? 0 : 1; /* select comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = nbts_call(_bt_compare, rel, key, page, mid);
+
+ if (result >= cmpval)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ /*
+ * At this point we have high == low, but be careful: they could point
+ * past the last slot on the page.
+ *
+ * On a leaf page, we always return the first key >= scan key (resp. >
+ * scan key), which could be the last slot + 1.
+ */
+ if (P_ISLEAF(opaque))
+ return low;
+
+ /*
+ * On a non-leaf page, return the last key < scan key (resp. <= scan key).
+ * There must be one if _bt_compare() is playing by the rules.
+ */
+ Assert(low > P_FIRSTDATAKEY(opaque));
+
+ return OffsetNumberPrev(low);
+}
+
+/*
+ * _bt_readpage() -- Load data from current index page into so->currPos
+ *
+ * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
+ * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
+ * they are updated as appropriate. All other fields of so->currPos are
+ * initialized from scratch here.
+ *
+ * We scan the current page starting at offnum and moving in the indicated
+ * direction. All items matching the scan keys are loaded into currPos.items.
+ * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
+ * that there can be no more matching tuples in the current scan direction.
+ *
+ * In the case of a parallel scan, caller must have called _bt_parallel_seize
+ * prior to calling this function; this function will invoke
+ * _bt_parallel_release before returning.
+ *
+ * Returns true if any matching items found on the page, false if none.
+ */
+static bool
+NBTS_FUNCTION(_bt_readpage)(IndexScanDesc scan, ScanDirection dir,
+ OffsetNumber offnum)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber minoff;
+ OffsetNumber maxoff;
+ int itemIndex;
+ bool continuescan;
+ int indnatts;
+
+ /*
+ * We must have the buffer pinned and locked, but the usual macro can't be
+ * used here; this function is what makes it good for currPos.
+ */
+ Assert(BufferIsValid(so->currPos.buf));
+
+ page = BufferGetPage(so->currPos.buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* allow next page be processed by parallel worker */
+ if (scan->parallel_scan)
+ {
+ if (ScanDirectionIsForward(dir))
+ _bt_parallel_release(scan, opaque->btpo_next);
+ else
+ _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+ }
+
+ continuescan = true; /* default assumption */
+ indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * We note the buffer's block number so that we can release the pin later.
+ * This allows us to re-read the buffer if it is needed again for hinting.
+ */
+ so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later. This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+
+ /*
+ * we must save the page's right-link while scanning it; this tells us
+ * where to step right to after we're done with these items. There is no
+ * corresponding need for the left-link, since splits always go right.
+ */
+ so->currPos.nextPage = opaque->btpo_next;
+
+ /* initialize tuple workspace to empty */
+ so->currPos.nextTupleOffset = 0;
+
+ /*
+ * Now that the current page has been made consistent, the macro should be
+ * good.
+ */
+ Assert(BTScanPosIsPinned(so->currPos));
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* load items[] in ascending order */
+ itemIndex = 0;
+
+ offnum = Max(offnum, minoff);
+
+ while (offnum <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ offnum = OffsetNumberNext(offnum);
+ continue;
+ }
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ if (nbts_call(_bt_checkkeys, scan->indexRelation, scan, itup, indnatts, dir, &continuescan))
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID
+ */
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ itemIndex++;
+ }
+ }
+ }
+ /* When !continuescan, there can't be any more matches, so stop */
+ if (!continuescan)
+ break;
+
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /*
+ * We don't need to visit page to the right when the high key
+ * indicates that no more matches will be found there.
+ *
+ * Checking the high key like this works out more often than you might
+ * think. Leaf page splits pick a split point between the two most
+ * dissimilar tuples (this is weighed against the need to evenly share
+ * free space). Leaf pages with high key attribute values that can
+ * only appear on non-pivot tuples on the right sibling page are
+ * common.
+ */
+ if (continuescan && !P_RIGHTMOST(opaque))
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
+ int truncatt;
+
+ truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
+ nbts_call(_bt_checkkeys, scan->indexRelation, scan, itup, truncatt, dir, &continuescan);
+ }
+
+ if (!continuescan)
+ so->currPos.moreRight = false;
+
+ Assert(itemIndex <= MaxTIDsPerBTreePage);
+ so->currPos.firstItem = 0;
+ so->currPos.lastItem = itemIndex - 1;
+ so->currPos.itemIndex = 0;
+ }
+ else
+ {
+ /* load items[] in descending order */
+ itemIndex = MaxTIDsPerBTreePage;
+
+ offnum = Min(offnum, maxoff);
+
+ while (offnum >= minoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+ bool tuple_alive;
+ bool passes_quals;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual. Most of the
+ * time, it's a win to not bother examining the tuple's index
+ * keys, but just skip to the next tuple (previous, actually,
+ * since we're scanning backwards). However, if this is the first
+ * tuple on the page, we do check the index keys, to prevent
+ * uselessly advancing to the page to the left. This is similar
+ * to the high key optimization used by forward scans.
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ Assert(offnum >= P_FIRSTDATAKEY(opaque));
+ if (offnum > P_FIRSTDATAKEY(opaque))
+ {
+ offnum = OffsetNumberPrev(offnum);
+ continue;
+ }
+
+ tuple_alive = false;
+ }
+ else
+ tuple_alive = true;
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ passes_quals = nbts_call(_bt_checkkeys, scan->indexRelation,
+ scan, itup, indnatts, dir,
+ &continuescan);
+ if (passes_quals && tuple_alive)
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID.
+ *
+ * Note that we deliberately save/return items from
+ * posting lists in ascending heap TID order for backwards
+ * scans. This allows _bt_killitems() to make a
+ * consistent assumption about the order of items
+ * associated with the same posting list tuple.
+ */
+ itemIndex--;
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ }
+ }
+ }
+ if (!continuescan)
+ {
+ /* there can't be any more matches, so stop */
+ so->currPos.moreLeft = false;
+ break;
+ }
+
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ Assert(itemIndex >= 0);
+ so->currPos.firstItem = itemIndex;
+ so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
+ so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+ }
+
+ return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+#endif /* NBTS_SPECIALIZING_DEFAULT */
+
+/*
+ * _bt_search() -- Search the tree for a particular scankey,
+ * or more precisely for the first leaf page it could be on.
+ *
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
+ * but it can omit the rightmost column(s) of the index.
+ *
+ * Return value is a stack of parent-page pointers (i.e. there is no entry for
+ * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
+ * which is locked and pinned. No locks are held on the parent pages,
+ * however!
+ *
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
+ *
+ * The returned buffer is locked according to access parameter. Additionally,
+ * access = BT_WRITE will allow an empty root page to be created and returned.
+ * When access = BT_READ, an empty index will result in *bufP being set to
+ * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
+ * during the search will be finished.
+ */
+BTStack
+NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
+ int access, Snapshot snapshot)
+{
+ BTStack stack_in = NULL;
+ int page_access = BT_READ;
+
+ /* Get the root page to start with */
+ *bufP = _bt_getroot(rel, access);
+
+ /* If index is empty and access = BT_READ, no root page is created. */
+ if (!BufferIsValid(*bufP))
+ return (BTStack) NULL;
+
+ /* Loop iterates once per level descended in the tree */
+ for (;;)
+ {
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber offnum;
+ ItemId itemid;
+ IndexTuple itup;
+ BlockNumber child;
+ BTStack new_stack;
+
+ /*
+ * Race -- the page we just grabbed may have split since we read its
+ * downlink in its parent page (or the metapage). If it has, we may
+ * need to move right to its new sibling. Do that.
+ *
+ * In write-mode, allow _bt_moveright to finish any incomplete splits
+ * along the way. Strictly speaking, we'd only need to finish an
+ * incomplete split on the leaf page we're about to insert to, not on
+ * any of the upper levels (internal pages with incomplete splits are
+ * also taken care of in _bt_getstackbuf). But this is a good
+ * opportunity to finish splits of internal pages too.
+ */
+ *bufP = nbts_call(_bt_moveright, rel, key, *bufP,
+ (access == BT_WRITE), stack_in,
+ page_access, snapshot);
+
+ /* if this is a leaf page, we're done */
+ page = BufferGetPage(*bufP);
+ opaque = BTPageGetOpaque(page);
+ if (P_ISLEAF(opaque))
+ break;
+
+ /*
+ * Find the appropriate pivot tuple on this page. Its downlink points
+ * to the child page that we're about to descend to.
+ */
+ offnum = nbts_call(_bt_binsrch, rel, key, *bufP);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
+ child = BTreeTupleGetDownLink(itup);
+
+ /*
+ * We need to save the location of the pivot tuple we chose in a new
+ * stack entry for this page/level. If caller ends up splitting a
+ * page one level down, it usually ends up inserting a new pivot
+ * tuple/downlink immediately after the location recorded here.
+ */
+ new_stack = (BTStack) palloc(sizeof(BTStackData));
+ new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
+ new_stack->bts_offset = offnum;
+ new_stack->bts_parent = stack_in;
+
+ /*
+ * Page level 1 is lowest non-leaf page level prior to leaves. So, if
+ * we're on the level 1 and asked to lock leaf page in write mode,
+ * then lock next page in write mode, because it must be a leaf.
+ */
+ if (opaque->btpo_level == 1 && access == BT_WRITE)
+ page_access = BT_WRITE;
+
+ /* drop the read lock on the page, then acquire one on its child */
+ *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
+
+ /* okay, all set to move down a level */
+ stack_in = new_stack;
+ }
+
+ /*
+ * If we're asked to lock leaf in write mode, but didn't manage to, then
+ * relock. This should only happen when the root page is a leaf page (and
+ * the only page in the index other than the metapage).
+ */
+ if (access == BT_WRITE && page_access == BT_READ)
+ {
+ /* trade in our read lock for a write lock */
+ _bt_unlockbuf(rel, *bufP);
+ _bt_lockbuf(rel, *bufP, BT_WRITE);
+
+ /*
+ * Race -- the leaf page may have split after we dropped the read lock
+ * but before we acquired a write lock. If it has, we may need to
+ * move right to its new sibling. Do that.
+ */
+ *bufP = nbts_call(_bt_moveright, rel, key, *bufP, true, stack_in,
+ BT_WRITE, snapshot);
+ }
+
+ return stack_in;
+}
+
+/*
+ * _bt_moveright() -- move right in the btree if necessary.
+ *
+ * When we follow a pointer to reach a page, it is possible that
+ * the page has changed in the meanwhile. If this happens, we're
+ * guaranteed that the page has "split right" -- that is, that any
+ * data that appeared on the page originally is either on the page
+ * or strictly to the right of it.
+ *
+ * This routine decides whether or not we need to move right in the
+ * tree by examining the high key entry on the page. If that entry is
+ * strictly less than the scankey, or <= the scankey in the
+ * key.nextkey=true case, then we followed the wrong link and we need
+ * to move right.
+ *
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
+ *
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key. When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
+ *
+ * If forupdate is true, we will attempt to finish any incomplete splits
+ * that we encounter. This is required when locking a target page for an
+ * insertion, because we don't allow inserting on a page before the split
+ * is completed. 'stack' is only used if forupdate is true.
+ *
+ * On entry, we have the buffer pinned and a lock of the type specified by
+ * 'access'. If we move right, we release the buffer and lock and acquire
+ * the same on the right sibling. Return value is the buffer we stop at.
+ *
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
+ */
+Buffer
+NBTS_FUNCTION(_bt_moveright)(Relation rel,
+ BTScanInsert key,
+ Buffer buf,
+ bool forupdate,
+ BTStack stack,
+ int access,
+ Snapshot snapshot)
+{
+ Page page;
+ BTPageOpaque opaque;
+ int32 cmpval;
+
+ /*
+ * When nextkey = false (normal case): if the scan key that brought us to
+ * this page is > the high key stored on the page, then the page has split
+ * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
+ * have some duplicates to the right as well as the left, but that's
+ * something that's only ever dealt with on the leaf level, after
+ * _bt_search has found an initial leaf page.)
+ *
+ * When nextkey = true: move right if the scan key is >= page's high key.
+ * (Note that key.scantid cannot be set in this case.)
+ *
+ * The page could even have split more than once, so scan as far as
+ * needed.
+ *
+ * We also have to move right if we followed a link that brought us to a
+ * dead page.
+ */
+ cmpval = key->nextkey ? 0 : 1;
+
+ for (;;)
+ {
+ page = BufferGetPage(buf);
+ TestForOldSnapshot(snapshot, rel, page);
+ opaque = BTPageGetOpaque(page);
+
+ if (P_RIGHTMOST(opaque))
+ break;
+
+ /*
+ * Finish any incomplete splits we encounter along the way.
+ */
+ if (forupdate && P_INCOMPLETE_SPLIT(opaque))
+ {
+ BlockNumber blkno = BufferGetBlockNumber(buf);
+
+ /* upgrade our lock if necessary */
+ if (access == BT_READ)
+ {
+ _bt_unlockbuf(rel, buf);
+ _bt_lockbuf(rel, buf, BT_WRITE);
+ }
+
+ if (P_INCOMPLETE_SPLIT(opaque))
+ _bt_finish_split(rel, buf, stack);
+ else
+ _bt_relbuf(rel, buf);
+
+ /* re-acquire the lock in the right mode, and re-check */
+ buf = _bt_getbuf(rel, blkno, access);
+ continue;
+ }
+
+ if (P_IGNORE(opaque) || nbts_call(_bt_compare, rel, key, page, P_HIKEY) >= cmpval)
+ {
+ /* step right one page */
+ buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
+ continue;
+ }
+ else
+ break;
+ }
+
+ if (P_IGNORE(opaque))
+ elog(ERROR, "fell off the end of index \"%s\"",
+ RelationGetRelationName(rel));
+
+ return buf;
+}
+
+/*
+ *
+ * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
+ *
+ * Like _bt_binsrch(), but with support for caching the binary search
+ * bounds. Only used during insertion, and only on the leaf page that it
+ * looks like caller will insert tuple on. Exclusive-locked and pinned
+ * leaf page is contained within insertstate.
+ *
+ * Caches the bounds fields in insertstate so that a subsequent call can
+ * reuse the low and strict high bounds of original binary search. Callers
+ * that use these fields directly must be prepared for the case where low
+ * and/or stricthigh are not on the same page (one or both exceed maxoff
+ * for the page). The case where there are no items on the page (high <
+ * low) makes bounds invalid.
+ *
+ * Caller is responsible for invalidating bounds when it modifies the page
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by a posting
+ * list split).
+ */
+OffsetNumber
+NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate)
+{
+ BTScanInsert key = insertstate->itup_key;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high,
+ stricthigh;
+ int32 result,
+ cmpval;
+
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ Assert(P_ISLEAF(opaque));
+ Assert(!key->nextkey);
+ Assert(insertstate->postingoff == 0);
+
+ if (!insertstate->bounds_valid)
+ {
+ /* Start new binary search */
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ }
+ else
+ {
+ /* Restore result of previous binary search against same page */
+ low = insertstate->low;
+ high = insertstate->stricthigh;
+ }
+
+ /* If there are no keys on the page, return the first available slot */
+ if (unlikely(high < low))
+ {
+ /* Caller can't reuse bounds */
+ insertstate->low = InvalidOffsetNumber;
+ insertstate->stricthigh = InvalidOffsetNumber;
+ insertstate->bounds_valid = false;
+ return low;
+ }
+
+ /*
+ * Binary search to find the first key on the page >= scan key. (nextkey
+ * is always false when inserting).
+ *
+ * The loop invariant is: all slots before 'low' are < scan key, all slots
+ * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
+ * maintained to save additional search effort for caller.
+ *
+ * We can fall out when high == low.
+ */
+ if (!insertstate->bounds_valid)
+ high++; /* establish the loop invariant for high */
+ stricthigh = high; /* high initially strictly higher */
+
+ cmpval = 1; /* !nextkey comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = nbts_call(_bt_compare, rel, key, page, mid);
+
+ if (result >= cmpval)
+ low = mid + 1;
+ else
+ {
+ high = mid;
+ if (result != 0)
+ stricthigh = high;
+ }
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set postingoff for caller. Caller must split the
+ * posting list when postingoff is set. This should happen
+ * infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ {
+ /*
+ * postingoff should never be set more than once per leaf page
+ * binary search. That would mean that there are duplicate table
+ * TIDs in the index, which is never okay. Check for that here.
+ */
+ if (insertstate->postingoff != 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
+ ItemPointerGetBlockNumber(key->scantid),
+ ItemPointerGetOffsetNumber(key->scantid),
+ low, stricthigh,
+ BufferGetBlockNumber(insertstate->buf),
+ RelationGetRelationName(rel))));
+
+ insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
+ }
+ }
+
+ /*
+ * On a leaf page, a binary search always returns the first key >= scan
+ * key (at least in !nextkey case), which could be the last slot + 1. This
+ * is also the lower bound of cached search.
+ *
+ * stricthigh may also be the last slot + 1, which prevents caller from
+ * using bounds directly, but is still useful to us if we're called a
+ * second time with cached bounds (cached low will be < stricthigh when
+ * that happens).
+ */
+ insertstate->low = low;
+ insertstate->stricthigh = stricthigh;
+ insertstate->bounds_valid = true;
+
+ return low;
+}
+
+/*----------
+ * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
+ *
+ * page/offnum: location of btree item to be compared to.
+ *
+ * This routine returns:
+ * <0 if scankey < tuple at offnum;
+ * 0 if scankey == tuple at offnum;
+ * >0 if scankey > tuple at offnum.
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
+ * "minus infinity": this routine will always claim it is less than the
+ * scankey. The actual key value stored is explicitly truncated to 0
+ * attributes (explicitly minus infinity) with version 3+ indexes, but
+ * that isn't relied upon. This allows us to implement the Lehman and
+ * Yao convention that the first down-link pointer is before the first
+ * key. See backend/access/nbtree/README for details.
+ *----------
+ */
+int32
+NBTS_FUNCTION(_bt_compare)(Relation rel,
+ BTScanInsert key,
+ Page page,
+ OffsetNumber offnum)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ IndexTuple itup;
+ ItemPointer heapTid;
+ ScanKey scankey;
+ int ncmpkey;
+ int ntupatts;
+ int32 result;
+
+ Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
+ Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+ Assert(key->heapkeyspace || key->scantid == NULL);
+
+ /*
+ * Force result ">" if target item is first data item on an internal page
+ * --- see NOTE above.
+ */
+ if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
+ return 1;
+
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+ ntupatts = BTreeTupleGetNAtts(itup, rel);
+
+ /*
+ * The scan key is set up with the attribute number associated with each
+ * term in the key. It is important that, if the index is multi-key, the
+ * scan contain the first k key attributes, and that they be in order. If
+ * you think about how multi-key ordering works, you'll understand why
+ * this is.
+ *
+ * We don't test for violation of this condition here, however. The
+ * initial setup for the index scan had better have gotten it right (see
+ * _bt_first).
+ */
+
+ ncmpkey = Min(ntupatts, key->keysz);
+ Assert(key->heapkeyspace || ncmpkey == key->keysz);
+ Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
+ scankey = key->scankeys;
+ for (int i = 1; i <= ncmpkey; i++)
+ {
+ Datum datum;
+ bool isNull;
+
+ datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+
+ if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ {
+ if (isNull)
+ result = 0; /* NULL "=" NULL */
+ else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = -1; /* NULL "<" NOT_NULL */
+ else
+ result = 1; /* NULL ">" NOT_NULL */
+ }
+ else if (isNull) /* key is NOT_NULL and item is NULL */
+ {
+ if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = 1; /* NOT_NULL ">" NULL */
+ else
+ result = -1; /* NOT_NULL "<" NULL */
+ }
+ else
+ {
+ /*
+ * The sk_func needs to be passed the index value as left arg and
+ * the sk_argument as right arg (they might be of different
+ * types). Since it is convenient for callers to think of
+ * _bt_compare as comparing the scankey to the index item, we have
+ * to flip the sign of the comparison result. (Unless it's a DESC
+ * column, in which case we *don't* flip the sign.)
+ */
+ result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum,
+ scankey->sk_argument));
+
+ if (!(scankey->sk_flags & SK_BT_DESC))
+ INVERT_COMPARE_RESULT(result);
+ }
+
+ /* if the keys are unequal, return the difference */
+ if (result != 0)
+ return result;
+
+ scankey++;
+ }
+
+ /*
+ * All non-truncated attributes (other than heap TID) were found to be
+ * equal. Treat truncated attributes as minus infinity when scankey has a
+ * key attribute value that would otherwise be compared directly.
+ *
+ * Note: it doesn't matter if ntupatts includes non-key attributes;
+ * scankey won't, so explicitly excluding non-key attributes isn't
+ * necessary.
+ */
+ if (key->keysz > ntupatts)
+ return 1;
+
+ /*
+ * Use the heap TID attribute and scantid to try to break the tie. The
+ * rules are the same as any other key attribute -- only the
+ * representation differs.
+ */
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid == NULL)
+ {
+ /*
+ * Most searches have a scankey that is considered greater than a
+ * truncated pivot tuple if and when the scankey has equal values for
+ * attributes up to and including the least significant untruncated
+ * attribute in tuple.
+ *
+ * For example, if an index has the minimum two attributes (single
+ * user key attribute, plus heap TID attribute), and a page's high key
+ * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
+ * will not descend to the page to the left. The search will descend
+ * right instead. The truncated attribute in pivot tuple means that
+ * all non-pivot tuples on the page to the left are strictly < 'foo',
+ * so it isn't necessary to descend left. In other words, search
+ * doesn't have to descend left because it isn't interested in a match
+ * that has a heap TID value of -inf.
+ *
+ * However, some searches (pivotsearch searches) actually require that
+ * we descend left when this happens. -inf is treated as a possible
+ * match for omitted scankey attribute(s). This is needed by page
+ * deletion, which must re-find leaf pages that are targets for
+ * deletion using their high keys.
+ *
+ * Note: the heap TID part of the test ensures that scankey is being
+ * compared to a pivot tuple with one or more truncated key
+ * attributes.
+ *
+ * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
+ * left here, since they have no heap TID attribute (and cannot have
+ * any -inf key values in any case, since truncation can only remove
+ * non-key attributes). !heapkeyspace searches must always be
+ * prepared to deal with matches on both sides of the pivot once the
+ * leaf level is reached.
+ */
+ if (key->heapkeyspace && !key->pivotsearch &&
+ key->keysz == ntupatts && heapTid == NULL)
+ return 1;
+
+ /* All provided scankey arguments found to be equal */
+ return 0;
+ }
+
+ /*
+ * Treat truncated heap TID as minus infinity, since scankey has a key
+ * attribute value (scantid) that would otherwise be compared directly
+ */
+ Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+ if (heapTid == NULL)
+ return 1;
+
+ /*
+ * Scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * with scantid.
+ */
+ Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (result <= 0 || !BTreeTupleIsPosting(itup))
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid,
+ BTreeTupleGetMaxHeapTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
+}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 9f60fa9894..f1d146ba71 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -279,8 +279,6 @@ static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
BTPageState *state,
BTDedupState dstate);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
-static void _bt_load(BTWriteState *wstate,
- BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
int request);
static void _bt_end_parallel(BTLeader *btleader);
@@ -293,6 +291,9 @@ static void _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
Sharedsort *sharedsort2, int sortmem,
bool progress);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsort_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
/*
* btbuild() -- build a new btree index.
@@ -566,7 +567,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
wstate.heap = btspool->heap;
wstate.index = btspool->index;
- wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+ wstate.inskey = nbts_call(_bt_mkscankey, wstate.index, NULL);
/* _bt_mkscankey() won't set allequalimage without metapage */
wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
@@ -578,7 +579,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
PROGRESS_BTREE_PHASE_LEAF_LOAD);
- _bt_load(&wstate, btspool, btspool2);
+ nbts_call_norel(_bt_load, wstate.index, &wstate, btspool, btspool2);
}
/*
@@ -978,8 +979,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
lastleft = (IndexTuple) PageGetItem(opage, ii);
Assert(IndexTupleSize(oitup) > last_truncextra);
- truncated = _bt_truncate(wstate->index, lastleft, oitup,
- wstate->inskey);
+ truncated = nbts_call(_bt_truncate, wstate->index, lastleft, oitup,
+ wstate->inskey);
if (!PageIndexTupleOverwrite(opage, P_HIKEY, (Item) truncated,
IndexTupleSize(truncated)))
elog(ERROR, "failed to add high key to the index page");
@@ -1176,264 +1177,6 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
-/*
- * Read tuples in correct sort order from tuplesort, and load them into
- * btree leaves.
- */
-static void
-_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
-{
- BTPageState *state = NULL;
- bool merge = (btspool2 != NULL);
- IndexTuple itup,
- itup2 = NULL;
- bool load1;
- TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
- SortSupport sortKeys;
- int64 tuples_done = 0;
- bool deduplicate;
-
- deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
- BTGetDeduplicateItems(wstate->index);
-
- if (merge)
- {
- /*
- * Another BTSpool for dead tuples exists. Now we have to merge
- * btspool and btspool2.
- */
-
- /* the preparation of merge */
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
-
- /* Prepare SortSupport data for each column */
- sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
- for (i = 0; i < keysz; i++)
- {
- SortSupport sortKey = sortKeys + i;
- ScanKey scanKey = wstate->inskey->scankeys + i;
- int16 strategy;
-
- sortKey->ssup_cxt = CurrentMemoryContext;
- sortKey->ssup_collation = scanKey->sk_collation;
- sortKey->ssup_nulls_first =
- (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
- sortKey->ssup_attno = scanKey->sk_attno;
- /* Abbreviation is not supported here */
- sortKey->abbreviate = false;
-
- AssertState(sortKey->ssup_attno != 0);
-
- strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
- BTGreaterStrategyNumber : BTLessStrategyNumber;
-
- PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
- }
-
- for (;;)
- {
- load1 = true; /* load BTSpool next ? */
- if (itup2 == NULL)
- {
- if (itup == NULL)
- break;
- }
- else if (itup != NULL)
- {
- int32 compare = 0;
-
- for (i = 1; i <= keysz; i++)
- {
- SortSupport entry;
- Datum attrDatum1,
- attrDatum2;
- bool isNull1,
- isNull2;
-
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
- entry);
- if (compare > 0)
- {
- load1 = false;
- break;
- }
- else if (compare < 0)
- break;
- }
-
- /*
- * If key values are equal, we sort on ItemPointer. This is
- * required for btree indexes, since heap TID is treated as an
- * implicit last key attribute in order to ensure that all
- * keys in the index are physically unique.
- */
- if (compare == 0)
- {
- compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
- Assert(compare != 0);
- if (compare > 0)
- load1 = false;
- }
- }
- else
- load1 = false;
-
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- if (load1)
- {
- _bt_buildadd(wstate, state, itup, 0);
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- }
- else
- {
- _bt_buildadd(wstate, state, itup2, 0);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- pfree(sortKeys);
- }
- else if (deduplicate)
- {
- /* merge is unnecessary, deduplicate into posting lists */
- BTDedupState dstate;
-
- dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
- dstate->deduplicate = true; /* unused */
- dstate->nmaxitems = 0; /* unused */
- dstate->maxpostingsize = 0; /* set later */
- /* Metadata about base tuple of current pending posting list */
- dstate->base = NULL;
- dstate->baseoff = InvalidOffsetNumber; /* unused */
- dstate->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- dstate->htids = NULL;
- dstate->nhtids = 0;
- dstate->nitems = 0;
- dstate->phystupsize = 0; /* unused */
- dstate->nintervals = 0; /* unused */
-
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- {
- state = _bt_pagestate(wstate, 0);
-
- /*
- * Limit size of posting list tuples to 1/10 space we want to
- * leave behind on the page, plus space for final item's line
- * pointer. This is equal to the space that we'd like to
- * leave behind on each leaf page when fillfactor is 90,
- * allowing us to get close to fillfactor% space utilization
- * when there happen to be a great many duplicates. (This
- * makes higher leaf fillfactor settings ineffective when
- * building indexes that have many duplicates, but packing
- * leaf pages full with few very large tuples doesn't seem
- * like a useful goal.)
- */
- dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
- sizeof(ItemIdData);
- Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
- dstate->maxpostingsize <= INDEX_SIZE_MASK);
- dstate->htids = palloc(dstate->maxpostingsize);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
- else if (_bt_keep_natts_fast(wstate->index, dstate->base,
- itup) > keysz &&
- _bt_dedup_save_htid(dstate, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID from itup has been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list.
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
-
- if (state)
- {
- /*
- * Handle the last item (there must be a last item when the
- * tuplesort returned one or more tuples)
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
- pfree(dstate->htids);
- }
-
- pfree(dstate);
- }
- else
- {
- /* merging and deduplication are both unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- _bt_buildadd(wstate, state, itup, 0);
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- }
-
- /* Close down final pages and write the metapage */
- _bt_uppershutdown(wstate, state);
-
- /*
- * When we WAL-logged index pages, we must nonetheless fsync index files.
- * Since we're building outside shared buffers, a CHECKPOINT occurring
- * during the build has no way to flush the previously written data to
- * disk (indeed it won't know the index even exists). A crash later on
- * would replay WAL from the checkpoint, therefore it wouldn't replay our
- * earlier WAL entries. If we do not fsync those pages here, they might
- * still not be on disk when the crash occurs.
- */
- if (wstate->btws_use_wal)
- smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
-}
-
/*
* Create parallel context, and launch workers for leader.
*
diff --git a/src/backend/access/nbtree/nbtsort_spec.h b/src/backend/access/nbtree/nbtsort_spec.h
new file mode 100644
index 0000000000..8f4a3602ca
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsort_spec.h
@@ -0,0 +1,275 @@
+/*
+ * Specialized functions included in nbtsort.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static void NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
+ BTSpool *btspool2);
+
+/*
+ * Read tuples in correct sort order from tuplesort, and load them into
+ * btree leaves.
+ */
+static void
+NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
+ BTSpool *btspool2)
+{
+ BTPageState *state = NULL;
+ bool merge = (btspool2 != NULL);
+ IndexTuple itup,
+ itup2 = NULL;
+ bool load1;
+ TupleDesc tupdes = RelationGetDescr(wstate->index);
+ int i,
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ SortSupport sortKeys;
+ int64 tuples_done = 0;
+ bool deduplicate;
+
+ deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
+ BTGetDeduplicateItems(wstate->index);
+
+ if (merge)
+ {
+ /*
+ * Another BTSpool for dead tuples exists. Now we have to merge
+ * btspool and btspool2.
+ */
+
+ /* the preparation of merge */
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+
+ /* Prepare SortSupport data for each column */
+ sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+
+ for (i = 0; i < keysz; i++)
+ {
+ SortSupport sortKey = sortKeys + i;
+ ScanKey scanKey = wstate->inskey->scankeys + i;
+ int16 strategy;
+
+ sortKey->ssup_cxt = CurrentMemoryContext;
+ sortKey->ssup_collation = scanKey->sk_collation;
+ sortKey->ssup_nulls_first =
+ (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+ sortKey->ssup_attno = scanKey->sk_attno;
+ /* Abbreviation is not supported here */
+ sortKey->abbreviate = false;
+
+ AssertState(sortKey->ssup_attno != 0);
+
+ strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+ BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+ PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+ }
+
+ for (;;)
+ {
+ load1 = true; /* load BTSpool next ? */
+ if (itup2 == NULL)
+ {
+ if (itup == NULL)
+ break;
+ }
+ else if (itup != NULL)
+ {
+ int32 compare = 0;
+
+ for (i = 1; i <= keysz; i++)
+ {
+ SortSupport entry;
+ Datum attrDatum1,
+ attrDatum2;
+ bool isNull1,
+ isNull2;
+
+ entry = sortKeys + i - 1;
+ attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+ attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+ compare = ApplySortComparator(attrDatum1, isNull1,
+ attrDatum2, isNull2,
+ entry);
+ if (compare > 0)
+ {
+ load1 = false;
+ break;
+ }
+ else if (compare < 0)
+ break;
+ }
+
+ /*
+ * If key values are equal, we sort on ItemPointer. This is
+ * required for btree indexes, since heap TID is treated as an
+ * implicit last key attribute in order to ensure that all
+ * keys in the index are physically unique.
+ */
+ if (compare == 0)
+ {
+ compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+ Assert(compare != 0);
+ if (compare > 0)
+ load1 = false;
+ }
+ }
+ else
+ load1 = false;
+
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ if (load1)
+ {
+ _bt_buildadd(wstate, state, itup, 0);
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ }
+ else
+ {
+ _bt_buildadd(wstate, state, itup2, 0);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ pfree(sortKeys);
+ }
+ else if (deduplicate)
+ {
+ /* merge is unnecessary, deduplicate into posting lists */
+ BTDedupState dstate;
+
+ dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ dstate->deduplicate = true; /* unused */
+ dstate->nmaxitems = 0; /* unused */
+ dstate->maxpostingsize = 0; /* set later */
+ /* Metadata about base tuple of current pending posting list */
+ dstate->base = NULL;
+ dstate->baseoff = InvalidOffsetNumber; /* unused */
+ dstate->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ dstate->htids = NULL;
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+ dstate->phystupsize = 0; /* unused */
+ dstate->nintervals = 0; /* unused */
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+
+ /*
+ * Limit size of posting list tuples to 1/10 space we want to
+ * leave behind on the page, plus space for final item's line
+ * pointer. This is equal to the space that we'd like to
+ * leave behind on each leaf page when fillfactor is 90,
+ * allowing us to get close to fillfactor% space utilization
+ * when there happen to be a great many duplicates. (This
+ * makes higher leaf fillfactor settings ineffective when
+ * building indexes that have many duplicates, but packing
+ * leaf pages full with few very large tuples doesn't seem
+ * like a useful goal.)
+ */
+ dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
+ sizeof(ItemIdData);
+ Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
+ dstate->maxpostingsize <= INDEX_SIZE_MASK);
+ dstate->htids = palloc(dstate->maxpostingsize);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+ else if (nbts_call(_bt_keep_natts_fast, wstate->index, dstate->base,
+ itup) > keysz &&
+ _bt_dedup_save_htid(dstate, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID from itup has been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list.
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+
+ if (state)
+ {
+ /*
+ * Handle the last item (there must be a last item when the
+ * tuplesort returned one or more tuples)
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+ pfree(dstate->htids);
+ }
+
+ pfree(dstate);
+ }
+ else
+ {
+ /* merging and deduplication are both unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ _bt_buildadd(wstate, state, itup, 0);
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+
+ /* Close down final pages and write the metapage */
+ _bt_uppershutdown(wstate, state);
+
+ /*
+ * When we WAL-logged index pages, we must nonetheless fsync index files.
+ * Since we're building outside shared buffers, a CHECKPOINT occurring
+ * during the build has no way to flush the previously written data to
+ * disk (indeed it won't know the index even exists). A crash later on
+ * would replay WAL from the checkpoint, therefore it wouldn't replay our
+ * earlier WAL entries. If we do not fsync those pages here, they might
+ * still not be on disk when the crash occurs.
+ */
+ if (wstate->btws_use_wal)
+ smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+}
+
+#endif
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 241e26d338..8e5337cad7 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -692,7 +692,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
{
itemid = PageGetItemId(state->origpage, maxoff);
tup = (IndexTuple) PageGetItem(state->origpage, itemid);
- keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+ keepnatts = nbts_call(_bt_keep_natts_fast, state->rel, tup, state->newitem);
if (keepnatts > 1 && keepnatts <= nkeyatts)
{
@@ -723,7 +723,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
- keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+ keepnatts = nbts_call(_bt_keep_natts_fast, state->rel, tup, state->newitem);
if (keepnatts > 1 && keepnatts <= nkeyatts)
{
@@ -972,7 +972,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
* avoid appending a heap TID in new high key, we're done. Finish split
* with default strategy and initial split interval.
*/
- perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+ perfectpenalty = nbts_call(_bt_keep_natts_fast, state->rel, leftmost, rightmost);
if (perfectpenalty <= indnkeyatts)
return perfectpenalty;
@@ -993,7 +993,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
* If page is entirely full of duplicates, a single value strategy split
* will be performed.
*/
- perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+ perfectpenalty = nbts_call(_bt_keep_natts_fast, state->rel, leftmost, rightmost);
if (perfectpenalty <= indnkeyatts)
{
*strategy = SPLIT_MANY_DUPLICATES;
@@ -1031,8 +1031,8 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
itemid = PageGetItemId(state->origpage, P_HIKEY);
hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
- perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
- state->newitem);
+ perfectpenalty = nbts_call(_bt_keep_natts_fast, state->rel, hikey,
+ state->newitem);
if (perfectpenalty <= indnkeyatts)
*strategy = SPLIT_SINGLE_VALUE;
else
@@ -1154,7 +1154,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
lastleft = _bt_split_lastleft(state, split);
firstright = _bt_split_firstright(state, split);
- return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+ return nbts_call(_bt_keep_natts_fast, state->rel, lastleft, firstright);
}
/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index ff260c393a..bc443ebd27 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -50,130 +50,11 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
bool *result);
static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
static void _bt_mark_scankey_required(ScanKey skey);
-static bool _bt_check_rowcompare(ScanKey skey,
- IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
- ScanDirection dir, bool *continuescan);
-static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtutils_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
-/*
- * _bt_mkscankey
- * Build an insertion scan key that contains comparison data from itup
- * as well as comparator routines appropriate to the key datatypes.
- *
- * When itup is a non-pivot tuple, the returned insertion scan key is
- * suitable for finding a place for it to go on the leaf level. Pivot
- * tuples can be used to re-find leaf page with matching high key, but
- * then caller needs to set scan key's pivotsearch field to true. This
- * allows caller to search for a leaf page with a matching high key,
- * which is usually to the left of the first leaf page a non-pivot match
- * might appear on.
- *
- * The result is intended for use with _bt_compare() and _bt_truncate().
- * Callers that don't need to fill out the insertion scankey arguments
- * (e.g. they use an ad-hoc comparison routine, or only need a scankey
- * for _bt_truncate()) can pass a NULL index tuple. The scankey will
- * be initialized as if an "all truncated" pivot tuple was passed
- * instead.
- *
- * Note that we may occasionally have to share lock the metapage to
- * determine whether or not the keys in the index are expected to be
- * unique (i.e. if this is a "heapkeyspace" index). We assume a
- * heapkeyspace index when caller passes a NULL tuple, allowing index
- * build callers to avoid accessing the non-existent metapage. We
- * also assume that the index is _not_ allequalimage when a NULL tuple
- * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
- * field themselves.
- */
-BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
-{
- BTScanInsert key;
- ScanKey skey;
- TupleDesc itupdesc;
- int indnkeyatts;
- int16 *indoption;
- int tupnatts;
- int i;
-
- itupdesc = RelationGetDescr(rel);
- indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- indoption = rel->rd_indoption;
- tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
-
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
- /*
- * We'll execute search using scan key constructed on key columns.
- * Truncated attributes and non-key attributes are omitted from the final
- * scan key.
- */
- key = palloc(offsetof(BTScanInsertData, scankeys) +
- sizeof(ScanKeyData) * indnkeyatts);
- if (itup)
- _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
- else
- {
- /* Utility statement callers can set these fields themselves */
- key->heapkeyspace = true;
- key->allequalimage = false;
- }
- key->anynullkeys = false; /* initial assumption */
- key->nextkey = false;
- key->pivotsearch = false;
- key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
- skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
- {
- FmgrInfo *procinfo;
- Datum arg;
- bool null;
- int flags;
-
- /*
- * We can use the cached (default) support procs since no cross-type
- * comparison can be needed.
- */
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-
- /*
- * Key arguments built from truncated attributes (or when caller
- * provides no tuple) are defensively represented as NULL values. They
- * should never be used.
- */
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
- else
- {
- arg = (Datum) 0;
- null = true;
- }
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
- flags,
- (AttrNumber) (i + 1),
- InvalidStrategy,
- InvalidOid,
- rel->rd_indcollation[i],
- procinfo,
- arg);
- /* Record if any key attribute is NULL (or truncated) */
- if (null)
- key->anynullkeys = true;
- }
-
- /*
- * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
- * that full uniqueness check is done.
- */
- if (rel->rd_index->indnullsnotdistinct)
- key->anynullkeys = false;
-
- return key;
-}
/*
* free a retracement stack made by _bt_search.
@@ -1340,356 +1221,6 @@ _bt_mark_scankey_required(ScanKey skey)
}
}
-/*
- * Test whether an indextuple satisfies all the scankey conditions.
- *
- * Return true if so, false if not. If the tuple fails to pass the qual,
- * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly. See comments for
- * _bt_preprocess_keys(), above, about how this is done.
- *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
- *
- * scan: index scan descriptor (containing a search-type scankey)
- * tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
- */
-bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
- ScanDirection dir, bool *continuescan)
-{
- TupleDesc tupdesc;
- BTScanOpaque so;
- int keysz;
- int ikey;
- ScanKey key;
-
- Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
-
- *continuescan = true; /* default assumption */
-
- tupdesc = RelationGetDescr(scan->indexRelation);
- so = (BTScanOpaque) scan->opaque;
- keysz = so->numberOfKeys;
-
- for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
- {
- Datum datum;
- bool isNull;
- Datum test;
-
- if (key->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- continue;
- }
-
- /* row-comparison keys need special processing */
- if (key->sk_flags & SK_ROW_HEADER)
- {
- if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
- continuescan))
- continue;
- return false;
- }
-
- datum = index_getattr(tuple,
- key->sk_attno,
- tupdesc,
- &isNull);
-
- if (key->sk_flags & SK_ISNULL)
- {
- /* Handle IS NULL/NOT NULL tests */
- if (key->sk_flags & SK_SEARCHNULL)
- {
- if (isNull)
- continue; /* tuple satisfies this qual */
- }
- else
- {
- Assert(key->sk_flags & SK_SEARCHNOTNULL);
- if (!isNull)
- continue; /* tuple satisfies this qual */
- }
-
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (isNull)
- {
- if (key->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
- datum, key->sk_argument);
-
- if (!DatumGetBool(test))
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- *
- * Note: because we stop the scan as soon as any required equality
- * qual fails, it is critical that equality quals be used for the
- * initial positioning in _bt_first() when they are available. See
- * comments in _bt_first().
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
- }
-
- /* If we get here, the tuple passes all index quals. */
- return true;
-}
-
-/*
- * Test whether an indextuple satisfies a row-comparison scan condition.
- *
- * Return true if so, false if not. If not, also clear *continuescan if
- * it's not possible for any future tuples in the current scan direction
- * to pass the qual.
- *
- * This is a subroutine for _bt_checkkeys, which see for more info.
- */
-static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
- TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
-{
- ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
- int32 cmpresult = 0;
- bool result;
-
- /* First subkey should be same as the header says */
- Assert(subkey->sk_attno == skey->sk_attno);
-
- /* Loop over columns of the row condition */
- for (;;)
- {
- Datum datum;
- bool isNull;
-
- Assert(subkey->sk_flags & SK_ROW_MEMBER);
-
- if (subkey->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- cmpresult = 0;
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- continue;
- }
-
- datum = index_getattr(tuple,
- subkey->sk_attno,
- tupdesc,
- &isNull);
-
- if (isNull)
- {
- if (subkey->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (subkey->sk_flags & SK_ISNULL)
- {
- /*
- * Unlike the simple-scankey case, this isn't a disallowed case.
- * But it can never match. If all the earlier row comparison
- * columns are required for the scan direction, we can stop the
- * scan, because there can't be another tuple that will succeed.
- */
- if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
- subkey--;
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- return false;
- }
-
- /* Perform the test --- three-way comparison not bool operator */
- cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
- subkey->sk_collation,
- datum,
- subkey->sk_argument));
-
- if (subkey->sk_flags & SK_BT_DESC)
- INVERT_COMPARE_RESULT(cmpresult);
-
- /* Done comparing if unequal, else advance to next column */
- if (cmpresult != 0)
- break;
-
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- }
-
- /*
- * At this point cmpresult indicates the overall result of the row
- * comparison, and subkey points to the deciding column (or the last
- * column if the result is "=").
- */
- switch (subkey->sk_strategy)
- {
- /* EQ and NE cases aren't allowed here */
- case BTLessStrategyNumber:
- result = (cmpresult < 0);
- break;
- case BTLessEqualStrategyNumber:
- result = (cmpresult <= 0);
- break;
- case BTGreaterEqualStrategyNumber:
- result = (cmpresult >= 0);
- break;
- case BTGreaterStrategyNumber:
- result = (cmpresult > 0);
- break;
- default:
- elog(ERROR, "unrecognized RowCompareType: %d",
- (int) subkey->sk_strategy);
- result = 0; /* keep compiler quiet */
- break;
- }
-
- if (!result)
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will pass,
- * either. Note we have to look at the deciding column, not
- * necessarily the first or last column of the row condition.
- */
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
-
- return result;
-}
-
/*
* _bt_killitems - set LP_DEAD state for items an indexscan caller has
* told us were killed
@@ -2173,286 +1704,6 @@ btbuildphasename(int64 phasenum)
}
}
-/*
- * _bt_truncate() -- create tuple without unneeded suffix attributes.
- *
- * Returns truncated pivot index tuple allocated in caller's memory context,
- * with key attributes copied from caller's firstright argument. If rel is
- * an INCLUDE index, non-key attributes will definitely be truncated away,
- * since they're not part of the key space. More aggressive suffix
- * truncation can take place when it's clear that the returned tuple does not
- * need one or more suffix key attributes. We only need to keep firstright
- * attributes up to and including the first non-lastleft-equal attribute.
- * Caller's insertion scankey is used to compare the tuples; the scankey's
- * argument values are not considered here.
- *
- * Note that returned tuple's t_tid offset will hold the number of attributes
- * present, so the original item pointer offset is not represented. Caller
- * should only change truncated tuple's downlink. Note also that truncated
- * key attributes are treated as containing "minus infinity" values by
- * _bt_compare().
- *
- * In the worst case (when a heap TID must be appended to distinguish lastleft
- * from firstright), the size of the returned tuple is the size of firstright
- * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
- * is important, since callers need to stay under the 1/3 of a page
- * restriction on tuple size. If this routine is ever taught to truncate
- * within an attribute/datum, it will need to avoid returning an enlarged
- * tuple to caller when truncation + TOAST compression ends up enlarging the
- * final datum.
- */
-IndexTuple
-_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
- IndexTuple pivot;
- IndexTuple tidpivot;
- ItemPointer pivotheaptid;
- Size newsize;
-
- /*
- * We should only ever truncate non-pivot tuples from leaf pages. It's
- * never okay to truncate when splitting an internal page.
- */
- Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
-
- /* Determine how many attributes must be kept in truncated tuple */
- keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
-
-#ifdef DEBUG_NO_TRUNCATE
- /* Force truncation to be ineffective for testing purposes */
- keepnatts = nkeyatts + 1;
-#endif
-
- pivot = index_truncate_tuple(itupdesc, firstright,
- Min(keepnatts, nkeyatts));
-
- if (BTreeTupleIsPosting(pivot))
- {
- /*
- * index_truncate_tuple() just returns a straight copy of firstright
- * when it has no attributes to truncate. When that happens, we may
- * need to truncate away a posting list here instead.
- */
- Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
- Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
- pivot->t_info &= ~INDEX_SIZE_MASK;
- pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
- }
-
- /*
- * If there is a distinguishing key attribute within pivot tuple, we're
- * done
- */
- if (keepnatts <= nkeyatts)
- {
- BTreeTupleSetNAtts(pivot, keepnatts, false);
- return pivot;
- }
-
- /*
- * We have to store a heap TID in the new pivot tuple, since no non-TID
- * key attribute value in firstright distinguishes the right side of the
- * split from the left side. nbtree conceptualizes this case as an
- * inability to truncate away any key attributes, since heap TID is
- * treated as just another key attribute (despite lacking a pg_attribute
- * entry).
- *
- * Use enlarged space that holds a copy of pivot. We need the extra space
- * to store a heap TID at the end (using the special pivot tuple
- * representation). Note that the original pivot already has firstright's
- * possible posting list/non-key attribute values removed at this point.
- */
- newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
- tidpivot = palloc0(newsize);
- memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
- /* Cannot leak memory here */
- pfree(pivot);
-
- /*
- * Store all of firstright's key attribute values plus a tiebreaker heap
- * TID value in enlarged pivot tuple
- */
- tidpivot->t_info &= ~INDEX_SIZE_MASK;
- tidpivot->t_info |= newsize;
- BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
- pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
-
- /*
- * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
- * consider suffix truncation. It seems like a good idea to follow that
- * example in cases where no truncation takes place -- use lastleft's heap
- * TID. (This is also the closest value to negative infinity that's
- * legally usable.)
- */
- ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
-
- /*
- * We're done. Assert() that heap TID invariants hold before returning.
- *
- * Lehman and Yao require that the downlink to the right page, which is to
- * be inserted into the parent page in the second phase of a page split be
- * a strict lower bound on items on the right page, and a non-strict upper
- * bound for items on the left page. Assert that heap TIDs follow these
- * invariants, since a heap TID value is apparently needed as a
- * tiebreaker.
- */
-#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
- BTreeTupleGetHeapTID(firstright)) < 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(lastleft)) >= 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#else
-
- /*
- * Those invariants aren't guaranteed to hold for lastleft + firstright
- * heap TID attribute values when they're considered here only because
- * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
- * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
- * TID value that always works as a strict lower bound for items to the
- * right. In particular, it must avoid using firstright's leading key
- * attribute values along with lastleft's heap TID value when lastleft's
- * TID happens to be greater than firstright's TID.
- */
- ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
-
- /*
- * Pivot heap TID should never be fully equal to firstright. Note that
- * the pivot heap TID will still end up equal to lastleft's heap TID when
- * that's the only usable value.
- */
- ItemPointerSetOffsetNumber(pivotheaptid,
- OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#endif
-
- return tidpivot;
-}
-
-/*
- * _bt_keep_natts - how many key attributes to keep when truncating.
- *
- * Caller provides two tuples that enclose a split point. Caller's insertion
- * scankey is used to compare the tuples; the scankey's argument values are
- * not considered here.
- *
- * This can return a number of attributes that is one greater than the
- * number of key attributes for the index relation. This indicates that the
- * caller must use a heap TID as a unique-ifier in new pivot tuple.
- */
-static int
-_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keepnatts;
- ScanKey scankey;
-
- /*
- * _bt_compare() treats truncated key attributes as having the value minus
- * infinity, which would break searches within !heapkeyspace indexes. We
- * must still truncate away non-key attribute values, though.
- */
- if (!itup_key->heapkeyspace)
- return nkeyatts;
-
- scankey = itup_key->scankeys;
- keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum1,
- datum2)) != 0)
- break;
-
- keepnatts++;
- }
-
- /*
- * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
- * expected in an allequalimage index.
- */
- Assert(!itup_key->allequalimage ||
- keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
-
- return keepnatts;
-}
-
-/*
- * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
- *
- * This is exported so that a candidate split point can have its effect on
- * suffix truncation inexpensively evaluated ahead of time when finding a
- * split location. A naive bitwise approach to datum comparisons is used to
- * save cycles.
- *
- * The approach taken here usually provides the same answer as _bt_keep_natts
- * will (for the same pair of tuples from a heapkeyspace index), since the
- * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting. When an index only has
- * "equal image" columns, routine is guaranteed to give the same result as
- * _bt_keep_natts would.
- *
- * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts, even when the index uses
- * an opclass or collation that is not "allequalimage"/deduplication-safe.
- * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
- * negatives generally only have the effect of making leaf page splits use a
- * more balanced split point.
- */
-int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
-
- keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
- Form_pg_attribute att;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
- break;
-
- keepnatts++;
- }
-
- return keepnatts;
-}
-
/*
* _bt_check_natts() -- Verify tuple has expected number of attributes.
*
diff --git a/src/backend/access/nbtree/nbtutils_spec.h b/src/backend/access/nbtree/nbtutils_spec.h
new file mode 100644
index 0000000000..a4b934ae7a
--- /dev/null
+++ b/src/backend/access/nbtree/nbtutils_spec.h
@@ -0,0 +1,772 @@
+/*
+ * Specialized functions included in nbtutils.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static bool NBTS_FUNCTION(_bt_check_rowcompare)(ScanKey skey,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ ScanDirection dir, bool *continuescan);
+
+static int NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+
+/*
+ * Test whether an indextuple satisfies a row-comparison scan condition.
+ *
+ * Return true if so, false if not. If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction
+ * to pass the qual.
+ *
+ * This is a subroutine for _bt_checkkeys, which see for more info.
+ */
+static bool
+NBTS_FUNCTION(_bt_check_rowcompare)(ScanKey skey, IndexTuple tuple,
+ int tupnatts, TupleDesc tupdesc,
+ ScanDirection dir, bool *continuescan)
+{
+ ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
+ int32 cmpresult = 0;
+ bool result;
+
+ /* First subkey should be same as the header says */
+ Assert(subkey->sk_attno == skey->sk_attno);
+
+ /* Loop over columns of the row condition */
+ for (;;)
+ {
+ Datum datum;
+ bool isNull;
+
+ Assert(subkey->sk_flags & SK_ROW_MEMBER);
+
+ if (subkey->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ cmpresult = 0;
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ continue;
+ }
+
+ datum = index_getattr(tuple,
+ subkey->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (isNull)
+ {
+ if (subkey->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (subkey->sk_flags & SK_ISNULL)
+ {
+ /*
+ * Unlike the simple-scankey case, this isn't a disallowed case.
+ * But it can never match. If all the earlier row comparison
+ * columns are required for the scan direction, we can stop the
+ * scan, because there can't be another tuple that will succeed.
+ */
+ if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
+ subkey--;
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ return false;
+ }
+
+ /* Perform the test --- three-way comparison not bool operator */
+ cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
+ subkey->sk_collation,
+ datum,
+ subkey->sk_argument));
+
+ if (subkey->sk_flags & SK_BT_DESC)
+ INVERT_COMPARE_RESULT(cmpresult);
+
+ /* Done comparing if unequal, else advance to next column */
+ if (cmpresult != 0)
+ break;
+
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ }
+
+ /*
+ * At this point cmpresult indicates the overall result of the row
+ * comparison, and subkey points to the deciding column (or the last
+ * column if the result is "=").
+ */
+ switch (subkey->sk_strategy)
+ {
+ /* EQ and NE cases aren't allowed here */
+ case BTLessStrategyNumber:
+ result = (cmpresult < 0);
+ break;
+ case BTLessEqualStrategyNumber:
+ result = (cmpresult <= 0);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ result = (cmpresult >= 0);
+ break;
+ case BTGreaterStrategyNumber:
+ result = (cmpresult > 0);
+ break;
+ default:
+ elog(ERROR, "unrecognized RowCompareType: %d",
+ (int) subkey->sk_strategy);
+ result = 0; /* keep compiler quiet */
+ break;
+ }
+
+ if (!result)
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will pass,
+ * either. Note we have to look at the deciding column, not
+ * necessarily the first or last column of the row condition.
+ */
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+
+ return result;
+}
+
+/*
+ * _bt_keep_natts - how many key attributes to keep when truncating.
+ *
+ * Caller provides two tuples that enclose a split point. Caller's insertion
+ * scankey is used to compare the tuples; the scankey's argument values are
+ * not considered here.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation. This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keepnatts;
+ ScanKey scankey;
+
+ /*
+ * _bt_compare() treats truncated key attributes as having the value minus
+ * infinity, which would break searches within !heapkeyspace indexes. We
+ * must still truncate away non-key attribute values, though.
+ */
+ if (!itup_key->heapkeyspace)
+ return nkeyatts;
+
+ scankey = itup_key->scankeys;
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum1,
+ datum2)) != 0)
+ break;
+
+ keepnatts++;
+ }
+
+ /*
+ * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
+ * expected in an allequalimage index.
+ */
+ Assert(!itup_key->allequalimage ||
+ keepnatts == nbts_call(_bt_keep_natts_fast, rel, lastleft, firstright));
+
+ return keepnatts;
+}
+
+#endif /* NBTS_SPECIALIZING_DEFAULT */
+
+/*
+ * _bt_mkscankey
+ * Build an insertion scan key that contains comparison data from itup
+ * as well as comparator routines appropriate to the key datatypes.
+ *
+ * When itup is a non-pivot tuple, the returned insertion scan key is
+ * suitable for finding a place for it to go on the leaf level. Pivot
+ * tuples can be used to re-find leaf page with matching high key, but
+ * then caller needs to set scan key's pivotsearch field to true. This
+ * allows caller to search for a leaf page with a matching high key,
+ * which is usually to the left of the first leaf page a non-pivot match
+ * might appear on.
+ *
+ * The result is intended for use with _bt_compare() and _bt_truncate().
+ * Callers that don't need to fill out the insertion scankey arguments
+ * (e.g. they use an ad-hoc comparison routine, or only need a scankey
+ * for _bt_truncate()) can pass a NULL index tuple. The scankey will
+ * be initialized as if an "all truncated" pivot tuple was passed
+ * instead.
+ *
+ * Note that we may occasionally have to share lock the metapage to
+ * determine whether or not the keys in the index are expected to be
+ * unique (i.e. if this is a "heapkeyspace" index). We assume a
+ * heapkeyspace index when caller passes a NULL tuple, allowing index
+ * build callers to avoid accessing the non-existent metapage. We
+ * also assume that the index is _not_ allequalimage when a NULL tuple
+ * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
+ * field themselves.
+ */
+BTScanInsert
+NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
+{
+ BTScanInsert key;
+ ScanKey skey;
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int16 *indoption;
+ int tupnatts;
+ int i;
+
+ itupdesc = RelationGetDescr(rel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ indoption = rel->rd_indoption;
+ tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+
+ /*
+ * We'll execute search using scan key constructed on key columns.
+ * Truncated attributes and non-key attributes are omitted from the final
+ * scan key.
+ */
+ key = palloc(offsetof(BTScanInsertData, scankeys) +
+ sizeof(ScanKeyData) * indnkeyatts);
+ if (itup)
+ _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+ else
+ {
+ /* Utility statement callers can set these fields themselves */
+ key->heapkeyspace = true;
+ key->allequalimage = false;
+ }
+ key->anynullkeys = false; /* initial assumption */
+ key->nextkey = false;
+ key->pivotsearch = false;
+ key->keysz = Min(indnkeyatts, tupnatts);
+ key->scantid = key->heapkeyspace && itup ?
+ BTreeTupleGetHeapTID(itup) : NULL;
+ skey = key->scankeys;
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ FmgrInfo *procinfo;
+ Datum arg;
+ bool null;
+ int flags;
+
+ /*
+ * We can use the cached (default) support procs since no cross-type
+ * comparison can be needed.
+ */
+ procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+
+ /*
+ * Key arguments built from truncated attributes (or when caller
+ * provides no tuple) are defensively represented as NULL values. They
+ * should never be used.
+ */
+ if (i < tupnatts)
+ arg = index_getattr(itup, i + 1, itupdesc, &null);
+ else
+ {
+ arg = (Datum) 0;
+ null = true;
+ }
+ flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags,
+ (AttrNumber) (i + 1),
+ InvalidStrategy,
+ InvalidOid,
+ rel->rd_indcollation[i],
+ procinfo,
+ arg);
+ /* Record if any key attribute is NULL (or truncated) */
+ if (null)
+ key->anynullkeys = true;
+ }
+
+ /*
+ * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
+ * that full uniqueness check is done.
+ */
+ if (rel->rd_index->indnullsnotdistinct)
+ key->anynullkeys = false;
+
+ return key;
+}
+
+/*
+ * Test whether an indextuple satisfies all the scankey conditions.
+ *
+ * Return true if so, false if not. If the tuple fails to pass the qual,
+ * we also determine whether there's any need to continue the scan beyond
+ * this tuple, and set *continuescan accordingly. See comments for
+ * _bt_preprocess_keys(), above, about how this is done.
+ *
+ * Forward scan callers can pass a high key tuple in the hopes of having
+ * us set *continuescan to false, and avoiding an unnecessary visit to
+ * the page to the right.
+ *
+ * scan: index scan descriptor (containing a search-type scankey)
+ * tuple: index tuple to test
+ * tupnatts: number of attributes in tupnatts (high key may be truncated)
+ * dir: direction we are scanning in
+ * continuescan: output parameter (will be set correctly in all cases)
+ */
+bool
+NBTS_FUNCTION(_bt_checkkeys)(Relation rel, IndexScanDesc scan,
+ IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan)
+{
+ TupleDesc tupdesc;
+ BTScanOpaque so;
+ int keysz;
+ int ikey;
+ ScanKey key;
+
+ Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+
+ *continuescan = true; /* default assumption */
+
+ tupdesc = RelationGetDescr(scan->indexRelation);
+ so = (BTScanOpaque) scan->opaque;
+ keysz = so->numberOfKeys;
+
+ for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+ {
+ Datum datum;
+ bool isNull;
+ Datum test;
+
+ if (key->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ continue;
+ }
+
+ /* row-comparison keys need special processing */
+ if (key->sk_flags & SK_ROW_HEADER)
+ {
+ if (nbts_call_norel(_bt_check_rowcompare, rel, key, tuple,
+ tupnatts, tupdesc, dir, continuescan))
+ continue;
+ return false;
+ }
+
+ datum = index_getattr(tuple,
+ key->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (key->sk_flags & SK_ISNULL)
+ {
+ /* Handle IS NULL/NOT NULL tests */
+ if (key->sk_flags & SK_SEARCHNULL)
+ {
+ if (isNull)
+ continue; /* tuple satisfies this qual */
+ }
+ else
+ {
+ Assert(key->sk_flags & SK_SEARCHNOTNULL);
+ if (!isNull)
+ continue; /* tuple satisfies this qual */
+ }
+
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (isNull)
+ {
+ if (key->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
+ datum, key->sk_argument);
+
+ if (!DatumGetBool(test))
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ *
+ * Note: because we stop the scan as soon as any required equality
+ * qual fails, it is critical that equality quals be used for the
+ * initial positioning in _bt_first() when they are available. See
+ * comments in _bt_first().
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+ }
+
+ /* If we get here, the tuple passes all index quals. */
+ return true;
+}
+
+/*
+ * _bt_truncate() -- create tuple without unneeded suffix attributes.
+ *
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument. If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space. More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes. We only need to keep firstright
+ * attributes up to and including the first non-lastleft-equal attribute.
+ * Caller's insertion scankey is used to compare the tuples; the scankey's
+ * argument values are not considered here.
+ *
+ * Note that returned tuple's t_tid offset will hold the number of attributes
+ * present, so the original item pointer offset is not represented. Caller
+ * should only change truncated tuple's downlink. Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
+ *
+ * In the worst case (when a heap TID must be appended to distinguish lastleft
+ * from firstright), the size of the returned tuple is the size of firstright
+ * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
+ * is important, since callers need to stay under the 1/3 of a page
+ * restriction on tuple size. If this routine is ever taught to truncate
+ * within an attribute/datum, it will need to avoid returning an enlarged
+ * tuple to caller when truncation + TOAST compression ends up enlarging the
+ * final datum.
+ */
+IndexTuple
+NBTS_FUNCTION(_bt_truncate)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+ IndexTuple pivot;
+ IndexTuple tidpivot;
+ ItemPointer pivotheaptid;
+ Size newsize;
+
+ /*
+ * We should only ever truncate non-pivot tuples from leaf pages. It's
+ * never okay to truncate when splitting an internal page.
+ */
+ Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
+
+ /* Determine how many attributes must be kept in truncated tuple */
+ keepnatts = nbts_call(_bt_keep_natts, rel, lastleft, firstright, itup_key);
+
+#ifdef DEBUG_NO_TRUNCATE
+ /* Force truncation to be ineffective for testing purposes */
+ keepnatts = nkeyatts + 1;
+#endif
+
+ pivot = index_truncate_tuple(itupdesc, firstright,
+ Min(keepnatts, nkeyatts));
+
+ if (BTreeTupleIsPosting(pivot))
+ {
+ /*
+ * index_truncate_tuple() just returns a straight copy of firstright
+ * when it has no attributes to truncate. When that happens, we may
+ * need to truncate away a posting list here instead.
+ */
+ Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
+ Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+
+ /*
+ * If there is a distinguishing key attribute within pivot tuple, we're
+ * done
+ */
+ if (keepnatts <= nkeyatts)
+ {
+ BTreeTupleSetNAtts(pivot, keepnatts, false);
+ return pivot;
+ }
+
+ /*
+ * We have to store a heap TID in the new pivot tuple, since no non-TID
+ * key attribute value in firstright distinguishes the right side of the
+ * split from the left side. nbtree conceptualizes this case as an
+ * inability to truncate away any key attributes, since heap TID is
+ * treated as just another key attribute (despite lacking a pg_attribute
+ * entry).
+ *
+ * Use enlarged space that holds a copy of pivot. We need the extra space
+ * to store a heap TID at the end (using the special pivot tuple
+ * representation). Note that the original pivot already has firstright's
+ * possible posting list/non-key attribute values removed at this point.
+ */
+ newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
+ tidpivot = palloc0(newsize);
+ memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
+ /* Cannot leak memory here */
+ pfree(pivot);
+
+ /*
+ * Store all of firstright's key attribute values plus a tiebreaker heap
+ * TID value in enlarged pivot tuple
+ */
+ tidpivot->t_info &= ~INDEX_SIZE_MASK;
+ tidpivot->t_info |= newsize;
+ BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
+ pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
+
+ /*
+ * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
+ * consider suffix truncation. It seems like a good idea to follow that
+ * example in cases where no truncation takes place -- use lastleft's heap
+ * TID. (This is also the closest value to negative infinity that's
+ * legally usable.)
+ */
+ ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
+
+ /*
+ * We're done. Assert() that heap TID invariants hold before returning.
+ *
+ * Lehman and Yao require that the downlink to the right page, which is to
+ * be inserted into the parent page in the second phase of a page split be
+ * a strict lower bound on items on the right page, and a non-strict upper
+ * bound for items on the left page. Assert that heap TIDs follow these
+ * invariants, since a heap TID value is apparently needed as a
+ * tiebreaker.
+ */
+#ifndef DEBUG_NO_TRUNCATE
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#else
+
+ /*
+ * Those invariants aren't guaranteed to hold for lastleft + firstright
+ * heap TID attribute values when they're considered here only because
+ * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
+ * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
+ * TID value that always works as a strict lower bound for items to the
+ * right. In particular, it must avoid using firstright's leading key
+ * attribute values along with lastleft's heap TID value when lastleft's
+ * TID happens to be greater than firstright's TID.
+ */
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
+
+ /*
+ * Pivot heap TID should never be fully equal to firstright. Note that
+ * the pivot heap TID will still end up equal to lastleft's heap TID when
+ * that's the only usable value.
+ */
+ ItemPointerSetOffsetNumber(pivotheaptid,
+ OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#endif
+
+ return tidpivot;
+}
+
+/*
+ * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location. A naive bitwise approach to datum comparisons is used to
+ * save cycles.
+ *
+ * The approach taken here usually provides the same answer as _bt_keep_natts
+ * will (for the same pair of tuples from a heapkeyspace index), since the
+ * majority of btree opclasses can never indicate that two datums are equal
+ * unless they're bitwise equal after detoasting. When an index only has
+ * "equal image" columns, routine is guaranteed to give the same result as
+ * _bt_keep_natts would.
+ *
+ * Callers can rely on the fact that attributes considered equal here are
+ * definitely also equal according to _bt_keep_natts, even when the index uses
+ * an opclass or collation that is not "allequalimage"/deduplication-safe.
+ * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
+ * negatives generally only have the effect of making leaf page splits use a
+ * more balanced split point.
+ */
+int
+NBTS_FUNCTION(_bt_keep_natts_fast)(Relation rel,
+ IndexTuple lastleft,
+ IndexTuple firstright)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= keysz; attnum++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+ Form_pg_attribute att;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ att = TupleDescAttr(itupdesc, attnum - 1);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
+ break;
+
+ keepnatts++;
+ }
+
+ return keepnatts;
+}
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 1174e1a31c..816165217e 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1122,7 +1122,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
state->tupDesc = tupDesc; /* assume we need not copy tupDesc */
- indexScanKey = _bt_mkscankey(indexRel, NULL);
+ indexScanKey = nbts_call(_bt_mkscankey, indexRel, NULL);
if (state->indexInfo->ii_Expressions != NULL)
{
@@ -1220,7 +1220,7 @@ tuplesort_begin_index_btree(Relation heapRel,
state->enforceUnique = enforceUnique;
state->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
- indexScanKey = _bt_mkscankey(indexRel, NULL);
+ indexScanKey = nbts_call(_bt_mkscankey, indexRel, NULL);
/* Prepare SortSupport data for each column */
state->sortKeys = (SortSupport) palloc0(state->nKeys *
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 93f8267b48..83e0dbab16 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1116,15 +1116,47 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_PERFORMSORT_2 4
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
+
+/*
+ * Macros used in the nbtree specialization code.
+ */
+#define NBTS_TYPE_CACHED cached
+#define NBTS_TYPE_DEFAULT default
+
+
+#define NBTS_MAKE_PREFIX(a) CppConcat(a,_)
+#define NBTS_MAKE_NAME_(a,b) CppConcat(a,b)
+#define NBTS_MAKE_NAME(a,b) NBTS_MAKE_NAME_(NBTS_MAKE_PREFIX(a),b)
+
+#define NBTS_ENABLED
+
+#ifdef NBTS_ENABLED
+
+/*
+ * Access a specialized nbtree function, based on the shape of the index key.
+ */
+
+#define NBT_SPECIALIZE_CALL(function, rel, ...) \
+( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+)
+
+#else /* not defined NBTS_ENABLED */
+
+#define NBT_SPECIALIZE_CALL(function, rel, ...) function(__VA_ARGS__)
+
+#endif /* NBTS_ENABLED */
+
+
+#define NBT_SPECIALIZE_FILE "access/nbtree_specialized.h"
+#include "nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
+
+
/*
* external entry points for btree, in nbtree.c
*/
extern void btbuildempty(Relation index);
-extern bool btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- struct IndexInfo *indexInfo);
extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
extern Size btestimateparallelscan(void);
extern void btinitparallelscan(void *target);
@@ -1155,9 +1187,6 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
/*
* prototypes for functions in nbtdedup.c
*/
-extern void _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel,
- IndexTuple newitem, Size newitemsz,
- bool bottomupdedup);
extern bool _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
Size newitemsz);
extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
@@ -1173,9 +1202,6 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
/*
* prototypes for functions in nbtinsert.c
*/
-extern bool _bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel);
extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
@@ -1223,12 +1249,6 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
/*
* prototypes for functions in nbtsearch.c
*/
-extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
- int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
- bool forupdate, BTStack stack, int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
-extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -1237,7 +1257,6 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
/*
* prototypes for functions in nbtutils.c
*/
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
extern void _bt_freestack(BTStack stack);
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
@@ -1245,8 +1264,6 @@ extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
extern void _bt_mark_array_keys(IndexScanDesc scan);
extern void _bt_restore_array_keys(IndexScanDesc scan);
extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
- int tupnatts, ScanDirection dir, bool *continuescan);
extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
extern BTCycleId _bt_start_vacuum(Relation rel);
@@ -1259,10 +1276,6 @@ extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
extern char *btbuildphasename(int64 phasenum);
-extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
-extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
- IndexTuple firstright);
extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/nbtree_specialize.h b/src/include/access/nbtree_specialize.h
new file mode 100644
index 0000000000..23fdda4f0e
--- /dev/null
+++ b/src/include/access/nbtree_specialize.h
@@ -0,0 +1,204 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtree_specialize.h
+ * header file for postgres btree access method implementation.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/nbtree_specialize.h
+ *
+ *-------------------------------------------------------------------------
+ *
+ * Specialize key-accessing functions and the hot code around those.
+ *
+ * Key attribute iteration is specialized through the use of the following
+ * macros:
+ *
+ * - nbts_call(function, indexrel, ...rest_of_args), and
+ * nbts_call_norel(function, indexrel, ...args)
+ * This will call the specialized variant of 'function' based on the index
+ * relation data.
+ * The difference between nbts_call and nbts_call_norel is that _call
+ * uses indexrel as first argument in the function call, whereas
+ * nbts_call_norel does not.
+ * - nbts_attiterdeclare(itup)
+ * Declare the variables required to iterate over the provided IndexTuple's
+ * key attributes. Many tuples may have their attributes iterated over at the
+ * same time.
+ * - nbts_attiterinit(itup, initAttNum, tupDesc)
+ * Initialize the attribute iterator for the provided IndexTuple at
+ * the provided AttributeNumber.
+ * - nbts_foreachattr(initAttNum, endAttNum)
+ * Start a loop over the attributes, starting at initAttNum and ending at
+ * endAttNum, inclusive. It also takes care of truncated attributes.
+ * - nbts_attiter_attnum
+ * The current attribute number
+ * - nbts_attiter_nextattdatum(itup, tupDesc)
+ * Updates the attribute iterator state to the next attribute. Returns the
+ * datum of the next attribute, which might be null (see below)
+ * - nbts_attiter_curattisnull(itup)
+ * Returns whether the result from the last nbts_attiter_nextattdatum is
+ * null.
+ *
+ * example usage:
+ *
+ * kwithnulls = nbts_call_norel(_bt_key_hasnulls, myindex, mytuple, tupDesc);
+ *
+ * NBTS_FUNCTION(_bt_key_hasnulls)(IndexTuple mytuple, TupleDesc tupDesc)
+ * {
+ * nbts_attiterdeclare(mytuple);
+ * nbts_attiterinit(mytuple, 1, tupDesc);
+ * nbts_foreachattr(1, 10)
+ * {
+ * Datum it = nbts_attiter_nextattdatum(tuple, tupDesc);
+ * if (nbts_attiter_curattisnull(tuple))
+ * return true;
+ * }
+ * return false
+ * }
+ */
+
+/*
+ * Call a potentially specialized function for a given btree operation.
+ *
+ * NB: the rel argument is evaluated multiple times.
+ */
+#define nbts_call(name, rel, ...) \
+ nbts_call_norel(name, (rel), (rel), __VA_ARGS__)
+
+#ifdef NBTS_ENABLED
+
+#define NBTS_FUNCTION(name) NBTS_MAKE_NAME(name, NBTS_TYPE)
+
+#ifdef nbts_call_norel
+#undef nbts_call_norel
+#endif
+
+#define nbts_call_norel(name, rel, ...) \
+ (NBTS_FUNCTION(name)(__VA_ARGS__))
+
+/*
+ * Multiple key columns, optimized access for attcacheoff -cacheable offsets.
+ */
+#define NBTS_SPECIALIZING_CACHED
+#define NBTS_TYPE NBTS_TYPE_CACHED
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_CACHED
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/* reset call to SPECIALIZE_CALL for default behaviour */
+#undef nbts_call_norel
+#define nbts_call_norel(name, rel, ...) \
+ NBT_SPECIALIZE_CALL(name, (rel), __VA_ARGS__)
+
+/*
+ * "Default", externally accessible, not so much optimized functions
+ */
+
+#define NBTS_SPECIALIZING_DEFAULT
+#define NBTS_TYPE NBTS_TYPE_DEFAULT
+
+/* for the default functions, we want to use the unspecialized name. */
+#undef NBTS_FUNCTION
+#define NBTS_FUNCTION(name) name
+
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_DEFAULT
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/* from here on there are no more NBTS_FUNCTIONs */
+#undef NBTS_FUNCTION
+
+#else /* not defined NBTS_ENABLED */
+
+/*
+ * NBTS_ENABLE is not defined, so we don't want to use the specializations.
+ * We revert to the behaviour from PG14 and earlier, which only uses
+ * attcacheoff.
+ */
+
+#define NBTS_FUNCTION(name) name
+
+#define nbts_call_norel(name, rel, ...) \
+ name(__VA_ARGS__)
+
+#define NBTS_TYPE NBTS_TYPE_CACHED
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+
+#endif /* !NBTS_ENABLED */
diff --git a/src/include/access/nbtree_specialized.h b/src/include/access/nbtree_specialized.h
new file mode 100644
index 0000000000..c45fa84aed
--- /dev/null
+++ b/src/include/access/nbtree_specialized.h
@@ -0,0 +1,67 @@
+/*
+ * prototypes for functions that are included in nbtree.h
+ */
+
+/*
+ * prototypes for functions in nbtree_spec.h
+ */
+extern void
+NBTS_FUNCTION(_bt_specialize)(Relation rel);
+
+extern bool
+NBTS_FUNCTION(btinsert)(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ struct IndexInfo *indexInfo);
+
+/*
+ * prototypes for functions in nbtdedup_spec.h
+ */
+extern void
+NBTS_FUNCTION(_bt_dedup_pass)(Relation rel, Buffer buf, Relation heapRel,
+ IndexTuple newitem, Size newitemsz,
+ bool bottomupdedup);
+
+
+/*
+ * prototypes for functions in nbtinsert_spec.h
+ */
+
+extern bool
+NBTS_FUNCTION(_bt_doinsert)(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ Relation heapRel);
+
+/*
+ * prototypes for functions in nbtsearch_spec.h
+ */
+extern BTStack
+NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key,
+ Buffer *bufP, int access,
+ Snapshot snapshot);
+extern Buffer
+NBTS_FUNCTION(_bt_moveright)(Relation rel, BTScanInsert key, Buffer buf,
+ bool forupdate, BTStack stack, int access,
+ Snapshot snapshot);
+extern OffsetNumber
+NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate);
+extern int32
+NBTS_FUNCTION(_bt_compare)(Relation rel, BTScanInsert key,
+ Page page, OffsetNumber offnum);
+
+/*
+ * prototypes for functions in nbtutils_spec.h
+ */
+extern BTScanInsert
+NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup);
+extern bool
+NBTS_FUNCTION(_bt_checkkeys)(Relation rel, IndexScanDesc scan,
+ IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan);
+extern IndexTuple
+NBTS_FUNCTION(_bt_truncate)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+extern int
+NBTS_FUNCTION(_bt_keep_natts_fast)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright);
--
2.30.2
v2-0006-Implement-specialized-uncacheable-attribute-itera.patchapplication/x-patch; name=v2-0006-Implement-specialized-uncacheable-attribute-itera.patchDownload
From b694db3e884b06b7f6e4508f41ee2a2288f73c20 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 8 Apr 2022 14:44:01 +0200
Subject: [PATCH v2 6/7] Implement specialized uncacheable attribute iteration
Uses an iterator to prevent doing duplicate work while iterating over
attributes.
Inspiration: https://www.postgresql.org/message-id/CAEze2WjE9ka8i%3Ds-Vv5oShro9xTrt5VQnQvFG9AaRwWpMm3-fg%40mail.gmail.com
---
src/include/access/itup.h | 179 +++++++++++++++++++++++++
src/include/access/nbtree.h | 12 +-
src/include/access/nbtree_specialize.h | 34 +++++
3 files changed, 223 insertions(+), 2 deletions(-)
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 2c8877e991..cc29614107 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -59,6 +59,15 @@ typedef struct IndexAttributeBitMapData
typedef IndexAttributeBitMapData * IndexAttributeBitMap;
+typedef struct IAttrIterStateData
+{
+ int offset;
+ bool slow;
+ bool isNull;
+} IAttrIterStateData;
+
+typedef IAttrIterStateData * IAttrIterState;
+
/*
* t_info manipulation macros
*/
@@ -126,6 +135,42 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
) \
)
+/* ----------------
+ * index_attiterinit
+ *
+ * This gets called many times, so we macro the cacheable and NULL
+ * lookups, and call nocache_index_attiterinit() for the rest.
+ *
+ * tup - the tuple being iterated on
+ * attnum - the attribute number that we start the iteration with
+ * in the first index_attiternext call
+ * tupdesc - the tuple description
+ *
+ * ----------------
+ */
+#define index_attiterinit(tup, attnum, tupleDesc, iter) \
+do { \
+ if ((attnum) == 1) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ 0 /* Offset of attribute 1 is always 0 */, \
+ false /* slow */, \
+ false /* isNull */ \
+ }); \
+ } \
+ else if (!IndexTupleHasNulls(tup) && \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff >= 0) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff, /* offset */ \
+ false, /* slow */ \
+ false /* isNull */ \
+ }); \
+ } \
+ else \
+ nocache_index_attiterinit((tup), (attnum) - 1, (tupleDesc), (iter)); \
+} while (false);
+
/*
* MaxIndexTuplesPerPage is an upper bound on the number of tuples that can
* fit on one index page. An index tuple must have either data or a null
@@ -161,4 +206,138 @@ extern IndexTuple CopyIndexTuple(IndexTuple source);
extern IndexTuple index_truncate_tuple(TupleDesc sourceDescriptor,
IndexTuple source, int leavenatts);
+/*
+ * Initiate an index attribute iterator to attribute attnum,
+ * and return the corresponding datum.
+ *
+ * This is nearly the same as index_deform_tuple, except that this
+ * returns the internal state up to attnum, instead of populating the
+ * datum- and isnull-arrays
+ */
+static inline void
+nocache_index_attiterinit(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ int curatt;
+ char *tp; /* ptr to tuple data */
+ int off; /* offset in tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ bool slow = false; /* can we use/set attcacheoff? */
+ bool null = false;
+
+ /* Assert to protect callers */
+ Assert(PointerIsValid(iter));
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ /* XXX "knows" t_bits are just after fixed tuple header! */
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+ off = 0;
+
+ for (curatt = 0; curatt < attnum; curatt++)
+ {
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, curatt);
+
+ if (hasnulls && att_isnull(curatt, bp))
+ {
+ null = true;
+ slow = true; /* can't use attcacheoff anymore */
+ continue;
+ }
+
+ null = false;
+
+ if (!slow && thisatt->attcacheoff >= 0)
+ off = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ slow = true;
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+ }
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+
+ if (thisatt->attlen <= 0)
+ slow = true; /* can't use attcacheoff anymore */
+ }
+
+ iter->isNull = null;
+ iter->offset = off;
+ iter->slow = slow;
+}
+
+/* ----------------
+ * index_attiternext() - get the next attribute of an index tuple
+ *
+ * This gets called many times, so we do the least amount of work
+ * possible.
+ *
+ * The code does not attempt to update attcacheoff; as it is unlikely
+ * to reach a situation where the cached offset matters a lot.
+ * If the cached offset do matter, the caller should make sure that
+ * PopulateTupleDescCacheOffsets() was called on the tuple descriptor
+ * to populate the attribute offset cache.
+ *
+ * ----------------
+ */
+static inline Datum
+index_attiternext(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ char *tp; /* ptr to tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ Datum datum;
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, attnum - 1);
+
+ Assert(PointerIsValid(iter));
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+
+ if (hasnulls && att_isnull(attnum - 1, bp))
+ {
+ iter->isNull = true;
+ iter->slow = true;
+ return (Datum) 0;
+ }
+
+ iter->isNull = false;
+
+ if (!iter->slow && thisatt->attcacheoff >= 0)
+ iter->offset = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ iter->offset = att_align_pointer(iter->offset, thisatt->attalign, -1,
+ tp + iter->offset);
+ iter->slow = true;
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ iter->offset = att_align_nominal(iter->offset, thisatt->attalign);
+ }
+
+ datum = fetchatt(thisatt, tp + iter->offset);
+
+ iter->offset = att_addlength_pointer(iter->offset, thisatt->attlen, tp + iter->offset);
+
+ if (thisatt->attlen <= 0)
+ iter->slow = true; /* can't use attcacheoff anymore */
+
+ return datum;
+}
+
#endif /* ITUP_H */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 1559399b0e..80f2575884 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1122,6 +1122,7 @@ typedef struct BTOptions
*/
#define NBTS_TYPE_SINGLE_COLUMN single
#define NBTS_TYPE_CACHED cached
+#define NBTS_TYPE_UNCACHED uncached
#define NBTS_TYPE_DEFAULT default
@@ -1152,12 +1153,19 @@ do { \
#define NBT_SPECIALIZE_CALL(function, rel, ...) \
( \
- IndexRelationGetNumberOfKeyAttributes(rel) == 1 ? ( \
+ IndexRelationGetNumberOfKeyAttributes(rel) == 1 ? ( \
NBTS_MAKE_NAME(function, NBTS_TYPE_SINGLE_COLUMN)(__VA_ARGS__) \
) \
: \
( \
- NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+ TupleDescAttr(RelationGetDescr(rel), \
+ IndexRelationGetNumberOfKeyAttributes(rel) - 1)->attcacheoff > 0 ? ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+ ) \
+ : \
+ ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_UNCACHED)(__VA_ARGS__) \
+ ) \
) \
)
diff --git a/src/include/access/nbtree_specialize.h b/src/include/access/nbtree_specialize.h
index 642bc4c795..52739d390e 100644
--- a/src/include/access/nbtree_specialize.h
+++ b/src/include/access/nbtree_specialize.h
@@ -168,6 +168,40 @@
#undef nbts_attiter_nextattdatum
#undef nbts_attiter_curattisnull
+/*
+ * Multiple key columns, but attcacheoff -optimization doesn't apply.
+ */
+#define NBTS_SPECIALIZING_UNCACHED
+#define NBTS_TYPE NBTS_TYPE_UNCACHED
+
+#define nbts_attiterdeclare(itup) \
+ IAttrIterStateData NBTS_MAKE_NAME(itup, iter)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc) \
+ index_attiterinit((itup), (initAttNum), (tupDesc), &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_attiternext((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, iter).isNull
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_UNCACHED
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
/* reset call to SPECIALIZE_CALL for default behaviour */
#undef nbts_call_norel
#define nbts_call_norel(name, rel, ...) \
--
2.30.2
v2-0007-Implement-dynamic-prefix-compression-in-nbtree.patchapplication/x-patch; name=v2-0007-Implement-dynamic-prefix-compression-in-nbtree.patchDownload
From 16a6f12b795cb50b3a3f1f246f7c964d53e807f7 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 15 Apr 2022 18:25:38 +0200
Subject: [PATCH v2 7/7] Implement dynamic prefix compression in nbtree
Because tuples are ordered on the page, if some prefix of the scan attributes
on both sides of the compared tuple are equal to the scankey, then the current
tuple that is being compared must also have those prefixing attributes that
equal the scankey.
We cannot propagate this information to _binsrch on lower pages, as this
downstream page may concurrently have split and/or have merged with its
deleted left neighbour (see [0]), which moves the keyspace of the linked page.
We thus can only trust the current state of this current page for this
optimization, which means we must validate this state each time we open the
page.
Although this limits the overall performance improvement, it still allows
for a nice performance improvement in most cases where initial columns have
many duplicate values and a compare function that is not cheap.
---
contrib/amcheck/verify_nbtree.c | 17 +++--
src/backend/access/nbtree/README | 25 ++++++++
src/backend/access/nbtree/nbtinsert.c | 14 ++--
src/backend/access/nbtree/nbtinsert_spec.h | 22 +++++--
src/backend/access/nbtree/nbtsearch.c | 2 +-
src/backend/access/nbtree/nbtsearch_spec.h | 75 +++++++++++++++++-----
src/include/access/nbtree_specialized.h | 8 ++-
7 files changed, 127 insertions(+), 36 deletions(-)
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 70278c4f93..5753611546 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2673,6 +2673,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
BTInsertStateData insertstate;
OffsetNumber offnum;
Page page;
+ AttrNumber cmpcol = 1;
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
@@ -2682,13 +2683,13 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.buf = lbuf;
/* Get matching tuple on leaf page */
- offnum = _bt_binsrch_insert(state->rel, &insertstate);
+ offnum = _bt_binsrch_insert(state->rel, &insertstate, 1);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
/* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
insertstate.postingoff <= 0 &&
- _bt_compare(state->rel, key, page, offnum) == 0)
+ _bt_compare(state->rel, key, page, offnum, &cmpcol) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
}
@@ -2750,6 +2751,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
@@ -2760,7 +2762,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
if (!key->heapkeyspace)
return invariant_leq_offset(state, key, upperbound);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
/*
* _bt_compare() is capable of determining that a scankey with a
@@ -2812,10 +2814,11 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber upperbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
return cmp <= 0;
}
@@ -2835,10 +2838,11 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber lowerbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, lowerbound);
+ cmp = _bt_compare(state->rel, key, state->target, lowerbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
@@ -2873,13 +2877,14 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
/* Verify line pointer before checking tuple */
itemid = PageGetItemIdCareful(state, nontargetblock, nontarget,
upperbound);
- cmp = _bt_compare(state->rel, key, nontarget, upperbound);
+ cmp = _bt_compare(state->rel, key, nontarget, upperbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3c08888c23..13ac9ee2be 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -901,6 +901,31 @@ large groups of duplicates, maximizing space utilization. Note also that
deduplication more efficient. Deduplication can be performed infrequently,
without merging together existing posting list tuples too often.
+
+Notes about dynamic prefix truncation
+-------------------------------------
+
+Because NBTrees have a sorted keyspace, when we have determined that some
+prefixing columns of tuples on both sides of the tuple that is being
+compared are equal to the scankey, then the current tuple must also share
+this prefix with the scankey. This allows us to skip comparing those columns,
+potentially saving cycles.
+
+We can only use this constraint if we have proven this information while we
+hold a pin on the page, so this is only useful on the page level: Concurrent
+page deletions and splits may have moved the keyspace of the page referenced
+by an inner page to the right. If we re-used high- and low-column-prefixes,
+we would not be able to detect a change of keyspace from e.g. (2,2) to (1,2),
+and subsequently return invalid results. This race condition can only be
+prevented by re-establishing the prefix-equal-columns for each page.
+
+The positive part of this, is that we already have results of the highest
+value of a page: a pages' highkey is compared to the scankey while we have
+a pin on the page in the _bt_moveright procedure. The _bt_binsrch procedure
+will use this result as a rightmost prefix compare, and for each step in the
+binary search (that does not compare less than the insert key) improve the
+equal-prefix bounds.
+
Notes about deduplication
-------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index ec6c73d1cc..20e5f33f98 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -132,7 +132,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* in the fastpath below, but also in the _bt_findinsertloc() call later.
*/
Assert(!insertstate->bounds_valid);
- offset = nbts_call(_bt_binsrch_insert, rel, insertstate);
+ offset = nbts_call(_bt_binsrch_insert, rel, insertstate, 1);
/*
* Scan over all equal tuples, looking for live conflicts.
@@ -142,6 +142,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(itup_key->scantid == NULL);
for (;;)
{
+ AttrNumber cmpcol = 1;
+
/*
* Each iteration of the loop processes one heap TID, not one index
* tuple. Current offset number for page isn't usually advanced on
@@ -177,7 +179,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(insertstate->bounds_valid);
Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
Assert(insertstate->low <= insertstate->stricthigh);
- Assert(nbts_call(_bt_compare, rel, itup_key, page, offset) < 0);
+ Assert(nbts_call(_bt_compare, rel, itup_key, page, offset,
+ &cmpcol) < 0);
break;
}
@@ -202,7 +205,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
if (!inposting)
{
/* Plain tuple, or first TID in posting list tuple */
- if (nbts_call(_bt_compare, rel, itup_key, page, offset) != 0)
+ if (nbts_call(_bt_compare, rel, itup_key, page, offset,
+ &cmpcol) != 0)
break; /* we're past all the equal tuples */
/* Advanced curitup */
@@ -412,11 +416,13 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
else
{
int highkeycmp;
+ cmpcol = 1;
/* If scankey == hikey we gotta check the next page too */
if (P_RIGHTMOST(opaque))
break;
- highkeycmp = nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY);
+ highkeycmp = nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY,
+ &cmpcol);
Assert(highkeycmp <= 0);
if (highkeycmp != 0)
break;
diff --git a/src/backend/access/nbtree/nbtinsert_spec.h b/src/backend/access/nbtree/nbtinsert_spec.h
index 97c866aea3..ccba0fa5ed 100644
--- a/src/backend/access/nbtree/nbtinsert_spec.h
+++ b/src/backend/access/nbtree/nbtinsert_spec.h
@@ -73,6 +73,7 @@ NBTS_FUNCTION(_bt_search_insert)(Relation rel, BTInsertState insertstate)
{
Page page;
BTPageOpaque opaque;
+ AttrNumber comparecol = 1;
_bt_checkpage(rel, insertstate->buf);
page = BufferGetPage(insertstate->buf);
@@ -91,7 +92,8 @@ NBTS_FUNCTION(_bt_search_insert)(Relation rel, BTInsertState insertstate)
!P_IGNORE(opaque) &&
PageGetFreeSpace(page) > insertstate->itemsz &&
PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- nbts_call(_bt_compare, rel, insertstate->itup_key, page, P_HIKEY) > 0)
+ nbts_call(_bt_compare, rel, insertstate->itup_key, page,
+ P_HIKEY, &comparecol) > 0)
{
/*
* Caller can use the fastpath optimization because cached
@@ -221,6 +223,7 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
/*
* Does the new tuple belong on this page?
*
@@ -238,7 +241,7 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
/* Test '<=', not '!=', since scantid is set now */
if (P_RIGHTMOST(opaque) ||
- nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) <= 0)
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
break;
_bt_stepright(rel, insertstate, stack);
@@ -291,6 +294,7 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
*/
while (PageGetFreeSpace(page) < insertstate->itemsz)
{
+ AttrNumber cmpcol = 1;
/*
* Before considering moving right, see if we can obtain enough
* space by erasing LP_DEAD items
@@ -321,7 +325,8 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
break;
if (P_RIGHTMOST(opaque) ||
- nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) != 0 ||
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY,
+ &cmpcol) != 0 ||
pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
break;
@@ -336,10 +341,13 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
* We should now be on the correct page. Find the offset within the page
* for the new tuple. (Possibly reusing earlier search bounds.)
*/
- Assert(P_RIGHTMOST(opaque) ||
- nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) <= 0);
+ {
+ AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
+ Assert(P_RIGHTMOST(opaque) || nbts_call(_bt_compare, rel, itup_key,
+ page, P_HIKEY, &cmpcol) <= 0);
+ }
- newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate);
+ newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate, 1);
if (insertstate->postingoff == -1)
{
@@ -358,7 +366,7 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
*/
Assert(!insertstate->bounds_valid);
insertstate->postingoff = 0;
- newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate);
+ newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate, 1);
Assert(insertstate->postingoff == 0);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d5152bfcb7..036ce88679 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -696,7 +696,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
_bt_initialize_more_data(so, dir);
/* position to the precise item on the page */
- offnum = nbts_call(_bt_binsrch, rel, &inskey, buf);
+ offnum = nbts_call(_bt_binsrch, rel, &inskey, buf, 1);
/*
* If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsearch_spec.h b/src/backend/access/nbtree/nbtsearch_spec.h
index a5c5f2b94f..829c216819 100644
--- a/src/backend/access/nbtree/nbtsearch_spec.h
+++ b/src/backend/access/nbtree/nbtsearch_spec.h
@@ -10,8 +10,10 @@
*/
#ifndef NBTS_SPECIALIZING_DEFAULT
-static OffsetNumber NBTS_FUNCTION(_bt_binsrch)(Relation rel, BTScanInsert key,
- Buffer buf);
+static OffsetNumber NBTS_FUNCTION(_bt_binsrch)(Relation rel,
+ BTScanInsert key,
+ Buffer buf,
+ AttrNumber highkeycmpcol);
static bool NBTS_FUNCTION(_bt_readpage)(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
@@ -38,7 +40,8 @@ static bool NBTS_FUNCTION(_bt_readpage)(IndexScanDesc scan, ScanDirection dir,
static OffsetNumber
NBTS_FUNCTION(_bt_binsrch)(Relation rel,
BTScanInsert key,
- Buffer buf)
+ Buffer buf,
+ AttrNumber highkeycmpcol)
{
Page page;
BTPageOpaque opaque;
@@ -46,6 +49,8 @@ NBTS_FUNCTION(_bt_binsrch)(Relation rel,
high;
int32 result,
cmpval;
+ AttrNumber highcmpcol = highkeycmpcol,
+ lowcmpcol = 1;
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
@@ -87,15 +92,22 @@ NBTS_FUNCTION(_bt_binsrch)(Relation rel,
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
/* We have low <= mid < high, so mid points at a real slot */
- result = nbts_call(_bt_compare, rel, key, page, mid);
+ result = nbts_call(_bt_compare, rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
+ {
high = mid;
+ highcmpcol = cmpcol;
+ }
}
/*
@@ -441,6 +453,7 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
IndexTuple itup;
BlockNumber child;
BTStack new_stack;
+ AttrNumber highkeycmpcol = 1;
/*
* Race -- the page we just grabbed may have split since we read its
@@ -456,7 +469,7 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
*/
*bufP = nbts_call(_bt_moveright, rel, key, *bufP,
(access == BT_WRITE), stack_in,
- page_access, snapshot);
+ page_access, snapshot, &highkeycmpcol);
/* if this is a leaf page, we're done */
page = BufferGetPage(*bufP);
@@ -468,7 +481,7 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
* Find the appropriate pivot tuple on this page. Its downlink points
* to the child page that we're about to descend to.
*/
- offnum = nbts_call(_bt_binsrch, rel, key, *bufP);
+ offnum = nbts_call(_bt_binsrch, rel, key, *bufP, highkeycmpcol);
itemid = PageGetItemId(page, offnum);
itup = (IndexTuple) PageGetItem(page, itemid);
Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
@@ -507,6 +520,7 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
*/
if (access == BT_WRITE && page_access == BT_READ)
{
+ AttrNumber highkeycmpcol = 1;
/* trade in our read lock for a write lock */
_bt_unlockbuf(rel, *bufP);
_bt_lockbuf(rel, *bufP, BT_WRITE);
@@ -517,7 +531,7 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
* move right to its new sibling. Do that.
*/
*bufP = nbts_call(_bt_moveright, rel, key, *bufP, true, stack_in,
- BT_WRITE, snapshot);
+ BT_WRITE, snapshot, &highkeycmpcol);
}
return stack_in;
@@ -565,12 +579,15 @@ NBTS_FUNCTION(_bt_moveright)(Relation rel,
bool forupdate,
BTStack stack,
int access,
- Snapshot snapshot)
+ Snapshot snapshot,
+ AttrNumber *comparecol)
{
Page page;
BTPageOpaque opaque;
int32 cmpval;
+ Assert(PointerIsValid(comparecol));
+
/*
* When nextkey = false (normal case): if the scan key that brought us to
* this page is > the high key stored on the page, then the page has split
@@ -592,12 +609,17 @@ NBTS_FUNCTION(_bt_moveright)(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
+
page = BufferGetPage(buf);
TestForOldSnapshot(snapshot, rel, page);
opaque = BTPageGetOpaque(page);
if (P_RIGHTMOST(opaque))
+ {
+ *comparecol = cmpcol;
break;
+ }
/*
* Finish any incomplete splits we encounter along the way.
@@ -623,14 +645,19 @@ NBTS_FUNCTION(_bt_moveright)(Relation rel,
continue;
}
- if (P_IGNORE(opaque) || nbts_call(_bt_compare, rel, key, page, P_HIKEY) >= cmpval)
+ if (P_IGNORE(opaque) || nbts_call(_bt_compare, rel, key, page, P_HIKEY,
+ &cmpcol) >= cmpval)
{
/* step right one page */
+ *comparecol = 1;
buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
continue;
}
else
+ {
+ *comparecol = cmpcol;
break;
+ }
}
if (P_IGNORE(opaque))
@@ -663,7 +690,8 @@ NBTS_FUNCTION(_bt_moveright)(Relation rel,
* list split).
*/
OffsetNumber
-NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate)
+NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol)
{
BTScanInsert key = insertstate->itup_key;
Page page;
@@ -673,6 +701,7 @@ NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate)
stricthigh;
int32 result,
cmpval;
+ AttrNumber lowcmpcol = 1;
page = BufferGetPage(insertstate->buf);
opaque = BTPageGetOpaque(page);
@@ -723,16 +752,21 @@ NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate)
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
/* We have low <= mid < high, so mid points at a real slot */
- result = nbts_call(_bt_compare, rel, key, page, mid);
+ result = nbts_call(_bt_compare, rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
{
high = mid;
+ highcmpcol = cmpcol;
if (result != 0)
stricthigh = high;
}
@@ -813,7 +847,8 @@ int32
NBTS_FUNCTION(_bt_compare)(Relation rel,
BTScanInsert key,
Page page,
- OffsetNumber offnum)
+ OffsetNumber offnum,
+ AttrNumber *comparecol)
{
TupleDesc itupdesc = RelationGetDescr(rel);
BTPageOpaque opaque = BTPageGetOpaque(page);
@@ -854,10 +889,11 @@ NBTS_FUNCTION(_bt_compare)(Relation rel,
ncmpkey = Min(ntupatts, key->keysz);
Assert(key->heapkeyspace || ncmpkey == key->keysz);
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
- scankey = key->scankeys;
- nbts_attiterinit(itup, 1, itupdesc);
- nbts_foreachattr(1, ncmpkey)
+ nbts_attiterinit(itup, *comparecol, itupdesc);
+ scankey = key->scankeys + ((*comparecol) - 1);
+
+ nbts_foreachattr(*comparecol, ncmpkey)
{
Datum datum;
@@ -902,11 +938,20 @@ NBTS_FUNCTION(_bt_compare)(Relation rel,
/* if the keys are unequal, return the difference */
if (result != 0)
+ {
+ *comparecol = nbts_attiter_attnum;
return result;
+ }
scankey++;
}
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *comparecol = ntupatts + 1;
+
/*
* All non-truncated attributes (other than heap TID) were found to be
* equal. Treat truncated attributes as minus infinity when scankey has a
diff --git a/src/include/access/nbtree_specialized.h b/src/include/access/nbtree_specialized.h
index c45fa84aed..7402a4c46e 100644
--- a/src/include/access/nbtree_specialized.h
+++ b/src/include/access/nbtree_specialized.h
@@ -43,12 +43,14 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key,
extern Buffer
NBTS_FUNCTION(_bt_moveright)(Relation rel, BTScanInsert key, Buffer buf,
bool forupdate, BTStack stack, int access,
- Snapshot snapshot);
+ Snapshot snapshot, AttrNumber *comparecol);
extern OffsetNumber
-NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate);
+NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol);
extern int32
NBTS_FUNCTION(_bt_compare)(Relation rel, BTScanInsert key,
- Page page, OffsetNumber offnum);
+ Page page, OffsetNumber offnum,
+ AttrNumber *comparecol);
/*
* prototypes for functions in nbtutils_spec.h
--
2.30.2
On Sat, 16 Apr 2022 at 01:05, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
Still to-do:
- Validate performance and share the numbers for the same test indexes
in [1]. I'm planning on doing that next Monday.
While working on benchmarking the v2 patchset, I noticed no
improvement on reindex, which I attributed to forgetting to also
specialize comparetup_index_btree in tuplesorth.c. After adding the
specialization there as well (attached in v3), reindex performance
improved significantly too.
Performance results attached in pgbench_log.[master,patched], which
includes the summarized output. Notes on those results:
- single-column reindex seem to have the same performance between
patch and master, within 1% error margins.
- multi-column indexes with useful ->attcacheoff also sees no obvious
performance degradation
- multi-column indexes with no useful ->attcacheoff see significant
insert performance improvement:
-8% runtime on 3 text attributes with default (C) collation ("ccl");
-9% runtime on 1 'en_US'-collated attribute + 2 text attributes
("ccl_collated");
-13% runtime on 1 null + 3 text attributes ("accl");
-74% runtime (!) on 31 'en_US'-collated 0-length text attributes + 1
uuid attribute ("worstcase" - I could not think of a worse index shape
than this one).
- reindex performance gains are much more visible: up to 84% (!) less
time spent for "worstcase", and 18-31% for the other multi-column
indexes mentioned above.
Other notes:
- The dataset I used is the same as I used in [1]/messages/by-id/CAEze2WhyBT2bKZRdj_U0KS2Sbewa1XoO_BzgpzLC09sa5LUROg@mail.gmail.com: the pp-complete.csv
as was available on 2021-06-20, containing 26070307 rows.
- The performance was measured on 7 runs of the attached bench script,
using pgbench to measure statement times etc.
- Database was initialized with C locale, all tables are unlogged and
source table was VACUUM (FREEZE, ANALYZE)-ed before starting.
- (!) On HEAD @ 5bb2b6ab, INSERT is faster than REINDEX for the
"worstcase" index. I've not yet discovered why (only lightly looked
into it, no sort debugging), and considering that the issue does not
appear in similar quantities in the patched version, I'm not planning
on putting a lot of time into that.
- Per-transaction results for the run on master were accidentally
deleted, I don't consider them important enough to re-run the
benchmark.
- Decide whether / how to keep the NBTS_ENABLED flag. The current
#define in nbtree.h is a bad example of a compile-time configuration,
that should be changed (even if we only want to be able to disable
specialization at compile-time, it should be moved).Maybe:
- More tests: PG already extensively tests the btree code while it is
running the test suite - btree is the main index AM - but more tests
might be needed to test the validity of the specialized code.
No work on those points yet.
I'll add this to CF 2022-07 for tracking.
Kind regards,
Matthias van de Meent.
[1]: /messages/by-id/CAEze2WhyBT2bKZRdj_U0KS2Sbewa1XoO_BzgpzLC09sa5LUROg@mail.gmail.com
Attachments:
v3-0003-Specialize-the-nbtree-rd_indam-entry.patchapplication/x-patch; name=v3-0003-Specialize-the-nbtree-rd_indam-entry.patchDownload
From eb05201ec8f207aec3c106813424477b9ab3c454 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 8 Apr 2022 14:54:52 +0200
Subject: [PATCH v3 3/9] Specialize the nbtree rd_indam entry.
Because each rd_indam struct is seperately allocated for each index, we can
freely modify it at runtime without impacting other indexes of the same
access method. For btinsert (which effectively only calls _bt_insert) it is
useful to specialize that function, which also makes rd_indam->aminsert a
good signal whether or not the indexRelation has been fully optimized yet.
---
src/backend/access/nbtree/nbtree.c | 7 +++++++
src/backend/access/nbtree/nbtsearch.c | 2 ++
src/backend/access/nbtree/nbtsort.c | 2 ++
src/include/access/nbtree.h | 14 ++++++++++++++
4 files changed, 25 insertions(+)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 09c43eb226..95da2c46bf 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -161,6 +161,8 @@ btbuildempty(Relation index)
metapage = (Page) palloc(BLCKSZ);
_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
+ nbt_opt_specialize(index);
+
/*
* Write the page and log it. It might seem that an immediate sync would
* be sufficient to guarantee that the file exists on disk, but recovery
@@ -323,6 +325,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
IndexScanDesc scan;
BTScanOpaque so;
+ nbt_opt_specialize(rel);
+
/* no order by operators allowed */
Assert(norderbys == 0);
@@ -765,6 +769,7 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
{
Relation rel = info->index;
BTCycleId cycleid;
+ nbt_opt_specialize(info->index);
/* allocate stats if first time through, else re-use existing struct */
if (stats == NULL)
@@ -798,6 +803,8 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
if (info->analyze_only)
return stats;
+ nbt_opt_specialize(info->index);
+
/*
* If btbulkdelete was called, we need not do anything (we just maintain
* the information used within _bt_vacuum_needs_cleanup() by calling
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index e81eee9c35..d5152bfcb7 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -181,6 +181,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
Assert(!BTScanPosIsValid(so->currPos));
+ nbt_opt_specialize(scan->indexRelation);
+
pgstat_count_index_scan(rel);
/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f1d146ba71..22c7163197 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -305,6 +305,8 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
BTBuildState buildstate;
double reltuples;
+ nbt_opt_specialize(index);
+
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
ResetUsage();
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 83e0dbab16..489b623663 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1132,6 +1132,19 @@ typedef struct BTOptions
#ifdef NBTS_ENABLED
+/*
+ * Replace the functions in the rd_indam struct with a variant optimized for
+ * our key shape, if not already done.
+ *
+ * It only needs to be done once for every index relation loaded, so it's
+ * quite unlikely we need to do this and thus marked unlikely().
+ */
+#define nbt_opt_specialize(rel) \
+do { \
+ if (unlikely((rel)->rd_indam->aminsert == btinsert)) \
+ _bt_specialize(rel); \
+} while (false)
+
/*
* Access a specialized nbtree function, based on the shape of the index key.
*/
@@ -1143,6 +1156,7 @@ typedef struct BTOptions
#else /* not defined NBTS_ENABLED */
+#define nbt_opt_specialize(rel)
#define NBT_SPECIALIZE_CALL(function, rel, ...) function(__VA_ARGS__)
#endif /* NBTS_ENABLED */
--
2.30.2
v3-0001-Specialize-nbtree-functions-on-btree-key-shape.patchapplication/x-patch; name=v3-0001-Specialize-nbtree-functions-on-btree-key-shape.patchDownload
From 1ff81e78ae0cecbe2d46735b26e0fad049f45772 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Sun, 30 Jan 2022 16:23:31 +0100
Subject: [PATCH v3 1/9] Specialize nbtree functions on btree key shape
nbtree keys are not all made the same, so a significant amount of time is
spent on code that exists only to deal with other key's shape. By specializing
function calls based on the key shape, we can remove or reduce these causes
of overhead.
This commit adds the basic infrastructure for specializing specific hot code
in the nbtree AM to certain shapes of keys, initially splitting splitting out
(not yet: specializing) the attcacheoff-capable case.
Note that we generate N specialized functions and 1 'default' function for each
specializable function.
This feature can be disabled by removing the #define NBTS_ENABLED -line in nbtree.h
---
src/backend/access/nbtree/README | 22 +
src/backend/access/nbtree/nbtdedup.c | 300 +------
src/backend/access/nbtree/nbtdedup_spec.h | 313 +++++++
src/backend/access/nbtree/nbtinsert.c | 572 +-----------
src/backend/access/nbtree/nbtinsert_spec.h | 569 ++++++++++++
src/backend/access/nbtree/nbtpage.c | 4 +-
src/backend/access/nbtree/nbtree.c | 31 +-
src/backend/access/nbtree/nbtree_spec.h | 50 ++
src/backend/access/nbtree/nbtsearch.c | 994 +--------------------
src/backend/access/nbtree/nbtsearch_spec.h | 994 +++++++++++++++++++++
src/backend/access/nbtree/nbtsort.c | 271 +-----
src/backend/access/nbtree/nbtsort_spec.h | 275 ++++++
src/backend/access/nbtree/nbtsplitloc.c | 14 +-
src/backend/access/nbtree/nbtutils.c | 755 +---------------
src/backend/access/nbtree/nbtutils_spec.h | 772 ++++++++++++++++
src/backend/utils/sort/tuplesort.c | 4 +-
src/include/access/nbtree.h | 61 +-
src/include/access/nbtree_specialize.h | 204 +++++
src/include/access/nbtree_specialized.h | 67 ++
19 files changed, 3357 insertions(+), 2915 deletions(-)
create mode 100644 src/backend/access/nbtree/nbtdedup_spec.h
create mode 100644 src/backend/access/nbtree/nbtinsert_spec.h
create mode 100644 src/backend/access/nbtree/nbtree_spec.h
create mode 100644 src/backend/access/nbtree/nbtsearch_spec.h
create mode 100644 src/backend/access/nbtree/nbtsort_spec.h
create mode 100644 src/backend/access/nbtree/nbtutils_spec.h
create mode 100644 src/include/access/nbtree_specialize.h
create mode 100644 src/include/access/nbtree_specialized.h
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 5529afc1fe..3c08888c23 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -1041,6 +1041,28 @@ that need a page split anyway. Besides, supporting variable "split points"
while splitting posting lists won't actually improve overall space
utilization.
+
+Notes about nbtree call specialization
+--------------------------------------
+
+Attribute iteration is a significant overhead for multi-column indexes.
+We can avoid it by specializing performance-sensitive search functions
+and calling those selectively. Additionally, we update the entry points
+in the index AM to call the specialized functions, increasing the
+performance of those hot paths. This performance benefit is at the cost
+of binary size, so this feature can be disabled by defining NBTS_DISABLED.
+
+Optimized code paths exist for the following cases, in order of preference:
+ - single-column indexes
+ NB: The code paths of this optimization do not support multiple key columns.
+ - multi-column indexes that could benefit from the attcacheoff optimization
+ NB: This is also used for the default case, and is slow for uncachable
+ attribute offsets.
+
+Future work will optimize for multi-column indexes that don't benefit
+from the attcacheoff optimization by improving on the O(n^2) nature of
+index_getattr through storing attribute offsets.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 0207421a5d..d7025d8e1c 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -22,259 +22,16 @@
static void _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
TM_IndexDeleteOp *delstate);
-static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem);
static void _bt_singleval_fillfactor(Page page, BTDedupState state,
Size newitemsz);
#ifdef USE_ASSERT_CHECKING
static bool _bt_posting_valid(IndexTuple posting);
#endif
-/*
- * Perform a deduplication pass.
- *
- * The general approach taken here is to perform as much deduplication as
- * possible to free as much space as possible. Note, however, that "single
- * value" strategy is used for !bottomupdedup callers when the page is full of
- * tuples of a single value. Deduplication passes that apply the strategy
- * will leave behind a few untouched tuples at the end of the page, preparing
- * the page for an anticipated page split that uses nbtsplitloc.c's own single
- * value strategy. Our high level goal is to delay merging the untouched
- * tuples until after the page splits.
- *
- * When a call to _bt_bottomupdel_pass() just took place (and failed), our
- * high level goal is to prevent a page split entirely by buying more time.
- * We still hope that a page split can be avoided altogether. That's why
- * single value strategy is not even considered for bottomupdedup callers.
- *
- * The page will have to be split if we cannot successfully free at least
- * newitemsz (we also need space for newitem's line pointer, which isn't
- * included in caller's newitemsz).
- *
- * Note: Caller should have already deleted all existing items with their
- * LP_DEAD bits set.
- */
-void
-_bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
- Size newitemsz, bool bottomupdedup)
-{
- OffsetNumber offnum,
- minoff,
- maxoff;
- Page page = BufferGetPage(buf);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- Page newpage;
- BTDedupState state;
- Size pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
- bool singlevalstrat = false;
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtdedup_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
- /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
- newitemsz += sizeof(ItemIdData);
-
- /*
- * Initialize deduplication state.
- *
- * It would be possible for maxpostingsize (limit on posting list tuple
- * size) to be set to one third of the page. However, it seems like a
- * good idea to limit the size of posting lists to one sixth of a page.
- * That ought to leave us with a good split point when pages full of
- * duplicates can be split several times.
- */
- state = (BTDedupState) palloc(sizeof(BTDedupStateData));
- state->deduplicate = true;
- state->nmaxitems = 0;
- state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
- /* Metadata about base tuple of current pending posting list */
- state->base = NULL;
- state->baseoff = InvalidOffsetNumber;
- state->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- state->htids = palloc(state->maxpostingsize);
- state->nhtids = 0;
- state->nitems = 0;
- /* Size of all physical tuples to be replaced by pending posting list */
- state->phystupsize = 0;
- /* nintervals should be initialized to zero */
- state->nintervals = 0;
-
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * Consider applying "single value" strategy, though only if the page
- * seems likely to be split in the near future
- */
- if (!bottomupdedup)
- singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
-
- /*
- * Deduplicate items from page, and write them to newpage.
- *
- * Copy the original page's LSN into newpage copy. This will become the
- * updated version of the page. We need this because XLogInsert will
- * examine the LSN and possibly dump it in a page image.
- */
- newpage = PageGetTempPageCopySpecial(page);
- PageSetLSN(newpage, PageGetLSN(page));
-
- /* Copy high key, if any */
- if (!P_RIGHTMOST(opaque))
- {
- ItemId hitemid = PageGetItemId(page, P_HIKEY);
- Size hitemsz = ItemIdGetLength(hitemid);
- IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
-
- if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
- false, false) == InvalidOffsetNumber)
- elog(ERROR, "deduplication failed to add highkey");
- }
-
- for (offnum = minoff;
- offnum <= maxoff;
- offnum = OffsetNumberNext(offnum))
- {
- ItemId itemid = PageGetItemId(page, offnum);
- IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
-
- Assert(!ItemIdIsDead(itemid));
-
- if (offnum == minoff)
- {
- /*
- * No previous/base tuple for the data item -- use the data item
- * as base tuple of pending posting list
- */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- else if (state->deduplicate &&
- _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
- _bt_dedup_save_htid(state, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID(s) for itup have been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list for some other reason (e.g., adding more
- * TIDs would have caused posting list to exceed current
- * maxpostingsize).
- *
- * If state contains pending posting list with more than one item,
- * form new posting tuple, and actually update the page. Else
- * reset the state and move on without modifying the page.
- */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- if (singlevalstrat)
- {
- /*
- * Single value strategy's extra steps.
- *
- * Lower maxpostingsize for sixth and final large posting list
- * tuple at the point where 5 maxpostingsize-capped tuples
- * have either been formed or observed.
- *
- * When a sixth maxpostingsize-capped item is formed/observed,
- * stop merging together tuples altogether. The few tuples
- * that remain at the end of the page won't be merged together
- * at all (at least not until after a future page split takes
- * place).
- */
- if (state->nmaxitems == 5)
- _bt_singleval_fillfactor(page, state, newitemsz);
- else if (state->nmaxitems == 6)
- {
- state->deduplicate = false;
- singlevalstrat = false; /* won't be back here */
- }
- }
-
- /* itup starts new pending posting list */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- }
-
- /* Handle the last item */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- /*
- * If no items suitable for deduplication were found, newpage must be
- * exactly the same as the original page, so just return from function.
- *
- * We could determine whether or not to proceed on the basis the space
- * savings being sufficient to avoid an immediate page split instead. We
- * don't do that because there is some small value in nbtsplitloc.c always
- * operating against a page that is fully deduplicated (apart from
- * newitem). Besides, most of the cost has already been paid.
- */
- if (state->nintervals == 0)
- {
- /* cannot leak memory here */
- pfree(newpage);
- pfree(state->htids);
- pfree(state);
- return;
- }
-
- /*
- * By here, it's clear that deduplication will definitely go ahead.
- *
- * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
- * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
- * But keep things tidy.
- */
- if (P_HAS_GARBAGE(opaque))
- {
- BTPageOpaque nopaque = BTPageGetOpaque(newpage);
-
- nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
- }
-
- START_CRIT_SECTION();
-
- PageRestoreTempPage(newpage, page);
- MarkBufferDirty(buf);
-
- /* XLOG stuff */
- if (RelationNeedsWAL(rel))
- {
- XLogRecPtr recptr;
- xl_btree_dedup xlrec_dedup;
-
- xlrec_dedup.nintervals = state->nintervals;
-
- XLogBeginInsert();
- XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
-
- /*
- * The intervals array is not in the buffer, but pretend that it is.
- * When XLogInsert stores the whole buffer, the array need not be
- * stored too.
- */
- XLogRegisterBufData(0, (char *) state->intervals,
- state->nintervals * sizeof(BTDedupInterval));
-
- recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
-
- PageSetLSN(page, recptr);
- }
-
- END_CRIT_SECTION();
-
- /* Local space accounting should agree with page accounting */
- Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
-
- /* cannot leak memory here */
- pfree(state->htids);
- pfree(state);
-}
/*
* Perform bottom-up index deletion pass.
@@ -373,7 +130,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
/* itup starts first pending interval */
_bt_dedup_start_pending(state, itup, offnum);
}
- else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+ else if (nbts_call(_bt_keep_natts_fast, rel, state->base, itup) > nkeyatts &&
_bt_dedup_save_htid(state, itup))
{
/* Tuple is equal; just added its TIDs to pending interval */
@@ -748,55 +505,6 @@ _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
state->phystupsize = 0;
}
-/*
- * Determine if page non-pivot tuples (data items) are all duplicates of the
- * same value -- if they are, deduplication's "single value" strategy should
- * be applied. The general goal of this strategy is to ensure that
- * nbtsplitloc.c (which uses its own single value strategy) will find a useful
- * split point as further duplicates are inserted, and successive rightmost
- * page splits occur among pages that store the same duplicate value. When
- * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
- * just like it would if deduplication were disabled.
- *
- * We expect that affected workloads will require _several_ single value
- * strategy deduplication passes (over a page that only stores duplicates)
- * before the page is finally split. The first deduplication pass should only
- * find regular non-pivot tuples. Later deduplication passes will find
- * existing maxpostingsize-capped posting list tuples, which must be skipped
- * over. The penultimate pass is generally the first pass that actually
- * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
- * few untouched non-pivot tuples. The final deduplication pass won't free
- * any space -- it will skip over everything without merging anything (it
- * retraces the steps of the penultimate pass).
- *
- * Fortunately, having several passes isn't too expensive. Each pass (after
- * the first pass) won't spend many cycles on the large posting list tuples
- * left by previous passes. Each pass will find a large contiguous group of
- * smaller duplicate tuples to merge together at the end of the page.
- */
-static bool
-_bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- ItemId itemid;
- IndexTuple itup;
-
- itemid = PageGetItemId(page, minoff);
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- {
- itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- return true;
- }
-
- return false;
-}
-
/*
* Lower maxpostingsize when using "single value" strategy, to avoid a sixth
* and final maxpostingsize-capped tuple. The sixth and final posting list
diff --git a/src/backend/access/nbtree/nbtdedup_spec.h b/src/backend/access/nbtree/nbtdedup_spec.h
new file mode 100644
index 0000000000..27e5a7e686
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup_spec.h
@@ -0,0 +1,313 @@
+/*
+ * Specialized functions included in nbtdedup.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static bool NBTS_FUNCTION(_bt_do_singleval)(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem);
+
+/*
+ * Determine if page non-pivot tuples (data items) are all duplicates of the
+ * same value -- if they are, deduplication's "single value" strategy should
+ * be applied. The general goal of this strategy is to ensure that
+ * nbtsplitloc.c (which uses its own single value strategy) will find a useful
+ * split point as further duplicates are inserted, and successive rightmost
+ * page splits occur among pages that store the same duplicate value. When
+ * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
+ * just like it would if deduplication were disabled.
+ *
+ * We expect that affected workloads will require _several_ single value
+ * strategy deduplication passes (over a page that only stores duplicates)
+ * before the page is finally split. The first deduplication pass should only
+ * find regular non-pivot tuples. Later deduplication passes will find
+ * existing maxpostingsize-capped posting list tuples, which must be skipped
+ * over. The penultimate pass is generally the first pass that actually
+ * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
+ * few untouched non-pivot tuples. The final deduplication pass won't free
+ * any space -- it will skip over everything without merging anything (it
+ * retraces the steps of the penultimate pass).
+ *
+ * Fortunately, having several passes isn't too expensive. Each pass (after
+ * the first pass) won't spend many cycles on the large posting list tuples
+ * left by previous passes. Each pass will find a large contiguous group of
+ * smaller duplicate tuples to merge together at the end of the page.
+ */
+static bool
+NBTS_FUNCTION(_bt_do_singleval)(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, minoff);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (nbts_call(_bt_keep_natts_fast, rel, newitem, itup) > nkeyatts)
+ {
+ itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (nbts_call(_bt_keep_natts_fast, rel, newitem, itup) > nkeyatts)
+ return true;
+ }
+
+ return false;
+}
+
+#endif /* ifndef NBTS_SPECIALIZING_DEFAULT */
+
+/*
+ * Perform a deduplication pass.
+ *
+ * The general approach taken here is to perform as much deduplication as
+ * possible to free as much space as possible. Note, however, that "single
+ * value" strategy is used for !bottomupdedup callers when the page is full of
+ * tuples of a single value. Deduplication passes that apply the strategy
+ * will leave behind a few untouched tuples at the end of the page, preparing
+ * the page for an anticipated page split that uses nbtsplitloc.c's own single
+ * value strategy. Our high level goal is to delay merging the untouched
+ * tuples until after the page splits.
+ *
+ * When a call to _bt_bottomupdel_pass() just took place (and failed), our
+ * high level goal is to prevent a page split entirely by buying more time.
+ * We still hope that a page split can be avoided altogether. That's why
+ * single value strategy is not even considered for bottomupdedup callers.
+ *
+ * The page will have to be split if we cannot successfully free at least
+ * newitemsz (we also need space for newitem's line pointer, which isn't
+ * included in caller's newitemsz).
+ *
+ * Note: Caller should have already deleted all existing items with their
+ * LP_DEAD bits set.
+ */
+void
+NBTS_FUNCTION(_bt_dedup_pass)(Relation rel, Buffer buf, Relation heapRel,
+ IndexTuple newitem, Size newitemsz,
+ bool bottomupdedup)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ Page newpage;
+ BTDedupState state;
+ Size pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
+ bool singlevalstrat = false;
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+
+ /*
+ * Initialize deduplication state.
+ *
+ * It would be possible for maxpostingsize (limit on posting list tuple
+ * size) to be set to one third of the page. However, it seems like a
+ * good idea to limit the size of posting lists to one sixth of a page.
+ * That ought to leave us with a good split point when pages full of
+ * duplicates can be split several times.
+ */
+ state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ state->deduplicate = true;
+ state->nmaxitems = 0;
+ state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
+ /* Metadata about base tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ state->htids = palloc(state->maxpostingsize);
+ state->nhtids = 0;
+ state->nitems = 0;
+ /* Size of all physical tuples to be replaced by pending posting list */
+ state->phystupsize = 0;
+ /* nintervals should be initialized to zero */
+ state->nintervals = 0;
+
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Consider applying "single value" strategy, though only if the page
+ * seems likely to be split in the near future
+ */
+ if (!bottomupdedup)
+ singlevalstrat = nbts_call(_bt_do_singleval, rel, page, state,
+ minoff, newitem);
+
+ /*
+ * Deduplicate items from page, and write them to newpage.
+ *
+ * Copy the original page's LSN into newpage copy. This will become the
+ * updated version of the page. We need this because XLogInsert will
+ * examine the LSN and possibly dump it in a page image.
+ */
+ newpage = PageGetTempPageCopySpecial(page);
+ PageSetLSN(newpage, PageGetLSN(page));
+
+ /* Copy high key, if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId hitemid = PageGetItemId(page, P_HIKEY);
+ Size hitemsz = ItemIdGetLength(hitemid);
+ IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+ if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add highkey");
+ }
+
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == minoff)
+ {
+ /*
+ * No previous/base tuple for the data item -- use the data item
+ * as base tuple of pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (state->deduplicate &&
+ nbts_call(_bt_keep_natts_fast, rel, state->base, itup) > nkeyatts &&
+ _bt_dedup_save_htid(state, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID(s) for itup have been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list for some other reason (e.g., adding more
+ * TIDs would have caused posting list to exceed current
+ * maxpostingsize).
+ *
+ * If state contains pending posting list with more than one item,
+ * form new posting tuple, and actually update the page. Else
+ * reset the state and move on without modifying the page.
+ */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ if (singlevalstrat)
+ {
+ /*
+ * Single value strategy's extra steps.
+ *
+ * Lower maxpostingsize for sixth and final large posting list
+ * tuple at the point where 5 maxpostingsize-capped tuples
+ * have either been formed or observed.
+ *
+ * When a sixth maxpostingsize-capped item is formed/observed,
+ * stop merging together tuples altogether. The few tuples
+ * that remain at the end of the page won't be merged together
+ * at all (at least not until after a future page split takes
+ * place).
+ */
+ if (state->nmaxitems == 5)
+ _bt_singleval_fillfactor(page, state, newitemsz);
+ else if (state->nmaxitems == 6)
+ {
+ state->deduplicate = false;
+ singlevalstrat = false; /* won't be back here */
+ }
+ }
+
+ /* itup starts new pending posting list */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ }
+
+ /* Handle the last item */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ /*
+ * If no items suitable for deduplication were found, newpage must be
+ * exactly the same as the original page, so just return from function.
+ *
+ * We could determine whether or not to proceed on the basis the space
+ * savings being sufficient to avoid an immediate page split instead. We
+ * don't do that because there is some small value in nbtsplitloc.c always
+ * operating against a page that is fully deduplicated (apart from
+ * newitem). Besides, most of the cost has already been paid.
+ */
+ if (state->nintervals == 0)
+ {
+ /* cannot leak memory here */
+ pfree(newpage);
+ pfree(state->htids);
+ pfree(state);
+ return;
+ }
+
+ /*
+ * By here, it's clear that deduplication will definitely go ahead.
+ *
+ * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
+ * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
+ * But keep things tidy.
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ BTPageOpaque nopaque = BTPageGetOpaque(newpage);
+
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+ }
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.nintervals = state->nintervals;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ /*
+ * The intervals array is not in the buffer, but pretend that it is.
+ * When XLogInsert stores the whole buffer, the array need not be
+ * stored too.
+ */
+ XLogRegisterBufData(0, (char *) state->intervals,
+ state->nintervals * sizeof(BTDedupInterval));
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* Local space accounting should agree with page accounting */
+ Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+ /* cannot leak memory here */
+ pfree(state->htids);
+ pfree(state);
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f6f4af8bfe..ec6c73d1cc 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,18 +30,13 @@
#define BTREE_FASTPATH_MIN_LEVEL 2
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
Relation heapRel,
IndexUniqueCheck checkUnique, bool *is_unique,
uint32 *speculativeToken);
-static OffsetNumber _bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel);
-static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_stepright(Relation rel,
+ BTInsertState insertstate,
+ BTStack stack);
static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
Buffer buf,
Buffer cbuf,
@@ -73,311 +68,10 @@ static BlockNumber *_bt_deadblocks(Page page, OffsetNumber *deletable,
int *nblocks);
static inline int _bt_blk_cmp(const void *arg1, const void *arg2);
-/*
- * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
- *
- * This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
- *
- * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
- * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
- * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
- * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
- * don't actually insert.
- *
- * indexUnchanged executor hint indicates if itup is from an
- * UPDATE that didn't logically change the indexed value, but
- * must nevertheless have a new entry to point to a successor
- * version.
- *
- * The result value is only significant for UNIQUE_CHECK_PARTIAL:
- * it must be true if the entry is known unique, else false.
- * (In the current implementation we'll also return true after a
- * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
- * that's just a coding artifact.)
- */
-bool
-_bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel)
-{
- bool is_unique = false;
- BTInsertStateData insertstate;
- BTScanInsert itup_key;
- BTStack stack;
- bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
-
- /* we need an insertion scan key to do our search, so build one */
- itup_key = _bt_mkscankey(rel, itup);
-
- if (checkingunique)
- {
- if (!itup_key->anynullkeys)
- {
- /* No (heapkeyspace) scantid until uniqueness established */
- itup_key->scantid = NULL;
- }
- else
- {
- /*
- * Scan key for new tuple contains NULL key values. Bypass
- * checkingunique steps. They are unnecessary because core code
- * considers NULL unequal to every value, including NULL.
- *
- * This optimization avoids O(N^2) behavior within the
- * _bt_findinsertloc() heapkeyspace path when a unique index has a
- * large number of "duplicates" with NULL key values.
- */
- checkingunique = false;
- /* Tuple is unique in the sense that core code cares about */
- Assert(checkUnique != UNIQUE_CHECK_EXISTING);
- is_unique = true;
- }
- }
-
- /*
- * Fill in the BTInsertState working area, to track the current page and
- * position within the page to insert on.
- *
- * Note that itemsz is passed down to lower level code that deals with
- * inserting the item. It must be MAXALIGN()'d. This ensures that space
- * accounting code consistently considers the alignment overhead that we
- * expect PageAddItem() will add later. (Actually, index_form_tuple() is
- * already conservative about alignment, but we don't rely on that from
- * this distance. Besides, preserving the "true" tuple size in index
- * tuple headers for the benefit of nbtsplitloc.c might happen someday.
- * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
- */
- insertstate.itup = itup;
- insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
- insertstate.itup_key = itup_key;
- insertstate.bounds_valid = false;
- insertstate.buf = InvalidBuffer;
- insertstate.postingoff = 0;
-
-search:
-
- /*
- * Find and lock the leaf page that the tuple should be added to by
- * searching from the root page. insertstate.buf will hold a buffer that
- * is locked in exclusive mode afterwards.
- */
- stack = _bt_search_insert(rel, &insertstate);
-
- /*
- * checkingunique inserts are not allowed to go ahead when two tuples with
- * equal key attribute values would be visible to new MVCC snapshots once
- * the xact commits. Check for conflicts in the locked page/buffer (if
- * needed) here.
- *
- * It might be necessary to check a page to the right in _bt_check_unique,
- * though that should be very rare. In practice the first page the value
- * could be on (with scantid omitted) is almost always also the only page
- * that a matching tuple might be found on. This is due to the behavior
- * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
- * only be allowed to cross a page boundary when there is no candidate
- * leaf page split point that avoids it. Also, _bt_check_unique can use
- * the leaf page high key to determine that there will be no duplicates on
- * the right sibling without actually visiting it (it uses the high key in
- * cases where the new item happens to belong at the far right of the leaf
- * page).
- *
- * NOTE: obviously, _bt_check_unique can only detect keys that are already
- * in the index; so it cannot defend against concurrent insertions of the
- * same key. We protect against that by means of holding a write lock on
- * the first page the value could be on, with omitted/-inf value for the
- * implicit heap TID tiebreaker attribute. Any other would-be inserter of
- * the same key must acquire a write lock on the same page, so only one
- * would-be inserter can be making the check at one time. Furthermore,
- * once we are past the check we hold write locks continuously until we
- * have performed our insertion, so no later inserter can fail to see our
- * insertion. (This requires some care in _bt_findinsertloc.)
- *
- * If we must wait for another xact, we release the lock while waiting,
- * and then must perform a new search.
- *
- * For a partial uniqueness check, we don't wait for the other xact. Just
- * let the tuple in and return false for possibly non-unique, or true for
- * definitely unique.
- */
- if (checkingunique)
- {
- TransactionId xwait;
- uint32 speculativeToken;
-
- xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
- &is_unique, &speculativeToken);
-
- if (unlikely(TransactionIdIsValid(xwait)))
- {
- /* Have to wait for the other guy ... */
- _bt_relbuf(rel, insertstate.buf);
- insertstate.buf = InvalidBuffer;
-
- /*
- * If it's a speculative insertion, wait for it to finish (ie. to
- * go ahead with the insertion, or kill the tuple). Otherwise
- * wait for the transaction to finish as usual.
- */
- if (speculativeToken)
- SpeculativeInsertionWait(xwait, speculativeToken);
- else
- XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
-
- /* start over... */
- if (stack)
- _bt_freestack(stack);
- goto search;
- }
-
- /* Uniqueness is established -- restore heap tid as scantid */
- if (itup_key->heapkeyspace)
- itup_key->scantid = &itup->t_tid;
- }
-
- if (checkUnique != UNIQUE_CHECK_EXISTING)
- {
- OffsetNumber newitemoff;
-
- /*
- * The only conflict predicate locking cares about for indexes is when
- * an index tuple insert conflicts with an existing lock. We don't
- * know the actual page we're going to insert on for sure just yet in
- * checkingunique and !heapkeyspace cases, but it's okay to use the
- * first page the value could be on (with scantid omitted) instead.
- */
- CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
-
- /*
- * Do the insertion. Note that insertstate contains cached binary
- * search bounds established within _bt_check_unique when insertion is
- * checkingunique.
- */
- newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
- indexUnchanged, stack, heapRel);
- _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, insertstate.itemsz, newitemoff,
- insertstate.postingoff, false);
- }
- else
- {
- /* just release the buffer */
- _bt_relbuf(rel, insertstate.buf);
- }
-
- /* be tidy */
- if (stack)
- _bt_freestack(stack);
- pfree(itup_key);
-
- return is_unique;
-}
-
-/*
- * _bt_search_insert() -- _bt_search() wrapper for inserts
- *
- * Search the tree for a particular scankey, or more precisely for the first
- * leaf page it could be on. Try to make use of the fastpath optimization's
- * rightmost leaf page cache before actually searching the tree from the root
- * page, though.
- *
- * Return value is a stack of parent-page pointers (though see notes about
- * fastpath optimization and page splits below). insertstate->buf is set to
- * the address of the leaf-page buffer, which is write-locked and pinned in
- * all cases (if necessary by creating a new empty root page for caller).
- *
- * The fastpath optimization avoids most of the work of searching the tree
- * repeatedly when a single backend inserts successive new tuples on the
- * rightmost leaf page of an index. A backend cache of the rightmost leaf
- * page is maintained within _bt_insertonpg(), and used here. The cache is
- * invalidated here when an insert of a non-pivot tuple must take place on a
- * non-rightmost leaf page.
- *
- * The optimization helps with indexes on an auto-incremented field. It also
- * helps with indexes on datetime columns, as well as indexes with lots of
- * NULL values. (NULLs usually get inserted in the rightmost page for single
- * column indexes, since they usually get treated as coming after everything
- * else in the key space. Individual NULL tuples will generally be placed on
- * the rightmost leaf page due to the influence of the heap TID column.)
- *
- * Note that we avoid applying the optimization when there is insufficient
- * space on the rightmost page to fit caller's new item. This is necessary
- * because we'll need to return a real descent stack when a page split is
- * expected (actually, caller can cope with a leaf page split that uses a NULL
- * stack, but that's very slow and so must be avoided). Note also that the
- * fastpath optimization acquires the lock on the page conditionally as a way
- * of reducing extra contention when there are concurrent insertions into the
- * rightmost page (we give up if we'd have to wait for the lock). We assume
- * that it isn't useful to apply the optimization when there is contention,
- * since each per-backend cache won't stay valid for long.
- */
-static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
-{
- Assert(insertstate->buf == InvalidBuffer);
- Assert(!insertstate->bounds_valid);
- Assert(insertstate->postingoff == 0);
-
- if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
- {
- /* Simulate a _bt_getbuf() call with conditional locking */
- insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
- if (_bt_conditionallockbuf(rel, insertstate->buf))
- {
- Page page;
- BTPageOpaque opaque;
-
- _bt_checkpage(rel, insertstate->buf);
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- /*
- * Check if the page is still the rightmost leaf page and has
- * enough free space to accommodate the new tuple. Also check
- * that the insertion scan key is strictly greater than the first
- * non-pivot tuple on the page. (Note that we expect itup_key's
- * scantid to be unset when our caller is a checkingunique
- * inserter.)
- */
- if (P_RIGHTMOST(opaque) &&
- P_ISLEAF(opaque) &&
- !P_IGNORE(opaque) &&
- PageGetFreeSpace(page) > insertstate->itemsz &&
- PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- _bt_compare(rel, insertstate->itup_key, page, P_HIKEY) > 0)
- {
- /*
- * Caller can use the fastpath optimization because cached
- * block is still rightmost leaf page, which can fit caller's
- * new tuple without splitting. Keep block in local cache for
- * next insert, and have caller use NULL stack.
- *
- * Note that _bt_insert_parent() has an assertion that catches
- * leaf page splits that somehow follow from a fastpath insert
- * (it should only be passed a NULL stack when it must deal
- * with a concurrent root page split, and never because a NULL
- * stack was returned here).
- */
- return NULL;
- }
-
- /* Page unsuitable for caller, drop lock and pin */
- _bt_relbuf(rel, insertstate->buf);
- }
- else
- {
- /* Lock unavailable, drop pin */
- ReleaseBuffer(insertstate->buf);
- }
-
- /* Forget block, since cache doesn't appear to be useful */
- RelationSetTargetBlock(rel, InvalidBlockNumber);
- }
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtinsert_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
- /* Cannot use optimization -- descend tree, return proper descent stack */
- return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
- NULL);
-}
/*
* _bt_check_unique() -- Check for violation of unique index constraint
@@ -438,7 +132,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* in the fastpath below, but also in the _bt_findinsertloc() call later.
*/
Assert(!insertstate->bounds_valid);
- offset = _bt_binsrch_insert(rel, insertstate);
+ offset = nbts_call(_bt_binsrch_insert, rel, insertstate);
/*
* Scan over all equal tuples, looking for live conflicts.
@@ -483,7 +177,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(insertstate->bounds_valid);
Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
Assert(insertstate->low <= insertstate->stricthigh);
- Assert(_bt_compare(rel, itup_key, page, offset) < 0);
+ Assert(nbts_call(_bt_compare, rel, itup_key, page, offset) < 0);
break;
}
@@ -508,7 +202,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
if (!inposting)
{
/* Plain tuple, or first TID in posting list tuple */
- if (_bt_compare(rel, itup_key, page, offset) != 0)
+ if (nbts_call(_bt_compare, rel, itup_key, page, offset) != 0)
break; /* we're past all the equal tuples */
/* Advanced curitup */
@@ -722,7 +416,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* If scankey == hikey we gotta check the next page too */
if (P_RIGHTMOST(opaque))
break;
- highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+ highkeycmp = nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY);
Assert(highkeycmp <= 0);
if (highkeycmp != 0)
break;
@@ -769,246 +463,6 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
return InvalidTransactionId;
}
-
-/*
- * _bt_findinsertloc() -- Finds an insert location for a tuple
- *
- * On entry, insertstate buffer contains the page the new tuple belongs
- * on. It is exclusive-locked and pinned by the caller.
- *
- * If 'checkingunique' is true, the buffer on entry is the first page
- * that contains duplicates of the new key. If there are duplicates on
- * multiple pages, the correct insertion position might be some page to
- * the right, rather than the first page. In that case, this function
- * moves right to the correct target page.
- *
- * (In a !heapkeyspace index, there can be multiple pages with the same
- * high key, where the new tuple could legitimately be placed on. In
- * that case, the caller passes the first page containing duplicates,
- * just like when checkingunique=true. If that page doesn't have enough
- * room for the new tuple, this function moves right, trying to find a
- * legal page that does.)
- *
- * If 'indexUnchanged' is true, this is for an UPDATE that didn't
- * logically change the indexed value, but must nevertheless have a new
- * entry to point to a successor version. This hint from the executor
- * will influence our behavior when the page might have to be split and
- * we must consider our options. Bottom-up index deletion can avoid
- * pathological version-driven page splits, but we only want to go to the
- * trouble of trying it when we already have moderate confidence that
- * it's appropriate. The hint should not significantly affect our
- * behavior over time unless practically all inserts on to the leaf page
- * get the hint.
- *
- * On exit, insertstate buffer contains the chosen insertion page, and
- * the offset within that page is returned. If _bt_findinsertloc needed
- * to move right, the lock and pin on the original page are released, and
- * the new buffer is exclusively locked and pinned instead.
- *
- * If insertstate contains cached binary search bounds, we will take
- * advantage of them. This avoids repeating comparisons that we made in
- * _bt_check_unique() already.
- */
-static OffsetNumber
-_bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel)
-{
- BTScanInsert itup_key = insertstate->itup_key;
- Page page = BufferGetPage(insertstate->buf);
- BTPageOpaque opaque;
- OffsetNumber newitemoff;
-
- opaque = BTPageGetOpaque(page);
-
- /* Check 1/3 of a page restriction */
- if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
- _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
- insertstate->itup);
-
- Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
- Assert(!insertstate->bounds_valid || checkingunique);
- Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
- Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
- Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
-
- if (itup_key->heapkeyspace)
- {
- /* Keep track of whether checkingunique duplicate seen */
- bool uniquedup = indexUnchanged;
-
- /*
- * If we're inserting into a unique index, we may have to walk right
- * through leaf pages to find the one leaf page that we must insert on
- * to.
- *
- * This is needed for checkingunique callers because a scantid was not
- * used when we called _bt_search(). scantid can only be set after
- * _bt_check_unique() has checked for duplicates. The buffer
- * initially stored in insertstate->buf has the page where the first
- * duplicate key might be found, which isn't always the page that new
- * tuple belongs on. The heap TID attribute for new tuple (scantid)
- * could force us to insert on a sibling page, though that should be
- * very rare in practice.
- */
- if (checkingunique)
- {
- if (insertstate->low < insertstate->stricthigh)
- {
- /* Encountered a duplicate in _bt_check_unique() */
- Assert(insertstate->bounds_valid);
- uniquedup = true;
- }
-
- for (;;)
- {
- /*
- * Does the new tuple belong on this page?
- *
- * The earlier _bt_check_unique() call may well have
- * established a strict upper bound on the offset for the new
- * item. If it's not the last item of the page (i.e. if there
- * is at least one tuple on the page that goes after the tuple
- * we're inserting) then we know that the tuple belongs on
- * this page. We can skip the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- /* Test '<=', not '!=', since scantid is set now */
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
- break;
-
- _bt_stepright(rel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- /* Assume duplicates (if checkingunique) */
- uniquedup = true;
- }
- }
-
- /*
- * If the target page cannot fit newitem, try to avoid splitting the
- * page on insert by performing deletion or deduplication now
- */
- if (PageGetFreeSpace(page) < insertstate->itemsz)
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
- checkingunique, uniquedup,
- indexUnchanged);
- }
- else
- {
- /*----------
- * This is a !heapkeyspace (version 2 or 3) index. The current page
- * is the first page that we could insert the new tuple to, but there
- * may be other pages to the right that we could opt to use instead.
- *
- * If the new key is equal to one or more existing keys, we can
- * legitimately place it anywhere in the series of equal keys. In
- * fact, if the new key is equal to the page's "high key" we can place
- * it on the next page. If it is equal to the high key, and there's
- * not room to insert the new tuple on the current page without
- * splitting, then we move right hoping to find more free space and
- * avoid a split.
- *
- * Keep scanning right until we
- * (a) find a page with enough free space,
- * (b) reach the last page where the tuple can legally go, or
- * (c) get tired of searching.
- * (c) is not flippant; it is important because if there are many
- * pages' worth of equal keys, it's better to split one of the early
- * pages than to scan all the way to the end of the run of equal keys
- * on every insert. We implement "get tired" as a random choice,
- * since stopping after scanning a fixed number of pages wouldn't work
- * well (we'd never reach the right-hand side of previously split
- * pages). The probability of moving right is set at 0.99, which may
- * seem too high to change the behavior much, but it does an excellent
- * job of preventing O(N^2) behavior with many equal keys.
- *----------
- */
- while (PageGetFreeSpace(page) < insertstate->itemsz)
- {
- /*
- * Before considering moving right, see if we can obtain enough
- * space by erasing LP_DEAD items
- */
- if (P_HAS_GARBAGE(opaque))
- {
- /* Perform simple deletion */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- if (PageGetFreeSpace(page) >= insertstate->itemsz)
- break; /* OK, now we have enough space */
- }
-
- /*
- * Nope, so check conditions (b) and (c) enumerated above
- *
- * The earlier _bt_check_unique() call may well have established a
- * strict upper bound on the offset for the new item. If it's not
- * the last item of the page (i.e. if there is at least one tuple
- * on the page that's greater than the tuple we're inserting to)
- * then we know that the tuple belongs on this page. We can skip
- * the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) != 0 ||
- pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
- break;
-
- _bt_stepright(rel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- }
- }
-
- /*
- * We should now be on the correct page. Find the offset within the page
- * for the new tuple. (Possibly reusing earlier search bounds.)
- */
- Assert(P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
-
- newitemoff = _bt_binsrch_insert(rel, insertstate);
-
- if (insertstate->postingoff == -1)
- {
- /*
- * There is an overlapping posting list tuple with its LP_DEAD bit
- * set. We don't want to unnecessarily unset its LP_DEAD bit while
- * performing a posting list split, so perform simple index tuple
- * deletion early.
- */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- /*
- * Do new binary search. New insert location cannot overlap with any
- * posting list now.
- */
- Assert(!insertstate->bounds_valid);
- insertstate->postingoff = 0;
- newitemoff = _bt_binsrch_insert(rel, insertstate);
- Assert(insertstate->postingoff == 0);
- }
-
- return newitemoff;
-}
-
/*
* Step right to next non-dead page, during insertion.
*
@@ -1649,7 +1103,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
lastleft = nposting;
}
- lefthighkey = _bt_truncate(rel, lastleft, firstright, itup_key);
+ lefthighkey = nbts_call(_bt_truncate, rel, lastleft, firstright, itup_key);
itemsz = IndexTupleSize(lefthighkey);
}
else
@@ -2764,8 +2218,8 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
/* Perform deduplication pass (when enabled and index-is-allequalimage) */
if (BTGetDeduplicateItems(rel) && itup_key->allequalimage)
- _bt_dedup_pass(rel, buffer, heapRel, insertstate->itup,
- insertstate->itemsz, (indexUnchanged || uniquedup));
+ nbts_call(_bt_dedup_pass, rel, buffer, heapRel, insertstate->itup,
+ insertstate->itemsz, (indexUnchanged || uniquedup));
}
/*
diff --git a/src/backend/access/nbtree/nbtinsert_spec.h b/src/backend/access/nbtree/nbtinsert_spec.h
new file mode 100644
index 0000000000..97c866aea3
--- /dev/null
+++ b/src/backend/access/nbtree/nbtinsert_spec.h
@@ -0,0 +1,569 @@
+/*
+ * Specialized functions for nbtinsert.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static BTStack NBTS_FUNCTION(_bt_search_insert)(Relation rel,
+ BTInsertState insertstate);
+
+static OffsetNumber NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel);
+
+/*
+ * _bt_search_insert() -- _bt_search() wrapper for inserts
+ *
+ * Search the tree for a particular scankey, or more precisely for the first
+ * leaf page it could be on. Try to make use of the fastpath optimization's
+ * rightmost leaf page cache before actually searching the tree from the root
+ * page, though.
+ *
+ * Return value is a stack of parent-page pointers (though see notes about
+ * fastpath optimization and page splits below). insertstate->buf is set to
+ * the address of the leaf-page buffer, which is write-locked and pinned in
+ * all cases (if necessary by creating a new empty root page for caller).
+ *
+ * The fastpath optimization avoids most of the work of searching the tree
+ * repeatedly when a single backend inserts successive new tuples on the
+ * rightmost leaf page of an index. A backend cache of the rightmost leaf
+ * page is maintained within _bt_insertonpg(), and used here. The cache is
+ * invalidated here when an insert of a non-pivot tuple must take place on a
+ * non-rightmost leaf page.
+ *
+ * The optimization helps with indexes on an auto-incremented field. It also
+ * helps with indexes on datetime columns, as well as indexes with lots of
+ * NULL values. (NULLs usually get inserted in the rightmost page for single
+ * column indexes, since they usually get treated as coming after everything
+ * else in the key space. Individual NULL tuples will generally be placed on
+ * the rightmost leaf page due to the influence of the heap TID column.)
+ *
+ * Note that we avoid applying the optimization when there is insufficient
+ * space on the rightmost page to fit caller's new item. This is necessary
+ * because we'll need to return a real descent stack when a page split is
+ * expected (actually, caller can cope with a leaf page split that uses a NULL
+ * stack, but that's very slow and so must be avoided). Note also that the
+ * fastpath optimization acquires the lock on the page conditionally as a way
+ * of reducing extra contention when there are concurrent insertions into the
+ * rightmost page (we give up if we'd have to wait for the lock). We assume
+ * that it isn't useful to apply the optimization when there is contention,
+ * since each per-backend cache won't stay valid for long.
+ */
+static BTStack
+NBTS_FUNCTION(_bt_search_insert)(Relation rel, BTInsertState insertstate)
+{
+ Assert(insertstate->buf == InvalidBuffer);
+ Assert(!insertstate->bounds_valid);
+ Assert(insertstate->postingoff == 0);
+
+ if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
+ {
+ /* Simulate a _bt_getbuf() call with conditional locking */
+ insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
+ if (_bt_conditionallockbuf(rel, insertstate->buf))
+ {
+ Page page;
+ BTPageOpaque opaque;
+
+ _bt_checkpage(rel, insertstate->buf);
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ /*
+ * Check if the page is still the rightmost leaf page and has
+ * enough free space to accommodate the new tuple. Also check
+ * that the insertion scan key is strictly greater than the first
+ * non-pivot tuple on the page. (Note that we expect itup_key's
+ * scantid to be unset when our caller is a checkingunique
+ * inserter.)
+ */
+ if (P_RIGHTMOST(opaque) &&
+ P_ISLEAF(opaque) &&
+ !P_IGNORE(opaque) &&
+ PageGetFreeSpace(page) > insertstate->itemsz &&
+ PageGetMaxOffsetNumber(page) >= P_HIKEY &&
+ nbts_call(_bt_compare, rel, insertstate->itup_key, page, P_HIKEY) > 0)
+ {
+ /*
+ * Caller can use the fastpath optimization because cached
+ * block is still rightmost leaf page, which can fit caller's
+ * new tuple without splitting. Keep block in local cache for
+ * next insert, and have caller use NULL stack.
+ *
+ * Note that _bt_insert_parent() has an assertion that catches
+ * leaf page splits that somehow follow from a fastpath insert
+ * (it should only be passed a NULL stack when it must deal
+ * with a concurrent root page split, and never because a NULL
+ * stack was returned here).
+ */
+ return NULL;
+ }
+
+ /* Page unsuitable for caller, drop lock and pin */
+ _bt_relbuf(rel, insertstate->buf);
+ }
+ else
+ {
+ /* Lock unavailable, drop pin */
+ ReleaseBuffer(insertstate->buf);
+ }
+
+ /* Forget block, since cache doesn't appear to be useful */
+ RelationSetTargetBlock(rel, InvalidBlockNumber);
+ }
+
+ /* Cannot use optimization -- descend tree, return proper descent stack */
+ return nbts_call(_bt_search, rel, insertstate->itup_key,
+ &insertstate->buf, BT_WRITE, NULL);
+}
+
+/*
+ * _bt_findinsertloc() -- Finds an insert location for a tuple
+ *
+ * On entry, insertstate buffer contains the page the new tuple belongs
+ * on. It is exclusive-locked and pinned by the caller.
+ *
+ * If 'checkingunique' is true, the buffer on entry is the first page
+ * that contains duplicates of the new key. If there are duplicates on
+ * multiple pages, the correct insertion position might be some page to
+ * the right, rather than the first page. In that case, this function
+ * moves right to the correct target page.
+ *
+ * (In a !heapkeyspace index, there can be multiple pages with the same
+ * high key, where the new tuple could legitimately be placed on. In
+ * that case, the caller passes the first page containing duplicates,
+ * just like when checkingunique=true. If that page doesn't have enough
+ * room for the new tuple, this function moves right, trying to find a
+ * legal page that does.)
+ *
+ * If 'indexUnchanged' is true, this is for an UPDATE that didn't
+ * logically change the indexed value, but must nevertheless have a new
+ * entry to point to a successor version. This hint from the executor
+ * will influence our behavior when the page might have to be split and
+ * we must consider our options. Bottom-up index deletion can avoid
+ * pathological version-driven page splits, but we only want to go to the
+ * trouble of trying it when we already have moderate confidence that
+ * it's appropriate. The hint should not significantly affect our
+ * behavior over time unless practically all inserts on to the leaf page
+ * get the hint.
+ *
+ * On exit, insertstate buffer contains the chosen insertion page, and
+ * the offset within that page is returned. If _bt_findinsertloc needed
+ * to move right, the lock and pin on the original page are released, and
+ * the new buffer is exclusively locked and pinned instead.
+ *
+ * If insertstate contains cached binary search bounds, we will take
+ * advantage of them. This avoids repeating comparisons that we made in
+ * _bt_check_unique() already.
+ */
+static OffsetNumber
+NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel)
+{
+ BTScanInsert itup_key = insertstate->itup_key;
+ Page page = BufferGetPage(insertstate->buf);
+ BTPageOpaque opaque;
+ OffsetNumber newitemoff;
+
+ opaque = BTPageGetOpaque(page);
+
+ /* Check 1/3 of a page restriction */
+ if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
+ _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
+ insertstate->itup);
+
+ Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
+ Assert(!insertstate->bounds_valid || checkingunique);
+ Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
+ Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
+ Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
+
+ if (itup_key->heapkeyspace)
+ {
+ /* Keep track of whether checkingunique duplicate seen */
+ bool uniquedup = indexUnchanged;
+
+ /*
+ * If we're inserting into a unique index, we may have to walk right
+ * through leaf pages to find the one leaf page that we must insert on
+ * to.
+ *
+ * This is needed for checkingunique callers because a scantid was not
+ * used when we called _bt_search(). scantid can only be set after
+ * _bt_check_unique() has checked for duplicates. The buffer
+ * initially stored in insertstate->buf has the page where the first
+ * duplicate key might be found, which isn't always the page that new
+ * tuple belongs on. The heap TID attribute for new tuple (scantid)
+ * could force us to insert on a sibling page, though that should be
+ * very rare in practice.
+ */
+ if (checkingunique)
+ {
+ if (insertstate->low < insertstate->stricthigh)
+ {
+ /* Encountered a duplicate in _bt_check_unique() */
+ Assert(insertstate->bounds_valid);
+ uniquedup = true;
+ }
+
+ for (;;)
+ {
+ /*
+ * Does the new tuple belong on this page?
+ *
+ * The earlier _bt_check_unique() call may well have
+ * established a strict upper bound on the offset for the new
+ * item. If it's not the last item of the page (i.e. if there
+ * is at least one tuple on the page that goes after the tuple
+ * we're inserting) then we know that the tuple belongs on
+ * this page. We can skip the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ /* Test '<=', not '!=', since scantid is set now */
+ if (P_RIGHTMOST(opaque) ||
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) <= 0)
+ break;
+
+ _bt_stepright(rel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ /* Assume duplicates (if checkingunique) */
+ uniquedup = true;
+ }
+ }
+
+ /*
+ * If the target page cannot fit newitem, try to avoid splitting the
+ * page on insert by performing deletion or deduplication now
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
+ checkingunique, uniquedup,
+ indexUnchanged);
+ }
+ else
+ {
+ /*----------
+ * This is a !heapkeyspace (version 2 or 3) index. The current page
+ * is the first page that we could insert the new tuple to, but there
+ * may be other pages to the right that we could opt to use instead.
+ *
+ * If the new key is equal to one or more existing keys, we can
+ * legitimately place it anywhere in the series of equal keys. In
+ * fact, if the new key is equal to the page's "high key" we can place
+ * it on the next page. If it is equal to the high key, and there's
+ * not room to insert the new tuple on the current page without
+ * splitting, then we move right hoping to find more free space and
+ * avoid a split.
+ *
+ * Keep scanning right until we
+ * (a) find a page with enough free space,
+ * (b) reach the last page where the tuple can legally go, or
+ * (c) get tired of searching.
+ * (c) is not flippant; it is important because if there are many
+ * pages' worth of equal keys, it's better to split one of the early
+ * pages than to scan all the way to the end of the run of equal keys
+ * on every insert. We implement "get tired" as a random choice,
+ * since stopping after scanning a fixed number of pages wouldn't work
+ * well (we'd never reach the right-hand side of previously split
+ * pages). The probability of moving right is set at 0.99, which may
+ * seem too high to change the behavior much, but it does an excellent
+ * job of preventing O(N^2) behavior with many equal keys.
+ *----------
+ */
+ while (PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ /*
+ * Before considering moving right, see if we can obtain enough
+ * space by erasing LP_DEAD items
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ /* Perform simple deletion */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ if (PageGetFreeSpace(page) >= insertstate->itemsz)
+ break; /* OK, now we have enough space */
+ }
+
+ /*
+ * Nope, so check conditions (b) and (c) enumerated above
+ *
+ * The earlier _bt_check_unique() call may well have established a
+ * strict upper bound on the offset for the new item. If it's not
+ * the last item of the page (i.e. if there is at least one tuple
+ * on the page that's greater than the tuple we're inserting to)
+ * then we know that the tuple belongs on this page. We can skip
+ * the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ if (P_RIGHTMOST(opaque) ||
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) != 0 ||
+ pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
+ break;
+
+ _bt_stepright(rel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ }
+ }
+
+ /*
+ * We should now be on the correct page. Find the offset within the page
+ * for the new tuple. (Possibly reusing earlier search bounds.)
+ */
+ Assert(P_RIGHTMOST(opaque) ||
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) <= 0);
+
+ newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate);
+
+ if (insertstate->postingoff == -1)
+ {
+ /*
+ * There is an overlapping posting list tuple with its LP_DEAD bit
+ * set. We don't want to unnecessarily unset its LP_DEAD bit while
+ * performing a posting list split, so perform simple index tuple
+ * deletion early.
+ */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ /*
+ * Do new binary search. New insert location cannot overlap with any
+ * posting list now.
+ */
+ Assert(!insertstate->bounds_valid);
+ insertstate->postingoff = 0;
+ newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate);
+ Assert(insertstate->postingoff == 0);
+ }
+
+ return newitemoff;
+}
+
+#endif /* ifndef NBTS_SPECIALIZING_DEFAULT */
+
+/*
+ * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
+ *
+ * This routine is called by the public interface routine, btinsert.
+ * By here, itup is filled in, including the TID.
+ *
+ * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
+ * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
+ * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
+ * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
+ * don't actually insert.
+ *
+ * indexUnchanged executor hint indicates if itup is from an
+ * UPDATE that didn't logically change the indexed value, but
+ * must nevertheless have a new entry to point to a successor
+ * version.
+ *
+ * The result value is only significant for UNIQUE_CHECK_PARTIAL:
+ * it must be true if the entry is known unique, else false.
+ * (In the current implementation we'll also return true after a
+ * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
+ * that's just a coding artifact.)
+ */
+bool
+NBTS_FUNCTION(_bt_doinsert)(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ Relation heapRel)
+{
+ bool is_unique = false;
+ BTInsertStateData insertstate;
+ BTScanInsert itup_key;
+ BTStack stack;
+ bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
+
+ /* we need an insertion scan key to do our search, so build one */
+ itup_key = nbts_call(_bt_mkscankey, rel, itup);
+
+ if (checkingunique)
+ {
+ if (!itup_key->anynullkeys)
+ {
+ /* No (heapkeyspace) scantid until uniqueness established */
+ itup_key->scantid = NULL;
+ }
+ else
+ {
+ /*
+ * Scan key for new tuple contains NULL key values. Bypass
+ * checkingunique steps. They are unnecessary because core code
+ * considers NULL unequal to every value, including NULL.
+ *
+ * This optimization avoids O(N^2) behavior within the
+ * _bt_findinsertloc() heapkeyspace path when a unique index has a
+ * large number of "duplicates" with NULL key values.
+ */
+ checkingunique = false;
+ /* Tuple is unique in the sense that core code cares about */
+ Assert(checkUnique != UNIQUE_CHECK_EXISTING);
+ is_unique = true;
+ }
+ }
+
+ /*
+ * Fill in the BTInsertState working area, to track the current page and
+ * position within the page to insert on.
+ *
+ * Note that itemsz is passed down to lower level code that deals with
+ * inserting the item. It must be MAXALIGN()'d. This ensures that space
+ * accounting code consistently considers the alignment overhead that we
+ * expect PageAddItem() will add later. (Actually, index_form_tuple() is
+ * already conservative about alignment, but we don't rely on that from
+ * this distance. Besides, preserving the "true" tuple size in index
+ * tuple headers for the benefit of nbtsplitloc.c might happen someday.
+ * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
+ */
+ insertstate.itup = itup;
+ insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
+ insertstate.itup_key = itup_key;
+ insertstate.bounds_valid = false;
+ insertstate.buf = InvalidBuffer;
+ insertstate.postingoff = 0;
+
+ search:
+
+ /*
+ * Find and lock the leaf page that the tuple should be added to by
+ * searching from the root page. insertstate.buf will hold a buffer that
+ * is locked in exclusive mode afterwards.
+ */
+ stack = nbts_call(_bt_search_insert, rel, &insertstate);
+
+ /*
+ * checkingunique inserts are not allowed to go ahead when two tuples with
+ * equal key attribute values would be visible to new MVCC snapshots once
+ * the xact commits. Check for conflicts in the locked page/buffer (if
+ * needed) here.
+ *
+ * It might be necessary to check a page to the right in _bt_check_unique,
+ * though that should be very rare. In practice the first page the value
+ * could be on (with scantid omitted) is almost always also the only page
+ * that a matching tuple might be found on. This is due to the behavior
+ * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
+ * only be allowed to cross a page boundary when there is no candidate
+ * leaf page split point that avoids it. Also, _bt_check_unique can use
+ * the leaf page high key to determine that there will be no duplicates on
+ * the right sibling without actually visiting it (it uses the high key in
+ * cases where the new item happens to belong at the far right of the leaf
+ * page).
+ *
+ * NOTE: obviously, _bt_check_unique can only detect keys that are already
+ * in the index; so it cannot defend against concurrent insertions of the
+ * same key. We protect against that by means of holding a write lock on
+ * the first page the value could be on, with omitted/-inf value for the
+ * implicit heap TID tiebreaker attribute. Any other would-be inserter of
+ * the same key must acquire a write lock on the same page, so only one
+ * would-be inserter can be making the check at one time. Furthermore,
+ * once we are past the check we hold write locks continuously until we
+ * have performed our insertion, so no later inserter can fail to see our
+ * insertion. (This requires some care in _bt_findinsertloc.)
+ *
+ * If we must wait for another xact, we release the lock while waiting,
+ * and then must perform a new search.
+ *
+ * For a partial uniqueness check, we don't wait for the other xact. Just
+ * let the tuple in and return false for possibly non-unique, or true for
+ * definitely unique.
+ */
+ if (checkingunique)
+ {
+ TransactionId xwait;
+ uint32 speculativeToken;
+
+ xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
+ &is_unique, &speculativeToken);
+
+ if (unlikely(TransactionIdIsValid(xwait)))
+ {
+ /* Have to wait for the other guy ... */
+ _bt_relbuf(rel, insertstate.buf);
+ insertstate.buf = InvalidBuffer;
+
+ /*
+ * If it's a speculative insertion, wait for it to finish (ie. to
+ * go ahead with the insertion, or kill the tuple). Otherwise
+ * wait for the transaction to finish as usual.
+ */
+ if (speculativeToken)
+ SpeculativeInsertionWait(xwait, speculativeToken);
+ else
+ XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
+
+ /* start over... */
+ if (stack)
+ _bt_freestack(stack);
+ goto search;
+ }
+
+ /* Uniqueness is established -- restore heap tid as scantid */
+ if (itup_key->heapkeyspace)
+ itup_key->scantid = &itup->t_tid;
+ }
+
+ if (checkUnique != UNIQUE_CHECK_EXISTING)
+ {
+ OffsetNumber newitemoff;
+
+ /*
+ * The only conflict predicate locking cares about for indexes is when
+ * an index tuple insert conflicts with an existing lock. We don't
+ * know the actual page we're going to insert on for sure just yet in
+ * checkingunique and !heapkeyspace cases, but it's okay to use the
+ * first page the value could be on (with scantid omitted) instead.
+ */
+ CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
+
+ /*
+ * Do the insertion. Note that insertstate contains cached binary
+ * search bounds established within _bt_check_unique when insertion is
+ * checkingunique.
+ */
+ newitemoff = nbts_call(_bt_findinsertloc, rel, &insertstate, checkingunique,
+ indexUnchanged, stack, heapRel);
+ _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
+ itup, insertstate.itemsz, newitemoff,
+ insertstate.postingoff, false);
+ }
+ else
+ {
+ /* just release the buffer */
+ _bt_relbuf(rel, insertstate.buf);
+ }
+
+ /* be tidy */
+ if (stack)
+ _bt_freestack(stack);
+ pfree(itup_key);
+
+ return is_unique;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 20adb602a4..e66299ebd8 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1967,10 +1967,10 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
}
/* we need an insertion scan key for the search, so build one */
- itup_key = _bt_mkscankey(rel, targetkey);
+ itup_key = nbts_call(_bt_mkscankey, rel, targetkey);
/* find the leftmost leaf page with matching pivot/high key */
itup_key->pivotsearch = true;
- stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
+ stack = nbts_call(_bt_search, rel, itup_key, &sleafbuf, BT_READ, NULL);
/* won't need a second lock or pin on leafbuf */
_bt_relbuf(rel, sleafbuf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 06131f23d4..09c43eb226 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -87,6 +87,10 @@ static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
OffsetNumber updatedoffset,
int *nremaining);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtree_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
+
/*
* Btree handler function: return IndexAmRoutine with access method parameters
@@ -178,33 +182,6 @@ btbuildempty(Relation index)
smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
}
-/*
- * btinsert() -- insert an index tuple into a btree.
- *
- * Descend the tree recursively, find the appropriate location for our
- * new tuple, and put it there.
- */
-bool
-btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- IndexInfo *indexInfo)
-{
- bool result;
- IndexTuple itup;
-
- /* generate an index tuple */
- itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
- itup->t_tid = *ht_ctid;
-
- result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);
-
- pfree(itup);
-
- return result;
-}
-
/*
* btgettuple() -- Get the next tuple in the scan.
*/
diff --git a/src/backend/access/nbtree/nbtree_spec.h b/src/backend/access/nbtree/nbtree_spec.h
new file mode 100644
index 0000000000..4c342287f6
--- /dev/null
+++ b/src/backend/access/nbtree/nbtree_spec.h
@@ -0,0 +1,50 @@
+/*
+ * Specialized functions for nbtree.c
+ */
+
+/*
+ * _bt_specialize() -- Specialize this index relation for its index key.
+ */
+void
+NBTS_FUNCTION(_bt_specialize)(Relation rel)
+{
+#ifdef NBTS_SPECIALIZING_DEFAULT
+ nbts_call_norel(_bt_specialize, rel, rel);
+#else
+ rel->rd_indam->aminsert = NBTS_FUNCTION(btinsert);
+#endif
+}
+
+/*
+ * btinsert() -- insert an index tuple into a btree.
+ *
+ * Descend the tree recursively, find the appropriate location for our
+ * new tuple, and put it there.
+ */
+bool
+NBTS_FUNCTION(btinsert)(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ IndexInfo *indexInfo)
+{
+#ifdef NBTS_SPECIALIZING_DEFAULT
+ nbts_call_norel(_bt_specialize, rel, rel);
+
+ return nbts_call(btinsert, rel, values, isnull, ht_ctid, heapRel,
+ checkUnique, indexUnchanged, indexInfo);
+#else
+ bool result;
+ IndexTuple itup;
+
+ /* generate an index tuple */
+ itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
+ itup->t_tid = *ht_ctid;
+
+ result = nbts_call(_bt_doinsert, rel, itup, checkUnique, indexUnchanged, heapRel);
+
+ pfree(itup);
+
+ return result;
+#endif
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c74543bfde..e81eee9c35 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,11 +25,8 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
-static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
static int _bt_binsrch_posting(BTScanInsert key, Page page,
OffsetNumber offnum);
-static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
- OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
static int _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
@@ -46,6 +43,9 @@ static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsearch_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
/*
* _bt_drop_lock_and_maybe_pin()
@@ -70,493 +70,6 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
}
}
-/*
- * _bt_search() -- Search the tree for a particular scankey,
- * or more precisely for the first leaf page it could be on.
- *
- * The passed scankey is an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * Return value is a stack of parent-page pointers (i.e. there is no entry for
- * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
- * which is locked and pinned. No locks are held on the parent pages,
- * however!
- *
- * If the snapshot parameter is not NULL, "old snapshot" checking will take
- * place during the descent through the tree. This is not needed when
- * positioning for an insert or delete, so NULL is used for those cases.
- *
- * The returned buffer is locked according to access parameter. Additionally,
- * access = BT_WRITE will allow an empty root page to be created and returned.
- * When access = BT_READ, an empty index will result in *bufP being set to
- * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
- * during the search will be finished.
- */
-BTStack
-_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
- Snapshot snapshot)
-{
- BTStack stack_in = NULL;
- int page_access = BT_READ;
-
- /* Get the root page to start with */
- *bufP = _bt_getroot(rel, access);
-
- /* If index is empty and access = BT_READ, no root page is created. */
- if (!BufferIsValid(*bufP))
- return (BTStack) NULL;
-
- /* Loop iterates once per level descended in the tree */
- for (;;)
- {
- Page page;
- BTPageOpaque opaque;
- OffsetNumber offnum;
- ItemId itemid;
- IndexTuple itup;
- BlockNumber child;
- BTStack new_stack;
-
- /*
- * Race -- the page we just grabbed may have split since we read its
- * downlink in its parent page (or the metapage). If it has, we may
- * need to move right to its new sibling. Do that.
- *
- * In write-mode, allow _bt_moveright to finish any incomplete splits
- * along the way. Strictly speaking, we'd only need to finish an
- * incomplete split on the leaf page we're about to insert to, not on
- * any of the upper levels (internal pages with incomplete splits are
- * also taken care of in _bt_getstackbuf). But this is a good
- * opportunity to finish splits of internal pages too.
- */
- *bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
- page_access, snapshot);
-
- /* if this is a leaf page, we're done */
- page = BufferGetPage(*bufP);
- opaque = BTPageGetOpaque(page);
- if (P_ISLEAF(opaque))
- break;
-
- /*
- * Find the appropriate pivot tuple on this page. Its downlink points
- * to the child page that we're about to descend to.
- */
- offnum = _bt_binsrch(rel, key, *bufP);
- itemid = PageGetItemId(page, offnum);
- itup = (IndexTuple) PageGetItem(page, itemid);
- Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
- child = BTreeTupleGetDownLink(itup);
-
- /*
- * We need to save the location of the pivot tuple we chose in a new
- * stack entry for this page/level. If caller ends up splitting a
- * page one level down, it usually ends up inserting a new pivot
- * tuple/downlink immediately after the location recorded here.
- */
- new_stack = (BTStack) palloc(sizeof(BTStackData));
- new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
- new_stack->bts_offset = offnum;
- new_stack->bts_parent = stack_in;
-
- /*
- * Page level 1 is lowest non-leaf page level prior to leaves. So, if
- * we're on the level 1 and asked to lock leaf page in write mode,
- * then lock next page in write mode, because it must be a leaf.
- */
- if (opaque->btpo_level == 1 && access == BT_WRITE)
- page_access = BT_WRITE;
-
- /* drop the read lock on the page, then acquire one on its child */
- *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
-
- /* okay, all set to move down a level */
- stack_in = new_stack;
- }
-
- /*
- * If we're asked to lock leaf in write mode, but didn't manage to, then
- * relock. This should only happen when the root page is a leaf page (and
- * the only page in the index other than the metapage).
- */
- if (access == BT_WRITE && page_access == BT_READ)
- {
- /* trade in our read lock for a write lock */
- _bt_unlockbuf(rel, *bufP);
- _bt_lockbuf(rel, *bufP, BT_WRITE);
-
- /*
- * Race -- the leaf page may have split after we dropped the read lock
- * but before we acquired a write lock. If it has, we may need to
- * move right to its new sibling. Do that.
- */
- *bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
- snapshot);
- }
-
- return stack_in;
-}
-
-/*
- * _bt_moveright() -- move right in the btree if necessary.
- *
- * When we follow a pointer to reach a page, it is possible that
- * the page has changed in the meanwhile. If this happens, we're
- * guaranteed that the page has "split right" -- that is, that any
- * data that appeared on the page originally is either on the page
- * or strictly to the right of it.
- *
- * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page. If that entry is
- * strictly less than the scankey, or <= the scankey in the
- * key.nextkey=true case, then we followed the wrong link and we need
- * to move right.
- *
- * The passed insertion-type scankey can omit the rightmost column(s) of the
- * index. (see nbtree/README)
- *
- * When key.nextkey is false (the usual case), we are looking for the first
- * item >= key. When key.nextkey is true, we are looking for the first item
- * strictly greater than key.
- *
- * If forupdate is true, we will attempt to finish any incomplete splits
- * that we encounter. This is required when locking a target page for an
- * insertion, because we don't allow inserting on a page before the split
- * is completed. 'stack' is only used if forupdate is true.
- *
- * On entry, we have the buffer pinned and a lock of the type specified by
- * 'access'. If we move right, we release the buffer and lock and acquire
- * the same on the right sibling. Return value is the buffer we stop at.
- *
- * If the snapshot parameter is not NULL, "old snapshot" checking will take
- * place during the descent through the tree. This is not needed when
- * positioning for an insert or delete, so NULL is used for those cases.
- */
-Buffer
-_bt_moveright(Relation rel,
- BTScanInsert key,
- Buffer buf,
- bool forupdate,
- BTStack stack,
- int access,
- Snapshot snapshot)
-{
- Page page;
- BTPageOpaque opaque;
- int32 cmpval;
-
- /*
- * When nextkey = false (normal case): if the scan key that brought us to
- * this page is > the high key stored on the page, then the page has split
- * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
- * have some duplicates to the right as well as the left, but that's
- * something that's only ever dealt with on the leaf level, after
- * _bt_search has found an initial leaf page.)
- *
- * When nextkey = true: move right if the scan key is >= page's high key.
- * (Note that key.scantid cannot be set in this case.)
- *
- * The page could even have split more than once, so scan as far as
- * needed.
- *
- * We also have to move right if we followed a link that brought us to a
- * dead page.
- */
- cmpval = key->nextkey ? 0 : 1;
-
- for (;;)
- {
- page = BufferGetPage(buf);
- TestForOldSnapshot(snapshot, rel, page);
- opaque = BTPageGetOpaque(page);
-
- if (P_RIGHTMOST(opaque))
- break;
-
- /*
- * Finish any incomplete splits we encounter along the way.
- */
- if (forupdate && P_INCOMPLETE_SPLIT(opaque))
- {
- BlockNumber blkno = BufferGetBlockNumber(buf);
-
- /* upgrade our lock if necessary */
- if (access == BT_READ)
- {
- _bt_unlockbuf(rel, buf);
- _bt_lockbuf(rel, buf, BT_WRITE);
- }
-
- if (P_INCOMPLETE_SPLIT(opaque))
- _bt_finish_split(rel, buf, stack);
- else
- _bt_relbuf(rel, buf);
-
- /* re-acquire the lock in the right mode, and re-check */
- buf = _bt_getbuf(rel, blkno, access);
- continue;
- }
-
- if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
- {
- /* step right one page */
- buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
- continue;
- }
- else
- break;
- }
-
- if (P_IGNORE(opaque))
- elog(ERROR, "fell off the end of index \"%s\"",
- RelationGetRelationName(rel));
-
- return buf;
-}
-
-/*
- * _bt_binsrch() -- Do a binary search for a key on a particular page.
- *
- * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
- * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
- * particular, this means it is possible to return a value 1 greater than the
- * number of keys on the page, if the scankey is > all keys on the page.)
- *
- * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
- * of the last key < given scankey, or last key <= given scankey if nextkey
- * is true. (Since _bt_compare treats the first data key of such a page as
- * minus infinity, there will be at least one key < scankey, so the result
- * always points at one of the keys on the page.) This key indicates the
- * right place to descend to be sure we find all leaf keys >= given scankey
- * (or leaf keys > given scankey when nextkey is true).
- *
- * This procedure is not responsible for walking right, it just examines
- * the given page. _bt_binsrch() has no lock or refcount side effects
- * on the buffer.
- */
-static OffsetNumber
-_bt_binsrch(Relation rel,
- BTScanInsert key,
- Buffer buf)
-{
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high;
- int32 result,
- cmpval;
-
- page = BufferGetPage(buf);
- opaque = BTPageGetOpaque(page);
-
- /* Requesting nextkey semantics while using scantid seems nonsensical */
- Assert(!key->nextkey || key->scantid == NULL);
- /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
- Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
-
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
-
- /*
- * If there are no keys on the page, return the first available slot. Note
- * this covers two cases: the page is really empty (no keys), or it
- * contains only a high key. The latter case is possible after vacuuming.
- * This can never happen on an internal page, however, since they are
- * never empty (an internal page must have children).
- */
- if (unlikely(high < low))
- return low;
-
- /*
- * Binary search to find the first key on the page >= scan key, or first
- * key > scankey when nextkey is true.
- *
- * For nextkey=false (cmpval=1), the loop invariant is: all slots before
- * 'low' are < scan key, all slots at or after 'high' are >= scan key.
- *
- * For nextkey=true (cmpval=0), the loop invariant is: all slots before
- * 'low' are <= scan key, all slots at or after 'high' are > scan key.
- *
- * We can fall out when high == low.
- */
- high++; /* establish the loop invariant for high */
-
- cmpval = key->nextkey ? 0 : 1; /* select comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid);
-
- if (result >= cmpval)
- low = mid + 1;
- else
- high = mid;
- }
-
- /*
- * At this point we have high == low, but be careful: they could point
- * past the last slot on the page.
- *
- * On a leaf page, we always return the first key >= scan key (resp. >
- * scan key), which could be the last slot + 1.
- */
- if (P_ISLEAF(opaque))
- return low;
-
- /*
- * On a non-leaf page, return the last key < scan key (resp. <= scan key).
- * There must be one if _bt_compare() is playing by the rules.
- */
- Assert(low > P_FIRSTDATAKEY(opaque));
-
- return OffsetNumberPrev(low);
-}
-
-/*
- *
- * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
- *
- * Like _bt_binsrch(), but with support for caching the binary search
- * bounds. Only used during insertion, and only on the leaf page that it
- * looks like caller will insert tuple on. Exclusive-locked and pinned
- * leaf page is contained within insertstate.
- *
- * Caches the bounds fields in insertstate so that a subsequent call can
- * reuse the low and strict high bounds of original binary search. Callers
- * that use these fields directly must be prepared for the case where low
- * and/or stricthigh are not on the same page (one or both exceed maxoff
- * for the page). The case where there are no items on the page (high <
- * low) makes bounds invalid.
- *
- * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time, and for dealing with posting list
- * tuple matches (callers can use insertstate's postingoff field to
- * determine which existing heap TID will need to be replaced by a posting
- * list split).
- */
-OffsetNumber
-_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
-{
- BTScanInsert key = insertstate->itup_key;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high,
- stricthigh;
- int32 result,
- cmpval;
-
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- Assert(P_ISLEAF(opaque));
- Assert(!key->nextkey);
- Assert(insertstate->postingoff == 0);
-
- if (!insertstate->bounds_valid)
- {
- /* Start new binary search */
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
- }
- else
- {
- /* Restore result of previous binary search against same page */
- low = insertstate->low;
- high = insertstate->stricthigh;
- }
-
- /* If there are no keys on the page, return the first available slot */
- if (unlikely(high < low))
- {
- /* Caller can't reuse bounds */
- insertstate->low = InvalidOffsetNumber;
- insertstate->stricthigh = InvalidOffsetNumber;
- insertstate->bounds_valid = false;
- return low;
- }
-
- /*
- * Binary search to find the first key on the page >= scan key. (nextkey
- * is always false when inserting).
- *
- * The loop invariant is: all slots before 'low' are < scan key, all slots
- * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
- * maintained to save additional search effort for caller.
- *
- * We can fall out when high == low.
- */
- if (!insertstate->bounds_valid)
- high++; /* establish the loop invariant for high */
- stricthigh = high; /* high initially strictly higher */
-
- cmpval = 1; /* !nextkey comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid);
-
- if (result >= cmpval)
- low = mid + 1;
- else
- {
- high = mid;
- if (result != 0)
- stricthigh = high;
- }
-
- /*
- * If tuple at offset located by binary search is a posting list whose
- * TID range overlaps with caller's scantid, perform posting list
- * binary search to set postingoff for caller. Caller must split the
- * posting list when postingoff is set. This should happen
- * infrequently.
- */
- if (unlikely(result == 0 && key->scantid != NULL))
- {
- /*
- * postingoff should never be set more than once per leaf page
- * binary search. That would mean that there are duplicate table
- * TIDs in the index, which is never okay. Check for that here.
- */
- if (insertstate->postingoff != 0)
- ereport(ERROR,
- (errcode(ERRCODE_INDEX_CORRUPTED),
- errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
- ItemPointerGetBlockNumber(key->scantid),
- ItemPointerGetOffsetNumber(key->scantid),
- low, stricthigh,
- BufferGetBlockNumber(insertstate->buf),
- RelationGetRelationName(rel))));
-
- insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
- }
- }
-
- /*
- * On a leaf page, a binary search always returns the first key >= scan
- * key (at least in !nextkey case), which could be the last slot + 1. This
- * is also the lower bound of cached search.
- *
- * stricthigh may also be the last slot + 1, which prevents caller from
- * using bounds directly, but is still useful to us if we're called a
- * second time with cached bounds (cached low will be < stricthigh when
- * that happens).
- */
- insertstate->low = low;
- insertstate->stricthigh = stricthigh;
- insertstate->bounds_valid = true;
-
- return low;
-}
/*----------
* _bt_binsrch_posting() -- posting list binary search.
@@ -625,217 +138,6 @@ _bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
return low;
}
-/*----------
- * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
- *
- * page/offnum: location of btree item to be compared to.
- *
- * This routine returns:
- * <0 if scankey < tuple at offnum;
- * 0 if scankey == tuple at offnum;
- * >0 if scankey > tuple at offnum.
- *
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be returned
- * to the caller as a matching key. Similarly, an insertion scankey
- * with its scantid set is treated as equal to a posting tuple whose TID
- * range overlaps with their scantid. There generally won't be a
- * matching TID in the posting tuple, which caller must handle
- * themselves (e.g., by splitting the posting list tuple).
- *
- * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
- * "minus infinity": this routine will always claim it is less than the
- * scankey. The actual key value stored is explicitly truncated to 0
- * attributes (explicitly minus infinity) with version 3+ indexes, but
- * that isn't relied upon. This allows us to implement the Lehman and
- * Yao convention that the first down-link pointer is before the first
- * key. See backend/access/nbtree/README for details.
- *----------
- */
-int32
-_bt_compare(Relation rel,
- BTScanInsert key,
- Page page,
- OffsetNumber offnum)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- IndexTuple itup;
- ItemPointer heapTid;
- ScanKey scankey;
- int ncmpkey;
- int ntupatts;
- int32 result;
-
- Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
- Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
- Assert(key->heapkeyspace || key->scantid == NULL);
-
- /*
- * Force result ">" if target item is first data item on an internal page
- * --- see NOTE above.
- */
- if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
- return 1;
-
- itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
- ntupatts = BTreeTupleGetNAtts(itup, rel);
-
- /*
- * The scan key is set up with the attribute number associated with each
- * term in the key. It is important that, if the index is multi-key, the
- * scan contain the first k key attributes, and that they be in order. If
- * you think about how multi-key ordering works, you'll understand why
- * this is.
- *
- * We don't test for violation of this condition here, however. The
- * initial setup for the index scan had better have gotten it right (see
- * _bt_first).
- */
-
- ncmpkey = Min(ntupatts, key->keysz);
- Assert(key->heapkeyspace || ncmpkey == key->keysz);
- Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
- scankey = key->scankeys;
- for (int i = 1; i <= ncmpkey; i++)
- {
- Datum datum;
- bool isNull;
-
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
-
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
- {
- if (isNull)
- result = 0; /* NULL "=" NULL */
- else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = -1; /* NULL "<" NOT_NULL */
- else
- result = 1; /* NULL ">" NOT_NULL */
- }
- else if (isNull) /* key is NOT_NULL and item is NULL */
- {
- if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = 1; /* NOT_NULL ">" NULL */
- else
- result = -1; /* NOT_NULL "<" NULL */
- }
- else
- {
- /*
- * The sk_func needs to be passed the index value as left arg and
- * the sk_argument as right arg (they might be of different
- * types). Since it is convenient for callers to think of
- * _bt_compare as comparing the scankey to the index item, we have
- * to flip the sign of the comparison result. (Unless it's a DESC
- * column, in which case we *don't* flip the sign.)
- */
- result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum,
- scankey->sk_argument));
-
- if (!(scankey->sk_flags & SK_BT_DESC))
- INVERT_COMPARE_RESULT(result);
- }
-
- /* if the keys are unequal, return the difference */
- if (result != 0)
- return result;
-
- scankey++;
- }
-
- /*
- * All non-truncated attributes (other than heap TID) were found to be
- * equal. Treat truncated attributes as minus infinity when scankey has a
- * key attribute value that would otherwise be compared directly.
- *
- * Note: it doesn't matter if ntupatts includes non-key attributes;
- * scankey won't, so explicitly excluding non-key attributes isn't
- * necessary.
- */
- if (key->keysz > ntupatts)
- return 1;
-
- /*
- * Use the heap TID attribute and scantid to try to break the tie. The
- * rules are the same as any other key attribute -- only the
- * representation differs.
- */
- heapTid = BTreeTupleGetHeapTID(itup);
- if (key->scantid == NULL)
- {
- /*
- * Most searches have a scankey that is considered greater than a
- * truncated pivot tuple if and when the scankey has equal values for
- * attributes up to and including the least significant untruncated
- * attribute in tuple.
- *
- * For example, if an index has the minimum two attributes (single
- * user key attribute, plus heap TID attribute), and a page's high key
- * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
- * will not descend to the page to the left. The search will descend
- * right instead. The truncated attribute in pivot tuple means that
- * all non-pivot tuples on the page to the left are strictly < 'foo',
- * so it isn't necessary to descend left. In other words, search
- * doesn't have to descend left because it isn't interested in a match
- * that has a heap TID value of -inf.
- *
- * However, some searches (pivotsearch searches) actually require that
- * we descend left when this happens. -inf is treated as a possible
- * match for omitted scankey attribute(s). This is needed by page
- * deletion, which must re-find leaf pages that are targets for
- * deletion using their high keys.
- *
- * Note: the heap TID part of the test ensures that scankey is being
- * compared to a pivot tuple with one or more truncated key
- * attributes.
- *
- * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
- * left here, since they have no heap TID attribute (and cannot have
- * any -inf key values in any case, since truncation can only remove
- * non-key attributes). !heapkeyspace searches must always be
- * prepared to deal with matches on both sides of the pivot once the
- * leaf level is reached.
- */
- if (key->heapkeyspace && !key->pivotsearch &&
- key->keysz == ntupatts && heapTid == NULL)
- return 1;
-
- /* All provided scankey arguments found to be equal */
- return 0;
- }
-
- /*
- * Treat truncated heap TID as minus infinity, since scankey has a key
- * attribute value (scantid) that would otherwise be compared directly
- */
- Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
- if (heapTid == NULL)
- return 1;
-
- /*
- * Scankey must be treated as equal to a posting list tuple if its scantid
- * value falls within the range of the posting list. In all other cases
- * there can only be a single heap TID value, which is compared directly
- * with scantid.
- */
- Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- result = ItemPointerCompare(key->scantid, heapTid);
- if (result <= 0 || !BTreeTupleIsPosting(itup))
- return result;
- else
- {
- result = ItemPointerCompare(key->scantid,
- BTreeTupleGetMaxHeapTID(itup));
- if (result > 0)
- return 1;
- }
-
- return 0;
-}
-
/*
* _bt_first() -- Find the first item in a scan.
*
@@ -1363,7 +665,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* Use the manufactured insertion scan key to descend the tree and
* position ourselves on the target leaf page.
*/
- stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+ stack = nbts_call(_bt_search, rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
/* don't need to keep the stack around... */
_bt_freestack(stack);
@@ -1392,7 +694,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
_bt_initialize_more_data(so, dir);
/* position to the precise item on the page */
- offnum = _bt_binsrch(rel, &inskey, buf);
+ offnum = nbts_call(_bt_binsrch, rel, &inskey, buf);
/*
* If nextkey = false, we are positioned at the first item >= scan key, or
@@ -1422,9 +724,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
/*
* Now load data from the first page of the scan.
*/
- if (!_bt_readpage(scan, dir, offnum))
+ if (!nbts_call_norel(_bt_readpage, scan->indexRelation, scan, dir, offnum))
{
- /*
+ /*`
* There's no actually-matching data on this page. Try to advance to
* the next page. Return false if there's no matching data at all.
*/
@@ -1498,280 +800,6 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
-/*
- * _bt_readpage() -- Load data from current index page into so->currPos
- *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate. All other fields of so->currPos are
- * initialized from scratch here.
- *
- * We scan the current page starting at offnum and moving in the indicated
- * direction. All items matching the scan keys are loaded into currPos.items.
- * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
- * that there can be no more matching tuples in the current scan direction.
- *
- * In the case of a parallel scan, caller must have called _bt_parallel_seize
- * prior to calling this function; this function will invoke
- * _bt_parallel_release before returning.
- *
- * Returns true if any matching items found on the page, false if none.
- */
-static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
-{
- BTScanOpaque so = (BTScanOpaque) scan->opaque;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber minoff;
- OffsetNumber maxoff;
- int itemIndex;
- bool continuescan;
- int indnatts;
-
- /*
- * We must have the buffer pinned and locked, but the usual macro can't be
- * used here; this function is what makes it good for currPos.
- */
- Assert(BufferIsValid(so->currPos.buf));
-
- page = BufferGetPage(so->currPos.buf);
- opaque = BTPageGetOpaque(page);
-
- /* allow next page be processed by parallel worker */
- if (scan->parallel_scan)
- {
- if (ScanDirectionIsForward(dir))
- _bt_parallel_release(scan, opaque->btpo_next);
- else
- _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
- }
-
- continuescan = true; /* default assumption */
- indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * We note the buffer's block number so that we can release the pin later.
- * This allows us to re-read the buffer if it is needed again for hinting.
- */
- so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-
- /*
- * We save the LSN of the page as we read it, so that we know whether it
- * safe to apply LP_DEAD hints to the page later. This allows us to drop
- * the pin for MVCC scans, which allows vacuum to avoid blocking.
- */
- so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-
- /*
- * we must save the page's right-link while scanning it; this tells us
- * where to step right to after we're done with these items. There is no
- * corresponding need for the left-link, since splits always go right.
- */
- so->currPos.nextPage = opaque->btpo_next;
-
- /* initialize tuple workspace to empty */
- so->currPos.nextTupleOffset = 0;
-
- /*
- * Now that the current page has been made consistent, the macro should be
- * good.
- */
- Assert(BTScanPosIsPinned(so->currPos));
-
- if (ScanDirectionIsForward(dir))
- {
- /* load items[] in ascending order */
- itemIndex = 0;
-
- offnum = Max(offnum, minoff);
-
- while (offnum <= maxoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- offnum = OffsetNumberNext(offnum);
- continue;
- }
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID
- */
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- itemIndex++;
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- itemIndex++;
- }
- }
- }
- /* When !continuescan, there can't be any more matches, so stop */
- if (!continuescan)
- break;
-
- offnum = OffsetNumberNext(offnum);
- }
-
- /*
- * We don't need to visit page to the right when the high key
- * indicates that no more matches will be found there.
- *
- * Checking the high key like this works out more often than you might
- * think. Leaf page splits pick a split point between the two most
- * dissimilar tuples (this is weighed against the need to evenly share
- * free space). Leaf pages with high key attribute values that can
- * only appear on non-pivot tuples on the right sibling page are
- * common.
- */
- if (continuescan && !P_RIGHTMOST(opaque))
- {
- ItemId iid = PageGetItemId(page, P_HIKEY);
- IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
- int truncatt;
-
- truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
- _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
- }
-
- if (!continuescan)
- so->currPos.moreRight = false;
-
- Assert(itemIndex <= MaxTIDsPerBTreePage);
- so->currPos.firstItem = 0;
- so->currPos.lastItem = itemIndex - 1;
- so->currPos.itemIndex = 0;
- }
- else
- {
- /* load items[] in descending order */
- itemIndex = MaxTIDsPerBTreePage;
-
- offnum = Min(offnum, maxoff);
-
- while (offnum >= minoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
- bool tuple_alive;
- bool passes_quals;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual. Most of the
- * time, it's a win to not bother examining the tuple's index
- * keys, but just skip to the next tuple (previous, actually,
- * since we're scanning backwards). However, if this is the first
- * tuple on the page, we do check the index keys, to prevent
- * uselessly advancing to the page to the left. This is similar
- * to the high key optimization used by forward scans.
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- Assert(offnum >= P_FIRSTDATAKEY(opaque));
- if (offnum > P_FIRSTDATAKEY(opaque))
- {
- offnum = OffsetNumberPrev(offnum);
- continue;
- }
-
- tuple_alive = false;
- }
- else
- tuple_alive = true;
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan);
- if (passes_quals && tuple_alive)
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID.
- *
- * Note that we deliberately save/return items from
- * posting lists in ascending heap TID order for backwards
- * scans. This allows _bt_killitems() to make a
- * consistent assumption about the order of items
- * associated with the same posting list tuple.
- */
- itemIndex--;
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- itemIndex--;
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- }
- }
- }
- if (!continuescan)
- {
- /* there can't be any more matches, so stop */
- so->currPos.moreLeft = false;
- break;
- }
-
- offnum = OffsetNumberPrev(offnum);
- }
-
- Assert(itemIndex >= 0);
- so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
- so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
- }
-
- return (so->currPos.firstItem <= so->currPos.lastItem);
-}
-
/* Save an index item into so->currPos.items[itemIndex] */
static void
_bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -2014,7 +1042,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
PredicateLockPage(rel, blkno, scan->xs_snapshot);
/* see if there are any matches on this page */
/* note that this will clear moreRight if we can stop */
- if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
+ if (nbts_call_norel(_bt_readpage, scan->indexRelation,
+ scan, dir, P_FIRSTDATAKEY(opaque)))
break;
}
else if (scan->parallel_scan != NULL)
@@ -2116,7 +1145,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
/* see if there are any matches on this page */
/* note that this will clear moreLeft if we can stop */
- if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
+ if (nbts_call_norel(_bt_readpage, scan->indexRelation, scan,
+ dir, PageGetMaxOffsetNumber(page)))
break;
}
else if (scan->parallel_scan != NULL)
@@ -2448,7 +1478,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
/*
* Now load data from the first page of the scan.
*/
- if (!_bt_readpage(scan, dir, start))
+ if (!nbts_call_norel(_bt_readpage, scan->indexRelation, scan, dir, start))
{
/*
* There's no actually-matching data on this page. Try to advance to
diff --git a/src/backend/access/nbtree/nbtsearch_spec.h b/src/backend/access/nbtree/nbtsearch_spec.h
new file mode 100644
index 0000000000..73d5370496
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsearch_spec.h
@@ -0,0 +1,994 @@
+/*
+ * Specialized functions for nbtsearch.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static OffsetNumber NBTS_FUNCTION(_bt_binsrch)(Relation rel, BTScanInsert key,
+ Buffer buf);
+static bool NBTS_FUNCTION(_bt_readpage)(IndexScanDesc scan, ScanDirection dir,
+ OffsetNumber offnum);
+
+/*
+ * _bt_binsrch() -- Do a binary search for a key on a particular page.
+ *
+ * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
+ * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
+ * particular, this means it is possible to return a value 1 greater than the
+ * number of keys on the page, if the scankey is > all keys on the page.)
+ *
+ * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
+ * of the last key < given scankey, or last key <= given scankey if nextkey
+ * is true. (Since _bt_compare treats the first data key of such a page as
+ * minus infinity, there will be at least one key < scankey, so the result
+ * always points at one of the keys on the page.) This key indicates the
+ * right place to descend to be sure we find all leaf keys >= given scankey
+ * (or leaf keys > given scankey when nextkey is true).
+ *
+ * This procedure is not responsible for walking right, it just examines
+ * the given page. _bt_binsrch() has no lock or refcount side effects
+ * on the buffer.
+ */
+static OffsetNumber
+NBTS_FUNCTION(_bt_binsrch)(Relation rel,
+ BTScanInsert key,
+ Buffer buf)
+{
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high;
+ int32 result,
+ cmpval;
+
+ page = BufferGetPage(buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* Requesting nextkey semantics while using scantid seems nonsensical */
+ Assert(!key->nextkey || key->scantid == NULL);
+ /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
+ Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+
+ /*
+ * If there are no keys on the page, return the first available slot. Note
+ * this covers two cases: the page is really empty (no keys), or it
+ * contains only a high key. The latter case is possible after vacuuming.
+ * This can never happen on an internal page, however, since they are
+ * never empty (an internal page must have children).
+ */
+ if (unlikely(high < low))
+ return low;
+
+ /*
+ * Binary search to find the first key on the page >= scan key, or first
+ * key > scankey when nextkey is true.
+ *
+ * For nextkey=false (cmpval=1), the loop invariant is: all slots before
+ * 'low' are < scan key, all slots at or after 'high' are >= scan key.
+ *
+ * For nextkey=true (cmpval=0), the loop invariant is: all slots before
+ * 'low' are <= scan key, all slots at or after 'high' are > scan key.
+ *
+ * We can fall out when high == low.
+ */
+ high++; /* establish the loop invariant for high */
+
+ cmpval = key->nextkey ? 0 : 1; /* select comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = nbts_call(_bt_compare, rel, key, page, mid);
+
+ if (result >= cmpval)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ /*
+ * At this point we have high == low, but be careful: they could point
+ * past the last slot on the page.
+ *
+ * On a leaf page, we always return the first key >= scan key (resp. >
+ * scan key), which could be the last slot + 1.
+ */
+ if (P_ISLEAF(opaque))
+ return low;
+
+ /*
+ * On a non-leaf page, return the last key < scan key (resp. <= scan key).
+ * There must be one if _bt_compare() is playing by the rules.
+ */
+ Assert(low > P_FIRSTDATAKEY(opaque));
+
+ return OffsetNumberPrev(low);
+}
+
+/*
+ * _bt_readpage() -- Load data from current index page into so->currPos
+ *
+ * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
+ * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
+ * they are updated as appropriate. All other fields of so->currPos are
+ * initialized from scratch here.
+ *
+ * We scan the current page starting at offnum and moving in the indicated
+ * direction. All items matching the scan keys are loaded into currPos.items.
+ * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
+ * that there can be no more matching tuples in the current scan direction.
+ *
+ * In the case of a parallel scan, caller must have called _bt_parallel_seize
+ * prior to calling this function; this function will invoke
+ * _bt_parallel_release before returning.
+ *
+ * Returns true if any matching items found on the page, false if none.
+ */
+static bool
+NBTS_FUNCTION(_bt_readpage)(IndexScanDesc scan, ScanDirection dir,
+ OffsetNumber offnum)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber minoff;
+ OffsetNumber maxoff;
+ int itemIndex;
+ bool continuescan;
+ int indnatts;
+
+ /*
+ * We must have the buffer pinned and locked, but the usual macro can't be
+ * used here; this function is what makes it good for currPos.
+ */
+ Assert(BufferIsValid(so->currPos.buf));
+
+ page = BufferGetPage(so->currPos.buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* allow next page be processed by parallel worker */
+ if (scan->parallel_scan)
+ {
+ if (ScanDirectionIsForward(dir))
+ _bt_parallel_release(scan, opaque->btpo_next);
+ else
+ _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+ }
+
+ continuescan = true; /* default assumption */
+ indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * We note the buffer's block number so that we can release the pin later.
+ * This allows us to re-read the buffer if it is needed again for hinting.
+ */
+ so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later. This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+
+ /*
+ * we must save the page's right-link while scanning it; this tells us
+ * where to step right to after we're done with these items. There is no
+ * corresponding need for the left-link, since splits always go right.
+ */
+ so->currPos.nextPage = opaque->btpo_next;
+
+ /* initialize tuple workspace to empty */
+ so->currPos.nextTupleOffset = 0;
+
+ /*
+ * Now that the current page has been made consistent, the macro should be
+ * good.
+ */
+ Assert(BTScanPosIsPinned(so->currPos));
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* load items[] in ascending order */
+ itemIndex = 0;
+
+ offnum = Max(offnum, minoff);
+
+ while (offnum <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ offnum = OffsetNumberNext(offnum);
+ continue;
+ }
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ if (nbts_call(_bt_checkkeys, scan->indexRelation, scan, itup, indnatts, dir, &continuescan))
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID
+ */
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ itemIndex++;
+ }
+ }
+ }
+ /* When !continuescan, there can't be any more matches, so stop */
+ if (!continuescan)
+ break;
+
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /*
+ * We don't need to visit page to the right when the high key
+ * indicates that no more matches will be found there.
+ *
+ * Checking the high key like this works out more often than you might
+ * think. Leaf page splits pick a split point between the two most
+ * dissimilar tuples (this is weighed against the need to evenly share
+ * free space). Leaf pages with high key attribute values that can
+ * only appear on non-pivot tuples on the right sibling page are
+ * common.
+ */
+ if (continuescan && !P_RIGHTMOST(opaque))
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
+ int truncatt;
+
+ truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
+ nbts_call(_bt_checkkeys, scan->indexRelation, scan, itup, truncatt, dir, &continuescan);
+ }
+
+ if (!continuescan)
+ so->currPos.moreRight = false;
+
+ Assert(itemIndex <= MaxTIDsPerBTreePage);
+ so->currPos.firstItem = 0;
+ so->currPos.lastItem = itemIndex - 1;
+ so->currPos.itemIndex = 0;
+ }
+ else
+ {
+ /* load items[] in descending order */
+ itemIndex = MaxTIDsPerBTreePage;
+
+ offnum = Min(offnum, maxoff);
+
+ while (offnum >= minoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+ bool tuple_alive;
+ bool passes_quals;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual. Most of the
+ * time, it's a win to not bother examining the tuple's index
+ * keys, but just skip to the next tuple (previous, actually,
+ * since we're scanning backwards). However, if this is the first
+ * tuple on the page, we do check the index keys, to prevent
+ * uselessly advancing to the page to the left. This is similar
+ * to the high key optimization used by forward scans.
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ Assert(offnum >= P_FIRSTDATAKEY(opaque));
+ if (offnum > P_FIRSTDATAKEY(opaque))
+ {
+ offnum = OffsetNumberPrev(offnum);
+ continue;
+ }
+
+ tuple_alive = false;
+ }
+ else
+ tuple_alive = true;
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ passes_quals = nbts_call(_bt_checkkeys, scan->indexRelation,
+ scan, itup, indnatts, dir,
+ &continuescan);
+ if (passes_quals && tuple_alive)
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID.
+ *
+ * Note that we deliberately save/return items from
+ * posting lists in ascending heap TID order for backwards
+ * scans. This allows _bt_killitems() to make a
+ * consistent assumption about the order of items
+ * associated with the same posting list tuple.
+ */
+ itemIndex--;
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ }
+ }
+ }
+ if (!continuescan)
+ {
+ /* there can't be any more matches, so stop */
+ so->currPos.moreLeft = false;
+ break;
+ }
+
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ Assert(itemIndex >= 0);
+ so->currPos.firstItem = itemIndex;
+ so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
+ so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+ }
+
+ return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+#endif /* NBTS_SPECIALIZING_DEFAULT */
+
+/*
+ * _bt_search() -- Search the tree for a particular scankey,
+ * or more precisely for the first leaf page it could be on.
+ *
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
+ * but it can omit the rightmost column(s) of the index.
+ *
+ * Return value is a stack of parent-page pointers (i.e. there is no entry for
+ * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
+ * which is locked and pinned. No locks are held on the parent pages,
+ * however!
+ *
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
+ *
+ * The returned buffer is locked according to access parameter. Additionally,
+ * access = BT_WRITE will allow an empty root page to be created and returned.
+ * When access = BT_READ, an empty index will result in *bufP being set to
+ * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
+ * during the search will be finished.
+ */
+BTStack
+NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
+ int access, Snapshot snapshot)
+{
+ BTStack stack_in = NULL;
+ int page_access = BT_READ;
+
+ /* Get the root page to start with */
+ *bufP = _bt_getroot(rel, access);
+
+ /* If index is empty and access = BT_READ, no root page is created. */
+ if (!BufferIsValid(*bufP))
+ return (BTStack) NULL;
+
+ /* Loop iterates once per level descended in the tree */
+ for (;;)
+ {
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber offnum;
+ ItemId itemid;
+ IndexTuple itup;
+ BlockNumber child;
+ BTStack new_stack;
+
+ /*
+ * Race -- the page we just grabbed may have split since we read its
+ * downlink in its parent page (or the metapage). If it has, we may
+ * need to move right to its new sibling. Do that.
+ *
+ * In write-mode, allow _bt_moveright to finish any incomplete splits
+ * along the way. Strictly speaking, we'd only need to finish an
+ * incomplete split on the leaf page we're about to insert to, not on
+ * any of the upper levels (internal pages with incomplete splits are
+ * also taken care of in _bt_getstackbuf). But this is a good
+ * opportunity to finish splits of internal pages too.
+ */
+ *bufP = nbts_call(_bt_moveright, rel, key, *bufP,
+ (access == BT_WRITE), stack_in,
+ page_access, snapshot);
+
+ /* if this is a leaf page, we're done */
+ page = BufferGetPage(*bufP);
+ opaque = BTPageGetOpaque(page);
+ if (P_ISLEAF(opaque))
+ break;
+
+ /*
+ * Find the appropriate pivot tuple on this page. Its downlink points
+ * to the child page that we're about to descend to.
+ */
+ offnum = nbts_call(_bt_binsrch, rel, key, *bufP);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
+ child = BTreeTupleGetDownLink(itup);
+
+ /*
+ * We need to save the location of the pivot tuple we chose in a new
+ * stack entry for this page/level. If caller ends up splitting a
+ * page one level down, it usually ends up inserting a new pivot
+ * tuple/downlink immediately after the location recorded here.
+ */
+ new_stack = (BTStack) palloc(sizeof(BTStackData));
+ new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
+ new_stack->bts_offset = offnum;
+ new_stack->bts_parent = stack_in;
+
+ /*
+ * Page level 1 is lowest non-leaf page level prior to leaves. So, if
+ * we're on the level 1 and asked to lock leaf page in write mode,
+ * then lock next page in write mode, because it must be a leaf.
+ */
+ if (opaque->btpo_level == 1 && access == BT_WRITE)
+ page_access = BT_WRITE;
+
+ /* drop the read lock on the page, then acquire one on its child */
+ *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
+
+ /* okay, all set to move down a level */
+ stack_in = new_stack;
+ }
+
+ /*
+ * If we're asked to lock leaf in write mode, but didn't manage to, then
+ * relock. This should only happen when the root page is a leaf page (and
+ * the only page in the index other than the metapage).
+ */
+ if (access == BT_WRITE && page_access == BT_READ)
+ {
+ /* trade in our read lock for a write lock */
+ _bt_unlockbuf(rel, *bufP);
+ _bt_lockbuf(rel, *bufP, BT_WRITE);
+
+ /*
+ * Race -- the leaf page may have split after we dropped the read lock
+ * but before we acquired a write lock. If it has, we may need to
+ * move right to its new sibling. Do that.
+ */
+ *bufP = nbts_call(_bt_moveright, rel, key, *bufP, true, stack_in,
+ BT_WRITE, snapshot);
+ }
+
+ return stack_in;
+}
+
+/*
+ * _bt_moveright() -- move right in the btree if necessary.
+ *
+ * When we follow a pointer to reach a page, it is possible that
+ * the page has changed in the meanwhile. If this happens, we're
+ * guaranteed that the page has "split right" -- that is, that any
+ * data that appeared on the page originally is either on the page
+ * or strictly to the right of it.
+ *
+ * This routine decides whether or not we need to move right in the
+ * tree by examining the high key entry on the page. If that entry is
+ * strictly less than the scankey, or <= the scankey in the
+ * key.nextkey=true case, then we followed the wrong link and we need
+ * to move right.
+ *
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
+ *
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key. When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
+ *
+ * If forupdate is true, we will attempt to finish any incomplete splits
+ * that we encounter. This is required when locking a target page for an
+ * insertion, because we don't allow inserting on a page before the split
+ * is completed. 'stack' is only used if forupdate is true.
+ *
+ * On entry, we have the buffer pinned and a lock of the type specified by
+ * 'access'. If we move right, we release the buffer and lock and acquire
+ * the same on the right sibling. Return value is the buffer we stop at.
+ *
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
+ */
+Buffer
+NBTS_FUNCTION(_bt_moveright)(Relation rel,
+ BTScanInsert key,
+ Buffer buf,
+ bool forupdate,
+ BTStack stack,
+ int access,
+ Snapshot snapshot)
+{
+ Page page;
+ BTPageOpaque opaque;
+ int32 cmpval;
+
+ /*
+ * When nextkey = false (normal case): if the scan key that brought us to
+ * this page is > the high key stored on the page, then the page has split
+ * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
+ * have some duplicates to the right as well as the left, but that's
+ * something that's only ever dealt with on the leaf level, after
+ * _bt_search has found an initial leaf page.)
+ *
+ * When nextkey = true: move right if the scan key is >= page's high key.
+ * (Note that key.scantid cannot be set in this case.)
+ *
+ * The page could even have split more than once, so scan as far as
+ * needed.
+ *
+ * We also have to move right if we followed a link that brought us to a
+ * dead page.
+ */
+ cmpval = key->nextkey ? 0 : 1;
+
+ for (;;)
+ {
+ page = BufferGetPage(buf);
+ TestForOldSnapshot(snapshot, rel, page);
+ opaque = BTPageGetOpaque(page);
+
+ if (P_RIGHTMOST(opaque))
+ break;
+
+ /*
+ * Finish any incomplete splits we encounter along the way.
+ */
+ if (forupdate && P_INCOMPLETE_SPLIT(opaque))
+ {
+ BlockNumber blkno = BufferGetBlockNumber(buf);
+
+ /* upgrade our lock if necessary */
+ if (access == BT_READ)
+ {
+ _bt_unlockbuf(rel, buf);
+ _bt_lockbuf(rel, buf, BT_WRITE);
+ }
+
+ if (P_INCOMPLETE_SPLIT(opaque))
+ _bt_finish_split(rel, buf, stack);
+ else
+ _bt_relbuf(rel, buf);
+
+ /* re-acquire the lock in the right mode, and re-check */
+ buf = _bt_getbuf(rel, blkno, access);
+ continue;
+ }
+
+ if (P_IGNORE(opaque) || nbts_call(_bt_compare, rel, key, page, P_HIKEY) >= cmpval)
+ {
+ /* step right one page */
+ buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
+ continue;
+ }
+ else
+ break;
+ }
+
+ if (P_IGNORE(opaque))
+ elog(ERROR, "fell off the end of index \"%s\"",
+ RelationGetRelationName(rel));
+
+ return buf;
+}
+
+/*
+ *
+ * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
+ *
+ * Like _bt_binsrch(), but with support for caching the binary search
+ * bounds. Only used during insertion, and only on the leaf page that it
+ * looks like caller will insert tuple on. Exclusive-locked and pinned
+ * leaf page is contained within insertstate.
+ *
+ * Caches the bounds fields in insertstate so that a subsequent call can
+ * reuse the low and strict high bounds of original binary search. Callers
+ * that use these fields directly must be prepared for the case where low
+ * and/or stricthigh are not on the same page (one or both exceed maxoff
+ * for the page). The case where there are no items on the page (high <
+ * low) makes bounds invalid.
+ *
+ * Caller is responsible for invalidating bounds when it modifies the page
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by a posting
+ * list split).
+ */
+OffsetNumber
+NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate)
+{
+ BTScanInsert key = insertstate->itup_key;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high,
+ stricthigh;
+ int32 result,
+ cmpval;
+
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ Assert(P_ISLEAF(opaque));
+ Assert(!key->nextkey);
+ Assert(insertstate->postingoff == 0);
+
+ if (!insertstate->bounds_valid)
+ {
+ /* Start new binary search */
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ }
+ else
+ {
+ /* Restore result of previous binary search against same page */
+ low = insertstate->low;
+ high = insertstate->stricthigh;
+ }
+
+ /* If there are no keys on the page, return the first available slot */
+ if (unlikely(high < low))
+ {
+ /* Caller can't reuse bounds */
+ insertstate->low = InvalidOffsetNumber;
+ insertstate->stricthigh = InvalidOffsetNumber;
+ insertstate->bounds_valid = false;
+ return low;
+ }
+
+ /*
+ * Binary search to find the first key on the page >= scan key. (nextkey
+ * is always false when inserting).
+ *
+ * The loop invariant is: all slots before 'low' are < scan key, all slots
+ * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
+ * maintained to save additional search effort for caller.
+ *
+ * We can fall out when high == low.
+ */
+ if (!insertstate->bounds_valid)
+ high++; /* establish the loop invariant for high */
+ stricthigh = high; /* high initially strictly higher */
+
+ cmpval = 1; /* !nextkey comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = nbts_call(_bt_compare, rel, key, page, mid);
+
+ if (result >= cmpval)
+ low = mid + 1;
+ else
+ {
+ high = mid;
+ if (result != 0)
+ stricthigh = high;
+ }
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set postingoff for caller. Caller must split the
+ * posting list when postingoff is set. This should happen
+ * infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ {
+ /*
+ * postingoff should never be set more than once per leaf page
+ * binary search. That would mean that there are duplicate table
+ * TIDs in the index, which is never okay. Check for that here.
+ */
+ if (insertstate->postingoff != 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
+ ItemPointerGetBlockNumber(key->scantid),
+ ItemPointerGetOffsetNumber(key->scantid),
+ low, stricthigh,
+ BufferGetBlockNumber(insertstate->buf),
+ RelationGetRelationName(rel))));
+
+ insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
+ }
+ }
+
+ /*
+ * On a leaf page, a binary search always returns the first key >= scan
+ * key (at least in !nextkey case), which could be the last slot + 1. This
+ * is also the lower bound of cached search.
+ *
+ * stricthigh may also be the last slot + 1, which prevents caller from
+ * using bounds directly, but is still useful to us if we're called a
+ * second time with cached bounds (cached low will be < stricthigh when
+ * that happens).
+ */
+ insertstate->low = low;
+ insertstate->stricthigh = stricthigh;
+ insertstate->bounds_valid = true;
+
+ return low;
+}
+
+/*----------
+ * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
+ *
+ * page/offnum: location of btree item to be compared to.
+ *
+ * This routine returns:
+ * <0 if scankey < tuple at offnum;
+ * 0 if scankey == tuple at offnum;
+ * >0 if scankey > tuple at offnum.
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
+ * "minus infinity": this routine will always claim it is less than the
+ * scankey. The actual key value stored is explicitly truncated to 0
+ * attributes (explicitly minus infinity) with version 3+ indexes, but
+ * that isn't relied upon. This allows us to implement the Lehman and
+ * Yao convention that the first down-link pointer is before the first
+ * key. See backend/access/nbtree/README for details.
+ *----------
+ */
+int32
+NBTS_FUNCTION(_bt_compare)(Relation rel,
+ BTScanInsert key,
+ Page page,
+ OffsetNumber offnum)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ IndexTuple itup;
+ ItemPointer heapTid;
+ ScanKey scankey;
+ int ncmpkey;
+ int ntupatts;
+ int32 result;
+
+ Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
+ Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+ Assert(key->heapkeyspace || key->scantid == NULL);
+
+ /*
+ * Force result ">" if target item is first data item on an internal page
+ * --- see NOTE above.
+ */
+ if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
+ return 1;
+
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+ ntupatts = BTreeTupleGetNAtts(itup, rel);
+
+ /*
+ * The scan key is set up with the attribute number associated with each
+ * term in the key. It is important that, if the index is multi-key, the
+ * scan contain the first k key attributes, and that they be in order. If
+ * you think about how multi-key ordering works, you'll understand why
+ * this is.
+ *
+ * We don't test for violation of this condition here, however. The
+ * initial setup for the index scan had better have gotten it right (see
+ * _bt_first).
+ */
+
+ ncmpkey = Min(ntupatts, key->keysz);
+ Assert(key->heapkeyspace || ncmpkey == key->keysz);
+ Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
+ scankey = key->scankeys;
+ for (int i = 1; i <= ncmpkey; i++)
+ {
+ Datum datum;
+ bool isNull;
+
+ datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+
+ if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ {
+ if (isNull)
+ result = 0; /* NULL "=" NULL */
+ else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = -1; /* NULL "<" NOT_NULL */
+ else
+ result = 1; /* NULL ">" NOT_NULL */
+ }
+ else if (isNull) /* key is NOT_NULL and item is NULL */
+ {
+ if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = 1; /* NOT_NULL ">" NULL */
+ else
+ result = -1; /* NOT_NULL "<" NULL */
+ }
+ else
+ {
+ /*
+ * The sk_func needs to be passed the index value as left arg and
+ * the sk_argument as right arg (they might be of different
+ * types). Since it is convenient for callers to think of
+ * _bt_compare as comparing the scankey to the index item, we have
+ * to flip the sign of the comparison result. (Unless it's a DESC
+ * column, in which case we *don't* flip the sign.)
+ */
+ result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum,
+ scankey->sk_argument));
+
+ if (!(scankey->sk_flags & SK_BT_DESC))
+ INVERT_COMPARE_RESULT(result);
+ }
+
+ /* if the keys are unequal, return the difference */
+ if (result != 0)
+ return result;
+
+ scankey++;
+ }
+
+ /*
+ * All non-truncated attributes (other than heap TID) were found to be
+ * equal. Treat truncated attributes as minus infinity when scankey has a
+ * key attribute value that would otherwise be compared directly.
+ *
+ * Note: it doesn't matter if ntupatts includes non-key attributes;
+ * scankey won't, so explicitly excluding non-key attributes isn't
+ * necessary.
+ */
+ if (key->keysz > ntupatts)
+ return 1;
+
+ /*
+ * Use the heap TID attribute and scantid to try to break the tie. The
+ * rules are the same as any other key attribute -- only the
+ * representation differs.
+ */
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid == NULL)
+ {
+ /*
+ * Most searches have a scankey that is considered greater than a
+ * truncated pivot tuple if and when the scankey has equal values for
+ * attributes up to and including the least significant untruncated
+ * attribute in tuple.
+ *
+ * For example, if an index has the minimum two attributes (single
+ * user key attribute, plus heap TID attribute), and a page's high key
+ * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
+ * will not descend to the page to the left. The search will descend
+ * right instead. The truncated attribute in pivot tuple means that
+ * all non-pivot tuples on the page to the left are strictly < 'foo',
+ * so it isn't necessary to descend left. In other words, search
+ * doesn't have to descend left because it isn't interested in a match
+ * that has a heap TID value of -inf.
+ *
+ * However, some searches (pivotsearch searches) actually require that
+ * we descend left when this happens. -inf is treated as a possible
+ * match for omitted scankey attribute(s). This is needed by page
+ * deletion, which must re-find leaf pages that are targets for
+ * deletion using their high keys.
+ *
+ * Note: the heap TID part of the test ensures that scankey is being
+ * compared to a pivot tuple with one or more truncated key
+ * attributes.
+ *
+ * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
+ * left here, since they have no heap TID attribute (and cannot have
+ * any -inf key values in any case, since truncation can only remove
+ * non-key attributes). !heapkeyspace searches must always be
+ * prepared to deal with matches on both sides of the pivot once the
+ * leaf level is reached.
+ */
+ if (key->heapkeyspace && !key->pivotsearch &&
+ key->keysz == ntupatts && heapTid == NULL)
+ return 1;
+
+ /* All provided scankey arguments found to be equal */
+ return 0;
+ }
+
+ /*
+ * Treat truncated heap TID as minus infinity, since scankey has a key
+ * attribute value (scantid) that would otherwise be compared directly
+ */
+ Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+ if (heapTid == NULL)
+ return 1;
+
+ /*
+ * Scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * with scantid.
+ */
+ Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (result <= 0 || !BTreeTupleIsPosting(itup))
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid,
+ BTreeTupleGetMaxHeapTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
+}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 9f60fa9894..f1d146ba71 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -279,8 +279,6 @@ static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
BTPageState *state,
BTDedupState dstate);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
-static void _bt_load(BTWriteState *wstate,
- BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
int request);
static void _bt_end_parallel(BTLeader *btleader);
@@ -293,6 +291,9 @@ static void _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
Sharedsort *sharedsort2, int sortmem,
bool progress);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsort_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
/*
* btbuild() -- build a new btree index.
@@ -566,7 +567,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
wstate.heap = btspool->heap;
wstate.index = btspool->index;
- wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+ wstate.inskey = nbts_call(_bt_mkscankey, wstate.index, NULL);
/* _bt_mkscankey() won't set allequalimage without metapage */
wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
@@ -578,7 +579,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
PROGRESS_BTREE_PHASE_LEAF_LOAD);
- _bt_load(&wstate, btspool, btspool2);
+ nbts_call_norel(_bt_load, wstate.index, &wstate, btspool, btspool2);
}
/*
@@ -978,8 +979,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
lastleft = (IndexTuple) PageGetItem(opage, ii);
Assert(IndexTupleSize(oitup) > last_truncextra);
- truncated = _bt_truncate(wstate->index, lastleft, oitup,
- wstate->inskey);
+ truncated = nbts_call(_bt_truncate, wstate->index, lastleft, oitup,
+ wstate->inskey);
if (!PageIndexTupleOverwrite(opage, P_HIKEY, (Item) truncated,
IndexTupleSize(truncated)))
elog(ERROR, "failed to add high key to the index page");
@@ -1176,264 +1177,6 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
-/*
- * Read tuples in correct sort order from tuplesort, and load them into
- * btree leaves.
- */
-static void
-_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
-{
- BTPageState *state = NULL;
- bool merge = (btspool2 != NULL);
- IndexTuple itup,
- itup2 = NULL;
- bool load1;
- TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
- SortSupport sortKeys;
- int64 tuples_done = 0;
- bool deduplicate;
-
- deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
- BTGetDeduplicateItems(wstate->index);
-
- if (merge)
- {
- /*
- * Another BTSpool for dead tuples exists. Now we have to merge
- * btspool and btspool2.
- */
-
- /* the preparation of merge */
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
-
- /* Prepare SortSupport data for each column */
- sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
- for (i = 0; i < keysz; i++)
- {
- SortSupport sortKey = sortKeys + i;
- ScanKey scanKey = wstate->inskey->scankeys + i;
- int16 strategy;
-
- sortKey->ssup_cxt = CurrentMemoryContext;
- sortKey->ssup_collation = scanKey->sk_collation;
- sortKey->ssup_nulls_first =
- (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
- sortKey->ssup_attno = scanKey->sk_attno;
- /* Abbreviation is not supported here */
- sortKey->abbreviate = false;
-
- AssertState(sortKey->ssup_attno != 0);
-
- strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
- BTGreaterStrategyNumber : BTLessStrategyNumber;
-
- PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
- }
-
- for (;;)
- {
- load1 = true; /* load BTSpool next ? */
- if (itup2 == NULL)
- {
- if (itup == NULL)
- break;
- }
- else if (itup != NULL)
- {
- int32 compare = 0;
-
- for (i = 1; i <= keysz; i++)
- {
- SortSupport entry;
- Datum attrDatum1,
- attrDatum2;
- bool isNull1,
- isNull2;
-
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
- entry);
- if (compare > 0)
- {
- load1 = false;
- break;
- }
- else if (compare < 0)
- break;
- }
-
- /*
- * If key values are equal, we sort on ItemPointer. This is
- * required for btree indexes, since heap TID is treated as an
- * implicit last key attribute in order to ensure that all
- * keys in the index are physically unique.
- */
- if (compare == 0)
- {
- compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
- Assert(compare != 0);
- if (compare > 0)
- load1 = false;
- }
- }
- else
- load1 = false;
-
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- if (load1)
- {
- _bt_buildadd(wstate, state, itup, 0);
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- }
- else
- {
- _bt_buildadd(wstate, state, itup2, 0);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- pfree(sortKeys);
- }
- else if (deduplicate)
- {
- /* merge is unnecessary, deduplicate into posting lists */
- BTDedupState dstate;
-
- dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
- dstate->deduplicate = true; /* unused */
- dstate->nmaxitems = 0; /* unused */
- dstate->maxpostingsize = 0; /* set later */
- /* Metadata about base tuple of current pending posting list */
- dstate->base = NULL;
- dstate->baseoff = InvalidOffsetNumber; /* unused */
- dstate->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- dstate->htids = NULL;
- dstate->nhtids = 0;
- dstate->nitems = 0;
- dstate->phystupsize = 0; /* unused */
- dstate->nintervals = 0; /* unused */
-
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- {
- state = _bt_pagestate(wstate, 0);
-
- /*
- * Limit size of posting list tuples to 1/10 space we want to
- * leave behind on the page, plus space for final item's line
- * pointer. This is equal to the space that we'd like to
- * leave behind on each leaf page when fillfactor is 90,
- * allowing us to get close to fillfactor% space utilization
- * when there happen to be a great many duplicates. (This
- * makes higher leaf fillfactor settings ineffective when
- * building indexes that have many duplicates, but packing
- * leaf pages full with few very large tuples doesn't seem
- * like a useful goal.)
- */
- dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
- sizeof(ItemIdData);
- Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
- dstate->maxpostingsize <= INDEX_SIZE_MASK);
- dstate->htids = palloc(dstate->maxpostingsize);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
- else if (_bt_keep_natts_fast(wstate->index, dstate->base,
- itup) > keysz &&
- _bt_dedup_save_htid(dstate, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID from itup has been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list.
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
-
- if (state)
- {
- /*
- * Handle the last item (there must be a last item when the
- * tuplesort returned one or more tuples)
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
- pfree(dstate->htids);
- }
-
- pfree(dstate);
- }
- else
- {
- /* merging and deduplication are both unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- _bt_buildadd(wstate, state, itup, 0);
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- }
-
- /* Close down final pages and write the metapage */
- _bt_uppershutdown(wstate, state);
-
- /*
- * When we WAL-logged index pages, we must nonetheless fsync index files.
- * Since we're building outside shared buffers, a CHECKPOINT occurring
- * during the build has no way to flush the previously written data to
- * disk (indeed it won't know the index even exists). A crash later on
- * would replay WAL from the checkpoint, therefore it wouldn't replay our
- * earlier WAL entries. If we do not fsync those pages here, they might
- * still not be on disk when the crash occurs.
- */
- if (wstate->btws_use_wal)
- smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
-}
-
/*
* Create parallel context, and launch workers for leader.
*
diff --git a/src/backend/access/nbtree/nbtsort_spec.h b/src/backend/access/nbtree/nbtsort_spec.h
new file mode 100644
index 0000000000..8f4a3602ca
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsort_spec.h
@@ -0,0 +1,275 @@
+/*
+ * Specialized functions included in nbtsort.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static void NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
+ BTSpool *btspool2);
+
+/*
+ * Read tuples in correct sort order from tuplesort, and load them into
+ * btree leaves.
+ */
+static void
+NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
+ BTSpool *btspool2)
+{
+ BTPageState *state = NULL;
+ bool merge = (btspool2 != NULL);
+ IndexTuple itup,
+ itup2 = NULL;
+ bool load1;
+ TupleDesc tupdes = RelationGetDescr(wstate->index);
+ int i,
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ SortSupport sortKeys;
+ int64 tuples_done = 0;
+ bool deduplicate;
+
+ deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
+ BTGetDeduplicateItems(wstate->index);
+
+ if (merge)
+ {
+ /*
+ * Another BTSpool for dead tuples exists. Now we have to merge
+ * btspool and btspool2.
+ */
+
+ /* the preparation of merge */
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+
+ /* Prepare SortSupport data for each column */
+ sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+
+ for (i = 0; i < keysz; i++)
+ {
+ SortSupport sortKey = sortKeys + i;
+ ScanKey scanKey = wstate->inskey->scankeys + i;
+ int16 strategy;
+
+ sortKey->ssup_cxt = CurrentMemoryContext;
+ sortKey->ssup_collation = scanKey->sk_collation;
+ sortKey->ssup_nulls_first =
+ (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+ sortKey->ssup_attno = scanKey->sk_attno;
+ /* Abbreviation is not supported here */
+ sortKey->abbreviate = false;
+
+ AssertState(sortKey->ssup_attno != 0);
+
+ strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+ BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+ PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+ }
+
+ for (;;)
+ {
+ load1 = true; /* load BTSpool next ? */
+ if (itup2 == NULL)
+ {
+ if (itup == NULL)
+ break;
+ }
+ else if (itup != NULL)
+ {
+ int32 compare = 0;
+
+ for (i = 1; i <= keysz; i++)
+ {
+ SortSupport entry;
+ Datum attrDatum1,
+ attrDatum2;
+ bool isNull1,
+ isNull2;
+
+ entry = sortKeys + i - 1;
+ attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+ attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+ compare = ApplySortComparator(attrDatum1, isNull1,
+ attrDatum2, isNull2,
+ entry);
+ if (compare > 0)
+ {
+ load1 = false;
+ break;
+ }
+ else if (compare < 0)
+ break;
+ }
+
+ /*
+ * If key values are equal, we sort on ItemPointer. This is
+ * required for btree indexes, since heap TID is treated as an
+ * implicit last key attribute in order to ensure that all
+ * keys in the index are physically unique.
+ */
+ if (compare == 0)
+ {
+ compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+ Assert(compare != 0);
+ if (compare > 0)
+ load1 = false;
+ }
+ }
+ else
+ load1 = false;
+
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ if (load1)
+ {
+ _bt_buildadd(wstate, state, itup, 0);
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ }
+ else
+ {
+ _bt_buildadd(wstate, state, itup2, 0);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ pfree(sortKeys);
+ }
+ else if (deduplicate)
+ {
+ /* merge is unnecessary, deduplicate into posting lists */
+ BTDedupState dstate;
+
+ dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ dstate->deduplicate = true; /* unused */
+ dstate->nmaxitems = 0; /* unused */
+ dstate->maxpostingsize = 0; /* set later */
+ /* Metadata about base tuple of current pending posting list */
+ dstate->base = NULL;
+ dstate->baseoff = InvalidOffsetNumber; /* unused */
+ dstate->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ dstate->htids = NULL;
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+ dstate->phystupsize = 0; /* unused */
+ dstate->nintervals = 0; /* unused */
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+
+ /*
+ * Limit size of posting list tuples to 1/10 space we want to
+ * leave behind on the page, plus space for final item's line
+ * pointer. This is equal to the space that we'd like to
+ * leave behind on each leaf page when fillfactor is 90,
+ * allowing us to get close to fillfactor% space utilization
+ * when there happen to be a great many duplicates. (This
+ * makes higher leaf fillfactor settings ineffective when
+ * building indexes that have many duplicates, but packing
+ * leaf pages full with few very large tuples doesn't seem
+ * like a useful goal.)
+ */
+ dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
+ sizeof(ItemIdData);
+ Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
+ dstate->maxpostingsize <= INDEX_SIZE_MASK);
+ dstate->htids = palloc(dstate->maxpostingsize);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+ else if (nbts_call(_bt_keep_natts_fast, wstate->index, dstate->base,
+ itup) > keysz &&
+ _bt_dedup_save_htid(dstate, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID from itup has been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list.
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+
+ if (state)
+ {
+ /*
+ * Handle the last item (there must be a last item when the
+ * tuplesort returned one or more tuples)
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+ pfree(dstate->htids);
+ }
+
+ pfree(dstate);
+ }
+ else
+ {
+ /* merging and deduplication are both unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ _bt_buildadd(wstate, state, itup, 0);
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+
+ /* Close down final pages and write the metapage */
+ _bt_uppershutdown(wstate, state);
+
+ /*
+ * When we WAL-logged index pages, we must nonetheless fsync index files.
+ * Since we're building outside shared buffers, a CHECKPOINT occurring
+ * during the build has no way to flush the previously written data to
+ * disk (indeed it won't know the index even exists). A crash later on
+ * would replay WAL from the checkpoint, therefore it wouldn't replay our
+ * earlier WAL entries. If we do not fsync those pages here, they might
+ * still not be on disk when the crash occurs.
+ */
+ if (wstate->btws_use_wal)
+ smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+}
+
+#endif
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 241e26d338..8e5337cad7 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -692,7 +692,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
{
itemid = PageGetItemId(state->origpage, maxoff);
tup = (IndexTuple) PageGetItem(state->origpage, itemid);
- keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+ keepnatts = nbts_call(_bt_keep_natts_fast, state->rel, tup, state->newitem);
if (keepnatts > 1 && keepnatts <= nkeyatts)
{
@@ -723,7 +723,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
- keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+ keepnatts = nbts_call(_bt_keep_natts_fast, state->rel, tup, state->newitem);
if (keepnatts > 1 && keepnatts <= nkeyatts)
{
@@ -972,7 +972,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
* avoid appending a heap TID in new high key, we're done. Finish split
* with default strategy and initial split interval.
*/
- perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+ perfectpenalty = nbts_call(_bt_keep_natts_fast, state->rel, leftmost, rightmost);
if (perfectpenalty <= indnkeyatts)
return perfectpenalty;
@@ -993,7 +993,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
* If page is entirely full of duplicates, a single value strategy split
* will be performed.
*/
- perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+ perfectpenalty = nbts_call(_bt_keep_natts_fast, state->rel, leftmost, rightmost);
if (perfectpenalty <= indnkeyatts)
{
*strategy = SPLIT_MANY_DUPLICATES;
@@ -1031,8 +1031,8 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
itemid = PageGetItemId(state->origpage, P_HIKEY);
hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
- perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
- state->newitem);
+ perfectpenalty = nbts_call(_bt_keep_natts_fast, state->rel, hikey,
+ state->newitem);
if (perfectpenalty <= indnkeyatts)
*strategy = SPLIT_SINGLE_VALUE;
else
@@ -1154,7 +1154,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
lastleft = _bt_split_lastleft(state, split);
firstright = _bt_split_firstright(state, split);
- return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+ return nbts_call(_bt_keep_natts_fast, state->rel, lastleft, firstright);
}
/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index ff260c393a..bc443ebd27 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -50,130 +50,11 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
bool *result);
static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
static void _bt_mark_scankey_required(ScanKey skey);
-static bool _bt_check_rowcompare(ScanKey skey,
- IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
- ScanDirection dir, bool *continuescan);
-static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtutils_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
-/*
- * _bt_mkscankey
- * Build an insertion scan key that contains comparison data from itup
- * as well as comparator routines appropriate to the key datatypes.
- *
- * When itup is a non-pivot tuple, the returned insertion scan key is
- * suitable for finding a place for it to go on the leaf level. Pivot
- * tuples can be used to re-find leaf page with matching high key, but
- * then caller needs to set scan key's pivotsearch field to true. This
- * allows caller to search for a leaf page with a matching high key,
- * which is usually to the left of the first leaf page a non-pivot match
- * might appear on.
- *
- * The result is intended for use with _bt_compare() and _bt_truncate().
- * Callers that don't need to fill out the insertion scankey arguments
- * (e.g. they use an ad-hoc comparison routine, or only need a scankey
- * for _bt_truncate()) can pass a NULL index tuple. The scankey will
- * be initialized as if an "all truncated" pivot tuple was passed
- * instead.
- *
- * Note that we may occasionally have to share lock the metapage to
- * determine whether or not the keys in the index are expected to be
- * unique (i.e. if this is a "heapkeyspace" index). We assume a
- * heapkeyspace index when caller passes a NULL tuple, allowing index
- * build callers to avoid accessing the non-existent metapage. We
- * also assume that the index is _not_ allequalimage when a NULL tuple
- * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
- * field themselves.
- */
-BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
-{
- BTScanInsert key;
- ScanKey skey;
- TupleDesc itupdesc;
- int indnkeyatts;
- int16 *indoption;
- int tupnatts;
- int i;
-
- itupdesc = RelationGetDescr(rel);
- indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- indoption = rel->rd_indoption;
- tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
-
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
- /*
- * We'll execute search using scan key constructed on key columns.
- * Truncated attributes and non-key attributes are omitted from the final
- * scan key.
- */
- key = palloc(offsetof(BTScanInsertData, scankeys) +
- sizeof(ScanKeyData) * indnkeyatts);
- if (itup)
- _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
- else
- {
- /* Utility statement callers can set these fields themselves */
- key->heapkeyspace = true;
- key->allequalimage = false;
- }
- key->anynullkeys = false; /* initial assumption */
- key->nextkey = false;
- key->pivotsearch = false;
- key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
- skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
- {
- FmgrInfo *procinfo;
- Datum arg;
- bool null;
- int flags;
-
- /*
- * We can use the cached (default) support procs since no cross-type
- * comparison can be needed.
- */
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-
- /*
- * Key arguments built from truncated attributes (or when caller
- * provides no tuple) are defensively represented as NULL values. They
- * should never be used.
- */
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
- else
- {
- arg = (Datum) 0;
- null = true;
- }
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
- flags,
- (AttrNumber) (i + 1),
- InvalidStrategy,
- InvalidOid,
- rel->rd_indcollation[i],
- procinfo,
- arg);
- /* Record if any key attribute is NULL (or truncated) */
- if (null)
- key->anynullkeys = true;
- }
-
- /*
- * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
- * that full uniqueness check is done.
- */
- if (rel->rd_index->indnullsnotdistinct)
- key->anynullkeys = false;
-
- return key;
-}
/*
* free a retracement stack made by _bt_search.
@@ -1340,356 +1221,6 @@ _bt_mark_scankey_required(ScanKey skey)
}
}
-/*
- * Test whether an indextuple satisfies all the scankey conditions.
- *
- * Return true if so, false if not. If the tuple fails to pass the qual,
- * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly. See comments for
- * _bt_preprocess_keys(), above, about how this is done.
- *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
- *
- * scan: index scan descriptor (containing a search-type scankey)
- * tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
- */
-bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
- ScanDirection dir, bool *continuescan)
-{
- TupleDesc tupdesc;
- BTScanOpaque so;
- int keysz;
- int ikey;
- ScanKey key;
-
- Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
-
- *continuescan = true; /* default assumption */
-
- tupdesc = RelationGetDescr(scan->indexRelation);
- so = (BTScanOpaque) scan->opaque;
- keysz = so->numberOfKeys;
-
- for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
- {
- Datum datum;
- bool isNull;
- Datum test;
-
- if (key->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- continue;
- }
-
- /* row-comparison keys need special processing */
- if (key->sk_flags & SK_ROW_HEADER)
- {
- if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
- continuescan))
- continue;
- return false;
- }
-
- datum = index_getattr(tuple,
- key->sk_attno,
- tupdesc,
- &isNull);
-
- if (key->sk_flags & SK_ISNULL)
- {
- /* Handle IS NULL/NOT NULL tests */
- if (key->sk_flags & SK_SEARCHNULL)
- {
- if (isNull)
- continue; /* tuple satisfies this qual */
- }
- else
- {
- Assert(key->sk_flags & SK_SEARCHNOTNULL);
- if (!isNull)
- continue; /* tuple satisfies this qual */
- }
-
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (isNull)
- {
- if (key->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
- datum, key->sk_argument);
-
- if (!DatumGetBool(test))
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- *
- * Note: because we stop the scan as soon as any required equality
- * qual fails, it is critical that equality quals be used for the
- * initial positioning in _bt_first() when they are available. See
- * comments in _bt_first().
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
- }
-
- /* If we get here, the tuple passes all index quals. */
- return true;
-}
-
-/*
- * Test whether an indextuple satisfies a row-comparison scan condition.
- *
- * Return true if so, false if not. If not, also clear *continuescan if
- * it's not possible for any future tuples in the current scan direction
- * to pass the qual.
- *
- * This is a subroutine for _bt_checkkeys, which see for more info.
- */
-static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
- TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
-{
- ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
- int32 cmpresult = 0;
- bool result;
-
- /* First subkey should be same as the header says */
- Assert(subkey->sk_attno == skey->sk_attno);
-
- /* Loop over columns of the row condition */
- for (;;)
- {
- Datum datum;
- bool isNull;
-
- Assert(subkey->sk_flags & SK_ROW_MEMBER);
-
- if (subkey->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- cmpresult = 0;
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- continue;
- }
-
- datum = index_getattr(tuple,
- subkey->sk_attno,
- tupdesc,
- &isNull);
-
- if (isNull)
- {
- if (subkey->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (subkey->sk_flags & SK_ISNULL)
- {
- /*
- * Unlike the simple-scankey case, this isn't a disallowed case.
- * But it can never match. If all the earlier row comparison
- * columns are required for the scan direction, we can stop the
- * scan, because there can't be another tuple that will succeed.
- */
- if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
- subkey--;
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- return false;
- }
-
- /* Perform the test --- three-way comparison not bool operator */
- cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
- subkey->sk_collation,
- datum,
- subkey->sk_argument));
-
- if (subkey->sk_flags & SK_BT_DESC)
- INVERT_COMPARE_RESULT(cmpresult);
-
- /* Done comparing if unequal, else advance to next column */
- if (cmpresult != 0)
- break;
-
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- }
-
- /*
- * At this point cmpresult indicates the overall result of the row
- * comparison, and subkey points to the deciding column (or the last
- * column if the result is "=").
- */
- switch (subkey->sk_strategy)
- {
- /* EQ and NE cases aren't allowed here */
- case BTLessStrategyNumber:
- result = (cmpresult < 0);
- break;
- case BTLessEqualStrategyNumber:
- result = (cmpresult <= 0);
- break;
- case BTGreaterEqualStrategyNumber:
- result = (cmpresult >= 0);
- break;
- case BTGreaterStrategyNumber:
- result = (cmpresult > 0);
- break;
- default:
- elog(ERROR, "unrecognized RowCompareType: %d",
- (int) subkey->sk_strategy);
- result = 0; /* keep compiler quiet */
- break;
- }
-
- if (!result)
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will pass,
- * either. Note we have to look at the deciding column, not
- * necessarily the first or last column of the row condition.
- */
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
-
- return result;
-}
-
/*
* _bt_killitems - set LP_DEAD state for items an indexscan caller has
* told us were killed
@@ -2173,286 +1704,6 @@ btbuildphasename(int64 phasenum)
}
}
-/*
- * _bt_truncate() -- create tuple without unneeded suffix attributes.
- *
- * Returns truncated pivot index tuple allocated in caller's memory context,
- * with key attributes copied from caller's firstright argument. If rel is
- * an INCLUDE index, non-key attributes will definitely be truncated away,
- * since they're not part of the key space. More aggressive suffix
- * truncation can take place when it's clear that the returned tuple does not
- * need one or more suffix key attributes. We only need to keep firstright
- * attributes up to and including the first non-lastleft-equal attribute.
- * Caller's insertion scankey is used to compare the tuples; the scankey's
- * argument values are not considered here.
- *
- * Note that returned tuple's t_tid offset will hold the number of attributes
- * present, so the original item pointer offset is not represented. Caller
- * should only change truncated tuple's downlink. Note also that truncated
- * key attributes are treated as containing "minus infinity" values by
- * _bt_compare().
- *
- * In the worst case (when a heap TID must be appended to distinguish lastleft
- * from firstright), the size of the returned tuple is the size of firstright
- * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
- * is important, since callers need to stay under the 1/3 of a page
- * restriction on tuple size. If this routine is ever taught to truncate
- * within an attribute/datum, it will need to avoid returning an enlarged
- * tuple to caller when truncation + TOAST compression ends up enlarging the
- * final datum.
- */
-IndexTuple
-_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
- IndexTuple pivot;
- IndexTuple tidpivot;
- ItemPointer pivotheaptid;
- Size newsize;
-
- /*
- * We should only ever truncate non-pivot tuples from leaf pages. It's
- * never okay to truncate when splitting an internal page.
- */
- Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
-
- /* Determine how many attributes must be kept in truncated tuple */
- keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
-
-#ifdef DEBUG_NO_TRUNCATE
- /* Force truncation to be ineffective for testing purposes */
- keepnatts = nkeyatts + 1;
-#endif
-
- pivot = index_truncate_tuple(itupdesc, firstright,
- Min(keepnatts, nkeyatts));
-
- if (BTreeTupleIsPosting(pivot))
- {
- /*
- * index_truncate_tuple() just returns a straight copy of firstright
- * when it has no attributes to truncate. When that happens, we may
- * need to truncate away a posting list here instead.
- */
- Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
- Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
- pivot->t_info &= ~INDEX_SIZE_MASK;
- pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
- }
-
- /*
- * If there is a distinguishing key attribute within pivot tuple, we're
- * done
- */
- if (keepnatts <= nkeyatts)
- {
- BTreeTupleSetNAtts(pivot, keepnatts, false);
- return pivot;
- }
-
- /*
- * We have to store a heap TID in the new pivot tuple, since no non-TID
- * key attribute value in firstright distinguishes the right side of the
- * split from the left side. nbtree conceptualizes this case as an
- * inability to truncate away any key attributes, since heap TID is
- * treated as just another key attribute (despite lacking a pg_attribute
- * entry).
- *
- * Use enlarged space that holds a copy of pivot. We need the extra space
- * to store a heap TID at the end (using the special pivot tuple
- * representation). Note that the original pivot already has firstright's
- * possible posting list/non-key attribute values removed at this point.
- */
- newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
- tidpivot = palloc0(newsize);
- memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
- /* Cannot leak memory here */
- pfree(pivot);
-
- /*
- * Store all of firstright's key attribute values plus a tiebreaker heap
- * TID value in enlarged pivot tuple
- */
- tidpivot->t_info &= ~INDEX_SIZE_MASK;
- tidpivot->t_info |= newsize;
- BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
- pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
-
- /*
- * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
- * consider suffix truncation. It seems like a good idea to follow that
- * example in cases where no truncation takes place -- use lastleft's heap
- * TID. (This is also the closest value to negative infinity that's
- * legally usable.)
- */
- ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
-
- /*
- * We're done. Assert() that heap TID invariants hold before returning.
- *
- * Lehman and Yao require that the downlink to the right page, which is to
- * be inserted into the parent page in the second phase of a page split be
- * a strict lower bound on items on the right page, and a non-strict upper
- * bound for items on the left page. Assert that heap TIDs follow these
- * invariants, since a heap TID value is apparently needed as a
- * tiebreaker.
- */
-#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
- BTreeTupleGetHeapTID(firstright)) < 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(lastleft)) >= 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#else
-
- /*
- * Those invariants aren't guaranteed to hold for lastleft + firstright
- * heap TID attribute values when they're considered here only because
- * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
- * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
- * TID value that always works as a strict lower bound for items to the
- * right. In particular, it must avoid using firstright's leading key
- * attribute values along with lastleft's heap TID value when lastleft's
- * TID happens to be greater than firstright's TID.
- */
- ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
-
- /*
- * Pivot heap TID should never be fully equal to firstright. Note that
- * the pivot heap TID will still end up equal to lastleft's heap TID when
- * that's the only usable value.
- */
- ItemPointerSetOffsetNumber(pivotheaptid,
- OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#endif
-
- return tidpivot;
-}
-
-/*
- * _bt_keep_natts - how many key attributes to keep when truncating.
- *
- * Caller provides two tuples that enclose a split point. Caller's insertion
- * scankey is used to compare the tuples; the scankey's argument values are
- * not considered here.
- *
- * This can return a number of attributes that is one greater than the
- * number of key attributes for the index relation. This indicates that the
- * caller must use a heap TID as a unique-ifier in new pivot tuple.
- */
-static int
-_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keepnatts;
- ScanKey scankey;
-
- /*
- * _bt_compare() treats truncated key attributes as having the value minus
- * infinity, which would break searches within !heapkeyspace indexes. We
- * must still truncate away non-key attribute values, though.
- */
- if (!itup_key->heapkeyspace)
- return nkeyatts;
-
- scankey = itup_key->scankeys;
- keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum1,
- datum2)) != 0)
- break;
-
- keepnatts++;
- }
-
- /*
- * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
- * expected in an allequalimage index.
- */
- Assert(!itup_key->allequalimage ||
- keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
-
- return keepnatts;
-}
-
-/*
- * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
- *
- * This is exported so that a candidate split point can have its effect on
- * suffix truncation inexpensively evaluated ahead of time when finding a
- * split location. A naive bitwise approach to datum comparisons is used to
- * save cycles.
- *
- * The approach taken here usually provides the same answer as _bt_keep_natts
- * will (for the same pair of tuples from a heapkeyspace index), since the
- * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting. When an index only has
- * "equal image" columns, routine is guaranteed to give the same result as
- * _bt_keep_natts would.
- *
- * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts, even when the index uses
- * an opclass or collation that is not "allequalimage"/deduplication-safe.
- * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
- * negatives generally only have the effect of making leaf page splits use a
- * more balanced split point.
- */
-int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
-
- keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
- Form_pg_attribute att;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
- break;
-
- keepnatts++;
- }
-
- return keepnatts;
-}
-
/*
* _bt_check_natts() -- Verify tuple has expected number of attributes.
*
diff --git a/src/backend/access/nbtree/nbtutils_spec.h b/src/backend/access/nbtree/nbtutils_spec.h
new file mode 100644
index 0000000000..a4b934ae7a
--- /dev/null
+++ b/src/backend/access/nbtree/nbtutils_spec.h
@@ -0,0 +1,772 @@
+/*
+ * Specialized functions included in nbtutils.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static bool NBTS_FUNCTION(_bt_check_rowcompare)(ScanKey skey,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ ScanDirection dir, bool *continuescan);
+
+static int NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+
+/*
+ * Test whether an indextuple satisfies a row-comparison scan condition.
+ *
+ * Return true if so, false if not. If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction
+ * to pass the qual.
+ *
+ * This is a subroutine for _bt_checkkeys, which see for more info.
+ */
+static bool
+NBTS_FUNCTION(_bt_check_rowcompare)(ScanKey skey, IndexTuple tuple,
+ int tupnatts, TupleDesc tupdesc,
+ ScanDirection dir, bool *continuescan)
+{
+ ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
+ int32 cmpresult = 0;
+ bool result;
+
+ /* First subkey should be same as the header says */
+ Assert(subkey->sk_attno == skey->sk_attno);
+
+ /* Loop over columns of the row condition */
+ for (;;)
+ {
+ Datum datum;
+ bool isNull;
+
+ Assert(subkey->sk_flags & SK_ROW_MEMBER);
+
+ if (subkey->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ cmpresult = 0;
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ continue;
+ }
+
+ datum = index_getattr(tuple,
+ subkey->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (isNull)
+ {
+ if (subkey->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (subkey->sk_flags & SK_ISNULL)
+ {
+ /*
+ * Unlike the simple-scankey case, this isn't a disallowed case.
+ * But it can never match. If all the earlier row comparison
+ * columns are required for the scan direction, we can stop the
+ * scan, because there can't be another tuple that will succeed.
+ */
+ if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
+ subkey--;
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ return false;
+ }
+
+ /* Perform the test --- three-way comparison not bool operator */
+ cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
+ subkey->sk_collation,
+ datum,
+ subkey->sk_argument));
+
+ if (subkey->sk_flags & SK_BT_DESC)
+ INVERT_COMPARE_RESULT(cmpresult);
+
+ /* Done comparing if unequal, else advance to next column */
+ if (cmpresult != 0)
+ break;
+
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ }
+
+ /*
+ * At this point cmpresult indicates the overall result of the row
+ * comparison, and subkey points to the deciding column (or the last
+ * column if the result is "=").
+ */
+ switch (subkey->sk_strategy)
+ {
+ /* EQ and NE cases aren't allowed here */
+ case BTLessStrategyNumber:
+ result = (cmpresult < 0);
+ break;
+ case BTLessEqualStrategyNumber:
+ result = (cmpresult <= 0);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ result = (cmpresult >= 0);
+ break;
+ case BTGreaterStrategyNumber:
+ result = (cmpresult > 0);
+ break;
+ default:
+ elog(ERROR, "unrecognized RowCompareType: %d",
+ (int) subkey->sk_strategy);
+ result = 0; /* keep compiler quiet */
+ break;
+ }
+
+ if (!result)
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will pass,
+ * either. Note we have to look at the deciding column, not
+ * necessarily the first or last column of the row condition.
+ */
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+
+ return result;
+}
+
+/*
+ * _bt_keep_natts - how many key attributes to keep when truncating.
+ *
+ * Caller provides two tuples that enclose a split point. Caller's insertion
+ * scankey is used to compare the tuples; the scankey's argument values are
+ * not considered here.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation. This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keepnatts;
+ ScanKey scankey;
+
+ /*
+ * _bt_compare() treats truncated key attributes as having the value minus
+ * infinity, which would break searches within !heapkeyspace indexes. We
+ * must still truncate away non-key attribute values, though.
+ */
+ if (!itup_key->heapkeyspace)
+ return nkeyatts;
+
+ scankey = itup_key->scankeys;
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum1,
+ datum2)) != 0)
+ break;
+
+ keepnatts++;
+ }
+
+ /*
+ * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
+ * expected in an allequalimage index.
+ */
+ Assert(!itup_key->allequalimage ||
+ keepnatts == nbts_call(_bt_keep_natts_fast, rel, lastleft, firstright));
+
+ return keepnatts;
+}
+
+#endif /* NBTS_SPECIALIZING_DEFAULT */
+
+/*
+ * _bt_mkscankey
+ * Build an insertion scan key that contains comparison data from itup
+ * as well as comparator routines appropriate to the key datatypes.
+ *
+ * When itup is a non-pivot tuple, the returned insertion scan key is
+ * suitable for finding a place for it to go on the leaf level. Pivot
+ * tuples can be used to re-find leaf page with matching high key, but
+ * then caller needs to set scan key's pivotsearch field to true. This
+ * allows caller to search for a leaf page with a matching high key,
+ * which is usually to the left of the first leaf page a non-pivot match
+ * might appear on.
+ *
+ * The result is intended for use with _bt_compare() and _bt_truncate().
+ * Callers that don't need to fill out the insertion scankey arguments
+ * (e.g. they use an ad-hoc comparison routine, or only need a scankey
+ * for _bt_truncate()) can pass a NULL index tuple. The scankey will
+ * be initialized as if an "all truncated" pivot tuple was passed
+ * instead.
+ *
+ * Note that we may occasionally have to share lock the metapage to
+ * determine whether or not the keys in the index are expected to be
+ * unique (i.e. if this is a "heapkeyspace" index). We assume a
+ * heapkeyspace index when caller passes a NULL tuple, allowing index
+ * build callers to avoid accessing the non-existent metapage. We
+ * also assume that the index is _not_ allequalimage when a NULL tuple
+ * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
+ * field themselves.
+ */
+BTScanInsert
+NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
+{
+ BTScanInsert key;
+ ScanKey skey;
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int16 *indoption;
+ int tupnatts;
+ int i;
+
+ itupdesc = RelationGetDescr(rel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ indoption = rel->rd_indoption;
+ tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+
+ /*
+ * We'll execute search using scan key constructed on key columns.
+ * Truncated attributes and non-key attributes are omitted from the final
+ * scan key.
+ */
+ key = palloc(offsetof(BTScanInsertData, scankeys) +
+ sizeof(ScanKeyData) * indnkeyatts);
+ if (itup)
+ _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+ else
+ {
+ /* Utility statement callers can set these fields themselves */
+ key->heapkeyspace = true;
+ key->allequalimage = false;
+ }
+ key->anynullkeys = false; /* initial assumption */
+ key->nextkey = false;
+ key->pivotsearch = false;
+ key->keysz = Min(indnkeyatts, tupnatts);
+ key->scantid = key->heapkeyspace && itup ?
+ BTreeTupleGetHeapTID(itup) : NULL;
+ skey = key->scankeys;
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ FmgrInfo *procinfo;
+ Datum arg;
+ bool null;
+ int flags;
+
+ /*
+ * We can use the cached (default) support procs since no cross-type
+ * comparison can be needed.
+ */
+ procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+
+ /*
+ * Key arguments built from truncated attributes (or when caller
+ * provides no tuple) are defensively represented as NULL values. They
+ * should never be used.
+ */
+ if (i < tupnatts)
+ arg = index_getattr(itup, i + 1, itupdesc, &null);
+ else
+ {
+ arg = (Datum) 0;
+ null = true;
+ }
+ flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags,
+ (AttrNumber) (i + 1),
+ InvalidStrategy,
+ InvalidOid,
+ rel->rd_indcollation[i],
+ procinfo,
+ arg);
+ /* Record if any key attribute is NULL (or truncated) */
+ if (null)
+ key->anynullkeys = true;
+ }
+
+ /*
+ * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
+ * that full uniqueness check is done.
+ */
+ if (rel->rd_index->indnullsnotdistinct)
+ key->anynullkeys = false;
+
+ return key;
+}
+
+/*
+ * Test whether an indextuple satisfies all the scankey conditions.
+ *
+ * Return true if so, false if not. If the tuple fails to pass the qual,
+ * we also determine whether there's any need to continue the scan beyond
+ * this tuple, and set *continuescan accordingly. See comments for
+ * _bt_preprocess_keys(), above, about how this is done.
+ *
+ * Forward scan callers can pass a high key tuple in the hopes of having
+ * us set *continuescan to false, and avoiding an unnecessary visit to
+ * the page to the right.
+ *
+ * scan: index scan descriptor (containing a search-type scankey)
+ * tuple: index tuple to test
+ * tupnatts: number of attributes in tupnatts (high key may be truncated)
+ * dir: direction we are scanning in
+ * continuescan: output parameter (will be set correctly in all cases)
+ */
+bool
+NBTS_FUNCTION(_bt_checkkeys)(Relation rel, IndexScanDesc scan,
+ IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan)
+{
+ TupleDesc tupdesc;
+ BTScanOpaque so;
+ int keysz;
+ int ikey;
+ ScanKey key;
+
+ Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+
+ *continuescan = true; /* default assumption */
+
+ tupdesc = RelationGetDescr(scan->indexRelation);
+ so = (BTScanOpaque) scan->opaque;
+ keysz = so->numberOfKeys;
+
+ for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+ {
+ Datum datum;
+ bool isNull;
+ Datum test;
+
+ if (key->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ continue;
+ }
+
+ /* row-comparison keys need special processing */
+ if (key->sk_flags & SK_ROW_HEADER)
+ {
+ if (nbts_call_norel(_bt_check_rowcompare, rel, key, tuple,
+ tupnatts, tupdesc, dir, continuescan))
+ continue;
+ return false;
+ }
+
+ datum = index_getattr(tuple,
+ key->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (key->sk_flags & SK_ISNULL)
+ {
+ /* Handle IS NULL/NOT NULL tests */
+ if (key->sk_flags & SK_SEARCHNULL)
+ {
+ if (isNull)
+ continue; /* tuple satisfies this qual */
+ }
+ else
+ {
+ Assert(key->sk_flags & SK_SEARCHNOTNULL);
+ if (!isNull)
+ continue; /* tuple satisfies this qual */
+ }
+
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (isNull)
+ {
+ if (key->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
+ datum, key->sk_argument);
+
+ if (!DatumGetBool(test))
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ *
+ * Note: because we stop the scan as soon as any required equality
+ * qual fails, it is critical that equality quals be used for the
+ * initial positioning in _bt_first() when they are available. See
+ * comments in _bt_first().
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+ }
+
+ /* If we get here, the tuple passes all index quals. */
+ return true;
+}
+
+/*
+ * _bt_truncate() -- create tuple without unneeded suffix attributes.
+ *
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument. If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space. More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes. We only need to keep firstright
+ * attributes up to and including the first non-lastleft-equal attribute.
+ * Caller's insertion scankey is used to compare the tuples; the scankey's
+ * argument values are not considered here.
+ *
+ * Note that returned tuple's t_tid offset will hold the number of attributes
+ * present, so the original item pointer offset is not represented. Caller
+ * should only change truncated tuple's downlink. Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
+ *
+ * In the worst case (when a heap TID must be appended to distinguish lastleft
+ * from firstright), the size of the returned tuple is the size of firstright
+ * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
+ * is important, since callers need to stay under the 1/3 of a page
+ * restriction on tuple size. If this routine is ever taught to truncate
+ * within an attribute/datum, it will need to avoid returning an enlarged
+ * tuple to caller when truncation + TOAST compression ends up enlarging the
+ * final datum.
+ */
+IndexTuple
+NBTS_FUNCTION(_bt_truncate)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+ IndexTuple pivot;
+ IndexTuple tidpivot;
+ ItemPointer pivotheaptid;
+ Size newsize;
+
+ /*
+ * We should only ever truncate non-pivot tuples from leaf pages. It's
+ * never okay to truncate when splitting an internal page.
+ */
+ Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
+
+ /* Determine how many attributes must be kept in truncated tuple */
+ keepnatts = nbts_call(_bt_keep_natts, rel, lastleft, firstright, itup_key);
+
+#ifdef DEBUG_NO_TRUNCATE
+ /* Force truncation to be ineffective for testing purposes */
+ keepnatts = nkeyatts + 1;
+#endif
+
+ pivot = index_truncate_tuple(itupdesc, firstright,
+ Min(keepnatts, nkeyatts));
+
+ if (BTreeTupleIsPosting(pivot))
+ {
+ /*
+ * index_truncate_tuple() just returns a straight copy of firstright
+ * when it has no attributes to truncate. When that happens, we may
+ * need to truncate away a posting list here instead.
+ */
+ Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
+ Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+
+ /*
+ * If there is a distinguishing key attribute within pivot tuple, we're
+ * done
+ */
+ if (keepnatts <= nkeyatts)
+ {
+ BTreeTupleSetNAtts(pivot, keepnatts, false);
+ return pivot;
+ }
+
+ /*
+ * We have to store a heap TID in the new pivot tuple, since no non-TID
+ * key attribute value in firstright distinguishes the right side of the
+ * split from the left side. nbtree conceptualizes this case as an
+ * inability to truncate away any key attributes, since heap TID is
+ * treated as just another key attribute (despite lacking a pg_attribute
+ * entry).
+ *
+ * Use enlarged space that holds a copy of pivot. We need the extra space
+ * to store a heap TID at the end (using the special pivot tuple
+ * representation). Note that the original pivot already has firstright's
+ * possible posting list/non-key attribute values removed at this point.
+ */
+ newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
+ tidpivot = palloc0(newsize);
+ memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
+ /* Cannot leak memory here */
+ pfree(pivot);
+
+ /*
+ * Store all of firstright's key attribute values plus a tiebreaker heap
+ * TID value in enlarged pivot tuple
+ */
+ tidpivot->t_info &= ~INDEX_SIZE_MASK;
+ tidpivot->t_info |= newsize;
+ BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
+ pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
+
+ /*
+ * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
+ * consider suffix truncation. It seems like a good idea to follow that
+ * example in cases where no truncation takes place -- use lastleft's heap
+ * TID. (This is also the closest value to negative infinity that's
+ * legally usable.)
+ */
+ ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
+
+ /*
+ * We're done. Assert() that heap TID invariants hold before returning.
+ *
+ * Lehman and Yao require that the downlink to the right page, which is to
+ * be inserted into the parent page in the second phase of a page split be
+ * a strict lower bound on items on the right page, and a non-strict upper
+ * bound for items on the left page. Assert that heap TIDs follow these
+ * invariants, since a heap TID value is apparently needed as a
+ * tiebreaker.
+ */
+#ifndef DEBUG_NO_TRUNCATE
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#else
+
+ /*
+ * Those invariants aren't guaranteed to hold for lastleft + firstright
+ * heap TID attribute values when they're considered here only because
+ * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
+ * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
+ * TID value that always works as a strict lower bound for items to the
+ * right. In particular, it must avoid using firstright's leading key
+ * attribute values along with lastleft's heap TID value when lastleft's
+ * TID happens to be greater than firstright's TID.
+ */
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
+
+ /*
+ * Pivot heap TID should never be fully equal to firstright. Note that
+ * the pivot heap TID will still end up equal to lastleft's heap TID when
+ * that's the only usable value.
+ */
+ ItemPointerSetOffsetNumber(pivotheaptid,
+ OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#endif
+
+ return tidpivot;
+}
+
+/*
+ * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location. A naive bitwise approach to datum comparisons is used to
+ * save cycles.
+ *
+ * The approach taken here usually provides the same answer as _bt_keep_natts
+ * will (for the same pair of tuples from a heapkeyspace index), since the
+ * majority of btree opclasses can never indicate that two datums are equal
+ * unless they're bitwise equal after detoasting. When an index only has
+ * "equal image" columns, routine is guaranteed to give the same result as
+ * _bt_keep_natts would.
+ *
+ * Callers can rely on the fact that attributes considered equal here are
+ * definitely also equal according to _bt_keep_natts, even when the index uses
+ * an opclass or collation that is not "allequalimage"/deduplication-safe.
+ * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
+ * negatives generally only have the effect of making leaf page splits use a
+ * more balanced split point.
+ */
+int
+NBTS_FUNCTION(_bt_keep_natts_fast)(Relation rel,
+ IndexTuple lastleft,
+ IndexTuple firstright)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= keysz; attnum++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+ Form_pg_attribute att;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ att = TupleDescAttr(itupdesc, attnum - 1);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
+ break;
+
+ keepnatts++;
+ }
+
+ return keepnatts;
+}
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 1174e1a31c..816165217e 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1122,7 +1122,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
state->tupDesc = tupDesc; /* assume we need not copy tupDesc */
- indexScanKey = _bt_mkscankey(indexRel, NULL);
+ indexScanKey = nbts_call(_bt_mkscankey, indexRel, NULL);
if (state->indexInfo->ii_Expressions != NULL)
{
@@ -1220,7 +1220,7 @@ tuplesort_begin_index_btree(Relation heapRel,
state->enforceUnique = enforceUnique;
state->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
- indexScanKey = _bt_mkscankey(indexRel, NULL);
+ indexScanKey = nbts_call(_bt_mkscankey, indexRel, NULL);
/* Prepare SortSupport data for each column */
state->sortKeys = (SortSupport) palloc0(state->nKeys *
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 93f8267b48..83e0dbab16 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1116,15 +1116,47 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_PERFORMSORT_2 4
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
+
+/*
+ * Macros used in the nbtree specialization code.
+ */
+#define NBTS_TYPE_CACHED cached
+#define NBTS_TYPE_DEFAULT default
+
+
+#define NBTS_MAKE_PREFIX(a) CppConcat(a,_)
+#define NBTS_MAKE_NAME_(a,b) CppConcat(a,b)
+#define NBTS_MAKE_NAME(a,b) NBTS_MAKE_NAME_(NBTS_MAKE_PREFIX(a),b)
+
+#define NBTS_ENABLED
+
+#ifdef NBTS_ENABLED
+
+/*
+ * Access a specialized nbtree function, based on the shape of the index key.
+ */
+
+#define NBT_SPECIALIZE_CALL(function, rel, ...) \
+( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+)
+
+#else /* not defined NBTS_ENABLED */
+
+#define NBT_SPECIALIZE_CALL(function, rel, ...) function(__VA_ARGS__)
+
+#endif /* NBTS_ENABLED */
+
+
+#define NBT_SPECIALIZE_FILE "access/nbtree_specialized.h"
+#include "nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
+
+
/*
* external entry points for btree, in nbtree.c
*/
extern void btbuildempty(Relation index);
-extern bool btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- struct IndexInfo *indexInfo);
extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
extern Size btestimateparallelscan(void);
extern void btinitparallelscan(void *target);
@@ -1155,9 +1187,6 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
/*
* prototypes for functions in nbtdedup.c
*/
-extern void _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel,
- IndexTuple newitem, Size newitemsz,
- bool bottomupdedup);
extern bool _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
Size newitemsz);
extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
@@ -1173,9 +1202,6 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
/*
* prototypes for functions in nbtinsert.c
*/
-extern bool _bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel);
extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
@@ -1223,12 +1249,6 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
/*
* prototypes for functions in nbtsearch.c
*/
-extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
- int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
- bool forupdate, BTStack stack, int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
-extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -1237,7 +1257,6 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
/*
* prototypes for functions in nbtutils.c
*/
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
extern void _bt_freestack(BTStack stack);
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
@@ -1245,8 +1264,6 @@ extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
extern void _bt_mark_array_keys(IndexScanDesc scan);
extern void _bt_restore_array_keys(IndexScanDesc scan);
extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
- int tupnatts, ScanDirection dir, bool *continuescan);
extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
extern BTCycleId _bt_start_vacuum(Relation rel);
@@ -1259,10 +1276,6 @@ extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
extern char *btbuildphasename(int64 phasenum);
-extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
-extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
- IndexTuple firstright);
extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/nbtree_specialize.h b/src/include/access/nbtree_specialize.h
new file mode 100644
index 0000000000..23fdda4f0e
--- /dev/null
+++ b/src/include/access/nbtree_specialize.h
@@ -0,0 +1,204 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtree_specialize.h
+ * header file for postgres btree access method implementation.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/nbtree_specialize.h
+ *
+ *-------------------------------------------------------------------------
+ *
+ * Specialize key-accessing functions and the hot code around those.
+ *
+ * Key attribute iteration is specialized through the use of the following
+ * macros:
+ *
+ * - nbts_call(function, indexrel, ...rest_of_args), and
+ * nbts_call_norel(function, indexrel, ...args)
+ * This will call the specialized variant of 'function' based on the index
+ * relation data.
+ * The difference between nbts_call and nbts_call_norel is that _call
+ * uses indexrel as first argument in the function call, whereas
+ * nbts_call_norel does not.
+ * - nbts_attiterdeclare(itup)
+ * Declare the variables required to iterate over the provided IndexTuple's
+ * key attributes. Many tuples may have their attributes iterated over at the
+ * same time.
+ * - nbts_attiterinit(itup, initAttNum, tupDesc)
+ * Initialize the attribute iterator for the provided IndexTuple at
+ * the provided AttributeNumber.
+ * - nbts_foreachattr(initAttNum, endAttNum)
+ * Start a loop over the attributes, starting at initAttNum and ending at
+ * endAttNum, inclusive. It also takes care of truncated attributes.
+ * - nbts_attiter_attnum
+ * The current attribute number
+ * - nbts_attiter_nextattdatum(itup, tupDesc)
+ * Updates the attribute iterator state to the next attribute. Returns the
+ * datum of the next attribute, which might be null (see below)
+ * - nbts_attiter_curattisnull(itup)
+ * Returns whether the result from the last nbts_attiter_nextattdatum is
+ * null.
+ *
+ * example usage:
+ *
+ * kwithnulls = nbts_call_norel(_bt_key_hasnulls, myindex, mytuple, tupDesc);
+ *
+ * NBTS_FUNCTION(_bt_key_hasnulls)(IndexTuple mytuple, TupleDesc tupDesc)
+ * {
+ * nbts_attiterdeclare(mytuple);
+ * nbts_attiterinit(mytuple, 1, tupDesc);
+ * nbts_foreachattr(1, 10)
+ * {
+ * Datum it = nbts_attiter_nextattdatum(tuple, tupDesc);
+ * if (nbts_attiter_curattisnull(tuple))
+ * return true;
+ * }
+ * return false
+ * }
+ */
+
+/*
+ * Call a potentially specialized function for a given btree operation.
+ *
+ * NB: the rel argument is evaluated multiple times.
+ */
+#define nbts_call(name, rel, ...) \
+ nbts_call_norel(name, (rel), (rel), __VA_ARGS__)
+
+#ifdef NBTS_ENABLED
+
+#define NBTS_FUNCTION(name) NBTS_MAKE_NAME(name, NBTS_TYPE)
+
+#ifdef nbts_call_norel
+#undef nbts_call_norel
+#endif
+
+#define nbts_call_norel(name, rel, ...) \
+ (NBTS_FUNCTION(name)(__VA_ARGS__))
+
+/*
+ * Multiple key columns, optimized access for attcacheoff -cacheable offsets.
+ */
+#define NBTS_SPECIALIZING_CACHED
+#define NBTS_TYPE NBTS_TYPE_CACHED
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_CACHED
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/* reset call to SPECIALIZE_CALL for default behaviour */
+#undef nbts_call_norel
+#define nbts_call_norel(name, rel, ...) \
+ NBT_SPECIALIZE_CALL(name, (rel), __VA_ARGS__)
+
+/*
+ * "Default", externally accessible, not so much optimized functions
+ */
+
+#define NBTS_SPECIALIZING_DEFAULT
+#define NBTS_TYPE NBTS_TYPE_DEFAULT
+
+/* for the default functions, we want to use the unspecialized name. */
+#undef NBTS_FUNCTION
+#define NBTS_FUNCTION(name) name
+
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_DEFAULT
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/* from here on there are no more NBTS_FUNCTIONs */
+#undef NBTS_FUNCTION
+
+#else /* not defined NBTS_ENABLED */
+
+/*
+ * NBTS_ENABLE is not defined, so we don't want to use the specializations.
+ * We revert to the behaviour from PG14 and earlier, which only uses
+ * attcacheoff.
+ */
+
+#define NBTS_FUNCTION(name) name
+
+#define nbts_call_norel(name, rel, ...) \
+ name(__VA_ARGS__)
+
+#define NBTS_TYPE NBTS_TYPE_CACHED
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+
+#endif /* !NBTS_ENABLED */
diff --git a/src/include/access/nbtree_specialized.h b/src/include/access/nbtree_specialized.h
new file mode 100644
index 0000000000..c45fa84aed
--- /dev/null
+++ b/src/include/access/nbtree_specialized.h
@@ -0,0 +1,67 @@
+/*
+ * prototypes for functions that are included in nbtree.h
+ */
+
+/*
+ * prototypes for functions in nbtree_spec.h
+ */
+extern void
+NBTS_FUNCTION(_bt_specialize)(Relation rel);
+
+extern bool
+NBTS_FUNCTION(btinsert)(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ struct IndexInfo *indexInfo);
+
+/*
+ * prototypes for functions in nbtdedup_spec.h
+ */
+extern void
+NBTS_FUNCTION(_bt_dedup_pass)(Relation rel, Buffer buf, Relation heapRel,
+ IndexTuple newitem, Size newitemsz,
+ bool bottomupdedup);
+
+
+/*
+ * prototypes for functions in nbtinsert_spec.h
+ */
+
+extern bool
+NBTS_FUNCTION(_bt_doinsert)(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ Relation heapRel);
+
+/*
+ * prototypes for functions in nbtsearch_spec.h
+ */
+extern BTStack
+NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key,
+ Buffer *bufP, int access,
+ Snapshot snapshot);
+extern Buffer
+NBTS_FUNCTION(_bt_moveright)(Relation rel, BTScanInsert key, Buffer buf,
+ bool forupdate, BTStack stack, int access,
+ Snapshot snapshot);
+extern OffsetNumber
+NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate);
+extern int32
+NBTS_FUNCTION(_bt_compare)(Relation rel, BTScanInsert key,
+ Page page, OffsetNumber offnum);
+
+/*
+ * prototypes for functions in nbtutils_spec.h
+ */
+extern BTScanInsert
+NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup);
+extern bool
+NBTS_FUNCTION(_bt_checkkeys)(Relation rel, IndexScanDesc scan,
+ IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan);
+extern IndexTuple
+NBTS_FUNCTION(_bt_truncate)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+extern int
+NBTS_FUNCTION(_bt_keep_natts_fast)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright);
--
2.30.2
v3-0002-Use-specialized-attribute-iterators-in-backend-nb.patchapplication/x-patch; name=v3-0002-Use-specialized-attribute-iterators-in-backend-nb.patchDownload
From 8cc1ea41353ba0d0d69a6383c75ed2663608d609 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 7 Apr 2022 12:30:00 +0200
Subject: [PATCH v3 2/9] Use specialized attribute iterators in
backend/*/nbt*_spec.h
Split out for making it clear what substantial changes were made to the
pre-existing functions.
Even though not all nbt*_spec functions have been updated; most call sites
now can directly call the specialized functions instead of having to determine
the right specialization based on the (potentially locally unavailable) index
relation, making the specialization of those functions worth the effort.
---
src/backend/access/nbtree/nbtsearch_spec.h | 16 +++---
src/backend/access/nbtree/nbtsort_spec.h | 24 +++++----
src/backend/access/nbtree/nbtutils_spec.h | 63 +++++++++++++---------
3 files changed, 62 insertions(+), 41 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsearch_spec.h b/src/backend/access/nbtree/nbtsearch_spec.h
index 73d5370496..a5c5f2b94f 100644
--- a/src/backend/access/nbtree/nbtsearch_spec.h
+++ b/src/backend/access/nbtree/nbtsearch_spec.h
@@ -823,6 +823,7 @@ NBTS_FUNCTION(_bt_compare)(Relation rel,
int ncmpkey;
int ntupatts;
int32 result;
+ nbts_attiterdeclare(itup);
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -854,23 +855,26 @@ NBTS_FUNCTION(_bt_compare)(Relation rel,
Assert(key->heapkeyspace || ncmpkey == key->keysz);
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
scankey = key->scankeys;
- for (int i = 1; i <= ncmpkey; i++)
+ nbts_attiterinit(itup, 1, itupdesc);
+
+ nbts_foreachattr(1, ncmpkey)
{
Datum datum;
- bool isNull;
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+ datum = nbts_attiter_nextattdatum(itup, itupdesc);
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ /* key is NULL */
+ if (scankey->sk_flags & SK_ISNULL)
{
- if (isNull)
+ if (nbts_attiter_curattisnull(itup))
result = 0; /* NULL "=" NULL */
else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = -1; /* NULL "<" NOT_NULL */
else
result = 1; /* NULL ">" NOT_NULL */
}
- else if (isNull) /* key is NOT_NULL and item is NULL */
+ /* key is NOT_NULL and item is NULL */
+ else if (nbts_attiter_curattisnull(itup))
{
if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = 1; /* NOT_NULL ">" NULL */
diff --git a/src/backend/access/nbtree/nbtsort_spec.h b/src/backend/access/nbtree/nbtsort_spec.h
index 8f4a3602ca..d3f2db2dc4 100644
--- a/src/backend/access/nbtree/nbtsort_spec.h
+++ b/src/backend/access/nbtree/nbtsort_spec.h
@@ -27,8 +27,7 @@ NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
itup2 = NULL;
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
bool deduplicate;
@@ -50,7 +49,7 @@ NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
/* Prepare SortSupport data for each column */
sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
- for (i = 0; i < keysz; i++)
+ for (int i = 0; i < keysz; i++)
{
SortSupport sortKey = sortKeys + i;
ScanKey scanKey = wstate->inskey->scankeys + i;
@@ -82,22 +81,25 @@ NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
}
else if (itup != NULL)
{
+ nbts_attiterdeclare(itup);
+ nbts_attiterdeclare(itup2);
int32 compare = 0;
- for (i = 1; i <= keysz; i++)
+ nbts_attiterinit(itup, 1, tupdes);
+ nbts_attiterinit(itup2, 1, tupdes);
+
+ nbts_foreachattr(1, keysz)
{
SortSupport entry;
Datum attrDatum1,
attrDatum2;
- bool isNull1,
- isNull2;
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+ entry = sortKeys + nbts_attiter_attnum - 1;
+ attrDatum1 = nbts_attiter_nextattdatum(itup, tupdes);
+ attrDatum2 = nbts_attiter_nextattdatum(itup2, tupdes);
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
+ compare = ApplySortComparator(attrDatum1, nbts_attiter_curattisnull(itup),
+ attrDatum2, nbts_attiter_curattisnull(itup2),
entry);
if (compare > 0)
{
diff --git a/src/backend/access/nbtree/nbtutils_spec.h b/src/backend/access/nbtree/nbtutils_spec.h
index a4b934ae7a..638eff18f6 100644
--- a/src/backend/access/nbtree/nbtutils_spec.h
+++ b/src/backend/access/nbtree/nbtutils_spec.h
@@ -211,6 +211,8 @@ NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
TupleDesc itupdesc = RelationGetDescr(rel);
int keepnatts;
ScanKey scankey;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
/*
* _bt_compare() treats truncated key attributes as having the value minus
@@ -222,20 +224,22 @@ NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
scankey = itup_key->scankeys;
keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, nkeyatts)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ datum1 = nbts_attiter_nextattdatum(lastleft, itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) != nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
scankey->sk_collation,
datum1,
@@ -243,6 +247,7 @@ NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
break;
keepnatts++;
+ scankey++;
}
/*
@@ -295,7 +300,7 @@ NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
int indnkeyatts;
int16 *indoption;
int tupnatts;
- int i;
+ nbts_attiterdeclare(itup);
itupdesc = RelationGetDescr(rel);
indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -326,7 +331,10 @@ NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
key->scantid = key->heapkeyspace && itup ?
BTreeTupleGetHeapTID(itup) : NULL;
skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
+
+ nbts_attiterinit(itup, 1, itupdesc);
+
+ nbts_foreachattr(1, indnkeyatts)
{
FmgrInfo *procinfo;
Datum arg;
@@ -337,27 +345,30 @@ NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
* We can use the cached (default) support procs since no cross-type
* comparison can be needed.
*/
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+ procinfo = index_getprocinfo(rel, nbts_attiter_attnum, BTORDER_PROC);
/*
* Key arguments built from truncated attributes (or when caller
* provides no tuple) are defensively represented as NULL values. They
* should never be used.
*/
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
+ if (nbts_attiter_attnum <= tupnatts)
+ {
+ arg = nbts_attiter_nextattdatum(itup, itupdesc);
+ null = nbts_attiter_curattisnull(itup);
+ }
else
{
arg = (Datum) 0;
null = true;
}
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags = (null ? SK_ISNULL : 0) | (indoption[nbts_attiter_attnum - 1] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[nbts_attiter_attnum - 1],
flags,
- (AttrNumber) (i + 1),
+ (AttrNumber) nbts_attiter_attnum,
InvalidStrategy,
InvalidOid,
- rel->rd_indcollation[i],
+ rel->rd_indcollation[nbts_attiter_attnum - 1],
procinfo,
arg);
/* Record if any key attribute is NULL (or truncated) */
@@ -744,24 +755,28 @@ NBTS_FUNCTION(_bt_keep_natts_fast)(Relation rel,
TupleDesc itupdesc = RelationGetDescr(rel);
int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
int keepnatts;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, keysz)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
Form_pg_attribute att;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
+ datum1 = nbts_attiter_nextattdatum(lastleft,itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
+ att = TupleDescAttr(itupdesc, nbts_attiter_attnum - 1);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) !=
+ nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
--
2.30.2
v3-0004-Optimize-attribute-iterator-access-for-single-col.patchapplication/x-patch; name=v3-0004-Optimize-attribute-iterator-access-for-single-col.patchDownload
From 38f619c26422ff8b58f6f70478ba110401f53ee0 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 7 Apr 2022 12:47:50 +0200
Subject: [PATCH v3 4/9] Optimize attribute iterator access for single-column
btree keys
This removes the index_getattr_nocache call path, which has significant overhead.
---
src/include/access/nbtree.h | 9 ++++-
src/include/access/nbtree_specialize.h | 56 ++++++++++++++++++++++++++
2 files changed, 64 insertions(+), 1 deletion(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 489b623663..1559399b0e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1120,6 +1120,7 @@ typedef struct BTOptions
/*
* Macros used in the nbtree specialization code.
*/
+#define NBTS_TYPE_SINGLE_COLUMN single
#define NBTS_TYPE_CACHED cached
#define NBTS_TYPE_DEFAULT default
@@ -1151,7 +1152,13 @@ do { \
#define NBT_SPECIALIZE_CALL(function, rel, ...) \
( \
- NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+ IndexRelationGetNumberOfKeyAttributes(rel) == 1 ? ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_SINGLE_COLUMN)(__VA_ARGS__) \
+ ) \
+ : \
+ ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+ ) \
)
#else /* not defined NBTS_ENABLED */
diff --git a/src/include/access/nbtree_specialize.h b/src/include/access/nbtree_specialize.h
index 23fdda4f0e..642bc4c795 100644
--- a/src/include/access/nbtree_specialize.h
+++ b/src/include/access/nbtree_specialize.h
@@ -79,6 +79,62 @@
#define nbts_call_norel(name, rel, ...) \
(NBTS_FUNCTION(name)(__VA_ARGS__))
+/*
+ * Optimized access for indexes with a single key column.
+ *
+ * Note that this path may never be used for indexes with multiple key
+ * columns, because it does not ever continue to a next column.
+ */
+
+#define NBTS_SPECIALIZING_SINGLE_COLUMN
+#define NBTS_TYPE NBTS_TYPE_SINGLE_COLUMN
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ Assert((endAttNum) == 1); \
+ if ((initAttNum) == 1) for (int spec_i = 0; spec_i < 1; spec_i++)
+
+#define nbts_attiter_attnum 1
+
+/*
+ * Simplified (optimized) variant of index_getattr specialized for extracting
+ * only the first attribute: cache offset is guaranteed to be 0, and as such
+ * no cache is required.
+ */
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+( \
+ AssertMacro(spec_i == 0), \
+ (IndexTupleHasNulls(itup) && att_isnull(0, (char *)(itup) + sizeof(IndexTupleData))) ? \
+ ( \
+ (NBTS_MAKE_NAME(itup, isNull)) = true, \
+ (Datum)NULL \
+ ) \
+ : \
+ ( \
+ (NBTS_MAKE_NAME(itup, isNull) = false), \
+ (Datum) fetchatt(TupleDescAttr((tupDesc), 0), \
+ (char *) (itup) + IndexInfoFindDataOffset((itup)->t_info)) \
+ ) \
+)
+
+#define nbts_attiter_curattisnull(tuple) \
+ NBTS_MAKE_NAME(tuple, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_SINGLE_COLUMN
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
/*
* Multiple key columns, optimized access for attcacheoff -cacheable offsets.
*/
--
2.30.2
v3-0005-Add-a-function-whose-task-it-is-to-populate-all-a.patchapplication/x-patch; name=v3-0005-Add-a-function-whose-task-it-is-to-populate-all-a.patchDownload
From 7cd7335a8e0c0540835a738ad92dc6c214ebe566 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 8 Apr 2022 14:51:01 +0200
Subject: [PATCH v3 5/9] Add a function whose task it is to populate all
attcacheoff-s of a TupleDesc's attributes
It fills uncacheable offsets with -2; as opposed to -1 which signals
"unknown", thus allowing users of the API to determine the cache-ability
of an attribute at O(1) complexity after this one-time O(n) cost, as
opposed to the repeated O(n) cost that currently applies.
---
src/backend/access/common/tupdesc.c | 97 +++++++++++++++++++++++++++++
src/include/access/tupdesc.h | 2 +
2 files changed, 99 insertions(+)
diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c
index 9f41b1e854..5630fc9da0 100644
--- a/src/backend/access/common/tupdesc.c
+++ b/src/backend/access/common/tupdesc.c
@@ -910,3 +910,100 @@ BuildDescFromLists(List *names, List *types, List *typmods, List *collations)
return desc;
}
+
+/*
+ * PopulateTupleDescCacheOffsets
+ *
+ * Populate the attcacheoff fields of a TupleDesc, returning the last
+ * attcacheoff with a valid value.
+ *
+ * Sets attcacheoff to -2 for uncacheable attributes (i.e. attributes after a
+ * variable length attributes).
+ */
+AttrNumber
+PopulateTupleDescCacheOffsets(TupleDesc desc)
+{
+ int numberOfAttributes = desc->natts;
+ AttrNumber i, j;
+
+ if (TupleDescAttr(desc, desc->natts - 1)->attcacheoff != -1)
+ {
+ /*
+ * Already done the calculations, find the last attribute that has
+ * cache offset.
+ */
+ for (i = (AttrNumber) numberOfAttributes; i > 1; i--)
+ {
+ if (TupleDescAttr(desc, i - 1)->attcacheoff != -2)
+ return i;
+ }
+
+ return 1;
+ }
+
+ /*
+ * First attribute always starts at offset zero.
+ */
+ TupleDescAttr(desc, 0)->attcacheoff = 0;
+
+ i = 1;
+ /*
+ * Someone might have set some offsets previously.
+ * Skip all positive offsets to get to the first attribute without
+ * attcacheoff.
+ */
+ while (i < numberOfAttributes && TupleDescAttr(desc, i)->attcacheoff > 0)
+ i++;
+
+ /* Cache offset is undetermined. Start calculating offsets if possible */
+ if (i < numberOfAttributes &&
+ TupleDescAttr(desc, i)->attcacheoff == -1)
+ {
+ Form_pg_attribute att = TupleDescAttr(desc, i - 1);
+ Size off = att->attcacheoff;
+
+ if (att->attlen >= 0) {
+ off += att->attlen;
+
+ while (i < numberOfAttributes)
+ {
+ att = TupleDescAttr(desc, i);
+
+ if (att->attlen < 0)
+ {
+ if (off == att_align_nominal(off, att->attalign))
+ att->attcacheoff = off;
+ else
+ att->attcacheoff = -2;
+ i++;
+ break;
+ }
+
+ off = att_align_nominal(off, att->attalign);
+ att->attcacheoff = off;
+ off += att->attlen;
+ i++;
+ }
+ } else {
+ if (off == att_align_nominal(off, att->attalign))
+ att->attcacheoff = off;
+ else
+ att->attcacheoff = -2;
+ i++;
+ }
+ }
+
+ /*
+ * No cacheable offsets left. Fill the rest with -2s, but return the latest
+ * cached offset.
+ */
+ j = i;
+
+ while (i < numberOfAttributes)
+ {
+ TupleDescAttr(desc, i)->attcacheoff = -2;
+ i++;
+ }
+
+ return j;
+}
diff --git a/src/include/access/tupdesc.h b/src/include/access/tupdesc.h
index 28dd6de18b..219f837875 100644
--- a/src/include/access/tupdesc.h
+++ b/src/include/access/tupdesc.h
@@ -151,4 +151,6 @@ extern TupleDesc BuildDescForRelation(List *schema);
extern TupleDesc BuildDescFromLists(List *names, List *types, List *typmods, List *collations);
+extern AttrNumber PopulateTupleDescCacheOffsets(TupleDesc desc);
+
#endif /* TUPDESC_H */
--
2.30.2
v3-0008-Add-specialization-to-btree-index-creation.patchapplication/x-patch; name=v3-0008-Add-specialization-to-btree-index-creation.patchDownload
From 8caee4fa759430b4f1ea5e6ca6cff8ea0f67bb80 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 21 Apr 2022 16:22:07 +0200
Subject: [PATCH v3 8/9] Add specialization to btree index creation.
This was an oversight that is corrected easily; but an oversight nonetheless.
This increases the (re)build performance of indexes by another few percents.
---
src/backend/utils/sort/tuplesort.c | 147 ++---------------------
src/backend/utils/sort/tuplesort_nbts.h | 148 ++++++++++++++++++++++++
src/include/access/nbtree.h | 18 +++
3 files changed, 175 insertions(+), 138 deletions(-)
create mode 100644 src/backend/utils/sort/tuplesort_nbts.h
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 816165217e..8adc734f1c 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -651,8 +651,6 @@ static void writetup_cluster(Tuplesortstate *state, LogicalTape *tape,
SortTuple *stup);
static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
LogicalTape *tape, unsigned int len);
-static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state);
static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
Tuplesortstate *state);
static void copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -675,6 +673,10 @@ static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
static void tuplesort_free(Tuplesortstate *state);
static void tuplesort_updatemax(Tuplesortstate *state);
+#define NBT_SPECIALIZE_FILE "../../backend/utils/sort/tuplesort_nbts.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
+
/*
* Specialized comparators that we can inline into specialized sorts. The goal
* is to try to sort two tuples without having to follow the pointers to the
@@ -1208,7 +1210,7 @@ tuplesort_begin_index_btree(Relation heapRel,
sortopt & TUPLESORT_RANDOMACCESS,
PARALLEL_SORT(state));
- state->comparetup = comparetup_index_btree;
+ state->comparetup = NBT_SPECIALIZE_NAME(comparetup_index_btree, indexRel);
state->copytup = copytup_index;
state->writetup = writetup_index;
state->readtup = readtup_index;
@@ -1326,7 +1328,7 @@ tuplesort_begin_index_gist(Relation heapRel,
state->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel);
- state->comparetup = comparetup_index_btree;
+ state->comparetup = NBT_SPECIALIZE_NAME(comparetup_index_btree, indexRel);
state->copytup = copytup_index;
state->writetup = writetup_index;
state->readtup = readtup_index;
@@ -4292,142 +4294,11 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
* The btree and hash cases require separate comparison functions, but the
* IndexTuple representation is the same so the copy/write/read support
* functions can be shared.
+ *
+ * nbtree function can be found in tuplesort_nbts.h, and is included
+ * through the nbtree specialization functions.
*/
-static int
-comparetup_index_btree(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state)
-{
- /*
- * This is similar to comparetup_heap(), but expects index tuples. There
- * is also special handling for enforcing uniqueness, and special
- * treatment for equal keys at the end.
- */
- SortSupport sortKey = state->sortKeys;
- IndexTuple tuple1;
- IndexTuple tuple2;
- int keysz;
- TupleDesc tupDes;
- bool equal_hasnull = false;
- int nkey;
- int32 compare;
- Datum datum1,
- datum2;
- bool isnull1,
- isnull2;
-
-
- /* Compare the leading sort key */
- compare = ApplySortComparator(a->datum1, a->isnull1,
- b->datum1, b->isnull1,
- sortKey);
- if (compare != 0)
- return compare;
-
- /* Compare additional sort keys */
- tuple1 = (IndexTuple) a->tuple;
- tuple2 = (IndexTuple) b->tuple;
- keysz = state->nKeys;
- tupDes = RelationGetDescr(state->indexRel);
-
- if (sortKey->abbrev_converter)
- {
- datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
-
- compare = ApplySortAbbrevFullComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare;
- }
-
- /* they are equal, so we only need to examine one null flag */
- if (a->isnull1)
- equal_hasnull = true;
-
- sortKey++;
- for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
- {
- datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
-
- compare = ApplySortComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare; /* done when we find unequal attributes */
-
- /* they are equal, so we only need to examine one null flag */
- if (isnull1)
- equal_hasnull = true;
- }
-
- /*
- * If btree has asked us to enforce uniqueness, complain if two equal
- * tuples are detected (unless there was at least one NULL field and NULLS
- * NOT DISTINCT was not set).
- *
- * It is sufficient to make the test here, because if two tuples are equal
- * they *must* get compared at some stage of the sort --- otherwise the
- * sort algorithm wouldn't have checked whether one must appear before the
- * other.
- */
- if (state->enforceUnique && !(!state->uniqueNullsNotDistinct && equal_hasnull))
- {
- Datum values[INDEX_MAX_KEYS];
- bool isnull[INDEX_MAX_KEYS];
- char *key_desc;
-
- /*
- * Some rather brain-dead implementations of qsort (such as the one in
- * QNX 4) will sometimes call the comparison routine to compare a
- * value to itself, but we always use our own implementation, which
- * does not.
- */
- Assert(tuple1 != tuple2);
-
- index_deform_tuple(tuple1, tupDes, values, isnull);
-
- key_desc = BuildIndexValueDescription(state->indexRel, values, isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("could not create unique index \"%s\"",
- RelationGetRelationName(state->indexRel)),
- key_desc ? errdetail("Key %s is duplicated.", key_desc) :
- errdetail("Duplicate keys exist."),
- errtableconstraint(state->heapRel,
- RelationGetRelationName(state->indexRel))));
- }
-
- /*
- * If key values are equal, we sort on ItemPointer. This is required for
- * btree indexes, since heap TID is treated as an implicit last key
- * attribute in order to ensure that all keys in the index are physically
- * unique.
- */
- {
- BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
- BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
-
- if (blk1 != blk2)
- return (blk1 < blk2) ? -1 : 1;
- }
- {
- OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
- OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
-
- if (pos1 != pos2)
- return (pos1 < pos2) ? -1 : 1;
- }
-
- /* ItemPointer values should never be equal */
- Assert(false);
-
- return 0;
-}
-
static int
comparetup_index_hash(const SortTuple *a, const SortTuple *b,
Tuplesortstate *state)
diff --git a/src/backend/utils/sort/tuplesort_nbts.h b/src/backend/utils/sort/tuplesort_nbts.h
new file mode 100644
index 0000000000..d1b2670747
--- /dev/null
+++ b/src/backend/utils/sort/tuplesort_nbts.h
@@ -0,0 +1,148 @@
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static int NBTS_FUNCTION(comparetup_index_btree)(const SortTuple *a,
+ const SortTuple *b,
+ Tuplesortstate *state);
+
+static int
+NBTS_FUNCTION(comparetup_index_btree)(const SortTuple *a, const SortTuple *b,
+ Tuplesortstate *state)
+{
+ /*
+ * This is similar to comparetup_heap(), but expects index tuples. There
+ * is also special handling for enforcing uniqueness, and special
+ * treatment for equal keys at the end.
+ */
+ SortSupport sortKey = state->sortKeys;
+ IndexTuple tuple1;
+ IndexTuple tuple2;
+ int keysz;
+ TupleDesc tupDes;
+ bool equal_hasnull = false;
+ int nkey;
+ int32 compare;
+ nbts_attiterdeclare(tuple1);
+ nbts_attiterdeclare(tuple2);
+
+ /* Compare the leading sort key */
+ compare = ApplySortComparator(a->datum1, a->isnull1,
+ b->datum1, b->isnull1,
+ sortKey);
+ if (compare != 0)
+ return compare;
+
+ /* Compare additional sort keys */
+ tuple1 = (IndexTuple) a->tuple;
+ tuple2 = (IndexTuple) b->tuple;
+ keysz = state->nKeys;
+ tupDes = RelationGetDescr(state->indexRel);
+
+ if (!sortKey->abbrev_converter)
+ {
+ nkey = 2;
+ sortKey++;
+ }
+ else
+ nkey = 1;
+
+ if (a->isnull1)
+ equal_hasnull = true;
+
+ nbts_attiterinit(tuple1, nkey, tupDes);
+ nbts_attiterinit(tuple2, nkey, tupDes);
+
+ nbts_foreachattr(nkey, keysz)
+ {
+ Datum datum1,
+ datum2;
+ datum1 = nbts_attiter_nextattdatum(tuple1, tupDes);
+ datum2 = nbts_attiter_nextattdatum(tuple2, tupDes);
+
+ if (nbts_attiter_attnum == 1)
+ {
+ compare = ApplySortAbbrevFullComparator(datum1, nbts_attiter_curattisnull(tuple1),
+ datum2, nbts_attiter_curattisnull(tuple2),
+ sortKey);
+ }
+ else
+ {
+ compare = ApplySortComparator(datum1, nbts_attiter_curattisnull(tuple1),
+ datum2, nbts_attiter_curattisnull(tuple2),
+ sortKey);
+ }
+
+ if (compare != 0)
+ return compare;
+
+ if (nbts_attiter_curattisnull(tuple1))
+ equal_hasnull = true;
+
+ sortKey++;
+ }
+
+ /*
+ * If btree has asked us to enforce uniqueness, complain if two equal
+ * tuples are detected (unless there was at least one NULL field and NULLS
+ * NOT DISTINCT was not set).
+ *
+ * It is sufficient to make the test here, because if two tuples are equal
+ * they *must* get compared at some stage of the sort --- otherwise the
+ * sort algorithm wouldn't have checked whether one must appear before the
+ * other.
+ */
+ if (state->enforceUnique && !(!state->uniqueNullsNotDistinct && equal_hasnull))
+ {
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ char *key_desc;
+
+ /*
+ * Some rather brain-dead implementations of qsort (such as the one in
+ * QNX 4) will sometimes call the comparison routine to compare a
+ * value to itself, but we always use our own implementation, which
+ * does not.
+ */
+ Assert(tuple1 != tuple2);
+
+ index_deform_tuple(tuple1, tupDes, values, isnull);
+
+ key_desc = BuildIndexValueDescription(state->indexRel, values, isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("could not create unique index \"%s\"",
+ RelationGetRelationName(state->indexRel)),
+ key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+ errdetail("Duplicate keys exist."),
+ errtableconstraint(state->heapRel,
+ RelationGetRelationName(state->indexRel))));
+ }
+
+ /*
+ * If key values are equal, we sort on ItemPointer. This is required for
+ * btree indexes, since heap TID is treated as an implicit last key
+ * attribute in order to ensure that all keys in the index are physically
+ * unique.
+ */
+ {
+ BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
+ BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
+
+ if (blk1 != blk2)
+ return (blk1 < blk2) ? -1 : 1;
+ }
+ {
+ OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
+ OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
+
+ if (pos1 != pos2)
+ return (pos1 < pos2) ? -1 : 1;
+ }
+
+ /* ItemPointer values should never be equal */
+ Assert(false);
+
+ return 0;
+}
+
+#endif
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 80f2575884..3e291c84fd 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1169,6 +1169,24 @@ do { \
) \
)
+#define NBT_SPECIALIZE_NAME(name, rel) \
+( \
+ IndexRelationGetNumberOfKeyAttributes(rel) == 1 ? ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_SINGLE_COLUMN) \
+ ) \
+ : \
+ ( \
+ TupleDescAttr(RelationGetDescr(rel), \
+ IndexRelationGetNumberOfKeyAttributes(rel) - 1)->attcacheoff > 0 ? ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED) \
+ ) \
+ : \
+ ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_UNCACHED) \
+ ) \
+ ) \
+)
+
#else /* not defined NBTS_ENABLED */
#define nbt_opt_specialize(rel)
--
2.30.2
v3-0007-Implement-dynamic-prefix-compression-in-nbtree.patchapplication/x-patch; name=v3-0007-Implement-dynamic-prefix-compression-in-nbtree.patchDownload
From 4f8c9cb8af171fca3226ef9acb2883623576983d Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 15 Apr 2022 18:25:38 +0200
Subject: [PATCH v3 7/9] Implement dynamic prefix compression in nbtree
Because tuples are ordered on the page, if some prefix of the scan attributes
on both sides of the compared tuple are equal to the scankey, then the current
tuple that is being compared must also have those prefixing attributes that
equal the scankey.
We cannot propagate this information to _binsrch on lower pages, as this
downstream page may concurrently have split and/or have merged with its
deleted left neighbour (see [0]), which moves the keyspace of the linked page.
We thus can only trust the current state of this current page for this
optimization, which means we must validate this state each time we open the
page.
Although this limits the overall performance improvement, it still allows
for a nice performance improvement in most cases where initial columns have
many duplicate values and a compare function that is not cheap.
---
contrib/amcheck/verify_nbtree.c | 17 +++--
src/backend/access/nbtree/README | 25 ++++++++
src/backend/access/nbtree/nbtinsert.c | 14 ++--
src/backend/access/nbtree/nbtinsert_spec.h | 22 +++++--
src/backend/access/nbtree/nbtsearch.c | 2 +-
src/backend/access/nbtree/nbtsearch_spec.h | 75 +++++++++++++++++-----
src/include/access/nbtree_specialized.h | 8 ++-
7 files changed, 127 insertions(+), 36 deletions(-)
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 70278c4f93..5753611546 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2673,6 +2673,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
BTInsertStateData insertstate;
OffsetNumber offnum;
Page page;
+ AttrNumber cmpcol = 1;
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
@@ -2682,13 +2683,13 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.buf = lbuf;
/* Get matching tuple on leaf page */
- offnum = _bt_binsrch_insert(state->rel, &insertstate);
+ offnum = _bt_binsrch_insert(state->rel, &insertstate, 1);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
/* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
insertstate.postingoff <= 0 &&
- _bt_compare(state->rel, key, page, offnum) == 0)
+ _bt_compare(state->rel, key, page, offnum, &cmpcol) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
}
@@ -2750,6 +2751,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
@@ -2760,7 +2762,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
if (!key->heapkeyspace)
return invariant_leq_offset(state, key, upperbound);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
/*
* _bt_compare() is capable of determining that a scankey with a
@@ -2812,10 +2814,11 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber upperbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
return cmp <= 0;
}
@@ -2835,10 +2838,11 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber lowerbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, lowerbound);
+ cmp = _bt_compare(state->rel, key, state->target, lowerbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
@@ -2873,13 +2877,14 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
/* Verify line pointer before checking tuple */
itemid = PageGetItemIdCareful(state, nontargetblock, nontarget,
upperbound);
- cmp = _bt_compare(state->rel, key, nontarget, upperbound);
+ cmp = _bt_compare(state->rel, key, nontarget, upperbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3c08888c23..13ac9ee2be 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -901,6 +901,31 @@ large groups of duplicates, maximizing space utilization. Note also that
deduplication more efficient. Deduplication can be performed infrequently,
without merging together existing posting list tuples too often.
+
+Notes about dynamic prefix truncation
+-------------------------------------
+
+Because NBTrees have a sorted keyspace, when we have determined that some
+prefixing columns of tuples on both sides of the tuple that is being
+compared are equal to the scankey, then the current tuple must also share
+this prefix with the scankey. This allows us to skip comparing those columns,
+potentially saving cycles.
+
+We can only use this constraint if we have proven this information while we
+hold a pin on the page, so this is only useful on the page level: Concurrent
+page deletions and splits may have moved the keyspace of the page referenced
+by an inner page to the right. If we re-used high- and low-column-prefixes,
+we would not be able to detect a change of keyspace from e.g. (2,2) to (1,2),
+and subsequently return invalid results. This race condition can only be
+prevented by re-establishing the prefix-equal-columns for each page.
+
+The positive part of this, is that we already have results of the highest
+value of a page: a pages' highkey is compared to the scankey while we have
+a pin on the page in the _bt_moveright procedure. The _bt_binsrch procedure
+will use this result as a rightmost prefix compare, and for each step in the
+binary search (that does not compare less than the insert key) improve the
+equal-prefix bounds.
+
Notes about deduplication
-------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index ec6c73d1cc..20e5f33f98 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -132,7 +132,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* in the fastpath below, but also in the _bt_findinsertloc() call later.
*/
Assert(!insertstate->bounds_valid);
- offset = nbts_call(_bt_binsrch_insert, rel, insertstate);
+ offset = nbts_call(_bt_binsrch_insert, rel, insertstate, 1);
/*
* Scan over all equal tuples, looking for live conflicts.
@@ -142,6 +142,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(itup_key->scantid == NULL);
for (;;)
{
+ AttrNumber cmpcol = 1;
+
/*
* Each iteration of the loop processes one heap TID, not one index
* tuple. Current offset number for page isn't usually advanced on
@@ -177,7 +179,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(insertstate->bounds_valid);
Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
Assert(insertstate->low <= insertstate->stricthigh);
- Assert(nbts_call(_bt_compare, rel, itup_key, page, offset) < 0);
+ Assert(nbts_call(_bt_compare, rel, itup_key, page, offset,
+ &cmpcol) < 0);
break;
}
@@ -202,7 +205,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
if (!inposting)
{
/* Plain tuple, or first TID in posting list tuple */
- if (nbts_call(_bt_compare, rel, itup_key, page, offset) != 0)
+ if (nbts_call(_bt_compare, rel, itup_key, page, offset,
+ &cmpcol) != 0)
break; /* we're past all the equal tuples */
/* Advanced curitup */
@@ -412,11 +416,13 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
else
{
int highkeycmp;
+ cmpcol = 1;
/* If scankey == hikey we gotta check the next page too */
if (P_RIGHTMOST(opaque))
break;
- highkeycmp = nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY);
+ highkeycmp = nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY,
+ &cmpcol);
Assert(highkeycmp <= 0);
if (highkeycmp != 0)
break;
diff --git a/src/backend/access/nbtree/nbtinsert_spec.h b/src/backend/access/nbtree/nbtinsert_spec.h
index 97c866aea3..ccba0fa5ed 100644
--- a/src/backend/access/nbtree/nbtinsert_spec.h
+++ b/src/backend/access/nbtree/nbtinsert_spec.h
@@ -73,6 +73,7 @@ NBTS_FUNCTION(_bt_search_insert)(Relation rel, BTInsertState insertstate)
{
Page page;
BTPageOpaque opaque;
+ AttrNumber comparecol = 1;
_bt_checkpage(rel, insertstate->buf);
page = BufferGetPage(insertstate->buf);
@@ -91,7 +92,8 @@ NBTS_FUNCTION(_bt_search_insert)(Relation rel, BTInsertState insertstate)
!P_IGNORE(opaque) &&
PageGetFreeSpace(page) > insertstate->itemsz &&
PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- nbts_call(_bt_compare, rel, insertstate->itup_key, page, P_HIKEY) > 0)
+ nbts_call(_bt_compare, rel, insertstate->itup_key, page,
+ P_HIKEY, &comparecol) > 0)
{
/*
* Caller can use the fastpath optimization because cached
@@ -221,6 +223,7 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
/*
* Does the new tuple belong on this page?
*
@@ -238,7 +241,7 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
/* Test '<=', not '!=', since scantid is set now */
if (P_RIGHTMOST(opaque) ||
- nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) <= 0)
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
break;
_bt_stepright(rel, insertstate, stack);
@@ -291,6 +294,7 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
*/
while (PageGetFreeSpace(page) < insertstate->itemsz)
{
+ AttrNumber cmpcol = 1;
/*
* Before considering moving right, see if we can obtain enough
* space by erasing LP_DEAD items
@@ -321,7 +325,8 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
break;
if (P_RIGHTMOST(opaque) ||
- nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) != 0 ||
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY,
+ &cmpcol) != 0 ||
pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
break;
@@ -336,10 +341,13 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
* We should now be on the correct page. Find the offset within the page
* for the new tuple. (Possibly reusing earlier search bounds.)
*/
- Assert(P_RIGHTMOST(opaque) ||
- nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) <= 0);
+ {
+ AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
+ Assert(P_RIGHTMOST(opaque) || nbts_call(_bt_compare, rel, itup_key,
+ page, P_HIKEY, &cmpcol) <= 0);
+ }
- newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate);
+ newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate, 1);
if (insertstate->postingoff == -1)
{
@@ -358,7 +366,7 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
*/
Assert(!insertstate->bounds_valid);
insertstate->postingoff = 0;
- newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate);
+ newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate, 1);
Assert(insertstate->postingoff == 0);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d5152bfcb7..036ce88679 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -696,7 +696,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
_bt_initialize_more_data(so, dir);
/* position to the precise item on the page */
- offnum = nbts_call(_bt_binsrch, rel, &inskey, buf);
+ offnum = nbts_call(_bt_binsrch, rel, &inskey, buf, 1);
/*
* If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsearch_spec.h b/src/backend/access/nbtree/nbtsearch_spec.h
index a5c5f2b94f..829c216819 100644
--- a/src/backend/access/nbtree/nbtsearch_spec.h
+++ b/src/backend/access/nbtree/nbtsearch_spec.h
@@ -10,8 +10,10 @@
*/
#ifndef NBTS_SPECIALIZING_DEFAULT
-static OffsetNumber NBTS_FUNCTION(_bt_binsrch)(Relation rel, BTScanInsert key,
- Buffer buf);
+static OffsetNumber NBTS_FUNCTION(_bt_binsrch)(Relation rel,
+ BTScanInsert key,
+ Buffer buf,
+ AttrNumber highkeycmpcol);
static bool NBTS_FUNCTION(_bt_readpage)(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
@@ -38,7 +40,8 @@ static bool NBTS_FUNCTION(_bt_readpage)(IndexScanDesc scan, ScanDirection dir,
static OffsetNumber
NBTS_FUNCTION(_bt_binsrch)(Relation rel,
BTScanInsert key,
- Buffer buf)
+ Buffer buf,
+ AttrNumber highkeycmpcol)
{
Page page;
BTPageOpaque opaque;
@@ -46,6 +49,8 @@ NBTS_FUNCTION(_bt_binsrch)(Relation rel,
high;
int32 result,
cmpval;
+ AttrNumber highcmpcol = highkeycmpcol,
+ lowcmpcol = 1;
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
@@ -87,15 +92,22 @@ NBTS_FUNCTION(_bt_binsrch)(Relation rel,
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
/* We have low <= mid < high, so mid points at a real slot */
- result = nbts_call(_bt_compare, rel, key, page, mid);
+ result = nbts_call(_bt_compare, rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
+ {
high = mid;
+ highcmpcol = cmpcol;
+ }
}
/*
@@ -441,6 +453,7 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
IndexTuple itup;
BlockNumber child;
BTStack new_stack;
+ AttrNumber highkeycmpcol = 1;
/*
* Race -- the page we just grabbed may have split since we read its
@@ -456,7 +469,7 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
*/
*bufP = nbts_call(_bt_moveright, rel, key, *bufP,
(access == BT_WRITE), stack_in,
- page_access, snapshot);
+ page_access, snapshot, &highkeycmpcol);
/* if this is a leaf page, we're done */
page = BufferGetPage(*bufP);
@@ -468,7 +481,7 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
* Find the appropriate pivot tuple on this page. Its downlink points
* to the child page that we're about to descend to.
*/
- offnum = nbts_call(_bt_binsrch, rel, key, *bufP);
+ offnum = nbts_call(_bt_binsrch, rel, key, *bufP, highkeycmpcol);
itemid = PageGetItemId(page, offnum);
itup = (IndexTuple) PageGetItem(page, itemid);
Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
@@ -507,6 +520,7 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
*/
if (access == BT_WRITE && page_access == BT_READ)
{
+ AttrNumber highkeycmpcol = 1;
/* trade in our read lock for a write lock */
_bt_unlockbuf(rel, *bufP);
_bt_lockbuf(rel, *bufP, BT_WRITE);
@@ -517,7 +531,7 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
* move right to its new sibling. Do that.
*/
*bufP = nbts_call(_bt_moveright, rel, key, *bufP, true, stack_in,
- BT_WRITE, snapshot);
+ BT_WRITE, snapshot, &highkeycmpcol);
}
return stack_in;
@@ -565,12 +579,15 @@ NBTS_FUNCTION(_bt_moveright)(Relation rel,
bool forupdate,
BTStack stack,
int access,
- Snapshot snapshot)
+ Snapshot snapshot,
+ AttrNumber *comparecol)
{
Page page;
BTPageOpaque opaque;
int32 cmpval;
+ Assert(PointerIsValid(comparecol));
+
/*
* When nextkey = false (normal case): if the scan key that brought us to
* this page is > the high key stored on the page, then the page has split
@@ -592,12 +609,17 @@ NBTS_FUNCTION(_bt_moveright)(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
+
page = BufferGetPage(buf);
TestForOldSnapshot(snapshot, rel, page);
opaque = BTPageGetOpaque(page);
if (P_RIGHTMOST(opaque))
+ {
+ *comparecol = cmpcol;
break;
+ }
/*
* Finish any incomplete splits we encounter along the way.
@@ -623,14 +645,19 @@ NBTS_FUNCTION(_bt_moveright)(Relation rel,
continue;
}
- if (P_IGNORE(opaque) || nbts_call(_bt_compare, rel, key, page, P_HIKEY) >= cmpval)
+ if (P_IGNORE(opaque) || nbts_call(_bt_compare, rel, key, page, P_HIKEY,
+ &cmpcol) >= cmpval)
{
/* step right one page */
+ *comparecol = 1;
buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
continue;
}
else
+ {
+ *comparecol = cmpcol;
break;
+ }
}
if (P_IGNORE(opaque))
@@ -663,7 +690,8 @@ NBTS_FUNCTION(_bt_moveright)(Relation rel,
* list split).
*/
OffsetNumber
-NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate)
+NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol)
{
BTScanInsert key = insertstate->itup_key;
Page page;
@@ -673,6 +701,7 @@ NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate)
stricthigh;
int32 result,
cmpval;
+ AttrNumber lowcmpcol = 1;
page = BufferGetPage(insertstate->buf);
opaque = BTPageGetOpaque(page);
@@ -723,16 +752,21 @@ NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate)
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
/* We have low <= mid < high, so mid points at a real slot */
- result = nbts_call(_bt_compare, rel, key, page, mid);
+ result = nbts_call(_bt_compare, rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
{
high = mid;
+ highcmpcol = cmpcol;
if (result != 0)
stricthigh = high;
}
@@ -813,7 +847,8 @@ int32
NBTS_FUNCTION(_bt_compare)(Relation rel,
BTScanInsert key,
Page page,
- OffsetNumber offnum)
+ OffsetNumber offnum,
+ AttrNumber *comparecol)
{
TupleDesc itupdesc = RelationGetDescr(rel);
BTPageOpaque opaque = BTPageGetOpaque(page);
@@ -854,10 +889,11 @@ NBTS_FUNCTION(_bt_compare)(Relation rel,
ncmpkey = Min(ntupatts, key->keysz);
Assert(key->heapkeyspace || ncmpkey == key->keysz);
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
- scankey = key->scankeys;
- nbts_attiterinit(itup, 1, itupdesc);
- nbts_foreachattr(1, ncmpkey)
+ nbts_attiterinit(itup, *comparecol, itupdesc);
+ scankey = key->scankeys + ((*comparecol) - 1);
+
+ nbts_foreachattr(*comparecol, ncmpkey)
{
Datum datum;
@@ -902,11 +938,20 @@ NBTS_FUNCTION(_bt_compare)(Relation rel,
/* if the keys are unequal, return the difference */
if (result != 0)
+ {
+ *comparecol = nbts_attiter_attnum;
return result;
+ }
scankey++;
}
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *comparecol = ntupatts + 1;
+
/*
* All non-truncated attributes (other than heap TID) were found to be
* equal. Treat truncated attributes as minus infinity when scankey has a
diff --git a/src/include/access/nbtree_specialized.h b/src/include/access/nbtree_specialized.h
index c45fa84aed..7402a4c46e 100644
--- a/src/include/access/nbtree_specialized.h
+++ b/src/include/access/nbtree_specialized.h
@@ -43,12 +43,14 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key,
extern Buffer
NBTS_FUNCTION(_bt_moveright)(Relation rel, BTScanInsert key, Buffer buf,
bool forupdate, BTStack stack, int access,
- Snapshot snapshot);
+ Snapshot snapshot, AttrNumber *comparecol);
extern OffsetNumber
-NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate);
+NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol);
extern int32
NBTS_FUNCTION(_bt_compare)(Relation rel, BTScanInsert key,
- Page page, OffsetNumber offnum);
+ Page page, OffsetNumber offnum,
+ AttrNumber *comparecol);
/*
* prototypes for functions in nbtutils_spec.h
--
2.30.2
v3-0006-Implement-specialized-uncacheable-attribute-itera.patchapplication/x-patch; name=v3-0006-Implement-specialized-uncacheable-attribute-itera.patchDownload
From b33ff609d10ed1ab9925cb128f05d05cc5babb46 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 8 Apr 2022 14:44:01 +0200
Subject: [PATCH v3 6/9] Implement specialized uncacheable attribute iteration
Uses an iterator to prevent doing duplicate work while iterating over
attributes.
Inspiration: https://www.postgresql.org/message-id/CAEze2WjE9ka8i%3Ds-Vv5oShro9xTrt5VQnQvFG9AaRwWpMm3-fg%40mail.gmail.com
---
src/backend/access/nbtree/nbtree_spec.h | 1 +
src/include/access/itup.h | 179 ++++++++++++++++++++++++
src/include/access/nbtree.h | 12 +-
src/include/access/nbtree_specialize.h | 34 +++++
4 files changed, 224 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/nbtree/nbtree_spec.h b/src/backend/access/nbtree/nbtree_spec.h
index 4c342287f6..88b01c86f7 100644
--- a/src/backend/access/nbtree/nbtree_spec.h
+++ b/src/backend/access/nbtree/nbtree_spec.h
@@ -9,6 +9,7 @@ void
NBTS_FUNCTION(_bt_specialize)(Relation rel)
{
#ifdef NBTS_SPECIALIZING_DEFAULT
+ PopulateTupleDescCacheOffsets(rel->rd_att);
nbts_call_norel(_bt_specialize, rel, rel);
#else
rel->rd_indam->aminsert = NBTS_FUNCTION(btinsert);
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 2c8877e991..cc29614107 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -59,6 +59,15 @@ typedef struct IndexAttributeBitMapData
typedef IndexAttributeBitMapData * IndexAttributeBitMap;
+typedef struct IAttrIterStateData
+{
+ int offset;
+ bool slow;
+ bool isNull;
+} IAttrIterStateData;
+
+typedef IAttrIterStateData * IAttrIterState;
+
/*
* t_info manipulation macros
*/
@@ -126,6 +135,42 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
) \
)
+/* ----------------
+ * index_attiterinit
+ *
+ * This gets called many times, so we macro the cacheable and NULL
+ * lookups, and call nocache_index_attiterinit() for the rest.
+ *
+ * tup - the tuple being iterated on
+ * attnum - the attribute number that we start the iteration with
+ * in the first index_attiternext call
+ * tupdesc - the tuple description
+ *
+ * ----------------
+ */
+#define index_attiterinit(tup, attnum, tupleDesc, iter) \
+do { \
+ if ((attnum) == 1) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ 0 /* Offset of attribute 1 is always 0 */, \
+ false /* slow */, \
+ false /* isNull */ \
+ }); \
+ } \
+ else if (!IndexTupleHasNulls(tup) && \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff >= 0) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff, /* offset */ \
+ false, /* slow */ \
+ false /* isNull */ \
+ }); \
+ } \
+ else \
+ nocache_index_attiterinit((tup), (attnum) - 1, (tupleDesc), (iter)); \
+} while (false);
+
/*
* MaxIndexTuplesPerPage is an upper bound on the number of tuples that can
* fit on one index page. An index tuple must have either data or a null
@@ -161,4 +206,138 @@ extern IndexTuple CopyIndexTuple(IndexTuple source);
extern IndexTuple index_truncate_tuple(TupleDesc sourceDescriptor,
IndexTuple source, int leavenatts);
+/*
+ * Initiate an index attribute iterator to attribute attnum,
+ * and return the corresponding datum.
+ *
+ * This is nearly the same as index_deform_tuple, except that this
+ * returns the internal state up to attnum, instead of populating the
+ * datum- and isnull-arrays
+ */
+static inline void
+nocache_index_attiterinit(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ int curatt;
+ char *tp; /* ptr to tuple data */
+ int off; /* offset in tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ bool slow = false; /* can we use/set attcacheoff? */
+ bool null = false;
+
+ /* Assert to protect callers */
+ Assert(PointerIsValid(iter));
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ /* XXX "knows" t_bits are just after fixed tuple header! */
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+ off = 0;
+
+ for (curatt = 0; curatt < attnum; curatt++)
+ {
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, curatt);
+
+ if (hasnulls && att_isnull(curatt, bp))
+ {
+ null = true;
+ slow = true; /* can't use attcacheoff anymore */
+ continue;
+ }
+
+ null = false;
+
+ if (!slow && thisatt->attcacheoff >= 0)
+ off = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ slow = true;
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+ }
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+
+ if (thisatt->attlen <= 0)
+ slow = true; /* can't use attcacheoff anymore */
+ }
+
+ iter->isNull = null;
+ iter->offset = off;
+ iter->slow = slow;
+}
+
+/* ----------------
+ * index_attiternext() - get the next attribute of an index tuple
+ *
+ * This gets called many times, so we do the least amount of work
+ * possible.
+ *
+ * The code does not attempt to update attcacheoff; as it is unlikely
+ * to reach a situation where the cached offset matters a lot.
+ * If the cached offset do matter, the caller should make sure that
+ * PopulateTupleDescCacheOffsets() was called on the tuple descriptor
+ * to populate the attribute offset cache.
+ *
+ * ----------------
+ */
+static inline Datum
+index_attiternext(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ char *tp; /* ptr to tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ Datum datum;
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, attnum - 1);
+
+ Assert(PointerIsValid(iter));
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+
+ if (hasnulls && att_isnull(attnum - 1, bp))
+ {
+ iter->isNull = true;
+ iter->slow = true;
+ return (Datum) 0;
+ }
+
+ iter->isNull = false;
+
+ if (!iter->slow && thisatt->attcacheoff >= 0)
+ iter->offset = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ iter->offset = att_align_pointer(iter->offset, thisatt->attalign, -1,
+ tp + iter->offset);
+ iter->slow = true;
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ iter->offset = att_align_nominal(iter->offset, thisatt->attalign);
+ }
+
+ datum = fetchatt(thisatt, tp + iter->offset);
+
+ iter->offset = att_addlength_pointer(iter->offset, thisatt->attlen, tp + iter->offset);
+
+ if (thisatt->attlen <= 0)
+ iter->slow = true; /* can't use attcacheoff anymore */
+
+ return datum;
+}
+
#endif /* ITUP_H */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 1559399b0e..80f2575884 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1122,6 +1122,7 @@ typedef struct BTOptions
*/
#define NBTS_TYPE_SINGLE_COLUMN single
#define NBTS_TYPE_CACHED cached
+#define NBTS_TYPE_UNCACHED uncached
#define NBTS_TYPE_DEFAULT default
@@ -1152,12 +1153,19 @@ do { \
#define NBT_SPECIALIZE_CALL(function, rel, ...) \
( \
- IndexRelationGetNumberOfKeyAttributes(rel) == 1 ? ( \
+ IndexRelationGetNumberOfKeyAttributes(rel) == 1 ? ( \
NBTS_MAKE_NAME(function, NBTS_TYPE_SINGLE_COLUMN)(__VA_ARGS__) \
) \
: \
( \
- NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+ TupleDescAttr(RelationGetDescr(rel), \
+ IndexRelationGetNumberOfKeyAttributes(rel) - 1)->attcacheoff > 0 ? ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+ ) \
+ : \
+ ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_UNCACHED)(__VA_ARGS__) \
+ ) \
) \
)
diff --git a/src/include/access/nbtree_specialize.h b/src/include/access/nbtree_specialize.h
index 642bc4c795..52739d390e 100644
--- a/src/include/access/nbtree_specialize.h
+++ b/src/include/access/nbtree_specialize.h
@@ -168,6 +168,40 @@
#undef nbts_attiter_nextattdatum
#undef nbts_attiter_curattisnull
+/*
+ * Multiple key columns, but attcacheoff -optimization doesn't apply.
+ */
+#define NBTS_SPECIALIZING_UNCACHED
+#define NBTS_TYPE NBTS_TYPE_UNCACHED
+
+#define nbts_attiterdeclare(itup) \
+ IAttrIterStateData NBTS_MAKE_NAME(itup, iter)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc) \
+ index_attiterinit((itup), (initAttNum), (tupDesc), &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_attiternext((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, iter).isNull
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_UNCACHED
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
/* reset call to SPECIALIZE_CALL for default behaviour */
#undef nbts_call_norel
#define nbts_call_norel(name, rel, ...) \
--
2.30.2
On Sun, 5 Jun 2022 at 21:12, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
While working on benchmarking the v2 patchset, I noticed no
improvement on reindex, which I attributed to forgetting to also
specialize comparetup_index_btree in tuplesorth.c. After adding the
specialization there as well (attached in v3), reindex performance
improved significantly too.
PFA version 4 of this patchset. Changes:
- Silence the compiler warnings,
- Extract itup_attiter code into its own header, so that we don't get
compiler warnings and pass the cfbot builds,
- Re-order patches to be in a more logical order,
- Updates to the dynamic prefix compression so that we don't always do
a _bt_compare on the pages' highkey. memcmp(parentpage_rightsep,
highkey) == 0 is often true, and allows us to skip the indirect
function calls in _bt_compare most of the time.
Based on local measurements, this patchset improves performance for
all key shapes, with 2% to 600+% increased throughput (2-86% faster
operations), depending on key shape. As can be seen in the attached
pgbench output, the performance results are based on beta1 (f00a4f02,
dated 2022-06-04) and thus not 100% current, but considering that no
significant changes have been made in the btree AM code since, I
believe these measurements are still quite valid.
I also didn't re-run the numbers for the main branch; but I compared
against the results of master in the last mail. This is because I run
the performance tests locally, and a 7-iteration pgbench run for
master requires 9 hours of downtime with this dataset, during which I
can't use the system so as to not interfere with the performance
tests. As such, I considered rerunning the benchmark for master to be
not worth the time/effort/cost with the little changes that were
committed.
Kind regards,
Matthias van de Meent.
Attachments:
v4-0001-Specialize-nbtree-functions-on-btree-key-shape.patchapplication/octet-stream; name=v4-0001-Specialize-nbtree-functions-on-btree-key-shape.patchDownload
From 595370bc49043fa2413e62c3b4b4573af2ef5f61 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Sun, 30 Jan 2022 16:23:31 +0100
Subject: [PATCH v4 1/8] Specialize nbtree functions on btree key shape
nbtree keys are not all made the same, so a significant amount of time is
spent on code that exists only to deal with other key's shape. By specializing
function calls based on the key shape, we can remove or reduce these causes
of overhead.
This commit adds the basic infrastructure for specializing specific hot code
in the nbtree AM to certain shapes of keys, initially splitting splitting out
(not yet: specializing) the attcacheoff-capable case.
Note that we generate N specialized functions and 1 'default' function for each
specializable function.
This feature can be disabled by removing the #define NBTS_ENABLED -line in nbtree.h
---
src/backend/access/nbtree/README | 22 +
src/backend/access/nbtree/nbtdedup.c | 300 +------
src/backend/access/nbtree/nbtdedup_spec.h | 313 +++++++
src/backend/access/nbtree/nbtinsert.c | 572 +-----------
src/backend/access/nbtree/nbtinsert_spec.h | 569 ++++++++++++
src/backend/access/nbtree/nbtpage.c | 4 +-
src/backend/access/nbtree/nbtree.c | 31 +-
src/backend/access/nbtree/nbtree_spec.h | 50 ++
src/backend/access/nbtree/nbtsearch.c | 994 +--------------------
src/backend/access/nbtree/nbtsearch_spec.h | 994 +++++++++++++++++++++
src/backend/access/nbtree/nbtsort.c | 271 +-----
src/backend/access/nbtree/nbtsort_spec.h | 275 ++++++
src/backend/access/nbtree/nbtsplitloc.c | 14 +-
src/backend/access/nbtree/nbtutils.c | 755 +---------------
src/backend/access/nbtree/nbtutils_spec.h | 772 ++++++++++++++++
src/backend/utils/sort/tuplesort.c | 4 +-
src/include/access/nbtree.h | 61 +-
src/include/access/nbtree_specialize.h | 204 +++++
src/include/access/nbtree_specialized.h | 67 ++
19 files changed, 3357 insertions(+), 2915 deletions(-)
create mode 100644 src/backend/access/nbtree/nbtdedup_spec.h
create mode 100644 src/backend/access/nbtree/nbtinsert_spec.h
create mode 100644 src/backend/access/nbtree/nbtree_spec.h
create mode 100644 src/backend/access/nbtree/nbtsearch_spec.h
create mode 100644 src/backend/access/nbtree/nbtsort_spec.h
create mode 100644 src/backend/access/nbtree/nbtutils_spec.h
create mode 100644 src/include/access/nbtree_specialize.h
create mode 100644 src/include/access/nbtree_specialized.h
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 5529afc1fe..3c08888c23 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -1041,6 +1041,28 @@ that need a page split anyway. Besides, supporting variable "split points"
while splitting posting lists won't actually improve overall space
utilization.
+
+Notes about nbtree call specialization
+--------------------------------------
+
+Attribute iteration is a significant overhead for multi-column indexes.
+We can avoid it by specializing performance-sensitive search functions
+and calling those selectively. Additionally, we update the entry points
+in the index AM to call the specialized functions, increasing the
+performance of those hot paths. This performance benefit is at the cost
+of binary size, so this feature can be disabled by defining NBTS_DISABLED.
+
+Optimized code paths exist for the following cases, in order of preference:
+ - single-column indexes
+ NB: The code paths of this optimization do not support multiple key columns.
+ - multi-column indexes that could benefit from the attcacheoff optimization
+ NB: This is also used for the default case, and is slow for uncachable
+ attribute offsets.
+
+Future work will optimize for multi-column indexes that don't benefit
+from the attcacheoff optimization by improving on the O(n^2) nature of
+index_getattr through storing attribute offsets.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 0207421a5d..d7025d8e1c 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -22,259 +22,16 @@
static void _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
TM_IndexDeleteOp *delstate);
-static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem);
static void _bt_singleval_fillfactor(Page page, BTDedupState state,
Size newitemsz);
#ifdef USE_ASSERT_CHECKING
static bool _bt_posting_valid(IndexTuple posting);
#endif
-/*
- * Perform a deduplication pass.
- *
- * The general approach taken here is to perform as much deduplication as
- * possible to free as much space as possible. Note, however, that "single
- * value" strategy is used for !bottomupdedup callers when the page is full of
- * tuples of a single value. Deduplication passes that apply the strategy
- * will leave behind a few untouched tuples at the end of the page, preparing
- * the page for an anticipated page split that uses nbtsplitloc.c's own single
- * value strategy. Our high level goal is to delay merging the untouched
- * tuples until after the page splits.
- *
- * When a call to _bt_bottomupdel_pass() just took place (and failed), our
- * high level goal is to prevent a page split entirely by buying more time.
- * We still hope that a page split can be avoided altogether. That's why
- * single value strategy is not even considered for bottomupdedup callers.
- *
- * The page will have to be split if we cannot successfully free at least
- * newitemsz (we also need space for newitem's line pointer, which isn't
- * included in caller's newitemsz).
- *
- * Note: Caller should have already deleted all existing items with their
- * LP_DEAD bits set.
- */
-void
-_bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
- Size newitemsz, bool bottomupdedup)
-{
- OffsetNumber offnum,
- minoff,
- maxoff;
- Page page = BufferGetPage(buf);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- Page newpage;
- BTDedupState state;
- Size pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
- bool singlevalstrat = false;
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtdedup_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
- /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
- newitemsz += sizeof(ItemIdData);
-
- /*
- * Initialize deduplication state.
- *
- * It would be possible for maxpostingsize (limit on posting list tuple
- * size) to be set to one third of the page. However, it seems like a
- * good idea to limit the size of posting lists to one sixth of a page.
- * That ought to leave us with a good split point when pages full of
- * duplicates can be split several times.
- */
- state = (BTDedupState) palloc(sizeof(BTDedupStateData));
- state->deduplicate = true;
- state->nmaxitems = 0;
- state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
- /* Metadata about base tuple of current pending posting list */
- state->base = NULL;
- state->baseoff = InvalidOffsetNumber;
- state->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- state->htids = palloc(state->maxpostingsize);
- state->nhtids = 0;
- state->nitems = 0;
- /* Size of all physical tuples to be replaced by pending posting list */
- state->phystupsize = 0;
- /* nintervals should be initialized to zero */
- state->nintervals = 0;
-
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * Consider applying "single value" strategy, though only if the page
- * seems likely to be split in the near future
- */
- if (!bottomupdedup)
- singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
-
- /*
- * Deduplicate items from page, and write them to newpage.
- *
- * Copy the original page's LSN into newpage copy. This will become the
- * updated version of the page. We need this because XLogInsert will
- * examine the LSN and possibly dump it in a page image.
- */
- newpage = PageGetTempPageCopySpecial(page);
- PageSetLSN(newpage, PageGetLSN(page));
-
- /* Copy high key, if any */
- if (!P_RIGHTMOST(opaque))
- {
- ItemId hitemid = PageGetItemId(page, P_HIKEY);
- Size hitemsz = ItemIdGetLength(hitemid);
- IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
-
- if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
- false, false) == InvalidOffsetNumber)
- elog(ERROR, "deduplication failed to add highkey");
- }
-
- for (offnum = minoff;
- offnum <= maxoff;
- offnum = OffsetNumberNext(offnum))
- {
- ItemId itemid = PageGetItemId(page, offnum);
- IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
-
- Assert(!ItemIdIsDead(itemid));
-
- if (offnum == minoff)
- {
- /*
- * No previous/base tuple for the data item -- use the data item
- * as base tuple of pending posting list
- */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- else if (state->deduplicate &&
- _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
- _bt_dedup_save_htid(state, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID(s) for itup have been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list for some other reason (e.g., adding more
- * TIDs would have caused posting list to exceed current
- * maxpostingsize).
- *
- * If state contains pending posting list with more than one item,
- * form new posting tuple, and actually update the page. Else
- * reset the state and move on without modifying the page.
- */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- if (singlevalstrat)
- {
- /*
- * Single value strategy's extra steps.
- *
- * Lower maxpostingsize for sixth and final large posting list
- * tuple at the point where 5 maxpostingsize-capped tuples
- * have either been formed or observed.
- *
- * When a sixth maxpostingsize-capped item is formed/observed,
- * stop merging together tuples altogether. The few tuples
- * that remain at the end of the page won't be merged together
- * at all (at least not until after a future page split takes
- * place).
- */
- if (state->nmaxitems == 5)
- _bt_singleval_fillfactor(page, state, newitemsz);
- else if (state->nmaxitems == 6)
- {
- state->deduplicate = false;
- singlevalstrat = false; /* won't be back here */
- }
- }
-
- /* itup starts new pending posting list */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- }
-
- /* Handle the last item */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- /*
- * If no items suitable for deduplication were found, newpage must be
- * exactly the same as the original page, so just return from function.
- *
- * We could determine whether or not to proceed on the basis the space
- * savings being sufficient to avoid an immediate page split instead. We
- * don't do that because there is some small value in nbtsplitloc.c always
- * operating against a page that is fully deduplicated (apart from
- * newitem). Besides, most of the cost has already been paid.
- */
- if (state->nintervals == 0)
- {
- /* cannot leak memory here */
- pfree(newpage);
- pfree(state->htids);
- pfree(state);
- return;
- }
-
- /*
- * By here, it's clear that deduplication will definitely go ahead.
- *
- * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
- * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
- * But keep things tidy.
- */
- if (P_HAS_GARBAGE(opaque))
- {
- BTPageOpaque nopaque = BTPageGetOpaque(newpage);
-
- nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
- }
-
- START_CRIT_SECTION();
-
- PageRestoreTempPage(newpage, page);
- MarkBufferDirty(buf);
-
- /* XLOG stuff */
- if (RelationNeedsWAL(rel))
- {
- XLogRecPtr recptr;
- xl_btree_dedup xlrec_dedup;
-
- xlrec_dedup.nintervals = state->nintervals;
-
- XLogBeginInsert();
- XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
-
- /*
- * The intervals array is not in the buffer, but pretend that it is.
- * When XLogInsert stores the whole buffer, the array need not be
- * stored too.
- */
- XLogRegisterBufData(0, (char *) state->intervals,
- state->nintervals * sizeof(BTDedupInterval));
-
- recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
-
- PageSetLSN(page, recptr);
- }
-
- END_CRIT_SECTION();
-
- /* Local space accounting should agree with page accounting */
- Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
-
- /* cannot leak memory here */
- pfree(state->htids);
- pfree(state);
-}
/*
* Perform bottom-up index deletion pass.
@@ -373,7 +130,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
/* itup starts first pending interval */
_bt_dedup_start_pending(state, itup, offnum);
}
- else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+ else if (nbts_call(_bt_keep_natts_fast, rel, state->base, itup) > nkeyatts &&
_bt_dedup_save_htid(state, itup))
{
/* Tuple is equal; just added its TIDs to pending interval */
@@ -748,55 +505,6 @@ _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
state->phystupsize = 0;
}
-/*
- * Determine if page non-pivot tuples (data items) are all duplicates of the
- * same value -- if they are, deduplication's "single value" strategy should
- * be applied. The general goal of this strategy is to ensure that
- * nbtsplitloc.c (which uses its own single value strategy) will find a useful
- * split point as further duplicates are inserted, and successive rightmost
- * page splits occur among pages that store the same duplicate value. When
- * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
- * just like it would if deduplication were disabled.
- *
- * We expect that affected workloads will require _several_ single value
- * strategy deduplication passes (over a page that only stores duplicates)
- * before the page is finally split. The first deduplication pass should only
- * find regular non-pivot tuples. Later deduplication passes will find
- * existing maxpostingsize-capped posting list tuples, which must be skipped
- * over. The penultimate pass is generally the first pass that actually
- * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
- * few untouched non-pivot tuples. The final deduplication pass won't free
- * any space -- it will skip over everything without merging anything (it
- * retraces the steps of the penultimate pass).
- *
- * Fortunately, having several passes isn't too expensive. Each pass (after
- * the first pass) won't spend many cycles on the large posting list tuples
- * left by previous passes. Each pass will find a large contiguous group of
- * smaller duplicate tuples to merge together at the end of the page.
- */
-static bool
-_bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- ItemId itemid;
- IndexTuple itup;
-
- itemid = PageGetItemId(page, minoff);
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- {
- itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- return true;
- }
-
- return false;
-}
-
/*
* Lower maxpostingsize when using "single value" strategy, to avoid a sixth
* and final maxpostingsize-capped tuple. The sixth and final posting list
diff --git a/src/backend/access/nbtree/nbtdedup_spec.h b/src/backend/access/nbtree/nbtdedup_spec.h
new file mode 100644
index 0000000000..27e5a7e686
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup_spec.h
@@ -0,0 +1,313 @@
+/*
+ * Specialized functions included in nbtdedup.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static bool NBTS_FUNCTION(_bt_do_singleval)(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem);
+
+/*
+ * Determine if page non-pivot tuples (data items) are all duplicates of the
+ * same value -- if they are, deduplication's "single value" strategy should
+ * be applied. The general goal of this strategy is to ensure that
+ * nbtsplitloc.c (which uses its own single value strategy) will find a useful
+ * split point as further duplicates are inserted, and successive rightmost
+ * page splits occur among pages that store the same duplicate value. When
+ * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
+ * just like it would if deduplication were disabled.
+ *
+ * We expect that affected workloads will require _several_ single value
+ * strategy deduplication passes (over a page that only stores duplicates)
+ * before the page is finally split. The first deduplication pass should only
+ * find regular non-pivot tuples. Later deduplication passes will find
+ * existing maxpostingsize-capped posting list tuples, which must be skipped
+ * over. The penultimate pass is generally the first pass that actually
+ * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
+ * few untouched non-pivot tuples. The final deduplication pass won't free
+ * any space -- it will skip over everything without merging anything (it
+ * retraces the steps of the penultimate pass).
+ *
+ * Fortunately, having several passes isn't too expensive. Each pass (after
+ * the first pass) won't spend many cycles on the large posting list tuples
+ * left by previous passes. Each pass will find a large contiguous group of
+ * smaller duplicate tuples to merge together at the end of the page.
+ */
+static bool
+NBTS_FUNCTION(_bt_do_singleval)(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, minoff);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (nbts_call(_bt_keep_natts_fast, rel, newitem, itup) > nkeyatts)
+ {
+ itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (nbts_call(_bt_keep_natts_fast, rel, newitem, itup) > nkeyatts)
+ return true;
+ }
+
+ return false;
+}
+
+#endif /* ifndef NBTS_SPECIALIZING_DEFAULT */
+
+/*
+ * Perform a deduplication pass.
+ *
+ * The general approach taken here is to perform as much deduplication as
+ * possible to free as much space as possible. Note, however, that "single
+ * value" strategy is used for !bottomupdedup callers when the page is full of
+ * tuples of a single value. Deduplication passes that apply the strategy
+ * will leave behind a few untouched tuples at the end of the page, preparing
+ * the page for an anticipated page split that uses nbtsplitloc.c's own single
+ * value strategy. Our high level goal is to delay merging the untouched
+ * tuples until after the page splits.
+ *
+ * When a call to _bt_bottomupdel_pass() just took place (and failed), our
+ * high level goal is to prevent a page split entirely by buying more time.
+ * We still hope that a page split can be avoided altogether. That's why
+ * single value strategy is not even considered for bottomupdedup callers.
+ *
+ * The page will have to be split if we cannot successfully free at least
+ * newitemsz (we also need space for newitem's line pointer, which isn't
+ * included in caller's newitemsz).
+ *
+ * Note: Caller should have already deleted all existing items with their
+ * LP_DEAD bits set.
+ */
+void
+NBTS_FUNCTION(_bt_dedup_pass)(Relation rel, Buffer buf, Relation heapRel,
+ IndexTuple newitem, Size newitemsz,
+ bool bottomupdedup)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ Page newpage;
+ BTDedupState state;
+ Size pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
+ bool singlevalstrat = false;
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+
+ /*
+ * Initialize deduplication state.
+ *
+ * It would be possible for maxpostingsize (limit on posting list tuple
+ * size) to be set to one third of the page. However, it seems like a
+ * good idea to limit the size of posting lists to one sixth of a page.
+ * That ought to leave us with a good split point when pages full of
+ * duplicates can be split several times.
+ */
+ state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ state->deduplicate = true;
+ state->nmaxitems = 0;
+ state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
+ /* Metadata about base tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ state->htids = palloc(state->maxpostingsize);
+ state->nhtids = 0;
+ state->nitems = 0;
+ /* Size of all physical tuples to be replaced by pending posting list */
+ state->phystupsize = 0;
+ /* nintervals should be initialized to zero */
+ state->nintervals = 0;
+
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Consider applying "single value" strategy, though only if the page
+ * seems likely to be split in the near future
+ */
+ if (!bottomupdedup)
+ singlevalstrat = nbts_call(_bt_do_singleval, rel, page, state,
+ minoff, newitem);
+
+ /*
+ * Deduplicate items from page, and write them to newpage.
+ *
+ * Copy the original page's LSN into newpage copy. This will become the
+ * updated version of the page. We need this because XLogInsert will
+ * examine the LSN and possibly dump it in a page image.
+ */
+ newpage = PageGetTempPageCopySpecial(page);
+ PageSetLSN(newpage, PageGetLSN(page));
+
+ /* Copy high key, if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId hitemid = PageGetItemId(page, P_HIKEY);
+ Size hitemsz = ItemIdGetLength(hitemid);
+ IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+ if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add highkey");
+ }
+
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == minoff)
+ {
+ /*
+ * No previous/base tuple for the data item -- use the data item
+ * as base tuple of pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (state->deduplicate &&
+ nbts_call(_bt_keep_natts_fast, rel, state->base, itup) > nkeyatts &&
+ _bt_dedup_save_htid(state, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID(s) for itup have been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list for some other reason (e.g., adding more
+ * TIDs would have caused posting list to exceed current
+ * maxpostingsize).
+ *
+ * If state contains pending posting list with more than one item,
+ * form new posting tuple, and actually update the page. Else
+ * reset the state and move on without modifying the page.
+ */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ if (singlevalstrat)
+ {
+ /*
+ * Single value strategy's extra steps.
+ *
+ * Lower maxpostingsize for sixth and final large posting list
+ * tuple at the point where 5 maxpostingsize-capped tuples
+ * have either been formed or observed.
+ *
+ * When a sixth maxpostingsize-capped item is formed/observed,
+ * stop merging together tuples altogether. The few tuples
+ * that remain at the end of the page won't be merged together
+ * at all (at least not until after a future page split takes
+ * place).
+ */
+ if (state->nmaxitems == 5)
+ _bt_singleval_fillfactor(page, state, newitemsz);
+ else if (state->nmaxitems == 6)
+ {
+ state->deduplicate = false;
+ singlevalstrat = false; /* won't be back here */
+ }
+ }
+
+ /* itup starts new pending posting list */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ }
+
+ /* Handle the last item */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ /*
+ * If no items suitable for deduplication were found, newpage must be
+ * exactly the same as the original page, so just return from function.
+ *
+ * We could determine whether or not to proceed on the basis the space
+ * savings being sufficient to avoid an immediate page split instead. We
+ * don't do that because there is some small value in nbtsplitloc.c always
+ * operating against a page that is fully deduplicated (apart from
+ * newitem). Besides, most of the cost has already been paid.
+ */
+ if (state->nintervals == 0)
+ {
+ /* cannot leak memory here */
+ pfree(newpage);
+ pfree(state->htids);
+ pfree(state);
+ return;
+ }
+
+ /*
+ * By here, it's clear that deduplication will definitely go ahead.
+ *
+ * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
+ * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
+ * But keep things tidy.
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ BTPageOpaque nopaque = BTPageGetOpaque(newpage);
+
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+ }
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.nintervals = state->nintervals;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ /*
+ * The intervals array is not in the buffer, but pretend that it is.
+ * When XLogInsert stores the whole buffer, the array need not be
+ * stored too.
+ */
+ XLogRegisterBufData(0, (char *) state->intervals,
+ state->nintervals * sizeof(BTDedupInterval));
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* Local space accounting should agree with page accounting */
+ Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+ /* cannot leak memory here */
+ pfree(state->htids);
+ pfree(state);
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f6f4af8bfe..ec6c73d1cc 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,18 +30,13 @@
#define BTREE_FASTPATH_MIN_LEVEL 2
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
Relation heapRel,
IndexUniqueCheck checkUnique, bool *is_unique,
uint32 *speculativeToken);
-static OffsetNumber _bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel);
-static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_stepright(Relation rel,
+ BTInsertState insertstate,
+ BTStack stack);
static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
Buffer buf,
Buffer cbuf,
@@ -73,311 +68,10 @@ static BlockNumber *_bt_deadblocks(Page page, OffsetNumber *deletable,
int *nblocks);
static inline int _bt_blk_cmp(const void *arg1, const void *arg2);
-/*
- * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
- *
- * This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
- *
- * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
- * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
- * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
- * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
- * don't actually insert.
- *
- * indexUnchanged executor hint indicates if itup is from an
- * UPDATE that didn't logically change the indexed value, but
- * must nevertheless have a new entry to point to a successor
- * version.
- *
- * The result value is only significant for UNIQUE_CHECK_PARTIAL:
- * it must be true if the entry is known unique, else false.
- * (In the current implementation we'll also return true after a
- * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
- * that's just a coding artifact.)
- */
-bool
-_bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel)
-{
- bool is_unique = false;
- BTInsertStateData insertstate;
- BTScanInsert itup_key;
- BTStack stack;
- bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
-
- /* we need an insertion scan key to do our search, so build one */
- itup_key = _bt_mkscankey(rel, itup);
-
- if (checkingunique)
- {
- if (!itup_key->anynullkeys)
- {
- /* No (heapkeyspace) scantid until uniqueness established */
- itup_key->scantid = NULL;
- }
- else
- {
- /*
- * Scan key for new tuple contains NULL key values. Bypass
- * checkingunique steps. They are unnecessary because core code
- * considers NULL unequal to every value, including NULL.
- *
- * This optimization avoids O(N^2) behavior within the
- * _bt_findinsertloc() heapkeyspace path when a unique index has a
- * large number of "duplicates" with NULL key values.
- */
- checkingunique = false;
- /* Tuple is unique in the sense that core code cares about */
- Assert(checkUnique != UNIQUE_CHECK_EXISTING);
- is_unique = true;
- }
- }
-
- /*
- * Fill in the BTInsertState working area, to track the current page and
- * position within the page to insert on.
- *
- * Note that itemsz is passed down to lower level code that deals with
- * inserting the item. It must be MAXALIGN()'d. This ensures that space
- * accounting code consistently considers the alignment overhead that we
- * expect PageAddItem() will add later. (Actually, index_form_tuple() is
- * already conservative about alignment, but we don't rely on that from
- * this distance. Besides, preserving the "true" tuple size in index
- * tuple headers for the benefit of nbtsplitloc.c might happen someday.
- * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
- */
- insertstate.itup = itup;
- insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
- insertstate.itup_key = itup_key;
- insertstate.bounds_valid = false;
- insertstate.buf = InvalidBuffer;
- insertstate.postingoff = 0;
-
-search:
-
- /*
- * Find and lock the leaf page that the tuple should be added to by
- * searching from the root page. insertstate.buf will hold a buffer that
- * is locked in exclusive mode afterwards.
- */
- stack = _bt_search_insert(rel, &insertstate);
-
- /*
- * checkingunique inserts are not allowed to go ahead when two tuples with
- * equal key attribute values would be visible to new MVCC snapshots once
- * the xact commits. Check for conflicts in the locked page/buffer (if
- * needed) here.
- *
- * It might be necessary to check a page to the right in _bt_check_unique,
- * though that should be very rare. In practice the first page the value
- * could be on (with scantid omitted) is almost always also the only page
- * that a matching tuple might be found on. This is due to the behavior
- * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
- * only be allowed to cross a page boundary when there is no candidate
- * leaf page split point that avoids it. Also, _bt_check_unique can use
- * the leaf page high key to determine that there will be no duplicates on
- * the right sibling without actually visiting it (it uses the high key in
- * cases where the new item happens to belong at the far right of the leaf
- * page).
- *
- * NOTE: obviously, _bt_check_unique can only detect keys that are already
- * in the index; so it cannot defend against concurrent insertions of the
- * same key. We protect against that by means of holding a write lock on
- * the first page the value could be on, with omitted/-inf value for the
- * implicit heap TID tiebreaker attribute. Any other would-be inserter of
- * the same key must acquire a write lock on the same page, so only one
- * would-be inserter can be making the check at one time. Furthermore,
- * once we are past the check we hold write locks continuously until we
- * have performed our insertion, so no later inserter can fail to see our
- * insertion. (This requires some care in _bt_findinsertloc.)
- *
- * If we must wait for another xact, we release the lock while waiting,
- * and then must perform a new search.
- *
- * For a partial uniqueness check, we don't wait for the other xact. Just
- * let the tuple in and return false for possibly non-unique, or true for
- * definitely unique.
- */
- if (checkingunique)
- {
- TransactionId xwait;
- uint32 speculativeToken;
-
- xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
- &is_unique, &speculativeToken);
-
- if (unlikely(TransactionIdIsValid(xwait)))
- {
- /* Have to wait for the other guy ... */
- _bt_relbuf(rel, insertstate.buf);
- insertstate.buf = InvalidBuffer;
-
- /*
- * If it's a speculative insertion, wait for it to finish (ie. to
- * go ahead with the insertion, or kill the tuple). Otherwise
- * wait for the transaction to finish as usual.
- */
- if (speculativeToken)
- SpeculativeInsertionWait(xwait, speculativeToken);
- else
- XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
-
- /* start over... */
- if (stack)
- _bt_freestack(stack);
- goto search;
- }
-
- /* Uniqueness is established -- restore heap tid as scantid */
- if (itup_key->heapkeyspace)
- itup_key->scantid = &itup->t_tid;
- }
-
- if (checkUnique != UNIQUE_CHECK_EXISTING)
- {
- OffsetNumber newitemoff;
-
- /*
- * The only conflict predicate locking cares about for indexes is when
- * an index tuple insert conflicts with an existing lock. We don't
- * know the actual page we're going to insert on for sure just yet in
- * checkingunique and !heapkeyspace cases, but it's okay to use the
- * first page the value could be on (with scantid omitted) instead.
- */
- CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
-
- /*
- * Do the insertion. Note that insertstate contains cached binary
- * search bounds established within _bt_check_unique when insertion is
- * checkingunique.
- */
- newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
- indexUnchanged, stack, heapRel);
- _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, insertstate.itemsz, newitemoff,
- insertstate.postingoff, false);
- }
- else
- {
- /* just release the buffer */
- _bt_relbuf(rel, insertstate.buf);
- }
-
- /* be tidy */
- if (stack)
- _bt_freestack(stack);
- pfree(itup_key);
-
- return is_unique;
-}
-
-/*
- * _bt_search_insert() -- _bt_search() wrapper for inserts
- *
- * Search the tree for a particular scankey, or more precisely for the first
- * leaf page it could be on. Try to make use of the fastpath optimization's
- * rightmost leaf page cache before actually searching the tree from the root
- * page, though.
- *
- * Return value is a stack of parent-page pointers (though see notes about
- * fastpath optimization and page splits below). insertstate->buf is set to
- * the address of the leaf-page buffer, which is write-locked and pinned in
- * all cases (if necessary by creating a new empty root page for caller).
- *
- * The fastpath optimization avoids most of the work of searching the tree
- * repeatedly when a single backend inserts successive new tuples on the
- * rightmost leaf page of an index. A backend cache of the rightmost leaf
- * page is maintained within _bt_insertonpg(), and used here. The cache is
- * invalidated here when an insert of a non-pivot tuple must take place on a
- * non-rightmost leaf page.
- *
- * The optimization helps with indexes on an auto-incremented field. It also
- * helps with indexes on datetime columns, as well as indexes with lots of
- * NULL values. (NULLs usually get inserted in the rightmost page for single
- * column indexes, since they usually get treated as coming after everything
- * else in the key space. Individual NULL tuples will generally be placed on
- * the rightmost leaf page due to the influence of the heap TID column.)
- *
- * Note that we avoid applying the optimization when there is insufficient
- * space on the rightmost page to fit caller's new item. This is necessary
- * because we'll need to return a real descent stack when a page split is
- * expected (actually, caller can cope with a leaf page split that uses a NULL
- * stack, but that's very slow and so must be avoided). Note also that the
- * fastpath optimization acquires the lock on the page conditionally as a way
- * of reducing extra contention when there are concurrent insertions into the
- * rightmost page (we give up if we'd have to wait for the lock). We assume
- * that it isn't useful to apply the optimization when there is contention,
- * since each per-backend cache won't stay valid for long.
- */
-static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
-{
- Assert(insertstate->buf == InvalidBuffer);
- Assert(!insertstate->bounds_valid);
- Assert(insertstate->postingoff == 0);
-
- if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
- {
- /* Simulate a _bt_getbuf() call with conditional locking */
- insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
- if (_bt_conditionallockbuf(rel, insertstate->buf))
- {
- Page page;
- BTPageOpaque opaque;
-
- _bt_checkpage(rel, insertstate->buf);
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- /*
- * Check if the page is still the rightmost leaf page and has
- * enough free space to accommodate the new tuple. Also check
- * that the insertion scan key is strictly greater than the first
- * non-pivot tuple on the page. (Note that we expect itup_key's
- * scantid to be unset when our caller is a checkingunique
- * inserter.)
- */
- if (P_RIGHTMOST(opaque) &&
- P_ISLEAF(opaque) &&
- !P_IGNORE(opaque) &&
- PageGetFreeSpace(page) > insertstate->itemsz &&
- PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- _bt_compare(rel, insertstate->itup_key, page, P_HIKEY) > 0)
- {
- /*
- * Caller can use the fastpath optimization because cached
- * block is still rightmost leaf page, which can fit caller's
- * new tuple without splitting. Keep block in local cache for
- * next insert, and have caller use NULL stack.
- *
- * Note that _bt_insert_parent() has an assertion that catches
- * leaf page splits that somehow follow from a fastpath insert
- * (it should only be passed a NULL stack when it must deal
- * with a concurrent root page split, and never because a NULL
- * stack was returned here).
- */
- return NULL;
- }
-
- /* Page unsuitable for caller, drop lock and pin */
- _bt_relbuf(rel, insertstate->buf);
- }
- else
- {
- /* Lock unavailable, drop pin */
- ReleaseBuffer(insertstate->buf);
- }
-
- /* Forget block, since cache doesn't appear to be useful */
- RelationSetTargetBlock(rel, InvalidBlockNumber);
- }
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtinsert_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
- /* Cannot use optimization -- descend tree, return proper descent stack */
- return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
- NULL);
-}
/*
* _bt_check_unique() -- Check for violation of unique index constraint
@@ -438,7 +132,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* in the fastpath below, but also in the _bt_findinsertloc() call later.
*/
Assert(!insertstate->bounds_valid);
- offset = _bt_binsrch_insert(rel, insertstate);
+ offset = nbts_call(_bt_binsrch_insert, rel, insertstate);
/*
* Scan over all equal tuples, looking for live conflicts.
@@ -483,7 +177,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(insertstate->bounds_valid);
Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
Assert(insertstate->low <= insertstate->stricthigh);
- Assert(_bt_compare(rel, itup_key, page, offset) < 0);
+ Assert(nbts_call(_bt_compare, rel, itup_key, page, offset) < 0);
break;
}
@@ -508,7 +202,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
if (!inposting)
{
/* Plain tuple, or first TID in posting list tuple */
- if (_bt_compare(rel, itup_key, page, offset) != 0)
+ if (nbts_call(_bt_compare, rel, itup_key, page, offset) != 0)
break; /* we're past all the equal tuples */
/* Advanced curitup */
@@ -722,7 +416,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* If scankey == hikey we gotta check the next page too */
if (P_RIGHTMOST(opaque))
break;
- highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+ highkeycmp = nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY);
Assert(highkeycmp <= 0);
if (highkeycmp != 0)
break;
@@ -769,246 +463,6 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
return InvalidTransactionId;
}
-
-/*
- * _bt_findinsertloc() -- Finds an insert location for a tuple
- *
- * On entry, insertstate buffer contains the page the new tuple belongs
- * on. It is exclusive-locked and pinned by the caller.
- *
- * If 'checkingunique' is true, the buffer on entry is the first page
- * that contains duplicates of the new key. If there are duplicates on
- * multiple pages, the correct insertion position might be some page to
- * the right, rather than the first page. In that case, this function
- * moves right to the correct target page.
- *
- * (In a !heapkeyspace index, there can be multiple pages with the same
- * high key, where the new tuple could legitimately be placed on. In
- * that case, the caller passes the first page containing duplicates,
- * just like when checkingunique=true. If that page doesn't have enough
- * room for the new tuple, this function moves right, trying to find a
- * legal page that does.)
- *
- * If 'indexUnchanged' is true, this is for an UPDATE that didn't
- * logically change the indexed value, but must nevertheless have a new
- * entry to point to a successor version. This hint from the executor
- * will influence our behavior when the page might have to be split and
- * we must consider our options. Bottom-up index deletion can avoid
- * pathological version-driven page splits, but we only want to go to the
- * trouble of trying it when we already have moderate confidence that
- * it's appropriate. The hint should not significantly affect our
- * behavior over time unless practically all inserts on to the leaf page
- * get the hint.
- *
- * On exit, insertstate buffer contains the chosen insertion page, and
- * the offset within that page is returned. If _bt_findinsertloc needed
- * to move right, the lock and pin on the original page are released, and
- * the new buffer is exclusively locked and pinned instead.
- *
- * If insertstate contains cached binary search bounds, we will take
- * advantage of them. This avoids repeating comparisons that we made in
- * _bt_check_unique() already.
- */
-static OffsetNumber
-_bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel)
-{
- BTScanInsert itup_key = insertstate->itup_key;
- Page page = BufferGetPage(insertstate->buf);
- BTPageOpaque opaque;
- OffsetNumber newitemoff;
-
- opaque = BTPageGetOpaque(page);
-
- /* Check 1/3 of a page restriction */
- if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
- _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
- insertstate->itup);
-
- Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
- Assert(!insertstate->bounds_valid || checkingunique);
- Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
- Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
- Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
-
- if (itup_key->heapkeyspace)
- {
- /* Keep track of whether checkingunique duplicate seen */
- bool uniquedup = indexUnchanged;
-
- /*
- * If we're inserting into a unique index, we may have to walk right
- * through leaf pages to find the one leaf page that we must insert on
- * to.
- *
- * This is needed for checkingunique callers because a scantid was not
- * used when we called _bt_search(). scantid can only be set after
- * _bt_check_unique() has checked for duplicates. The buffer
- * initially stored in insertstate->buf has the page where the first
- * duplicate key might be found, which isn't always the page that new
- * tuple belongs on. The heap TID attribute for new tuple (scantid)
- * could force us to insert on a sibling page, though that should be
- * very rare in practice.
- */
- if (checkingunique)
- {
- if (insertstate->low < insertstate->stricthigh)
- {
- /* Encountered a duplicate in _bt_check_unique() */
- Assert(insertstate->bounds_valid);
- uniquedup = true;
- }
-
- for (;;)
- {
- /*
- * Does the new tuple belong on this page?
- *
- * The earlier _bt_check_unique() call may well have
- * established a strict upper bound on the offset for the new
- * item. If it's not the last item of the page (i.e. if there
- * is at least one tuple on the page that goes after the tuple
- * we're inserting) then we know that the tuple belongs on
- * this page. We can skip the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- /* Test '<=', not '!=', since scantid is set now */
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
- break;
-
- _bt_stepright(rel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- /* Assume duplicates (if checkingunique) */
- uniquedup = true;
- }
- }
-
- /*
- * If the target page cannot fit newitem, try to avoid splitting the
- * page on insert by performing deletion or deduplication now
- */
- if (PageGetFreeSpace(page) < insertstate->itemsz)
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
- checkingunique, uniquedup,
- indexUnchanged);
- }
- else
- {
- /*----------
- * This is a !heapkeyspace (version 2 or 3) index. The current page
- * is the first page that we could insert the new tuple to, but there
- * may be other pages to the right that we could opt to use instead.
- *
- * If the new key is equal to one or more existing keys, we can
- * legitimately place it anywhere in the series of equal keys. In
- * fact, if the new key is equal to the page's "high key" we can place
- * it on the next page. If it is equal to the high key, and there's
- * not room to insert the new tuple on the current page without
- * splitting, then we move right hoping to find more free space and
- * avoid a split.
- *
- * Keep scanning right until we
- * (a) find a page with enough free space,
- * (b) reach the last page where the tuple can legally go, or
- * (c) get tired of searching.
- * (c) is not flippant; it is important because if there are many
- * pages' worth of equal keys, it's better to split one of the early
- * pages than to scan all the way to the end of the run of equal keys
- * on every insert. We implement "get tired" as a random choice,
- * since stopping after scanning a fixed number of pages wouldn't work
- * well (we'd never reach the right-hand side of previously split
- * pages). The probability of moving right is set at 0.99, which may
- * seem too high to change the behavior much, but it does an excellent
- * job of preventing O(N^2) behavior with many equal keys.
- *----------
- */
- while (PageGetFreeSpace(page) < insertstate->itemsz)
- {
- /*
- * Before considering moving right, see if we can obtain enough
- * space by erasing LP_DEAD items
- */
- if (P_HAS_GARBAGE(opaque))
- {
- /* Perform simple deletion */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- if (PageGetFreeSpace(page) >= insertstate->itemsz)
- break; /* OK, now we have enough space */
- }
-
- /*
- * Nope, so check conditions (b) and (c) enumerated above
- *
- * The earlier _bt_check_unique() call may well have established a
- * strict upper bound on the offset for the new item. If it's not
- * the last item of the page (i.e. if there is at least one tuple
- * on the page that's greater than the tuple we're inserting to)
- * then we know that the tuple belongs on this page. We can skip
- * the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) != 0 ||
- pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
- break;
-
- _bt_stepright(rel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- }
- }
-
- /*
- * We should now be on the correct page. Find the offset within the page
- * for the new tuple. (Possibly reusing earlier search bounds.)
- */
- Assert(P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
-
- newitemoff = _bt_binsrch_insert(rel, insertstate);
-
- if (insertstate->postingoff == -1)
- {
- /*
- * There is an overlapping posting list tuple with its LP_DEAD bit
- * set. We don't want to unnecessarily unset its LP_DEAD bit while
- * performing a posting list split, so perform simple index tuple
- * deletion early.
- */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- /*
- * Do new binary search. New insert location cannot overlap with any
- * posting list now.
- */
- Assert(!insertstate->bounds_valid);
- insertstate->postingoff = 0;
- newitemoff = _bt_binsrch_insert(rel, insertstate);
- Assert(insertstate->postingoff == 0);
- }
-
- return newitemoff;
-}
-
/*
* Step right to next non-dead page, during insertion.
*
@@ -1649,7 +1103,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
lastleft = nposting;
}
- lefthighkey = _bt_truncate(rel, lastleft, firstright, itup_key);
+ lefthighkey = nbts_call(_bt_truncate, rel, lastleft, firstright, itup_key);
itemsz = IndexTupleSize(lefthighkey);
}
else
@@ -2764,8 +2218,8 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
/* Perform deduplication pass (when enabled and index-is-allequalimage) */
if (BTGetDeduplicateItems(rel) && itup_key->allequalimage)
- _bt_dedup_pass(rel, buffer, heapRel, insertstate->itup,
- insertstate->itemsz, (indexUnchanged || uniquedup));
+ nbts_call(_bt_dedup_pass, rel, buffer, heapRel, insertstate->itup,
+ insertstate->itemsz, (indexUnchanged || uniquedup));
}
/*
diff --git a/src/backend/access/nbtree/nbtinsert_spec.h b/src/backend/access/nbtree/nbtinsert_spec.h
new file mode 100644
index 0000000000..97c866aea3
--- /dev/null
+++ b/src/backend/access/nbtree/nbtinsert_spec.h
@@ -0,0 +1,569 @@
+/*
+ * Specialized functions for nbtinsert.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static BTStack NBTS_FUNCTION(_bt_search_insert)(Relation rel,
+ BTInsertState insertstate);
+
+static OffsetNumber NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel);
+
+/*
+ * _bt_search_insert() -- _bt_search() wrapper for inserts
+ *
+ * Search the tree for a particular scankey, or more precisely for the first
+ * leaf page it could be on. Try to make use of the fastpath optimization's
+ * rightmost leaf page cache before actually searching the tree from the root
+ * page, though.
+ *
+ * Return value is a stack of parent-page pointers (though see notes about
+ * fastpath optimization and page splits below). insertstate->buf is set to
+ * the address of the leaf-page buffer, which is write-locked and pinned in
+ * all cases (if necessary by creating a new empty root page for caller).
+ *
+ * The fastpath optimization avoids most of the work of searching the tree
+ * repeatedly when a single backend inserts successive new tuples on the
+ * rightmost leaf page of an index. A backend cache of the rightmost leaf
+ * page is maintained within _bt_insertonpg(), and used here. The cache is
+ * invalidated here when an insert of a non-pivot tuple must take place on a
+ * non-rightmost leaf page.
+ *
+ * The optimization helps with indexes on an auto-incremented field. It also
+ * helps with indexes on datetime columns, as well as indexes with lots of
+ * NULL values. (NULLs usually get inserted in the rightmost page for single
+ * column indexes, since they usually get treated as coming after everything
+ * else in the key space. Individual NULL tuples will generally be placed on
+ * the rightmost leaf page due to the influence of the heap TID column.)
+ *
+ * Note that we avoid applying the optimization when there is insufficient
+ * space on the rightmost page to fit caller's new item. This is necessary
+ * because we'll need to return a real descent stack when a page split is
+ * expected (actually, caller can cope with a leaf page split that uses a NULL
+ * stack, but that's very slow and so must be avoided). Note also that the
+ * fastpath optimization acquires the lock on the page conditionally as a way
+ * of reducing extra contention when there are concurrent insertions into the
+ * rightmost page (we give up if we'd have to wait for the lock). We assume
+ * that it isn't useful to apply the optimization when there is contention,
+ * since each per-backend cache won't stay valid for long.
+ */
+static BTStack
+NBTS_FUNCTION(_bt_search_insert)(Relation rel, BTInsertState insertstate)
+{
+ Assert(insertstate->buf == InvalidBuffer);
+ Assert(!insertstate->bounds_valid);
+ Assert(insertstate->postingoff == 0);
+
+ if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
+ {
+ /* Simulate a _bt_getbuf() call with conditional locking */
+ insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
+ if (_bt_conditionallockbuf(rel, insertstate->buf))
+ {
+ Page page;
+ BTPageOpaque opaque;
+
+ _bt_checkpage(rel, insertstate->buf);
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ /*
+ * Check if the page is still the rightmost leaf page and has
+ * enough free space to accommodate the new tuple. Also check
+ * that the insertion scan key is strictly greater than the first
+ * non-pivot tuple on the page. (Note that we expect itup_key's
+ * scantid to be unset when our caller is a checkingunique
+ * inserter.)
+ */
+ if (P_RIGHTMOST(opaque) &&
+ P_ISLEAF(opaque) &&
+ !P_IGNORE(opaque) &&
+ PageGetFreeSpace(page) > insertstate->itemsz &&
+ PageGetMaxOffsetNumber(page) >= P_HIKEY &&
+ nbts_call(_bt_compare, rel, insertstate->itup_key, page, P_HIKEY) > 0)
+ {
+ /*
+ * Caller can use the fastpath optimization because cached
+ * block is still rightmost leaf page, which can fit caller's
+ * new tuple without splitting. Keep block in local cache for
+ * next insert, and have caller use NULL stack.
+ *
+ * Note that _bt_insert_parent() has an assertion that catches
+ * leaf page splits that somehow follow from a fastpath insert
+ * (it should only be passed a NULL stack when it must deal
+ * with a concurrent root page split, and never because a NULL
+ * stack was returned here).
+ */
+ return NULL;
+ }
+
+ /* Page unsuitable for caller, drop lock and pin */
+ _bt_relbuf(rel, insertstate->buf);
+ }
+ else
+ {
+ /* Lock unavailable, drop pin */
+ ReleaseBuffer(insertstate->buf);
+ }
+
+ /* Forget block, since cache doesn't appear to be useful */
+ RelationSetTargetBlock(rel, InvalidBlockNumber);
+ }
+
+ /* Cannot use optimization -- descend tree, return proper descent stack */
+ return nbts_call(_bt_search, rel, insertstate->itup_key,
+ &insertstate->buf, BT_WRITE, NULL);
+}
+
+/*
+ * _bt_findinsertloc() -- Finds an insert location for a tuple
+ *
+ * On entry, insertstate buffer contains the page the new tuple belongs
+ * on. It is exclusive-locked and pinned by the caller.
+ *
+ * If 'checkingunique' is true, the buffer on entry is the first page
+ * that contains duplicates of the new key. If there are duplicates on
+ * multiple pages, the correct insertion position might be some page to
+ * the right, rather than the first page. In that case, this function
+ * moves right to the correct target page.
+ *
+ * (In a !heapkeyspace index, there can be multiple pages with the same
+ * high key, where the new tuple could legitimately be placed on. In
+ * that case, the caller passes the first page containing duplicates,
+ * just like when checkingunique=true. If that page doesn't have enough
+ * room for the new tuple, this function moves right, trying to find a
+ * legal page that does.)
+ *
+ * If 'indexUnchanged' is true, this is for an UPDATE that didn't
+ * logically change the indexed value, but must nevertheless have a new
+ * entry to point to a successor version. This hint from the executor
+ * will influence our behavior when the page might have to be split and
+ * we must consider our options. Bottom-up index deletion can avoid
+ * pathological version-driven page splits, but we only want to go to the
+ * trouble of trying it when we already have moderate confidence that
+ * it's appropriate. The hint should not significantly affect our
+ * behavior over time unless practically all inserts on to the leaf page
+ * get the hint.
+ *
+ * On exit, insertstate buffer contains the chosen insertion page, and
+ * the offset within that page is returned. If _bt_findinsertloc needed
+ * to move right, the lock and pin on the original page are released, and
+ * the new buffer is exclusively locked and pinned instead.
+ *
+ * If insertstate contains cached binary search bounds, we will take
+ * advantage of them. This avoids repeating comparisons that we made in
+ * _bt_check_unique() already.
+ */
+static OffsetNumber
+NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel)
+{
+ BTScanInsert itup_key = insertstate->itup_key;
+ Page page = BufferGetPage(insertstate->buf);
+ BTPageOpaque opaque;
+ OffsetNumber newitemoff;
+
+ opaque = BTPageGetOpaque(page);
+
+ /* Check 1/3 of a page restriction */
+ if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
+ _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
+ insertstate->itup);
+
+ Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
+ Assert(!insertstate->bounds_valid || checkingunique);
+ Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
+ Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
+ Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
+
+ if (itup_key->heapkeyspace)
+ {
+ /* Keep track of whether checkingunique duplicate seen */
+ bool uniquedup = indexUnchanged;
+
+ /*
+ * If we're inserting into a unique index, we may have to walk right
+ * through leaf pages to find the one leaf page that we must insert on
+ * to.
+ *
+ * This is needed for checkingunique callers because a scantid was not
+ * used when we called _bt_search(). scantid can only be set after
+ * _bt_check_unique() has checked for duplicates. The buffer
+ * initially stored in insertstate->buf has the page where the first
+ * duplicate key might be found, which isn't always the page that new
+ * tuple belongs on. The heap TID attribute for new tuple (scantid)
+ * could force us to insert on a sibling page, though that should be
+ * very rare in practice.
+ */
+ if (checkingunique)
+ {
+ if (insertstate->low < insertstate->stricthigh)
+ {
+ /* Encountered a duplicate in _bt_check_unique() */
+ Assert(insertstate->bounds_valid);
+ uniquedup = true;
+ }
+
+ for (;;)
+ {
+ /*
+ * Does the new tuple belong on this page?
+ *
+ * The earlier _bt_check_unique() call may well have
+ * established a strict upper bound on the offset for the new
+ * item. If it's not the last item of the page (i.e. if there
+ * is at least one tuple on the page that goes after the tuple
+ * we're inserting) then we know that the tuple belongs on
+ * this page. We can skip the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ /* Test '<=', not '!=', since scantid is set now */
+ if (P_RIGHTMOST(opaque) ||
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) <= 0)
+ break;
+
+ _bt_stepright(rel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ /* Assume duplicates (if checkingunique) */
+ uniquedup = true;
+ }
+ }
+
+ /*
+ * If the target page cannot fit newitem, try to avoid splitting the
+ * page on insert by performing deletion or deduplication now
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
+ checkingunique, uniquedup,
+ indexUnchanged);
+ }
+ else
+ {
+ /*----------
+ * This is a !heapkeyspace (version 2 or 3) index. The current page
+ * is the first page that we could insert the new tuple to, but there
+ * may be other pages to the right that we could opt to use instead.
+ *
+ * If the new key is equal to one or more existing keys, we can
+ * legitimately place it anywhere in the series of equal keys. In
+ * fact, if the new key is equal to the page's "high key" we can place
+ * it on the next page. If it is equal to the high key, and there's
+ * not room to insert the new tuple on the current page without
+ * splitting, then we move right hoping to find more free space and
+ * avoid a split.
+ *
+ * Keep scanning right until we
+ * (a) find a page with enough free space,
+ * (b) reach the last page where the tuple can legally go, or
+ * (c) get tired of searching.
+ * (c) is not flippant; it is important because if there are many
+ * pages' worth of equal keys, it's better to split one of the early
+ * pages than to scan all the way to the end of the run of equal keys
+ * on every insert. We implement "get tired" as a random choice,
+ * since stopping after scanning a fixed number of pages wouldn't work
+ * well (we'd never reach the right-hand side of previously split
+ * pages). The probability of moving right is set at 0.99, which may
+ * seem too high to change the behavior much, but it does an excellent
+ * job of preventing O(N^2) behavior with many equal keys.
+ *----------
+ */
+ while (PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ /*
+ * Before considering moving right, see if we can obtain enough
+ * space by erasing LP_DEAD items
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ /* Perform simple deletion */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ if (PageGetFreeSpace(page) >= insertstate->itemsz)
+ break; /* OK, now we have enough space */
+ }
+
+ /*
+ * Nope, so check conditions (b) and (c) enumerated above
+ *
+ * The earlier _bt_check_unique() call may well have established a
+ * strict upper bound on the offset for the new item. If it's not
+ * the last item of the page (i.e. if there is at least one tuple
+ * on the page that's greater than the tuple we're inserting to)
+ * then we know that the tuple belongs on this page. We can skip
+ * the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ if (P_RIGHTMOST(opaque) ||
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) != 0 ||
+ pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
+ break;
+
+ _bt_stepright(rel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ }
+ }
+
+ /*
+ * We should now be on the correct page. Find the offset within the page
+ * for the new tuple. (Possibly reusing earlier search bounds.)
+ */
+ Assert(P_RIGHTMOST(opaque) ||
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) <= 0);
+
+ newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate);
+
+ if (insertstate->postingoff == -1)
+ {
+ /*
+ * There is an overlapping posting list tuple with its LP_DEAD bit
+ * set. We don't want to unnecessarily unset its LP_DEAD bit while
+ * performing a posting list split, so perform simple index tuple
+ * deletion early.
+ */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ /*
+ * Do new binary search. New insert location cannot overlap with any
+ * posting list now.
+ */
+ Assert(!insertstate->bounds_valid);
+ insertstate->postingoff = 0;
+ newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate);
+ Assert(insertstate->postingoff == 0);
+ }
+
+ return newitemoff;
+}
+
+#endif /* ifndef NBTS_SPECIALIZING_DEFAULT */
+
+/*
+ * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
+ *
+ * This routine is called by the public interface routine, btinsert.
+ * By here, itup is filled in, including the TID.
+ *
+ * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
+ * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
+ * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
+ * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
+ * don't actually insert.
+ *
+ * indexUnchanged executor hint indicates if itup is from an
+ * UPDATE that didn't logically change the indexed value, but
+ * must nevertheless have a new entry to point to a successor
+ * version.
+ *
+ * The result value is only significant for UNIQUE_CHECK_PARTIAL:
+ * it must be true if the entry is known unique, else false.
+ * (In the current implementation we'll also return true after a
+ * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
+ * that's just a coding artifact.)
+ */
+bool
+NBTS_FUNCTION(_bt_doinsert)(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ Relation heapRel)
+{
+ bool is_unique = false;
+ BTInsertStateData insertstate;
+ BTScanInsert itup_key;
+ BTStack stack;
+ bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
+
+ /* we need an insertion scan key to do our search, so build one */
+ itup_key = nbts_call(_bt_mkscankey, rel, itup);
+
+ if (checkingunique)
+ {
+ if (!itup_key->anynullkeys)
+ {
+ /* No (heapkeyspace) scantid until uniqueness established */
+ itup_key->scantid = NULL;
+ }
+ else
+ {
+ /*
+ * Scan key for new tuple contains NULL key values. Bypass
+ * checkingunique steps. They are unnecessary because core code
+ * considers NULL unequal to every value, including NULL.
+ *
+ * This optimization avoids O(N^2) behavior within the
+ * _bt_findinsertloc() heapkeyspace path when a unique index has a
+ * large number of "duplicates" with NULL key values.
+ */
+ checkingunique = false;
+ /* Tuple is unique in the sense that core code cares about */
+ Assert(checkUnique != UNIQUE_CHECK_EXISTING);
+ is_unique = true;
+ }
+ }
+
+ /*
+ * Fill in the BTInsertState working area, to track the current page and
+ * position within the page to insert on.
+ *
+ * Note that itemsz is passed down to lower level code that deals with
+ * inserting the item. It must be MAXALIGN()'d. This ensures that space
+ * accounting code consistently considers the alignment overhead that we
+ * expect PageAddItem() will add later. (Actually, index_form_tuple() is
+ * already conservative about alignment, but we don't rely on that from
+ * this distance. Besides, preserving the "true" tuple size in index
+ * tuple headers for the benefit of nbtsplitloc.c might happen someday.
+ * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
+ */
+ insertstate.itup = itup;
+ insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
+ insertstate.itup_key = itup_key;
+ insertstate.bounds_valid = false;
+ insertstate.buf = InvalidBuffer;
+ insertstate.postingoff = 0;
+
+ search:
+
+ /*
+ * Find and lock the leaf page that the tuple should be added to by
+ * searching from the root page. insertstate.buf will hold a buffer that
+ * is locked in exclusive mode afterwards.
+ */
+ stack = nbts_call(_bt_search_insert, rel, &insertstate);
+
+ /*
+ * checkingunique inserts are not allowed to go ahead when two tuples with
+ * equal key attribute values would be visible to new MVCC snapshots once
+ * the xact commits. Check for conflicts in the locked page/buffer (if
+ * needed) here.
+ *
+ * It might be necessary to check a page to the right in _bt_check_unique,
+ * though that should be very rare. In practice the first page the value
+ * could be on (with scantid omitted) is almost always also the only page
+ * that a matching tuple might be found on. This is due to the behavior
+ * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
+ * only be allowed to cross a page boundary when there is no candidate
+ * leaf page split point that avoids it. Also, _bt_check_unique can use
+ * the leaf page high key to determine that there will be no duplicates on
+ * the right sibling without actually visiting it (it uses the high key in
+ * cases where the new item happens to belong at the far right of the leaf
+ * page).
+ *
+ * NOTE: obviously, _bt_check_unique can only detect keys that are already
+ * in the index; so it cannot defend against concurrent insertions of the
+ * same key. We protect against that by means of holding a write lock on
+ * the first page the value could be on, with omitted/-inf value for the
+ * implicit heap TID tiebreaker attribute. Any other would-be inserter of
+ * the same key must acquire a write lock on the same page, so only one
+ * would-be inserter can be making the check at one time. Furthermore,
+ * once we are past the check we hold write locks continuously until we
+ * have performed our insertion, so no later inserter can fail to see our
+ * insertion. (This requires some care in _bt_findinsertloc.)
+ *
+ * If we must wait for another xact, we release the lock while waiting,
+ * and then must perform a new search.
+ *
+ * For a partial uniqueness check, we don't wait for the other xact. Just
+ * let the tuple in and return false for possibly non-unique, or true for
+ * definitely unique.
+ */
+ if (checkingunique)
+ {
+ TransactionId xwait;
+ uint32 speculativeToken;
+
+ xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
+ &is_unique, &speculativeToken);
+
+ if (unlikely(TransactionIdIsValid(xwait)))
+ {
+ /* Have to wait for the other guy ... */
+ _bt_relbuf(rel, insertstate.buf);
+ insertstate.buf = InvalidBuffer;
+
+ /*
+ * If it's a speculative insertion, wait for it to finish (ie. to
+ * go ahead with the insertion, or kill the tuple). Otherwise
+ * wait for the transaction to finish as usual.
+ */
+ if (speculativeToken)
+ SpeculativeInsertionWait(xwait, speculativeToken);
+ else
+ XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
+
+ /* start over... */
+ if (stack)
+ _bt_freestack(stack);
+ goto search;
+ }
+
+ /* Uniqueness is established -- restore heap tid as scantid */
+ if (itup_key->heapkeyspace)
+ itup_key->scantid = &itup->t_tid;
+ }
+
+ if (checkUnique != UNIQUE_CHECK_EXISTING)
+ {
+ OffsetNumber newitemoff;
+
+ /*
+ * The only conflict predicate locking cares about for indexes is when
+ * an index tuple insert conflicts with an existing lock. We don't
+ * know the actual page we're going to insert on for sure just yet in
+ * checkingunique and !heapkeyspace cases, but it's okay to use the
+ * first page the value could be on (with scantid omitted) instead.
+ */
+ CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
+
+ /*
+ * Do the insertion. Note that insertstate contains cached binary
+ * search bounds established within _bt_check_unique when insertion is
+ * checkingunique.
+ */
+ newitemoff = nbts_call(_bt_findinsertloc, rel, &insertstate, checkingunique,
+ indexUnchanged, stack, heapRel);
+ _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
+ itup, insertstate.itemsz, newitemoff,
+ insertstate.postingoff, false);
+ }
+ else
+ {
+ /* just release the buffer */
+ _bt_relbuf(rel, insertstate.buf);
+ }
+
+ /* be tidy */
+ if (stack)
+ _bt_freestack(stack);
+ pfree(itup_key);
+
+ return is_unique;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 20adb602a4..e66299ebd8 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1967,10 +1967,10 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
}
/* we need an insertion scan key for the search, so build one */
- itup_key = _bt_mkscankey(rel, targetkey);
+ itup_key = nbts_call(_bt_mkscankey, rel, targetkey);
/* find the leftmost leaf page with matching pivot/high key */
itup_key->pivotsearch = true;
- stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
+ stack = nbts_call(_bt_search, rel, itup_key, &sleafbuf, BT_READ, NULL);
/* won't need a second lock or pin on leafbuf */
_bt_relbuf(rel, sleafbuf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 9b730f303f..c9cd2b6026 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -87,6 +87,10 @@ static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
OffsetNumber updatedoffset,
int *nremaining);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtree_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
+
/*
* Btree handler function: return IndexAmRoutine with access method parameters
@@ -177,33 +181,6 @@ btbuildempty(Relation index)
smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
}
-/*
- * btinsert() -- insert an index tuple into a btree.
- *
- * Descend the tree recursively, find the appropriate location for our
- * new tuple, and put it there.
- */
-bool
-btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- IndexInfo *indexInfo)
-{
- bool result;
- IndexTuple itup;
-
- /* generate an index tuple */
- itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
- itup->t_tid = *ht_ctid;
-
- result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);
-
- pfree(itup);
-
- return result;
-}
-
/*
* btgettuple() -- Get the next tuple in the scan.
*/
diff --git a/src/backend/access/nbtree/nbtree_spec.h b/src/backend/access/nbtree/nbtree_spec.h
new file mode 100644
index 0000000000..4c342287f6
--- /dev/null
+++ b/src/backend/access/nbtree/nbtree_spec.h
@@ -0,0 +1,50 @@
+/*
+ * Specialized functions for nbtree.c
+ */
+
+/*
+ * _bt_specialize() -- Specialize this index relation for its index key.
+ */
+void
+NBTS_FUNCTION(_bt_specialize)(Relation rel)
+{
+#ifdef NBTS_SPECIALIZING_DEFAULT
+ nbts_call_norel(_bt_specialize, rel, rel);
+#else
+ rel->rd_indam->aminsert = NBTS_FUNCTION(btinsert);
+#endif
+}
+
+/*
+ * btinsert() -- insert an index tuple into a btree.
+ *
+ * Descend the tree recursively, find the appropriate location for our
+ * new tuple, and put it there.
+ */
+bool
+NBTS_FUNCTION(btinsert)(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ IndexInfo *indexInfo)
+{
+#ifdef NBTS_SPECIALIZING_DEFAULT
+ nbts_call_norel(_bt_specialize, rel, rel);
+
+ return nbts_call(btinsert, rel, values, isnull, ht_ctid, heapRel,
+ checkUnique, indexUnchanged, indexInfo);
+#else
+ bool result;
+ IndexTuple itup;
+
+ /* generate an index tuple */
+ itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
+ itup->t_tid = *ht_ctid;
+
+ result = nbts_call(_bt_doinsert, rel, itup, checkUnique, indexUnchanged, heapRel);
+
+ pfree(itup);
+
+ return result;
+#endif
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c74543bfde..e81eee9c35 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,11 +25,8 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
-static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
static int _bt_binsrch_posting(BTScanInsert key, Page page,
OffsetNumber offnum);
-static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
- OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
static int _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
@@ -46,6 +43,9 @@ static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsearch_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
/*
* _bt_drop_lock_and_maybe_pin()
@@ -70,493 +70,6 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
}
}
-/*
- * _bt_search() -- Search the tree for a particular scankey,
- * or more precisely for the first leaf page it could be on.
- *
- * The passed scankey is an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * Return value is a stack of parent-page pointers (i.e. there is no entry for
- * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
- * which is locked and pinned. No locks are held on the parent pages,
- * however!
- *
- * If the snapshot parameter is not NULL, "old snapshot" checking will take
- * place during the descent through the tree. This is not needed when
- * positioning for an insert or delete, so NULL is used for those cases.
- *
- * The returned buffer is locked according to access parameter. Additionally,
- * access = BT_WRITE will allow an empty root page to be created and returned.
- * When access = BT_READ, an empty index will result in *bufP being set to
- * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
- * during the search will be finished.
- */
-BTStack
-_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
- Snapshot snapshot)
-{
- BTStack stack_in = NULL;
- int page_access = BT_READ;
-
- /* Get the root page to start with */
- *bufP = _bt_getroot(rel, access);
-
- /* If index is empty and access = BT_READ, no root page is created. */
- if (!BufferIsValid(*bufP))
- return (BTStack) NULL;
-
- /* Loop iterates once per level descended in the tree */
- for (;;)
- {
- Page page;
- BTPageOpaque opaque;
- OffsetNumber offnum;
- ItemId itemid;
- IndexTuple itup;
- BlockNumber child;
- BTStack new_stack;
-
- /*
- * Race -- the page we just grabbed may have split since we read its
- * downlink in its parent page (or the metapage). If it has, we may
- * need to move right to its new sibling. Do that.
- *
- * In write-mode, allow _bt_moveright to finish any incomplete splits
- * along the way. Strictly speaking, we'd only need to finish an
- * incomplete split on the leaf page we're about to insert to, not on
- * any of the upper levels (internal pages with incomplete splits are
- * also taken care of in _bt_getstackbuf). But this is a good
- * opportunity to finish splits of internal pages too.
- */
- *bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
- page_access, snapshot);
-
- /* if this is a leaf page, we're done */
- page = BufferGetPage(*bufP);
- opaque = BTPageGetOpaque(page);
- if (P_ISLEAF(opaque))
- break;
-
- /*
- * Find the appropriate pivot tuple on this page. Its downlink points
- * to the child page that we're about to descend to.
- */
- offnum = _bt_binsrch(rel, key, *bufP);
- itemid = PageGetItemId(page, offnum);
- itup = (IndexTuple) PageGetItem(page, itemid);
- Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
- child = BTreeTupleGetDownLink(itup);
-
- /*
- * We need to save the location of the pivot tuple we chose in a new
- * stack entry for this page/level. If caller ends up splitting a
- * page one level down, it usually ends up inserting a new pivot
- * tuple/downlink immediately after the location recorded here.
- */
- new_stack = (BTStack) palloc(sizeof(BTStackData));
- new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
- new_stack->bts_offset = offnum;
- new_stack->bts_parent = stack_in;
-
- /*
- * Page level 1 is lowest non-leaf page level prior to leaves. So, if
- * we're on the level 1 and asked to lock leaf page in write mode,
- * then lock next page in write mode, because it must be a leaf.
- */
- if (opaque->btpo_level == 1 && access == BT_WRITE)
- page_access = BT_WRITE;
-
- /* drop the read lock on the page, then acquire one on its child */
- *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
-
- /* okay, all set to move down a level */
- stack_in = new_stack;
- }
-
- /*
- * If we're asked to lock leaf in write mode, but didn't manage to, then
- * relock. This should only happen when the root page is a leaf page (and
- * the only page in the index other than the metapage).
- */
- if (access == BT_WRITE && page_access == BT_READ)
- {
- /* trade in our read lock for a write lock */
- _bt_unlockbuf(rel, *bufP);
- _bt_lockbuf(rel, *bufP, BT_WRITE);
-
- /*
- * Race -- the leaf page may have split after we dropped the read lock
- * but before we acquired a write lock. If it has, we may need to
- * move right to its new sibling. Do that.
- */
- *bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
- snapshot);
- }
-
- return stack_in;
-}
-
-/*
- * _bt_moveright() -- move right in the btree if necessary.
- *
- * When we follow a pointer to reach a page, it is possible that
- * the page has changed in the meanwhile. If this happens, we're
- * guaranteed that the page has "split right" -- that is, that any
- * data that appeared on the page originally is either on the page
- * or strictly to the right of it.
- *
- * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page. If that entry is
- * strictly less than the scankey, or <= the scankey in the
- * key.nextkey=true case, then we followed the wrong link and we need
- * to move right.
- *
- * The passed insertion-type scankey can omit the rightmost column(s) of the
- * index. (see nbtree/README)
- *
- * When key.nextkey is false (the usual case), we are looking for the first
- * item >= key. When key.nextkey is true, we are looking for the first item
- * strictly greater than key.
- *
- * If forupdate is true, we will attempt to finish any incomplete splits
- * that we encounter. This is required when locking a target page for an
- * insertion, because we don't allow inserting on a page before the split
- * is completed. 'stack' is only used if forupdate is true.
- *
- * On entry, we have the buffer pinned and a lock of the type specified by
- * 'access'. If we move right, we release the buffer and lock and acquire
- * the same on the right sibling. Return value is the buffer we stop at.
- *
- * If the snapshot parameter is not NULL, "old snapshot" checking will take
- * place during the descent through the tree. This is not needed when
- * positioning for an insert or delete, so NULL is used for those cases.
- */
-Buffer
-_bt_moveright(Relation rel,
- BTScanInsert key,
- Buffer buf,
- bool forupdate,
- BTStack stack,
- int access,
- Snapshot snapshot)
-{
- Page page;
- BTPageOpaque opaque;
- int32 cmpval;
-
- /*
- * When nextkey = false (normal case): if the scan key that brought us to
- * this page is > the high key stored on the page, then the page has split
- * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
- * have some duplicates to the right as well as the left, but that's
- * something that's only ever dealt with on the leaf level, after
- * _bt_search has found an initial leaf page.)
- *
- * When nextkey = true: move right if the scan key is >= page's high key.
- * (Note that key.scantid cannot be set in this case.)
- *
- * The page could even have split more than once, so scan as far as
- * needed.
- *
- * We also have to move right if we followed a link that brought us to a
- * dead page.
- */
- cmpval = key->nextkey ? 0 : 1;
-
- for (;;)
- {
- page = BufferGetPage(buf);
- TestForOldSnapshot(snapshot, rel, page);
- opaque = BTPageGetOpaque(page);
-
- if (P_RIGHTMOST(opaque))
- break;
-
- /*
- * Finish any incomplete splits we encounter along the way.
- */
- if (forupdate && P_INCOMPLETE_SPLIT(opaque))
- {
- BlockNumber blkno = BufferGetBlockNumber(buf);
-
- /* upgrade our lock if necessary */
- if (access == BT_READ)
- {
- _bt_unlockbuf(rel, buf);
- _bt_lockbuf(rel, buf, BT_WRITE);
- }
-
- if (P_INCOMPLETE_SPLIT(opaque))
- _bt_finish_split(rel, buf, stack);
- else
- _bt_relbuf(rel, buf);
-
- /* re-acquire the lock in the right mode, and re-check */
- buf = _bt_getbuf(rel, blkno, access);
- continue;
- }
-
- if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
- {
- /* step right one page */
- buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
- continue;
- }
- else
- break;
- }
-
- if (P_IGNORE(opaque))
- elog(ERROR, "fell off the end of index \"%s\"",
- RelationGetRelationName(rel));
-
- return buf;
-}
-
-/*
- * _bt_binsrch() -- Do a binary search for a key on a particular page.
- *
- * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
- * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
- * particular, this means it is possible to return a value 1 greater than the
- * number of keys on the page, if the scankey is > all keys on the page.)
- *
- * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
- * of the last key < given scankey, or last key <= given scankey if nextkey
- * is true. (Since _bt_compare treats the first data key of such a page as
- * minus infinity, there will be at least one key < scankey, so the result
- * always points at one of the keys on the page.) This key indicates the
- * right place to descend to be sure we find all leaf keys >= given scankey
- * (or leaf keys > given scankey when nextkey is true).
- *
- * This procedure is not responsible for walking right, it just examines
- * the given page. _bt_binsrch() has no lock or refcount side effects
- * on the buffer.
- */
-static OffsetNumber
-_bt_binsrch(Relation rel,
- BTScanInsert key,
- Buffer buf)
-{
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high;
- int32 result,
- cmpval;
-
- page = BufferGetPage(buf);
- opaque = BTPageGetOpaque(page);
-
- /* Requesting nextkey semantics while using scantid seems nonsensical */
- Assert(!key->nextkey || key->scantid == NULL);
- /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
- Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
-
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
-
- /*
- * If there are no keys on the page, return the first available slot. Note
- * this covers two cases: the page is really empty (no keys), or it
- * contains only a high key. The latter case is possible after vacuuming.
- * This can never happen on an internal page, however, since they are
- * never empty (an internal page must have children).
- */
- if (unlikely(high < low))
- return low;
-
- /*
- * Binary search to find the first key on the page >= scan key, or first
- * key > scankey when nextkey is true.
- *
- * For nextkey=false (cmpval=1), the loop invariant is: all slots before
- * 'low' are < scan key, all slots at or after 'high' are >= scan key.
- *
- * For nextkey=true (cmpval=0), the loop invariant is: all slots before
- * 'low' are <= scan key, all slots at or after 'high' are > scan key.
- *
- * We can fall out when high == low.
- */
- high++; /* establish the loop invariant for high */
-
- cmpval = key->nextkey ? 0 : 1; /* select comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid);
-
- if (result >= cmpval)
- low = mid + 1;
- else
- high = mid;
- }
-
- /*
- * At this point we have high == low, but be careful: they could point
- * past the last slot on the page.
- *
- * On a leaf page, we always return the first key >= scan key (resp. >
- * scan key), which could be the last slot + 1.
- */
- if (P_ISLEAF(opaque))
- return low;
-
- /*
- * On a non-leaf page, return the last key < scan key (resp. <= scan key).
- * There must be one if _bt_compare() is playing by the rules.
- */
- Assert(low > P_FIRSTDATAKEY(opaque));
-
- return OffsetNumberPrev(low);
-}
-
-/*
- *
- * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
- *
- * Like _bt_binsrch(), but with support for caching the binary search
- * bounds. Only used during insertion, and only on the leaf page that it
- * looks like caller will insert tuple on. Exclusive-locked and pinned
- * leaf page is contained within insertstate.
- *
- * Caches the bounds fields in insertstate so that a subsequent call can
- * reuse the low and strict high bounds of original binary search. Callers
- * that use these fields directly must be prepared for the case where low
- * and/or stricthigh are not on the same page (one or both exceed maxoff
- * for the page). The case where there are no items on the page (high <
- * low) makes bounds invalid.
- *
- * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time, and for dealing with posting list
- * tuple matches (callers can use insertstate's postingoff field to
- * determine which existing heap TID will need to be replaced by a posting
- * list split).
- */
-OffsetNumber
-_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
-{
- BTScanInsert key = insertstate->itup_key;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high,
- stricthigh;
- int32 result,
- cmpval;
-
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- Assert(P_ISLEAF(opaque));
- Assert(!key->nextkey);
- Assert(insertstate->postingoff == 0);
-
- if (!insertstate->bounds_valid)
- {
- /* Start new binary search */
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
- }
- else
- {
- /* Restore result of previous binary search against same page */
- low = insertstate->low;
- high = insertstate->stricthigh;
- }
-
- /* If there are no keys on the page, return the first available slot */
- if (unlikely(high < low))
- {
- /* Caller can't reuse bounds */
- insertstate->low = InvalidOffsetNumber;
- insertstate->stricthigh = InvalidOffsetNumber;
- insertstate->bounds_valid = false;
- return low;
- }
-
- /*
- * Binary search to find the first key on the page >= scan key. (nextkey
- * is always false when inserting).
- *
- * The loop invariant is: all slots before 'low' are < scan key, all slots
- * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
- * maintained to save additional search effort for caller.
- *
- * We can fall out when high == low.
- */
- if (!insertstate->bounds_valid)
- high++; /* establish the loop invariant for high */
- stricthigh = high; /* high initially strictly higher */
-
- cmpval = 1; /* !nextkey comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid);
-
- if (result >= cmpval)
- low = mid + 1;
- else
- {
- high = mid;
- if (result != 0)
- stricthigh = high;
- }
-
- /*
- * If tuple at offset located by binary search is a posting list whose
- * TID range overlaps with caller's scantid, perform posting list
- * binary search to set postingoff for caller. Caller must split the
- * posting list when postingoff is set. This should happen
- * infrequently.
- */
- if (unlikely(result == 0 && key->scantid != NULL))
- {
- /*
- * postingoff should never be set more than once per leaf page
- * binary search. That would mean that there are duplicate table
- * TIDs in the index, which is never okay. Check for that here.
- */
- if (insertstate->postingoff != 0)
- ereport(ERROR,
- (errcode(ERRCODE_INDEX_CORRUPTED),
- errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
- ItemPointerGetBlockNumber(key->scantid),
- ItemPointerGetOffsetNumber(key->scantid),
- low, stricthigh,
- BufferGetBlockNumber(insertstate->buf),
- RelationGetRelationName(rel))));
-
- insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
- }
- }
-
- /*
- * On a leaf page, a binary search always returns the first key >= scan
- * key (at least in !nextkey case), which could be the last slot + 1. This
- * is also the lower bound of cached search.
- *
- * stricthigh may also be the last slot + 1, which prevents caller from
- * using bounds directly, but is still useful to us if we're called a
- * second time with cached bounds (cached low will be < stricthigh when
- * that happens).
- */
- insertstate->low = low;
- insertstate->stricthigh = stricthigh;
- insertstate->bounds_valid = true;
-
- return low;
-}
/*----------
* _bt_binsrch_posting() -- posting list binary search.
@@ -625,217 +138,6 @@ _bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
return low;
}
-/*----------
- * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
- *
- * page/offnum: location of btree item to be compared to.
- *
- * This routine returns:
- * <0 if scankey < tuple at offnum;
- * 0 if scankey == tuple at offnum;
- * >0 if scankey > tuple at offnum.
- *
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be returned
- * to the caller as a matching key. Similarly, an insertion scankey
- * with its scantid set is treated as equal to a posting tuple whose TID
- * range overlaps with their scantid. There generally won't be a
- * matching TID in the posting tuple, which caller must handle
- * themselves (e.g., by splitting the posting list tuple).
- *
- * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
- * "minus infinity": this routine will always claim it is less than the
- * scankey. The actual key value stored is explicitly truncated to 0
- * attributes (explicitly minus infinity) with version 3+ indexes, but
- * that isn't relied upon. This allows us to implement the Lehman and
- * Yao convention that the first down-link pointer is before the first
- * key. See backend/access/nbtree/README for details.
- *----------
- */
-int32
-_bt_compare(Relation rel,
- BTScanInsert key,
- Page page,
- OffsetNumber offnum)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- IndexTuple itup;
- ItemPointer heapTid;
- ScanKey scankey;
- int ncmpkey;
- int ntupatts;
- int32 result;
-
- Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
- Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
- Assert(key->heapkeyspace || key->scantid == NULL);
-
- /*
- * Force result ">" if target item is first data item on an internal page
- * --- see NOTE above.
- */
- if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
- return 1;
-
- itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
- ntupatts = BTreeTupleGetNAtts(itup, rel);
-
- /*
- * The scan key is set up with the attribute number associated with each
- * term in the key. It is important that, if the index is multi-key, the
- * scan contain the first k key attributes, and that they be in order. If
- * you think about how multi-key ordering works, you'll understand why
- * this is.
- *
- * We don't test for violation of this condition here, however. The
- * initial setup for the index scan had better have gotten it right (see
- * _bt_first).
- */
-
- ncmpkey = Min(ntupatts, key->keysz);
- Assert(key->heapkeyspace || ncmpkey == key->keysz);
- Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
- scankey = key->scankeys;
- for (int i = 1; i <= ncmpkey; i++)
- {
- Datum datum;
- bool isNull;
-
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
-
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
- {
- if (isNull)
- result = 0; /* NULL "=" NULL */
- else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = -1; /* NULL "<" NOT_NULL */
- else
- result = 1; /* NULL ">" NOT_NULL */
- }
- else if (isNull) /* key is NOT_NULL and item is NULL */
- {
- if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = 1; /* NOT_NULL ">" NULL */
- else
- result = -1; /* NOT_NULL "<" NULL */
- }
- else
- {
- /*
- * The sk_func needs to be passed the index value as left arg and
- * the sk_argument as right arg (they might be of different
- * types). Since it is convenient for callers to think of
- * _bt_compare as comparing the scankey to the index item, we have
- * to flip the sign of the comparison result. (Unless it's a DESC
- * column, in which case we *don't* flip the sign.)
- */
- result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum,
- scankey->sk_argument));
-
- if (!(scankey->sk_flags & SK_BT_DESC))
- INVERT_COMPARE_RESULT(result);
- }
-
- /* if the keys are unequal, return the difference */
- if (result != 0)
- return result;
-
- scankey++;
- }
-
- /*
- * All non-truncated attributes (other than heap TID) were found to be
- * equal. Treat truncated attributes as minus infinity when scankey has a
- * key attribute value that would otherwise be compared directly.
- *
- * Note: it doesn't matter if ntupatts includes non-key attributes;
- * scankey won't, so explicitly excluding non-key attributes isn't
- * necessary.
- */
- if (key->keysz > ntupatts)
- return 1;
-
- /*
- * Use the heap TID attribute and scantid to try to break the tie. The
- * rules are the same as any other key attribute -- only the
- * representation differs.
- */
- heapTid = BTreeTupleGetHeapTID(itup);
- if (key->scantid == NULL)
- {
- /*
- * Most searches have a scankey that is considered greater than a
- * truncated pivot tuple if and when the scankey has equal values for
- * attributes up to and including the least significant untruncated
- * attribute in tuple.
- *
- * For example, if an index has the minimum two attributes (single
- * user key attribute, plus heap TID attribute), and a page's high key
- * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
- * will not descend to the page to the left. The search will descend
- * right instead. The truncated attribute in pivot tuple means that
- * all non-pivot tuples on the page to the left are strictly < 'foo',
- * so it isn't necessary to descend left. In other words, search
- * doesn't have to descend left because it isn't interested in a match
- * that has a heap TID value of -inf.
- *
- * However, some searches (pivotsearch searches) actually require that
- * we descend left when this happens. -inf is treated as a possible
- * match for omitted scankey attribute(s). This is needed by page
- * deletion, which must re-find leaf pages that are targets for
- * deletion using their high keys.
- *
- * Note: the heap TID part of the test ensures that scankey is being
- * compared to a pivot tuple with one or more truncated key
- * attributes.
- *
- * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
- * left here, since they have no heap TID attribute (and cannot have
- * any -inf key values in any case, since truncation can only remove
- * non-key attributes). !heapkeyspace searches must always be
- * prepared to deal with matches on both sides of the pivot once the
- * leaf level is reached.
- */
- if (key->heapkeyspace && !key->pivotsearch &&
- key->keysz == ntupatts && heapTid == NULL)
- return 1;
-
- /* All provided scankey arguments found to be equal */
- return 0;
- }
-
- /*
- * Treat truncated heap TID as minus infinity, since scankey has a key
- * attribute value (scantid) that would otherwise be compared directly
- */
- Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
- if (heapTid == NULL)
- return 1;
-
- /*
- * Scankey must be treated as equal to a posting list tuple if its scantid
- * value falls within the range of the posting list. In all other cases
- * there can only be a single heap TID value, which is compared directly
- * with scantid.
- */
- Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- result = ItemPointerCompare(key->scantid, heapTid);
- if (result <= 0 || !BTreeTupleIsPosting(itup))
- return result;
- else
- {
- result = ItemPointerCompare(key->scantid,
- BTreeTupleGetMaxHeapTID(itup));
- if (result > 0)
- return 1;
- }
-
- return 0;
-}
-
/*
* _bt_first() -- Find the first item in a scan.
*
@@ -1363,7 +665,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* Use the manufactured insertion scan key to descend the tree and
* position ourselves on the target leaf page.
*/
- stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+ stack = nbts_call(_bt_search, rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
/* don't need to keep the stack around... */
_bt_freestack(stack);
@@ -1392,7 +694,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
_bt_initialize_more_data(so, dir);
/* position to the precise item on the page */
- offnum = _bt_binsrch(rel, &inskey, buf);
+ offnum = nbts_call(_bt_binsrch, rel, &inskey, buf);
/*
* If nextkey = false, we are positioned at the first item >= scan key, or
@@ -1422,9 +724,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
/*
* Now load data from the first page of the scan.
*/
- if (!_bt_readpage(scan, dir, offnum))
+ if (!nbts_call_norel(_bt_readpage, scan->indexRelation, scan, dir, offnum))
{
- /*
+ /*`
* There's no actually-matching data on this page. Try to advance to
* the next page. Return false if there's no matching data at all.
*/
@@ -1498,280 +800,6 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
-/*
- * _bt_readpage() -- Load data from current index page into so->currPos
- *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate. All other fields of so->currPos are
- * initialized from scratch here.
- *
- * We scan the current page starting at offnum and moving in the indicated
- * direction. All items matching the scan keys are loaded into currPos.items.
- * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
- * that there can be no more matching tuples in the current scan direction.
- *
- * In the case of a parallel scan, caller must have called _bt_parallel_seize
- * prior to calling this function; this function will invoke
- * _bt_parallel_release before returning.
- *
- * Returns true if any matching items found on the page, false if none.
- */
-static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
-{
- BTScanOpaque so = (BTScanOpaque) scan->opaque;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber minoff;
- OffsetNumber maxoff;
- int itemIndex;
- bool continuescan;
- int indnatts;
-
- /*
- * We must have the buffer pinned and locked, but the usual macro can't be
- * used here; this function is what makes it good for currPos.
- */
- Assert(BufferIsValid(so->currPos.buf));
-
- page = BufferGetPage(so->currPos.buf);
- opaque = BTPageGetOpaque(page);
-
- /* allow next page be processed by parallel worker */
- if (scan->parallel_scan)
- {
- if (ScanDirectionIsForward(dir))
- _bt_parallel_release(scan, opaque->btpo_next);
- else
- _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
- }
-
- continuescan = true; /* default assumption */
- indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * We note the buffer's block number so that we can release the pin later.
- * This allows us to re-read the buffer if it is needed again for hinting.
- */
- so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-
- /*
- * We save the LSN of the page as we read it, so that we know whether it
- * safe to apply LP_DEAD hints to the page later. This allows us to drop
- * the pin for MVCC scans, which allows vacuum to avoid blocking.
- */
- so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-
- /*
- * we must save the page's right-link while scanning it; this tells us
- * where to step right to after we're done with these items. There is no
- * corresponding need for the left-link, since splits always go right.
- */
- so->currPos.nextPage = opaque->btpo_next;
-
- /* initialize tuple workspace to empty */
- so->currPos.nextTupleOffset = 0;
-
- /*
- * Now that the current page has been made consistent, the macro should be
- * good.
- */
- Assert(BTScanPosIsPinned(so->currPos));
-
- if (ScanDirectionIsForward(dir))
- {
- /* load items[] in ascending order */
- itemIndex = 0;
-
- offnum = Max(offnum, minoff);
-
- while (offnum <= maxoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- offnum = OffsetNumberNext(offnum);
- continue;
- }
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID
- */
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- itemIndex++;
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- itemIndex++;
- }
- }
- }
- /* When !continuescan, there can't be any more matches, so stop */
- if (!continuescan)
- break;
-
- offnum = OffsetNumberNext(offnum);
- }
-
- /*
- * We don't need to visit page to the right when the high key
- * indicates that no more matches will be found there.
- *
- * Checking the high key like this works out more often than you might
- * think. Leaf page splits pick a split point between the two most
- * dissimilar tuples (this is weighed against the need to evenly share
- * free space). Leaf pages with high key attribute values that can
- * only appear on non-pivot tuples on the right sibling page are
- * common.
- */
- if (continuescan && !P_RIGHTMOST(opaque))
- {
- ItemId iid = PageGetItemId(page, P_HIKEY);
- IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
- int truncatt;
-
- truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
- _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
- }
-
- if (!continuescan)
- so->currPos.moreRight = false;
-
- Assert(itemIndex <= MaxTIDsPerBTreePage);
- so->currPos.firstItem = 0;
- so->currPos.lastItem = itemIndex - 1;
- so->currPos.itemIndex = 0;
- }
- else
- {
- /* load items[] in descending order */
- itemIndex = MaxTIDsPerBTreePage;
-
- offnum = Min(offnum, maxoff);
-
- while (offnum >= minoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
- bool tuple_alive;
- bool passes_quals;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual. Most of the
- * time, it's a win to not bother examining the tuple's index
- * keys, but just skip to the next tuple (previous, actually,
- * since we're scanning backwards). However, if this is the first
- * tuple on the page, we do check the index keys, to prevent
- * uselessly advancing to the page to the left. This is similar
- * to the high key optimization used by forward scans.
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- Assert(offnum >= P_FIRSTDATAKEY(opaque));
- if (offnum > P_FIRSTDATAKEY(opaque))
- {
- offnum = OffsetNumberPrev(offnum);
- continue;
- }
-
- tuple_alive = false;
- }
- else
- tuple_alive = true;
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan);
- if (passes_quals && tuple_alive)
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID.
- *
- * Note that we deliberately save/return items from
- * posting lists in ascending heap TID order for backwards
- * scans. This allows _bt_killitems() to make a
- * consistent assumption about the order of items
- * associated with the same posting list tuple.
- */
- itemIndex--;
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- itemIndex--;
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- }
- }
- }
- if (!continuescan)
- {
- /* there can't be any more matches, so stop */
- so->currPos.moreLeft = false;
- break;
- }
-
- offnum = OffsetNumberPrev(offnum);
- }
-
- Assert(itemIndex >= 0);
- so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
- so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
- }
-
- return (so->currPos.firstItem <= so->currPos.lastItem);
-}
-
/* Save an index item into so->currPos.items[itemIndex] */
static void
_bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -2014,7 +1042,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
PredicateLockPage(rel, blkno, scan->xs_snapshot);
/* see if there are any matches on this page */
/* note that this will clear moreRight if we can stop */
- if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
+ if (nbts_call_norel(_bt_readpage, scan->indexRelation,
+ scan, dir, P_FIRSTDATAKEY(opaque)))
break;
}
else if (scan->parallel_scan != NULL)
@@ -2116,7 +1145,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
/* see if there are any matches on this page */
/* note that this will clear moreLeft if we can stop */
- if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
+ if (nbts_call_norel(_bt_readpage, scan->indexRelation, scan,
+ dir, PageGetMaxOffsetNumber(page)))
break;
}
else if (scan->parallel_scan != NULL)
@@ -2448,7 +1478,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
/*
* Now load data from the first page of the scan.
*/
- if (!_bt_readpage(scan, dir, start))
+ if (!nbts_call_norel(_bt_readpage, scan->indexRelation, scan, dir, start))
{
/*
* There's no actually-matching data on this page. Try to advance to
diff --git a/src/backend/access/nbtree/nbtsearch_spec.h b/src/backend/access/nbtree/nbtsearch_spec.h
new file mode 100644
index 0000000000..73d5370496
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsearch_spec.h
@@ -0,0 +1,994 @@
+/*
+ * Specialized functions for nbtsearch.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static OffsetNumber NBTS_FUNCTION(_bt_binsrch)(Relation rel, BTScanInsert key,
+ Buffer buf);
+static bool NBTS_FUNCTION(_bt_readpage)(IndexScanDesc scan, ScanDirection dir,
+ OffsetNumber offnum);
+
+/*
+ * _bt_binsrch() -- Do a binary search for a key on a particular page.
+ *
+ * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
+ * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
+ * particular, this means it is possible to return a value 1 greater than the
+ * number of keys on the page, if the scankey is > all keys on the page.)
+ *
+ * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
+ * of the last key < given scankey, or last key <= given scankey if nextkey
+ * is true. (Since _bt_compare treats the first data key of such a page as
+ * minus infinity, there will be at least one key < scankey, so the result
+ * always points at one of the keys on the page.) This key indicates the
+ * right place to descend to be sure we find all leaf keys >= given scankey
+ * (or leaf keys > given scankey when nextkey is true).
+ *
+ * This procedure is not responsible for walking right, it just examines
+ * the given page. _bt_binsrch() has no lock or refcount side effects
+ * on the buffer.
+ */
+static OffsetNumber
+NBTS_FUNCTION(_bt_binsrch)(Relation rel,
+ BTScanInsert key,
+ Buffer buf)
+{
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high;
+ int32 result,
+ cmpval;
+
+ page = BufferGetPage(buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* Requesting nextkey semantics while using scantid seems nonsensical */
+ Assert(!key->nextkey || key->scantid == NULL);
+ /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
+ Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+
+ /*
+ * If there are no keys on the page, return the first available slot. Note
+ * this covers two cases: the page is really empty (no keys), or it
+ * contains only a high key. The latter case is possible after vacuuming.
+ * This can never happen on an internal page, however, since they are
+ * never empty (an internal page must have children).
+ */
+ if (unlikely(high < low))
+ return low;
+
+ /*
+ * Binary search to find the first key on the page >= scan key, or first
+ * key > scankey when nextkey is true.
+ *
+ * For nextkey=false (cmpval=1), the loop invariant is: all slots before
+ * 'low' are < scan key, all slots at or after 'high' are >= scan key.
+ *
+ * For nextkey=true (cmpval=0), the loop invariant is: all slots before
+ * 'low' are <= scan key, all slots at or after 'high' are > scan key.
+ *
+ * We can fall out when high == low.
+ */
+ high++; /* establish the loop invariant for high */
+
+ cmpval = key->nextkey ? 0 : 1; /* select comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = nbts_call(_bt_compare, rel, key, page, mid);
+
+ if (result >= cmpval)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ /*
+ * At this point we have high == low, but be careful: they could point
+ * past the last slot on the page.
+ *
+ * On a leaf page, we always return the first key >= scan key (resp. >
+ * scan key), which could be the last slot + 1.
+ */
+ if (P_ISLEAF(opaque))
+ return low;
+
+ /*
+ * On a non-leaf page, return the last key < scan key (resp. <= scan key).
+ * There must be one if _bt_compare() is playing by the rules.
+ */
+ Assert(low > P_FIRSTDATAKEY(opaque));
+
+ return OffsetNumberPrev(low);
+}
+
+/*
+ * _bt_readpage() -- Load data from current index page into so->currPos
+ *
+ * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
+ * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
+ * they are updated as appropriate. All other fields of so->currPos are
+ * initialized from scratch here.
+ *
+ * We scan the current page starting at offnum and moving in the indicated
+ * direction. All items matching the scan keys are loaded into currPos.items.
+ * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
+ * that there can be no more matching tuples in the current scan direction.
+ *
+ * In the case of a parallel scan, caller must have called _bt_parallel_seize
+ * prior to calling this function; this function will invoke
+ * _bt_parallel_release before returning.
+ *
+ * Returns true if any matching items found on the page, false if none.
+ */
+static bool
+NBTS_FUNCTION(_bt_readpage)(IndexScanDesc scan, ScanDirection dir,
+ OffsetNumber offnum)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber minoff;
+ OffsetNumber maxoff;
+ int itemIndex;
+ bool continuescan;
+ int indnatts;
+
+ /*
+ * We must have the buffer pinned and locked, but the usual macro can't be
+ * used here; this function is what makes it good for currPos.
+ */
+ Assert(BufferIsValid(so->currPos.buf));
+
+ page = BufferGetPage(so->currPos.buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* allow next page be processed by parallel worker */
+ if (scan->parallel_scan)
+ {
+ if (ScanDirectionIsForward(dir))
+ _bt_parallel_release(scan, opaque->btpo_next);
+ else
+ _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+ }
+
+ continuescan = true; /* default assumption */
+ indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * We note the buffer's block number so that we can release the pin later.
+ * This allows us to re-read the buffer if it is needed again for hinting.
+ */
+ so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later. This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+
+ /*
+ * we must save the page's right-link while scanning it; this tells us
+ * where to step right to after we're done with these items. There is no
+ * corresponding need for the left-link, since splits always go right.
+ */
+ so->currPos.nextPage = opaque->btpo_next;
+
+ /* initialize tuple workspace to empty */
+ so->currPos.nextTupleOffset = 0;
+
+ /*
+ * Now that the current page has been made consistent, the macro should be
+ * good.
+ */
+ Assert(BTScanPosIsPinned(so->currPos));
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* load items[] in ascending order */
+ itemIndex = 0;
+
+ offnum = Max(offnum, minoff);
+
+ while (offnum <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ offnum = OffsetNumberNext(offnum);
+ continue;
+ }
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ if (nbts_call(_bt_checkkeys, scan->indexRelation, scan, itup, indnatts, dir, &continuescan))
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID
+ */
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ itemIndex++;
+ }
+ }
+ }
+ /* When !continuescan, there can't be any more matches, so stop */
+ if (!continuescan)
+ break;
+
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /*
+ * We don't need to visit page to the right when the high key
+ * indicates that no more matches will be found there.
+ *
+ * Checking the high key like this works out more often than you might
+ * think. Leaf page splits pick a split point between the two most
+ * dissimilar tuples (this is weighed against the need to evenly share
+ * free space). Leaf pages with high key attribute values that can
+ * only appear on non-pivot tuples on the right sibling page are
+ * common.
+ */
+ if (continuescan && !P_RIGHTMOST(opaque))
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
+ int truncatt;
+
+ truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
+ nbts_call(_bt_checkkeys, scan->indexRelation, scan, itup, truncatt, dir, &continuescan);
+ }
+
+ if (!continuescan)
+ so->currPos.moreRight = false;
+
+ Assert(itemIndex <= MaxTIDsPerBTreePage);
+ so->currPos.firstItem = 0;
+ so->currPos.lastItem = itemIndex - 1;
+ so->currPos.itemIndex = 0;
+ }
+ else
+ {
+ /* load items[] in descending order */
+ itemIndex = MaxTIDsPerBTreePage;
+
+ offnum = Min(offnum, maxoff);
+
+ while (offnum >= minoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+ bool tuple_alive;
+ bool passes_quals;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual. Most of the
+ * time, it's a win to not bother examining the tuple's index
+ * keys, but just skip to the next tuple (previous, actually,
+ * since we're scanning backwards). However, if this is the first
+ * tuple on the page, we do check the index keys, to prevent
+ * uselessly advancing to the page to the left. This is similar
+ * to the high key optimization used by forward scans.
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ Assert(offnum >= P_FIRSTDATAKEY(opaque));
+ if (offnum > P_FIRSTDATAKEY(opaque))
+ {
+ offnum = OffsetNumberPrev(offnum);
+ continue;
+ }
+
+ tuple_alive = false;
+ }
+ else
+ tuple_alive = true;
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ passes_quals = nbts_call(_bt_checkkeys, scan->indexRelation,
+ scan, itup, indnatts, dir,
+ &continuescan);
+ if (passes_quals && tuple_alive)
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID.
+ *
+ * Note that we deliberately save/return items from
+ * posting lists in ascending heap TID order for backwards
+ * scans. This allows _bt_killitems() to make a
+ * consistent assumption about the order of items
+ * associated with the same posting list tuple.
+ */
+ itemIndex--;
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ }
+ }
+ }
+ if (!continuescan)
+ {
+ /* there can't be any more matches, so stop */
+ so->currPos.moreLeft = false;
+ break;
+ }
+
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ Assert(itemIndex >= 0);
+ so->currPos.firstItem = itemIndex;
+ so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
+ so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+ }
+
+ return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+#endif /* NBTS_SPECIALIZING_DEFAULT */
+
+/*
+ * _bt_search() -- Search the tree for a particular scankey,
+ * or more precisely for the first leaf page it could be on.
+ *
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
+ * but it can omit the rightmost column(s) of the index.
+ *
+ * Return value is a stack of parent-page pointers (i.e. there is no entry for
+ * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
+ * which is locked and pinned. No locks are held on the parent pages,
+ * however!
+ *
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
+ *
+ * The returned buffer is locked according to access parameter. Additionally,
+ * access = BT_WRITE will allow an empty root page to be created and returned.
+ * When access = BT_READ, an empty index will result in *bufP being set to
+ * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
+ * during the search will be finished.
+ */
+BTStack
+NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
+ int access, Snapshot snapshot)
+{
+ BTStack stack_in = NULL;
+ int page_access = BT_READ;
+
+ /* Get the root page to start with */
+ *bufP = _bt_getroot(rel, access);
+
+ /* If index is empty and access = BT_READ, no root page is created. */
+ if (!BufferIsValid(*bufP))
+ return (BTStack) NULL;
+
+ /* Loop iterates once per level descended in the tree */
+ for (;;)
+ {
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber offnum;
+ ItemId itemid;
+ IndexTuple itup;
+ BlockNumber child;
+ BTStack new_stack;
+
+ /*
+ * Race -- the page we just grabbed may have split since we read its
+ * downlink in its parent page (or the metapage). If it has, we may
+ * need to move right to its new sibling. Do that.
+ *
+ * In write-mode, allow _bt_moveright to finish any incomplete splits
+ * along the way. Strictly speaking, we'd only need to finish an
+ * incomplete split on the leaf page we're about to insert to, not on
+ * any of the upper levels (internal pages with incomplete splits are
+ * also taken care of in _bt_getstackbuf). But this is a good
+ * opportunity to finish splits of internal pages too.
+ */
+ *bufP = nbts_call(_bt_moveright, rel, key, *bufP,
+ (access == BT_WRITE), stack_in,
+ page_access, snapshot);
+
+ /* if this is a leaf page, we're done */
+ page = BufferGetPage(*bufP);
+ opaque = BTPageGetOpaque(page);
+ if (P_ISLEAF(opaque))
+ break;
+
+ /*
+ * Find the appropriate pivot tuple on this page. Its downlink points
+ * to the child page that we're about to descend to.
+ */
+ offnum = nbts_call(_bt_binsrch, rel, key, *bufP);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
+ child = BTreeTupleGetDownLink(itup);
+
+ /*
+ * We need to save the location of the pivot tuple we chose in a new
+ * stack entry for this page/level. If caller ends up splitting a
+ * page one level down, it usually ends up inserting a new pivot
+ * tuple/downlink immediately after the location recorded here.
+ */
+ new_stack = (BTStack) palloc(sizeof(BTStackData));
+ new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
+ new_stack->bts_offset = offnum;
+ new_stack->bts_parent = stack_in;
+
+ /*
+ * Page level 1 is lowest non-leaf page level prior to leaves. So, if
+ * we're on the level 1 and asked to lock leaf page in write mode,
+ * then lock next page in write mode, because it must be a leaf.
+ */
+ if (opaque->btpo_level == 1 && access == BT_WRITE)
+ page_access = BT_WRITE;
+
+ /* drop the read lock on the page, then acquire one on its child */
+ *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
+
+ /* okay, all set to move down a level */
+ stack_in = new_stack;
+ }
+
+ /*
+ * If we're asked to lock leaf in write mode, but didn't manage to, then
+ * relock. This should only happen when the root page is a leaf page (and
+ * the only page in the index other than the metapage).
+ */
+ if (access == BT_WRITE && page_access == BT_READ)
+ {
+ /* trade in our read lock for a write lock */
+ _bt_unlockbuf(rel, *bufP);
+ _bt_lockbuf(rel, *bufP, BT_WRITE);
+
+ /*
+ * Race -- the leaf page may have split after we dropped the read lock
+ * but before we acquired a write lock. If it has, we may need to
+ * move right to its new sibling. Do that.
+ */
+ *bufP = nbts_call(_bt_moveright, rel, key, *bufP, true, stack_in,
+ BT_WRITE, snapshot);
+ }
+
+ return stack_in;
+}
+
+/*
+ * _bt_moveright() -- move right in the btree if necessary.
+ *
+ * When we follow a pointer to reach a page, it is possible that
+ * the page has changed in the meanwhile. If this happens, we're
+ * guaranteed that the page has "split right" -- that is, that any
+ * data that appeared on the page originally is either on the page
+ * or strictly to the right of it.
+ *
+ * This routine decides whether or not we need to move right in the
+ * tree by examining the high key entry on the page. If that entry is
+ * strictly less than the scankey, or <= the scankey in the
+ * key.nextkey=true case, then we followed the wrong link and we need
+ * to move right.
+ *
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
+ *
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key. When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
+ *
+ * If forupdate is true, we will attempt to finish any incomplete splits
+ * that we encounter. This is required when locking a target page for an
+ * insertion, because we don't allow inserting on a page before the split
+ * is completed. 'stack' is only used if forupdate is true.
+ *
+ * On entry, we have the buffer pinned and a lock of the type specified by
+ * 'access'. If we move right, we release the buffer and lock and acquire
+ * the same on the right sibling. Return value is the buffer we stop at.
+ *
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
+ */
+Buffer
+NBTS_FUNCTION(_bt_moveright)(Relation rel,
+ BTScanInsert key,
+ Buffer buf,
+ bool forupdate,
+ BTStack stack,
+ int access,
+ Snapshot snapshot)
+{
+ Page page;
+ BTPageOpaque opaque;
+ int32 cmpval;
+
+ /*
+ * When nextkey = false (normal case): if the scan key that brought us to
+ * this page is > the high key stored on the page, then the page has split
+ * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
+ * have some duplicates to the right as well as the left, but that's
+ * something that's only ever dealt with on the leaf level, after
+ * _bt_search has found an initial leaf page.)
+ *
+ * When nextkey = true: move right if the scan key is >= page's high key.
+ * (Note that key.scantid cannot be set in this case.)
+ *
+ * The page could even have split more than once, so scan as far as
+ * needed.
+ *
+ * We also have to move right if we followed a link that brought us to a
+ * dead page.
+ */
+ cmpval = key->nextkey ? 0 : 1;
+
+ for (;;)
+ {
+ page = BufferGetPage(buf);
+ TestForOldSnapshot(snapshot, rel, page);
+ opaque = BTPageGetOpaque(page);
+
+ if (P_RIGHTMOST(opaque))
+ break;
+
+ /*
+ * Finish any incomplete splits we encounter along the way.
+ */
+ if (forupdate && P_INCOMPLETE_SPLIT(opaque))
+ {
+ BlockNumber blkno = BufferGetBlockNumber(buf);
+
+ /* upgrade our lock if necessary */
+ if (access == BT_READ)
+ {
+ _bt_unlockbuf(rel, buf);
+ _bt_lockbuf(rel, buf, BT_WRITE);
+ }
+
+ if (P_INCOMPLETE_SPLIT(opaque))
+ _bt_finish_split(rel, buf, stack);
+ else
+ _bt_relbuf(rel, buf);
+
+ /* re-acquire the lock in the right mode, and re-check */
+ buf = _bt_getbuf(rel, blkno, access);
+ continue;
+ }
+
+ if (P_IGNORE(opaque) || nbts_call(_bt_compare, rel, key, page, P_HIKEY) >= cmpval)
+ {
+ /* step right one page */
+ buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
+ continue;
+ }
+ else
+ break;
+ }
+
+ if (P_IGNORE(opaque))
+ elog(ERROR, "fell off the end of index \"%s\"",
+ RelationGetRelationName(rel));
+
+ return buf;
+}
+
+/*
+ *
+ * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
+ *
+ * Like _bt_binsrch(), but with support for caching the binary search
+ * bounds. Only used during insertion, and only on the leaf page that it
+ * looks like caller will insert tuple on. Exclusive-locked and pinned
+ * leaf page is contained within insertstate.
+ *
+ * Caches the bounds fields in insertstate so that a subsequent call can
+ * reuse the low and strict high bounds of original binary search. Callers
+ * that use these fields directly must be prepared for the case where low
+ * and/or stricthigh are not on the same page (one or both exceed maxoff
+ * for the page). The case where there are no items on the page (high <
+ * low) makes bounds invalid.
+ *
+ * Caller is responsible for invalidating bounds when it modifies the page
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by a posting
+ * list split).
+ */
+OffsetNumber
+NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate)
+{
+ BTScanInsert key = insertstate->itup_key;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high,
+ stricthigh;
+ int32 result,
+ cmpval;
+
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ Assert(P_ISLEAF(opaque));
+ Assert(!key->nextkey);
+ Assert(insertstate->postingoff == 0);
+
+ if (!insertstate->bounds_valid)
+ {
+ /* Start new binary search */
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ }
+ else
+ {
+ /* Restore result of previous binary search against same page */
+ low = insertstate->low;
+ high = insertstate->stricthigh;
+ }
+
+ /* If there are no keys on the page, return the first available slot */
+ if (unlikely(high < low))
+ {
+ /* Caller can't reuse bounds */
+ insertstate->low = InvalidOffsetNumber;
+ insertstate->stricthigh = InvalidOffsetNumber;
+ insertstate->bounds_valid = false;
+ return low;
+ }
+
+ /*
+ * Binary search to find the first key on the page >= scan key. (nextkey
+ * is always false when inserting).
+ *
+ * The loop invariant is: all slots before 'low' are < scan key, all slots
+ * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
+ * maintained to save additional search effort for caller.
+ *
+ * We can fall out when high == low.
+ */
+ if (!insertstate->bounds_valid)
+ high++; /* establish the loop invariant for high */
+ stricthigh = high; /* high initially strictly higher */
+
+ cmpval = 1; /* !nextkey comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = nbts_call(_bt_compare, rel, key, page, mid);
+
+ if (result >= cmpval)
+ low = mid + 1;
+ else
+ {
+ high = mid;
+ if (result != 0)
+ stricthigh = high;
+ }
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set postingoff for caller. Caller must split the
+ * posting list when postingoff is set. This should happen
+ * infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ {
+ /*
+ * postingoff should never be set more than once per leaf page
+ * binary search. That would mean that there are duplicate table
+ * TIDs in the index, which is never okay. Check for that here.
+ */
+ if (insertstate->postingoff != 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
+ ItemPointerGetBlockNumber(key->scantid),
+ ItemPointerGetOffsetNumber(key->scantid),
+ low, stricthigh,
+ BufferGetBlockNumber(insertstate->buf),
+ RelationGetRelationName(rel))));
+
+ insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
+ }
+ }
+
+ /*
+ * On a leaf page, a binary search always returns the first key >= scan
+ * key (at least in !nextkey case), which could be the last slot + 1. This
+ * is also the lower bound of cached search.
+ *
+ * stricthigh may also be the last slot + 1, which prevents caller from
+ * using bounds directly, but is still useful to us if we're called a
+ * second time with cached bounds (cached low will be < stricthigh when
+ * that happens).
+ */
+ insertstate->low = low;
+ insertstate->stricthigh = stricthigh;
+ insertstate->bounds_valid = true;
+
+ return low;
+}
+
+/*----------
+ * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
+ *
+ * page/offnum: location of btree item to be compared to.
+ *
+ * This routine returns:
+ * <0 if scankey < tuple at offnum;
+ * 0 if scankey == tuple at offnum;
+ * >0 if scankey > tuple at offnum.
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
+ * "minus infinity": this routine will always claim it is less than the
+ * scankey. The actual key value stored is explicitly truncated to 0
+ * attributes (explicitly minus infinity) with version 3+ indexes, but
+ * that isn't relied upon. This allows us to implement the Lehman and
+ * Yao convention that the first down-link pointer is before the first
+ * key. See backend/access/nbtree/README for details.
+ *----------
+ */
+int32
+NBTS_FUNCTION(_bt_compare)(Relation rel,
+ BTScanInsert key,
+ Page page,
+ OffsetNumber offnum)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ IndexTuple itup;
+ ItemPointer heapTid;
+ ScanKey scankey;
+ int ncmpkey;
+ int ntupatts;
+ int32 result;
+
+ Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
+ Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+ Assert(key->heapkeyspace || key->scantid == NULL);
+
+ /*
+ * Force result ">" if target item is first data item on an internal page
+ * --- see NOTE above.
+ */
+ if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
+ return 1;
+
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+ ntupatts = BTreeTupleGetNAtts(itup, rel);
+
+ /*
+ * The scan key is set up with the attribute number associated with each
+ * term in the key. It is important that, if the index is multi-key, the
+ * scan contain the first k key attributes, and that they be in order. If
+ * you think about how multi-key ordering works, you'll understand why
+ * this is.
+ *
+ * We don't test for violation of this condition here, however. The
+ * initial setup for the index scan had better have gotten it right (see
+ * _bt_first).
+ */
+
+ ncmpkey = Min(ntupatts, key->keysz);
+ Assert(key->heapkeyspace || ncmpkey == key->keysz);
+ Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
+ scankey = key->scankeys;
+ for (int i = 1; i <= ncmpkey; i++)
+ {
+ Datum datum;
+ bool isNull;
+
+ datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+
+ if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ {
+ if (isNull)
+ result = 0; /* NULL "=" NULL */
+ else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = -1; /* NULL "<" NOT_NULL */
+ else
+ result = 1; /* NULL ">" NOT_NULL */
+ }
+ else if (isNull) /* key is NOT_NULL and item is NULL */
+ {
+ if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = 1; /* NOT_NULL ">" NULL */
+ else
+ result = -1; /* NOT_NULL "<" NULL */
+ }
+ else
+ {
+ /*
+ * The sk_func needs to be passed the index value as left arg and
+ * the sk_argument as right arg (they might be of different
+ * types). Since it is convenient for callers to think of
+ * _bt_compare as comparing the scankey to the index item, we have
+ * to flip the sign of the comparison result. (Unless it's a DESC
+ * column, in which case we *don't* flip the sign.)
+ */
+ result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum,
+ scankey->sk_argument));
+
+ if (!(scankey->sk_flags & SK_BT_DESC))
+ INVERT_COMPARE_RESULT(result);
+ }
+
+ /* if the keys are unequal, return the difference */
+ if (result != 0)
+ return result;
+
+ scankey++;
+ }
+
+ /*
+ * All non-truncated attributes (other than heap TID) were found to be
+ * equal. Treat truncated attributes as minus infinity when scankey has a
+ * key attribute value that would otherwise be compared directly.
+ *
+ * Note: it doesn't matter if ntupatts includes non-key attributes;
+ * scankey won't, so explicitly excluding non-key attributes isn't
+ * necessary.
+ */
+ if (key->keysz > ntupatts)
+ return 1;
+
+ /*
+ * Use the heap TID attribute and scantid to try to break the tie. The
+ * rules are the same as any other key attribute -- only the
+ * representation differs.
+ */
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid == NULL)
+ {
+ /*
+ * Most searches have a scankey that is considered greater than a
+ * truncated pivot tuple if and when the scankey has equal values for
+ * attributes up to and including the least significant untruncated
+ * attribute in tuple.
+ *
+ * For example, if an index has the minimum two attributes (single
+ * user key attribute, plus heap TID attribute), and a page's high key
+ * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
+ * will not descend to the page to the left. The search will descend
+ * right instead. The truncated attribute in pivot tuple means that
+ * all non-pivot tuples on the page to the left are strictly < 'foo',
+ * so it isn't necessary to descend left. In other words, search
+ * doesn't have to descend left because it isn't interested in a match
+ * that has a heap TID value of -inf.
+ *
+ * However, some searches (pivotsearch searches) actually require that
+ * we descend left when this happens. -inf is treated as a possible
+ * match for omitted scankey attribute(s). This is needed by page
+ * deletion, which must re-find leaf pages that are targets for
+ * deletion using their high keys.
+ *
+ * Note: the heap TID part of the test ensures that scankey is being
+ * compared to a pivot tuple with one or more truncated key
+ * attributes.
+ *
+ * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
+ * left here, since they have no heap TID attribute (and cannot have
+ * any -inf key values in any case, since truncation can only remove
+ * non-key attributes). !heapkeyspace searches must always be
+ * prepared to deal with matches on both sides of the pivot once the
+ * leaf level is reached.
+ */
+ if (key->heapkeyspace && !key->pivotsearch &&
+ key->keysz == ntupatts && heapTid == NULL)
+ return 1;
+
+ /* All provided scankey arguments found to be equal */
+ return 0;
+ }
+
+ /*
+ * Treat truncated heap TID as minus infinity, since scankey has a key
+ * attribute value (scantid) that would otherwise be compared directly
+ */
+ Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+ if (heapTid == NULL)
+ return 1;
+
+ /*
+ * Scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * with scantid.
+ */
+ Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (result <= 0 || !BTreeTupleIsPosting(itup))
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid,
+ BTreeTupleGetMaxHeapTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
+}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 9f60fa9894..f1d146ba71 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -279,8 +279,6 @@ static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
BTPageState *state,
BTDedupState dstate);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
-static void _bt_load(BTWriteState *wstate,
- BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
int request);
static void _bt_end_parallel(BTLeader *btleader);
@@ -293,6 +291,9 @@ static void _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
Sharedsort *sharedsort2, int sortmem,
bool progress);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsort_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
/*
* btbuild() -- build a new btree index.
@@ -566,7 +567,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
wstate.heap = btspool->heap;
wstate.index = btspool->index;
- wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+ wstate.inskey = nbts_call(_bt_mkscankey, wstate.index, NULL);
/* _bt_mkscankey() won't set allequalimage without metapage */
wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
@@ -578,7 +579,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
PROGRESS_BTREE_PHASE_LEAF_LOAD);
- _bt_load(&wstate, btspool, btspool2);
+ nbts_call_norel(_bt_load, wstate.index, &wstate, btspool, btspool2);
}
/*
@@ -978,8 +979,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
lastleft = (IndexTuple) PageGetItem(opage, ii);
Assert(IndexTupleSize(oitup) > last_truncextra);
- truncated = _bt_truncate(wstate->index, lastleft, oitup,
- wstate->inskey);
+ truncated = nbts_call(_bt_truncate, wstate->index, lastleft, oitup,
+ wstate->inskey);
if (!PageIndexTupleOverwrite(opage, P_HIKEY, (Item) truncated,
IndexTupleSize(truncated)))
elog(ERROR, "failed to add high key to the index page");
@@ -1176,264 +1177,6 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
-/*
- * Read tuples in correct sort order from tuplesort, and load them into
- * btree leaves.
- */
-static void
-_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
-{
- BTPageState *state = NULL;
- bool merge = (btspool2 != NULL);
- IndexTuple itup,
- itup2 = NULL;
- bool load1;
- TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
- SortSupport sortKeys;
- int64 tuples_done = 0;
- bool deduplicate;
-
- deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
- BTGetDeduplicateItems(wstate->index);
-
- if (merge)
- {
- /*
- * Another BTSpool for dead tuples exists. Now we have to merge
- * btspool and btspool2.
- */
-
- /* the preparation of merge */
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
-
- /* Prepare SortSupport data for each column */
- sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
- for (i = 0; i < keysz; i++)
- {
- SortSupport sortKey = sortKeys + i;
- ScanKey scanKey = wstate->inskey->scankeys + i;
- int16 strategy;
-
- sortKey->ssup_cxt = CurrentMemoryContext;
- sortKey->ssup_collation = scanKey->sk_collation;
- sortKey->ssup_nulls_first =
- (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
- sortKey->ssup_attno = scanKey->sk_attno;
- /* Abbreviation is not supported here */
- sortKey->abbreviate = false;
-
- AssertState(sortKey->ssup_attno != 0);
-
- strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
- BTGreaterStrategyNumber : BTLessStrategyNumber;
-
- PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
- }
-
- for (;;)
- {
- load1 = true; /* load BTSpool next ? */
- if (itup2 == NULL)
- {
- if (itup == NULL)
- break;
- }
- else if (itup != NULL)
- {
- int32 compare = 0;
-
- for (i = 1; i <= keysz; i++)
- {
- SortSupport entry;
- Datum attrDatum1,
- attrDatum2;
- bool isNull1,
- isNull2;
-
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
- entry);
- if (compare > 0)
- {
- load1 = false;
- break;
- }
- else if (compare < 0)
- break;
- }
-
- /*
- * If key values are equal, we sort on ItemPointer. This is
- * required for btree indexes, since heap TID is treated as an
- * implicit last key attribute in order to ensure that all
- * keys in the index are physically unique.
- */
- if (compare == 0)
- {
- compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
- Assert(compare != 0);
- if (compare > 0)
- load1 = false;
- }
- }
- else
- load1 = false;
-
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- if (load1)
- {
- _bt_buildadd(wstate, state, itup, 0);
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- }
- else
- {
- _bt_buildadd(wstate, state, itup2, 0);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- pfree(sortKeys);
- }
- else if (deduplicate)
- {
- /* merge is unnecessary, deduplicate into posting lists */
- BTDedupState dstate;
-
- dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
- dstate->deduplicate = true; /* unused */
- dstate->nmaxitems = 0; /* unused */
- dstate->maxpostingsize = 0; /* set later */
- /* Metadata about base tuple of current pending posting list */
- dstate->base = NULL;
- dstate->baseoff = InvalidOffsetNumber; /* unused */
- dstate->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- dstate->htids = NULL;
- dstate->nhtids = 0;
- dstate->nitems = 0;
- dstate->phystupsize = 0; /* unused */
- dstate->nintervals = 0; /* unused */
-
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- {
- state = _bt_pagestate(wstate, 0);
-
- /*
- * Limit size of posting list tuples to 1/10 space we want to
- * leave behind on the page, plus space for final item's line
- * pointer. This is equal to the space that we'd like to
- * leave behind on each leaf page when fillfactor is 90,
- * allowing us to get close to fillfactor% space utilization
- * when there happen to be a great many duplicates. (This
- * makes higher leaf fillfactor settings ineffective when
- * building indexes that have many duplicates, but packing
- * leaf pages full with few very large tuples doesn't seem
- * like a useful goal.)
- */
- dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
- sizeof(ItemIdData);
- Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
- dstate->maxpostingsize <= INDEX_SIZE_MASK);
- dstate->htids = palloc(dstate->maxpostingsize);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
- else if (_bt_keep_natts_fast(wstate->index, dstate->base,
- itup) > keysz &&
- _bt_dedup_save_htid(dstate, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID from itup has been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list.
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
-
- if (state)
- {
- /*
- * Handle the last item (there must be a last item when the
- * tuplesort returned one or more tuples)
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
- pfree(dstate->htids);
- }
-
- pfree(dstate);
- }
- else
- {
- /* merging and deduplication are both unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- _bt_buildadd(wstate, state, itup, 0);
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- }
-
- /* Close down final pages and write the metapage */
- _bt_uppershutdown(wstate, state);
-
- /*
- * When we WAL-logged index pages, we must nonetheless fsync index files.
- * Since we're building outside shared buffers, a CHECKPOINT occurring
- * during the build has no way to flush the previously written data to
- * disk (indeed it won't know the index even exists). A crash later on
- * would replay WAL from the checkpoint, therefore it wouldn't replay our
- * earlier WAL entries. If we do not fsync those pages here, they might
- * still not be on disk when the crash occurs.
- */
- if (wstate->btws_use_wal)
- smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
-}
-
/*
* Create parallel context, and launch workers for leader.
*
diff --git a/src/backend/access/nbtree/nbtsort_spec.h b/src/backend/access/nbtree/nbtsort_spec.h
new file mode 100644
index 0000000000..8f4a3602ca
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsort_spec.h
@@ -0,0 +1,275 @@
+/*
+ * Specialized functions included in nbtsort.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static void NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
+ BTSpool *btspool2);
+
+/*
+ * Read tuples in correct sort order from tuplesort, and load them into
+ * btree leaves.
+ */
+static void
+NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
+ BTSpool *btspool2)
+{
+ BTPageState *state = NULL;
+ bool merge = (btspool2 != NULL);
+ IndexTuple itup,
+ itup2 = NULL;
+ bool load1;
+ TupleDesc tupdes = RelationGetDescr(wstate->index);
+ int i,
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ SortSupport sortKeys;
+ int64 tuples_done = 0;
+ bool deduplicate;
+
+ deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
+ BTGetDeduplicateItems(wstate->index);
+
+ if (merge)
+ {
+ /*
+ * Another BTSpool for dead tuples exists. Now we have to merge
+ * btspool and btspool2.
+ */
+
+ /* the preparation of merge */
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+
+ /* Prepare SortSupport data for each column */
+ sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+
+ for (i = 0; i < keysz; i++)
+ {
+ SortSupport sortKey = sortKeys + i;
+ ScanKey scanKey = wstate->inskey->scankeys + i;
+ int16 strategy;
+
+ sortKey->ssup_cxt = CurrentMemoryContext;
+ sortKey->ssup_collation = scanKey->sk_collation;
+ sortKey->ssup_nulls_first =
+ (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+ sortKey->ssup_attno = scanKey->sk_attno;
+ /* Abbreviation is not supported here */
+ sortKey->abbreviate = false;
+
+ AssertState(sortKey->ssup_attno != 0);
+
+ strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+ BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+ PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+ }
+
+ for (;;)
+ {
+ load1 = true; /* load BTSpool next ? */
+ if (itup2 == NULL)
+ {
+ if (itup == NULL)
+ break;
+ }
+ else if (itup != NULL)
+ {
+ int32 compare = 0;
+
+ for (i = 1; i <= keysz; i++)
+ {
+ SortSupport entry;
+ Datum attrDatum1,
+ attrDatum2;
+ bool isNull1,
+ isNull2;
+
+ entry = sortKeys + i - 1;
+ attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+ attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+ compare = ApplySortComparator(attrDatum1, isNull1,
+ attrDatum2, isNull2,
+ entry);
+ if (compare > 0)
+ {
+ load1 = false;
+ break;
+ }
+ else if (compare < 0)
+ break;
+ }
+
+ /*
+ * If key values are equal, we sort on ItemPointer. This is
+ * required for btree indexes, since heap TID is treated as an
+ * implicit last key attribute in order to ensure that all
+ * keys in the index are physically unique.
+ */
+ if (compare == 0)
+ {
+ compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+ Assert(compare != 0);
+ if (compare > 0)
+ load1 = false;
+ }
+ }
+ else
+ load1 = false;
+
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ if (load1)
+ {
+ _bt_buildadd(wstate, state, itup, 0);
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ }
+ else
+ {
+ _bt_buildadd(wstate, state, itup2, 0);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ pfree(sortKeys);
+ }
+ else if (deduplicate)
+ {
+ /* merge is unnecessary, deduplicate into posting lists */
+ BTDedupState dstate;
+
+ dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ dstate->deduplicate = true; /* unused */
+ dstate->nmaxitems = 0; /* unused */
+ dstate->maxpostingsize = 0; /* set later */
+ /* Metadata about base tuple of current pending posting list */
+ dstate->base = NULL;
+ dstate->baseoff = InvalidOffsetNumber; /* unused */
+ dstate->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ dstate->htids = NULL;
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+ dstate->phystupsize = 0; /* unused */
+ dstate->nintervals = 0; /* unused */
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+
+ /*
+ * Limit size of posting list tuples to 1/10 space we want to
+ * leave behind on the page, plus space for final item's line
+ * pointer. This is equal to the space that we'd like to
+ * leave behind on each leaf page when fillfactor is 90,
+ * allowing us to get close to fillfactor% space utilization
+ * when there happen to be a great many duplicates. (This
+ * makes higher leaf fillfactor settings ineffective when
+ * building indexes that have many duplicates, but packing
+ * leaf pages full with few very large tuples doesn't seem
+ * like a useful goal.)
+ */
+ dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
+ sizeof(ItemIdData);
+ Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
+ dstate->maxpostingsize <= INDEX_SIZE_MASK);
+ dstate->htids = palloc(dstate->maxpostingsize);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+ else if (nbts_call(_bt_keep_natts_fast, wstate->index, dstate->base,
+ itup) > keysz &&
+ _bt_dedup_save_htid(dstate, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID from itup has been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list.
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+
+ if (state)
+ {
+ /*
+ * Handle the last item (there must be a last item when the
+ * tuplesort returned one or more tuples)
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+ pfree(dstate->htids);
+ }
+
+ pfree(dstate);
+ }
+ else
+ {
+ /* merging and deduplication are both unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ _bt_buildadd(wstate, state, itup, 0);
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+
+ /* Close down final pages and write the metapage */
+ _bt_uppershutdown(wstate, state);
+
+ /*
+ * When we WAL-logged index pages, we must nonetheless fsync index files.
+ * Since we're building outside shared buffers, a CHECKPOINT occurring
+ * during the build has no way to flush the previously written data to
+ * disk (indeed it won't know the index even exists). A crash later on
+ * would replay WAL from the checkpoint, therefore it wouldn't replay our
+ * earlier WAL entries. If we do not fsync those pages here, they might
+ * still not be on disk when the crash occurs.
+ */
+ if (wstate->btws_use_wal)
+ smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+}
+
+#endif
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 241e26d338..8e5337cad7 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -692,7 +692,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
{
itemid = PageGetItemId(state->origpage, maxoff);
tup = (IndexTuple) PageGetItem(state->origpage, itemid);
- keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+ keepnatts = nbts_call(_bt_keep_natts_fast, state->rel, tup, state->newitem);
if (keepnatts > 1 && keepnatts <= nkeyatts)
{
@@ -723,7 +723,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
- keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+ keepnatts = nbts_call(_bt_keep_natts_fast, state->rel, tup, state->newitem);
if (keepnatts > 1 && keepnatts <= nkeyatts)
{
@@ -972,7 +972,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
* avoid appending a heap TID in new high key, we're done. Finish split
* with default strategy and initial split interval.
*/
- perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+ perfectpenalty = nbts_call(_bt_keep_natts_fast, state->rel, leftmost, rightmost);
if (perfectpenalty <= indnkeyatts)
return perfectpenalty;
@@ -993,7 +993,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
* If page is entirely full of duplicates, a single value strategy split
* will be performed.
*/
- perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+ perfectpenalty = nbts_call(_bt_keep_natts_fast, state->rel, leftmost, rightmost);
if (perfectpenalty <= indnkeyatts)
{
*strategy = SPLIT_MANY_DUPLICATES;
@@ -1031,8 +1031,8 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
itemid = PageGetItemId(state->origpage, P_HIKEY);
hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
- perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
- state->newitem);
+ perfectpenalty = nbts_call(_bt_keep_natts_fast, state->rel, hikey,
+ state->newitem);
if (perfectpenalty <= indnkeyatts)
*strategy = SPLIT_SINGLE_VALUE;
else
@@ -1154,7 +1154,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
lastleft = _bt_split_lastleft(state, split);
firstright = _bt_split_firstright(state, split);
- return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+ return nbts_call(_bt_keep_natts_fast, state->rel, lastleft, firstright);
}
/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index ff260c393a..bc443ebd27 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -50,130 +50,11 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
bool *result);
static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
static void _bt_mark_scankey_required(ScanKey skey);
-static bool _bt_check_rowcompare(ScanKey skey,
- IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
- ScanDirection dir, bool *continuescan);
-static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtutils_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
-/*
- * _bt_mkscankey
- * Build an insertion scan key that contains comparison data from itup
- * as well as comparator routines appropriate to the key datatypes.
- *
- * When itup is a non-pivot tuple, the returned insertion scan key is
- * suitable for finding a place for it to go on the leaf level. Pivot
- * tuples can be used to re-find leaf page with matching high key, but
- * then caller needs to set scan key's pivotsearch field to true. This
- * allows caller to search for a leaf page with a matching high key,
- * which is usually to the left of the first leaf page a non-pivot match
- * might appear on.
- *
- * The result is intended for use with _bt_compare() and _bt_truncate().
- * Callers that don't need to fill out the insertion scankey arguments
- * (e.g. they use an ad-hoc comparison routine, or only need a scankey
- * for _bt_truncate()) can pass a NULL index tuple. The scankey will
- * be initialized as if an "all truncated" pivot tuple was passed
- * instead.
- *
- * Note that we may occasionally have to share lock the metapage to
- * determine whether or not the keys in the index are expected to be
- * unique (i.e. if this is a "heapkeyspace" index). We assume a
- * heapkeyspace index when caller passes a NULL tuple, allowing index
- * build callers to avoid accessing the non-existent metapage. We
- * also assume that the index is _not_ allequalimage when a NULL tuple
- * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
- * field themselves.
- */
-BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
-{
- BTScanInsert key;
- ScanKey skey;
- TupleDesc itupdesc;
- int indnkeyatts;
- int16 *indoption;
- int tupnatts;
- int i;
-
- itupdesc = RelationGetDescr(rel);
- indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- indoption = rel->rd_indoption;
- tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
-
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
- /*
- * We'll execute search using scan key constructed on key columns.
- * Truncated attributes and non-key attributes are omitted from the final
- * scan key.
- */
- key = palloc(offsetof(BTScanInsertData, scankeys) +
- sizeof(ScanKeyData) * indnkeyatts);
- if (itup)
- _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
- else
- {
- /* Utility statement callers can set these fields themselves */
- key->heapkeyspace = true;
- key->allequalimage = false;
- }
- key->anynullkeys = false; /* initial assumption */
- key->nextkey = false;
- key->pivotsearch = false;
- key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
- skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
- {
- FmgrInfo *procinfo;
- Datum arg;
- bool null;
- int flags;
-
- /*
- * We can use the cached (default) support procs since no cross-type
- * comparison can be needed.
- */
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-
- /*
- * Key arguments built from truncated attributes (or when caller
- * provides no tuple) are defensively represented as NULL values. They
- * should never be used.
- */
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
- else
- {
- arg = (Datum) 0;
- null = true;
- }
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
- flags,
- (AttrNumber) (i + 1),
- InvalidStrategy,
- InvalidOid,
- rel->rd_indcollation[i],
- procinfo,
- arg);
- /* Record if any key attribute is NULL (or truncated) */
- if (null)
- key->anynullkeys = true;
- }
-
- /*
- * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
- * that full uniqueness check is done.
- */
- if (rel->rd_index->indnullsnotdistinct)
- key->anynullkeys = false;
-
- return key;
-}
/*
* free a retracement stack made by _bt_search.
@@ -1340,356 +1221,6 @@ _bt_mark_scankey_required(ScanKey skey)
}
}
-/*
- * Test whether an indextuple satisfies all the scankey conditions.
- *
- * Return true if so, false if not. If the tuple fails to pass the qual,
- * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly. See comments for
- * _bt_preprocess_keys(), above, about how this is done.
- *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
- *
- * scan: index scan descriptor (containing a search-type scankey)
- * tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
- */
-bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
- ScanDirection dir, bool *continuescan)
-{
- TupleDesc tupdesc;
- BTScanOpaque so;
- int keysz;
- int ikey;
- ScanKey key;
-
- Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
-
- *continuescan = true; /* default assumption */
-
- tupdesc = RelationGetDescr(scan->indexRelation);
- so = (BTScanOpaque) scan->opaque;
- keysz = so->numberOfKeys;
-
- for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
- {
- Datum datum;
- bool isNull;
- Datum test;
-
- if (key->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- continue;
- }
-
- /* row-comparison keys need special processing */
- if (key->sk_flags & SK_ROW_HEADER)
- {
- if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
- continuescan))
- continue;
- return false;
- }
-
- datum = index_getattr(tuple,
- key->sk_attno,
- tupdesc,
- &isNull);
-
- if (key->sk_flags & SK_ISNULL)
- {
- /* Handle IS NULL/NOT NULL tests */
- if (key->sk_flags & SK_SEARCHNULL)
- {
- if (isNull)
- continue; /* tuple satisfies this qual */
- }
- else
- {
- Assert(key->sk_flags & SK_SEARCHNOTNULL);
- if (!isNull)
- continue; /* tuple satisfies this qual */
- }
-
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (isNull)
- {
- if (key->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
- datum, key->sk_argument);
-
- if (!DatumGetBool(test))
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- *
- * Note: because we stop the scan as soon as any required equality
- * qual fails, it is critical that equality quals be used for the
- * initial positioning in _bt_first() when they are available. See
- * comments in _bt_first().
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
- }
-
- /* If we get here, the tuple passes all index quals. */
- return true;
-}
-
-/*
- * Test whether an indextuple satisfies a row-comparison scan condition.
- *
- * Return true if so, false if not. If not, also clear *continuescan if
- * it's not possible for any future tuples in the current scan direction
- * to pass the qual.
- *
- * This is a subroutine for _bt_checkkeys, which see for more info.
- */
-static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
- TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
-{
- ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
- int32 cmpresult = 0;
- bool result;
-
- /* First subkey should be same as the header says */
- Assert(subkey->sk_attno == skey->sk_attno);
-
- /* Loop over columns of the row condition */
- for (;;)
- {
- Datum datum;
- bool isNull;
-
- Assert(subkey->sk_flags & SK_ROW_MEMBER);
-
- if (subkey->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- cmpresult = 0;
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- continue;
- }
-
- datum = index_getattr(tuple,
- subkey->sk_attno,
- tupdesc,
- &isNull);
-
- if (isNull)
- {
- if (subkey->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (subkey->sk_flags & SK_ISNULL)
- {
- /*
- * Unlike the simple-scankey case, this isn't a disallowed case.
- * But it can never match. If all the earlier row comparison
- * columns are required for the scan direction, we can stop the
- * scan, because there can't be another tuple that will succeed.
- */
- if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
- subkey--;
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- return false;
- }
-
- /* Perform the test --- three-way comparison not bool operator */
- cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
- subkey->sk_collation,
- datum,
- subkey->sk_argument));
-
- if (subkey->sk_flags & SK_BT_DESC)
- INVERT_COMPARE_RESULT(cmpresult);
-
- /* Done comparing if unequal, else advance to next column */
- if (cmpresult != 0)
- break;
-
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- }
-
- /*
- * At this point cmpresult indicates the overall result of the row
- * comparison, and subkey points to the deciding column (or the last
- * column if the result is "=").
- */
- switch (subkey->sk_strategy)
- {
- /* EQ and NE cases aren't allowed here */
- case BTLessStrategyNumber:
- result = (cmpresult < 0);
- break;
- case BTLessEqualStrategyNumber:
- result = (cmpresult <= 0);
- break;
- case BTGreaterEqualStrategyNumber:
- result = (cmpresult >= 0);
- break;
- case BTGreaterStrategyNumber:
- result = (cmpresult > 0);
- break;
- default:
- elog(ERROR, "unrecognized RowCompareType: %d",
- (int) subkey->sk_strategy);
- result = 0; /* keep compiler quiet */
- break;
- }
-
- if (!result)
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will pass,
- * either. Note we have to look at the deciding column, not
- * necessarily the first or last column of the row condition.
- */
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
-
- return result;
-}
-
/*
* _bt_killitems - set LP_DEAD state for items an indexscan caller has
* told us were killed
@@ -2173,286 +1704,6 @@ btbuildphasename(int64 phasenum)
}
}
-/*
- * _bt_truncate() -- create tuple without unneeded suffix attributes.
- *
- * Returns truncated pivot index tuple allocated in caller's memory context,
- * with key attributes copied from caller's firstright argument. If rel is
- * an INCLUDE index, non-key attributes will definitely be truncated away,
- * since they're not part of the key space. More aggressive suffix
- * truncation can take place when it's clear that the returned tuple does not
- * need one or more suffix key attributes. We only need to keep firstright
- * attributes up to and including the first non-lastleft-equal attribute.
- * Caller's insertion scankey is used to compare the tuples; the scankey's
- * argument values are not considered here.
- *
- * Note that returned tuple's t_tid offset will hold the number of attributes
- * present, so the original item pointer offset is not represented. Caller
- * should only change truncated tuple's downlink. Note also that truncated
- * key attributes are treated as containing "minus infinity" values by
- * _bt_compare().
- *
- * In the worst case (when a heap TID must be appended to distinguish lastleft
- * from firstright), the size of the returned tuple is the size of firstright
- * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
- * is important, since callers need to stay under the 1/3 of a page
- * restriction on tuple size. If this routine is ever taught to truncate
- * within an attribute/datum, it will need to avoid returning an enlarged
- * tuple to caller when truncation + TOAST compression ends up enlarging the
- * final datum.
- */
-IndexTuple
-_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
- IndexTuple pivot;
- IndexTuple tidpivot;
- ItemPointer pivotheaptid;
- Size newsize;
-
- /*
- * We should only ever truncate non-pivot tuples from leaf pages. It's
- * never okay to truncate when splitting an internal page.
- */
- Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
-
- /* Determine how many attributes must be kept in truncated tuple */
- keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
-
-#ifdef DEBUG_NO_TRUNCATE
- /* Force truncation to be ineffective for testing purposes */
- keepnatts = nkeyatts + 1;
-#endif
-
- pivot = index_truncate_tuple(itupdesc, firstright,
- Min(keepnatts, nkeyatts));
-
- if (BTreeTupleIsPosting(pivot))
- {
- /*
- * index_truncate_tuple() just returns a straight copy of firstright
- * when it has no attributes to truncate. When that happens, we may
- * need to truncate away a posting list here instead.
- */
- Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
- Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
- pivot->t_info &= ~INDEX_SIZE_MASK;
- pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
- }
-
- /*
- * If there is a distinguishing key attribute within pivot tuple, we're
- * done
- */
- if (keepnatts <= nkeyatts)
- {
- BTreeTupleSetNAtts(pivot, keepnatts, false);
- return pivot;
- }
-
- /*
- * We have to store a heap TID in the new pivot tuple, since no non-TID
- * key attribute value in firstright distinguishes the right side of the
- * split from the left side. nbtree conceptualizes this case as an
- * inability to truncate away any key attributes, since heap TID is
- * treated as just another key attribute (despite lacking a pg_attribute
- * entry).
- *
- * Use enlarged space that holds a copy of pivot. We need the extra space
- * to store a heap TID at the end (using the special pivot tuple
- * representation). Note that the original pivot already has firstright's
- * possible posting list/non-key attribute values removed at this point.
- */
- newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
- tidpivot = palloc0(newsize);
- memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
- /* Cannot leak memory here */
- pfree(pivot);
-
- /*
- * Store all of firstright's key attribute values plus a tiebreaker heap
- * TID value in enlarged pivot tuple
- */
- tidpivot->t_info &= ~INDEX_SIZE_MASK;
- tidpivot->t_info |= newsize;
- BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
- pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
-
- /*
- * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
- * consider suffix truncation. It seems like a good idea to follow that
- * example in cases where no truncation takes place -- use lastleft's heap
- * TID. (This is also the closest value to negative infinity that's
- * legally usable.)
- */
- ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
-
- /*
- * We're done. Assert() that heap TID invariants hold before returning.
- *
- * Lehman and Yao require that the downlink to the right page, which is to
- * be inserted into the parent page in the second phase of a page split be
- * a strict lower bound on items on the right page, and a non-strict upper
- * bound for items on the left page. Assert that heap TIDs follow these
- * invariants, since a heap TID value is apparently needed as a
- * tiebreaker.
- */
-#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
- BTreeTupleGetHeapTID(firstright)) < 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(lastleft)) >= 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#else
-
- /*
- * Those invariants aren't guaranteed to hold for lastleft + firstright
- * heap TID attribute values when they're considered here only because
- * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
- * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
- * TID value that always works as a strict lower bound for items to the
- * right. In particular, it must avoid using firstright's leading key
- * attribute values along with lastleft's heap TID value when lastleft's
- * TID happens to be greater than firstright's TID.
- */
- ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
-
- /*
- * Pivot heap TID should never be fully equal to firstright. Note that
- * the pivot heap TID will still end up equal to lastleft's heap TID when
- * that's the only usable value.
- */
- ItemPointerSetOffsetNumber(pivotheaptid,
- OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#endif
-
- return tidpivot;
-}
-
-/*
- * _bt_keep_natts - how many key attributes to keep when truncating.
- *
- * Caller provides two tuples that enclose a split point. Caller's insertion
- * scankey is used to compare the tuples; the scankey's argument values are
- * not considered here.
- *
- * This can return a number of attributes that is one greater than the
- * number of key attributes for the index relation. This indicates that the
- * caller must use a heap TID as a unique-ifier in new pivot tuple.
- */
-static int
-_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keepnatts;
- ScanKey scankey;
-
- /*
- * _bt_compare() treats truncated key attributes as having the value minus
- * infinity, which would break searches within !heapkeyspace indexes. We
- * must still truncate away non-key attribute values, though.
- */
- if (!itup_key->heapkeyspace)
- return nkeyatts;
-
- scankey = itup_key->scankeys;
- keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum1,
- datum2)) != 0)
- break;
-
- keepnatts++;
- }
-
- /*
- * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
- * expected in an allequalimage index.
- */
- Assert(!itup_key->allequalimage ||
- keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
-
- return keepnatts;
-}
-
-/*
- * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
- *
- * This is exported so that a candidate split point can have its effect on
- * suffix truncation inexpensively evaluated ahead of time when finding a
- * split location. A naive bitwise approach to datum comparisons is used to
- * save cycles.
- *
- * The approach taken here usually provides the same answer as _bt_keep_natts
- * will (for the same pair of tuples from a heapkeyspace index), since the
- * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting. When an index only has
- * "equal image" columns, routine is guaranteed to give the same result as
- * _bt_keep_natts would.
- *
- * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts, even when the index uses
- * an opclass or collation that is not "allequalimage"/deduplication-safe.
- * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
- * negatives generally only have the effect of making leaf page splits use a
- * more balanced split point.
- */
-int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
-
- keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
- Form_pg_attribute att;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
- break;
-
- keepnatts++;
- }
-
- return keepnatts;
-}
-
/*
* _bt_check_natts() -- Verify tuple has expected number of attributes.
*
diff --git a/src/backend/access/nbtree/nbtutils_spec.h b/src/backend/access/nbtree/nbtutils_spec.h
new file mode 100644
index 0000000000..a4b934ae7a
--- /dev/null
+++ b/src/backend/access/nbtree/nbtutils_spec.h
@@ -0,0 +1,772 @@
+/*
+ * Specialized functions included in nbtutils.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static bool NBTS_FUNCTION(_bt_check_rowcompare)(ScanKey skey,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ ScanDirection dir, bool *continuescan);
+
+static int NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+
+/*
+ * Test whether an indextuple satisfies a row-comparison scan condition.
+ *
+ * Return true if so, false if not. If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction
+ * to pass the qual.
+ *
+ * This is a subroutine for _bt_checkkeys, which see for more info.
+ */
+static bool
+NBTS_FUNCTION(_bt_check_rowcompare)(ScanKey skey, IndexTuple tuple,
+ int tupnatts, TupleDesc tupdesc,
+ ScanDirection dir, bool *continuescan)
+{
+ ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
+ int32 cmpresult = 0;
+ bool result;
+
+ /* First subkey should be same as the header says */
+ Assert(subkey->sk_attno == skey->sk_attno);
+
+ /* Loop over columns of the row condition */
+ for (;;)
+ {
+ Datum datum;
+ bool isNull;
+
+ Assert(subkey->sk_flags & SK_ROW_MEMBER);
+
+ if (subkey->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ cmpresult = 0;
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ continue;
+ }
+
+ datum = index_getattr(tuple,
+ subkey->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (isNull)
+ {
+ if (subkey->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (subkey->sk_flags & SK_ISNULL)
+ {
+ /*
+ * Unlike the simple-scankey case, this isn't a disallowed case.
+ * But it can never match. If all the earlier row comparison
+ * columns are required for the scan direction, we can stop the
+ * scan, because there can't be another tuple that will succeed.
+ */
+ if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
+ subkey--;
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ return false;
+ }
+
+ /* Perform the test --- three-way comparison not bool operator */
+ cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
+ subkey->sk_collation,
+ datum,
+ subkey->sk_argument));
+
+ if (subkey->sk_flags & SK_BT_DESC)
+ INVERT_COMPARE_RESULT(cmpresult);
+
+ /* Done comparing if unequal, else advance to next column */
+ if (cmpresult != 0)
+ break;
+
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ }
+
+ /*
+ * At this point cmpresult indicates the overall result of the row
+ * comparison, and subkey points to the deciding column (or the last
+ * column if the result is "=").
+ */
+ switch (subkey->sk_strategy)
+ {
+ /* EQ and NE cases aren't allowed here */
+ case BTLessStrategyNumber:
+ result = (cmpresult < 0);
+ break;
+ case BTLessEqualStrategyNumber:
+ result = (cmpresult <= 0);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ result = (cmpresult >= 0);
+ break;
+ case BTGreaterStrategyNumber:
+ result = (cmpresult > 0);
+ break;
+ default:
+ elog(ERROR, "unrecognized RowCompareType: %d",
+ (int) subkey->sk_strategy);
+ result = 0; /* keep compiler quiet */
+ break;
+ }
+
+ if (!result)
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will pass,
+ * either. Note we have to look at the deciding column, not
+ * necessarily the first or last column of the row condition.
+ */
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+
+ return result;
+}
+
+/*
+ * _bt_keep_natts - how many key attributes to keep when truncating.
+ *
+ * Caller provides two tuples that enclose a split point. Caller's insertion
+ * scankey is used to compare the tuples; the scankey's argument values are
+ * not considered here.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation. This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keepnatts;
+ ScanKey scankey;
+
+ /*
+ * _bt_compare() treats truncated key attributes as having the value minus
+ * infinity, which would break searches within !heapkeyspace indexes. We
+ * must still truncate away non-key attribute values, though.
+ */
+ if (!itup_key->heapkeyspace)
+ return nkeyatts;
+
+ scankey = itup_key->scankeys;
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum1,
+ datum2)) != 0)
+ break;
+
+ keepnatts++;
+ }
+
+ /*
+ * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
+ * expected in an allequalimage index.
+ */
+ Assert(!itup_key->allequalimage ||
+ keepnatts == nbts_call(_bt_keep_natts_fast, rel, lastleft, firstright));
+
+ return keepnatts;
+}
+
+#endif /* NBTS_SPECIALIZING_DEFAULT */
+
+/*
+ * _bt_mkscankey
+ * Build an insertion scan key that contains comparison data from itup
+ * as well as comparator routines appropriate to the key datatypes.
+ *
+ * When itup is a non-pivot tuple, the returned insertion scan key is
+ * suitable for finding a place for it to go on the leaf level. Pivot
+ * tuples can be used to re-find leaf page with matching high key, but
+ * then caller needs to set scan key's pivotsearch field to true. This
+ * allows caller to search for a leaf page with a matching high key,
+ * which is usually to the left of the first leaf page a non-pivot match
+ * might appear on.
+ *
+ * The result is intended for use with _bt_compare() and _bt_truncate().
+ * Callers that don't need to fill out the insertion scankey arguments
+ * (e.g. they use an ad-hoc comparison routine, or only need a scankey
+ * for _bt_truncate()) can pass a NULL index tuple. The scankey will
+ * be initialized as if an "all truncated" pivot tuple was passed
+ * instead.
+ *
+ * Note that we may occasionally have to share lock the metapage to
+ * determine whether or not the keys in the index are expected to be
+ * unique (i.e. if this is a "heapkeyspace" index). We assume a
+ * heapkeyspace index when caller passes a NULL tuple, allowing index
+ * build callers to avoid accessing the non-existent metapage. We
+ * also assume that the index is _not_ allequalimage when a NULL tuple
+ * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
+ * field themselves.
+ */
+BTScanInsert
+NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
+{
+ BTScanInsert key;
+ ScanKey skey;
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int16 *indoption;
+ int tupnatts;
+ int i;
+
+ itupdesc = RelationGetDescr(rel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ indoption = rel->rd_indoption;
+ tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+
+ /*
+ * We'll execute search using scan key constructed on key columns.
+ * Truncated attributes and non-key attributes are omitted from the final
+ * scan key.
+ */
+ key = palloc(offsetof(BTScanInsertData, scankeys) +
+ sizeof(ScanKeyData) * indnkeyatts);
+ if (itup)
+ _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+ else
+ {
+ /* Utility statement callers can set these fields themselves */
+ key->heapkeyspace = true;
+ key->allequalimage = false;
+ }
+ key->anynullkeys = false; /* initial assumption */
+ key->nextkey = false;
+ key->pivotsearch = false;
+ key->keysz = Min(indnkeyatts, tupnatts);
+ key->scantid = key->heapkeyspace && itup ?
+ BTreeTupleGetHeapTID(itup) : NULL;
+ skey = key->scankeys;
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ FmgrInfo *procinfo;
+ Datum arg;
+ bool null;
+ int flags;
+
+ /*
+ * We can use the cached (default) support procs since no cross-type
+ * comparison can be needed.
+ */
+ procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+
+ /*
+ * Key arguments built from truncated attributes (or when caller
+ * provides no tuple) are defensively represented as NULL values. They
+ * should never be used.
+ */
+ if (i < tupnatts)
+ arg = index_getattr(itup, i + 1, itupdesc, &null);
+ else
+ {
+ arg = (Datum) 0;
+ null = true;
+ }
+ flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags,
+ (AttrNumber) (i + 1),
+ InvalidStrategy,
+ InvalidOid,
+ rel->rd_indcollation[i],
+ procinfo,
+ arg);
+ /* Record if any key attribute is NULL (or truncated) */
+ if (null)
+ key->anynullkeys = true;
+ }
+
+ /*
+ * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
+ * that full uniqueness check is done.
+ */
+ if (rel->rd_index->indnullsnotdistinct)
+ key->anynullkeys = false;
+
+ return key;
+}
+
+/*
+ * Test whether an indextuple satisfies all the scankey conditions.
+ *
+ * Return true if so, false if not. If the tuple fails to pass the qual,
+ * we also determine whether there's any need to continue the scan beyond
+ * this tuple, and set *continuescan accordingly. See comments for
+ * _bt_preprocess_keys(), above, about how this is done.
+ *
+ * Forward scan callers can pass a high key tuple in the hopes of having
+ * us set *continuescan to false, and avoiding an unnecessary visit to
+ * the page to the right.
+ *
+ * scan: index scan descriptor (containing a search-type scankey)
+ * tuple: index tuple to test
+ * tupnatts: number of attributes in tupnatts (high key may be truncated)
+ * dir: direction we are scanning in
+ * continuescan: output parameter (will be set correctly in all cases)
+ */
+bool
+NBTS_FUNCTION(_bt_checkkeys)(Relation rel, IndexScanDesc scan,
+ IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan)
+{
+ TupleDesc tupdesc;
+ BTScanOpaque so;
+ int keysz;
+ int ikey;
+ ScanKey key;
+
+ Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+
+ *continuescan = true; /* default assumption */
+
+ tupdesc = RelationGetDescr(scan->indexRelation);
+ so = (BTScanOpaque) scan->opaque;
+ keysz = so->numberOfKeys;
+
+ for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+ {
+ Datum datum;
+ bool isNull;
+ Datum test;
+
+ if (key->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ continue;
+ }
+
+ /* row-comparison keys need special processing */
+ if (key->sk_flags & SK_ROW_HEADER)
+ {
+ if (nbts_call_norel(_bt_check_rowcompare, rel, key, tuple,
+ tupnatts, tupdesc, dir, continuescan))
+ continue;
+ return false;
+ }
+
+ datum = index_getattr(tuple,
+ key->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (key->sk_flags & SK_ISNULL)
+ {
+ /* Handle IS NULL/NOT NULL tests */
+ if (key->sk_flags & SK_SEARCHNULL)
+ {
+ if (isNull)
+ continue; /* tuple satisfies this qual */
+ }
+ else
+ {
+ Assert(key->sk_flags & SK_SEARCHNOTNULL);
+ if (!isNull)
+ continue; /* tuple satisfies this qual */
+ }
+
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (isNull)
+ {
+ if (key->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
+ datum, key->sk_argument);
+
+ if (!DatumGetBool(test))
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ *
+ * Note: because we stop the scan as soon as any required equality
+ * qual fails, it is critical that equality quals be used for the
+ * initial positioning in _bt_first() when they are available. See
+ * comments in _bt_first().
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+ }
+
+ /* If we get here, the tuple passes all index quals. */
+ return true;
+}
+
+/*
+ * _bt_truncate() -- create tuple without unneeded suffix attributes.
+ *
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument. If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space. More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes. We only need to keep firstright
+ * attributes up to and including the first non-lastleft-equal attribute.
+ * Caller's insertion scankey is used to compare the tuples; the scankey's
+ * argument values are not considered here.
+ *
+ * Note that returned tuple's t_tid offset will hold the number of attributes
+ * present, so the original item pointer offset is not represented. Caller
+ * should only change truncated tuple's downlink. Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
+ *
+ * In the worst case (when a heap TID must be appended to distinguish lastleft
+ * from firstright), the size of the returned tuple is the size of firstright
+ * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
+ * is important, since callers need to stay under the 1/3 of a page
+ * restriction on tuple size. If this routine is ever taught to truncate
+ * within an attribute/datum, it will need to avoid returning an enlarged
+ * tuple to caller when truncation + TOAST compression ends up enlarging the
+ * final datum.
+ */
+IndexTuple
+NBTS_FUNCTION(_bt_truncate)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+ IndexTuple pivot;
+ IndexTuple tidpivot;
+ ItemPointer pivotheaptid;
+ Size newsize;
+
+ /*
+ * We should only ever truncate non-pivot tuples from leaf pages. It's
+ * never okay to truncate when splitting an internal page.
+ */
+ Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
+
+ /* Determine how many attributes must be kept in truncated tuple */
+ keepnatts = nbts_call(_bt_keep_natts, rel, lastleft, firstright, itup_key);
+
+#ifdef DEBUG_NO_TRUNCATE
+ /* Force truncation to be ineffective for testing purposes */
+ keepnatts = nkeyatts + 1;
+#endif
+
+ pivot = index_truncate_tuple(itupdesc, firstright,
+ Min(keepnatts, nkeyatts));
+
+ if (BTreeTupleIsPosting(pivot))
+ {
+ /*
+ * index_truncate_tuple() just returns a straight copy of firstright
+ * when it has no attributes to truncate. When that happens, we may
+ * need to truncate away a posting list here instead.
+ */
+ Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
+ Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+
+ /*
+ * If there is a distinguishing key attribute within pivot tuple, we're
+ * done
+ */
+ if (keepnatts <= nkeyatts)
+ {
+ BTreeTupleSetNAtts(pivot, keepnatts, false);
+ return pivot;
+ }
+
+ /*
+ * We have to store a heap TID in the new pivot tuple, since no non-TID
+ * key attribute value in firstright distinguishes the right side of the
+ * split from the left side. nbtree conceptualizes this case as an
+ * inability to truncate away any key attributes, since heap TID is
+ * treated as just another key attribute (despite lacking a pg_attribute
+ * entry).
+ *
+ * Use enlarged space that holds a copy of pivot. We need the extra space
+ * to store a heap TID at the end (using the special pivot tuple
+ * representation). Note that the original pivot already has firstright's
+ * possible posting list/non-key attribute values removed at this point.
+ */
+ newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
+ tidpivot = palloc0(newsize);
+ memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
+ /* Cannot leak memory here */
+ pfree(pivot);
+
+ /*
+ * Store all of firstright's key attribute values plus a tiebreaker heap
+ * TID value in enlarged pivot tuple
+ */
+ tidpivot->t_info &= ~INDEX_SIZE_MASK;
+ tidpivot->t_info |= newsize;
+ BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
+ pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
+
+ /*
+ * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
+ * consider suffix truncation. It seems like a good idea to follow that
+ * example in cases where no truncation takes place -- use lastleft's heap
+ * TID. (This is also the closest value to negative infinity that's
+ * legally usable.)
+ */
+ ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
+
+ /*
+ * We're done. Assert() that heap TID invariants hold before returning.
+ *
+ * Lehman and Yao require that the downlink to the right page, which is to
+ * be inserted into the parent page in the second phase of a page split be
+ * a strict lower bound on items on the right page, and a non-strict upper
+ * bound for items on the left page. Assert that heap TIDs follow these
+ * invariants, since a heap TID value is apparently needed as a
+ * tiebreaker.
+ */
+#ifndef DEBUG_NO_TRUNCATE
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#else
+
+ /*
+ * Those invariants aren't guaranteed to hold for lastleft + firstright
+ * heap TID attribute values when they're considered here only because
+ * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
+ * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
+ * TID value that always works as a strict lower bound for items to the
+ * right. In particular, it must avoid using firstright's leading key
+ * attribute values along with lastleft's heap TID value when lastleft's
+ * TID happens to be greater than firstright's TID.
+ */
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
+
+ /*
+ * Pivot heap TID should never be fully equal to firstright. Note that
+ * the pivot heap TID will still end up equal to lastleft's heap TID when
+ * that's the only usable value.
+ */
+ ItemPointerSetOffsetNumber(pivotheaptid,
+ OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#endif
+
+ return tidpivot;
+}
+
+/*
+ * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location. A naive bitwise approach to datum comparisons is used to
+ * save cycles.
+ *
+ * The approach taken here usually provides the same answer as _bt_keep_natts
+ * will (for the same pair of tuples from a heapkeyspace index), since the
+ * majority of btree opclasses can never indicate that two datums are equal
+ * unless they're bitwise equal after detoasting. When an index only has
+ * "equal image" columns, routine is guaranteed to give the same result as
+ * _bt_keep_natts would.
+ *
+ * Callers can rely on the fact that attributes considered equal here are
+ * definitely also equal according to _bt_keep_natts, even when the index uses
+ * an opclass or collation that is not "allequalimage"/deduplication-safe.
+ * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
+ * negatives generally only have the effect of making leaf page splits use a
+ * more balanced split point.
+ */
+int
+NBTS_FUNCTION(_bt_keep_natts_fast)(Relation rel,
+ IndexTuple lastleft,
+ IndexTuple firstright)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= keysz; attnum++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+ Form_pg_attribute att;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ att = TupleDescAttr(itupdesc, attnum - 1);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
+ break;
+
+ keepnatts++;
+ }
+
+ return keepnatts;
+}
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 31554fd867..27a5d53324 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1153,7 +1153,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
state->tupDesc = tupDesc; /* assume we need not copy tupDesc */
- indexScanKey = _bt_mkscankey(indexRel, NULL);
+ indexScanKey = nbts_call(_bt_mkscankey, indexRel, NULL);
if (state->indexInfo->ii_Expressions != NULL)
{
@@ -1251,7 +1251,7 @@ tuplesort_begin_index_btree(Relation heapRel,
state->enforceUnique = enforceUnique;
state->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
- indexScanKey = _bt_mkscankey(indexRel, NULL);
+ indexScanKey = nbts_call(_bt_mkscankey, indexRel, NULL);
/* Prepare SortSupport data for each column */
state->sortKeys = (SortSupport) palloc0(state->nKeys *
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 93f8267b48..83e0dbab16 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1116,15 +1116,47 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_PERFORMSORT_2 4
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
+
+/*
+ * Macros used in the nbtree specialization code.
+ */
+#define NBTS_TYPE_CACHED cached
+#define NBTS_TYPE_DEFAULT default
+
+
+#define NBTS_MAKE_PREFIX(a) CppConcat(a,_)
+#define NBTS_MAKE_NAME_(a,b) CppConcat(a,b)
+#define NBTS_MAKE_NAME(a,b) NBTS_MAKE_NAME_(NBTS_MAKE_PREFIX(a),b)
+
+#define NBTS_ENABLED
+
+#ifdef NBTS_ENABLED
+
+/*
+ * Access a specialized nbtree function, based on the shape of the index key.
+ */
+
+#define NBT_SPECIALIZE_CALL(function, rel, ...) \
+( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+)
+
+#else /* not defined NBTS_ENABLED */
+
+#define NBT_SPECIALIZE_CALL(function, rel, ...) function(__VA_ARGS__)
+
+#endif /* NBTS_ENABLED */
+
+
+#define NBT_SPECIALIZE_FILE "access/nbtree_specialized.h"
+#include "nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
+
+
/*
* external entry points for btree, in nbtree.c
*/
extern void btbuildempty(Relation index);
-extern bool btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- struct IndexInfo *indexInfo);
extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
extern Size btestimateparallelscan(void);
extern void btinitparallelscan(void *target);
@@ -1155,9 +1187,6 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
/*
* prototypes for functions in nbtdedup.c
*/
-extern void _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel,
- IndexTuple newitem, Size newitemsz,
- bool bottomupdedup);
extern bool _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
Size newitemsz);
extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
@@ -1173,9 +1202,6 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
/*
* prototypes for functions in nbtinsert.c
*/
-extern bool _bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel);
extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
@@ -1223,12 +1249,6 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
/*
* prototypes for functions in nbtsearch.c
*/
-extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
- int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
- bool forupdate, BTStack stack, int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
-extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -1237,7 +1257,6 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
/*
* prototypes for functions in nbtutils.c
*/
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
extern void _bt_freestack(BTStack stack);
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
@@ -1245,8 +1264,6 @@ extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
extern void _bt_mark_array_keys(IndexScanDesc scan);
extern void _bt_restore_array_keys(IndexScanDesc scan);
extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
- int tupnatts, ScanDirection dir, bool *continuescan);
extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
extern BTCycleId _bt_start_vacuum(Relation rel);
@@ -1259,10 +1276,6 @@ extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
extern char *btbuildphasename(int64 phasenum);
-extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
-extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
- IndexTuple firstright);
extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/nbtree_specialize.h b/src/include/access/nbtree_specialize.h
new file mode 100644
index 0000000000..23fdda4f0e
--- /dev/null
+++ b/src/include/access/nbtree_specialize.h
@@ -0,0 +1,204 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtree_specialize.h
+ * header file for postgres btree access method implementation.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/nbtree_specialize.h
+ *
+ *-------------------------------------------------------------------------
+ *
+ * Specialize key-accessing functions and the hot code around those.
+ *
+ * Key attribute iteration is specialized through the use of the following
+ * macros:
+ *
+ * - nbts_call(function, indexrel, ...rest_of_args), and
+ * nbts_call_norel(function, indexrel, ...args)
+ * This will call the specialized variant of 'function' based on the index
+ * relation data.
+ * The difference between nbts_call and nbts_call_norel is that _call
+ * uses indexrel as first argument in the function call, whereas
+ * nbts_call_norel does not.
+ * - nbts_attiterdeclare(itup)
+ * Declare the variables required to iterate over the provided IndexTuple's
+ * key attributes. Many tuples may have their attributes iterated over at the
+ * same time.
+ * - nbts_attiterinit(itup, initAttNum, tupDesc)
+ * Initialize the attribute iterator for the provided IndexTuple at
+ * the provided AttributeNumber.
+ * - nbts_foreachattr(initAttNum, endAttNum)
+ * Start a loop over the attributes, starting at initAttNum and ending at
+ * endAttNum, inclusive. It also takes care of truncated attributes.
+ * - nbts_attiter_attnum
+ * The current attribute number
+ * - nbts_attiter_nextattdatum(itup, tupDesc)
+ * Updates the attribute iterator state to the next attribute. Returns the
+ * datum of the next attribute, which might be null (see below)
+ * - nbts_attiter_curattisnull(itup)
+ * Returns whether the result from the last nbts_attiter_nextattdatum is
+ * null.
+ *
+ * example usage:
+ *
+ * kwithnulls = nbts_call_norel(_bt_key_hasnulls, myindex, mytuple, tupDesc);
+ *
+ * NBTS_FUNCTION(_bt_key_hasnulls)(IndexTuple mytuple, TupleDesc tupDesc)
+ * {
+ * nbts_attiterdeclare(mytuple);
+ * nbts_attiterinit(mytuple, 1, tupDesc);
+ * nbts_foreachattr(1, 10)
+ * {
+ * Datum it = nbts_attiter_nextattdatum(tuple, tupDesc);
+ * if (nbts_attiter_curattisnull(tuple))
+ * return true;
+ * }
+ * return false
+ * }
+ */
+
+/*
+ * Call a potentially specialized function for a given btree operation.
+ *
+ * NB: the rel argument is evaluated multiple times.
+ */
+#define nbts_call(name, rel, ...) \
+ nbts_call_norel(name, (rel), (rel), __VA_ARGS__)
+
+#ifdef NBTS_ENABLED
+
+#define NBTS_FUNCTION(name) NBTS_MAKE_NAME(name, NBTS_TYPE)
+
+#ifdef nbts_call_norel
+#undef nbts_call_norel
+#endif
+
+#define nbts_call_norel(name, rel, ...) \
+ (NBTS_FUNCTION(name)(__VA_ARGS__))
+
+/*
+ * Multiple key columns, optimized access for attcacheoff -cacheable offsets.
+ */
+#define NBTS_SPECIALIZING_CACHED
+#define NBTS_TYPE NBTS_TYPE_CACHED
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_CACHED
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/* reset call to SPECIALIZE_CALL for default behaviour */
+#undef nbts_call_norel
+#define nbts_call_norel(name, rel, ...) \
+ NBT_SPECIALIZE_CALL(name, (rel), __VA_ARGS__)
+
+/*
+ * "Default", externally accessible, not so much optimized functions
+ */
+
+#define NBTS_SPECIALIZING_DEFAULT
+#define NBTS_TYPE NBTS_TYPE_DEFAULT
+
+/* for the default functions, we want to use the unspecialized name. */
+#undef NBTS_FUNCTION
+#define NBTS_FUNCTION(name) name
+
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_DEFAULT
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/* from here on there are no more NBTS_FUNCTIONs */
+#undef NBTS_FUNCTION
+
+#else /* not defined NBTS_ENABLED */
+
+/*
+ * NBTS_ENABLE is not defined, so we don't want to use the specializations.
+ * We revert to the behaviour from PG14 and earlier, which only uses
+ * attcacheoff.
+ */
+
+#define NBTS_FUNCTION(name) name
+
+#define nbts_call_norel(name, rel, ...) \
+ name(__VA_ARGS__)
+
+#define NBTS_TYPE NBTS_TYPE_CACHED
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+
+#endif /* !NBTS_ENABLED */
diff --git a/src/include/access/nbtree_specialized.h b/src/include/access/nbtree_specialized.h
new file mode 100644
index 0000000000..c45fa84aed
--- /dev/null
+++ b/src/include/access/nbtree_specialized.h
@@ -0,0 +1,67 @@
+/*
+ * prototypes for functions that are included in nbtree.h
+ */
+
+/*
+ * prototypes for functions in nbtree_spec.h
+ */
+extern void
+NBTS_FUNCTION(_bt_specialize)(Relation rel);
+
+extern bool
+NBTS_FUNCTION(btinsert)(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ struct IndexInfo *indexInfo);
+
+/*
+ * prototypes for functions in nbtdedup_spec.h
+ */
+extern void
+NBTS_FUNCTION(_bt_dedup_pass)(Relation rel, Buffer buf, Relation heapRel,
+ IndexTuple newitem, Size newitemsz,
+ bool bottomupdedup);
+
+
+/*
+ * prototypes for functions in nbtinsert_spec.h
+ */
+
+extern bool
+NBTS_FUNCTION(_bt_doinsert)(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ Relation heapRel);
+
+/*
+ * prototypes for functions in nbtsearch_spec.h
+ */
+extern BTStack
+NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key,
+ Buffer *bufP, int access,
+ Snapshot snapshot);
+extern Buffer
+NBTS_FUNCTION(_bt_moveright)(Relation rel, BTScanInsert key, Buffer buf,
+ bool forupdate, BTStack stack, int access,
+ Snapshot snapshot);
+extern OffsetNumber
+NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate);
+extern int32
+NBTS_FUNCTION(_bt_compare)(Relation rel, BTScanInsert key,
+ Page page, OffsetNumber offnum);
+
+/*
+ * prototypes for functions in nbtutils_spec.h
+ */
+extern BTScanInsert
+NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup);
+extern bool
+NBTS_FUNCTION(_bt_checkkeys)(Relation rel, IndexScanDesc scan,
+ IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan);
+extern IndexTuple
+NBTS_FUNCTION(_bt_truncate)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+extern int
+NBTS_FUNCTION(_bt_keep_natts_fast)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright);
--
2.30.2
v4-0002-Use-specialized-attribute-iterators-in-backend-nb.patchapplication/octet-stream; name=v4-0002-Use-specialized-attribute-iterators-in-backend-nb.patchDownload
From ef93c39b89e9db22b85b2893caf8fb97dc2651c2 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 7 Apr 2022 12:30:00 +0200
Subject: [PATCH v4 2/8] Use specialized attribute iterators in
backend/*/nbt*_spec.h
Split out for making it clear what substantial changes were made to the
pre-existing functions.
Even though not all nbt*_spec functions have been updated; most call sites
now can directly call the specialized functions instead of having to determine
the right specialization based on the (potentially locally unavailable) index
relation, making the specialization of those functions worth the effort.
---
src/backend/access/nbtree/nbtsearch_spec.h | 16 +++---
src/backend/access/nbtree/nbtsort_spec.h | 24 +++++----
src/backend/access/nbtree/nbtutils_spec.h | 63 +++++++++++++---------
3 files changed, 62 insertions(+), 41 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsearch_spec.h b/src/backend/access/nbtree/nbtsearch_spec.h
index 73d5370496..a5c5f2b94f 100644
--- a/src/backend/access/nbtree/nbtsearch_spec.h
+++ b/src/backend/access/nbtree/nbtsearch_spec.h
@@ -823,6 +823,7 @@ NBTS_FUNCTION(_bt_compare)(Relation rel,
int ncmpkey;
int ntupatts;
int32 result;
+ nbts_attiterdeclare(itup);
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -854,23 +855,26 @@ NBTS_FUNCTION(_bt_compare)(Relation rel,
Assert(key->heapkeyspace || ncmpkey == key->keysz);
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
scankey = key->scankeys;
- for (int i = 1; i <= ncmpkey; i++)
+ nbts_attiterinit(itup, 1, itupdesc);
+
+ nbts_foreachattr(1, ncmpkey)
{
Datum datum;
- bool isNull;
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+ datum = nbts_attiter_nextattdatum(itup, itupdesc);
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ /* key is NULL */
+ if (scankey->sk_flags & SK_ISNULL)
{
- if (isNull)
+ if (nbts_attiter_curattisnull(itup))
result = 0; /* NULL "=" NULL */
else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = -1; /* NULL "<" NOT_NULL */
else
result = 1; /* NULL ">" NOT_NULL */
}
- else if (isNull) /* key is NOT_NULL and item is NULL */
+ /* key is NOT_NULL and item is NULL */
+ else if (nbts_attiter_curattisnull(itup))
{
if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = 1; /* NOT_NULL ">" NULL */
diff --git a/src/backend/access/nbtree/nbtsort_spec.h b/src/backend/access/nbtree/nbtsort_spec.h
index 8f4a3602ca..d3f2db2dc4 100644
--- a/src/backend/access/nbtree/nbtsort_spec.h
+++ b/src/backend/access/nbtree/nbtsort_spec.h
@@ -27,8 +27,7 @@ NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
itup2 = NULL;
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
bool deduplicate;
@@ -50,7 +49,7 @@ NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
/* Prepare SortSupport data for each column */
sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
- for (i = 0; i < keysz; i++)
+ for (int i = 0; i < keysz; i++)
{
SortSupport sortKey = sortKeys + i;
ScanKey scanKey = wstate->inskey->scankeys + i;
@@ -82,22 +81,25 @@ NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
}
else if (itup != NULL)
{
+ nbts_attiterdeclare(itup);
+ nbts_attiterdeclare(itup2);
int32 compare = 0;
- for (i = 1; i <= keysz; i++)
+ nbts_attiterinit(itup, 1, tupdes);
+ nbts_attiterinit(itup2, 1, tupdes);
+
+ nbts_foreachattr(1, keysz)
{
SortSupport entry;
Datum attrDatum1,
attrDatum2;
- bool isNull1,
- isNull2;
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+ entry = sortKeys + nbts_attiter_attnum - 1;
+ attrDatum1 = nbts_attiter_nextattdatum(itup, tupdes);
+ attrDatum2 = nbts_attiter_nextattdatum(itup2, tupdes);
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
+ compare = ApplySortComparator(attrDatum1, nbts_attiter_curattisnull(itup),
+ attrDatum2, nbts_attiter_curattisnull(itup2),
entry);
if (compare > 0)
{
diff --git a/src/backend/access/nbtree/nbtutils_spec.h b/src/backend/access/nbtree/nbtutils_spec.h
index a4b934ae7a..638eff18f6 100644
--- a/src/backend/access/nbtree/nbtutils_spec.h
+++ b/src/backend/access/nbtree/nbtutils_spec.h
@@ -211,6 +211,8 @@ NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
TupleDesc itupdesc = RelationGetDescr(rel);
int keepnatts;
ScanKey scankey;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
/*
* _bt_compare() treats truncated key attributes as having the value minus
@@ -222,20 +224,22 @@ NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
scankey = itup_key->scankeys;
keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, nkeyatts)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ datum1 = nbts_attiter_nextattdatum(lastleft, itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) != nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
scankey->sk_collation,
datum1,
@@ -243,6 +247,7 @@ NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
break;
keepnatts++;
+ scankey++;
}
/*
@@ -295,7 +300,7 @@ NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
int indnkeyatts;
int16 *indoption;
int tupnatts;
- int i;
+ nbts_attiterdeclare(itup);
itupdesc = RelationGetDescr(rel);
indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -326,7 +331,10 @@ NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
key->scantid = key->heapkeyspace && itup ?
BTreeTupleGetHeapTID(itup) : NULL;
skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
+
+ nbts_attiterinit(itup, 1, itupdesc);
+
+ nbts_foreachattr(1, indnkeyatts)
{
FmgrInfo *procinfo;
Datum arg;
@@ -337,27 +345,30 @@ NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
* We can use the cached (default) support procs since no cross-type
* comparison can be needed.
*/
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+ procinfo = index_getprocinfo(rel, nbts_attiter_attnum, BTORDER_PROC);
/*
* Key arguments built from truncated attributes (or when caller
* provides no tuple) are defensively represented as NULL values. They
* should never be used.
*/
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
+ if (nbts_attiter_attnum <= tupnatts)
+ {
+ arg = nbts_attiter_nextattdatum(itup, itupdesc);
+ null = nbts_attiter_curattisnull(itup);
+ }
else
{
arg = (Datum) 0;
null = true;
}
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags = (null ? SK_ISNULL : 0) | (indoption[nbts_attiter_attnum - 1] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[nbts_attiter_attnum - 1],
flags,
- (AttrNumber) (i + 1),
+ (AttrNumber) nbts_attiter_attnum,
InvalidStrategy,
InvalidOid,
- rel->rd_indcollation[i],
+ rel->rd_indcollation[nbts_attiter_attnum - 1],
procinfo,
arg);
/* Record if any key attribute is NULL (or truncated) */
@@ -744,24 +755,28 @@ NBTS_FUNCTION(_bt_keep_natts_fast)(Relation rel,
TupleDesc itupdesc = RelationGetDescr(rel);
int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
int keepnatts;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, keysz)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
Form_pg_attribute att;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
+ datum1 = nbts_attiter_nextattdatum(lastleft,itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
+ att = TupleDescAttr(itupdesc, nbts_attiter_attnum - 1);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) !=
+ nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
--
2.30.2
v4-0003-Specialize-the-nbtree-rd_indam-entry.patchapplication/octet-stream; name=v4-0003-Specialize-the-nbtree-rd_indam-entry.patchDownload
From 6b8384a22b1659bea4003c478b454960115ec025 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 8 Apr 2022 14:54:52 +0200
Subject: [PATCH v4 3/8] Specialize the nbtree rd_indam entry.
Because each rd_indam struct is seperately allocated for each index, we can
freely modify it at runtime without impacting other indexes of the same
access method. For btinsert (which effectively only calls _bt_insert) it is
useful to specialize that function, which also makes rd_indam->aminsert a
good signal whether or not the indexRelation has been fully optimized yet.
---
src/backend/access/nbtree/nbtree.c | 7 +++++++
src/backend/access/nbtree/nbtsearch.c | 2 ++
src/backend/access/nbtree/nbtsort.c | 2 ++
src/include/access/nbtree.h | 14 ++++++++++++++
4 files changed, 25 insertions(+)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c9cd2b6026..5f4271f718 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -160,6 +160,8 @@ btbuildempty(Relation index)
metapage = (Page) palloc(BLCKSZ);
_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
+ nbt_opt_specialize(index);
+
/*
* Write the page and log it. It might seem that an immediate sync would
* be sufficient to guarantee that the file exists on disk, but recovery
@@ -322,6 +324,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
IndexScanDesc scan;
BTScanOpaque so;
+ nbt_opt_specialize(rel);
+
/* no order by operators allowed */
Assert(norderbys == 0);
@@ -764,6 +768,7 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
{
Relation rel = info->index;
BTCycleId cycleid;
+ nbt_opt_specialize(info->index);
/* allocate stats if first time through, else re-use existing struct */
if (stats == NULL)
@@ -797,6 +802,8 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
if (info->analyze_only)
return stats;
+ nbt_opt_specialize(info->index);
+
/*
* If btbulkdelete was called, we need not do anything (we just maintain
* the information used within _bt_vacuum_needs_cleanup() by calling
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index e81eee9c35..d5152bfcb7 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -181,6 +181,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
Assert(!BTScanPosIsValid(so->currPos));
+ nbt_opt_specialize(scan->indexRelation);
+
pgstat_count_index_scan(rel);
/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f1d146ba71..22c7163197 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -305,6 +305,8 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
BTBuildState buildstate;
double reltuples;
+ nbt_opt_specialize(index);
+
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
ResetUsage();
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 83e0dbab16..489b623663 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1132,6 +1132,19 @@ typedef struct BTOptions
#ifdef NBTS_ENABLED
+/*
+ * Replace the functions in the rd_indam struct with a variant optimized for
+ * our key shape, if not already done.
+ *
+ * It only needs to be done once for every index relation loaded, so it's
+ * quite unlikely we need to do this and thus marked unlikely().
+ */
+#define nbt_opt_specialize(rel) \
+do { \
+ if (unlikely((rel)->rd_indam->aminsert == btinsert)) \
+ _bt_specialize(rel); \
+} while (false)
+
/*
* Access a specialized nbtree function, based on the shape of the index key.
*/
@@ -1143,6 +1156,7 @@ typedef struct BTOptions
#else /* not defined NBTS_ENABLED */
+#define nbt_opt_specialize(rel)
#define NBT_SPECIALIZE_CALL(function, rel, ...) function(__VA_ARGS__)
#endif /* NBTS_ENABLED */
--
2.30.2
v4-0004-Optimize-attribute-iterator-access-for-single-col.patchapplication/octet-stream; name=v4-0004-Optimize-attribute-iterator-access-for-single-col.patchDownload
From 1cceed69b9746c15155bbaa81a84c90f90ccca3f Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 7 Apr 2022 12:47:50 +0200
Subject: [PATCH v4 4/8] Optimize attribute iterator access for single-column
btree keys
This removes the index_getattr_nocache call path, which has significant overhead.
---
src/include/access/nbtree.h | 9 +++-
src/include/access/nbtree_specialize.h | 63 ++++++++++++++++++++++++++
2 files changed, 71 insertions(+), 1 deletion(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 489b623663..1559399b0e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1120,6 +1120,7 @@ typedef struct BTOptions
/*
* Macros used in the nbtree specialization code.
*/
+#define NBTS_TYPE_SINGLE_COLUMN single
#define NBTS_TYPE_CACHED cached
#define NBTS_TYPE_DEFAULT default
@@ -1151,7 +1152,13 @@ do { \
#define NBT_SPECIALIZE_CALL(function, rel, ...) \
( \
- NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+ IndexRelationGetNumberOfKeyAttributes(rel) == 1 ? ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_SINGLE_COLUMN)(__VA_ARGS__) \
+ ) \
+ : \
+ ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+ ) \
)
#else /* not defined NBTS_ENABLED */
diff --git a/src/include/access/nbtree_specialize.h b/src/include/access/nbtree_specialize.h
index 23fdda4f0e..9733a27bdd 100644
--- a/src/include/access/nbtree_specialize.h
+++ b/src/include/access/nbtree_specialize.h
@@ -79,6 +79,69 @@
#define nbts_call_norel(name, rel, ...) \
(NBTS_FUNCTION(name)(__VA_ARGS__))
+/*
+ * Optimized access for indexes with a single key column.
+ *
+ * Note that this path may never be used for indexes with multiple key
+ * columns, because it does not ever continue to a next column.
+ */
+
+#define NBTS_SPECIALIZING_SINGLE_COLUMN
+#define NBTS_TYPE NBTS_TYPE_SINGLE_COLUMN
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+/*
+ * We void endAttNum to prevent unused variable warnings.
+ * The if- and for-loop are structured like this to make the compiler
+ * unroll the loop and detect only one single iteration. We need `break`
+ * in the following code block, so just a plain 'if' statement would
+ * not work.
+ */
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ Assert((endAttNum) == 1); ((void) (endAttNum)); \
+ if ((initAttNum) == 1) for (int spec_i = 0; spec_i < 1; spec_i++)
+
+#define nbts_attiter_attnum 1
+
+/*
+ * Simplified (optimized) variant of index_getattr specialized for extracting
+ * only the first attribute: cache offset is guaranteed to be 0, and as such
+ * no cache is required.
+ */
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+( \
+ AssertMacro(spec_i == 0), \
+ (IndexTupleHasNulls(itup) && att_isnull(0, (char *)(itup) + sizeof(IndexTupleData))) ? \
+ ( \
+ (NBTS_MAKE_NAME(itup, isNull)) = true, \
+ (Datum)NULL \
+ ) \
+ : \
+ ( \
+ (NBTS_MAKE_NAME(itup, isNull) = false), \
+ (Datum) fetchatt(TupleDescAttr((tupDesc), 0), \
+ (char *) (itup) + IndexInfoFindDataOffset((itup)->t_info)) \
+ ) \
+)
+
+#define nbts_attiter_curattisnull(tuple) \
+ NBTS_MAKE_NAME(tuple, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_SINGLE_COLUMN
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
/*
* Multiple key columns, optimized access for attcacheoff -cacheable offsets.
*/
--
2.30.2
v4-0005-Add-a-function-whose-task-it-is-to-populate-all-a.patchapplication/octet-stream; name=v4-0005-Add-a-function-whose-task-it-is-to-populate-all-a.patchDownload
From a3ff1bc255ec747e0e469233612e0505b4cc1a76 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 8 Apr 2022 14:51:01 +0200
Subject: [PATCH v4 5/8] Add a function whose task it is to populate all
attcacheoff-s of a TupleDesc's attributes
It fills uncacheable offsets with -2; as opposed to -1 which signals
"unknown", thus allowing users of the API to determine the cache-ability
of an attribute at O(1) complexity after this one-time O(n) cost, as
opposed to the repeated O(n) cost that currently applies.
---
src/backend/access/common/tupdesc.c | 97 +++++++++++++++++++++++++++++
src/include/access/tupdesc.h | 2 +
2 files changed, 99 insertions(+)
diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c
index 9f41b1e854..5630fc9da0 100644
--- a/src/backend/access/common/tupdesc.c
+++ b/src/backend/access/common/tupdesc.c
@@ -910,3 +910,100 @@ BuildDescFromLists(List *names, List *types, List *typmods, List *collations)
return desc;
}
+
+/*
+ * PopulateTupleDescCacheOffsets
+ *
+ * Populate the attcacheoff fields of a TupleDesc, returning the last
+ * attcacheoff with a valid value.
+ *
+ * Sets attcacheoff to -2 for uncacheable attributes (i.e. attributes after a
+ * variable length attributes).
+ */
+AttrNumber
+PopulateTupleDescCacheOffsets(TupleDesc desc)
+{
+ int numberOfAttributes = desc->natts;
+ AttrNumber i, j;
+
+ if (TupleDescAttr(desc, desc->natts - 1)->attcacheoff != -1)
+ {
+ /*
+ * Already done the calculations, find the last attribute that has
+ * cache offset.
+ */
+ for (i = (AttrNumber) numberOfAttributes; i > 1; i--)
+ {
+ if (TupleDescAttr(desc, i - 1)->attcacheoff != -2)
+ return i;
+ }
+
+ return 1;
+ }
+
+ /*
+ * First attribute always starts at offset zero.
+ */
+ TupleDescAttr(desc, 0)->attcacheoff = 0;
+
+ i = 1;
+ /*
+ * Someone might have set some offsets previously.
+ * Skip all positive offsets to get to the first attribute without
+ * attcacheoff.
+ */
+ while (i < numberOfAttributes && TupleDescAttr(desc, i)->attcacheoff > 0)
+ i++;
+
+ /* Cache offset is undetermined. Start calculating offsets if possible */
+ if (i < numberOfAttributes &&
+ TupleDescAttr(desc, i)->attcacheoff == -1)
+ {
+ Form_pg_attribute att = TupleDescAttr(desc, i - 1);
+ Size off = att->attcacheoff;
+
+ if (att->attlen >= 0) {
+ off += att->attlen;
+
+ while (i < numberOfAttributes)
+ {
+ att = TupleDescAttr(desc, i);
+
+ if (att->attlen < 0)
+ {
+ if (off == att_align_nominal(off, att->attalign))
+ att->attcacheoff = off;
+ else
+ att->attcacheoff = -2;
+ i++;
+ break;
+ }
+
+ off = att_align_nominal(off, att->attalign);
+ att->attcacheoff = off;
+ off += att->attlen;
+ i++;
+ }
+ } else {
+ if (off == att_align_nominal(off, att->attalign))
+ att->attcacheoff = off;
+ else
+ att->attcacheoff = -2;
+ i++;
+ }
+ }
+
+ /*
+ * No cacheable offsets left. Fill the rest with -2s, but return the latest
+ * cached offset.
+ */
+ j = i;
+
+ while (i < numberOfAttributes)
+ {
+ TupleDescAttr(desc, i)->attcacheoff = -2;
+ i++;
+ }
+
+ return j;
+}
diff --git a/src/include/access/tupdesc.h b/src/include/access/tupdesc.h
index 28dd6de18b..219f837875 100644
--- a/src/include/access/tupdesc.h
+++ b/src/include/access/tupdesc.h
@@ -151,4 +151,6 @@ extern TupleDesc BuildDescForRelation(List *schema);
extern TupleDesc BuildDescFromLists(List *names, List *types, List *typmods, List *collations);
+extern AttrNumber PopulateTupleDescCacheOffsets(TupleDesc desc);
+
#endif /* TUPDESC_H */
--
2.30.2
v4-0006-Implement-specialized-uncacheable-attribute-itera.patchapplication/octet-stream; name=v4-0006-Implement-specialized-uncacheable-attribute-itera.patchDownload
From f8ea88c8da8fa1bd43e678589992455116cd82ef Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 8 Apr 2022 14:44:01 +0200
Subject: [PATCH v4 6/8] Implement specialized uncacheable attribute iteration
Uses an iterator to prevent doing duplicate work while iterating over
attributes.
Inspiration: https://www.postgresql.org/message-id/CAEze2WjE9ka8i%3Ds-Vv5oShro9xTrt5VQnQvFG9AaRwWpMm3-fg%40mail.gmail.com
---
src/backend/access/nbtree/nbtree_spec.h | 1 +
src/include/access/itup_attiter.h | 198 ++++++++++++++++++++++++
src/include/access/nbtree.h | 13 +-
src/include/access/nbtree_specialize.h | 34 ++++
4 files changed, 244 insertions(+), 2 deletions(-)
create mode 100644 src/include/access/itup_attiter.h
diff --git a/src/backend/access/nbtree/nbtree_spec.h b/src/backend/access/nbtree/nbtree_spec.h
index 4c342287f6..88b01c86f7 100644
--- a/src/backend/access/nbtree/nbtree_spec.h
+++ b/src/backend/access/nbtree/nbtree_spec.h
@@ -9,6 +9,7 @@ void
NBTS_FUNCTION(_bt_specialize)(Relation rel)
{
#ifdef NBTS_SPECIALIZING_DEFAULT
+ PopulateTupleDescCacheOffsets(rel->rd_att);
nbts_call_norel(_bt_specialize, rel, rel);
#else
rel->rd_indam->aminsert = NBTS_FUNCTION(btinsert);
diff --git a/src/include/access/itup_attiter.h b/src/include/access/itup_attiter.h
new file mode 100644
index 0000000000..9f16a4b3d7
--- /dev/null
+++ b/src/include/access/itup_attiter.h
@@ -0,0 +1,198 @@
+/*-------------------------------------------------------------------------
+ *
+ * itup.h
+ * POSTGRES index tuple attribute iterator definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/itup_attiter.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef ITUP_ATTITER_H
+#define ITUP_ATTITER_H
+
+#include "access/itup.h"
+
+typedef struct IAttrIterStateData
+{
+ int offset;
+ bool slow;
+ bool isNull;
+} IAttrIterStateData;
+
+typedef IAttrIterStateData * IAttrIterState;
+
+/* ----------------
+ * index_attiterinit
+ *
+ * This gets called many times, so we macro the cacheable and NULL
+ * lookups, and call nocache_index_attiterinit() for the rest.
+ *
+ * tup - the tuple being iterated on
+ * attnum - the attribute number that we start the iteration with
+ * in the first index_attiternext call
+ * tupdesc - the tuple description
+ *
+ * ----------------
+ */
+#define index_attiterinit(tup, attnum, tupleDesc, iter) \
+do { \
+ if ((attnum) == 1) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ 0 /* Offset of attribute 1 is always 0 */, \
+ false /* slow */, \
+ false /* isNull */ \
+ }); \
+ } \
+ else if (!IndexTupleHasNulls(tup) && \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff >= 0) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff, /* offset */ \
+ false, /* slow */ \
+ false /* isNull */ \
+ }); \
+ } \
+ else \
+ nocache_index_attiterinit((tup), (attnum) - 1, (tupleDesc), (iter)); \
+} while (false);
+
+/*
+ * Initiate an index attribute iterator to attribute attnum,
+ * and return the corresponding datum.
+ *
+ * This is nearly the same as index_deform_tuple, except that this
+ * returns the internal state up to attnum, instead of populating the
+ * datum- and isnull-arrays
+ */
+static inline void
+nocache_index_attiterinit(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ int curatt;
+ char *tp; /* ptr to tuple data */
+ int off; /* offset in tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ bool slow = false; /* can we use/set attcacheoff? */
+ bool null = false;
+
+ /* Assert to protect callers */
+ Assert(PointerIsValid(iter));
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ /* XXX "knows" t_bits are just after fixed tuple header! */
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+ off = 0;
+
+ for (curatt = 0; curatt < attnum; curatt++)
+ {
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, curatt);
+
+ if (hasnulls && att_isnull(curatt, bp))
+ {
+ null = true;
+ slow = true; /* can't use attcacheoff anymore */
+ continue;
+ }
+
+ null = false;
+
+ if (!slow && thisatt->attcacheoff >= 0)
+ off = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ slow = true;
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+ }
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+
+ if (thisatt->attlen <= 0)
+ slow = true; /* can't use attcacheoff anymore */
+ }
+
+ iter->isNull = null;
+ iter->offset = off;
+ iter->slow = slow;
+}
+
+/* ----------------
+ * index_attiternext() - get the next attribute of an index tuple
+ *
+ * This gets called many times, so we do the least amount of work
+ * possible.
+ *
+ * The code does not attempt to update attcacheoff; as it is unlikely
+ * to reach a situation where the cached offset matters a lot.
+ * If the cached offset do matter, the caller should make sure that
+ * PopulateTupleDescCacheOffsets() was called on the tuple descriptor
+ * to populate the attribute offset cache.
+ *
+ * ----------------
+ */
+static inline Datum
+index_attiternext(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ char *tp; /* ptr to tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ Datum datum;
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, attnum - 1);
+
+ Assert(PointerIsValid(iter));
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+
+ if (hasnulls && att_isnull(attnum - 1, bp))
+ {
+ iter->isNull = true;
+ iter->slow = true;
+ return (Datum) 0;
+ }
+
+ iter->isNull = false;
+
+ if (!iter->slow && thisatt->attcacheoff >= 0)
+ iter->offset = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ iter->offset = att_align_pointer(iter->offset, thisatt->attalign, -1,
+ tp + iter->offset);
+ iter->slow = true;
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ iter->offset = att_align_nominal(iter->offset, thisatt->attalign);
+ }
+
+ datum = fetchatt(thisatt, tp + iter->offset);
+
+ iter->offset = att_addlength_pointer(iter->offset, thisatt->attlen, tp + iter->offset);
+
+ if (thisatt->attlen <= 0)
+ iter->slow = true; /* can't use attcacheoff anymore */
+
+ return datum;
+}
+
+#endif /* ITUP_ATTITER_H */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 1559399b0e..92894e4ea7 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -16,6 +16,7 @@
#include "access/amapi.h"
#include "access/itup.h"
+#include "access/itup_attiter.h"
#include "access/sdir.h"
#include "access/tableam.h"
#include "access/xlogreader.h"
@@ -1122,6 +1123,7 @@ typedef struct BTOptions
*/
#define NBTS_TYPE_SINGLE_COLUMN single
#define NBTS_TYPE_CACHED cached
+#define NBTS_TYPE_UNCACHED uncached
#define NBTS_TYPE_DEFAULT default
@@ -1152,12 +1154,19 @@ do { \
#define NBT_SPECIALIZE_CALL(function, rel, ...) \
( \
- IndexRelationGetNumberOfKeyAttributes(rel) == 1 ? ( \
+ IndexRelationGetNumberOfKeyAttributes(rel) == 1 ? ( \
NBTS_MAKE_NAME(function, NBTS_TYPE_SINGLE_COLUMN)(__VA_ARGS__) \
) \
: \
( \
- NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+ TupleDescAttr(RelationGetDescr(rel), \
+ IndexRelationGetNumberOfKeyAttributes(rel) - 1)->attcacheoff > 0 ? ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+ ) \
+ : \
+ ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_UNCACHED)(__VA_ARGS__) \
+ ) \
) \
)
diff --git a/src/include/access/nbtree_specialize.h b/src/include/access/nbtree_specialize.h
index 9733a27bdd..aa8cc51666 100644
--- a/src/include/access/nbtree_specialize.h
+++ b/src/include/access/nbtree_specialize.h
@@ -175,6 +175,40 @@
#undef nbts_attiter_nextattdatum
#undef nbts_attiter_curattisnull
+/*
+ * Multiple key columns, but attcacheoff -optimization doesn't apply.
+ */
+#define NBTS_SPECIALIZING_UNCACHED
+#define NBTS_TYPE NBTS_TYPE_UNCACHED
+
+#define nbts_attiterdeclare(itup) \
+ IAttrIterStateData NBTS_MAKE_NAME(itup, iter)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc) \
+ index_attiterinit((itup), (initAttNum), (tupDesc), &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_attiternext((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, iter).isNull
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_UNCACHED
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
/* reset call to SPECIALIZE_CALL for default behaviour */
#undef nbts_call_norel
#define nbts_call_norel(name, rel, ...) \
--
2.30.2
v4-0007-Add-specialization-to-btree-index-creation.patchapplication/octet-stream; name=v4-0007-Add-specialization-to-btree-index-creation.patchDownload
From d9e1abfafbdb1c0c9e80bb4cdba9a1a43e33b699 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 21 Apr 2022 16:22:07 +0200
Subject: [PATCH v4 7/8] Add specialization to btree index creation.
This was an oversight that is corrected easily; but an oversight nonetheless.
This increases the (re)build performance of indexes by another few percents.
---
src/backend/utils/sort/tuplesort.c | 147 ++---------------------
src/backend/utils/sort/tuplesort_nbts.h | 148 ++++++++++++++++++++++++
src/include/access/nbtree.h | 18 +++
3 files changed, 175 insertions(+), 138 deletions(-)
create mode 100644 src/backend/utils/sort/tuplesort_nbts.h
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 27a5d53324..44d7d831ff 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -655,8 +655,6 @@ static void writetup_cluster(Tuplesortstate *state, LogicalTape *tape,
SortTuple *stup);
static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
LogicalTape *tape, unsigned int len);
-static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state);
static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
Tuplesortstate *state);
static void copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -679,6 +677,10 @@ static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
static void tuplesort_free(Tuplesortstate *state);
static void tuplesort_updatemax(Tuplesortstate *state);
+#define NBT_SPECIALIZE_FILE "../../backend/utils/sort/tuplesort_nbts.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
+
/*
* Specialized comparators that we can inline into specialized sorts. The goal
* is to try to sort two tuples without having to follow the pointers to the
@@ -1239,7 +1241,7 @@ tuplesort_begin_index_btree(Relation heapRel,
sortopt & TUPLESORT_RANDOMACCESS,
PARALLEL_SORT(state));
- state->comparetup = comparetup_index_btree;
+ state->comparetup = NBT_SPECIALIZE_NAME(comparetup_index_btree, indexRel);
state->copytup = copytup_index;
state->writetup = writetup_index;
state->readtup = readtup_index;
@@ -1357,7 +1359,7 @@ tuplesort_begin_index_gist(Relation heapRel,
state->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel);
- state->comparetup = comparetup_index_btree;
+ state->comparetup = NBT_SPECIALIZE_NAME(comparetup_index_btree, indexRel);
state->copytup = copytup_index;
state->writetup = writetup_index;
state->readtup = readtup_index;
@@ -4320,142 +4322,11 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
* The btree and hash cases require separate comparison functions, but the
* IndexTuple representation is the same so the copy/write/read support
* functions can be shared.
+ *
+ * nbtree function can be found in tuplesort_nbts.h, and is included
+ * through the nbtree specialization functions.
*/
-static int
-comparetup_index_btree(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state)
-{
- /*
- * This is similar to comparetup_heap(), but expects index tuples. There
- * is also special handling for enforcing uniqueness, and special
- * treatment for equal keys at the end.
- */
- SortSupport sortKey = state->sortKeys;
- IndexTuple tuple1;
- IndexTuple tuple2;
- int keysz;
- TupleDesc tupDes;
- bool equal_hasnull = false;
- int nkey;
- int32 compare;
- Datum datum1,
- datum2;
- bool isnull1,
- isnull2;
-
-
- /* Compare the leading sort key */
- compare = ApplySortComparator(a->datum1, a->isnull1,
- b->datum1, b->isnull1,
- sortKey);
- if (compare != 0)
- return compare;
-
- /* Compare additional sort keys */
- tuple1 = (IndexTuple) a->tuple;
- tuple2 = (IndexTuple) b->tuple;
- keysz = state->nKeys;
- tupDes = RelationGetDescr(state->indexRel);
-
- if (sortKey->abbrev_converter)
- {
- datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
-
- compare = ApplySortAbbrevFullComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare;
- }
-
- /* they are equal, so we only need to examine one null flag */
- if (a->isnull1)
- equal_hasnull = true;
-
- sortKey++;
- for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
- {
- datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
-
- compare = ApplySortComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare; /* done when we find unequal attributes */
-
- /* they are equal, so we only need to examine one null flag */
- if (isnull1)
- equal_hasnull = true;
- }
-
- /*
- * If btree has asked us to enforce uniqueness, complain if two equal
- * tuples are detected (unless there was at least one NULL field and NULLS
- * NOT DISTINCT was not set).
- *
- * It is sufficient to make the test here, because if two tuples are equal
- * they *must* get compared at some stage of the sort --- otherwise the
- * sort algorithm wouldn't have checked whether one must appear before the
- * other.
- */
- if (state->enforceUnique && !(!state->uniqueNullsNotDistinct && equal_hasnull))
- {
- Datum values[INDEX_MAX_KEYS];
- bool isnull[INDEX_MAX_KEYS];
- char *key_desc;
-
- /*
- * Some rather brain-dead implementations of qsort (such as the one in
- * QNX 4) will sometimes call the comparison routine to compare a
- * value to itself, but we always use our own implementation, which
- * does not.
- */
- Assert(tuple1 != tuple2);
-
- index_deform_tuple(tuple1, tupDes, values, isnull);
-
- key_desc = BuildIndexValueDescription(state->indexRel, values, isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("could not create unique index \"%s\"",
- RelationGetRelationName(state->indexRel)),
- key_desc ? errdetail("Key %s is duplicated.", key_desc) :
- errdetail("Duplicate keys exist."),
- errtableconstraint(state->heapRel,
- RelationGetRelationName(state->indexRel))));
- }
-
- /*
- * If key values are equal, we sort on ItemPointer. This is required for
- * btree indexes, since heap TID is treated as an implicit last key
- * attribute in order to ensure that all keys in the index are physically
- * unique.
- */
- {
- BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
- BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
-
- if (blk1 != blk2)
- return (blk1 < blk2) ? -1 : 1;
- }
- {
- OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
- OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
-
- if (pos1 != pos2)
- return (pos1 < pos2) ? -1 : 1;
- }
-
- /* ItemPointer values should never be equal */
- Assert(false);
-
- return 0;
-}
-
static int
comparetup_index_hash(const SortTuple *a, const SortTuple *b,
Tuplesortstate *state)
diff --git a/src/backend/utils/sort/tuplesort_nbts.h b/src/backend/utils/sort/tuplesort_nbts.h
new file mode 100644
index 0000000000..d1b2670747
--- /dev/null
+++ b/src/backend/utils/sort/tuplesort_nbts.h
@@ -0,0 +1,148 @@
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static int NBTS_FUNCTION(comparetup_index_btree)(const SortTuple *a,
+ const SortTuple *b,
+ Tuplesortstate *state);
+
+static int
+NBTS_FUNCTION(comparetup_index_btree)(const SortTuple *a, const SortTuple *b,
+ Tuplesortstate *state)
+{
+ /*
+ * This is similar to comparetup_heap(), but expects index tuples. There
+ * is also special handling for enforcing uniqueness, and special
+ * treatment for equal keys at the end.
+ */
+ SortSupport sortKey = state->sortKeys;
+ IndexTuple tuple1;
+ IndexTuple tuple2;
+ int keysz;
+ TupleDesc tupDes;
+ bool equal_hasnull = false;
+ int nkey;
+ int32 compare;
+ nbts_attiterdeclare(tuple1);
+ nbts_attiterdeclare(tuple2);
+
+ /* Compare the leading sort key */
+ compare = ApplySortComparator(a->datum1, a->isnull1,
+ b->datum1, b->isnull1,
+ sortKey);
+ if (compare != 0)
+ return compare;
+
+ /* Compare additional sort keys */
+ tuple1 = (IndexTuple) a->tuple;
+ tuple2 = (IndexTuple) b->tuple;
+ keysz = state->nKeys;
+ tupDes = RelationGetDescr(state->indexRel);
+
+ if (!sortKey->abbrev_converter)
+ {
+ nkey = 2;
+ sortKey++;
+ }
+ else
+ nkey = 1;
+
+ if (a->isnull1)
+ equal_hasnull = true;
+
+ nbts_attiterinit(tuple1, nkey, tupDes);
+ nbts_attiterinit(tuple2, nkey, tupDes);
+
+ nbts_foreachattr(nkey, keysz)
+ {
+ Datum datum1,
+ datum2;
+ datum1 = nbts_attiter_nextattdatum(tuple1, tupDes);
+ datum2 = nbts_attiter_nextattdatum(tuple2, tupDes);
+
+ if (nbts_attiter_attnum == 1)
+ {
+ compare = ApplySortAbbrevFullComparator(datum1, nbts_attiter_curattisnull(tuple1),
+ datum2, nbts_attiter_curattisnull(tuple2),
+ sortKey);
+ }
+ else
+ {
+ compare = ApplySortComparator(datum1, nbts_attiter_curattisnull(tuple1),
+ datum2, nbts_attiter_curattisnull(tuple2),
+ sortKey);
+ }
+
+ if (compare != 0)
+ return compare;
+
+ if (nbts_attiter_curattisnull(tuple1))
+ equal_hasnull = true;
+
+ sortKey++;
+ }
+
+ /*
+ * If btree has asked us to enforce uniqueness, complain if two equal
+ * tuples are detected (unless there was at least one NULL field and NULLS
+ * NOT DISTINCT was not set).
+ *
+ * It is sufficient to make the test here, because if two tuples are equal
+ * they *must* get compared at some stage of the sort --- otherwise the
+ * sort algorithm wouldn't have checked whether one must appear before the
+ * other.
+ */
+ if (state->enforceUnique && !(!state->uniqueNullsNotDistinct && equal_hasnull))
+ {
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ char *key_desc;
+
+ /*
+ * Some rather brain-dead implementations of qsort (such as the one in
+ * QNX 4) will sometimes call the comparison routine to compare a
+ * value to itself, but we always use our own implementation, which
+ * does not.
+ */
+ Assert(tuple1 != tuple2);
+
+ index_deform_tuple(tuple1, tupDes, values, isnull);
+
+ key_desc = BuildIndexValueDescription(state->indexRel, values, isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("could not create unique index \"%s\"",
+ RelationGetRelationName(state->indexRel)),
+ key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+ errdetail("Duplicate keys exist."),
+ errtableconstraint(state->heapRel,
+ RelationGetRelationName(state->indexRel))));
+ }
+
+ /*
+ * If key values are equal, we sort on ItemPointer. This is required for
+ * btree indexes, since heap TID is treated as an implicit last key
+ * attribute in order to ensure that all keys in the index are physically
+ * unique.
+ */
+ {
+ BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
+ BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
+
+ if (blk1 != blk2)
+ return (blk1 < blk2) ? -1 : 1;
+ }
+ {
+ OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
+ OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
+
+ if (pos1 != pos2)
+ return (pos1 < pos2) ? -1 : 1;
+ }
+
+ /* ItemPointer values should never be equal */
+ Assert(false);
+
+ return 0;
+}
+
+#endif
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 92894e4ea7..11116b47ca 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1170,6 +1170,24 @@ do { \
) \
)
+#define NBT_SPECIALIZE_NAME(name, rel) \
+( \
+ IndexRelationGetNumberOfKeyAttributes(rel) == 1 ? ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_SINGLE_COLUMN) \
+ ) \
+ : \
+ ( \
+ TupleDescAttr(RelationGetDescr(rel), \
+ IndexRelationGetNumberOfKeyAttributes(rel) - 1)->attcacheoff > 0 ? ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED) \
+ ) \
+ : \
+ ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_UNCACHED) \
+ ) \
+ ) \
+)
+
#else /* not defined NBTS_ENABLED */
#define nbt_opt_specialize(rel)
--
2.30.2
v4-0008-Implement-dynamic-prefix-compression-in-nbtree.patchapplication/octet-stream; name=v4-0008-Implement-dynamic-prefix-compression-in-nbtree.patchDownload
From 36019f49e51a82f5a6b33f02486dd59f0c06e509 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 6 Jun 2022 23:16:18 +0200
Subject: [PATCH v4 8/8] Implement dynamic prefix compression in nbtree
Because tuples are ordered on the page, if some prefix of the
scan attributes on both sides of the compared tuple are equal
to the scankey, then the current tuple that is being compared
must also have those prefixing attributes that equal the
scankey.
We cannot propagate this information to _binsrch on lower pages,
as this downstream page may concurrently have split and/or have
merged with its deleted left neighbour (see [0]), which moves
the keyspace of the linked page. We thus can only trust the
current state of this current page for this optimization, which
means we must validate this state each time we open the page.
Although this limits the overall applicability of the
performance improvement, it still allows for a nice performance
improvement in most cases where initial columns have many
duplicate values and a compare function that is not cheap.
Additionally, most of the time a pages' highkey is equal to the
right seperator on the parent page. By storing this seperator
and doing a binary equality check, we can cheaply validate the
highkey on a page, which also allows us to carry over the
right seperators' prefix into the page.
---
contrib/amcheck/verify_nbtree.c | 17 +--
src/backend/access/nbtree/README | 25 +++++
src/backend/access/nbtree/nbtinsert.c | 14 ++-
src/backend/access/nbtree/nbtinsert_spec.h | 22 ++--
src/backend/access/nbtree/nbtsearch.c | 3 +-
src/backend/access/nbtree/nbtsearch_spec.h | 115 ++++++++++++++++++---
src/include/access/nbtree_specialized.h | 9 +-
7 files changed, 169 insertions(+), 36 deletions(-)
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 2beeebb163..8c4215372a 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2700,6 +2700,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
BTInsertStateData insertstate;
OffsetNumber offnum;
Page page;
+ AttrNumber cmpcol = 1;
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
@@ -2709,13 +2710,13 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.buf = lbuf;
/* Get matching tuple on leaf page */
- offnum = _bt_binsrch_insert(state->rel, &insertstate);
+ offnum = _bt_binsrch_insert(state->rel, &insertstate, 1);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
/* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
insertstate.postingoff <= 0 &&
- _bt_compare(state->rel, key, page, offnum) == 0)
+ _bt_compare(state->rel, key, page, offnum, &cmpcol) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
}
@@ -2777,6 +2778,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
@@ -2787,7 +2789,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
if (!key->heapkeyspace)
return invariant_leq_offset(state, key, upperbound);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
/*
* _bt_compare() is capable of determining that a scankey with a
@@ -2839,10 +2841,11 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber upperbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
return cmp <= 0;
}
@@ -2862,10 +2865,11 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber lowerbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, lowerbound);
+ cmp = _bt_compare(state->rel, key, state->target, lowerbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
@@ -2900,13 +2904,14 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
/* Verify line pointer before checking tuple */
itemid = PageGetItemIdCareful(state, nontargetblock, nontarget,
upperbound);
- cmp = _bt_compare(state->rel, key, nontarget, upperbound);
+ cmp = _bt_compare(state->rel, key, nontarget, upperbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3c08888c23..13ac9ee2be 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -901,6 +901,31 @@ large groups of duplicates, maximizing space utilization. Note also that
deduplication more efficient. Deduplication can be performed infrequently,
without merging together existing posting list tuples too often.
+
+Notes about dynamic prefix truncation
+-------------------------------------
+
+Because NBTrees have a sorted keyspace, when we have determined that some
+prefixing columns of tuples on both sides of the tuple that is being
+compared are equal to the scankey, then the current tuple must also share
+this prefix with the scankey. This allows us to skip comparing those columns,
+potentially saving cycles.
+
+We can only use this constraint if we have proven this information while we
+hold a pin on the page, so this is only useful on the page level: Concurrent
+page deletions and splits may have moved the keyspace of the page referenced
+by an inner page to the right. If we re-used high- and low-column-prefixes,
+we would not be able to detect a change of keyspace from e.g. (2,2) to (1,2),
+and subsequently return invalid results. This race condition can only be
+prevented by re-establishing the prefix-equal-columns for each page.
+
+The positive part of this, is that we already have results of the highest
+value of a page: a pages' highkey is compared to the scankey while we have
+a pin on the page in the _bt_moveright procedure. The _bt_binsrch procedure
+will use this result as a rightmost prefix compare, and for each step in the
+binary search (that does not compare less than the insert key) improve the
+equal-prefix bounds.
+
Notes about deduplication
-------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index ec6c73d1cc..20e5f33f98 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -132,7 +132,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* in the fastpath below, but also in the _bt_findinsertloc() call later.
*/
Assert(!insertstate->bounds_valid);
- offset = nbts_call(_bt_binsrch_insert, rel, insertstate);
+ offset = nbts_call(_bt_binsrch_insert, rel, insertstate, 1);
/*
* Scan over all equal tuples, looking for live conflicts.
@@ -142,6 +142,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(itup_key->scantid == NULL);
for (;;)
{
+ AttrNumber cmpcol = 1;
+
/*
* Each iteration of the loop processes one heap TID, not one index
* tuple. Current offset number for page isn't usually advanced on
@@ -177,7 +179,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(insertstate->bounds_valid);
Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
Assert(insertstate->low <= insertstate->stricthigh);
- Assert(nbts_call(_bt_compare, rel, itup_key, page, offset) < 0);
+ Assert(nbts_call(_bt_compare, rel, itup_key, page, offset,
+ &cmpcol) < 0);
break;
}
@@ -202,7 +205,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
if (!inposting)
{
/* Plain tuple, or first TID in posting list tuple */
- if (nbts_call(_bt_compare, rel, itup_key, page, offset) != 0)
+ if (nbts_call(_bt_compare, rel, itup_key, page, offset,
+ &cmpcol) != 0)
break; /* we're past all the equal tuples */
/* Advanced curitup */
@@ -412,11 +416,13 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
else
{
int highkeycmp;
+ cmpcol = 1;
/* If scankey == hikey we gotta check the next page too */
if (P_RIGHTMOST(opaque))
break;
- highkeycmp = nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY);
+ highkeycmp = nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY,
+ &cmpcol);
Assert(highkeycmp <= 0);
if (highkeycmp != 0)
break;
diff --git a/src/backend/access/nbtree/nbtinsert_spec.h b/src/backend/access/nbtree/nbtinsert_spec.h
index 97c866aea3..ccba0fa5ed 100644
--- a/src/backend/access/nbtree/nbtinsert_spec.h
+++ b/src/backend/access/nbtree/nbtinsert_spec.h
@@ -73,6 +73,7 @@ NBTS_FUNCTION(_bt_search_insert)(Relation rel, BTInsertState insertstate)
{
Page page;
BTPageOpaque opaque;
+ AttrNumber comparecol = 1;
_bt_checkpage(rel, insertstate->buf);
page = BufferGetPage(insertstate->buf);
@@ -91,7 +92,8 @@ NBTS_FUNCTION(_bt_search_insert)(Relation rel, BTInsertState insertstate)
!P_IGNORE(opaque) &&
PageGetFreeSpace(page) > insertstate->itemsz &&
PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- nbts_call(_bt_compare, rel, insertstate->itup_key, page, P_HIKEY) > 0)
+ nbts_call(_bt_compare, rel, insertstate->itup_key, page,
+ P_HIKEY, &comparecol) > 0)
{
/*
* Caller can use the fastpath optimization because cached
@@ -221,6 +223,7 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
/*
* Does the new tuple belong on this page?
*
@@ -238,7 +241,7 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
/* Test '<=', not '!=', since scantid is set now */
if (P_RIGHTMOST(opaque) ||
- nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) <= 0)
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
break;
_bt_stepright(rel, insertstate, stack);
@@ -291,6 +294,7 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
*/
while (PageGetFreeSpace(page) < insertstate->itemsz)
{
+ AttrNumber cmpcol = 1;
/*
* Before considering moving right, see if we can obtain enough
* space by erasing LP_DEAD items
@@ -321,7 +325,8 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
break;
if (P_RIGHTMOST(opaque) ||
- nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) != 0 ||
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY,
+ &cmpcol) != 0 ||
pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
break;
@@ -336,10 +341,13 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
* We should now be on the correct page. Find the offset within the page
* for the new tuple. (Possibly reusing earlier search bounds.)
*/
- Assert(P_RIGHTMOST(opaque) ||
- nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) <= 0);
+ {
+ AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
+ Assert(P_RIGHTMOST(opaque) || nbts_call(_bt_compare, rel, itup_key,
+ page, P_HIKEY, &cmpcol) <= 0);
+ }
- newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate);
+ newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate, 1);
if (insertstate->postingoff == -1)
{
@@ -358,7 +366,7 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
*/
Assert(!insertstate->bounds_valid);
insertstate->postingoff = 0;
- newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate);
+ newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate, 1);
Assert(insertstate->postingoff == 0);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d5152bfcb7..607940bbcd 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -178,6 +178,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
StrategyNumber strat_total;
BTScanPosItem *currItem;
BlockNumber blkno;
+ AttrNumber attno = 1;
Assert(!BTScanPosIsValid(so->currPos));
@@ -696,7 +697,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
_bt_initialize_more_data(so, dir);
/* position to the precise item on the page */
- offnum = nbts_call(_bt_binsrch, rel, &inskey, buf);
+ offnum = nbts_call(_bt_binsrch, rel, &inskey, buf, &attno);
/*
* If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsearch_spec.h b/src/backend/access/nbtree/nbtsearch_spec.h
index a5c5f2b94f..19a6178334 100644
--- a/src/backend/access/nbtree/nbtsearch_spec.h
+++ b/src/backend/access/nbtree/nbtsearch_spec.h
@@ -10,8 +10,10 @@
*/
#ifndef NBTS_SPECIALIZING_DEFAULT
-static OffsetNumber NBTS_FUNCTION(_bt_binsrch)(Relation rel, BTScanInsert key,
- Buffer buf);
+static OffsetNumber NBTS_FUNCTION(_bt_binsrch)(Relation rel,
+ BTScanInsert key,
+ Buffer buf,
+ AttrNumber *highkeycmpcol);
static bool NBTS_FUNCTION(_bt_readpage)(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
@@ -38,7 +40,8 @@ static bool NBTS_FUNCTION(_bt_readpage)(IndexScanDesc scan, ScanDirection dir,
static OffsetNumber
NBTS_FUNCTION(_bt_binsrch)(Relation rel,
BTScanInsert key,
- Buffer buf)
+ Buffer buf,
+ AttrNumber *highkeycmpcol)
{
Page page;
BTPageOpaque opaque;
@@ -46,6 +49,8 @@ NBTS_FUNCTION(_bt_binsrch)(Relation rel,
high;
int32 result,
cmpval;
+ AttrNumber highcmpcol = *highkeycmpcol,
+ lowcmpcol = 1;
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
@@ -87,17 +92,26 @@ NBTS_FUNCTION(_bt_binsrch)(Relation rel,
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
/* We have low <= mid < high, so mid points at a real slot */
- result = nbts_call(_bt_compare, rel, key, page, mid);
+ result = nbts_call(_bt_compare, rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
+ {
high = mid;
+ highcmpcol = cmpcol;
+ }
}
+ *highkeycmpcol = highcmpcol;
+
/*
* At this point we have high == low, but be careful: they could point
* past the last slot on the page.
@@ -423,6 +437,7 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
{
BTStack stack_in = NULL;
int page_access = BT_READ;
+ char tupdatabuf[BLCKSZ / 3];
/* Get the root page to start with */
*bufP = _bt_getroot(rel, access);
@@ -441,6 +456,7 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
IndexTuple itup;
BlockNumber child;
BTStack new_stack;
+ AttrNumber highkeycmpcol = 1;
/*
* Race -- the page we just grabbed may have split since we read its
@@ -456,7 +472,8 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
*/
*bufP = nbts_call(_bt_moveright, rel, key, *bufP,
(access == BT_WRITE), stack_in,
- page_access, snapshot);
+ page_access, snapshot, &highkeycmpcol,
+ (char *) tupdatabuf);
/* if this is a leaf page, we're done */
page = BufferGetPage(*bufP);
@@ -468,12 +485,17 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
* Find the appropriate pivot tuple on this page. Its downlink points
* to the child page that we're about to descend to.
*/
- offnum = nbts_call(_bt_binsrch, rel, key, *bufP);
+ offnum = nbts_call(_bt_binsrch, rel, key, *bufP, &highkeycmpcol);
itemid = PageGetItemId(page, offnum);
itup = (IndexTuple) PageGetItem(page, itemid);
Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
child = BTreeTupleGetDownLink(itup);
+ if (highkeycmpcol > 1)
+ {
+ memcpy((char *) tupdatabuf, (char *) itup, IndexTupleSize(itup));
+ }
+
/*
* We need to save the location of the pivot tuple we chose in a new
* stack entry for this page/level. If caller ends up splitting a
@@ -507,6 +529,7 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
*/
if (access == BT_WRITE && page_access == BT_READ)
{
+ AttrNumber highkeycmpcol = 1;
/* trade in our read lock for a write lock */
_bt_unlockbuf(rel, *bufP);
_bt_lockbuf(rel, *bufP, BT_WRITE);
@@ -517,7 +540,7 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
* move right to its new sibling. Do that.
*/
*bufP = nbts_call(_bt_moveright, rel, key, *bufP, true, stack_in,
- BT_WRITE, snapshot);
+ BT_WRITE, snapshot, &highkeycmpcol, (char *) tupdatabuf);
}
return stack_in;
@@ -565,12 +588,16 @@ NBTS_FUNCTION(_bt_moveright)(Relation rel,
bool forupdate,
BTStack stack,
int access,
- Snapshot snapshot)
+ Snapshot snapshot,
+ AttrNumber *comparecol,
+ char *tupdatabuf)
{
Page page;
BTPageOpaque opaque;
int32 cmpval;
+ Assert(PointerIsValid(comparecol));
+
/*
* When nextkey = false (normal case): if the scan key that brought us to
* this page is > the high key stored on the page, then the page has split
@@ -592,12 +619,17 @@ NBTS_FUNCTION(_bt_moveright)(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
+
page = BufferGetPage(buf);
TestForOldSnapshot(snapshot, rel, page);
opaque = BTPageGetOpaque(page);
if (P_RIGHTMOST(opaque))
+ {
+ *comparecol = cmpcol;
break;
+ }
/*
* Finish any incomplete splits we encounter along the way.
@@ -623,14 +655,49 @@ NBTS_FUNCTION(_bt_moveright)(Relation rel,
continue;
}
- if (P_IGNORE(opaque) || nbts_call(_bt_compare, rel, key, page, P_HIKEY) >= cmpval)
+ /*
+ * When comparecol is > 1, tupdatabuf is filled with the right seperator
+ * of the parent node. This allows us to do a binary equality check
+ * between the parent node's right seperator (which is < key) and this
+ * page's P_HIKEY. If they equal, we can reuse the result of the
+ * parent node's rightkey compare, which means we can potentially save
+ * a full key compare.
+ *
+ * Without this, we'd on average use 3 full key compares per page before
+ * we achieve full dynamic prefix bounds, with this optimization we can
+ * skip one of those.
+ *
+ * 3: 1 for the highkey (rightmost), and on average 2 before we move
+ * right in the binary search on the page.
+ */
+ if (!P_IGNORE(opaque) && *comparecol > 1)
+ {
+ IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
+ IndexTuple buftuple = (IndexTuple) tupdatabuf;
+ if (IndexTupleSize(itup) == IndexTupleSize(buftuple))
+ {
+ char *dataptr = (char *) itup;
+
+ if (memcmp(dataptr + sizeof(IndexTupleData),
+ tupdatabuf + sizeof(IndexTupleData),
+ IndexTupleSize(itup) - sizeof(IndexTupleData)) == 0)
+ break;
+ }
+ }
+
+ if (P_IGNORE(opaque) || nbts_call(_bt_compare, rel, key, page, P_HIKEY,
+ &cmpcol) >= cmpval)
{
/* step right one page */
+ *comparecol = 1;
buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
continue;
}
else
+ {
+ *comparecol = cmpcol;
break;
+ }
}
if (P_IGNORE(opaque))
@@ -663,7 +730,8 @@ NBTS_FUNCTION(_bt_moveright)(Relation rel,
* list split).
*/
OffsetNumber
-NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate)
+NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol)
{
BTScanInsert key = insertstate->itup_key;
Page page;
@@ -673,6 +741,7 @@ NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate)
stricthigh;
int32 result,
cmpval;
+ AttrNumber lowcmpcol = 1;
page = BufferGetPage(insertstate->buf);
opaque = BTPageGetOpaque(page);
@@ -723,16 +792,21 @@ NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate)
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
/* We have low <= mid < high, so mid points at a real slot */
- result = nbts_call(_bt_compare, rel, key, page, mid);
+ result = nbts_call(_bt_compare, rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
{
high = mid;
+ highcmpcol = cmpcol;
if (result != 0)
stricthigh = high;
}
@@ -813,7 +887,8 @@ int32
NBTS_FUNCTION(_bt_compare)(Relation rel,
BTScanInsert key,
Page page,
- OffsetNumber offnum)
+ OffsetNumber offnum,
+ AttrNumber *comparecol)
{
TupleDesc itupdesc = RelationGetDescr(rel);
BTPageOpaque opaque = BTPageGetOpaque(page);
@@ -854,10 +929,11 @@ NBTS_FUNCTION(_bt_compare)(Relation rel,
ncmpkey = Min(ntupatts, key->keysz);
Assert(key->heapkeyspace || ncmpkey == key->keysz);
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
- scankey = key->scankeys;
- nbts_attiterinit(itup, 1, itupdesc);
- nbts_foreachattr(1, ncmpkey)
+ nbts_attiterinit(itup, *comparecol, itupdesc);
+ scankey = key->scankeys + ((*comparecol) - 1);
+
+ nbts_foreachattr(*comparecol, ncmpkey)
{
Datum datum;
@@ -902,11 +978,20 @@ NBTS_FUNCTION(_bt_compare)(Relation rel,
/* if the keys are unequal, return the difference */
if (result != 0)
+ {
+ *comparecol = nbts_attiter_attnum;
return result;
+ }
scankey++;
}
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *comparecol = ntupatts + 1;
+
/*
* All non-truncated attributes (other than heap TID) were found to be
* equal. Treat truncated attributes as minus infinity when scankey has a
diff --git a/src/include/access/nbtree_specialized.h b/src/include/access/nbtree_specialized.h
index c45fa84aed..ddceb4a4aa 100644
--- a/src/include/access/nbtree_specialized.h
+++ b/src/include/access/nbtree_specialized.h
@@ -43,12 +43,15 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key,
extern Buffer
NBTS_FUNCTION(_bt_moveright)(Relation rel, BTScanInsert key, Buffer buf,
bool forupdate, BTStack stack, int access,
- Snapshot snapshot);
+ Snapshot snapshot, AttrNumber *comparecol,
+ char *tupdatabuf);
extern OffsetNumber
-NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate);
+NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol);
extern int32
NBTS_FUNCTION(_bt_compare)(Relation rel, BTScanInsert key,
- Page page, OffsetNumber offnum);
+ Page page, OffsetNumber offnum,
+ AttrNumber *comparecol);
/*
* prototypes for functions in nbtutils_spec.h
--
2.30.2
On Mon, 4 Jul 2022 at 16:18, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Sun, 5 Jun 2022 at 21:12, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:While working on benchmarking the v2 patchset, I noticed no
improvement on reindex, which I attributed to forgetting to also
specialize comparetup_index_btree in tuplesorth.c. After adding the
specialization there as well (attached in v3), reindex performance
improved significantly too.PFA version 4 of this patchset. Changes:
Version 5 now, which is identical to v4 except for bitrot fixes to
deal with f58d7073.
Kind regards,
Matthias van de Meent.
Attachments:
v5-0002-Use-specialized-attribute-iterators-in-backend-nb.patchapplication/x-patch; name=v5-0002-Use-specialized-attribute-iterators-in-backend-nb.patchDownload
From 7f13603d7a37dae59ca5d08346d79e67e5d3fe4e Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 7 Apr 2022 12:30:00 +0200
Subject: [PATCH v5 2/8] Use specialized attribute iterators in
backend/*/nbt*_spec.h
Split out for making it clear what substantial changes were made to the
pre-existing functions.
Even though not all nbt*_spec functions have been updated; most call sites
now can directly call the specialized functions instead of having to determine
the right specialization based on the (potentially locally unavailable) index
relation, making the specialization of those functions worth the effort.
---
src/backend/access/nbtree/nbtsearch_spec.h | 16 +++---
src/backend/access/nbtree/nbtsort_spec.h | 24 +++++----
src/backend/access/nbtree/nbtutils_spec.h | 63 +++++++++++++---------
3 files changed, 62 insertions(+), 41 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsearch_spec.h b/src/backend/access/nbtree/nbtsearch_spec.h
index 73d5370496..a5c5f2b94f 100644
--- a/src/backend/access/nbtree/nbtsearch_spec.h
+++ b/src/backend/access/nbtree/nbtsearch_spec.h
@@ -823,6 +823,7 @@ NBTS_FUNCTION(_bt_compare)(Relation rel,
int ncmpkey;
int ntupatts;
int32 result;
+ nbts_attiterdeclare(itup);
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -854,23 +855,26 @@ NBTS_FUNCTION(_bt_compare)(Relation rel,
Assert(key->heapkeyspace || ncmpkey == key->keysz);
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
scankey = key->scankeys;
- for (int i = 1; i <= ncmpkey; i++)
+ nbts_attiterinit(itup, 1, itupdesc);
+
+ nbts_foreachattr(1, ncmpkey)
{
Datum datum;
- bool isNull;
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+ datum = nbts_attiter_nextattdatum(itup, itupdesc);
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ /* key is NULL */
+ if (scankey->sk_flags & SK_ISNULL)
{
- if (isNull)
+ if (nbts_attiter_curattisnull(itup))
result = 0; /* NULL "=" NULL */
else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = -1; /* NULL "<" NOT_NULL */
else
result = 1; /* NULL ">" NOT_NULL */
}
- else if (isNull) /* key is NOT_NULL and item is NULL */
+ /* key is NOT_NULL and item is NULL */
+ else if (nbts_attiter_curattisnull(itup))
{
if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = 1; /* NOT_NULL ">" NULL */
diff --git a/src/backend/access/nbtree/nbtsort_spec.h b/src/backend/access/nbtree/nbtsort_spec.h
index 8f4a3602ca..d3f2db2dc4 100644
--- a/src/backend/access/nbtree/nbtsort_spec.h
+++ b/src/backend/access/nbtree/nbtsort_spec.h
@@ -27,8 +27,7 @@ NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
itup2 = NULL;
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
bool deduplicate;
@@ -50,7 +49,7 @@ NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
/* Prepare SortSupport data for each column */
sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
- for (i = 0; i < keysz; i++)
+ for (int i = 0; i < keysz; i++)
{
SortSupport sortKey = sortKeys + i;
ScanKey scanKey = wstate->inskey->scankeys + i;
@@ -82,22 +81,25 @@ NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
}
else if (itup != NULL)
{
+ nbts_attiterdeclare(itup);
+ nbts_attiterdeclare(itup2);
int32 compare = 0;
- for (i = 1; i <= keysz; i++)
+ nbts_attiterinit(itup, 1, tupdes);
+ nbts_attiterinit(itup2, 1, tupdes);
+
+ nbts_foreachattr(1, keysz)
{
SortSupport entry;
Datum attrDatum1,
attrDatum2;
- bool isNull1,
- isNull2;
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+ entry = sortKeys + nbts_attiter_attnum - 1;
+ attrDatum1 = nbts_attiter_nextattdatum(itup, tupdes);
+ attrDatum2 = nbts_attiter_nextattdatum(itup2, tupdes);
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
+ compare = ApplySortComparator(attrDatum1, nbts_attiter_curattisnull(itup),
+ attrDatum2, nbts_attiter_curattisnull(itup2),
entry);
if (compare > 0)
{
diff --git a/src/backend/access/nbtree/nbtutils_spec.h b/src/backend/access/nbtree/nbtutils_spec.h
index a4b934ae7a..638eff18f6 100644
--- a/src/backend/access/nbtree/nbtutils_spec.h
+++ b/src/backend/access/nbtree/nbtutils_spec.h
@@ -211,6 +211,8 @@ NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
TupleDesc itupdesc = RelationGetDescr(rel);
int keepnatts;
ScanKey scankey;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
/*
* _bt_compare() treats truncated key attributes as having the value minus
@@ -222,20 +224,22 @@ NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
scankey = itup_key->scankeys;
keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, nkeyatts)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ datum1 = nbts_attiter_nextattdatum(lastleft, itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) != nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
scankey->sk_collation,
datum1,
@@ -243,6 +247,7 @@ NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
break;
keepnatts++;
+ scankey++;
}
/*
@@ -295,7 +300,7 @@ NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
int indnkeyatts;
int16 *indoption;
int tupnatts;
- int i;
+ nbts_attiterdeclare(itup);
itupdesc = RelationGetDescr(rel);
indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -326,7 +331,10 @@ NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
key->scantid = key->heapkeyspace && itup ?
BTreeTupleGetHeapTID(itup) : NULL;
skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
+
+ nbts_attiterinit(itup, 1, itupdesc);
+
+ nbts_foreachattr(1, indnkeyatts)
{
FmgrInfo *procinfo;
Datum arg;
@@ -337,27 +345,30 @@ NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
* We can use the cached (default) support procs since no cross-type
* comparison can be needed.
*/
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+ procinfo = index_getprocinfo(rel, nbts_attiter_attnum, BTORDER_PROC);
/*
* Key arguments built from truncated attributes (or when caller
* provides no tuple) are defensively represented as NULL values. They
* should never be used.
*/
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
+ if (nbts_attiter_attnum <= tupnatts)
+ {
+ arg = nbts_attiter_nextattdatum(itup, itupdesc);
+ null = nbts_attiter_curattisnull(itup);
+ }
else
{
arg = (Datum) 0;
null = true;
}
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags = (null ? SK_ISNULL : 0) | (indoption[nbts_attiter_attnum - 1] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[nbts_attiter_attnum - 1],
flags,
- (AttrNumber) (i + 1),
+ (AttrNumber) nbts_attiter_attnum,
InvalidStrategy,
InvalidOid,
- rel->rd_indcollation[i],
+ rel->rd_indcollation[nbts_attiter_attnum - 1],
procinfo,
arg);
/* Record if any key attribute is NULL (or truncated) */
@@ -744,24 +755,28 @@ NBTS_FUNCTION(_bt_keep_natts_fast)(Relation rel,
TupleDesc itupdesc = RelationGetDescr(rel);
int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
int keepnatts;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, keysz)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
Form_pg_attribute att;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
+ datum1 = nbts_attiter_nextattdatum(lastleft,itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
+ att = TupleDescAttr(itupdesc, nbts_attiter_attnum - 1);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) !=
+ nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
--
2.30.2
v5-0004-Optimize-attribute-iterator-access-for-single-col.patchapplication/x-patch; name=v5-0004-Optimize-attribute-iterator-access-for-single-col.patchDownload
From d33e725c6e0dfd45a2b6e1461fcbdb744249f9ae Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 7 Apr 2022 12:47:50 +0200
Subject: [PATCH v5 4/8] Optimize attribute iterator access for single-column
btree keys
This removes the index_getattr_nocache call path, which has significant overhead.
---
src/include/access/nbtree.h | 9 +++-
src/include/access/nbtree_specialize.h | 63 ++++++++++++++++++++++++++
2 files changed, 71 insertions(+), 1 deletion(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 489b623663..1559399b0e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1120,6 +1120,7 @@ typedef struct BTOptions
/*
* Macros used in the nbtree specialization code.
*/
+#define NBTS_TYPE_SINGLE_COLUMN single
#define NBTS_TYPE_CACHED cached
#define NBTS_TYPE_DEFAULT default
@@ -1151,7 +1152,13 @@ do { \
#define NBT_SPECIALIZE_CALL(function, rel, ...) \
( \
- NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+ IndexRelationGetNumberOfKeyAttributes(rel) == 1 ? ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_SINGLE_COLUMN)(__VA_ARGS__) \
+ ) \
+ : \
+ ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+ ) \
)
#else /* not defined NBTS_ENABLED */
diff --git a/src/include/access/nbtree_specialize.h b/src/include/access/nbtree_specialize.h
index 23fdda4f0e..9733a27bdd 100644
--- a/src/include/access/nbtree_specialize.h
+++ b/src/include/access/nbtree_specialize.h
@@ -79,6 +79,69 @@
#define nbts_call_norel(name, rel, ...) \
(NBTS_FUNCTION(name)(__VA_ARGS__))
+/*
+ * Optimized access for indexes with a single key column.
+ *
+ * Note that this path may never be used for indexes with multiple key
+ * columns, because it does not ever continue to a next column.
+ */
+
+#define NBTS_SPECIALIZING_SINGLE_COLUMN
+#define NBTS_TYPE NBTS_TYPE_SINGLE_COLUMN
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+/*
+ * We void endAttNum to prevent unused variable warnings.
+ * The if- and for-loop are structured like this to make the compiler
+ * unroll the loop and detect only one single iteration. We need `break`
+ * in the following code block, so just a plain 'if' statement would
+ * not work.
+ */
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ Assert((endAttNum) == 1); ((void) (endAttNum)); \
+ if ((initAttNum) == 1) for (int spec_i = 0; spec_i < 1; spec_i++)
+
+#define nbts_attiter_attnum 1
+
+/*
+ * Simplified (optimized) variant of index_getattr specialized for extracting
+ * only the first attribute: cache offset is guaranteed to be 0, and as such
+ * no cache is required.
+ */
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+( \
+ AssertMacro(spec_i == 0), \
+ (IndexTupleHasNulls(itup) && att_isnull(0, (char *)(itup) + sizeof(IndexTupleData))) ? \
+ ( \
+ (NBTS_MAKE_NAME(itup, isNull)) = true, \
+ (Datum)NULL \
+ ) \
+ : \
+ ( \
+ (NBTS_MAKE_NAME(itup, isNull) = false), \
+ (Datum) fetchatt(TupleDescAttr((tupDesc), 0), \
+ (char *) (itup) + IndexInfoFindDataOffset((itup)->t_info)) \
+ ) \
+)
+
+#define nbts_attiter_curattisnull(tuple) \
+ NBTS_MAKE_NAME(tuple, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_SINGLE_COLUMN
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
/*
* Multiple key columns, optimized access for attcacheoff -cacheable offsets.
*/
--
2.30.2
v5-0003-Specialize-the-nbtree-rd_indam-entry.patchapplication/x-patch; name=v5-0003-Specialize-the-nbtree-rd_indam-entry.patchDownload
From b3b10f8b583fb3bf05d7055c2b1319128e596c63 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 8 Apr 2022 14:54:52 +0200
Subject: [PATCH v5 3/8] Specialize the nbtree rd_indam entry.
Because each rd_indam struct is seperately allocated for each index, we can
freely modify it at runtime without impacting other indexes of the same
access method. For btinsert (which effectively only calls _bt_insert) it is
useful to specialize that function, which also makes rd_indam->aminsert a
good signal whether or not the indexRelation has been fully optimized yet.
---
src/backend/access/nbtree/nbtree.c | 7 +++++++
src/backend/access/nbtree/nbtsearch.c | 2 ++
src/backend/access/nbtree/nbtsort.c | 2 ++
src/include/access/nbtree.h | 14 ++++++++++++++
4 files changed, 25 insertions(+)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1481db4dcf..2ce996a0f5 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -160,6 +160,8 @@ btbuildempty(Relation index)
metapage = (Page) palloc(BLCKSZ);
_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
+ nbt_opt_specialize(index);
+
/*
* Write the page and log it. It might seem that an immediate sync would
* be sufficient to guarantee that the file exists on disk, but recovery
@@ -322,6 +324,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
IndexScanDesc scan;
BTScanOpaque so;
+ nbt_opt_specialize(rel);
+
/* no order by operators allowed */
Assert(norderbys == 0);
@@ -764,6 +768,7 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
{
Relation rel = info->index;
BTCycleId cycleid;
+ nbt_opt_specialize(info->index);
/* allocate stats if first time through, else re-use existing struct */
if (stats == NULL)
@@ -797,6 +802,8 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
if (info->analyze_only)
return stats;
+ nbt_opt_specialize(info->index);
+
/*
* If btbulkdelete was called, we need not do anything (we just maintain
* the information used within _bt_vacuum_needs_cleanup() by calling
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index e81eee9c35..d5152bfcb7 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -181,6 +181,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
Assert(!BTScanPosIsValid(so->currPos));
+ nbt_opt_specialize(scan->indexRelation);
+
pgstat_count_index_scan(rel);
/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 3558b2d3da..521a2a33c5 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -305,6 +305,8 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
BTBuildState buildstate;
double reltuples;
+ nbt_opt_specialize(index);
+
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
ResetUsage();
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 83e0dbab16..489b623663 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1132,6 +1132,19 @@ typedef struct BTOptions
#ifdef NBTS_ENABLED
+/*
+ * Replace the functions in the rd_indam struct with a variant optimized for
+ * our key shape, if not already done.
+ *
+ * It only needs to be done once for every index relation loaded, so it's
+ * quite unlikely we need to do this and thus marked unlikely().
+ */
+#define nbt_opt_specialize(rel) \
+do { \
+ if (unlikely((rel)->rd_indam->aminsert == btinsert)) \
+ _bt_specialize(rel); \
+} while (false)
+
/*
* Access a specialized nbtree function, based on the shape of the index key.
*/
@@ -1143,6 +1156,7 @@ typedef struct BTOptions
#else /* not defined NBTS_ENABLED */
+#define nbt_opt_specialize(rel)
#define NBT_SPECIALIZE_CALL(function, rel, ...) function(__VA_ARGS__)
#endif /* NBTS_ENABLED */
--
2.30.2
v5-0001-Specialize-nbtree-functions-on-btree-key-shape.patchapplication/x-patch; name=v5-0001-Specialize-nbtree-functions-on-btree-key-shape.patchDownload
From 09f5afd1c2b6784a462eb40910b6a71ba3480dd8 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Sun, 30 Jan 2022 16:23:31 +0100
Subject: [PATCH v5 1/8] Specialize nbtree functions on btree key shape
nbtree keys are not all made the same, so a significant amount of time is
spent on code that exists only to deal with other key's shape. By specializing
function calls based on the key shape, we can remove or reduce these causes
of overhead.
This commit adds the basic infrastructure for specializing specific hot code
in the nbtree AM to certain shapes of keys, initially splitting splitting out
(not yet: specializing) the attcacheoff-capable case.
Note that we generate N specialized functions and 1 'default' function for each
specializable function.
This feature can be disabled by removing the #define NBTS_ENABLED -line in nbtree.h
---
src/backend/access/nbtree/README | 22 +
src/backend/access/nbtree/nbtdedup.c | 300 +------
src/backend/access/nbtree/nbtdedup_spec.h | 313 +++++++
src/backend/access/nbtree/nbtinsert.c | 572 +-----------
src/backend/access/nbtree/nbtinsert_spec.h | 569 ++++++++++++
src/backend/access/nbtree/nbtpage.c | 4 +-
src/backend/access/nbtree/nbtree.c | 31 +-
src/backend/access/nbtree/nbtree_spec.h | 50 ++
src/backend/access/nbtree/nbtsearch.c | 994 +--------------------
src/backend/access/nbtree/nbtsearch_spec.h | 994 +++++++++++++++++++++
src/backend/access/nbtree/nbtsort.c | 271 +-----
src/backend/access/nbtree/nbtsort_spec.h | 275 ++++++
src/backend/access/nbtree/nbtsplitloc.c | 14 +-
src/backend/access/nbtree/nbtutils.c | 755 +---------------
src/backend/access/nbtree/nbtutils_spec.h | 772 ++++++++++++++++
src/backend/utils/sort/tuplesort.c | 4 +-
src/include/access/nbtree.h | 61 +-
src/include/access/nbtree_specialize.h | 204 +++++
src/include/access/nbtree_specialized.h | 67 ++
19 files changed, 3357 insertions(+), 2915 deletions(-)
create mode 100644 src/backend/access/nbtree/nbtdedup_spec.h
create mode 100644 src/backend/access/nbtree/nbtinsert_spec.h
create mode 100644 src/backend/access/nbtree/nbtree_spec.h
create mode 100644 src/backend/access/nbtree/nbtsearch_spec.h
create mode 100644 src/backend/access/nbtree/nbtsort_spec.h
create mode 100644 src/backend/access/nbtree/nbtutils_spec.h
create mode 100644 src/include/access/nbtree_specialize.h
create mode 100644 src/include/access/nbtree_specialized.h
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 5529afc1fe..3c08888c23 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -1041,6 +1041,28 @@ that need a page split anyway. Besides, supporting variable "split points"
while splitting posting lists won't actually improve overall space
utilization.
+
+Notes about nbtree call specialization
+--------------------------------------
+
+Attribute iteration is a significant overhead for multi-column indexes.
+We can avoid it by specializing performance-sensitive search functions
+and calling those selectively. Additionally, we update the entry points
+in the index AM to call the specialized functions, increasing the
+performance of those hot paths. This performance benefit is at the cost
+of binary size, so this feature can be disabled by defining NBTS_DISABLED.
+
+Optimized code paths exist for the following cases, in order of preference:
+ - single-column indexes
+ NB: The code paths of this optimization do not support multiple key columns.
+ - multi-column indexes that could benefit from the attcacheoff optimization
+ NB: This is also used for the default case, and is slow for uncachable
+ attribute offsets.
+
+Future work will optimize for multi-column indexes that don't benefit
+from the attcacheoff optimization by improving on the O(n^2) nature of
+index_getattr through storing attribute offsets.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 0207421a5d..d7025d8e1c 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -22,259 +22,16 @@
static void _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
TM_IndexDeleteOp *delstate);
-static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem);
static void _bt_singleval_fillfactor(Page page, BTDedupState state,
Size newitemsz);
#ifdef USE_ASSERT_CHECKING
static bool _bt_posting_valid(IndexTuple posting);
#endif
-/*
- * Perform a deduplication pass.
- *
- * The general approach taken here is to perform as much deduplication as
- * possible to free as much space as possible. Note, however, that "single
- * value" strategy is used for !bottomupdedup callers when the page is full of
- * tuples of a single value. Deduplication passes that apply the strategy
- * will leave behind a few untouched tuples at the end of the page, preparing
- * the page for an anticipated page split that uses nbtsplitloc.c's own single
- * value strategy. Our high level goal is to delay merging the untouched
- * tuples until after the page splits.
- *
- * When a call to _bt_bottomupdel_pass() just took place (and failed), our
- * high level goal is to prevent a page split entirely by buying more time.
- * We still hope that a page split can be avoided altogether. That's why
- * single value strategy is not even considered for bottomupdedup callers.
- *
- * The page will have to be split if we cannot successfully free at least
- * newitemsz (we also need space for newitem's line pointer, which isn't
- * included in caller's newitemsz).
- *
- * Note: Caller should have already deleted all existing items with their
- * LP_DEAD bits set.
- */
-void
-_bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
- Size newitemsz, bool bottomupdedup)
-{
- OffsetNumber offnum,
- minoff,
- maxoff;
- Page page = BufferGetPage(buf);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- Page newpage;
- BTDedupState state;
- Size pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
- bool singlevalstrat = false;
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtdedup_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
- /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
- newitemsz += sizeof(ItemIdData);
-
- /*
- * Initialize deduplication state.
- *
- * It would be possible for maxpostingsize (limit on posting list tuple
- * size) to be set to one third of the page. However, it seems like a
- * good idea to limit the size of posting lists to one sixth of a page.
- * That ought to leave us with a good split point when pages full of
- * duplicates can be split several times.
- */
- state = (BTDedupState) palloc(sizeof(BTDedupStateData));
- state->deduplicate = true;
- state->nmaxitems = 0;
- state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
- /* Metadata about base tuple of current pending posting list */
- state->base = NULL;
- state->baseoff = InvalidOffsetNumber;
- state->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- state->htids = palloc(state->maxpostingsize);
- state->nhtids = 0;
- state->nitems = 0;
- /* Size of all physical tuples to be replaced by pending posting list */
- state->phystupsize = 0;
- /* nintervals should be initialized to zero */
- state->nintervals = 0;
-
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * Consider applying "single value" strategy, though only if the page
- * seems likely to be split in the near future
- */
- if (!bottomupdedup)
- singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
-
- /*
- * Deduplicate items from page, and write them to newpage.
- *
- * Copy the original page's LSN into newpage copy. This will become the
- * updated version of the page. We need this because XLogInsert will
- * examine the LSN and possibly dump it in a page image.
- */
- newpage = PageGetTempPageCopySpecial(page);
- PageSetLSN(newpage, PageGetLSN(page));
-
- /* Copy high key, if any */
- if (!P_RIGHTMOST(opaque))
- {
- ItemId hitemid = PageGetItemId(page, P_HIKEY);
- Size hitemsz = ItemIdGetLength(hitemid);
- IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
-
- if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
- false, false) == InvalidOffsetNumber)
- elog(ERROR, "deduplication failed to add highkey");
- }
-
- for (offnum = minoff;
- offnum <= maxoff;
- offnum = OffsetNumberNext(offnum))
- {
- ItemId itemid = PageGetItemId(page, offnum);
- IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
-
- Assert(!ItemIdIsDead(itemid));
-
- if (offnum == minoff)
- {
- /*
- * No previous/base tuple for the data item -- use the data item
- * as base tuple of pending posting list
- */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- else if (state->deduplicate &&
- _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
- _bt_dedup_save_htid(state, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID(s) for itup have been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list for some other reason (e.g., adding more
- * TIDs would have caused posting list to exceed current
- * maxpostingsize).
- *
- * If state contains pending posting list with more than one item,
- * form new posting tuple, and actually update the page. Else
- * reset the state and move on without modifying the page.
- */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- if (singlevalstrat)
- {
- /*
- * Single value strategy's extra steps.
- *
- * Lower maxpostingsize for sixth and final large posting list
- * tuple at the point where 5 maxpostingsize-capped tuples
- * have either been formed or observed.
- *
- * When a sixth maxpostingsize-capped item is formed/observed,
- * stop merging together tuples altogether. The few tuples
- * that remain at the end of the page won't be merged together
- * at all (at least not until after a future page split takes
- * place).
- */
- if (state->nmaxitems == 5)
- _bt_singleval_fillfactor(page, state, newitemsz);
- else if (state->nmaxitems == 6)
- {
- state->deduplicate = false;
- singlevalstrat = false; /* won't be back here */
- }
- }
-
- /* itup starts new pending posting list */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- }
-
- /* Handle the last item */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- /*
- * If no items suitable for deduplication were found, newpage must be
- * exactly the same as the original page, so just return from function.
- *
- * We could determine whether or not to proceed on the basis the space
- * savings being sufficient to avoid an immediate page split instead. We
- * don't do that because there is some small value in nbtsplitloc.c always
- * operating against a page that is fully deduplicated (apart from
- * newitem). Besides, most of the cost has already been paid.
- */
- if (state->nintervals == 0)
- {
- /* cannot leak memory here */
- pfree(newpage);
- pfree(state->htids);
- pfree(state);
- return;
- }
-
- /*
- * By here, it's clear that deduplication will definitely go ahead.
- *
- * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
- * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
- * But keep things tidy.
- */
- if (P_HAS_GARBAGE(opaque))
- {
- BTPageOpaque nopaque = BTPageGetOpaque(newpage);
-
- nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
- }
-
- START_CRIT_SECTION();
-
- PageRestoreTempPage(newpage, page);
- MarkBufferDirty(buf);
-
- /* XLOG stuff */
- if (RelationNeedsWAL(rel))
- {
- XLogRecPtr recptr;
- xl_btree_dedup xlrec_dedup;
-
- xlrec_dedup.nintervals = state->nintervals;
-
- XLogBeginInsert();
- XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
-
- /*
- * The intervals array is not in the buffer, but pretend that it is.
- * When XLogInsert stores the whole buffer, the array need not be
- * stored too.
- */
- XLogRegisterBufData(0, (char *) state->intervals,
- state->nintervals * sizeof(BTDedupInterval));
-
- recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
-
- PageSetLSN(page, recptr);
- }
-
- END_CRIT_SECTION();
-
- /* Local space accounting should agree with page accounting */
- Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
-
- /* cannot leak memory here */
- pfree(state->htids);
- pfree(state);
-}
/*
* Perform bottom-up index deletion pass.
@@ -373,7 +130,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
/* itup starts first pending interval */
_bt_dedup_start_pending(state, itup, offnum);
}
- else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+ else if (nbts_call(_bt_keep_natts_fast, rel, state->base, itup) > nkeyatts &&
_bt_dedup_save_htid(state, itup))
{
/* Tuple is equal; just added its TIDs to pending interval */
@@ -748,55 +505,6 @@ _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
state->phystupsize = 0;
}
-/*
- * Determine if page non-pivot tuples (data items) are all duplicates of the
- * same value -- if they are, deduplication's "single value" strategy should
- * be applied. The general goal of this strategy is to ensure that
- * nbtsplitloc.c (which uses its own single value strategy) will find a useful
- * split point as further duplicates are inserted, and successive rightmost
- * page splits occur among pages that store the same duplicate value. When
- * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
- * just like it would if deduplication were disabled.
- *
- * We expect that affected workloads will require _several_ single value
- * strategy deduplication passes (over a page that only stores duplicates)
- * before the page is finally split. The first deduplication pass should only
- * find regular non-pivot tuples. Later deduplication passes will find
- * existing maxpostingsize-capped posting list tuples, which must be skipped
- * over. The penultimate pass is generally the first pass that actually
- * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
- * few untouched non-pivot tuples. The final deduplication pass won't free
- * any space -- it will skip over everything without merging anything (it
- * retraces the steps of the penultimate pass).
- *
- * Fortunately, having several passes isn't too expensive. Each pass (after
- * the first pass) won't spend many cycles on the large posting list tuples
- * left by previous passes. Each pass will find a large contiguous group of
- * smaller duplicate tuples to merge together at the end of the page.
- */
-static bool
-_bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- ItemId itemid;
- IndexTuple itup;
-
- itemid = PageGetItemId(page, minoff);
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- {
- itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- return true;
- }
-
- return false;
-}
-
/*
* Lower maxpostingsize when using "single value" strategy, to avoid a sixth
* and final maxpostingsize-capped tuple. The sixth and final posting list
diff --git a/src/backend/access/nbtree/nbtdedup_spec.h b/src/backend/access/nbtree/nbtdedup_spec.h
new file mode 100644
index 0000000000..27e5a7e686
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup_spec.h
@@ -0,0 +1,313 @@
+/*
+ * Specialized functions included in nbtdedup.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static bool NBTS_FUNCTION(_bt_do_singleval)(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem);
+
+/*
+ * Determine if page non-pivot tuples (data items) are all duplicates of the
+ * same value -- if they are, deduplication's "single value" strategy should
+ * be applied. The general goal of this strategy is to ensure that
+ * nbtsplitloc.c (which uses its own single value strategy) will find a useful
+ * split point as further duplicates are inserted, and successive rightmost
+ * page splits occur among pages that store the same duplicate value. When
+ * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
+ * just like it would if deduplication were disabled.
+ *
+ * We expect that affected workloads will require _several_ single value
+ * strategy deduplication passes (over a page that only stores duplicates)
+ * before the page is finally split. The first deduplication pass should only
+ * find regular non-pivot tuples. Later deduplication passes will find
+ * existing maxpostingsize-capped posting list tuples, which must be skipped
+ * over. The penultimate pass is generally the first pass that actually
+ * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
+ * few untouched non-pivot tuples. The final deduplication pass won't free
+ * any space -- it will skip over everything without merging anything (it
+ * retraces the steps of the penultimate pass).
+ *
+ * Fortunately, having several passes isn't too expensive. Each pass (after
+ * the first pass) won't spend many cycles on the large posting list tuples
+ * left by previous passes. Each pass will find a large contiguous group of
+ * smaller duplicate tuples to merge together at the end of the page.
+ */
+static bool
+NBTS_FUNCTION(_bt_do_singleval)(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, minoff);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (nbts_call(_bt_keep_natts_fast, rel, newitem, itup) > nkeyatts)
+ {
+ itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (nbts_call(_bt_keep_natts_fast, rel, newitem, itup) > nkeyatts)
+ return true;
+ }
+
+ return false;
+}
+
+#endif /* ifndef NBTS_SPECIALIZING_DEFAULT */
+
+/*
+ * Perform a deduplication pass.
+ *
+ * The general approach taken here is to perform as much deduplication as
+ * possible to free as much space as possible. Note, however, that "single
+ * value" strategy is used for !bottomupdedup callers when the page is full of
+ * tuples of a single value. Deduplication passes that apply the strategy
+ * will leave behind a few untouched tuples at the end of the page, preparing
+ * the page for an anticipated page split that uses nbtsplitloc.c's own single
+ * value strategy. Our high level goal is to delay merging the untouched
+ * tuples until after the page splits.
+ *
+ * When a call to _bt_bottomupdel_pass() just took place (and failed), our
+ * high level goal is to prevent a page split entirely by buying more time.
+ * We still hope that a page split can be avoided altogether. That's why
+ * single value strategy is not even considered for bottomupdedup callers.
+ *
+ * The page will have to be split if we cannot successfully free at least
+ * newitemsz (we also need space for newitem's line pointer, which isn't
+ * included in caller's newitemsz).
+ *
+ * Note: Caller should have already deleted all existing items with their
+ * LP_DEAD bits set.
+ */
+void
+NBTS_FUNCTION(_bt_dedup_pass)(Relation rel, Buffer buf, Relation heapRel,
+ IndexTuple newitem, Size newitemsz,
+ bool bottomupdedup)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ Page newpage;
+ BTDedupState state;
+ Size pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
+ bool singlevalstrat = false;
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+
+ /*
+ * Initialize deduplication state.
+ *
+ * It would be possible for maxpostingsize (limit on posting list tuple
+ * size) to be set to one third of the page. However, it seems like a
+ * good idea to limit the size of posting lists to one sixth of a page.
+ * That ought to leave us with a good split point when pages full of
+ * duplicates can be split several times.
+ */
+ state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ state->deduplicate = true;
+ state->nmaxitems = 0;
+ state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
+ /* Metadata about base tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ state->htids = palloc(state->maxpostingsize);
+ state->nhtids = 0;
+ state->nitems = 0;
+ /* Size of all physical tuples to be replaced by pending posting list */
+ state->phystupsize = 0;
+ /* nintervals should be initialized to zero */
+ state->nintervals = 0;
+
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Consider applying "single value" strategy, though only if the page
+ * seems likely to be split in the near future
+ */
+ if (!bottomupdedup)
+ singlevalstrat = nbts_call(_bt_do_singleval, rel, page, state,
+ minoff, newitem);
+
+ /*
+ * Deduplicate items from page, and write them to newpage.
+ *
+ * Copy the original page's LSN into newpage copy. This will become the
+ * updated version of the page. We need this because XLogInsert will
+ * examine the LSN and possibly dump it in a page image.
+ */
+ newpage = PageGetTempPageCopySpecial(page);
+ PageSetLSN(newpage, PageGetLSN(page));
+
+ /* Copy high key, if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId hitemid = PageGetItemId(page, P_HIKEY);
+ Size hitemsz = ItemIdGetLength(hitemid);
+ IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+ if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add highkey");
+ }
+
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == minoff)
+ {
+ /*
+ * No previous/base tuple for the data item -- use the data item
+ * as base tuple of pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (state->deduplicate &&
+ nbts_call(_bt_keep_natts_fast, rel, state->base, itup) > nkeyatts &&
+ _bt_dedup_save_htid(state, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID(s) for itup have been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list for some other reason (e.g., adding more
+ * TIDs would have caused posting list to exceed current
+ * maxpostingsize).
+ *
+ * If state contains pending posting list with more than one item,
+ * form new posting tuple, and actually update the page. Else
+ * reset the state and move on without modifying the page.
+ */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ if (singlevalstrat)
+ {
+ /*
+ * Single value strategy's extra steps.
+ *
+ * Lower maxpostingsize for sixth and final large posting list
+ * tuple at the point where 5 maxpostingsize-capped tuples
+ * have either been formed or observed.
+ *
+ * When a sixth maxpostingsize-capped item is formed/observed,
+ * stop merging together tuples altogether. The few tuples
+ * that remain at the end of the page won't be merged together
+ * at all (at least not until after a future page split takes
+ * place).
+ */
+ if (state->nmaxitems == 5)
+ _bt_singleval_fillfactor(page, state, newitemsz);
+ else if (state->nmaxitems == 6)
+ {
+ state->deduplicate = false;
+ singlevalstrat = false; /* won't be back here */
+ }
+ }
+
+ /* itup starts new pending posting list */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ }
+
+ /* Handle the last item */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ /*
+ * If no items suitable for deduplication were found, newpage must be
+ * exactly the same as the original page, so just return from function.
+ *
+ * We could determine whether or not to proceed on the basis the space
+ * savings being sufficient to avoid an immediate page split instead. We
+ * don't do that because there is some small value in nbtsplitloc.c always
+ * operating against a page that is fully deduplicated (apart from
+ * newitem). Besides, most of the cost has already been paid.
+ */
+ if (state->nintervals == 0)
+ {
+ /* cannot leak memory here */
+ pfree(newpage);
+ pfree(state->htids);
+ pfree(state);
+ return;
+ }
+
+ /*
+ * By here, it's clear that deduplication will definitely go ahead.
+ *
+ * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
+ * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
+ * But keep things tidy.
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ BTPageOpaque nopaque = BTPageGetOpaque(newpage);
+
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+ }
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.nintervals = state->nintervals;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ /*
+ * The intervals array is not in the buffer, but pretend that it is.
+ * When XLogInsert stores the whole buffer, the array need not be
+ * stored too.
+ */
+ XLogRegisterBufData(0, (char *) state->intervals,
+ state->nintervals * sizeof(BTDedupInterval));
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* Local space accounting should agree with page accounting */
+ Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+ /* cannot leak memory here */
+ pfree(state->htids);
+ pfree(state);
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f6f4af8bfe..ec6c73d1cc 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,18 +30,13 @@
#define BTREE_FASTPATH_MIN_LEVEL 2
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
Relation heapRel,
IndexUniqueCheck checkUnique, bool *is_unique,
uint32 *speculativeToken);
-static OffsetNumber _bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel);
-static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_stepright(Relation rel,
+ BTInsertState insertstate,
+ BTStack stack);
static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
Buffer buf,
Buffer cbuf,
@@ -73,311 +68,10 @@ static BlockNumber *_bt_deadblocks(Page page, OffsetNumber *deletable,
int *nblocks);
static inline int _bt_blk_cmp(const void *arg1, const void *arg2);
-/*
- * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
- *
- * This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
- *
- * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
- * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
- * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
- * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
- * don't actually insert.
- *
- * indexUnchanged executor hint indicates if itup is from an
- * UPDATE that didn't logically change the indexed value, but
- * must nevertheless have a new entry to point to a successor
- * version.
- *
- * The result value is only significant for UNIQUE_CHECK_PARTIAL:
- * it must be true if the entry is known unique, else false.
- * (In the current implementation we'll also return true after a
- * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
- * that's just a coding artifact.)
- */
-bool
-_bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel)
-{
- bool is_unique = false;
- BTInsertStateData insertstate;
- BTScanInsert itup_key;
- BTStack stack;
- bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
-
- /* we need an insertion scan key to do our search, so build one */
- itup_key = _bt_mkscankey(rel, itup);
-
- if (checkingunique)
- {
- if (!itup_key->anynullkeys)
- {
- /* No (heapkeyspace) scantid until uniqueness established */
- itup_key->scantid = NULL;
- }
- else
- {
- /*
- * Scan key for new tuple contains NULL key values. Bypass
- * checkingunique steps. They are unnecessary because core code
- * considers NULL unequal to every value, including NULL.
- *
- * This optimization avoids O(N^2) behavior within the
- * _bt_findinsertloc() heapkeyspace path when a unique index has a
- * large number of "duplicates" with NULL key values.
- */
- checkingunique = false;
- /* Tuple is unique in the sense that core code cares about */
- Assert(checkUnique != UNIQUE_CHECK_EXISTING);
- is_unique = true;
- }
- }
-
- /*
- * Fill in the BTInsertState working area, to track the current page and
- * position within the page to insert on.
- *
- * Note that itemsz is passed down to lower level code that deals with
- * inserting the item. It must be MAXALIGN()'d. This ensures that space
- * accounting code consistently considers the alignment overhead that we
- * expect PageAddItem() will add later. (Actually, index_form_tuple() is
- * already conservative about alignment, but we don't rely on that from
- * this distance. Besides, preserving the "true" tuple size in index
- * tuple headers for the benefit of nbtsplitloc.c might happen someday.
- * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
- */
- insertstate.itup = itup;
- insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
- insertstate.itup_key = itup_key;
- insertstate.bounds_valid = false;
- insertstate.buf = InvalidBuffer;
- insertstate.postingoff = 0;
-
-search:
-
- /*
- * Find and lock the leaf page that the tuple should be added to by
- * searching from the root page. insertstate.buf will hold a buffer that
- * is locked in exclusive mode afterwards.
- */
- stack = _bt_search_insert(rel, &insertstate);
-
- /*
- * checkingunique inserts are not allowed to go ahead when two tuples with
- * equal key attribute values would be visible to new MVCC snapshots once
- * the xact commits. Check for conflicts in the locked page/buffer (if
- * needed) here.
- *
- * It might be necessary to check a page to the right in _bt_check_unique,
- * though that should be very rare. In practice the first page the value
- * could be on (with scantid omitted) is almost always also the only page
- * that a matching tuple might be found on. This is due to the behavior
- * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
- * only be allowed to cross a page boundary when there is no candidate
- * leaf page split point that avoids it. Also, _bt_check_unique can use
- * the leaf page high key to determine that there will be no duplicates on
- * the right sibling without actually visiting it (it uses the high key in
- * cases where the new item happens to belong at the far right of the leaf
- * page).
- *
- * NOTE: obviously, _bt_check_unique can only detect keys that are already
- * in the index; so it cannot defend against concurrent insertions of the
- * same key. We protect against that by means of holding a write lock on
- * the first page the value could be on, with omitted/-inf value for the
- * implicit heap TID tiebreaker attribute. Any other would-be inserter of
- * the same key must acquire a write lock on the same page, so only one
- * would-be inserter can be making the check at one time. Furthermore,
- * once we are past the check we hold write locks continuously until we
- * have performed our insertion, so no later inserter can fail to see our
- * insertion. (This requires some care in _bt_findinsertloc.)
- *
- * If we must wait for another xact, we release the lock while waiting,
- * and then must perform a new search.
- *
- * For a partial uniqueness check, we don't wait for the other xact. Just
- * let the tuple in and return false for possibly non-unique, or true for
- * definitely unique.
- */
- if (checkingunique)
- {
- TransactionId xwait;
- uint32 speculativeToken;
-
- xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
- &is_unique, &speculativeToken);
-
- if (unlikely(TransactionIdIsValid(xwait)))
- {
- /* Have to wait for the other guy ... */
- _bt_relbuf(rel, insertstate.buf);
- insertstate.buf = InvalidBuffer;
-
- /*
- * If it's a speculative insertion, wait for it to finish (ie. to
- * go ahead with the insertion, or kill the tuple). Otherwise
- * wait for the transaction to finish as usual.
- */
- if (speculativeToken)
- SpeculativeInsertionWait(xwait, speculativeToken);
- else
- XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
-
- /* start over... */
- if (stack)
- _bt_freestack(stack);
- goto search;
- }
-
- /* Uniqueness is established -- restore heap tid as scantid */
- if (itup_key->heapkeyspace)
- itup_key->scantid = &itup->t_tid;
- }
-
- if (checkUnique != UNIQUE_CHECK_EXISTING)
- {
- OffsetNumber newitemoff;
-
- /*
- * The only conflict predicate locking cares about for indexes is when
- * an index tuple insert conflicts with an existing lock. We don't
- * know the actual page we're going to insert on for sure just yet in
- * checkingunique and !heapkeyspace cases, but it's okay to use the
- * first page the value could be on (with scantid omitted) instead.
- */
- CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
-
- /*
- * Do the insertion. Note that insertstate contains cached binary
- * search bounds established within _bt_check_unique when insertion is
- * checkingunique.
- */
- newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
- indexUnchanged, stack, heapRel);
- _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, insertstate.itemsz, newitemoff,
- insertstate.postingoff, false);
- }
- else
- {
- /* just release the buffer */
- _bt_relbuf(rel, insertstate.buf);
- }
-
- /* be tidy */
- if (stack)
- _bt_freestack(stack);
- pfree(itup_key);
-
- return is_unique;
-}
-
-/*
- * _bt_search_insert() -- _bt_search() wrapper for inserts
- *
- * Search the tree for a particular scankey, or more precisely for the first
- * leaf page it could be on. Try to make use of the fastpath optimization's
- * rightmost leaf page cache before actually searching the tree from the root
- * page, though.
- *
- * Return value is a stack of parent-page pointers (though see notes about
- * fastpath optimization and page splits below). insertstate->buf is set to
- * the address of the leaf-page buffer, which is write-locked and pinned in
- * all cases (if necessary by creating a new empty root page for caller).
- *
- * The fastpath optimization avoids most of the work of searching the tree
- * repeatedly when a single backend inserts successive new tuples on the
- * rightmost leaf page of an index. A backend cache of the rightmost leaf
- * page is maintained within _bt_insertonpg(), and used here. The cache is
- * invalidated here when an insert of a non-pivot tuple must take place on a
- * non-rightmost leaf page.
- *
- * The optimization helps with indexes on an auto-incremented field. It also
- * helps with indexes on datetime columns, as well as indexes with lots of
- * NULL values. (NULLs usually get inserted in the rightmost page for single
- * column indexes, since they usually get treated as coming after everything
- * else in the key space. Individual NULL tuples will generally be placed on
- * the rightmost leaf page due to the influence of the heap TID column.)
- *
- * Note that we avoid applying the optimization when there is insufficient
- * space on the rightmost page to fit caller's new item. This is necessary
- * because we'll need to return a real descent stack when a page split is
- * expected (actually, caller can cope with a leaf page split that uses a NULL
- * stack, but that's very slow and so must be avoided). Note also that the
- * fastpath optimization acquires the lock on the page conditionally as a way
- * of reducing extra contention when there are concurrent insertions into the
- * rightmost page (we give up if we'd have to wait for the lock). We assume
- * that it isn't useful to apply the optimization when there is contention,
- * since each per-backend cache won't stay valid for long.
- */
-static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
-{
- Assert(insertstate->buf == InvalidBuffer);
- Assert(!insertstate->bounds_valid);
- Assert(insertstate->postingoff == 0);
-
- if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
- {
- /* Simulate a _bt_getbuf() call with conditional locking */
- insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
- if (_bt_conditionallockbuf(rel, insertstate->buf))
- {
- Page page;
- BTPageOpaque opaque;
-
- _bt_checkpage(rel, insertstate->buf);
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- /*
- * Check if the page is still the rightmost leaf page and has
- * enough free space to accommodate the new tuple. Also check
- * that the insertion scan key is strictly greater than the first
- * non-pivot tuple on the page. (Note that we expect itup_key's
- * scantid to be unset when our caller is a checkingunique
- * inserter.)
- */
- if (P_RIGHTMOST(opaque) &&
- P_ISLEAF(opaque) &&
- !P_IGNORE(opaque) &&
- PageGetFreeSpace(page) > insertstate->itemsz &&
- PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- _bt_compare(rel, insertstate->itup_key, page, P_HIKEY) > 0)
- {
- /*
- * Caller can use the fastpath optimization because cached
- * block is still rightmost leaf page, which can fit caller's
- * new tuple without splitting. Keep block in local cache for
- * next insert, and have caller use NULL stack.
- *
- * Note that _bt_insert_parent() has an assertion that catches
- * leaf page splits that somehow follow from a fastpath insert
- * (it should only be passed a NULL stack when it must deal
- * with a concurrent root page split, and never because a NULL
- * stack was returned here).
- */
- return NULL;
- }
-
- /* Page unsuitable for caller, drop lock and pin */
- _bt_relbuf(rel, insertstate->buf);
- }
- else
- {
- /* Lock unavailable, drop pin */
- ReleaseBuffer(insertstate->buf);
- }
-
- /* Forget block, since cache doesn't appear to be useful */
- RelationSetTargetBlock(rel, InvalidBlockNumber);
- }
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtinsert_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
- /* Cannot use optimization -- descend tree, return proper descent stack */
- return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
- NULL);
-}
/*
* _bt_check_unique() -- Check for violation of unique index constraint
@@ -438,7 +132,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* in the fastpath below, but also in the _bt_findinsertloc() call later.
*/
Assert(!insertstate->bounds_valid);
- offset = _bt_binsrch_insert(rel, insertstate);
+ offset = nbts_call(_bt_binsrch_insert, rel, insertstate);
/*
* Scan over all equal tuples, looking for live conflicts.
@@ -483,7 +177,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(insertstate->bounds_valid);
Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
Assert(insertstate->low <= insertstate->stricthigh);
- Assert(_bt_compare(rel, itup_key, page, offset) < 0);
+ Assert(nbts_call(_bt_compare, rel, itup_key, page, offset) < 0);
break;
}
@@ -508,7 +202,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
if (!inposting)
{
/* Plain tuple, or first TID in posting list tuple */
- if (_bt_compare(rel, itup_key, page, offset) != 0)
+ if (nbts_call(_bt_compare, rel, itup_key, page, offset) != 0)
break; /* we're past all the equal tuples */
/* Advanced curitup */
@@ -722,7 +416,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* If scankey == hikey we gotta check the next page too */
if (P_RIGHTMOST(opaque))
break;
- highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+ highkeycmp = nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY);
Assert(highkeycmp <= 0);
if (highkeycmp != 0)
break;
@@ -769,246 +463,6 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
return InvalidTransactionId;
}
-
-/*
- * _bt_findinsertloc() -- Finds an insert location for a tuple
- *
- * On entry, insertstate buffer contains the page the new tuple belongs
- * on. It is exclusive-locked and pinned by the caller.
- *
- * If 'checkingunique' is true, the buffer on entry is the first page
- * that contains duplicates of the new key. If there are duplicates on
- * multiple pages, the correct insertion position might be some page to
- * the right, rather than the first page. In that case, this function
- * moves right to the correct target page.
- *
- * (In a !heapkeyspace index, there can be multiple pages with the same
- * high key, where the new tuple could legitimately be placed on. In
- * that case, the caller passes the first page containing duplicates,
- * just like when checkingunique=true. If that page doesn't have enough
- * room for the new tuple, this function moves right, trying to find a
- * legal page that does.)
- *
- * If 'indexUnchanged' is true, this is for an UPDATE that didn't
- * logically change the indexed value, but must nevertheless have a new
- * entry to point to a successor version. This hint from the executor
- * will influence our behavior when the page might have to be split and
- * we must consider our options. Bottom-up index deletion can avoid
- * pathological version-driven page splits, but we only want to go to the
- * trouble of trying it when we already have moderate confidence that
- * it's appropriate. The hint should not significantly affect our
- * behavior over time unless practically all inserts on to the leaf page
- * get the hint.
- *
- * On exit, insertstate buffer contains the chosen insertion page, and
- * the offset within that page is returned. If _bt_findinsertloc needed
- * to move right, the lock and pin on the original page are released, and
- * the new buffer is exclusively locked and pinned instead.
- *
- * If insertstate contains cached binary search bounds, we will take
- * advantage of them. This avoids repeating comparisons that we made in
- * _bt_check_unique() already.
- */
-static OffsetNumber
-_bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel)
-{
- BTScanInsert itup_key = insertstate->itup_key;
- Page page = BufferGetPage(insertstate->buf);
- BTPageOpaque opaque;
- OffsetNumber newitemoff;
-
- opaque = BTPageGetOpaque(page);
-
- /* Check 1/3 of a page restriction */
- if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
- _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
- insertstate->itup);
-
- Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
- Assert(!insertstate->bounds_valid || checkingunique);
- Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
- Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
- Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
-
- if (itup_key->heapkeyspace)
- {
- /* Keep track of whether checkingunique duplicate seen */
- bool uniquedup = indexUnchanged;
-
- /*
- * If we're inserting into a unique index, we may have to walk right
- * through leaf pages to find the one leaf page that we must insert on
- * to.
- *
- * This is needed for checkingunique callers because a scantid was not
- * used when we called _bt_search(). scantid can only be set after
- * _bt_check_unique() has checked for duplicates. The buffer
- * initially stored in insertstate->buf has the page where the first
- * duplicate key might be found, which isn't always the page that new
- * tuple belongs on. The heap TID attribute for new tuple (scantid)
- * could force us to insert on a sibling page, though that should be
- * very rare in practice.
- */
- if (checkingunique)
- {
- if (insertstate->low < insertstate->stricthigh)
- {
- /* Encountered a duplicate in _bt_check_unique() */
- Assert(insertstate->bounds_valid);
- uniquedup = true;
- }
-
- for (;;)
- {
- /*
- * Does the new tuple belong on this page?
- *
- * The earlier _bt_check_unique() call may well have
- * established a strict upper bound on the offset for the new
- * item. If it's not the last item of the page (i.e. if there
- * is at least one tuple on the page that goes after the tuple
- * we're inserting) then we know that the tuple belongs on
- * this page. We can skip the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- /* Test '<=', not '!=', since scantid is set now */
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
- break;
-
- _bt_stepright(rel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- /* Assume duplicates (if checkingunique) */
- uniquedup = true;
- }
- }
-
- /*
- * If the target page cannot fit newitem, try to avoid splitting the
- * page on insert by performing deletion or deduplication now
- */
- if (PageGetFreeSpace(page) < insertstate->itemsz)
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
- checkingunique, uniquedup,
- indexUnchanged);
- }
- else
- {
- /*----------
- * This is a !heapkeyspace (version 2 or 3) index. The current page
- * is the first page that we could insert the new tuple to, but there
- * may be other pages to the right that we could opt to use instead.
- *
- * If the new key is equal to one or more existing keys, we can
- * legitimately place it anywhere in the series of equal keys. In
- * fact, if the new key is equal to the page's "high key" we can place
- * it on the next page. If it is equal to the high key, and there's
- * not room to insert the new tuple on the current page without
- * splitting, then we move right hoping to find more free space and
- * avoid a split.
- *
- * Keep scanning right until we
- * (a) find a page with enough free space,
- * (b) reach the last page where the tuple can legally go, or
- * (c) get tired of searching.
- * (c) is not flippant; it is important because if there are many
- * pages' worth of equal keys, it's better to split one of the early
- * pages than to scan all the way to the end of the run of equal keys
- * on every insert. We implement "get tired" as a random choice,
- * since stopping after scanning a fixed number of pages wouldn't work
- * well (we'd never reach the right-hand side of previously split
- * pages). The probability of moving right is set at 0.99, which may
- * seem too high to change the behavior much, but it does an excellent
- * job of preventing O(N^2) behavior with many equal keys.
- *----------
- */
- while (PageGetFreeSpace(page) < insertstate->itemsz)
- {
- /*
- * Before considering moving right, see if we can obtain enough
- * space by erasing LP_DEAD items
- */
- if (P_HAS_GARBAGE(opaque))
- {
- /* Perform simple deletion */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- if (PageGetFreeSpace(page) >= insertstate->itemsz)
- break; /* OK, now we have enough space */
- }
-
- /*
- * Nope, so check conditions (b) and (c) enumerated above
- *
- * The earlier _bt_check_unique() call may well have established a
- * strict upper bound on the offset for the new item. If it's not
- * the last item of the page (i.e. if there is at least one tuple
- * on the page that's greater than the tuple we're inserting to)
- * then we know that the tuple belongs on this page. We can skip
- * the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) != 0 ||
- pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
- break;
-
- _bt_stepright(rel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- }
- }
-
- /*
- * We should now be on the correct page. Find the offset within the page
- * for the new tuple. (Possibly reusing earlier search bounds.)
- */
- Assert(P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
-
- newitemoff = _bt_binsrch_insert(rel, insertstate);
-
- if (insertstate->postingoff == -1)
- {
- /*
- * There is an overlapping posting list tuple with its LP_DEAD bit
- * set. We don't want to unnecessarily unset its LP_DEAD bit while
- * performing a posting list split, so perform simple index tuple
- * deletion early.
- */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- /*
- * Do new binary search. New insert location cannot overlap with any
- * posting list now.
- */
- Assert(!insertstate->bounds_valid);
- insertstate->postingoff = 0;
- newitemoff = _bt_binsrch_insert(rel, insertstate);
- Assert(insertstate->postingoff == 0);
- }
-
- return newitemoff;
-}
-
/*
* Step right to next non-dead page, during insertion.
*
@@ -1649,7 +1103,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
lastleft = nposting;
}
- lefthighkey = _bt_truncate(rel, lastleft, firstright, itup_key);
+ lefthighkey = nbts_call(_bt_truncate, rel, lastleft, firstright, itup_key);
itemsz = IndexTupleSize(lefthighkey);
}
else
@@ -2764,8 +2218,8 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
/* Perform deduplication pass (when enabled and index-is-allequalimage) */
if (BTGetDeduplicateItems(rel) && itup_key->allequalimage)
- _bt_dedup_pass(rel, buffer, heapRel, insertstate->itup,
- insertstate->itemsz, (indexUnchanged || uniquedup));
+ nbts_call(_bt_dedup_pass, rel, buffer, heapRel, insertstate->itup,
+ insertstate->itemsz, (indexUnchanged || uniquedup));
}
/*
diff --git a/src/backend/access/nbtree/nbtinsert_spec.h b/src/backend/access/nbtree/nbtinsert_spec.h
new file mode 100644
index 0000000000..97c866aea3
--- /dev/null
+++ b/src/backend/access/nbtree/nbtinsert_spec.h
@@ -0,0 +1,569 @@
+/*
+ * Specialized functions for nbtinsert.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static BTStack NBTS_FUNCTION(_bt_search_insert)(Relation rel,
+ BTInsertState insertstate);
+
+static OffsetNumber NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel);
+
+/*
+ * _bt_search_insert() -- _bt_search() wrapper for inserts
+ *
+ * Search the tree for a particular scankey, or more precisely for the first
+ * leaf page it could be on. Try to make use of the fastpath optimization's
+ * rightmost leaf page cache before actually searching the tree from the root
+ * page, though.
+ *
+ * Return value is a stack of parent-page pointers (though see notes about
+ * fastpath optimization and page splits below). insertstate->buf is set to
+ * the address of the leaf-page buffer, which is write-locked and pinned in
+ * all cases (if necessary by creating a new empty root page for caller).
+ *
+ * The fastpath optimization avoids most of the work of searching the tree
+ * repeatedly when a single backend inserts successive new tuples on the
+ * rightmost leaf page of an index. A backend cache of the rightmost leaf
+ * page is maintained within _bt_insertonpg(), and used here. The cache is
+ * invalidated here when an insert of a non-pivot tuple must take place on a
+ * non-rightmost leaf page.
+ *
+ * The optimization helps with indexes on an auto-incremented field. It also
+ * helps with indexes on datetime columns, as well as indexes with lots of
+ * NULL values. (NULLs usually get inserted in the rightmost page for single
+ * column indexes, since they usually get treated as coming after everything
+ * else in the key space. Individual NULL tuples will generally be placed on
+ * the rightmost leaf page due to the influence of the heap TID column.)
+ *
+ * Note that we avoid applying the optimization when there is insufficient
+ * space on the rightmost page to fit caller's new item. This is necessary
+ * because we'll need to return a real descent stack when a page split is
+ * expected (actually, caller can cope with a leaf page split that uses a NULL
+ * stack, but that's very slow and so must be avoided). Note also that the
+ * fastpath optimization acquires the lock on the page conditionally as a way
+ * of reducing extra contention when there are concurrent insertions into the
+ * rightmost page (we give up if we'd have to wait for the lock). We assume
+ * that it isn't useful to apply the optimization when there is contention,
+ * since each per-backend cache won't stay valid for long.
+ */
+static BTStack
+NBTS_FUNCTION(_bt_search_insert)(Relation rel, BTInsertState insertstate)
+{
+ Assert(insertstate->buf == InvalidBuffer);
+ Assert(!insertstate->bounds_valid);
+ Assert(insertstate->postingoff == 0);
+
+ if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
+ {
+ /* Simulate a _bt_getbuf() call with conditional locking */
+ insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
+ if (_bt_conditionallockbuf(rel, insertstate->buf))
+ {
+ Page page;
+ BTPageOpaque opaque;
+
+ _bt_checkpage(rel, insertstate->buf);
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ /*
+ * Check if the page is still the rightmost leaf page and has
+ * enough free space to accommodate the new tuple. Also check
+ * that the insertion scan key is strictly greater than the first
+ * non-pivot tuple on the page. (Note that we expect itup_key's
+ * scantid to be unset when our caller is a checkingunique
+ * inserter.)
+ */
+ if (P_RIGHTMOST(opaque) &&
+ P_ISLEAF(opaque) &&
+ !P_IGNORE(opaque) &&
+ PageGetFreeSpace(page) > insertstate->itemsz &&
+ PageGetMaxOffsetNumber(page) >= P_HIKEY &&
+ nbts_call(_bt_compare, rel, insertstate->itup_key, page, P_HIKEY) > 0)
+ {
+ /*
+ * Caller can use the fastpath optimization because cached
+ * block is still rightmost leaf page, which can fit caller's
+ * new tuple without splitting. Keep block in local cache for
+ * next insert, and have caller use NULL stack.
+ *
+ * Note that _bt_insert_parent() has an assertion that catches
+ * leaf page splits that somehow follow from a fastpath insert
+ * (it should only be passed a NULL stack when it must deal
+ * with a concurrent root page split, and never because a NULL
+ * stack was returned here).
+ */
+ return NULL;
+ }
+
+ /* Page unsuitable for caller, drop lock and pin */
+ _bt_relbuf(rel, insertstate->buf);
+ }
+ else
+ {
+ /* Lock unavailable, drop pin */
+ ReleaseBuffer(insertstate->buf);
+ }
+
+ /* Forget block, since cache doesn't appear to be useful */
+ RelationSetTargetBlock(rel, InvalidBlockNumber);
+ }
+
+ /* Cannot use optimization -- descend tree, return proper descent stack */
+ return nbts_call(_bt_search, rel, insertstate->itup_key,
+ &insertstate->buf, BT_WRITE, NULL);
+}
+
+/*
+ * _bt_findinsertloc() -- Finds an insert location for a tuple
+ *
+ * On entry, insertstate buffer contains the page the new tuple belongs
+ * on. It is exclusive-locked and pinned by the caller.
+ *
+ * If 'checkingunique' is true, the buffer on entry is the first page
+ * that contains duplicates of the new key. If there are duplicates on
+ * multiple pages, the correct insertion position might be some page to
+ * the right, rather than the first page. In that case, this function
+ * moves right to the correct target page.
+ *
+ * (In a !heapkeyspace index, there can be multiple pages with the same
+ * high key, where the new tuple could legitimately be placed on. In
+ * that case, the caller passes the first page containing duplicates,
+ * just like when checkingunique=true. If that page doesn't have enough
+ * room for the new tuple, this function moves right, trying to find a
+ * legal page that does.)
+ *
+ * If 'indexUnchanged' is true, this is for an UPDATE that didn't
+ * logically change the indexed value, but must nevertheless have a new
+ * entry to point to a successor version. This hint from the executor
+ * will influence our behavior when the page might have to be split and
+ * we must consider our options. Bottom-up index deletion can avoid
+ * pathological version-driven page splits, but we only want to go to the
+ * trouble of trying it when we already have moderate confidence that
+ * it's appropriate. The hint should not significantly affect our
+ * behavior over time unless practically all inserts on to the leaf page
+ * get the hint.
+ *
+ * On exit, insertstate buffer contains the chosen insertion page, and
+ * the offset within that page is returned. If _bt_findinsertloc needed
+ * to move right, the lock and pin on the original page are released, and
+ * the new buffer is exclusively locked and pinned instead.
+ *
+ * If insertstate contains cached binary search bounds, we will take
+ * advantage of them. This avoids repeating comparisons that we made in
+ * _bt_check_unique() already.
+ */
+static OffsetNumber
+NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel)
+{
+ BTScanInsert itup_key = insertstate->itup_key;
+ Page page = BufferGetPage(insertstate->buf);
+ BTPageOpaque opaque;
+ OffsetNumber newitemoff;
+
+ opaque = BTPageGetOpaque(page);
+
+ /* Check 1/3 of a page restriction */
+ if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
+ _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
+ insertstate->itup);
+
+ Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
+ Assert(!insertstate->bounds_valid || checkingunique);
+ Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
+ Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
+ Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
+
+ if (itup_key->heapkeyspace)
+ {
+ /* Keep track of whether checkingunique duplicate seen */
+ bool uniquedup = indexUnchanged;
+
+ /*
+ * If we're inserting into a unique index, we may have to walk right
+ * through leaf pages to find the one leaf page that we must insert on
+ * to.
+ *
+ * This is needed for checkingunique callers because a scantid was not
+ * used when we called _bt_search(). scantid can only be set after
+ * _bt_check_unique() has checked for duplicates. The buffer
+ * initially stored in insertstate->buf has the page where the first
+ * duplicate key might be found, which isn't always the page that new
+ * tuple belongs on. The heap TID attribute for new tuple (scantid)
+ * could force us to insert on a sibling page, though that should be
+ * very rare in practice.
+ */
+ if (checkingunique)
+ {
+ if (insertstate->low < insertstate->stricthigh)
+ {
+ /* Encountered a duplicate in _bt_check_unique() */
+ Assert(insertstate->bounds_valid);
+ uniquedup = true;
+ }
+
+ for (;;)
+ {
+ /*
+ * Does the new tuple belong on this page?
+ *
+ * The earlier _bt_check_unique() call may well have
+ * established a strict upper bound on the offset for the new
+ * item. If it's not the last item of the page (i.e. if there
+ * is at least one tuple on the page that goes after the tuple
+ * we're inserting) then we know that the tuple belongs on
+ * this page. We can skip the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ /* Test '<=', not '!=', since scantid is set now */
+ if (P_RIGHTMOST(opaque) ||
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) <= 0)
+ break;
+
+ _bt_stepright(rel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ /* Assume duplicates (if checkingunique) */
+ uniquedup = true;
+ }
+ }
+
+ /*
+ * If the target page cannot fit newitem, try to avoid splitting the
+ * page on insert by performing deletion or deduplication now
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
+ checkingunique, uniquedup,
+ indexUnchanged);
+ }
+ else
+ {
+ /*----------
+ * This is a !heapkeyspace (version 2 or 3) index. The current page
+ * is the first page that we could insert the new tuple to, but there
+ * may be other pages to the right that we could opt to use instead.
+ *
+ * If the new key is equal to one or more existing keys, we can
+ * legitimately place it anywhere in the series of equal keys. In
+ * fact, if the new key is equal to the page's "high key" we can place
+ * it on the next page. If it is equal to the high key, and there's
+ * not room to insert the new tuple on the current page without
+ * splitting, then we move right hoping to find more free space and
+ * avoid a split.
+ *
+ * Keep scanning right until we
+ * (a) find a page with enough free space,
+ * (b) reach the last page where the tuple can legally go, or
+ * (c) get tired of searching.
+ * (c) is not flippant; it is important because if there are many
+ * pages' worth of equal keys, it's better to split one of the early
+ * pages than to scan all the way to the end of the run of equal keys
+ * on every insert. We implement "get tired" as a random choice,
+ * since stopping after scanning a fixed number of pages wouldn't work
+ * well (we'd never reach the right-hand side of previously split
+ * pages). The probability of moving right is set at 0.99, which may
+ * seem too high to change the behavior much, but it does an excellent
+ * job of preventing O(N^2) behavior with many equal keys.
+ *----------
+ */
+ while (PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ /*
+ * Before considering moving right, see if we can obtain enough
+ * space by erasing LP_DEAD items
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ /* Perform simple deletion */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ if (PageGetFreeSpace(page) >= insertstate->itemsz)
+ break; /* OK, now we have enough space */
+ }
+
+ /*
+ * Nope, so check conditions (b) and (c) enumerated above
+ *
+ * The earlier _bt_check_unique() call may well have established a
+ * strict upper bound on the offset for the new item. If it's not
+ * the last item of the page (i.e. if there is at least one tuple
+ * on the page that's greater than the tuple we're inserting to)
+ * then we know that the tuple belongs on this page. We can skip
+ * the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ if (P_RIGHTMOST(opaque) ||
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) != 0 ||
+ pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
+ break;
+
+ _bt_stepright(rel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ }
+ }
+
+ /*
+ * We should now be on the correct page. Find the offset within the page
+ * for the new tuple. (Possibly reusing earlier search bounds.)
+ */
+ Assert(P_RIGHTMOST(opaque) ||
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) <= 0);
+
+ newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate);
+
+ if (insertstate->postingoff == -1)
+ {
+ /*
+ * There is an overlapping posting list tuple with its LP_DEAD bit
+ * set. We don't want to unnecessarily unset its LP_DEAD bit while
+ * performing a posting list split, so perform simple index tuple
+ * deletion early.
+ */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ /*
+ * Do new binary search. New insert location cannot overlap with any
+ * posting list now.
+ */
+ Assert(!insertstate->bounds_valid);
+ insertstate->postingoff = 0;
+ newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate);
+ Assert(insertstate->postingoff == 0);
+ }
+
+ return newitemoff;
+}
+
+#endif /* ifndef NBTS_SPECIALIZING_DEFAULT */
+
+/*
+ * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
+ *
+ * This routine is called by the public interface routine, btinsert.
+ * By here, itup is filled in, including the TID.
+ *
+ * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
+ * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
+ * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
+ * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
+ * don't actually insert.
+ *
+ * indexUnchanged executor hint indicates if itup is from an
+ * UPDATE that didn't logically change the indexed value, but
+ * must nevertheless have a new entry to point to a successor
+ * version.
+ *
+ * The result value is only significant for UNIQUE_CHECK_PARTIAL:
+ * it must be true if the entry is known unique, else false.
+ * (In the current implementation we'll also return true after a
+ * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
+ * that's just a coding artifact.)
+ */
+bool
+NBTS_FUNCTION(_bt_doinsert)(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ Relation heapRel)
+{
+ bool is_unique = false;
+ BTInsertStateData insertstate;
+ BTScanInsert itup_key;
+ BTStack stack;
+ bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
+
+ /* we need an insertion scan key to do our search, so build one */
+ itup_key = nbts_call(_bt_mkscankey, rel, itup);
+
+ if (checkingunique)
+ {
+ if (!itup_key->anynullkeys)
+ {
+ /* No (heapkeyspace) scantid until uniqueness established */
+ itup_key->scantid = NULL;
+ }
+ else
+ {
+ /*
+ * Scan key for new tuple contains NULL key values. Bypass
+ * checkingunique steps. They are unnecessary because core code
+ * considers NULL unequal to every value, including NULL.
+ *
+ * This optimization avoids O(N^2) behavior within the
+ * _bt_findinsertloc() heapkeyspace path when a unique index has a
+ * large number of "duplicates" with NULL key values.
+ */
+ checkingunique = false;
+ /* Tuple is unique in the sense that core code cares about */
+ Assert(checkUnique != UNIQUE_CHECK_EXISTING);
+ is_unique = true;
+ }
+ }
+
+ /*
+ * Fill in the BTInsertState working area, to track the current page and
+ * position within the page to insert on.
+ *
+ * Note that itemsz is passed down to lower level code that deals with
+ * inserting the item. It must be MAXALIGN()'d. This ensures that space
+ * accounting code consistently considers the alignment overhead that we
+ * expect PageAddItem() will add later. (Actually, index_form_tuple() is
+ * already conservative about alignment, but we don't rely on that from
+ * this distance. Besides, preserving the "true" tuple size in index
+ * tuple headers for the benefit of nbtsplitloc.c might happen someday.
+ * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
+ */
+ insertstate.itup = itup;
+ insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
+ insertstate.itup_key = itup_key;
+ insertstate.bounds_valid = false;
+ insertstate.buf = InvalidBuffer;
+ insertstate.postingoff = 0;
+
+ search:
+
+ /*
+ * Find and lock the leaf page that the tuple should be added to by
+ * searching from the root page. insertstate.buf will hold a buffer that
+ * is locked in exclusive mode afterwards.
+ */
+ stack = nbts_call(_bt_search_insert, rel, &insertstate);
+
+ /*
+ * checkingunique inserts are not allowed to go ahead when two tuples with
+ * equal key attribute values would be visible to new MVCC snapshots once
+ * the xact commits. Check for conflicts in the locked page/buffer (if
+ * needed) here.
+ *
+ * It might be necessary to check a page to the right in _bt_check_unique,
+ * though that should be very rare. In practice the first page the value
+ * could be on (with scantid omitted) is almost always also the only page
+ * that a matching tuple might be found on. This is due to the behavior
+ * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
+ * only be allowed to cross a page boundary when there is no candidate
+ * leaf page split point that avoids it. Also, _bt_check_unique can use
+ * the leaf page high key to determine that there will be no duplicates on
+ * the right sibling without actually visiting it (it uses the high key in
+ * cases where the new item happens to belong at the far right of the leaf
+ * page).
+ *
+ * NOTE: obviously, _bt_check_unique can only detect keys that are already
+ * in the index; so it cannot defend against concurrent insertions of the
+ * same key. We protect against that by means of holding a write lock on
+ * the first page the value could be on, with omitted/-inf value for the
+ * implicit heap TID tiebreaker attribute. Any other would-be inserter of
+ * the same key must acquire a write lock on the same page, so only one
+ * would-be inserter can be making the check at one time. Furthermore,
+ * once we are past the check we hold write locks continuously until we
+ * have performed our insertion, so no later inserter can fail to see our
+ * insertion. (This requires some care in _bt_findinsertloc.)
+ *
+ * If we must wait for another xact, we release the lock while waiting,
+ * and then must perform a new search.
+ *
+ * For a partial uniqueness check, we don't wait for the other xact. Just
+ * let the tuple in and return false for possibly non-unique, or true for
+ * definitely unique.
+ */
+ if (checkingunique)
+ {
+ TransactionId xwait;
+ uint32 speculativeToken;
+
+ xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
+ &is_unique, &speculativeToken);
+
+ if (unlikely(TransactionIdIsValid(xwait)))
+ {
+ /* Have to wait for the other guy ... */
+ _bt_relbuf(rel, insertstate.buf);
+ insertstate.buf = InvalidBuffer;
+
+ /*
+ * If it's a speculative insertion, wait for it to finish (ie. to
+ * go ahead with the insertion, or kill the tuple). Otherwise
+ * wait for the transaction to finish as usual.
+ */
+ if (speculativeToken)
+ SpeculativeInsertionWait(xwait, speculativeToken);
+ else
+ XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
+
+ /* start over... */
+ if (stack)
+ _bt_freestack(stack);
+ goto search;
+ }
+
+ /* Uniqueness is established -- restore heap tid as scantid */
+ if (itup_key->heapkeyspace)
+ itup_key->scantid = &itup->t_tid;
+ }
+
+ if (checkUnique != UNIQUE_CHECK_EXISTING)
+ {
+ OffsetNumber newitemoff;
+
+ /*
+ * The only conflict predicate locking cares about for indexes is when
+ * an index tuple insert conflicts with an existing lock. We don't
+ * know the actual page we're going to insert on for sure just yet in
+ * checkingunique and !heapkeyspace cases, but it's okay to use the
+ * first page the value could be on (with scantid omitted) instead.
+ */
+ CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
+
+ /*
+ * Do the insertion. Note that insertstate contains cached binary
+ * search bounds established within _bt_check_unique when insertion is
+ * checkingunique.
+ */
+ newitemoff = nbts_call(_bt_findinsertloc, rel, &insertstate, checkingunique,
+ indexUnchanged, stack, heapRel);
+ _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
+ itup, insertstate.itemsz, newitemoff,
+ insertstate.postingoff, false);
+ }
+ else
+ {
+ /* just release the buffer */
+ _bt_relbuf(rel, insertstate.buf);
+ }
+
+ /* be tidy */
+ if (stack)
+ _bt_freestack(stack);
+ pfree(itup_key);
+
+ return is_unique;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 8b96708b3e..70304c793d 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1967,10 +1967,10 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
}
/* we need an insertion scan key for the search, so build one */
- itup_key = _bt_mkscankey(rel, targetkey);
+ itup_key = nbts_call(_bt_mkscankey, rel, targetkey);
/* find the leftmost leaf page with matching pivot/high key */
itup_key->pivotsearch = true;
- stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
+ stack = nbts_call(_bt_search, rel, itup_key, &sleafbuf, BT_READ, NULL);
/* won't need a second lock or pin on leafbuf */
_bt_relbuf(rel, sleafbuf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b52eca8f38..1481db4dcf 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -87,6 +87,10 @@ static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
OffsetNumber updatedoffset,
int *nremaining);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtree_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
+
/*
* Btree handler function: return IndexAmRoutine with access method parameters
@@ -177,33 +181,6 @@ btbuildempty(Relation index)
smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
}
-/*
- * btinsert() -- insert an index tuple into a btree.
- *
- * Descend the tree recursively, find the appropriate location for our
- * new tuple, and put it there.
- */
-bool
-btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- IndexInfo *indexInfo)
-{
- bool result;
- IndexTuple itup;
-
- /* generate an index tuple */
- itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
- itup->t_tid = *ht_ctid;
-
- result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);
-
- pfree(itup);
-
- return result;
-}
-
/*
* btgettuple() -- Get the next tuple in the scan.
*/
diff --git a/src/backend/access/nbtree/nbtree_spec.h b/src/backend/access/nbtree/nbtree_spec.h
new file mode 100644
index 0000000000..4c342287f6
--- /dev/null
+++ b/src/backend/access/nbtree/nbtree_spec.h
@@ -0,0 +1,50 @@
+/*
+ * Specialized functions for nbtree.c
+ */
+
+/*
+ * _bt_specialize() -- Specialize this index relation for its index key.
+ */
+void
+NBTS_FUNCTION(_bt_specialize)(Relation rel)
+{
+#ifdef NBTS_SPECIALIZING_DEFAULT
+ nbts_call_norel(_bt_specialize, rel, rel);
+#else
+ rel->rd_indam->aminsert = NBTS_FUNCTION(btinsert);
+#endif
+}
+
+/*
+ * btinsert() -- insert an index tuple into a btree.
+ *
+ * Descend the tree recursively, find the appropriate location for our
+ * new tuple, and put it there.
+ */
+bool
+NBTS_FUNCTION(btinsert)(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ IndexInfo *indexInfo)
+{
+#ifdef NBTS_SPECIALIZING_DEFAULT
+ nbts_call_norel(_bt_specialize, rel, rel);
+
+ return nbts_call(btinsert, rel, values, isnull, ht_ctid, heapRel,
+ checkUnique, indexUnchanged, indexInfo);
+#else
+ bool result;
+ IndexTuple itup;
+
+ /* generate an index tuple */
+ itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
+ itup->t_tid = *ht_ctid;
+
+ result = nbts_call(_bt_doinsert, rel, itup, checkUnique, indexUnchanged, heapRel);
+
+ pfree(itup);
+
+ return result;
+#endif
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c74543bfde..e81eee9c35 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,11 +25,8 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
-static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
static int _bt_binsrch_posting(BTScanInsert key, Page page,
OffsetNumber offnum);
-static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
- OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
static int _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
@@ -46,6 +43,9 @@ static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsearch_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
/*
* _bt_drop_lock_and_maybe_pin()
@@ -70,493 +70,6 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
}
}
-/*
- * _bt_search() -- Search the tree for a particular scankey,
- * or more precisely for the first leaf page it could be on.
- *
- * The passed scankey is an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * Return value is a stack of parent-page pointers (i.e. there is no entry for
- * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
- * which is locked and pinned. No locks are held on the parent pages,
- * however!
- *
- * If the snapshot parameter is not NULL, "old snapshot" checking will take
- * place during the descent through the tree. This is not needed when
- * positioning for an insert or delete, so NULL is used for those cases.
- *
- * The returned buffer is locked according to access parameter. Additionally,
- * access = BT_WRITE will allow an empty root page to be created and returned.
- * When access = BT_READ, an empty index will result in *bufP being set to
- * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
- * during the search will be finished.
- */
-BTStack
-_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
- Snapshot snapshot)
-{
- BTStack stack_in = NULL;
- int page_access = BT_READ;
-
- /* Get the root page to start with */
- *bufP = _bt_getroot(rel, access);
-
- /* If index is empty and access = BT_READ, no root page is created. */
- if (!BufferIsValid(*bufP))
- return (BTStack) NULL;
-
- /* Loop iterates once per level descended in the tree */
- for (;;)
- {
- Page page;
- BTPageOpaque opaque;
- OffsetNumber offnum;
- ItemId itemid;
- IndexTuple itup;
- BlockNumber child;
- BTStack new_stack;
-
- /*
- * Race -- the page we just grabbed may have split since we read its
- * downlink in its parent page (or the metapage). If it has, we may
- * need to move right to its new sibling. Do that.
- *
- * In write-mode, allow _bt_moveright to finish any incomplete splits
- * along the way. Strictly speaking, we'd only need to finish an
- * incomplete split on the leaf page we're about to insert to, not on
- * any of the upper levels (internal pages with incomplete splits are
- * also taken care of in _bt_getstackbuf). But this is a good
- * opportunity to finish splits of internal pages too.
- */
- *bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
- page_access, snapshot);
-
- /* if this is a leaf page, we're done */
- page = BufferGetPage(*bufP);
- opaque = BTPageGetOpaque(page);
- if (P_ISLEAF(opaque))
- break;
-
- /*
- * Find the appropriate pivot tuple on this page. Its downlink points
- * to the child page that we're about to descend to.
- */
- offnum = _bt_binsrch(rel, key, *bufP);
- itemid = PageGetItemId(page, offnum);
- itup = (IndexTuple) PageGetItem(page, itemid);
- Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
- child = BTreeTupleGetDownLink(itup);
-
- /*
- * We need to save the location of the pivot tuple we chose in a new
- * stack entry for this page/level. If caller ends up splitting a
- * page one level down, it usually ends up inserting a new pivot
- * tuple/downlink immediately after the location recorded here.
- */
- new_stack = (BTStack) palloc(sizeof(BTStackData));
- new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
- new_stack->bts_offset = offnum;
- new_stack->bts_parent = stack_in;
-
- /*
- * Page level 1 is lowest non-leaf page level prior to leaves. So, if
- * we're on the level 1 and asked to lock leaf page in write mode,
- * then lock next page in write mode, because it must be a leaf.
- */
- if (opaque->btpo_level == 1 && access == BT_WRITE)
- page_access = BT_WRITE;
-
- /* drop the read lock on the page, then acquire one on its child */
- *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
-
- /* okay, all set to move down a level */
- stack_in = new_stack;
- }
-
- /*
- * If we're asked to lock leaf in write mode, but didn't manage to, then
- * relock. This should only happen when the root page is a leaf page (and
- * the only page in the index other than the metapage).
- */
- if (access == BT_WRITE && page_access == BT_READ)
- {
- /* trade in our read lock for a write lock */
- _bt_unlockbuf(rel, *bufP);
- _bt_lockbuf(rel, *bufP, BT_WRITE);
-
- /*
- * Race -- the leaf page may have split after we dropped the read lock
- * but before we acquired a write lock. If it has, we may need to
- * move right to its new sibling. Do that.
- */
- *bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
- snapshot);
- }
-
- return stack_in;
-}
-
-/*
- * _bt_moveright() -- move right in the btree if necessary.
- *
- * When we follow a pointer to reach a page, it is possible that
- * the page has changed in the meanwhile. If this happens, we're
- * guaranteed that the page has "split right" -- that is, that any
- * data that appeared on the page originally is either on the page
- * or strictly to the right of it.
- *
- * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page. If that entry is
- * strictly less than the scankey, or <= the scankey in the
- * key.nextkey=true case, then we followed the wrong link and we need
- * to move right.
- *
- * The passed insertion-type scankey can omit the rightmost column(s) of the
- * index. (see nbtree/README)
- *
- * When key.nextkey is false (the usual case), we are looking for the first
- * item >= key. When key.nextkey is true, we are looking for the first item
- * strictly greater than key.
- *
- * If forupdate is true, we will attempt to finish any incomplete splits
- * that we encounter. This is required when locking a target page for an
- * insertion, because we don't allow inserting on a page before the split
- * is completed. 'stack' is only used if forupdate is true.
- *
- * On entry, we have the buffer pinned and a lock of the type specified by
- * 'access'. If we move right, we release the buffer and lock and acquire
- * the same on the right sibling. Return value is the buffer we stop at.
- *
- * If the snapshot parameter is not NULL, "old snapshot" checking will take
- * place during the descent through the tree. This is not needed when
- * positioning for an insert or delete, so NULL is used for those cases.
- */
-Buffer
-_bt_moveright(Relation rel,
- BTScanInsert key,
- Buffer buf,
- bool forupdate,
- BTStack stack,
- int access,
- Snapshot snapshot)
-{
- Page page;
- BTPageOpaque opaque;
- int32 cmpval;
-
- /*
- * When nextkey = false (normal case): if the scan key that brought us to
- * this page is > the high key stored on the page, then the page has split
- * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
- * have some duplicates to the right as well as the left, but that's
- * something that's only ever dealt with on the leaf level, after
- * _bt_search has found an initial leaf page.)
- *
- * When nextkey = true: move right if the scan key is >= page's high key.
- * (Note that key.scantid cannot be set in this case.)
- *
- * The page could even have split more than once, so scan as far as
- * needed.
- *
- * We also have to move right if we followed a link that brought us to a
- * dead page.
- */
- cmpval = key->nextkey ? 0 : 1;
-
- for (;;)
- {
- page = BufferGetPage(buf);
- TestForOldSnapshot(snapshot, rel, page);
- opaque = BTPageGetOpaque(page);
-
- if (P_RIGHTMOST(opaque))
- break;
-
- /*
- * Finish any incomplete splits we encounter along the way.
- */
- if (forupdate && P_INCOMPLETE_SPLIT(opaque))
- {
- BlockNumber blkno = BufferGetBlockNumber(buf);
-
- /* upgrade our lock if necessary */
- if (access == BT_READ)
- {
- _bt_unlockbuf(rel, buf);
- _bt_lockbuf(rel, buf, BT_WRITE);
- }
-
- if (P_INCOMPLETE_SPLIT(opaque))
- _bt_finish_split(rel, buf, stack);
- else
- _bt_relbuf(rel, buf);
-
- /* re-acquire the lock in the right mode, and re-check */
- buf = _bt_getbuf(rel, blkno, access);
- continue;
- }
-
- if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
- {
- /* step right one page */
- buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
- continue;
- }
- else
- break;
- }
-
- if (P_IGNORE(opaque))
- elog(ERROR, "fell off the end of index \"%s\"",
- RelationGetRelationName(rel));
-
- return buf;
-}
-
-/*
- * _bt_binsrch() -- Do a binary search for a key on a particular page.
- *
- * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
- * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
- * particular, this means it is possible to return a value 1 greater than the
- * number of keys on the page, if the scankey is > all keys on the page.)
- *
- * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
- * of the last key < given scankey, or last key <= given scankey if nextkey
- * is true. (Since _bt_compare treats the first data key of such a page as
- * minus infinity, there will be at least one key < scankey, so the result
- * always points at one of the keys on the page.) This key indicates the
- * right place to descend to be sure we find all leaf keys >= given scankey
- * (or leaf keys > given scankey when nextkey is true).
- *
- * This procedure is not responsible for walking right, it just examines
- * the given page. _bt_binsrch() has no lock or refcount side effects
- * on the buffer.
- */
-static OffsetNumber
-_bt_binsrch(Relation rel,
- BTScanInsert key,
- Buffer buf)
-{
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high;
- int32 result,
- cmpval;
-
- page = BufferGetPage(buf);
- opaque = BTPageGetOpaque(page);
-
- /* Requesting nextkey semantics while using scantid seems nonsensical */
- Assert(!key->nextkey || key->scantid == NULL);
- /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
- Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
-
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
-
- /*
- * If there are no keys on the page, return the first available slot. Note
- * this covers two cases: the page is really empty (no keys), or it
- * contains only a high key. The latter case is possible after vacuuming.
- * This can never happen on an internal page, however, since they are
- * never empty (an internal page must have children).
- */
- if (unlikely(high < low))
- return low;
-
- /*
- * Binary search to find the first key on the page >= scan key, or first
- * key > scankey when nextkey is true.
- *
- * For nextkey=false (cmpval=1), the loop invariant is: all slots before
- * 'low' are < scan key, all slots at or after 'high' are >= scan key.
- *
- * For nextkey=true (cmpval=0), the loop invariant is: all slots before
- * 'low' are <= scan key, all slots at or after 'high' are > scan key.
- *
- * We can fall out when high == low.
- */
- high++; /* establish the loop invariant for high */
-
- cmpval = key->nextkey ? 0 : 1; /* select comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid);
-
- if (result >= cmpval)
- low = mid + 1;
- else
- high = mid;
- }
-
- /*
- * At this point we have high == low, but be careful: they could point
- * past the last slot on the page.
- *
- * On a leaf page, we always return the first key >= scan key (resp. >
- * scan key), which could be the last slot + 1.
- */
- if (P_ISLEAF(opaque))
- return low;
-
- /*
- * On a non-leaf page, return the last key < scan key (resp. <= scan key).
- * There must be one if _bt_compare() is playing by the rules.
- */
- Assert(low > P_FIRSTDATAKEY(opaque));
-
- return OffsetNumberPrev(low);
-}
-
-/*
- *
- * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
- *
- * Like _bt_binsrch(), but with support for caching the binary search
- * bounds. Only used during insertion, and only on the leaf page that it
- * looks like caller will insert tuple on. Exclusive-locked and pinned
- * leaf page is contained within insertstate.
- *
- * Caches the bounds fields in insertstate so that a subsequent call can
- * reuse the low and strict high bounds of original binary search. Callers
- * that use these fields directly must be prepared for the case where low
- * and/or stricthigh are not on the same page (one or both exceed maxoff
- * for the page). The case where there are no items on the page (high <
- * low) makes bounds invalid.
- *
- * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time, and for dealing with posting list
- * tuple matches (callers can use insertstate's postingoff field to
- * determine which existing heap TID will need to be replaced by a posting
- * list split).
- */
-OffsetNumber
-_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
-{
- BTScanInsert key = insertstate->itup_key;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high,
- stricthigh;
- int32 result,
- cmpval;
-
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- Assert(P_ISLEAF(opaque));
- Assert(!key->nextkey);
- Assert(insertstate->postingoff == 0);
-
- if (!insertstate->bounds_valid)
- {
- /* Start new binary search */
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
- }
- else
- {
- /* Restore result of previous binary search against same page */
- low = insertstate->low;
- high = insertstate->stricthigh;
- }
-
- /* If there are no keys on the page, return the first available slot */
- if (unlikely(high < low))
- {
- /* Caller can't reuse bounds */
- insertstate->low = InvalidOffsetNumber;
- insertstate->stricthigh = InvalidOffsetNumber;
- insertstate->bounds_valid = false;
- return low;
- }
-
- /*
- * Binary search to find the first key on the page >= scan key. (nextkey
- * is always false when inserting).
- *
- * The loop invariant is: all slots before 'low' are < scan key, all slots
- * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
- * maintained to save additional search effort for caller.
- *
- * We can fall out when high == low.
- */
- if (!insertstate->bounds_valid)
- high++; /* establish the loop invariant for high */
- stricthigh = high; /* high initially strictly higher */
-
- cmpval = 1; /* !nextkey comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid);
-
- if (result >= cmpval)
- low = mid + 1;
- else
- {
- high = mid;
- if (result != 0)
- stricthigh = high;
- }
-
- /*
- * If tuple at offset located by binary search is a posting list whose
- * TID range overlaps with caller's scantid, perform posting list
- * binary search to set postingoff for caller. Caller must split the
- * posting list when postingoff is set. This should happen
- * infrequently.
- */
- if (unlikely(result == 0 && key->scantid != NULL))
- {
- /*
- * postingoff should never be set more than once per leaf page
- * binary search. That would mean that there are duplicate table
- * TIDs in the index, which is never okay. Check for that here.
- */
- if (insertstate->postingoff != 0)
- ereport(ERROR,
- (errcode(ERRCODE_INDEX_CORRUPTED),
- errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
- ItemPointerGetBlockNumber(key->scantid),
- ItemPointerGetOffsetNumber(key->scantid),
- low, stricthigh,
- BufferGetBlockNumber(insertstate->buf),
- RelationGetRelationName(rel))));
-
- insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
- }
- }
-
- /*
- * On a leaf page, a binary search always returns the first key >= scan
- * key (at least in !nextkey case), which could be the last slot + 1. This
- * is also the lower bound of cached search.
- *
- * stricthigh may also be the last slot + 1, which prevents caller from
- * using bounds directly, but is still useful to us if we're called a
- * second time with cached bounds (cached low will be < stricthigh when
- * that happens).
- */
- insertstate->low = low;
- insertstate->stricthigh = stricthigh;
- insertstate->bounds_valid = true;
-
- return low;
-}
/*----------
* _bt_binsrch_posting() -- posting list binary search.
@@ -625,217 +138,6 @@ _bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
return low;
}
-/*----------
- * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
- *
- * page/offnum: location of btree item to be compared to.
- *
- * This routine returns:
- * <0 if scankey < tuple at offnum;
- * 0 if scankey == tuple at offnum;
- * >0 if scankey > tuple at offnum.
- *
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be returned
- * to the caller as a matching key. Similarly, an insertion scankey
- * with its scantid set is treated as equal to a posting tuple whose TID
- * range overlaps with their scantid. There generally won't be a
- * matching TID in the posting tuple, which caller must handle
- * themselves (e.g., by splitting the posting list tuple).
- *
- * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
- * "minus infinity": this routine will always claim it is less than the
- * scankey. The actual key value stored is explicitly truncated to 0
- * attributes (explicitly minus infinity) with version 3+ indexes, but
- * that isn't relied upon. This allows us to implement the Lehman and
- * Yao convention that the first down-link pointer is before the first
- * key. See backend/access/nbtree/README for details.
- *----------
- */
-int32
-_bt_compare(Relation rel,
- BTScanInsert key,
- Page page,
- OffsetNumber offnum)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- IndexTuple itup;
- ItemPointer heapTid;
- ScanKey scankey;
- int ncmpkey;
- int ntupatts;
- int32 result;
-
- Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
- Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
- Assert(key->heapkeyspace || key->scantid == NULL);
-
- /*
- * Force result ">" if target item is first data item on an internal page
- * --- see NOTE above.
- */
- if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
- return 1;
-
- itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
- ntupatts = BTreeTupleGetNAtts(itup, rel);
-
- /*
- * The scan key is set up with the attribute number associated with each
- * term in the key. It is important that, if the index is multi-key, the
- * scan contain the first k key attributes, and that they be in order. If
- * you think about how multi-key ordering works, you'll understand why
- * this is.
- *
- * We don't test for violation of this condition here, however. The
- * initial setup for the index scan had better have gotten it right (see
- * _bt_first).
- */
-
- ncmpkey = Min(ntupatts, key->keysz);
- Assert(key->heapkeyspace || ncmpkey == key->keysz);
- Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
- scankey = key->scankeys;
- for (int i = 1; i <= ncmpkey; i++)
- {
- Datum datum;
- bool isNull;
-
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
-
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
- {
- if (isNull)
- result = 0; /* NULL "=" NULL */
- else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = -1; /* NULL "<" NOT_NULL */
- else
- result = 1; /* NULL ">" NOT_NULL */
- }
- else if (isNull) /* key is NOT_NULL and item is NULL */
- {
- if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = 1; /* NOT_NULL ">" NULL */
- else
- result = -1; /* NOT_NULL "<" NULL */
- }
- else
- {
- /*
- * The sk_func needs to be passed the index value as left arg and
- * the sk_argument as right arg (they might be of different
- * types). Since it is convenient for callers to think of
- * _bt_compare as comparing the scankey to the index item, we have
- * to flip the sign of the comparison result. (Unless it's a DESC
- * column, in which case we *don't* flip the sign.)
- */
- result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum,
- scankey->sk_argument));
-
- if (!(scankey->sk_flags & SK_BT_DESC))
- INVERT_COMPARE_RESULT(result);
- }
-
- /* if the keys are unequal, return the difference */
- if (result != 0)
- return result;
-
- scankey++;
- }
-
- /*
- * All non-truncated attributes (other than heap TID) were found to be
- * equal. Treat truncated attributes as minus infinity when scankey has a
- * key attribute value that would otherwise be compared directly.
- *
- * Note: it doesn't matter if ntupatts includes non-key attributes;
- * scankey won't, so explicitly excluding non-key attributes isn't
- * necessary.
- */
- if (key->keysz > ntupatts)
- return 1;
-
- /*
- * Use the heap TID attribute and scantid to try to break the tie. The
- * rules are the same as any other key attribute -- only the
- * representation differs.
- */
- heapTid = BTreeTupleGetHeapTID(itup);
- if (key->scantid == NULL)
- {
- /*
- * Most searches have a scankey that is considered greater than a
- * truncated pivot tuple if and when the scankey has equal values for
- * attributes up to and including the least significant untruncated
- * attribute in tuple.
- *
- * For example, if an index has the minimum two attributes (single
- * user key attribute, plus heap TID attribute), and a page's high key
- * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
- * will not descend to the page to the left. The search will descend
- * right instead. The truncated attribute in pivot tuple means that
- * all non-pivot tuples on the page to the left are strictly < 'foo',
- * so it isn't necessary to descend left. In other words, search
- * doesn't have to descend left because it isn't interested in a match
- * that has a heap TID value of -inf.
- *
- * However, some searches (pivotsearch searches) actually require that
- * we descend left when this happens. -inf is treated as a possible
- * match for omitted scankey attribute(s). This is needed by page
- * deletion, which must re-find leaf pages that are targets for
- * deletion using their high keys.
- *
- * Note: the heap TID part of the test ensures that scankey is being
- * compared to a pivot tuple with one or more truncated key
- * attributes.
- *
- * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
- * left here, since they have no heap TID attribute (and cannot have
- * any -inf key values in any case, since truncation can only remove
- * non-key attributes). !heapkeyspace searches must always be
- * prepared to deal with matches on both sides of the pivot once the
- * leaf level is reached.
- */
- if (key->heapkeyspace && !key->pivotsearch &&
- key->keysz == ntupatts && heapTid == NULL)
- return 1;
-
- /* All provided scankey arguments found to be equal */
- return 0;
- }
-
- /*
- * Treat truncated heap TID as minus infinity, since scankey has a key
- * attribute value (scantid) that would otherwise be compared directly
- */
- Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
- if (heapTid == NULL)
- return 1;
-
- /*
- * Scankey must be treated as equal to a posting list tuple if its scantid
- * value falls within the range of the posting list. In all other cases
- * there can only be a single heap TID value, which is compared directly
- * with scantid.
- */
- Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- result = ItemPointerCompare(key->scantid, heapTid);
- if (result <= 0 || !BTreeTupleIsPosting(itup))
- return result;
- else
- {
- result = ItemPointerCompare(key->scantid,
- BTreeTupleGetMaxHeapTID(itup));
- if (result > 0)
- return 1;
- }
-
- return 0;
-}
-
/*
* _bt_first() -- Find the first item in a scan.
*
@@ -1363,7 +665,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* Use the manufactured insertion scan key to descend the tree and
* position ourselves on the target leaf page.
*/
- stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+ stack = nbts_call(_bt_search, rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
/* don't need to keep the stack around... */
_bt_freestack(stack);
@@ -1392,7 +694,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
_bt_initialize_more_data(so, dir);
/* position to the precise item on the page */
- offnum = _bt_binsrch(rel, &inskey, buf);
+ offnum = nbts_call(_bt_binsrch, rel, &inskey, buf);
/*
* If nextkey = false, we are positioned at the first item >= scan key, or
@@ -1422,9 +724,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
/*
* Now load data from the first page of the scan.
*/
- if (!_bt_readpage(scan, dir, offnum))
+ if (!nbts_call_norel(_bt_readpage, scan->indexRelation, scan, dir, offnum))
{
- /*
+ /*`
* There's no actually-matching data on this page. Try to advance to
* the next page. Return false if there's no matching data at all.
*/
@@ -1498,280 +800,6 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
-/*
- * _bt_readpage() -- Load data from current index page into so->currPos
- *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate. All other fields of so->currPos are
- * initialized from scratch here.
- *
- * We scan the current page starting at offnum and moving in the indicated
- * direction. All items matching the scan keys are loaded into currPos.items.
- * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
- * that there can be no more matching tuples in the current scan direction.
- *
- * In the case of a parallel scan, caller must have called _bt_parallel_seize
- * prior to calling this function; this function will invoke
- * _bt_parallel_release before returning.
- *
- * Returns true if any matching items found on the page, false if none.
- */
-static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
-{
- BTScanOpaque so = (BTScanOpaque) scan->opaque;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber minoff;
- OffsetNumber maxoff;
- int itemIndex;
- bool continuescan;
- int indnatts;
-
- /*
- * We must have the buffer pinned and locked, but the usual macro can't be
- * used here; this function is what makes it good for currPos.
- */
- Assert(BufferIsValid(so->currPos.buf));
-
- page = BufferGetPage(so->currPos.buf);
- opaque = BTPageGetOpaque(page);
-
- /* allow next page be processed by parallel worker */
- if (scan->parallel_scan)
- {
- if (ScanDirectionIsForward(dir))
- _bt_parallel_release(scan, opaque->btpo_next);
- else
- _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
- }
-
- continuescan = true; /* default assumption */
- indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * We note the buffer's block number so that we can release the pin later.
- * This allows us to re-read the buffer if it is needed again for hinting.
- */
- so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-
- /*
- * We save the LSN of the page as we read it, so that we know whether it
- * safe to apply LP_DEAD hints to the page later. This allows us to drop
- * the pin for MVCC scans, which allows vacuum to avoid blocking.
- */
- so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-
- /*
- * we must save the page's right-link while scanning it; this tells us
- * where to step right to after we're done with these items. There is no
- * corresponding need for the left-link, since splits always go right.
- */
- so->currPos.nextPage = opaque->btpo_next;
-
- /* initialize tuple workspace to empty */
- so->currPos.nextTupleOffset = 0;
-
- /*
- * Now that the current page has been made consistent, the macro should be
- * good.
- */
- Assert(BTScanPosIsPinned(so->currPos));
-
- if (ScanDirectionIsForward(dir))
- {
- /* load items[] in ascending order */
- itemIndex = 0;
-
- offnum = Max(offnum, minoff);
-
- while (offnum <= maxoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- offnum = OffsetNumberNext(offnum);
- continue;
- }
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID
- */
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- itemIndex++;
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- itemIndex++;
- }
- }
- }
- /* When !continuescan, there can't be any more matches, so stop */
- if (!continuescan)
- break;
-
- offnum = OffsetNumberNext(offnum);
- }
-
- /*
- * We don't need to visit page to the right when the high key
- * indicates that no more matches will be found there.
- *
- * Checking the high key like this works out more often than you might
- * think. Leaf page splits pick a split point between the two most
- * dissimilar tuples (this is weighed against the need to evenly share
- * free space). Leaf pages with high key attribute values that can
- * only appear on non-pivot tuples on the right sibling page are
- * common.
- */
- if (continuescan && !P_RIGHTMOST(opaque))
- {
- ItemId iid = PageGetItemId(page, P_HIKEY);
- IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
- int truncatt;
-
- truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
- _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
- }
-
- if (!continuescan)
- so->currPos.moreRight = false;
-
- Assert(itemIndex <= MaxTIDsPerBTreePage);
- so->currPos.firstItem = 0;
- so->currPos.lastItem = itemIndex - 1;
- so->currPos.itemIndex = 0;
- }
- else
- {
- /* load items[] in descending order */
- itemIndex = MaxTIDsPerBTreePage;
-
- offnum = Min(offnum, maxoff);
-
- while (offnum >= minoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
- bool tuple_alive;
- bool passes_quals;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual. Most of the
- * time, it's a win to not bother examining the tuple's index
- * keys, but just skip to the next tuple (previous, actually,
- * since we're scanning backwards). However, if this is the first
- * tuple on the page, we do check the index keys, to prevent
- * uselessly advancing to the page to the left. This is similar
- * to the high key optimization used by forward scans.
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- Assert(offnum >= P_FIRSTDATAKEY(opaque));
- if (offnum > P_FIRSTDATAKEY(opaque))
- {
- offnum = OffsetNumberPrev(offnum);
- continue;
- }
-
- tuple_alive = false;
- }
- else
- tuple_alive = true;
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan);
- if (passes_quals && tuple_alive)
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID.
- *
- * Note that we deliberately save/return items from
- * posting lists in ascending heap TID order for backwards
- * scans. This allows _bt_killitems() to make a
- * consistent assumption about the order of items
- * associated with the same posting list tuple.
- */
- itemIndex--;
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- itemIndex--;
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- }
- }
- }
- if (!continuescan)
- {
- /* there can't be any more matches, so stop */
- so->currPos.moreLeft = false;
- break;
- }
-
- offnum = OffsetNumberPrev(offnum);
- }
-
- Assert(itemIndex >= 0);
- so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
- so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
- }
-
- return (so->currPos.firstItem <= so->currPos.lastItem);
-}
-
/* Save an index item into so->currPos.items[itemIndex] */
static void
_bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -2014,7 +1042,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
PredicateLockPage(rel, blkno, scan->xs_snapshot);
/* see if there are any matches on this page */
/* note that this will clear moreRight if we can stop */
- if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
+ if (nbts_call_norel(_bt_readpage, scan->indexRelation,
+ scan, dir, P_FIRSTDATAKEY(opaque)))
break;
}
else if (scan->parallel_scan != NULL)
@@ -2116,7 +1145,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
/* see if there are any matches on this page */
/* note that this will clear moreLeft if we can stop */
- if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
+ if (nbts_call_norel(_bt_readpage, scan->indexRelation, scan,
+ dir, PageGetMaxOffsetNumber(page)))
break;
}
else if (scan->parallel_scan != NULL)
@@ -2448,7 +1478,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
/*
* Now load data from the first page of the scan.
*/
- if (!_bt_readpage(scan, dir, start))
+ if (!nbts_call_norel(_bt_readpage, scan->indexRelation, scan, dir, start))
{
/*
* There's no actually-matching data on this page. Try to advance to
diff --git a/src/backend/access/nbtree/nbtsearch_spec.h b/src/backend/access/nbtree/nbtsearch_spec.h
new file mode 100644
index 0000000000..73d5370496
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsearch_spec.h
@@ -0,0 +1,994 @@
+/*
+ * Specialized functions for nbtsearch.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static OffsetNumber NBTS_FUNCTION(_bt_binsrch)(Relation rel, BTScanInsert key,
+ Buffer buf);
+static bool NBTS_FUNCTION(_bt_readpage)(IndexScanDesc scan, ScanDirection dir,
+ OffsetNumber offnum);
+
+/*
+ * _bt_binsrch() -- Do a binary search for a key on a particular page.
+ *
+ * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
+ * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
+ * particular, this means it is possible to return a value 1 greater than the
+ * number of keys on the page, if the scankey is > all keys on the page.)
+ *
+ * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
+ * of the last key < given scankey, or last key <= given scankey if nextkey
+ * is true. (Since _bt_compare treats the first data key of such a page as
+ * minus infinity, there will be at least one key < scankey, so the result
+ * always points at one of the keys on the page.) This key indicates the
+ * right place to descend to be sure we find all leaf keys >= given scankey
+ * (or leaf keys > given scankey when nextkey is true).
+ *
+ * This procedure is not responsible for walking right, it just examines
+ * the given page. _bt_binsrch() has no lock or refcount side effects
+ * on the buffer.
+ */
+static OffsetNumber
+NBTS_FUNCTION(_bt_binsrch)(Relation rel,
+ BTScanInsert key,
+ Buffer buf)
+{
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high;
+ int32 result,
+ cmpval;
+
+ page = BufferGetPage(buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* Requesting nextkey semantics while using scantid seems nonsensical */
+ Assert(!key->nextkey || key->scantid == NULL);
+ /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
+ Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+
+ /*
+ * If there are no keys on the page, return the first available slot. Note
+ * this covers two cases: the page is really empty (no keys), or it
+ * contains only a high key. The latter case is possible after vacuuming.
+ * This can never happen on an internal page, however, since they are
+ * never empty (an internal page must have children).
+ */
+ if (unlikely(high < low))
+ return low;
+
+ /*
+ * Binary search to find the first key on the page >= scan key, or first
+ * key > scankey when nextkey is true.
+ *
+ * For nextkey=false (cmpval=1), the loop invariant is: all slots before
+ * 'low' are < scan key, all slots at or after 'high' are >= scan key.
+ *
+ * For nextkey=true (cmpval=0), the loop invariant is: all slots before
+ * 'low' are <= scan key, all slots at or after 'high' are > scan key.
+ *
+ * We can fall out when high == low.
+ */
+ high++; /* establish the loop invariant for high */
+
+ cmpval = key->nextkey ? 0 : 1; /* select comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = nbts_call(_bt_compare, rel, key, page, mid);
+
+ if (result >= cmpval)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ /*
+ * At this point we have high == low, but be careful: they could point
+ * past the last slot on the page.
+ *
+ * On a leaf page, we always return the first key >= scan key (resp. >
+ * scan key), which could be the last slot + 1.
+ */
+ if (P_ISLEAF(opaque))
+ return low;
+
+ /*
+ * On a non-leaf page, return the last key < scan key (resp. <= scan key).
+ * There must be one if _bt_compare() is playing by the rules.
+ */
+ Assert(low > P_FIRSTDATAKEY(opaque));
+
+ return OffsetNumberPrev(low);
+}
+
+/*
+ * _bt_readpage() -- Load data from current index page into so->currPos
+ *
+ * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
+ * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
+ * they are updated as appropriate. All other fields of so->currPos are
+ * initialized from scratch here.
+ *
+ * We scan the current page starting at offnum and moving in the indicated
+ * direction. All items matching the scan keys are loaded into currPos.items.
+ * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
+ * that there can be no more matching tuples in the current scan direction.
+ *
+ * In the case of a parallel scan, caller must have called _bt_parallel_seize
+ * prior to calling this function; this function will invoke
+ * _bt_parallel_release before returning.
+ *
+ * Returns true if any matching items found on the page, false if none.
+ */
+static bool
+NBTS_FUNCTION(_bt_readpage)(IndexScanDesc scan, ScanDirection dir,
+ OffsetNumber offnum)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber minoff;
+ OffsetNumber maxoff;
+ int itemIndex;
+ bool continuescan;
+ int indnatts;
+
+ /*
+ * We must have the buffer pinned and locked, but the usual macro can't be
+ * used here; this function is what makes it good for currPos.
+ */
+ Assert(BufferIsValid(so->currPos.buf));
+
+ page = BufferGetPage(so->currPos.buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* allow next page be processed by parallel worker */
+ if (scan->parallel_scan)
+ {
+ if (ScanDirectionIsForward(dir))
+ _bt_parallel_release(scan, opaque->btpo_next);
+ else
+ _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+ }
+
+ continuescan = true; /* default assumption */
+ indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * We note the buffer's block number so that we can release the pin later.
+ * This allows us to re-read the buffer if it is needed again for hinting.
+ */
+ so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later. This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+
+ /*
+ * we must save the page's right-link while scanning it; this tells us
+ * where to step right to after we're done with these items. There is no
+ * corresponding need for the left-link, since splits always go right.
+ */
+ so->currPos.nextPage = opaque->btpo_next;
+
+ /* initialize tuple workspace to empty */
+ so->currPos.nextTupleOffset = 0;
+
+ /*
+ * Now that the current page has been made consistent, the macro should be
+ * good.
+ */
+ Assert(BTScanPosIsPinned(so->currPos));
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* load items[] in ascending order */
+ itemIndex = 0;
+
+ offnum = Max(offnum, minoff);
+
+ while (offnum <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ offnum = OffsetNumberNext(offnum);
+ continue;
+ }
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ if (nbts_call(_bt_checkkeys, scan->indexRelation, scan, itup, indnatts, dir, &continuescan))
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID
+ */
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ itemIndex++;
+ }
+ }
+ }
+ /* When !continuescan, there can't be any more matches, so stop */
+ if (!continuescan)
+ break;
+
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /*
+ * We don't need to visit page to the right when the high key
+ * indicates that no more matches will be found there.
+ *
+ * Checking the high key like this works out more often than you might
+ * think. Leaf page splits pick a split point between the two most
+ * dissimilar tuples (this is weighed against the need to evenly share
+ * free space). Leaf pages with high key attribute values that can
+ * only appear on non-pivot tuples on the right sibling page are
+ * common.
+ */
+ if (continuescan && !P_RIGHTMOST(opaque))
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
+ int truncatt;
+
+ truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
+ nbts_call(_bt_checkkeys, scan->indexRelation, scan, itup, truncatt, dir, &continuescan);
+ }
+
+ if (!continuescan)
+ so->currPos.moreRight = false;
+
+ Assert(itemIndex <= MaxTIDsPerBTreePage);
+ so->currPos.firstItem = 0;
+ so->currPos.lastItem = itemIndex - 1;
+ so->currPos.itemIndex = 0;
+ }
+ else
+ {
+ /* load items[] in descending order */
+ itemIndex = MaxTIDsPerBTreePage;
+
+ offnum = Min(offnum, maxoff);
+
+ while (offnum >= minoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+ bool tuple_alive;
+ bool passes_quals;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual. Most of the
+ * time, it's a win to not bother examining the tuple's index
+ * keys, but just skip to the next tuple (previous, actually,
+ * since we're scanning backwards). However, if this is the first
+ * tuple on the page, we do check the index keys, to prevent
+ * uselessly advancing to the page to the left. This is similar
+ * to the high key optimization used by forward scans.
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ Assert(offnum >= P_FIRSTDATAKEY(opaque));
+ if (offnum > P_FIRSTDATAKEY(opaque))
+ {
+ offnum = OffsetNumberPrev(offnum);
+ continue;
+ }
+
+ tuple_alive = false;
+ }
+ else
+ tuple_alive = true;
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ passes_quals = nbts_call(_bt_checkkeys, scan->indexRelation,
+ scan, itup, indnatts, dir,
+ &continuescan);
+ if (passes_quals && tuple_alive)
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID.
+ *
+ * Note that we deliberately save/return items from
+ * posting lists in ascending heap TID order for backwards
+ * scans. This allows _bt_killitems() to make a
+ * consistent assumption about the order of items
+ * associated with the same posting list tuple.
+ */
+ itemIndex--;
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ }
+ }
+ }
+ if (!continuescan)
+ {
+ /* there can't be any more matches, so stop */
+ so->currPos.moreLeft = false;
+ break;
+ }
+
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ Assert(itemIndex >= 0);
+ so->currPos.firstItem = itemIndex;
+ so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
+ so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+ }
+
+ return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+#endif /* NBTS_SPECIALIZING_DEFAULT */
+
+/*
+ * _bt_search() -- Search the tree for a particular scankey,
+ * or more precisely for the first leaf page it could be on.
+ *
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
+ * but it can omit the rightmost column(s) of the index.
+ *
+ * Return value is a stack of parent-page pointers (i.e. there is no entry for
+ * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
+ * which is locked and pinned. No locks are held on the parent pages,
+ * however!
+ *
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
+ *
+ * The returned buffer is locked according to access parameter. Additionally,
+ * access = BT_WRITE will allow an empty root page to be created and returned.
+ * When access = BT_READ, an empty index will result in *bufP being set to
+ * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
+ * during the search will be finished.
+ */
+BTStack
+NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
+ int access, Snapshot snapshot)
+{
+ BTStack stack_in = NULL;
+ int page_access = BT_READ;
+
+ /* Get the root page to start with */
+ *bufP = _bt_getroot(rel, access);
+
+ /* If index is empty and access = BT_READ, no root page is created. */
+ if (!BufferIsValid(*bufP))
+ return (BTStack) NULL;
+
+ /* Loop iterates once per level descended in the tree */
+ for (;;)
+ {
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber offnum;
+ ItemId itemid;
+ IndexTuple itup;
+ BlockNumber child;
+ BTStack new_stack;
+
+ /*
+ * Race -- the page we just grabbed may have split since we read its
+ * downlink in its parent page (or the metapage). If it has, we may
+ * need to move right to its new sibling. Do that.
+ *
+ * In write-mode, allow _bt_moveright to finish any incomplete splits
+ * along the way. Strictly speaking, we'd only need to finish an
+ * incomplete split on the leaf page we're about to insert to, not on
+ * any of the upper levels (internal pages with incomplete splits are
+ * also taken care of in _bt_getstackbuf). But this is a good
+ * opportunity to finish splits of internal pages too.
+ */
+ *bufP = nbts_call(_bt_moveright, rel, key, *bufP,
+ (access == BT_WRITE), stack_in,
+ page_access, snapshot);
+
+ /* if this is a leaf page, we're done */
+ page = BufferGetPage(*bufP);
+ opaque = BTPageGetOpaque(page);
+ if (P_ISLEAF(opaque))
+ break;
+
+ /*
+ * Find the appropriate pivot tuple on this page. Its downlink points
+ * to the child page that we're about to descend to.
+ */
+ offnum = nbts_call(_bt_binsrch, rel, key, *bufP);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
+ child = BTreeTupleGetDownLink(itup);
+
+ /*
+ * We need to save the location of the pivot tuple we chose in a new
+ * stack entry for this page/level. If caller ends up splitting a
+ * page one level down, it usually ends up inserting a new pivot
+ * tuple/downlink immediately after the location recorded here.
+ */
+ new_stack = (BTStack) palloc(sizeof(BTStackData));
+ new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
+ new_stack->bts_offset = offnum;
+ new_stack->bts_parent = stack_in;
+
+ /*
+ * Page level 1 is lowest non-leaf page level prior to leaves. So, if
+ * we're on the level 1 and asked to lock leaf page in write mode,
+ * then lock next page in write mode, because it must be a leaf.
+ */
+ if (opaque->btpo_level == 1 && access == BT_WRITE)
+ page_access = BT_WRITE;
+
+ /* drop the read lock on the page, then acquire one on its child */
+ *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
+
+ /* okay, all set to move down a level */
+ stack_in = new_stack;
+ }
+
+ /*
+ * If we're asked to lock leaf in write mode, but didn't manage to, then
+ * relock. This should only happen when the root page is a leaf page (and
+ * the only page in the index other than the metapage).
+ */
+ if (access == BT_WRITE && page_access == BT_READ)
+ {
+ /* trade in our read lock for a write lock */
+ _bt_unlockbuf(rel, *bufP);
+ _bt_lockbuf(rel, *bufP, BT_WRITE);
+
+ /*
+ * Race -- the leaf page may have split after we dropped the read lock
+ * but before we acquired a write lock. If it has, we may need to
+ * move right to its new sibling. Do that.
+ */
+ *bufP = nbts_call(_bt_moveright, rel, key, *bufP, true, stack_in,
+ BT_WRITE, snapshot);
+ }
+
+ return stack_in;
+}
+
+/*
+ * _bt_moveright() -- move right in the btree if necessary.
+ *
+ * When we follow a pointer to reach a page, it is possible that
+ * the page has changed in the meanwhile. If this happens, we're
+ * guaranteed that the page has "split right" -- that is, that any
+ * data that appeared on the page originally is either on the page
+ * or strictly to the right of it.
+ *
+ * This routine decides whether or not we need to move right in the
+ * tree by examining the high key entry on the page. If that entry is
+ * strictly less than the scankey, or <= the scankey in the
+ * key.nextkey=true case, then we followed the wrong link and we need
+ * to move right.
+ *
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
+ *
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key. When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
+ *
+ * If forupdate is true, we will attempt to finish any incomplete splits
+ * that we encounter. This is required when locking a target page for an
+ * insertion, because we don't allow inserting on a page before the split
+ * is completed. 'stack' is only used if forupdate is true.
+ *
+ * On entry, we have the buffer pinned and a lock of the type specified by
+ * 'access'. If we move right, we release the buffer and lock and acquire
+ * the same on the right sibling. Return value is the buffer we stop at.
+ *
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
+ */
+Buffer
+NBTS_FUNCTION(_bt_moveright)(Relation rel,
+ BTScanInsert key,
+ Buffer buf,
+ bool forupdate,
+ BTStack stack,
+ int access,
+ Snapshot snapshot)
+{
+ Page page;
+ BTPageOpaque opaque;
+ int32 cmpval;
+
+ /*
+ * When nextkey = false (normal case): if the scan key that brought us to
+ * this page is > the high key stored on the page, then the page has split
+ * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
+ * have some duplicates to the right as well as the left, but that's
+ * something that's only ever dealt with on the leaf level, after
+ * _bt_search has found an initial leaf page.)
+ *
+ * When nextkey = true: move right if the scan key is >= page's high key.
+ * (Note that key.scantid cannot be set in this case.)
+ *
+ * The page could even have split more than once, so scan as far as
+ * needed.
+ *
+ * We also have to move right if we followed a link that brought us to a
+ * dead page.
+ */
+ cmpval = key->nextkey ? 0 : 1;
+
+ for (;;)
+ {
+ page = BufferGetPage(buf);
+ TestForOldSnapshot(snapshot, rel, page);
+ opaque = BTPageGetOpaque(page);
+
+ if (P_RIGHTMOST(opaque))
+ break;
+
+ /*
+ * Finish any incomplete splits we encounter along the way.
+ */
+ if (forupdate && P_INCOMPLETE_SPLIT(opaque))
+ {
+ BlockNumber blkno = BufferGetBlockNumber(buf);
+
+ /* upgrade our lock if necessary */
+ if (access == BT_READ)
+ {
+ _bt_unlockbuf(rel, buf);
+ _bt_lockbuf(rel, buf, BT_WRITE);
+ }
+
+ if (P_INCOMPLETE_SPLIT(opaque))
+ _bt_finish_split(rel, buf, stack);
+ else
+ _bt_relbuf(rel, buf);
+
+ /* re-acquire the lock in the right mode, and re-check */
+ buf = _bt_getbuf(rel, blkno, access);
+ continue;
+ }
+
+ if (P_IGNORE(opaque) || nbts_call(_bt_compare, rel, key, page, P_HIKEY) >= cmpval)
+ {
+ /* step right one page */
+ buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
+ continue;
+ }
+ else
+ break;
+ }
+
+ if (P_IGNORE(opaque))
+ elog(ERROR, "fell off the end of index \"%s\"",
+ RelationGetRelationName(rel));
+
+ return buf;
+}
+
+/*
+ *
+ * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
+ *
+ * Like _bt_binsrch(), but with support for caching the binary search
+ * bounds. Only used during insertion, and only on the leaf page that it
+ * looks like caller will insert tuple on. Exclusive-locked and pinned
+ * leaf page is contained within insertstate.
+ *
+ * Caches the bounds fields in insertstate so that a subsequent call can
+ * reuse the low and strict high bounds of original binary search. Callers
+ * that use these fields directly must be prepared for the case where low
+ * and/or stricthigh are not on the same page (one or both exceed maxoff
+ * for the page). The case where there are no items on the page (high <
+ * low) makes bounds invalid.
+ *
+ * Caller is responsible for invalidating bounds when it modifies the page
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by a posting
+ * list split).
+ */
+OffsetNumber
+NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate)
+{
+ BTScanInsert key = insertstate->itup_key;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high,
+ stricthigh;
+ int32 result,
+ cmpval;
+
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ Assert(P_ISLEAF(opaque));
+ Assert(!key->nextkey);
+ Assert(insertstate->postingoff == 0);
+
+ if (!insertstate->bounds_valid)
+ {
+ /* Start new binary search */
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ }
+ else
+ {
+ /* Restore result of previous binary search against same page */
+ low = insertstate->low;
+ high = insertstate->stricthigh;
+ }
+
+ /* If there are no keys on the page, return the first available slot */
+ if (unlikely(high < low))
+ {
+ /* Caller can't reuse bounds */
+ insertstate->low = InvalidOffsetNumber;
+ insertstate->stricthigh = InvalidOffsetNumber;
+ insertstate->bounds_valid = false;
+ return low;
+ }
+
+ /*
+ * Binary search to find the first key on the page >= scan key. (nextkey
+ * is always false when inserting).
+ *
+ * The loop invariant is: all slots before 'low' are < scan key, all slots
+ * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
+ * maintained to save additional search effort for caller.
+ *
+ * We can fall out when high == low.
+ */
+ if (!insertstate->bounds_valid)
+ high++; /* establish the loop invariant for high */
+ stricthigh = high; /* high initially strictly higher */
+
+ cmpval = 1; /* !nextkey comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = nbts_call(_bt_compare, rel, key, page, mid);
+
+ if (result >= cmpval)
+ low = mid + 1;
+ else
+ {
+ high = mid;
+ if (result != 0)
+ stricthigh = high;
+ }
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set postingoff for caller. Caller must split the
+ * posting list when postingoff is set. This should happen
+ * infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ {
+ /*
+ * postingoff should never be set more than once per leaf page
+ * binary search. That would mean that there are duplicate table
+ * TIDs in the index, which is never okay. Check for that here.
+ */
+ if (insertstate->postingoff != 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
+ ItemPointerGetBlockNumber(key->scantid),
+ ItemPointerGetOffsetNumber(key->scantid),
+ low, stricthigh,
+ BufferGetBlockNumber(insertstate->buf),
+ RelationGetRelationName(rel))));
+
+ insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
+ }
+ }
+
+ /*
+ * On a leaf page, a binary search always returns the first key >= scan
+ * key (at least in !nextkey case), which could be the last slot + 1. This
+ * is also the lower bound of cached search.
+ *
+ * stricthigh may also be the last slot + 1, which prevents caller from
+ * using bounds directly, but is still useful to us if we're called a
+ * second time with cached bounds (cached low will be < stricthigh when
+ * that happens).
+ */
+ insertstate->low = low;
+ insertstate->stricthigh = stricthigh;
+ insertstate->bounds_valid = true;
+
+ return low;
+}
+
+/*----------
+ * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
+ *
+ * page/offnum: location of btree item to be compared to.
+ *
+ * This routine returns:
+ * <0 if scankey < tuple at offnum;
+ * 0 if scankey == tuple at offnum;
+ * >0 if scankey > tuple at offnum.
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
+ * "minus infinity": this routine will always claim it is less than the
+ * scankey. The actual key value stored is explicitly truncated to 0
+ * attributes (explicitly minus infinity) with version 3+ indexes, but
+ * that isn't relied upon. This allows us to implement the Lehman and
+ * Yao convention that the first down-link pointer is before the first
+ * key. See backend/access/nbtree/README for details.
+ *----------
+ */
+int32
+NBTS_FUNCTION(_bt_compare)(Relation rel,
+ BTScanInsert key,
+ Page page,
+ OffsetNumber offnum)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ IndexTuple itup;
+ ItemPointer heapTid;
+ ScanKey scankey;
+ int ncmpkey;
+ int ntupatts;
+ int32 result;
+
+ Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
+ Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+ Assert(key->heapkeyspace || key->scantid == NULL);
+
+ /*
+ * Force result ">" if target item is first data item on an internal page
+ * --- see NOTE above.
+ */
+ if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
+ return 1;
+
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+ ntupatts = BTreeTupleGetNAtts(itup, rel);
+
+ /*
+ * The scan key is set up with the attribute number associated with each
+ * term in the key. It is important that, if the index is multi-key, the
+ * scan contain the first k key attributes, and that they be in order. If
+ * you think about how multi-key ordering works, you'll understand why
+ * this is.
+ *
+ * We don't test for violation of this condition here, however. The
+ * initial setup for the index scan had better have gotten it right (see
+ * _bt_first).
+ */
+
+ ncmpkey = Min(ntupatts, key->keysz);
+ Assert(key->heapkeyspace || ncmpkey == key->keysz);
+ Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
+ scankey = key->scankeys;
+ for (int i = 1; i <= ncmpkey; i++)
+ {
+ Datum datum;
+ bool isNull;
+
+ datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+
+ if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ {
+ if (isNull)
+ result = 0; /* NULL "=" NULL */
+ else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = -1; /* NULL "<" NOT_NULL */
+ else
+ result = 1; /* NULL ">" NOT_NULL */
+ }
+ else if (isNull) /* key is NOT_NULL and item is NULL */
+ {
+ if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = 1; /* NOT_NULL ">" NULL */
+ else
+ result = -1; /* NOT_NULL "<" NULL */
+ }
+ else
+ {
+ /*
+ * The sk_func needs to be passed the index value as left arg and
+ * the sk_argument as right arg (they might be of different
+ * types). Since it is convenient for callers to think of
+ * _bt_compare as comparing the scankey to the index item, we have
+ * to flip the sign of the comparison result. (Unless it's a DESC
+ * column, in which case we *don't* flip the sign.)
+ */
+ result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum,
+ scankey->sk_argument));
+
+ if (!(scankey->sk_flags & SK_BT_DESC))
+ INVERT_COMPARE_RESULT(result);
+ }
+
+ /* if the keys are unequal, return the difference */
+ if (result != 0)
+ return result;
+
+ scankey++;
+ }
+
+ /*
+ * All non-truncated attributes (other than heap TID) were found to be
+ * equal. Treat truncated attributes as minus infinity when scankey has a
+ * key attribute value that would otherwise be compared directly.
+ *
+ * Note: it doesn't matter if ntupatts includes non-key attributes;
+ * scankey won't, so explicitly excluding non-key attributes isn't
+ * necessary.
+ */
+ if (key->keysz > ntupatts)
+ return 1;
+
+ /*
+ * Use the heap TID attribute and scantid to try to break the tie. The
+ * rules are the same as any other key attribute -- only the
+ * representation differs.
+ */
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid == NULL)
+ {
+ /*
+ * Most searches have a scankey that is considered greater than a
+ * truncated pivot tuple if and when the scankey has equal values for
+ * attributes up to and including the least significant untruncated
+ * attribute in tuple.
+ *
+ * For example, if an index has the minimum two attributes (single
+ * user key attribute, plus heap TID attribute), and a page's high key
+ * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
+ * will not descend to the page to the left. The search will descend
+ * right instead. The truncated attribute in pivot tuple means that
+ * all non-pivot tuples on the page to the left are strictly < 'foo',
+ * so it isn't necessary to descend left. In other words, search
+ * doesn't have to descend left because it isn't interested in a match
+ * that has a heap TID value of -inf.
+ *
+ * However, some searches (pivotsearch searches) actually require that
+ * we descend left when this happens. -inf is treated as a possible
+ * match for omitted scankey attribute(s). This is needed by page
+ * deletion, which must re-find leaf pages that are targets for
+ * deletion using their high keys.
+ *
+ * Note: the heap TID part of the test ensures that scankey is being
+ * compared to a pivot tuple with one or more truncated key
+ * attributes.
+ *
+ * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
+ * left here, since they have no heap TID attribute (and cannot have
+ * any -inf key values in any case, since truncation can only remove
+ * non-key attributes). !heapkeyspace searches must always be
+ * prepared to deal with matches on both sides of the pivot once the
+ * leaf level is reached.
+ */
+ if (key->heapkeyspace && !key->pivotsearch &&
+ key->keysz == ntupatts && heapTid == NULL)
+ return 1;
+
+ /* All provided scankey arguments found to be equal */
+ return 0;
+ }
+
+ /*
+ * Treat truncated heap TID as minus infinity, since scankey has a key
+ * attribute value (scantid) that would otherwise be compared directly
+ */
+ Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+ if (heapTid == NULL)
+ return 1;
+
+ /*
+ * Scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * with scantid.
+ */
+ Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (result <= 0 || !BTreeTupleIsPosting(itup))
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid,
+ BTreeTupleGetMaxHeapTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
+}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index bd1685c441..3558b2d3da 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -279,8 +279,6 @@ static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
BTPageState *state,
BTDedupState dstate);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
-static void _bt_load(BTWriteState *wstate,
- BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
int request);
static void _bt_end_parallel(BTLeader *btleader);
@@ -293,6 +291,9 @@ static void _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
Sharedsort *sharedsort2, int sortmem,
bool progress);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsort_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
/*
* btbuild() -- build a new btree index.
@@ -566,7 +567,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
wstate.heap = btspool->heap;
wstate.index = btspool->index;
- wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+ wstate.inskey = nbts_call(_bt_mkscankey, wstate.index, NULL);
/* _bt_mkscankey() won't set allequalimage without metapage */
wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
@@ -578,7 +579,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
PROGRESS_BTREE_PHASE_LEAF_LOAD);
- _bt_load(&wstate, btspool, btspool2);
+ nbts_call_norel(_bt_load, wstate.index, &wstate, btspool, btspool2);
}
/*
@@ -978,8 +979,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
lastleft = (IndexTuple) PageGetItem(opage, ii);
Assert(IndexTupleSize(oitup) > last_truncextra);
- truncated = _bt_truncate(wstate->index, lastleft, oitup,
- wstate->inskey);
+ truncated = nbts_call(_bt_truncate, wstate->index, lastleft, oitup,
+ wstate->inskey);
if (!PageIndexTupleOverwrite(opage, P_HIKEY, (Item) truncated,
IndexTupleSize(truncated)))
elog(ERROR, "failed to add high key to the index page");
@@ -1176,264 +1177,6 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
-/*
- * Read tuples in correct sort order from tuplesort, and load them into
- * btree leaves.
- */
-static void
-_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
-{
- BTPageState *state = NULL;
- bool merge = (btspool2 != NULL);
- IndexTuple itup,
- itup2 = NULL;
- bool load1;
- TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
- SortSupport sortKeys;
- int64 tuples_done = 0;
- bool deduplicate;
-
- deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
- BTGetDeduplicateItems(wstate->index);
-
- if (merge)
- {
- /*
- * Another BTSpool for dead tuples exists. Now we have to merge
- * btspool and btspool2.
- */
-
- /* the preparation of merge */
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
-
- /* Prepare SortSupport data for each column */
- sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
- for (i = 0; i < keysz; i++)
- {
- SortSupport sortKey = sortKeys + i;
- ScanKey scanKey = wstate->inskey->scankeys + i;
- int16 strategy;
-
- sortKey->ssup_cxt = CurrentMemoryContext;
- sortKey->ssup_collation = scanKey->sk_collation;
- sortKey->ssup_nulls_first =
- (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
- sortKey->ssup_attno = scanKey->sk_attno;
- /* Abbreviation is not supported here */
- sortKey->abbreviate = false;
-
- AssertState(sortKey->ssup_attno != 0);
-
- strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
- BTGreaterStrategyNumber : BTLessStrategyNumber;
-
- PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
- }
-
- for (;;)
- {
- load1 = true; /* load BTSpool next ? */
- if (itup2 == NULL)
- {
- if (itup == NULL)
- break;
- }
- else if (itup != NULL)
- {
- int32 compare = 0;
-
- for (i = 1; i <= keysz; i++)
- {
- SortSupport entry;
- Datum attrDatum1,
- attrDatum2;
- bool isNull1,
- isNull2;
-
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
- entry);
- if (compare > 0)
- {
- load1 = false;
- break;
- }
- else if (compare < 0)
- break;
- }
-
- /*
- * If key values are equal, we sort on ItemPointer. This is
- * required for btree indexes, since heap TID is treated as an
- * implicit last key attribute in order to ensure that all
- * keys in the index are physically unique.
- */
- if (compare == 0)
- {
- compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
- Assert(compare != 0);
- if (compare > 0)
- load1 = false;
- }
- }
- else
- load1 = false;
-
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- if (load1)
- {
- _bt_buildadd(wstate, state, itup, 0);
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- }
- else
- {
- _bt_buildadd(wstate, state, itup2, 0);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- pfree(sortKeys);
- }
- else if (deduplicate)
- {
- /* merge is unnecessary, deduplicate into posting lists */
- BTDedupState dstate;
-
- dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
- dstate->deduplicate = true; /* unused */
- dstate->nmaxitems = 0; /* unused */
- dstate->maxpostingsize = 0; /* set later */
- /* Metadata about base tuple of current pending posting list */
- dstate->base = NULL;
- dstate->baseoff = InvalidOffsetNumber; /* unused */
- dstate->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- dstate->htids = NULL;
- dstate->nhtids = 0;
- dstate->nitems = 0;
- dstate->phystupsize = 0; /* unused */
- dstate->nintervals = 0; /* unused */
-
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- {
- state = _bt_pagestate(wstate, 0);
-
- /*
- * Limit size of posting list tuples to 1/10 space we want to
- * leave behind on the page, plus space for final item's line
- * pointer. This is equal to the space that we'd like to
- * leave behind on each leaf page when fillfactor is 90,
- * allowing us to get close to fillfactor% space utilization
- * when there happen to be a great many duplicates. (This
- * makes higher leaf fillfactor settings ineffective when
- * building indexes that have many duplicates, but packing
- * leaf pages full with few very large tuples doesn't seem
- * like a useful goal.)
- */
- dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
- sizeof(ItemIdData);
- Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
- dstate->maxpostingsize <= INDEX_SIZE_MASK);
- dstate->htids = palloc(dstate->maxpostingsize);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
- else if (_bt_keep_natts_fast(wstate->index, dstate->base,
- itup) > keysz &&
- _bt_dedup_save_htid(dstate, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID from itup has been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list.
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
-
- if (state)
- {
- /*
- * Handle the last item (there must be a last item when the
- * tuplesort returned one or more tuples)
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
- pfree(dstate->htids);
- }
-
- pfree(dstate);
- }
- else
- {
- /* merging and deduplication are both unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- _bt_buildadd(wstate, state, itup, 0);
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- }
-
- /* Close down final pages and write the metapage */
- _bt_uppershutdown(wstate, state);
-
- /*
- * When we WAL-logged index pages, we must nonetheless fsync index files.
- * Since we're building outside shared buffers, a CHECKPOINT occurring
- * during the build has no way to flush the previously written data to
- * disk (indeed it won't know the index even exists). A crash later on
- * would replay WAL from the checkpoint, therefore it wouldn't replay our
- * earlier WAL entries. If we do not fsync those pages here, they might
- * still not be on disk when the crash occurs.
- */
- if (wstate->btws_use_wal)
- smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
-}
-
/*
* Create parallel context, and launch workers for leader.
*
diff --git a/src/backend/access/nbtree/nbtsort_spec.h b/src/backend/access/nbtree/nbtsort_spec.h
new file mode 100644
index 0000000000..8f4a3602ca
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsort_spec.h
@@ -0,0 +1,275 @@
+/*
+ * Specialized functions included in nbtsort.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static void NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
+ BTSpool *btspool2);
+
+/*
+ * Read tuples in correct sort order from tuplesort, and load them into
+ * btree leaves.
+ */
+static void
+NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
+ BTSpool *btspool2)
+{
+ BTPageState *state = NULL;
+ bool merge = (btspool2 != NULL);
+ IndexTuple itup,
+ itup2 = NULL;
+ bool load1;
+ TupleDesc tupdes = RelationGetDescr(wstate->index);
+ int i,
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ SortSupport sortKeys;
+ int64 tuples_done = 0;
+ bool deduplicate;
+
+ deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
+ BTGetDeduplicateItems(wstate->index);
+
+ if (merge)
+ {
+ /*
+ * Another BTSpool for dead tuples exists. Now we have to merge
+ * btspool and btspool2.
+ */
+
+ /* the preparation of merge */
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+
+ /* Prepare SortSupport data for each column */
+ sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+
+ for (i = 0; i < keysz; i++)
+ {
+ SortSupport sortKey = sortKeys + i;
+ ScanKey scanKey = wstate->inskey->scankeys + i;
+ int16 strategy;
+
+ sortKey->ssup_cxt = CurrentMemoryContext;
+ sortKey->ssup_collation = scanKey->sk_collation;
+ sortKey->ssup_nulls_first =
+ (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+ sortKey->ssup_attno = scanKey->sk_attno;
+ /* Abbreviation is not supported here */
+ sortKey->abbreviate = false;
+
+ AssertState(sortKey->ssup_attno != 0);
+
+ strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+ BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+ PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+ }
+
+ for (;;)
+ {
+ load1 = true; /* load BTSpool next ? */
+ if (itup2 == NULL)
+ {
+ if (itup == NULL)
+ break;
+ }
+ else if (itup != NULL)
+ {
+ int32 compare = 0;
+
+ for (i = 1; i <= keysz; i++)
+ {
+ SortSupport entry;
+ Datum attrDatum1,
+ attrDatum2;
+ bool isNull1,
+ isNull2;
+
+ entry = sortKeys + i - 1;
+ attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+ attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+ compare = ApplySortComparator(attrDatum1, isNull1,
+ attrDatum2, isNull2,
+ entry);
+ if (compare > 0)
+ {
+ load1 = false;
+ break;
+ }
+ else if (compare < 0)
+ break;
+ }
+
+ /*
+ * If key values are equal, we sort on ItemPointer. This is
+ * required for btree indexes, since heap TID is treated as an
+ * implicit last key attribute in order to ensure that all
+ * keys in the index are physically unique.
+ */
+ if (compare == 0)
+ {
+ compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+ Assert(compare != 0);
+ if (compare > 0)
+ load1 = false;
+ }
+ }
+ else
+ load1 = false;
+
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ if (load1)
+ {
+ _bt_buildadd(wstate, state, itup, 0);
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ }
+ else
+ {
+ _bt_buildadd(wstate, state, itup2, 0);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ pfree(sortKeys);
+ }
+ else if (deduplicate)
+ {
+ /* merge is unnecessary, deduplicate into posting lists */
+ BTDedupState dstate;
+
+ dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ dstate->deduplicate = true; /* unused */
+ dstate->nmaxitems = 0; /* unused */
+ dstate->maxpostingsize = 0; /* set later */
+ /* Metadata about base tuple of current pending posting list */
+ dstate->base = NULL;
+ dstate->baseoff = InvalidOffsetNumber; /* unused */
+ dstate->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ dstate->htids = NULL;
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+ dstate->phystupsize = 0; /* unused */
+ dstate->nintervals = 0; /* unused */
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+
+ /*
+ * Limit size of posting list tuples to 1/10 space we want to
+ * leave behind on the page, plus space for final item's line
+ * pointer. This is equal to the space that we'd like to
+ * leave behind on each leaf page when fillfactor is 90,
+ * allowing us to get close to fillfactor% space utilization
+ * when there happen to be a great many duplicates. (This
+ * makes higher leaf fillfactor settings ineffective when
+ * building indexes that have many duplicates, but packing
+ * leaf pages full with few very large tuples doesn't seem
+ * like a useful goal.)
+ */
+ dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
+ sizeof(ItemIdData);
+ Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
+ dstate->maxpostingsize <= INDEX_SIZE_MASK);
+ dstate->htids = palloc(dstate->maxpostingsize);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+ else if (nbts_call(_bt_keep_natts_fast, wstate->index, dstate->base,
+ itup) > keysz &&
+ _bt_dedup_save_htid(dstate, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID from itup has been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list.
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+
+ if (state)
+ {
+ /*
+ * Handle the last item (there must be a last item when the
+ * tuplesort returned one or more tuples)
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+ pfree(dstate->htids);
+ }
+
+ pfree(dstate);
+ }
+ else
+ {
+ /* merging and deduplication are both unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ _bt_buildadd(wstate, state, itup, 0);
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+
+ /* Close down final pages and write the metapage */
+ _bt_uppershutdown(wstate, state);
+
+ /*
+ * When we WAL-logged index pages, we must nonetheless fsync index files.
+ * Since we're building outside shared buffers, a CHECKPOINT occurring
+ * during the build has no way to flush the previously written data to
+ * disk (indeed it won't know the index even exists). A crash later on
+ * would replay WAL from the checkpoint, therefore it wouldn't replay our
+ * earlier WAL entries. If we do not fsync those pages here, they might
+ * still not be on disk when the crash occurs.
+ */
+ if (wstate->btws_use_wal)
+ smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+}
+
+#endif
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 241e26d338..8e5337cad7 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -692,7 +692,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
{
itemid = PageGetItemId(state->origpage, maxoff);
tup = (IndexTuple) PageGetItem(state->origpage, itemid);
- keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+ keepnatts = nbts_call(_bt_keep_natts_fast, state->rel, tup, state->newitem);
if (keepnatts > 1 && keepnatts <= nkeyatts)
{
@@ -723,7 +723,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
- keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+ keepnatts = nbts_call(_bt_keep_natts_fast, state->rel, tup, state->newitem);
if (keepnatts > 1 && keepnatts <= nkeyatts)
{
@@ -972,7 +972,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
* avoid appending a heap TID in new high key, we're done. Finish split
* with default strategy and initial split interval.
*/
- perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+ perfectpenalty = nbts_call(_bt_keep_natts_fast, state->rel, leftmost, rightmost);
if (perfectpenalty <= indnkeyatts)
return perfectpenalty;
@@ -993,7 +993,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
* If page is entirely full of duplicates, a single value strategy split
* will be performed.
*/
- perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+ perfectpenalty = nbts_call(_bt_keep_natts_fast, state->rel, leftmost, rightmost);
if (perfectpenalty <= indnkeyatts)
{
*strategy = SPLIT_MANY_DUPLICATES;
@@ -1031,8 +1031,8 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
itemid = PageGetItemId(state->origpage, P_HIKEY);
hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
- perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
- state->newitem);
+ perfectpenalty = nbts_call(_bt_keep_natts_fast, state->rel, hikey,
+ state->newitem);
if (perfectpenalty <= indnkeyatts)
*strategy = SPLIT_SINGLE_VALUE;
else
@@ -1154,7 +1154,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
lastleft = _bt_split_lastleft(state, split);
firstright = _bt_split_firstright(state, split);
- return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+ return nbts_call(_bt_keep_natts_fast, state->rel, lastleft, firstright);
}
/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index ff260c393a..bc443ebd27 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -50,130 +50,11 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
bool *result);
static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
static void _bt_mark_scankey_required(ScanKey skey);
-static bool _bt_check_rowcompare(ScanKey skey,
- IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
- ScanDirection dir, bool *continuescan);
-static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtutils_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
-/*
- * _bt_mkscankey
- * Build an insertion scan key that contains comparison data from itup
- * as well as comparator routines appropriate to the key datatypes.
- *
- * When itup is a non-pivot tuple, the returned insertion scan key is
- * suitable for finding a place for it to go on the leaf level. Pivot
- * tuples can be used to re-find leaf page with matching high key, but
- * then caller needs to set scan key's pivotsearch field to true. This
- * allows caller to search for a leaf page with a matching high key,
- * which is usually to the left of the first leaf page a non-pivot match
- * might appear on.
- *
- * The result is intended for use with _bt_compare() and _bt_truncate().
- * Callers that don't need to fill out the insertion scankey arguments
- * (e.g. they use an ad-hoc comparison routine, or only need a scankey
- * for _bt_truncate()) can pass a NULL index tuple. The scankey will
- * be initialized as if an "all truncated" pivot tuple was passed
- * instead.
- *
- * Note that we may occasionally have to share lock the metapage to
- * determine whether or not the keys in the index are expected to be
- * unique (i.e. if this is a "heapkeyspace" index). We assume a
- * heapkeyspace index when caller passes a NULL tuple, allowing index
- * build callers to avoid accessing the non-existent metapage. We
- * also assume that the index is _not_ allequalimage when a NULL tuple
- * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
- * field themselves.
- */
-BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
-{
- BTScanInsert key;
- ScanKey skey;
- TupleDesc itupdesc;
- int indnkeyatts;
- int16 *indoption;
- int tupnatts;
- int i;
-
- itupdesc = RelationGetDescr(rel);
- indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- indoption = rel->rd_indoption;
- tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
-
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
- /*
- * We'll execute search using scan key constructed on key columns.
- * Truncated attributes and non-key attributes are omitted from the final
- * scan key.
- */
- key = palloc(offsetof(BTScanInsertData, scankeys) +
- sizeof(ScanKeyData) * indnkeyatts);
- if (itup)
- _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
- else
- {
- /* Utility statement callers can set these fields themselves */
- key->heapkeyspace = true;
- key->allequalimage = false;
- }
- key->anynullkeys = false; /* initial assumption */
- key->nextkey = false;
- key->pivotsearch = false;
- key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
- skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
- {
- FmgrInfo *procinfo;
- Datum arg;
- bool null;
- int flags;
-
- /*
- * We can use the cached (default) support procs since no cross-type
- * comparison can be needed.
- */
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-
- /*
- * Key arguments built from truncated attributes (or when caller
- * provides no tuple) are defensively represented as NULL values. They
- * should never be used.
- */
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
- else
- {
- arg = (Datum) 0;
- null = true;
- }
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
- flags,
- (AttrNumber) (i + 1),
- InvalidStrategy,
- InvalidOid,
- rel->rd_indcollation[i],
- procinfo,
- arg);
- /* Record if any key attribute is NULL (or truncated) */
- if (null)
- key->anynullkeys = true;
- }
-
- /*
- * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
- * that full uniqueness check is done.
- */
- if (rel->rd_index->indnullsnotdistinct)
- key->anynullkeys = false;
-
- return key;
-}
/*
* free a retracement stack made by _bt_search.
@@ -1340,356 +1221,6 @@ _bt_mark_scankey_required(ScanKey skey)
}
}
-/*
- * Test whether an indextuple satisfies all the scankey conditions.
- *
- * Return true if so, false if not. If the tuple fails to pass the qual,
- * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly. See comments for
- * _bt_preprocess_keys(), above, about how this is done.
- *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
- *
- * scan: index scan descriptor (containing a search-type scankey)
- * tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
- */
-bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
- ScanDirection dir, bool *continuescan)
-{
- TupleDesc tupdesc;
- BTScanOpaque so;
- int keysz;
- int ikey;
- ScanKey key;
-
- Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
-
- *continuescan = true; /* default assumption */
-
- tupdesc = RelationGetDescr(scan->indexRelation);
- so = (BTScanOpaque) scan->opaque;
- keysz = so->numberOfKeys;
-
- for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
- {
- Datum datum;
- bool isNull;
- Datum test;
-
- if (key->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- continue;
- }
-
- /* row-comparison keys need special processing */
- if (key->sk_flags & SK_ROW_HEADER)
- {
- if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
- continuescan))
- continue;
- return false;
- }
-
- datum = index_getattr(tuple,
- key->sk_attno,
- tupdesc,
- &isNull);
-
- if (key->sk_flags & SK_ISNULL)
- {
- /* Handle IS NULL/NOT NULL tests */
- if (key->sk_flags & SK_SEARCHNULL)
- {
- if (isNull)
- continue; /* tuple satisfies this qual */
- }
- else
- {
- Assert(key->sk_flags & SK_SEARCHNOTNULL);
- if (!isNull)
- continue; /* tuple satisfies this qual */
- }
-
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (isNull)
- {
- if (key->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
- datum, key->sk_argument);
-
- if (!DatumGetBool(test))
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- *
- * Note: because we stop the scan as soon as any required equality
- * qual fails, it is critical that equality quals be used for the
- * initial positioning in _bt_first() when they are available. See
- * comments in _bt_first().
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
- }
-
- /* If we get here, the tuple passes all index quals. */
- return true;
-}
-
-/*
- * Test whether an indextuple satisfies a row-comparison scan condition.
- *
- * Return true if so, false if not. If not, also clear *continuescan if
- * it's not possible for any future tuples in the current scan direction
- * to pass the qual.
- *
- * This is a subroutine for _bt_checkkeys, which see for more info.
- */
-static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
- TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
-{
- ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
- int32 cmpresult = 0;
- bool result;
-
- /* First subkey should be same as the header says */
- Assert(subkey->sk_attno == skey->sk_attno);
-
- /* Loop over columns of the row condition */
- for (;;)
- {
- Datum datum;
- bool isNull;
-
- Assert(subkey->sk_flags & SK_ROW_MEMBER);
-
- if (subkey->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- cmpresult = 0;
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- continue;
- }
-
- datum = index_getattr(tuple,
- subkey->sk_attno,
- tupdesc,
- &isNull);
-
- if (isNull)
- {
- if (subkey->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (subkey->sk_flags & SK_ISNULL)
- {
- /*
- * Unlike the simple-scankey case, this isn't a disallowed case.
- * But it can never match. If all the earlier row comparison
- * columns are required for the scan direction, we can stop the
- * scan, because there can't be another tuple that will succeed.
- */
- if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
- subkey--;
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- return false;
- }
-
- /* Perform the test --- three-way comparison not bool operator */
- cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
- subkey->sk_collation,
- datum,
- subkey->sk_argument));
-
- if (subkey->sk_flags & SK_BT_DESC)
- INVERT_COMPARE_RESULT(cmpresult);
-
- /* Done comparing if unequal, else advance to next column */
- if (cmpresult != 0)
- break;
-
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- }
-
- /*
- * At this point cmpresult indicates the overall result of the row
- * comparison, and subkey points to the deciding column (or the last
- * column if the result is "=").
- */
- switch (subkey->sk_strategy)
- {
- /* EQ and NE cases aren't allowed here */
- case BTLessStrategyNumber:
- result = (cmpresult < 0);
- break;
- case BTLessEqualStrategyNumber:
- result = (cmpresult <= 0);
- break;
- case BTGreaterEqualStrategyNumber:
- result = (cmpresult >= 0);
- break;
- case BTGreaterStrategyNumber:
- result = (cmpresult > 0);
- break;
- default:
- elog(ERROR, "unrecognized RowCompareType: %d",
- (int) subkey->sk_strategy);
- result = 0; /* keep compiler quiet */
- break;
- }
-
- if (!result)
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will pass,
- * either. Note we have to look at the deciding column, not
- * necessarily the first or last column of the row condition.
- */
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
-
- return result;
-}
-
/*
* _bt_killitems - set LP_DEAD state for items an indexscan caller has
* told us were killed
@@ -2173,286 +1704,6 @@ btbuildphasename(int64 phasenum)
}
}
-/*
- * _bt_truncate() -- create tuple without unneeded suffix attributes.
- *
- * Returns truncated pivot index tuple allocated in caller's memory context,
- * with key attributes copied from caller's firstright argument. If rel is
- * an INCLUDE index, non-key attributes will definitely be truncated away,
- * since they're not part of the key space. More aggressive suffix
- * truncation can take place when it's clear that the returned tuple does not
- * need one or more suffix key attributes. We only need to keep firstright
- * attributes up to and including the first non-lastleft-equal attribute.
- * Caller's insertion scankey is used to compare the tuples; the scankey's
- * argument values are not considered here.
- *
- * Note that returned tuple's t_tid offset will hold the number of attributes
- * present, so the original item pointer offset is not represented. Caller
- * should only change truncated tuple's downlink. Note also that truncated
- * key attributes are treated as containing "minus infinity" values by
- * _bt_compare().
- *
- * In the worst case (when a heap TID must be appended to distinguish lastleft
- * from firstright), the size of the returned tuple is the size of firstright
- * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
- * is important, since callers need to stay under the 1/3 of a page
- * restriction on tuple size. If this routine is ever taught to truncate
- * within an attribute/datum, it will need to avoid returning an enlarged
- * tuple to caller when truncation + TOAST compression ends up enlarging the
- * final datum.
- */
-IndexTuple
-_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
- IndexTuple pivot;
- IndexTuple tidpivot;
- ItemPointer pivotheaptid;
- Size newsize;
-
- /*
- * We should only ever truncate non-pivot tuples from leaf pages. It's
- * never okay to truncate when splitting an internal page.
- */
- Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
-
- /* Determine how many attributes must be kept in truncated tuple */
- keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
-
-#ifdef DEBUG_NO_TRUNCATE
- /* Force truncation to be ineffective for testing purposes */
- keepnatts = nkeyatts + 1;
-#endif
-
- pivot = index_truncate_tuple(itupdesc, firstright,
- Min(keepnatts, nkeyatts));
-
- if (BTreeTupleIsPosting(pivot))
- {
- /*
- * index_truncate_tuple() just returns a straight copy of firstright
- * when it has no attributes to truncate. When that happens, we may
- * need to truncate away a posting list here instead.
- */
- Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
- Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
- pivot->t_info &= ~INDEX_SIZE_MASK;
- pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
- }
-
- /*
- * If there is a distinguishing key attribute within pivot tuple, we're
- * done
- */
- if (keepnatts <= nkeyatts)
- {
- BTreeTupleSetNAtts(pivot, keepnatts, false);
- return pivot;
- }
-
- /*
- * We have to store a heap TID in the new pivot tuple, since no non-TID
- * key attribute value in firstright distinguishes the right side of the
- * split from the left side. nbtree conceptualizes this case as an
- * inability to truncate away any key attributes, since heap TID is
- * treated as just another key attribute (despite lacking a pg_attribute
- * entry).
- *
- * Use enlarged space that holds a copy of pivot. We need the extra space
- * to store a heap TID at the end (using the special pivot tuple
- * representation). Note that the original pivot already has firstright's
- * possible posting list/non-key attribute values removed at this point.
- */
- newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
- tidpivot = palloc0(newsize);
- memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
- /* Cannot leak memory here */
- pfree(pivot);
-
- /*
- * Store all of firstright's key attribute values plus a tiebreaker heap
- * TID value in enlarged pivot tuple
- */
- tidpivot->t_info &= ~INDEX_SIZE_MASK;
- tidpivot->t_info |= newsize;
- BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
- pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
-
- /*
- * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
- * consider suffix truncation. It seems like a good idea to follow that
- * example in cases where no truncation takes place -- use lastleft's heap
- * TID. (This is also the closest value to negative infinity that's
- * legally usable.)
- */
- ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
-
- /*
- * We're done. Assert() that heap TID invariants hold before returning.
- *
- * Lehman and Yao require that the downlink to the right page, which is to
- * be inserted into the parent page in the second phase of a page split be
- * a strict lower bound on items on the right page, and a non-strict upper
- * bound for items on the left page. Assert that heap TIDs follow these
- * invariants, since a heap TID value is apparently needed as a
- * tiebreaker.
- */
-#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
- BTreeTupleGetHeapTID(firstright)) < 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(lastleft)) >= 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#else
-
- /*
- * Those invariants aren't guaranteed to hold for lastleft + firstright
- * heap TID attribute values when they're considered here only because
- * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
- * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
- * TID value that always works as a strict lower bound for items to the
- * right. In particular, it must avoid using firstright's leading key
- * attribute values along with lastleft's heap TID value when lastleft's
- * TID happens to be greater than firstright's TID.
- */
- ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
-
- /*
- * Pivot heap TID should never be fully equal to firstright. Note that
- * the pivot heap TID will still end up equal to lastleft's heap TID when
- * that's the only usable value.
- */
- ItemPointerSetOffsetNumber(pivotheaptid,
- OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#endif
-
- return tidpivot;
-}
-
-/*
- * _bt_keep_natts - how many key attributes to keep when truncating.
- *
- * Caller provides two tuples that enclose a split point. Caller's insertion
- * scankey is used to compare the tuples; the scankey's argument values are
- * not considered here.
- *
- * This can return a number of attributes that is one greater than the
- * number of key attributes for the index relation. This indicates that the
- * caller must use a heap TID as a unique-ifier in new pivot tuple.
- */
-static int
-_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keepnatts;
- ScanKey scankey;
-
- /*
- * _bt_compare() treats truncated key attributes as having the value minus
- * infinity, which would break searches within !heapkeyspace indexes. We
- * must still truncate away non-key attribute values, though.
- */
- if (!itup_key->heapkeyspace)
- return nkeyatts;
-
- scankey = itup_key->scankeys;
- keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum1,
- datum2)) != 0)
- break;
-
- keepnatts++;
- }
-
- /*
- * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
- * expected in an allequalimage index.
- */
- Assert(!itup_key->allequalimage ||
- keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
-
- return keepnatts;
-}
-
-/*
- * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
- *
- * This is exported so that a candidate split point can have its effect on
- * suffix truncation inexpensively evaluated ahead of time when finding a
- * split location. A naive bitwise approach to datum comparisons is used to
- * save cycles.
- *
- * The approach taken here usually provides the same answer as _bt_keep_natts
- * will (for the same pair of tuples from a heapkeyspace index), since the
- * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting. When an index only has
- * "equal image" columns, routine is guaranteed to give the same result as
- * _bt_keep_natts would.
- *
- * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts, even when the index uses
- * an opclass or collation that is not "allequalimage"/deduplication-safe.
- * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
- * negatives generally only have the effect of making leaf page splits use a
- * more balanced split point.
- */
-int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
-
- keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
- Form_pg_attribute att;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
- break;
-
- keepnatts++;
- }
-
- return keepnatts;
-}
-
/*
* _bt_check_natts() -- Verify tuple has expected number of attributes.
*
diff --git a/src/backend/access/nbtree/nbtutils_spec.h b/src/backend/access/nbtree/nbtutils_spec.h
new file mode 100644
index 0000000000..a4b934ae7a
--- /dev/null
+++ b/src/backend/access/nbtree/nbtutils_spec.h
@@ -0,0 +1,772 @@
+/*
+ * Specialized functions included in nbtutils.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static bool NBTS_FUNCTION(_bt_check_rowcompare)(ScanKey skey,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ ScanDirection dir, bool *continuescan);
+
+static int NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+
+/*
+ * Test whether an indextuple satisfies a row-comparison scan condition.
+ *
+ * Return true if so, false if not. If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction
+ * to pass the qual.
+ *
+ * This is a subroutine for _bt_checkkeys, which see for more info.
+ */
+static bool
+NBTS_FUNCTION(_bt_check_rowcompare)(ScanKey skey, IndexTuple tuple,
+ int tupnatts, TupleDesc tupdesc,
+ ScanDirection dir, bool *continuescan)
+{
+ ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
+ int32 cmpresult = 0;
+ bool result;
+
+ /* First subkey should be same as the header says */
+ Assert(subkey->sk_attno == skey->sk_attno);
+
+ /* Loop over columns of the row condition */
+ for (;;)
+ {
+ Datum datum;
+ bool isNull;
+
+ Assert(subkey->sk_flags & SK_ROW_MEMBER);
+
+ if (subkey->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ cmpresult = 0;
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ continue;
+ }
+
+ datum = index_getattr(tuple,
+ subkey->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (isNull)
+ {
+ if (subkey->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (subkey->sk_flags & SK_ISNULL)
+ {
+ /*
+ * Unlike the simple-scankey case, this isn't a disallowed case.
+ * But it can never match. If all the earlier row comparison
+ * columns are required for the scan direction, we can stop the
+ * scan, because there can't be another tuple that will succeed.
+ */
+ if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
+ subkey--;
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ return false;
+ }
+
+ /* Perform the test --- three-way comparison not bool operator */
+ cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
+ subkey->sk_collation,
+ datum,
+ subkey->sk_argument));
+
+ if (subkey->sk_flags & SK_BT_DESC)
+ INVERT_COMPARE_RESULT(cmpresult);
+
+ /* Done comparing if unequal, else advance to next column */
+ if (cmpresult != 0)
+ break;
+
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ }
+
+ /*
+ * At this point cmpresult indicates the overall result of the row
+ * comparison, and subkey points to the deciding column (or the last
+ * column if the result is "=").
+ */
+ switch (subkey->sk_strategy)
+ {
+ /* EQ and NE cases aren't allowed here */
+ case BTLessStrategyNumber:
+ result = (cmpresult < 0);
+ break;
+ case BTLessEqualStrategyNumber:
+ result = (cmpresult <= 0);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ result = (cmpresult >= 0);
+ break;
+ case BTGreaterStrategyNumber:
+ result = (cmpresult > 0);
+ break;
+ default:
+ elog(ERROR, "unrecognized RowCompareType: %d",
+ (int) subkey->sk_strategy);
+ result = 0; /* keep compiler quiet */
+ break;
+ }
+
+ if (!result)
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will pass,
+ * either. Note we have to look at the deciding column, not
+ * necessarily the first or last column of the row condition.
+ */
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+
+ return result;
+}
+
+/*
+ * _bt_keep_natts - how many key attributes to keep when truncating.
+ *
+ * Caller provides two tuples that enclose a split point. Caller's insertion
+ * scankey is used to compare the tuples; the scankey's argument values are
+ * not considered here.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation. This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keepnatts;
+ ScanKey scankey;
+
+ /*
+ * _bt_compare() treats truncated key attributes as having the value minus
+ * infinity, which would break searches within !heapkeyspace indexes. We
+ * must still truncate away non-key attribute values, though.
+ */
+ if (!itup_key->heapkeyspace)
+ return nkeyatts;
+
+ scankey = itup_key->scankeys;
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum1,
+ datum2)) != 0)
+ break;
+
+ keepnatts++;
+ }
+
+ /*
+ * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
+ * expected in an allequalimage index.
+ */
+ Assert(!itup_key->allequalimage ||
+ keepnatts == nbts_call(_bt_keep_natts_fast, rel, lastleft, firstright));
+
+ return keepnatts;
+}
+
+#endif /* NBTS_SPECIALIZING_DEFAULT */
+
+/*
+ * _bt_mkscankey
+ * Build an insertion scan key that contains comparison data from itup
+ * as well as comparator routines appropriate to the key datatypes.
+ *
+ * When itup is a non-pivot tuple, the returned insertion scan key is
+ * suitable for finding a place for it to go on the leaf level. Pivot
+ * tuples can be used to re-find leaf page with matching high key, but
+ * then caller needs to set scan key's pivotsearch field to true. This
+ * allows caller to search for a leaf page with a matching high key,
+ * which is usually to the left of the first leaf page a non-pivot match
+ * might appear on.
+ *
+ * The result is intended for use with _bt_compare() and _bt_truncate().
+ * Callers that don't need to fill out the insertion scankey arguments
+ * (e.g. they use an ad-hoc comparison routine, or only need a scankey
+ * for _bt_truncate()) can pass a NULL index tuple. The scankey will
+ * be initialized as if an "all truncated" pivot tuple was passed
+ * instead.
+ *
+ * Note that we may occasionally have to share lock the metapage to
+ * determine whether or not the keys in the index are expected to be
+ * unique (i.e. if this is a "heapkeyspace" index). We assume a
+ * heapkeyspace index when caller passes a NULL tuple, allowing index
+ * build callers to avoid accessing the non-existent metapage. We
+ * also assume that the index is _not_ allequalimage when a NULL tuple
+ * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
+ * field themselves.
+ */
+BTScanInsert
+NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
+{
+ BTScanInsert key;
+ ScanKey skey;
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int16 *indoption;
+ int tupnatts;
+ int i;
+
+ itupdesc = RelationGetDescr(rel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ indoption = rel->rd_indoption;
+ tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+
+ /*
+ * We'll execute search using scan key constructed on key columns.
+ * Truncated attributes and non-key attributes are omitted from the final
+ * scan key.
+ */
+ key = palloc(offsetof(BTScanInsertData, scankeys) +
+ sizeof(ScanKeyData) * indnkeyatts);
+ if (itup)
+ _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+ else
+ {
+ /* Utility statement callers can set these fields themselves */
+ key->heapkeyspace = true;
+ key->allequalimage = false;
+ }
+ key->anynullkeys = false; /* initial assumption */
+ key->nextkey = false;
+ key->pivotsearch = false;
+ key->keysz = Min(indnkeyatts, tupnatts);
+ key->scantid = key->heapkeyspace && itup ?
+ BTreeTupleGetHeapTID(itup) : NULL;
+ skey = key->scankeys;
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ FmgrInfo *procinfo;
+ Datum arg;
+ bool null;
+ int flags;
+
+ /*
+ * We can use the cached (default) support procs since no cross-type
+ * comparison can be needed.
+ */
+ procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+
+ /*
+ * Key arguments built from truncated attributes (or when caller
+ * provides no tuple) are defensively represented as NULL values. They
+ * should never be used.
+ */
+ if (i < tupnatts)
+ arg = index_getattr(itup, i + 1, itupdesc, &null);
+ else
+ {
+ arg = (Datum) 0;
+ null = true;
+ }
+ flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags,
+ (AttrNumber) (i + 1),
+ InvalidStrategy,
+ InvalidOid,
+ rel->rd_indcollation[i],
+ procinfo,
+ arg);
+ /* Record if any key attribute is NULL (or truncated) */
+ if (null)
+ key->anynullkeys = true;
+ }
+
+ /*
+ * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
+ * that full uniqueness check is done.
+ */
+ if (rel->rd_index->indnullsnotdistinct)
+ key->anynullkeys = false;
+
+ return key;
+}
+
+/*
+ * Test whether an indextuple satisfies all the scankey conditions.
+ *
+ * Return true if so, false if not. If the tuple fails to pass the qual,
+ * we also determine whether there's any need to continue the scan beyond
+ * this tuple, and set *continuescan accordingly. See comments for
+ * _bt_preprocess_keys(), above, about how this is done.
+ *
+ * Forward scan callers can pass a high key tuple in the hopes of having
+ * us set *continuescan to false, and avoiding an unnecessary visit to
+ * the page to the right.
+ *
+ * scan: index scan descriptor (containing a search-type scankey)
+ * tuple: index tuple to test
+ * tupnatts: number of attributes in tupnatts (high key may be truncated)
+ * dir: direction we are scanning in
+ * continuescan: output parameter (will be set correctly in all cases)
+ */
+bool
+NBTS_FUNCTION(_bt_checkkeys)(Relation rel, IndexScanDesc scan,
+ IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan)
+{
+ TupleDesc tupdesc;
+ BTScanOpaque so;
+ int keysz;
+ int ikey;
+ ScanKey key;
+
+ Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+
+ *continuescan = true; /* default assumption */
+
+ tupdesc = RelationGetDescr(scan->indexRelation);
+ so = (BTScanOpaque) scan->opaque;
+ keysz = so->numberOfKeys;
+
+ for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+ {
+ Datum datum;
+ bool isNull;
+ Datum test;
+
+ if (key->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ continue;
+ }
+
+ /* row-comparison keys need special processing */
+ if (key->sk_flags & SK_ROW_HEADER)
+ {
+ if (nbts_call_norel(_bt_check_rowcompare, rel, key, tuple,
+ tupnatts, tupdesc, dir, continuescan))
+ continue;
+ return false;
+ }
+
+ datum = index_getattr(tuple,
+ key->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (key->sk_flags & SK_ISNULL)
+ {
+ /* Handle IS NULL/NOT NULL tests */
+ if (key->sk_flags & SK_SEARCHNULL)
+ {
+ if (isNull)
+ continue; /* tuple satisfies this qual */
+ }
+ else
+ {
+ Assert(key->sk_flags & SK_SEARCHNOTNULL);
+ if (!isNull)
+ continue; /* tuple satisfies this qual */
+ }
+
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (isNull)
+ {
+ if (key->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
+ datum, key->sk_argument);
+
+ if (!DatumGetBool(test))
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ *
+ * Note: because we stop the scan as soon as any required equality
+ * qual fails, it is critical that equality quals be used for the
+ * initial positioning in _bt_first() when they are available. See
+ * comments in _bt_first().
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+ }
+
+ /* If we get here, the tuple passes all index quals. */
+ return true;
+}
+
+/*
+ * _bt_truncate() -- create tuple without unneeded suffix attributes.
+ *
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument. If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space. More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes. We only need to keep firstright
+ * attributes up to and including the first non-lastleft-equal attribute.
+ * Caller's insertion scankey is used to compare the tuples; the scankey's
+ * argument values are not considered here.
+ *
+ * Note that returned tuple's t_tid offset will hold the number of attributes
+ * present, so the original item pointer offset is not represented. Caller
+ * should only change truncated tuple's downlink. Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
+ *
+ * In the worst case (when a heap TID must be appended to distinguish lastleft
+ * from firstright), the size of the returned tuple is the size of firstright
+ * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
+ * is important, since callers need to stay under the 1/3 of a page
+ * restriction on tuple size. If this routine is ever taught to truncate
+ * within an attribute/datum, it will need to avoid returning an enlarged
+ * tuple to caller when truncation + TOAST compression ends up enlarging the
+ * final datum.
+ */
+IndexTuple
+NBTS_FUNCTION(_bt_truncate)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+ IndexTuple pivot;
+ IndexTuple tidpivot;
+ ItemPointer pivotheaptid;
+ Size newsize;
+
+ /*
+ * We should only ever truncate non-pivot tuples from leaf pages. It's
+ * never okay to truncate when splitting an internal page.
+ */
+ Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
+
+ /* Determine how many attributes must be kept in truncated tuple */
+ keepnatts = nbts_call(_bt_keep_natts, rel, lastleft, firstright, itup_key);
+
+#ifdef DEBUG_NO_TRUNCATE
+ /* Force truncation to be ineffective for testing purposes */
+ keepnatts = nkeyatts + 1;
+#endif
+
+ pivot = index_truncate_tuple(itupdesc, firstright,
+ Min(keepnatts, nkeyatts));
+
+ if (BTreeTupleIsPosting(pivot))
+ {
+ /*
+ * index_truncate_tuple() just returns a straight copy of firstright
+ * when it has no attributes to truncate. When that happens, we may
+ * need to truncate away a posting list here instead.
+ */
+ Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
+ Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+
+ /*
+ * If there is a distinguishing key attribute within pivot tuple, we're
+ * done
+ */
+ if (keepnatts <= nkeyatts)
+ {
+ BTreeTupleSetNAtts(pivot, keepnatts, false);
+ return pivot;
+ }
+
+ /*
+ * We have to store a heap TID in the new pivot tuple, since no non-TID
+ * key attribute value in firstright distinguishes the right side of the
+ * split from the left side. nbtree conceptualizes this case as an
+ * inability to truncate away any key attributes, since heap TID is
+ * treated as just another key attribute (despite lacking a pg_attribute
+ * entry).
+ *
+ * Use enlarged space that holds a copy of pivot. We need the extra space
+ * to store a heap TID at the end (using the special pivot tuple
+ * representation). Note that the original pivot already has firstright's
+ * possible posting list/non-key attribute values removed at this point.
+ */
+ newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
+ tidpivot = palloc0(newsize);
+ memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
+ /* Cannot leak memory here */
+ pfree(pivot);
+
+ /*
+ * Store all of firstright's key attribute values plus a tiebreaker heap
+ * TID value in enlarged pivot tuple
+ */
+ tidpivot->t_info &= ~INDEX_SIZE_MASK;
+ tidpivot->t_info |= newsize;
+ BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
+ pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
+
+ /*
+ * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
+ * consider suffix truncation. It seems like a good idea to follow that
+ * example in cases where no truncation takes place -- use lastleft's heap
+ * TID. (This is also the closest value to negative infinity that's
+ * legally usable.)
+ */
+ ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
+
+ /*
+ * We're done. Assert() that heap TID invariants hold before returning.
+ *
+ * Lehman and Yao require that the downlink to the right page, which is to
+ * be inserted into the parent page in the second phase of a page split be
+ * a strict lower bound on items on the right page, and a non-strict upper
+ * bound for items on the left page. Assert that heap TIDs follow these
+ * invariants, since a heap TID value is apparently needed as a
+ * tiebreaker.
+ */
+#ifndef DEBUG_NO_TRUNCATE
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#else
+
+ /*
+ * Those invariants aren't guaranteed to hold for lastleft + firstright
+ * heap TID attribute values when they're considered here only because
+ * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
+ * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
+ * TID value that always works as a strict lower bound for items to the
+ * right. In particular, it must avoid using firstright's leading key
+ * attribute values along with lastleft's heap TID value when lastleft's
+ * TID happens to be greater than firstright's TID.
+ */
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
+
+ /*
+ * Pivot heap TID should never be fully equal to firstright. Note that
+ * the pivot heap TID will still end up equal to lastleft's heap TID when
+ * that's the only usable value.
+ */
+ ItemPointerSetOffsetNumber(pivotheaptid,
+ OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#endif
+
+ return tidpivot;
+}
+
+/*
+ * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location. A naive bitwise approach to datum comparisons is used to
+ * save cycles.
+ *
+ * The approach taken here usually provides the same answer as _bt_keep_natts
+ * will (for the same pair of tuples from a heapkeyspace index), since the
+ * majority of btree opclasses can never indicate that two datums are equal
+ * unless they're bitwise equal after detoasting. When an index only has
+ * "equal image" columns, routine is guaranteed to give the same result as
+ * _bt_keep_natts would.
+ *
+ * Callers can rely on the fact that attributes considered equal here are
+ * definitely also equal according to _bt_keep_natts, even when the index uses
+ * an opclass or collation that is not "allequalimage"/deduplication-safe.
+ * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
+ * negatives generally only have the effect of making leaf page splits use a
+ * more balanced split point.
+ */
+int
+NBTS_FUNCTION(_bt_keep_natts_fast)(Relation rel,
+ IndexTuple lastleft,
+ IndexTuple firstright)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= keysz; attnum++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+ Form_pg_attribute att;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ att = TupleDescAttr(itupdesc, attnum - 1);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
+ break;
+
+ keepnatts++;
+ }
+
+ return keepnatts;
+}
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 421afcf47d..7f398bd4eb 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1153,7 +1153,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
state->tupDesc = tupDesc; /* assume we need not copy tupDesc */
- indexScanKey = _bt_mkscankey(indexRel, NULL);
+ indexScanKey = nbts_call(_bt_mkscankey, indexRel, NULL);
if (state->indexInfo->ii_Expressions != NULL)
{
@@ -1251,7 +1251,7 @@ tuplesort_begin_index_btree(Relation heapRel,
state->enforceUnique = enforceUnique;
state->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
- indexScanKey = _bt_mkscankey(indexRel, NULL);
+ indexScanKey = nbts_call(_bt_mkscankey, indexRel, NULL);
/* Prepare SortSupport data for each column */
state->sortKeys = (SortSupport) palloc0(state->nKeys *
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 93f8267b48..83e0dbab16 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1116,15 +1116,47 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_PERFORMSORT_2 4
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
+
+/*
+ * Macros used in the nbtree specialization code.
+ */
+#define NBTS_TYPE_CACHED cached
+#define NBTS_TYPE_DEFAULT default
+
+
+#define NBTS_MAKE_PREFIX(a) CppConcat(a,_)
+#define NBTS_MAKE_NAME_(a,b) CppConcat(a,b)
+#define NBTS_MAKE_NAME(a,b) NBTS_MAKE_NAME_(NBTS_MAKE_PREFIX(a),b)
+
+#define NBTS_ENABLED
+
+#ifdef NBTS_ENABLED
+
+/*
+ * Access a specialized nbtree function, based on the shape of the index key.
+ */
+
+#define NBT_SPECIALIZE_CALL(function, rel, ...) \
+( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+)
+
+#else /* not defined NBTS_ENABLED */
+
+#define NBT_SPECIALIZE_CALL(function, rel, ...) function(__VA_ARGS__)
+
+#endif /* NBTS_ENABLED */
+
+
+#define NBT_SPECIALIZE_FILE "access/nbtree_specialized.h"
+#include "nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
+
+
/*
* external entry points for btree, in nbtree.c
*/
extern void btbuildempty(Relation index);
-extern bool btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- struct IndexInfo *indexInfo);
extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
extern Size btestimateparallelscan(void);
extern void btinitparallelscan(void *target);
@@ -1155,9 +1187,6 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
/*
* prototypes for functions in nbtdedup.c
*/
-extern void _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel,
- IndexTuple newitem, Size newitemsz,
- bool bottomupdedup);
extern bool _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
Size newitemsz);
extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
@@ -1173,9 +1202,6 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
/*
* prototypes for functions in nbtinsert.c
*/
-extern bool _bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel);
extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
@@ -1223,12 +1249,6 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
/*
* prototypes for functions in nbtsearch.c
*/
-extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
- int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
- bool forupdate, BTStack stack, int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
-extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -1237,7 +1257,6 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
/*
* prototypes for functions in nbtutils.c
*/
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
extern void _bt_freestack(BTStack stack);
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
@@ -1245,8 +1264,6 @@ extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
extern void _bt_mark_array_keys(IndexScanDesc scan);
extern void _bt_restore_array_keys(IndexScanDesc scan);
extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
- int tupnatts, ScanDirection dir, bool *continuescan);
extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
extern BTCycleId _bt_start_vacuum(Relation rel);
@@ -1259,10 +1276,6 @@ extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
extern char *btbuildphasename(int64 phasenum);
-extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
-extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
- IndexTuple firstright);
extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/nbtree_specialize.h b/src/include/access/nbtree_specialize.h
new file mode 100644
index 0000000000..23fdda4f0e
--- /dev/null
+++ b/src/include/access/nbtree_specialize.h
@@ -0,0 +1,204 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtree_specialize.h
+ * header file for postgres btree access method implementation.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/nbtree_specialize.h
+ *
+ *-------------------------------------------------------------------------
+ *
+ * Specialize key-accessing functions and the hot code around those.
+ *
+ * Key attribute iteration is specialized through the use of the following
+ * macros:
+ *
+ * - nbts_call(function, indexrel, ...rest_of_args), and
+ * nbts_call_norel(function, indexrel, ...args)
+ * This will call the specialized variant of 'function' based on the index
+ * relation data.
+ * The difference between nbts_call and nbts_call_norel is that _call
+ * uses indexrel as first argument in the function call, whereas
+ * nbts_call_norel does not.
+ * - nbts_attiterdeclare(itup)
+ * Declare the variables required to iterate over the provided IndexTuple's
+ * key attributes. Many tuples may have their attributes iterated over at the
+ * same time.
+ * - nbts_attiterinit(itup, initAttNum, tupDesc)
+ * Initialize the attribute iterator for the provided IndexTuple at
+ * the provided AttributeNumber.
+ * - nbts_foreachattr(initAttNum, endAttNum)
+ * Start a loop over the attributes, starting at initAttNum and ending at
+ * endAttNum, inclusive. It also takes care of truncated attributes.
+ * - nbts_attiter_attnum
+ * The current attribute number
+ * - nbts_attiter_nextattdatum(itup, tupDesc)
+ * Updates the attribute iterator state to the next attribute. Returns the
+ * datum of the next attribute, which might be null (see below)
+ * - nbts_attiter_curattisnull(itup)
+ * Returns whether the result from the last nbts_attiter_nextattdatum is
+ * null.
+ *
+ * example usage:
+ *
+ * kwithnulls = nbts_call_norel(_bt_key_hasnulls, myindex, mytuple, tupDesc);
+ *
+ * NBTS_FUNCTION(_bt_key_hasnulls)(IndexTuple mytuple, TupleDesc tupDesc)
+ * {
+ * nbts_attiterdeclare(mytuple);
+ * nbts_attiterinit(mytuple, 1, tupDesc);
+ * nbts_foreachattr(1, 10)
+ * {
+ * Datum it = nbts_attiter_nextattdatum(tuple, tupDesc);
+ * if (nbts_attiter_curattisnull(tuple))
+ * return true;
+ * }
+ * return false
+ * }
+ */
+
+/*
+ * Call a potentially specialized function for a given btree operation.
+ *
+ * NB: the rel argument is evaluated multiple times.
+ */
+#define nbts_call(name, rel, ...) \
+ nbts_call_norel(name, (rel), (rel), __VA_ARGS__)
+
+#ifdef NBTS_ENABLED
+
+#define NBTS_FUNCTION(name) NBTS_MAKE_NAME(name, NBTS_TYPE)
+
+#ifdef nbts_call_norel
+#undef nbts_call_norel
+#endif
+
+#define nbts_call_norel(name, rel, ...) \
+ (NBTS_FUNCTION(name)(__VA_ARGS__))
+
+/*
+ * Multiple key columns, optimized access for attcacheoff -cacheable offsets.
+ */
+#define NBTS_SPECIALIZING_CACHED
+#define NBTS_TYPE NBTS_TYPE_CACHED
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_CACHED
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/* reset call to SPECIALIZE_CALL for default behaviour */
+#undef nbts_call_norel
+#define nbts_call_norel(name, rel, ...) \
+ NBT_SPECIALIZE_CALL(name, (rel), __VA_ARGS__)
+
+/*
+ * "Default", externally accessible, not so much optimized functions
+ */
+
+#define NBTS_SPECIALIZING_DEFAULT
+#define NBTS_TYPE NBTS_TYPE_DEFAULT
+
+/* for the default functions, we want to use the unspecialized name. */
+#undef NBTS_FUNCTION
+#define NBTS_FUNCTION(name) name
+
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_DEFAULT
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/* from here on there are no more NBTS_FUNCTIONs */
+#undef NBTS_FUNCTION
+
+#else /* not defined NBTS_ENABLED */
+
+/*
+ * NBTS_ENABLE is not defined, so we don't want to use the specializations.
+ * We revert to the behaviour from PG14 and earlier, which only uses
+ * attcacheoff.
+ */
+
+#define NBTS_FUNCTION(name) name
+
+#define nbts_call_norel(name, rel, ...) \
+ name(__VA_ARGS__)
+
+#define NBTS_TYPE NBTS_TYPE_CACHED
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+
+#endif /* !NBTS_ENABLED */
diff --git a/src/include/access/nbtree_specialized.h b/src/include/access/nbtree_specialized.h
new file mode 100644
index 0000000000..c45fa84aed
--- /dev/null
+++ b/src/include/access/nbtree_specialized.h
@@ -0,0 +1,67 @@
+/*
+ * prototypes for functions that are included in nbtree.h
+ */
+
+/*
+ * prototypes for functions in nbtree_spec.h
+ */
+extern void
+NBTS_FUNCTION(_bt_specialize)(Relation rel);
+
+extern bool
+NBTS_FUNCTION(btinsert)(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ struct IndexInfo *indexInfo);
+
+/*
+ * prototypes for functions in nbtdedup_spec.h
+ */
+extern void
+NBTS_FUNCTION(_bt_dedup_pass)(Relation rel, Buffer buf, Relation heapRel,
+ IndexTuple newitem, Size newitemsz,
+ bool bottomupdedup);
+
+
+/*
+ * prototypes for functions in nbtinsert_spec.h
+ */
+
+extern bool
+NBTS_FUNCTION(_bt_doinsert)(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ Relation heapRel);
+
+/*
+ * prototypes for functions in nbtsearch_spec.h
+ */
+extern BTStack
+NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key,
+ Buffer *bufP, int access,
+ Snapshot snapshot);
+extern Buffer
+NBTS_FUNCTION(_bt_moveright)(Relation rel, BTScanInsert key, Buffer buf,
+ bool forupdate, BTStack stack, int access,
+ Snapshot snapshot);
+extern OffsetNumber
+NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate);
+extern int32
+NBTS_FUNCTION(_bt_compare)(Relation rel, BTScanInsert key,
+ Page page, OffsetNumber offnum);
+
+/*
+ * prototypes for functions in nbtutils_spec.h
+ */
+extern BTScanInsert
+NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup);
+extern bool
+NBTS_FUNCTION(_bt_checkkeys)(Relation rel, IndexScanDesc scan,
+ IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan);
+extern IndexTuple
+NBTS_FUNCTION(_bt_truncate)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+extern int
+NBTS_FUNCTION(_bt_keep_natts_fast)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright);
--
2.30.2
v5-0005-Add-a-function-whose-task-it-is-to-populate-all-a.patchapplication/x-patch; name=v5-0005-Add-a-function-whose-task-it-is-to-populate-all-a.patchDownload
From 303a4dd46705aa9aef49d0cc92ed66a49c16cd79 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 8 Apr 2022 14:51:01 +0200
Subject: [PATCH v5 5/8] Add a function whose task it is to populate all
attcacheoff-s of a TupleDesc's attributes
It fills uncacheable offsets with -2; as opposed to -1 which signals
"unknown", thus allowing users of the API to determine the cache-ability
of an attribute at O(1) complexity after this one-time O(n) cost, as
opposed to the repeated O(n) cost that currently applies.
---
src/backend/access/common/tupdesc.c | 97 +++++++++++++++++++++++++++++
src/include/access/tupdesc.h | 2 +
2 files changed, 99 insertions(+)
diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c
index d6fb261e20..af4ef00f2b 100644
--- a/src/backend/access/common/tupdesc.c
+++ b/src/backend/access/common/tupdesc.c
@@ -919,3 +919,100 @@ BuildDescFromLists(List *names, List *types, List *typmods, List *collations)
return desc;
}
+
+/*
+ * PopulateTupleDescCacheOffsets
+ *
+ * Populate the attcacheoff fields of a TupleDesc, returning the last
+ * attcacheoff with a valid value.
+ *
+ * Sets attcacheoff to -2 for uncacheable attributes (i.e. attributes after a
+ * variable length attributes).
+ */
+AttrNumber
+PopulateTupleDescCacheOffsets(TupleDesc desc)
+{
+ int numberOfAttributes = desc->natts;
+ AttrNumber i, j;
+
+ if (TupleDescAttr(desc, desc->natts - 1)->attcacheoff != -1)
+ {
+ /*
+ * Already done the calculations, find the last attribute that has
+ * cache offset.
+ */
+ for (i = (AttrNumber) numberOfAttributes; i > 1; i--)
+ {
+ if (TupleDescAttr(desc, i - 1)->attcacheoff != -2)
+ return i;
+ }
+
+ return 1;
+ }
+
+ /*
+ * First attribute always starts at offset zero.
+ */
+ TupleDescAttr(desc, 0)->attcacheoff = 0;
+
+ i = 1;
+ /*
+ * Someone might have set some offsets previously.
+ * Skip all positive offsets to get to the first attribute without
+ * attcacheoff.
+ */
+ while (i < numberOfAttributes && TupleDescAttr(desc, i)->attcacheoff > 0)
+ i++;
+
+ /* Cache offset is undetermined. Start calculating offsets if possible */
+ if (i < numberOfAttributes &&
+ TupleDescAttr(desc, i)->attcacheoff == -1)
+ {
+ Form_pg_attribute att = TupleDescAttr(desc, i - 1);
+ Size off = att->attcacheoff;
+
+ if (att->attlen >= 0) {
+ off += att->attlen;
+
+ while (i < numberOfAttributes)
+ {
+ att = TupleDescAttr(desc, i);
+
+ if (att->attlen < 0)
+ {
+ if (off == att_align_nominal(off, att->attalign))
+ att->attcacheoff = off;
+ else
+ att->attcacheoff = -2;
+ i++;
+ break;
+ }
+
+ off = att_align_nominal(off, att->attalign);
+ att->attcacheoff = off;
+ off += att->attlen;
+ i++;
+ }
+ } else {
+ if (off == att_align_nominal(off, att->attalign))
+ att->attcacheoff = off;
+ else
+ att->attcacheoff = -2;
+ i++;
+ }
+ }
+
+ /*
+ * No cacheable offsets left. Fill the rest with -2s, but return the latest
+ * cached offset.
+ */
+ j = i;
+
+ while (i < numberOfAttributes)
+ {
+ TupleDescAttr(desc, i)->attcacheoff = -2;
+ i++;
+ }
+
+ return j;
+}
diff --git a/src/include/access/tupdesc.h b/src/include/access/tupdesc.h
index 28dd6de18b..219f837875 100644
--- a/src/include/access/tupdesc.h
+++ b/src/include/access/tupdesc.h
@@ -151,4 +151,6 @@ extern TupleDesc BuildDescForRelation(List *schema);
extern TupleDesc BuildDescFromLists(List *names, List *types, List *typmods, List *collations);
+extern AttrNumber PopulateTupleDescCacheOffsets(TupleDesc desc);
+
#endif /* TUPDESC_H */
--
2.30.2
v5-0008-Implement-dynamic-prefix-compression-in-nbtree.patchapplication/x-patch; name=v5-0008-Implement-dynamic-prefix-compression-in-nbtree.patchDownload
From efcdd2b73405020b43d43c0e1770ba060baa126a Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 6 Jun 2022 23:16:18 +0200
Subject: [PATCH v5 8/8] Implement dynamic prefix compression in nbtree
Because tuples are ordered on the page, if some prefix of the
scan attributes on both sides of the compared tuple are equal
to the scankey, then the current tuple that is being compared
must also have those prefixing attributes that equal the
scankey.
We cannot propagate this information to _binsrch on lower pages,
as this downstream page may concurrently have split and/or have
merged with its deleted left neighbour (see [0]), which moves
the keyspace of the linked page. We thus can only trust the
current state of this current page for this optimization, which
means we must validate this state each time we open the page.
Although this limits the overall applicability of the
performance improvement, it still allows for a nice performance
improvement in most cases where initial columns have many
duplicate values and a compare function that is not cheap.
Additionally, most of the time a pages' highkey is equal to the
right seperator on the parent page. By storing this seperator
and doing a binary equality check, we can cheaply validate the
highkey on a page, which also allows us to carry over the
right seperators' prefix into the page.
---
contrib/amcheck/verify_nbtree.c | 17 +--
src/backend/access/nbtree/README | 25 +++++
src/backend/access/nbtree/nbtinsert.c | 14 ++-
src/backend/access/nbtree/nbtinsert_spec.h | 22 ++--
src/backend/access/nbtree/nbtsearch.c | 3 +-
src/backend/access/nbtree/nbtsearch_spec.h | 115 ++++++++++++++++++---
src/include/access/nbtree_specialized.h | 9 +-
7 files changed, 169 insertions(+), 36 deletions(-)
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 2beeebb163..8c4215372a 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2700,6 +2700,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
BTInsertStateData insertstate;
OffsetNumber offnum;
Page page;
+ AttrNumber cmpcol = 1;
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
@@ -2709,13 +2710,13 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.buf = lbuf;
/* Get matching tuple on leaf page */
- offnum = _bt_binsrch_insert(state->rel, &insertstate);
+ offnum = _bt_binsrch_insert(state->rel, &insertstate, 1);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
/* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
insertstate.postingoff <= 0 &&
- _bt_compare(state->rel, key, page, offnum) == 0)
+ _bt_compare(state->rel, key, page, offnum, &cmpcol) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
}
@@ -2777,6 +2778,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
@@ -2787,7 +2789,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
if (!key->heapkeyspace)
return invariant_leq_offset(state, key, upperbound);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
/*
* _bt_compare() is capable of determining that a scankey with a
@@ -2839,10 +2841,11 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber upperbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
return cmp <= 0;
}
@@ -2862,10 +2865,11 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber lowerbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, lowerbound);
+ cmp = _bt_compare(state->rel, key, state->target, lowerbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
@@ -2900,13 +2904,14 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
/* Verify line pointer before checking tuple */
itemid = PageGetItemIdCareful(state, nontargetblock, nontarget,
upperbound);
- cmp = _bt_compare(state->rel, key, nontarget, upperbound);
+ cmp = _bt_compare(state->rel, key, nontarget, upperbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3c08888c23..13ac9ee2be 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -901,6 +901,31 @@ large groups of duplicates, maximizing space utilization. Note also that
deduplication more efficient. Deduplication can be performed infrequently,
without merging together existing posting list tuples too often.
+
+Notes about dynamic prefix truncation
+-------------------------------------
+
+Because NBTrees have a sorted keyspace, when we have determined that some
+prefixing columns of tuples on both sides of the tuple that is being
+compared are equal to the scankey, then the current tuple must also share
+this prefix with the scankey. This allows us to skip comparing those columns,
+potentially saving cycles.
+
+We can only use this constraint if we have proven this information while we
+hold a pin on the page, so this is only useful on the page level: Concurrent
+page deletions and splits may have moved the keyspace of the page referenced
+by an inner page to the right. If we re-used high- and low-column-prefixes,
+we would not be able to detect a change of keyspace from e.g. (2,2) to (1,2),
+and subsequently return invalid results. This race condition can only be
+prevented by re-establishing the prefix-equal-columns for each page.
+
+The positive part of this, is that we already have results of the highest
+value of a page: a pages' highkey is compared to the scankey while we have
+a pin on the page in the _bt_moveright procedure. The _bt_binsrch procedure
+will use this result as a rightmost prefix compare, and for each step in the
+binary search (that does not compare less than the insert key) improve the
+equal-prefix bounds.
+
Notes about deduplication
-------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index ec6c73d1cc..20e5f33f98 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -132,7 +132,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* in the fastpath below, but also in the _bt_findinsertloc() call later.
*/
Assert(!insertstate->bounds_valid);
- offset = nbts_call(_bt_binsrch_insert, rel, insertstate);
+ offset = nbts_call(_bt_binsrch_insert, rel, insertstate, 1);
/*
* Scan over all equal tuples, looking for live conflicts.
@@ -142,6 +142,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(itup_key->scantid == NULL);
for (;;)
{
+ AttrNumber cmpcol = 1;
+
/*
* Each iteration of the loop processes one heap TID, not one index
* tuple. Current offset number for page isn't usually advanced on
@@ -177,7 +179,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(insertstate->bounds_valid);
Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
Assert(insertstate->low <= insertstate->stricthigh);
- Assert(nbts_call(_bt_compare, rel, itup_key, page, offset) < 0);
+ Assert(nbts_call(_bt_compare, rel, itup_key, page, offset,
+ &cmpcol) < 0);
break;
}
@@ -202,7 +205,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
if (!inposting)
{
/* Plain tuple, or first TID in posting list tuple */
- if (nbts_call(_bt_compare, rel, itup_key, page, offset) != 0)
+ if (nbts_call(_bt_compare, rel, itup_key, page, offset,
+ &cmpcol) != 0)
break; /* we're past all the equal tuples */
/* Advanced curitup */
@@ -412,11 +416,13 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
else
{
int highkeycmp;
+ cmpcol = 1;
/* If scankey == hikey we gotta check the next page too */
if (P_RIGHTMOST(opaque))
break;
- highkeycmp = nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY);
+ highkeycmp = nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY,
+ &cmpcol);
Assert(highkeycmp <= 0);
if (highkeycmp != 0)
break;
diff --git a/src/backend/access/nbtree/nbtinsert_spec.h b/src/backend/access/nbtree/nbtinsert_spec.h
index 97c866aea3..ccba0fa5ed 100644
--- a/src/backend/access/nbtree/nbtinsert_spec.h
+++ b/src/backend/access/nbtree/nbtinsert_spec.h
@@ -73,6 +73,7 @@ NBTS_FUNCTION(_bt_search_insert)(Relation rel, BTInsertState insertstate)
{
Page page;
BTPageOpaque opaque;
+ AttrNumber comparecol = 1;
_bt_checkpage(rel, insertstate->buf);
page = BufferGetPage(insertstate->buf);
@@ -91,7 +92,8 @@ NBTS_FUNCTION(_bt_search_insert)(Relation rel, BTInsertState insertstate)
!P_IGNORE(opaque) &&
PageGetFreeSpace(page) > insertstate->itemsz &&
PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- nbts_call(_bt_compare, rel, insertstate->itup_key, page, P_HIKEY) > 0)
+ nbts_call(_bt_compare, rel, insertstate->itup_key, page,
+ P_HIKEY, &comparecol) > 0)
{
/*
* Caller can use the fastpath optimization because cached
@@ -221,6 +223,7 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
/*
* Does the new tuple belong on this page?
*
@@ -238,7 +241,7 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
/* Test '<=', not '!=', since scantid is set now */
if (P_RIGHTMOST(opaque) ||
- nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) <= 0)
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
break;
_bt_stepright(rel, insertstate, stack);
@@ -291,6 +294,7 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
*/
while (PageGetFreeSpace(page) < insertstate->itemsz)
{
+ AttrNumber cmpcol = 1;
/*
* Before considering moving right, see if we can obtain enough
* space by erasing LP_DEAD items
@@ -321,7 +325,8 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
break;
if (P_RIGHTMOST(opaque) ||
- nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) != 0 ||
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY,
+ &cmpcol) != 0 ||
pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
break;
@@ -336,10 +341,13 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
* We should now be on the correct page. Find the offset within the page
* for the new tuple. (Possibly reusing earlier search bounds.)
*/
- Assert(P_RIGHTMOST(opaque) ||
- nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) <= 0);
+ {
+ AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
+ Assert(P_RIGHTMOST(opaque) || nbts_call(_bt_compare, rel, itup_key,
+ page, P_HIKEY, &cmpcol) <= 0);
+ }
- newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate);
+ newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate, 1);
if (insertstate->postingoff == -1)
{
@@ -358,7 +366,7 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
*/
Assert(!insertstate->bounds_valid);
insertstate->postingoff = 0;
- newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate);
+ newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate, 1);
Assert(insertstate->postingoff == 0);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d5152bfcb7..607940bbcd 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -178,6 +178,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
StrategyNumber strat_total;
BTScanPosItem *currItem;
BlockNumber blkno;
+ AttrNumber attno = 1;
Assert(!BTScanPosIsValid(so->currPos));
@@ -696,7 +697,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
_bt_initialize_more_data(so, dir);
/* position to the precise item on the page */
- offnum = nbts_call(_bt_binsrch, rel, &inskey, buf);
+ offnum = nbts_call(_bt_binsrch, rel, &inskey, buf, &attno);
/*
* If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsearch_spec.h b/src/backend/access/nbtree/nbtsearch_spec.h
index a5c5f2b94f..19a6178334 100644
--- a/src/backend/access/nbtree/nbtsearch_spec.h
+++ b/src/backend/access/nbtree/nbtsearch_spec.h
@@ -10,8 +10,10 @@
*/
#ifndef NBTS_SPECIALIZING_DEFAULT
-static OffsetNumber NBTS_FUNCTION(_bt_binsrch)(Relation rel, BTScanInsert key,
- Buffer buf);
+static OffsetNumber NBTS_FUNCTION(_bt_binsrch)(Relation rel,
+ BTScanInsert key,
+ Buffer buf,
+ AttrNumber *highkeycmpcol);
static bool NBTS_FUNCTION(_bt_readpage)(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
@@ -38,7 +40,8 @@ static bool NBTS_FUNCTION(_bt_readpage)(IndexScanDesc scan, ScanDirection dir,
static OffsetNumber
NBTS_FUNCTION(_bt_binsrch)(Relation rel,
BTScanInsert key,
- Buffer buf)
+ Buffer buf,
+ AttrNumber *highkeycmpcol)
{
Page page;
BTPageOpaque opaque;
@@ -46,6 +49,8 @@ NBTS_FUNCTION(_bt_binsrch)(Relation rel,
high;
int32 result,
cmpval;
+ AttrNumber highcmpcol = *highkeycmpcol,
+ lowcmpcol = 1;
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
@@ -87,17 +92,26 @@ NBTS_FUNCTION(_bt_binsrch)(Relation rel,
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
/* We have low <= mid < high, so mid points at a real slot */
- result = nbts_call(_bt_compare, rel, key, page, mid);
+ result = nbts_call(_bt_compare, rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
+ {
high = mid;
+ highcmpcol = cmpcol;
+ }
}
+ *highkeycmpcol = highcmpcol;
+
/*
* At this point we have high == low, but be careful: they could point
* past the last slot on the page.
@@ -423,6 +437,7 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
{
BTStack stack_in = NULL;
int page_access = BT_READ;
+ char tupdatabuf[BLCKSZ / 3];
/* Get the root page to start with */
*bufP = _bt_getroot(rel, access);
@@ -441,6 +456,7 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
IndexTuple itup;
BlockNumber child;
BTStack new_stack;
+ AttrNumber highkeycmpcol = 1;
/*
* Race -- the page we just grabbed may have split since we read its
@@ -456,7 +472,8 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
*/
*bufP = nbts_call(_bt_moveright, rel, key, *bufP,
(access == BT_WRITE), stack_in,
- page_access, snapshot);
+ page_access, snapshot, &highkeycmpcol,
+ (char *) tupdatabuf);
/* if this is a leaf page, we're done */
page = BufferGetPage(*bufP);
@@ -468,12 +485,17 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
* Find the appropriate pivot tuple on this page. Its downlink points
* to the child page that we're about to descend to.
*/
- offnum = nbts_call(_bt_binsrch, rel, key, *bufP);
+ offnum = nbts_call(_bt_binsrch, rel, key, *bufP, &highkeycmpcol);
itemid = PageGetItemId(page, offnum);
itup = (IndexTuple) PageGetItem(page, itemid);
Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
child = BTreeTupleGetDownLink(itup);
+ if (highkeycmpcol > 1)
+ {
+ memcpy((char *) tupdatabuf, (char *) itup, IndexTupleSize(itup));
+ }
+
/*
* We need to save the location of the pivot tuple we chose in a new
* stack entry for this page/level. If caller ends up splitting a
@@ -507,6 +529,7 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
*/
if (access == BT_WRITE && page_access == BT_READ)
{
+ AttrNumber highkeycmpcol = 1;
/* trade in our read lock for a write lock */
_bt_unlockbuf(rel, *bufP);
_bt_lockbuf(rel, *bufP, BT_WRITE);
@@ -517,7 +540,7 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
* move right to its new sibling. Do that.
*/
*bufP = nbts_call(_bt_moveright, rel, key, *bufP, true, stack_in,
- BT_WRITE, snapshot);
+ BT_WRITE, snapshot, &highkeycmpcol, (char *) tupdatabuf);
}
return stack_in;
@@ -565,12 +588,16 @@ NBTS_FUNCTION(_bt_moveright)(Relation rel,
bool forupdate,
BTStack stack,
int access,
- Snapshot snapshot)
+ Snapshot snapshot,
+ AttrNumber *comparecol,
+ char *tupdatabuf)
{
Page page;
BTPageOpaque opaque;
int32 cmpval;
+ Assert(PointerIsValid(comparecol));
+
/*
* When nextkey = false (normal case): if the scan key that brought us to
* this page is > the high key stored on the page, then the page has split
@@ -592,12 +619,17 @@ NBTS_FUNCTION(_bt_moveright)(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
+
page = BufferGetPage(buf);
TestForOldSnapshot(snapshot, rel, page);
opaque = BTPageGetOpaque(page);
if (P_RIGHTMOST(opaque))
+ {
+ *comparecol = cmpcol;
break;
+ }
/*
* Finish any incomplete splits we encounter along the way.
@@ -623,14 +655,49 @@ NBTS_FUNCTION(_bt_moveright)(Relation rel,
continue;
}
- if (P_IGNORE(opaque) || nbts_call(_bt_compare, rel, key, page, P_HIKEY) >= cmpval)
+ /*
+ * When comparecol is > 1, tupdatabuf is filled with the right seperator
+ * of the parent node. This allows us to do a binary equality check
+ * between the parent node's right seperator (which is < key) and this
+ * page's P_HIKEY. If they equal, we can reuse the result of the
+ * parent node's rightkey compare, which means we can potentially save
+ * a full key compare.
+ *
+ * Without this, we'd on average use 3 full key compares per page before
+ * we achieve full dynamic prefix bounds, with this optimization we can
+ * skip one of those.
+ *
+ * 3: 1 for the highkey (rightmost), and on average 2 before we move
+ * right in the binary search on the page.
+ */
+ if (!P_IGNORE(opaque) && *comparecol > 1)
+ {
+ IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
+ IndexTuple buftuple = (IndexTuple) tupdatabuf;
+ if (IndexTupleSize(itup) == IndexTupleSize(buftuple))
+ {
+ char *dataptr = (char *) itup;
+
+ if (memcmp(dataptr + sizeof(IndexTupleData),
+ tupdatabuf + sizeof(IndexTupleData),
+ IndexTupleSize(itup) - sizeof(IndexTupleData)) == 0)
+ break;
+ }
+ }
+
+ if (P_IGNORE(opaque) || nbts_call(_bt_compare, rel, key, page, P_HIKEY,
+ &cmpcol) >= cmpval)
{
/* step right one page */
+ *comparecol = 1;
buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
continue;
}
else
+ {
+ *comparecol = cmpcol;
break;
+ }
}
if (P_IGNORE(opaque))
@@ -663,7 +730,8 @@ NBTS_FUNCTION(_bt_moveright)(Relation rel,
* list split).
*/
OffsetNumber
-NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate)
+NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol)
{
BTScanInsert key = insertstate->itup_key;
Page page;
@@ -673,6 +741,7 @@ NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate)
stricthigh;
int32 result,
cmpval;
+ AttrNumber lowcmpcol = 1;
page = BufferGetPage(insertstate->buf);
opaque = BTPageGetOpaque(page);
@@ -723,16 +792,21 @@ NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate)
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
/* We have low <= mid < high, so mid points at a real slot */
- result = nbts_call(_bt_compare, rel, key, page, mid);
+ result = nbts_call(_bt_compare, rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
{
high = mid;
+ highcmpcol = cmpcol;
if (result != 0)
stricthigh = high;
}
@@ -813,7 +887,8 @@ int32
NBTS_FUNCTION(_bt_compare)(Relation rel,
BTScanInsert key,
Page page,
- OffsetNumber offnum)
+ OffsetNumber offnum,
+ AttrNumber *comparecol)
{
TupleDesc itupdesc = RelationGetDescr(rel);
BTPageOpaque opaque = BTPageGetOpaque(page);
@@ -854,10 +929,11 @@ NBTS_FUNCTION(_bt_compare)(Relation rel,
ncmpkey = Min(ntupatts, key->keysz);
Assert(key->heapkeyspace || ncmpkey == key->keysz);
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
- scankey = key->scankeys;
- nbts_attiterinit(itup, 1, itupdesc);
- nbts_foreachattr(1, ncmpkey)
+ nbts_attiterinit(itup, *comparecol, itupdesc);
+ scankey = key->scankeys + ((*comparecol) - 1);
+
+ nbts_foreachattr(*comparecol, ncmpkey)
{
Datum datum;
@@ -902,11 +978,20 @@ NBTS_FUNCTION(_bt_compare)(Relation rel,
/* if the keys are unequal, return the difference */
if (result != 0)
+ {
+ *comparecol = nbts_attiter_attnum;
return result;
+ }
scankey++;
}
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *comparecol = ntupatts + 1;
+
/*
* All non-truncated attributes (other than heap TID) were found to be
* equal. Treat truncated attributes as minus infinity when scankey has a
diff --git a/src/include/access/nbtree_specialized.h b/src/include/access/nbtree_specialized.h
index c45fa84aed..ddceb4a4aa 100644
--- a/src/include/access/nbtree_specialized.h
+++ b/src/include/access/nbtree_specialized.h
@@ -43,12 +43,15 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key,
extern Buffer
NBTS_FUNCTION(_bt_moveright)(Relation rel, BTScanInsert key, Buffer buf,
bool forupdate, BTStack stack, int access,
- Snapshot snapshot);
+ Snapshot snapshot, AttrNumber *comparecol,
+ char *tupdatabuf);
extern OffsetNumber
-NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate);
+NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol);
extern int32
NBTS_FUNCTION(_bt_compare)(Relation rel, BTScanInsert key,
- Page page, OffsetNumber offnum);
+ Page page, OffsetNumber offnum,
+ AttrNumber *comparecol);
/*
* prototypes for functions in nbtutils_spec.h
--
2.30.2
v5-0007-Add-specialization-to-btree-index-creation.patchapplication/x-patch; name=v5-0007-Add-specialization-to-btree-index-creation.patchDownload
From 3a3b346c80bfafb094f296c2ab15c86063deb2b5 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 21 Apr 2022 16:22:07 +0200
Subject: [PATCH v5 7/8] Add specialization to btree index creation.
This was an oversight that is corrected easily; but an oversight nonetheless.
This increases the (re)build performance of indexes by another few percents.
---
src/backend/utils/sort/tuplesort.c | 147 ++---------------------
src/backend/utils/sort/tuplesort_nbts.h | 148 ++++++++++++++++++++++++
src/include/access/nbtree.h | 18 +++
3 files changed, 175 insertions(+), 138 deletions(-)
create mode 100644 src/backend/utils/sort/tuplesort_nbts.h
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 7f398bd4eb..3b96e8d4f8 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -655,8 +655,6 @@ static void writetup_cluster(Tuplesortstate *state, LogicalTape *tape,
SortTuple *stup);
static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
LogicalTape *tape, unsigned int len);
-static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state);
static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
Tuplesortstate *state);
static void copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -679,6 +677,10 @@ static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
static void tuplesort_free(Tuplesortstate *state);
static void tuplesort_updatemax(Tuplesortstate *state);
+#define NBT_SPECIALIZE_FILE "../../backend/utils/sort/tuplesort_nbts.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
+
/*
* Specialized comparators that we can inline into specialized sorts. The goal
* is to try to sort two tuples without having to follow the pointers to the
@@ -1239,7 +1241,7 @@ tuplesort_begin_index_btree(Relation heapRel,
sortopt & TUPLESORT_RANDOMACCESS,
PARALLEL_SORT(state));
- state->comparetup = comparetup_index_btree;
+ state->comparetup = NBT_SPECIALIZE_NAME(comparetup_index_btree, indexRel);
state->copytup = copytup_index;
state->writetup = writetup_index;
state->readtup = readtup_index;
@@ -1357,7 +1359,7 @@ tuplesort_begin_index_gist(Relation heapRel,
state->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel);
- state->comparetup = comparetup_index_btree;
+ state->comparetup = NBT_SPECIALIZE_NAME(comparetup_index_btree, indexRel);
state->copytup = copytup_index;
state->writetup = writetup_index;
state->readtup = readtup_index;
@@ -4321,142 +4323,11 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
* The btree and hash cases require separate comparison functions, but the
* IndexTuple representation is the same so the copy/write/read support
* functions can be shared.
+ *
+ * nbtree function can be found in tuplesort_nbts.h, and is included
+ * through the nbtree specialization functions.
*/
-static int
-comparetup_index_btree(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state)
-{
- /*
- * This is similar to comparetup_heap(), but expects index tuples. There
- * is also special handling for enforcing uniqueness, and special
- * treatment for equal keys at the end.
- */
- SortSupport sortKey = state->sortKeys;
- IndexTuple tuple1;
- IndexTuple tuple2;
- int keysz;
- TupleDesc tupDes;
- bool equal_hasnull = false;
- int nkey;
- int32 compare;
- Datum datum1,
- datum2;
- bool isnull1,
- isnull2;
-
-
- /* Compare the leading sort key */
- compare = ApplySortComparator(a->datum1, a->isnull1,
- b->datum1, b->isnull1,
- sortKey);
- if (compare != 0)
- return compare;
-
- /* Compare additional sort keys */
- tuple1 = (IndexTuple) a->tuple;
- tuple2 = (IndexTuple) b->tuple;
- keysz = state->nKeys;
- tupDes = RelationGetDescr(state->indexRel);
-
- if (sortKey->abbrev_converter)
- {
- datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
-
- compare = ApplySortAbbrevFullComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare;
- }
-
- /* they are equal, so we only need to examine one null flag */
- if (a->isnull1)
- equal_hasnull = true;
-
- sortKey++;
- for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
- {
- datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
-
- compare = ApplySortComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare; /* done when we find unequal attributes */
-
- /* they are equal, so we only need to examine one null flag */
- if (isnull1)
- equal_hasnull = true;
- }
-
- /*
- * If btree has asked us to enforce uniqueness, complain if two equal
- * tuples are detected (unless there was at least one NULL field and NULLS
- * NOT DISTINCT was not set).
- *
- * It is sufficient to make the test here, because if two tuples are equal
- * they *must* get compared at some stage of the sort --- otherwise the
- * sort algorithm wouldn't have checked whether one must appear before the
- * other.
- */
- if (state->enforceUnique && !(!state->uniqueNullsNotDistinct && equal_hasnull))
- {
- Datum values[INDEX_MAX_KEYS];
- bool isnull[INDEX_MAX_KEYS];
- char *key_desc;
-
- /*
- * Some rather brain-dead implementations of qsort (such as the one in
- * QNX 4) will sometimes call the comparison routine to compare a
- * value to itself, but we always use our own implementation, which
- * does not.
- */
- Assert(tuple1 != tuple2);
-
- index_deform_tuple(tuple1, tupDes, values, isnull);
-
- key_desc = BuildIndexValueDescription(state->indexRel, values, isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("could not create unique index \"%s\"",
- RelationGetRelationName(state->indexRel)),
- key_desc ? errdetail("Key %s is duplicated.", key_desc) :
- errdetail("Duplicate keys exist."),
- errtableconstraint(state->heapRel,
- RelationGetRelationName(state->indexRel))));
- }
-
- /*
- * If key values are equal, we sort on ItemPointer. This is required for
- * btree indexes, since heap TID is treated as an implicit last key
- * attribute in order to ensure that all keys in the index are physically
- * unique.
- */
- {
- BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
- BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
-
- if (blk1 != blk2)
- return (blk1 < blk2) ? -1 : 1;
- }
- {
- OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
- OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
-
- if (pos1 != pos2)
- return (pos1 < pos2) ? -1 : 1;
- }
-
- /* ItemPointer values should never be equal */
- Assert(false);
-
- return 0;
-}
-
static int
comparetup_index_hash(const SortTuple *a, const SortTuple *b,
Tuplesortstate *state)
diff --git a/src/backend/utils/sort/tuplesort_nbts.h b/src/backend/utils/sort/tuplesort_nbts.h
new file mode 100644
index 0000000000..d1b2670747
--- /dev/null
+++ b/src/backend/utils/sort/tuplesort_nbts.h
@@ -0,0 +1,148 @@
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static int NBTS_FUNCTION(comparetup_index_btree)(const SortTuple *a,
+ const SortTuple *b,
+ Tuplesortstate *state);
+
+static int
+NBTS_FUNCTION(comparetup_index_btree)(const SortTuple *a, const SortTuple *b,
+ Tuplesortstate *state)
+{
+ /*
+ * This is similar to comparetup_heap(), but expects index tuples. There
+ * is also special handling for enforcing uniqueness, and special
+ * treatment for equal keys at the end.
+ */
+ SortSupport sortKey = state->sortKeys;
+ IndexTuple tuple1;
+ IndexTuple tuple2;
+ int keysz;
+ TupleDesc tupDes;
+ bool equal_hasnull = false;
+ int nkey;
+ int32 compare;
+ nbts_attiterdeclare(tuple1);
+ nbts_attiterdeclare(tuple2);
+
+ /* Compare the leading sort key */
+ compare = ApplySortComparator(a->datum1, a->isnull1,
+ b->datum1, b->isnull1,
+ sortKey);
+ if (compare != 0)
+ return compare;
+
+ /* Compare additional sort keys */
+ tuple1 = (IndexTuple) a->tuple;
+ tuple2 = (IndexTuple) b->tuple;
+ keysz = state->nKeys;
+ tupDes = RelationGetDescr(state->indexRel);
+
+ if (!sortKey->abbrev_converter)
+ {
+ nkey = 2;
+ sortKey++;
+ }
+ else
+ nkey = 1;
+
+ if (a->isnull1)
+ equal_hasnull = true;
+
+ nbts_attiterinit(tuple1, nkey, tupDes);
+ nbts_attiterinit(tuple2, nkey, tupDes);
+
+ nbts_foreachattr(nkey, keysz)
+ {
+ Datum datum1,
+ datum2;
+ datum1 = nbts_attiter_nextattdatum(tuple1, tupDes);
+ datum2 = nbts_attiter_nextattdatum(tuple2, tupDes);
+
+ if (nbts_attiter_attnum == 1)
+ {
+ compare = ApplySortAbbrevFullComparator(datum1, nbts_attiter_curattisnull(tuple1),
+ datum2, nbts_attiter_curattisnull(tuple2),
+ sortKey);
+ }
+ else
+ {
+ compare = ApplySortComparator(datum1, nbts_attiter_curattisnull(tuple1),
+ datum2, nbts_attiter_curattisnull(tuple2),
+ sortKey);
+ }
+
+ if (compare != 0)
+ return compare;
+
+ if (nbts_attiter_curattisnull(tuple1))
+ equal_hasnull = true;
+
+ sortKey++;
+ }
+
+ /*
+ * If btree has asked us to enforce uniqueness, complain if two equal
+ * tuples are detected (unless there was at least one NULL field and NULLS
+ * NOT DISTINCT was not set).
+ *
+ * It is sufficient to make the test here, because if two tuples are equal
+ * they *must* get compared at some stage of the sort --- otherwise the
+ * sort algorithm wouldn't have checked whether one must appear before the
+ * other.
+ */
+ if (state->enforceUnique && !(!state->uniqueNullsNotDistinct && equal_hasnull))
+ {
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ char *key_desc;
+
+ /*
+ * Some rather brain-dead implementations of qsort (such as the one in
+ * QNX 4) will sometimes call the comparison routine to compare a
+ * value to itself, but we always use our own implementation, which
+ * does not.
+ */
+ Assert(tuple1 != tuple2);
+
+ index_deform_tuple(tuple1, tupDes, values, isnull);
+
+ key_desc = BuildIndexValueDescription(state->indexRel, values, isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("could not create unique index \"%s\"",
+ RelationGetRelationName(state->indexRel)),
+ key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+ errdetail("Duplicate keys exist."),
+ errtableconstraint(state->heapRel,
+ RelationGetRelationName(state->indexRel))));
+ }
+
+ /*
+ * If key values are equal, we sort on ItemPointer. This is required for
+ * btree indexes, since heap TID is treated as an implicit last key
+ * attribute in order to ensure that all keys in the index are physically
+ * unique.
+ */
+ {
+ BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
+ BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
+
+ if (blk1 != blk2)
+ return (blk1 < blk2) ? -1 : 1;
+ }
+ {
+ OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
+ OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
+
+ if (pos1 != pos2)
+ return (pos1 < pos2) ? -1 : 1;
+ }
+
+ /* ItemPointer values should never be equal */
+ Assert(false);
+
+ return 0;
+}
+
+#endif
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 92894e4ea7..11116b47ca 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1170,6 +1170,24 @@ do { \
) \
)
+#define NBT_SPECIALIZE_NAME(name, rel) \
+( \
+ IndexRelationGetNumberOfKeyAttributes(rel) == 1 ? ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_SINGLE_COLUMN) \
+ ) \
+ : \
+ ( \
+ TupleDescAttr(RelationGetDescr(rel), \
+ IndexRelationGetNumberOfKeyAttributes(rel) - 1)->attcacheoff > 0 ? ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED) \
+ ) \
+ : \
+ ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_UNCACHED) \
+ ) \
+ ) \
+)
+
#else /* not defined NBTS_ENABLED */
#define nbt_opt_specialize(rel)
--
2.30.2
v5-0006-Implement-specialized-uncacheable-attribute-itera.patchapplication/x-patch; name=v5-0006-Implement-specialized-uncacheable-attribute-itera.patchDownload
From 3e558e97598466b79f57c530ff5e56ac9cc20ad4 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 8 Apr 2022 14:44:01 +0200
Subject: [PATCH v5 6/8] Implement specialized uncacheable attribute iteration
Uses an iterator to prevent doing duplicate work while iterating over
attributes.
Inspiration: https://www.postgresql.org/message-id/CAEze2WjE9ka8i%3Ds-Vv5oShro9xTrt5VQnQvFG9AaRwWpMm3-fg%40mail.gmail.com
---
src/backend/access/nbtree/nbtree_spec.h | 1 +
src/include/access/itup_attiter.h | 198 ++++++++++++++++++++++++
src/include/access/nbtree.h | 13 +-
src/include/access/nbtree_specialize.h | 40 ++++-
4 files changed, 249 insertions(+), 3 deletions(-)
create mode 100644 src/include/access/itup_attiter.h
diff --git a/src/backend/access/nbtree/nbtree_spec.h b/src/backend/access/nbtree/nbtree_spec.h
index 4c342287f6..88b01c86f7 100644
--- a/src/backend/access/nbtree/nbtree_spec.h
+++ b/src/backend/access/nbtree/nbtree_spec.h
@@ -9,6 +9,7 @@ void
NBTS_FUNCTION(_bt_specialize)(Relation rel)
{
#ifdef NBTS_SPECIALIZING_DEFAULT
+ PopulateTupleDescCacheOffsets(rel->rd_att);
nbts_call_norel(_bt_specialize, rel, rel);
#else
rel->rd_indam->aminsert = NBTS_FUNCTION(btinsert);
diff --git a/src/include/access/itup_attiter.h b/src/include/access/itup_attiter.h
new file mode 100644
index 0000000000..9f16a4b3d7
--- /dev/null
+++ b/src/include/access/itup_attiter.h
@@ -0,0 +1,198 @@
+/*-------------------------------------------------------------------------
+ *
+ * itup.h
+ * POSTGRES index tuple attribute iterator definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/itup_attiter.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef ITUP_ATTITER_H
+#define ITUP_ATTITER_H
+
+#include "access/itup.h"
+
+typedef struct IAttrIterStateData
+{
+ int offset;
+ bool slow;
+ bool isNull;
+} IAttrIterStateData;
+
+typedef IAttrIterStateData * IAttrIterState;
+
+/* ----------------
+ * index_attiterinit
+ *
+ * This gets called many times, so we macro the cacheable and NULL
+ * lookups, and call nocache_index_attiterinit() for the rest.
+ *
+ * tup - the tuple being iterated on
+ * attnum - the attribute number that we start the iteration with
+ * in the first index_attiternext call
+ * tupdesc - the tuple description
+ *
+ * ----------------
+ */
+#define index_attiterinit(tup, attnum, tupleDesc, iter) \
+do { \
+ if ((attnum) == 1) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ 0 /* Offset of attribute 1 is always 0 */, \
+ false /* slow */, \
+ false /* isNull */ \
+ }); \
+ } \
+ else if (!IndexTupleHasNulls(tup) && \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff >= 0) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff, /* offset */ \
+ false, /* slow */ \
+ false /* isNull */ \
+ }); \
+ } \
+ else \
+ nocache_index_attiterinit((tup), (attnum) - 1, (tupleDesc), (iter)); \
+} while (false);
+
+/*
+ * Initiate an index attribute iterator to attribute attnum,
+ * and return the corresponding datum.
+ *
+ * This is nearly the same as index_deform_tuple, except that this
+ * returns the internal state up to attnum, instead of populating the
+ * datum- and isnull-arrays
+ */
+static inline void
+nocache_index_attiterinit(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ int curatt;
+ char *tp; /* ptr to tuple data */
+ int off; /* offset in tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ bool slow = false; /* can we use/set attcacheoff? */
+ bool null = false;
+
+ /* Assert to protect callers */
+ Assert(PointerIsValid(iter));
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ /* XXX "knows" t_bits are just after fixed tuple header! */
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+ off = 0;
+
+ for (curatt = 0; curatt < attnum; curatt++)
+ {
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, curatt);
+
+ if (hasnulls && att_isnull(curatt, bp))
+ {
+ null = true;
+ slow = true; /* can't use attcacheoff anymore */
+ continue;
+ }
+
+ null = false;
+
+ if (!slow && thisatt->attcacheoff >= 0)
+ off = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ slow = true;
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+ }
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+
+ if (thisatt->attlen <= 0)
+ slow = true; /* can't use attcacheoff anymore */
+ }
+
+ iter->isNull = null;
+ iter->offset = off;
+ iter->slow = slow;
+}
+
+/* ----------------
+ * index_attiternext() - get the next attribute of an index tuple
+ *
+ * This gets called many times, so we do the least amount of work
+ * possible.
+ *
+ * The code does not attempt to update attcacheoff; as it is unlikely
+ * to reach a situation where the cached offset matters a lot.
+ * If the cached offset do matter, the caller should make sure that
+ * PopulateTupleDescCacheOffsets() was called on the tuple descriptor
+ * to populate the attribute offset cache.
+ *
+ * ----------------
+ */
+static inline Datum
+index_attiternext(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ char *tp; /* ptr to tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ Datum datum;
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, attnum - 1);
+
+ Assert(PointerIsValid(iter));
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+
+ if (hasnulls && att_isnull(attnum - 1, bp))
+ {
+ iter->isNull = true;
+ iter->slow = true;
+ return (Datum) 0;
+ }
+
+ iter->isNull = false;
+
+ if (!iter->slow && thisatt->attcacheoff >= 0)
+ iter->offset = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ iter->offset = att_align_pointer(iter->offset, thisatt->attalign, -1,
+ tp + iter->offset);
+ iter->slow = true;
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ iter->offset = att_align_nominal(iter->offset, thisatt->attalign);
+ }
+
+ datum = fetchatt(thisatt, tp + iter->offset);
+
+ iter->offset = att_addlength_pointer(iter->offset, thisatt->attlen, tp + iter->offset);
+
+ if (thisatt->attlen <= 0)
+ iter->slow = true; /* can't use attcacheoff anymore */
+
+ return datum;
+}
+
+#endif /* ITUP_ATTITER_H */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 1559399b0e..92894e4ea7 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -16,6 +16,7 @@
#include "access/amapi.h"
#include "access/itup.h"
+#include "access/itup_attiter.h"
#include "access/sdir.h"
#include "access/tableam.h"
#include "access/xlogreader.h"
@@ -1122,6 +1123,7 @@ typedef struct BTOptions
*/
#define NBTS_TYPE_SINGLE_COLUMN single
#define NBTS_TYPE_CACHED cached
+#define NBTS_TYPE_UNCACHED uncached
#define NBTS_TYPE_DEFAULT default
@@ -1152,12 +1154,19 @@ do { \
#define NBT_SPECIALIZE_CALL(function, rel, ...) \
( \
- IndexRelationGetNumberOfKeyAttributes(rel) == 1 ? ( \
+ IndexRelationGetNumberOfKeyAttributes(rel) == 1 ? ( \
NBTS_MAKE_NAME(function, NBTS_TYPE_SINGLE_COLUMN)(__VA_ARGS__) \
) \
: \
( \
- NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+ TupleDescAttr(RelationGetDescr(rel), \
+ IndexRelationGetNumberOfKeyAttributes(rel) - 1)->attcacheoff > 0 ? ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+ ) \
+ : \
+ ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_UNCACHED)(__VA_ARGS__) \
+ ) \
) \
)
diff --git a/src/include/access/nbtree_specialize.h b/src/include/access/nbtree_specialize.h
index 9733a27bdd..efbacf7d67 100644
--- a/src/include/access/nbtree_specialize.h
+++ b/src/include/access/nbtree_specialize.h
@@ -115,7 +115,11 @@
#define nbts_attiter_nextattdatum(itup, tupDesc) \
( \
AssertMacro(spec_i == 0), \
- (IndexTupleHasNulls(itup) && att_isnull(0, (char *)(itup) + sizeof(IndexTupleData))) ? \
+ ( \
+ IndexTupleHasNulls(itup) && \
+ att_isnull(0, (bits8 *) ((char *) (itup) + sizeof(IndexTupleData))) \
+ ) \
+ ? \
( \
(NBTS_MAKE_NAME(itup, isNull)) = true, \
(Datum)NULL \
@@ -175,6 +179,40 @@
#undef nbts_attiter_nextattdatum
#undef nbts_attiter_curattisnull
+/*
+ * Multiple key columns, but attcacheoff -optimization doesn't apply.
+ */
+#define NBTS_SPECIALIZING_UNCACHED
+#define NBTS_TYPE NBTS_TYPE_UNCACHED
+
+#define nbts_attiterdeclare(itup) \
+ IAttrIterStateData NBTS_MAKE_NAME(itup, iter)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc) \
+ index_attiterinit((itup), (initAttNum), (tupDesc), &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_attiternext((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, iter).isNull
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_UNCACHED
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
/* reset call to SPECIALIZE_CALL for default behaviour */
#undef nbts_call_norel
#define nbts_call_norel(name, rel, ...) \
--
2.30.2
On Wed, 27 Jul 2022 at 09:35, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Mon, 4 Jul 2022 at 16:18, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Sun, 5 Jun 2022 at 21:12, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:While working on benchmarking the v2 patchset, I noticed no
improvement on reindex, which I attributed to forgetting to also
specialize comparetup_index_btree in tuplesorth.c. After adding the
specialization there as well (attached in v3), reindex performance
improved significantly too.PFA version 4 of this patchset. Changes:
Version 5 now, which is identical to v4 except for bitrot fixes to
deal with f58d7073.
... and now v6 to deal with d0b193c0 and co.
I probably should've waited a bit longer this morning and checked
master before sending, but that's not how it went. Sorry for the
noise.
Kind regards,
Matthias van de Meent
Attachments:
v6-0005-Add-a-function-whose-task-it-is-to-populate-all-a.patchapplication/octet-stream; name=v6-0005-Add-a-function-whose-task-it-is-to-populate-all-a.patchDownload
From 9b51e105074a776d4bb69afd77a0223e719a4997 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 8 Apr 2022 14:51:01 +0200
Subject: [PATCH v6 5/8] Add a function whose task it is to populate all
attcacheoff-s of a TupleDesc's attributes
It fills uncacheable offsets with -2; as opposed to -1 which signals
"unknown", thus allowing users of the API to determine the cache-ability
of an attribute at O(1) complexity after this one-time O(n) cost, as
opposed to the repeated O(n) cost that currently applies.
---
src/backend/access/common/tupdesc.c | 97 +++++++++++++++++++++++++++++
src/include/access/tupdesc.h | 2 +
2 files changed, 99 insertions(+)
diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c
index d6fb261e20..af4ef00f2b 100644
--- a/src/backend/access/common/tupdesc.c
+++ b/src/backend/access/common/tupdesc.c
@@ -919,3 +919,100 @@ BuildDescFromLists(List *names, List *types, List *typmods, List *collations)
return desc;
}
+
+/*
+ * PopulateTupleDescCacheOffsets
+ *
+ * Populate the attcacheoff fields of a TupleDesc, returning the last
+ * attcacheoff with a valid value.
+ *
+ * Sets attcacheoff to -2 for uncacheable attributes (i.e. attributes after a
+ * variable length attributes).
+ */
+AttrNumber
+PopulateTupleDescCacheOffsets(TupleDesc desc)
+{
+ int numberOfAttributes = desc->natts;
+ AttrNumber i, j;
+
+ if (TupleDescAttr(desc, desc->natts - 1)->attcacheoff != -1)
+ {
+ /*
+ * Already done the calculations, find the last attribute that has
+ * cache offset.
+ */
+ for (i = (AttrNumber) numberOfAttributes; i > 1; i--)
+ {
+ if (TupleDescAttr(desc, i - 1)->attcacheoff != -2)
+ return i;
+ }
+
+ return 1;
+ }
+
+ /*
+ * First attribute always starts at offset zero.
+ */
+ TupleDescAttr(desc, 0)->attcacheoff = 0;
+
+ i = 1;
+ /*
+ * Someone might have set some offsets previously.
+ * Skip all positive offsets to get to the first attribute without
+ * attcacheoff.
+ */
+ while (i < numberOfAttributes && TupleDescAttr(desc, i)->attcacheoff > 0)
+ i++;
+
+ /* Cache offset is undetermined. Start calculating offsets if possible */
+ if (i < numberOfAttributes &&
+ TupleDescAttr(desc, i)->attcacheoff == -1)
+ {
+ Form_pg_attribute att = TupleDescAttr(desc, i - 1);
+ Size off = att->attcacheoff;
+
+ if (att->attlen >= 0) {
+ off += att->attlen;
+
+ while (i < numberOfAttributes)
+ {
+ att = TupleDescAttr(desc, i);
+
+ if (att->attlen < 0)
+ {
+ if (off == att_align_nominal(off, att->attalign))
+ att->attcacheoff = off;
+ else
+ att->attcacheoff = -2;
+ i++;
+ break;
+ }
+
+ off = att_align_nominal(off, att->attalign);
+ att->attcacheoff = off;
+ off += att->attlen;
+ i++;
+ }
+ } else {
+ if (off == att_align_nominal(off, att->attalign))
+ att->attcacheoff = off;
+ else
+ att->attcacheoff = -2;
+ i++;
+ }
+ }
+
+ /*
+ * No cacheable offsets left. Fill the rest with -2s, but return the latest
+ * cached offset.
+ */
+ j = i;
+
+ while (i < numberOfAttributes)
+ {
+ TupleDescAttr(desc, i)->attcacheoff = -2;
+ i++;
+ }
+
+ return j;
+}
diff --git a/src/include/access/tupdesc.h b/src/include/access/tupdesc.h
index 28dd6de18b..219f837875 100644
--- a/src/include/access/tupdesc.h
+++ b/src/include/access/tupdesc.h
@@ -151,4 +151,6 @@ extern TupleDesc BuildDescForRelation(List *schema);
extern TupleDesc BuildDescFromLists(List *names, List *types, List *typmods, List *collations);
+extern AttrNumber PopulateTupleDescCacheOffsets(TupleDesc desc);
+
#endif /* TUPDESC_H */
--
2.30.2
v6-0004-Optimize-attribute-iterator-access-for-single-col.patchapplication/octet-stream; name=v6-0004-Optimize-attribute-iterator-access-for-single-col.patchDownload
From 1609509386ef18d5a22ebc79c9075e34573f0640 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 7 Apr 2022 12:47:50 +0200
Subject: [PATCH v6 4/8] Optimize attribute iterator access for single-column
btree keys
This removes the index_getattr_nocache call path, which has significant overhead.
---
src/include/access/nbtree.h | 9 +++-
src/include/access/nbtree_specialize.h | 63 ++++++++++++++++++++++++++
2 files changed, 71 insertions(+), 1 deletion(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 489b623663..1559399b0e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1120,6 +1120,7 @@ typedef struct BTOptions
/*
* Macros used in the nbtree specialization code.
*/
+#define NBTS_TYPE_SINGLE_COLUMN single
#define NBTS_TYPE_CACHED cached
#define NBTS_TYPE_DEFAULT default
@@ -1151,7 +1152,13 @@ do { \
#define NBT_SPECIALIZE_CALL(function, rel, ...) \
( \
- NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+ IndexRelationGetNumberOfKeyAttributes(rel) == 1 ? ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_SINGLE_COLUMN)(__VA_ARGS__) \
+ ) \
+ : \
+ ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+ ) \
)
#else /* not defined NBTS_ENABLED */
diff --git a/src/include/access/nbtree_specialize.h b/src/include/access/nbtree_specialize.h
index 23fdda4f0e..9733a27bdd 100644
--- a/src/include/access/nbtree_specialize.h
+++ b/src/include/access/nbtree_specialize.h
@@ -79,6 +79,69 @@
#define nbts_call_norel(name, rel, ...) \
(NBTS_FUNCTION(name)(__VA_ARGS__))
+/*
+ * Optimized access for indexes with a single key column.
+ *
+ * Note that this path may never be used for indexes with multiple key
+ * columns, because it does not ever continue to a next column.
+ */
+
+#define NBTS_SPECIALIZING_SINGLE_COLUMN
+#define NBTS_TYPE NBTS_TYPE_SINGLE_COLUMN
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+/*
+ * We void endAttNum to prevent unused variable warnings.
+ * The if- and for-loop are structured like this to make the compiler
+ * unroll the loop and detect only one single iteration. We need `break`
+ * in the following code block, so just a plain 'if' statement would
+ * not work.
+ */
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ Assert((endAttNum) == 1); ((void) (endAttNum)); \
+ if ((initAttNum) == 1) for (int spec_i = 0; spec_i < 1; spec_i++)
+
+#define nbts_attiter_attnum 1
+
+/*
+ * Simplified (optimized) variant of index_getattr specialized for extracting
+ * only the first attribute: cache offset is guaranteed to be 0, and as such
+ * no cache is required.
+ */
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+( \
+ AssertMacro(spec_i == 0), \
+ (IndexTupleHasNulls(itup) && att_isnull(0, (char *)(itup) + sizeof(IndexTupleData))) ? \
+ ( \
+ (NBTS_MAKE_NAME(itup, isNull)) = true, \
+ (Datum)NULL \
+ ) \
+ : \
+ ( \
+ (NBTS_MAKE_NAME(itup, isNull) = false), \
+ (Datum) fetchatt(TupleDescAttr((tupDesc), 0), \
+ (char *) (itup) + IndexInfoFindDataOffset((itup)->t_info)) \
+ ) \
+)
+
+#define nbts_attiter_curattisnull(tuple) \
+ NBTS_MAKE_NAME(tuple, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_SINGLE_COLUMN
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
/*
* Multiple key columns, optimized access for attcacheoff -cacheable offsets.
*/
--
2.30.2
v6-0003-Specialize-the-nbtree-rd_indam-entry.patchapplication/octet-stream; name=v6-0003-Specialize-the-nbtree-rd_indam-entry.patchDownload
From 2e449385053315df3b89019ad8359656fa3cae97 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 8 Apr 2022 14:54:52 +0200
Subject: [PATCH v6 3/8] Specialize the nbtree rd_indam entry.
Because each rd_indam struct is seperately allocated for each index, we can
freely modify it at runtime without impacting other indexes of the same
access method. For btinsert (which effectively only calls _bt_insert) it is
useful to specialize that function, which also makes rd_indam->aminsert a
good signal whether or not the indexRelation has been fully optimized yet.
---
src/backend/access/nbtree/nbtree.c | 7 +++++++
src/backend/access/nbtree/nbtsearch.c | 2 ++
src/backend/access/nbtree/nbtsort.c | 2 ++
src/include/access/nbtree.h | 14 ++++++++++++++
4 files changed, 25 insertions(+)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1481db4dcf..2ce996a0f5 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -160,6 +160,8 @@ btbuildempty(Relation index)
metapage = (Page) palloc(BLCKSZ);
_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
+ nbt_opt_specialize(index);
+
/*
* Write the page and log it. It might seem that an immediate sync would
* be sufficient to guarantee that the file exists on disk, but recovery
@@ -322,6 +324,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
IndexScanDesc scan;
BTScanOpaque so;
+ nbt_opt_specialize(rel);
+
/* no order by operators allowed */
Assert(norderbys == 0);
@@ -764,6 +768,7 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
{
Relation rel = info->index;
BTCycleId cycleid;
+ nbt_opt_specialize(info->index);
/* allocate stats if first time through, else re-use existing struct */
if (stats == NULL)
@@ -797,6 +802,8 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
if (info->analyze_only)
return stats;
+ nbt_opt_specialize(info->index);
+
/*
* If btbulkdelete was called, we need not do anything (we just maintain
* the information used within _bt_vacuum_needs_cleanup() by calling
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index e81eee9c35..d5152bfcb7 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -181,6 +181,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
Assert(!BTScanPosIsValid(so->currPos));
+ nbt_opt_specialize(scan->indexRelation);
+
pgstat_count_index_scan(rel);
/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 3558b2d3da..521a2a33c5 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -305,6 +305,8 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
BTBuildState buildstate;
double reltuples;
+ nbt_opt_specialize(index);
+
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
ResetUsage();
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 83e0dbab16..489b623663 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1132,6 +1132,19 @@ typedef struct BTOptions
#ifdef NBTS_ENABLED
+/*
+ * Replace the functions in the rd_indam struct with a variant optimized for
+ * our key shape, if not already done.
+ *
+ * It only needs to be done once for every index relation loaded, so it's
+ * quite unlikely we need to do this and thus marked unlikely().
+ */
+#define nbt_opt_specialize(rel) \
+do { \
+ if (unlikely((rel)->rd_indam->aminsert == btinsert)) \
+ _bt_specialize(rel); \
+} while (false)
+
/*
* Access a specialized nbtree function, based on the shape of the index key.
*/
@@ -1143,6 +1156,7 @@ typedef struct BTOptions
#else /* not defined NBTS_ENABLED */
+#define nbt_opt_specialize(rel)
#define NBT_SPECIALIZE_CALL(function, rel, ...) function(__VA_ARGS__)
#endif /* NBTS_ENABLED */
--
2.30.2
v6-0001-Specialize-nbtree-functions-on-btree-key-shape.patchapplication/octet-stream; name=v6-0001-Specialize-nbtree-functions-on-btree-key-shape.patchDownload
From 825488ac1433612e4cbedc2479c7f0a8e1af295e Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Sun, 30 Jan 2022 16:23:31 +0100
Subject: [PATCH v6 1/8] Specialize nbtree functions on btree key shape
nbtree keys are not all made the same, so a significant amount of time is
spent on code that exists only to deal with other key's shape. By specializing
function calls based on the key shape, we can remove or reduce these causes
of overhead.
This commit adds the basic infrastructure for specializing specific hot code
in the nbtree AM to certain shapes of keys, initially splitting splitting out
(not yet: specializing) the attcacheoff-capable case.
Note that we generate N specialized functions and 1 'default' function for each
specializable function.
This feature can be disabled by removing the #define NBTS_ENABLED -line in nbtree.h
---
src/backend/access/nbtree/README | 22 +
src/backend/access/nbtree/nbtdedup.c | 300 +------
src/backend/access/nbtree/nbtdedup_spec.h | 313 +++++++
src/backend/access/nbtree/nbtinsert.c | 572 +-----------
src/backend/access/nbtree/nbtinsert_spec.h | 569 ++++++++++++
src/backend/access/nbtree/nbtpage.c | 4 +-
src/backend/access/nbtree/nbtree.c | 31 +-
src/backend/access/nbtree/nbtree_spec.h | 50 ++
src/backend/access/nbtree/nbtsearch.c | 994 +--------------------
src/backend/access/nbtree/nbtsearch_spec.h | 994 +++++++++++++++++++++
src/backend/access/nbtree/nbtsort.c | 271 +-----
src/backend/access/nbtree/nbtsort_spec.h | 275 ++++++
src/backend/access/nbtree/nbtsplitloc.c | 14 +-
src/backend/access/nbtree/nbtutils.c | 755 +---------------
src/backend/access/nbtree/nbtutils_spec.h | 772 ++++++++++++++++
src/include/access/nbtree.h | 61 +-
src/include/access/nbtree_specialize.h | 204 +++++
src/include/access/nbtree_specialized.h | 67 ++
18 files changed, 3355 insertions(+), 2913 deletions(-)
create mode 100644 src/backend/access/nbtree/nbtdedup_spec.h
create mode 100644 src/backend/access/nbtree/nbtinsert_spec.h
create mode 100644 src/backend/access/nbtree/nbtree_spec.h
create mode 100644 src/backend/access/nbtree/nbtsearch_spec.h
create mode 100644 src/backend/access/nbtree/nbtsort_spec.h
create mode 100644 src/backend/access/nbtree/nbtutils_spec.h
create mode 100644 src/include/access/nbtree_specialize.h
create mode 100644 src/include/access/nbtree_specialized.h
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 5529afc1fe..3c08888c23 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -1041,6 +1041,28 @@ that need a page split anyway. Besides, supporting variable "split points"
while splitting posting lists won't actually improve overall space
utilization.
+
+Notes about nbtree call specialization
+--------------------------------------
+
+Attribute iteration is a significant overhead for multi-column indexes.
+We can avoid it by specializing performance-sensitive search functions
+and calling those selectively. Additionally, we update the entry points
+in the index AM to call the specialized functions, increasing the
+performance of those hot paths. This performance benefit is at the cost
+of binary size, so this feature can be disabled by defining NBTS_DISABLED.
+
+Optimized code paths exist for the following cases, in order of preference:
+ - single-column indexes
+ NB: The code paths of this optimization do not support multiple key columns.
+ - multi-column indexes that could benefit from the attcacheoff optimization
+ NB: This is also used for the default case, and is slow for uncachable
+ attribute offsets.
+
+Future work will optimize for multi-column indexes that don't benefit
+from the attcacheoff optimization by improving on the O(n^2) nature of
+index_getattr through storing attribute offsets.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 0207421a5d..d7025d8e1c 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -22,259 +22,16 @@
static void _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
TM_IndexDeleteOp *delstate);
-static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem);
static void _bt_singleval_fillfactor(Page page, BTDedupState state,
Size newitemsz);
#ifdef USE_ASSERT_CHECKING
static bool _bt_posting_valid(IndexTuple posting);
#endif
-/*
- * Perform a deduplication pass.
- *
- * The general approach taken here is to perform as much deduplication as
- * possible to free as much space as possible. Note, however, that "single
- * value" strategy is used for !bottomupdedup callers when the page is full of
- * tuples of a single value. Deduplication passes that apply the strategy
- * will leave behind a few untouched tuples at the end of the page, preparing
- * the page for an anticipated page split that uses nbtsplitloc.c's own single
- * value strategy. Our high level goal is to delay merging the untouched
- * tuples until after the page splits.
- *
- * When a call to _bt_bottomupdel_pass() just took place (and failed), our
- * high level goal is to prevent a page split entirely by buying more time.
- * We still hope that a page split can be avoided altogether. That's why
- * single value strategy is not even considered for bottomupdedup callers.
- *
- * The page will have to be split if we cannot successfully free at least
- * newitemsz (we also need space for newitem's line pointer, which isn't
- * included in caller's newitemsz).
- *
- * Note: Caller should have already deleted all existing items with their
- * LP_DEAD bits set.
- */
-void
-_bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
- Size newitemsz, bool bottomupdedup)
-{
- OffsetNumber offnum,
- minoff,
- maxoff;
- Page page = BufferGetPage(buf);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- Page newpage;
- BTDedupState state;
- Size pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
- bool singlevalstrat = false;
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtdedup_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
- /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
- newitemsz += sizeof(ItemIdData);
-
- /*
- * Initialize deduplication state.
- *
- * It would be possible for maxpostingsize (limit on posting list tuple
- * size) to be set to one third of the page. However, it seems like a
- * good idea to limit the size of posting lists to one sixth of a page.
- * That ought to leave us with a good split point when pages full of
- * duplicates can be split several times.
- */
- state = (BTDedupState) palloc(sizeof(BTDedupStateData));
- state->deduplicate = true;
- state->nmaxitems = 0;
- state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
- /* Metadata about base tuple of current pending posting list */
- state->base = NULL;
- state->baseoff = InvalidOffsetNumber;
- state->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- state->htids = palloc(state->maxpostingsize);
- state->nhtids = 0;
- state->nitems = 0;
- /* Size of all physical tuples to be replaced by pending posting list */
- state->phystupsize = 0;
- /* nintervals should be initialized to zero */
- state->nintervals = 0;
-
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * Consider applying "single value" strategy, though only if the page
- * seems likely to be split in the near future
- */
- if (!bottomupdedup)
- singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
-
- /*
- * Deduplicate items from page, and write them to newpage.
- *
- * Copy the original page's LSN into newpage copy. This will become the
- * updated version of the page. We need this because XLogInsert will
- * examine the LSN and possibly dump it in a page image.
- */
- newpage = PageGetTempPageCopySpecial(page);
- PageSetLSN(newpage, PageGetLSN(page));
-
- /* Copy high key, if any */
- if (!P_RIGHTMOST(opaque))
- {
- ItemId hitemid = PageGetItemId(page, P_HIKEY);
- Size hitemsz = ItemIdGetLength(hitemid);
- IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
-
- if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
- false, false) == InvalidOffsetNumber)
- elog(ERROR, "deduplication failed to add highkey");
- }
-
- for (offnum = minoff;
- offnum <= maxoff;
- offnum = OffsetNumberNext(offnum))
- {
- ItemId itemid = PageGetItemId(page, offnum);
- IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
-
- Assert(!ItemIdIsDead(itemid));
-
- if (offnum == minoff)
- {
- /*
- * No previous/base tuple for the data item -- use the data item
- * as base tuple of pending posting list
- */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- else if (state->deduplicate &&
- _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
- _bt_dedup_save_htid(state, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID(s) for itup have been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list for some other reason (e.g., adding more
- * TIDs would have caused posting list to exceed current
- * maxpostingsize).
- *
- * If state contains pending posting list with more than one item,
- * form new posting tuple, and actually update the page. Else
- * reset the state and move on without modifying the page.
- */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- if (singlevalstrat)
- {
- /*
- * Single value strategy's extra steps.
- *
- * Lower maxpostingsize for sixth and final large posting list
- * tuple at the point where 5 maxpostingsize-capped tuples
- * have either been formed or observed.
- *
- * When a sixth maxpostingsize-capped item is formed/observed,
- * stop merging together tuples altogether. The few tuples
- * that remain at the end of the page won't be merged together
- * at all (at least not until after a future page split takes
- * place).
- */
- if (state->nmaxitems == 5)
- _bt_singleval_fillfactor(page, state, newitemsz);
- else if (state->nmaxitems == 6)
- {
- state->deduplicate = false;
- singlevalstrat = false; /* won't be back here */
- }
- }
-
- /* itup starts new pending posting list */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- }
-
- /* Handle the last item */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- /*
- * If no items suitable for deduplication were found, newpage must be
- * exactly the same as the original page, so just return from function.
- *
- * We could determine whether or not to proceed on the basis the space
- * savings being sufficient to avoid an immediate page split instead. We
- * don't do that because there is some small value in nbtsplitloc.c always
- * operating against a page that is fully deduplicated (apart from
- * newitem). Besides, most of the cost has already been paid.
- */
- if (state->nintervals == 0)
- {
- /* cannot leak memory here */
- pfree(newpage);
- pfree(state->htids);
- pfree(state);
- return;
- }
-
- /*
- * By here, it's clear that deduplication will definitely go ahead.
- *
- * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
- * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
- * But keep things tidy.
- */
- if (P_HAS_GARBAGE(opaque))
- {
- BTPageOpaque nopaque = BTPageGetOpaque(newpage);
-
- nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
- }
-
- START_CRIT_SECTION();
-
- PageRestoreTempPage(newpage, page);
- MarkBufferDirty(buf);
-
- /* XLOG stuff */
- if (RelationNeedsWAL(rel))
- {
- XLogRecPtr recptr;
- xl_btree_dedup xlrec_dedup;
-
- xlrec_dedup.nintervals = state->nintervals;
-
- XLogBeginInsert();
- XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
-
- /*
- * The intervals array is not in the buffer, but pretend that it is.
- * When XLogInsert stores the whole buffer, the array need not be
- * stored too.
- */
- XLogRegisterBufData(0, (char *) state->intervals,
- state->nintervals * sizeof(BTDedupInterval));
-
- recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
-
- PageSetLSN(page, recptr);
- }
-
- END_CRIT_SECTION();
-
- /* Local space accounting should agree with page accounting */
- Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
-
- /* cannot leak memory here */
- pfree(state->htids);
- pfree(state);
-}
/*
* Perform bottom-up index deletion pass.
@@ -373,7 +130,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
/* itup starts first pending interval */
_bt_dedup_start_pending(state, itup, offnum);
}
- else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+ else if (nbts_call(_bt_keep_natts_fast, rel, state->base, itup) > nkeyatts &&
_bt_dedup_save_htid(state, itup))
{
/* Tuple is equal; just added its TIDs to pending interval */
@@ -748,55 +505,6 @@ _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
state->phystupsize = 0;
}
-/*
- * Determine if page non-pivot tuples (data items) are all duplicates of the
- * same value -- if they are, deduplication's "single value" strategy should
- * be applied. The general goal of this strategy is to ensure that
- * nbtsplitloc.c (which uses its own single value strategy) will find a useful
- * split point as further duplicates are inserted, and successive rightmost
- * page splits occur among pages that store the same duplicate value. When
- * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
- * just like it would if deduplication were disabled.
- *
- * We expect that affected workloads will require _several_ single value
- * strategy deduplication passes (over a page that only stores duplicates)
- * before the page is finally split. The first deduplication pass should only
- * find regular non-pivot tuples. Later deduplication passes will find
- * existing maxpostingsize-capped posting list tuples, which must be skipped
- * over. The penultimate pass is generally the first pass that actually
- * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
- * few untouched non-pivot tuples. The final deduplication pass won't free
- * any space -- it will skip over everything without merging anything (it
- * retraces the steps of the penultimate pass).
- *
- * Fortunately, having several passes isn't too expensive. Each pass (after
- * the first pass) won't spend many cycles on the large posting list tuples
- * left by previous passes. Each pass will find a large contiguous group of
- * smaller duplicate tuples to merge together at the end of the page.
- */
-static bool
-_bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- ItemId itemid;
- IndexTuple itup;
-
- itemid = PageGetItemId(page, minoff);
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- {
- itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- return true;
- }
-
- return false;
-}
-
/*
* Lower maxpostingsize when using "single value" strategy, to avoid a sixth
* and final maxpostingsize-capped tuple. The sixth and final posting list
diff --git a/src/backend/access/nbtree/nbtdedup_spec.h b/src/backend/access/nbtree/nbtdedup_spec.h
new file mode 100644
index 0000000000..27e5a7e686
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup_spec.h
@@ -0,0 +1,313 @@
+/*
+ * Specialized functions included in nbtdedup.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static bool NBTS_FUNCTION(_bt_do_singleval)(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem);
+
+/*
+ * Determine if page non-pivot tuples (data items) are all duplicates of the
+ * same value -- if they are, deduplication's "single value" strategy should
+ * be applied. The general goal of this strategy is to ensure that
+ * nbtsplitloc.c (which uses its own single value strategy) will find a useful
+ * split point as further duplicates are inserted, and successive rightmost
+ * page splits occur among pages that store the same duplicate value. When
+ * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
+ * just like it would if deduplication were disabled.
+ *
+ * We expect that affected workloads will require _several_ single value
+ * strategy deduplication passes (over a page that only stores duplicates)
+ * before the page is finally split. The first deduplication pass should only
+ * find regular non-pivot tuples. Later deduplication passes will find
+ * existing maxpostingsize-capped posting list tuples, which must be skipped
+ * over. The penultimate pass is generally the first pass that actually
+ * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
+ * few untouched non-pivot tuples. The final deduplication pass won't free
+ * any space -- it will skip over everything without merging anything (it
+ * retraces the steps of the penultimate pass).
+ *
+ * Fortunately, having several passes isn't too expensive. Each pass (after
+ * the first pass) won't spend many cycles on the large posting list tuples
+ * left by previous passes. Each pass will find a large contiguous group of
+ * smaller duplicate tuples to merge together at the end of the page.
+ */
+static bool
+NBTS_FUNCTION(_bt_do_singleval)(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, minoff);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (nbts_call(_bt_keep_natts_fast, rel, newitem, itup) > nkeyatts)
+ {
+ itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (nbts_call(_bt_keep_natts_fast, rel, newitem, itup) > nkeyatts)
+ return true;
+ }
+
+ return false;
+}
+
+#endif /* ifndef NBTS_SPECIALIZING_DEFAULT */
+
+/*
+ * Perform a deduplication pass.
+ *
+ * The general approach taken here is to perform as much deduplication as
+ * possible to free as much space as possible. Note, however, that "single
+ * value" strategy is used for !bottomupdedup callers when the page is full of
+ * tuples of a single value. Deduplication passes that apply the strategy
+ * will leave behind a few untouched tuples at the end of the page, preparing
+ * the page for an anticipated page split that uses nbtsplitloc.c's own single
+ * value strategy. Our high level goal is to delay merging the untouched
+ * tuples until after the page splits.
+ *
+ * When a call to _bt_bottomupdel_pass() just took place (and failed), our
+ * high level goal is to prevent a page split entirely by buying more time.
+ * We still hope that a page split can be avoided altogether. That's why
+ * single value strategy is not even considered for bottomupdedup callers.
+ *
+ * The page will have to be split if we cannot successfully free at least
+ * newitemsz (we also need space for newitem's line pointer, which isn't
+ * included in caller's newitemsz).
+ *
+ * Note: Caller should have already deleted all existing items with their
+ * LP_DEAD bits set.
+ */
+void
+NBTS_FUNCTION(_bt_dedup_pass)(Relation rel, Buffer buf, Relation heapRel,
+ IndexTuple newitem, Size newitemsz,
+ bool bottomupdedup)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ Page newpage;
+ BTDedupState state;
+ Size pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
+ bool singlevalstrat = false;
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+
+ /*
+ * Initialize deduplication state.
+ *
+ * It would be possible for maxpostingsize (limit on posting list tuple
+ * size) to be set to one third of the page. However, it seems like a
+ * good idea to limit the size of posting lists to one sixth of a page.
+ * That ought to leave us with a good split point when pages full of
+ * duplicates can be split several times.
+ */
+ state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ state->deduplicate = true;
+ state->nmaxitems = 0;
+ state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
+ /* Metadata about base tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ state->htids = palloc(state->maxpostingsize);
+ state->nhtids = 0;
+ state->nitems = 0;
+ /* Size of all physical tuples to be replaced by pending posting list */
+ state->phystupsize = 0;
+ /* nintervals should be initialized to zero */
+ state->nintervals = 0;
+
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Consider applying "single value" strategy, though only if the page
+ * seems likely to be split in the near future
+ */
+ if (!bottomupdedup)
+ singlevalstrat = nbts_call(_bt_do_singleval, rel, page, state,
+ minoff, newitem);
+
+ /*
+ * Deduplicate items from page, and write them to newpage.
+ *
+ * Copy the original page's LSN into newpage copy. This will become the
+ * updated version of the page. We need this because XLogInsert will
+ * examine the LSN and possibly dump it in a page image.
+ */
+ newpage = PageGetTempPageCopySpecial(page);
+ PageSetLSN(newpage, PageGetLSN(page));
+
+ /* Copy high key, if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId hitemid = PageGetItemId(page, P_HIKEY);
+ Size hitemsz = ItemIdGetLength(hitemid);
+ IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+ if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add highkey");
+ }
+
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == minoff)
+ {
+ /*
+ * No previous/base tuple for the data item -- use the data item
+ * as base tuple of pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (state->deduplicate &&
+ nbts_call(_bt_keep_natts_fast, rel, state->base, itup) > nkeyatts &&
+ _bt_dedup_save_htid(state, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID(s) for itup have been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list for some other reason (e.g., adding more
+ * TIDs would have caused posting list to exceed current
+ * maxpostingsize).
+ *
+ * If state contains pending posting list with more than one item,
+ * form new posting tuple, and actually update the page. Else
+ * reset the state and move on without modifying the page.
+ */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ if (singlevalstrat)
+ {
+ /*
+ * Single value strategy's extra steps.
+ *
+ * Lower maxpostingsize for sixth and final large posting list
+ * tuple at the point where 5 maxpostingsize-capped tuples
+ * have either been formed or observed.
+ *
+ * When a sixth maxpostingsize-capped item is formed/observed,
+ * stop merging together tuples altogether. The few tuples
+ * that remain at the end of the page won't be merged together
+ * at all (at least not until after a future page split takes
+ * place).
+ */
+ if (state->nmaxitems == 5)
+ _bt_singleval_fillfactor(page, state, newitemsz);
+ else if (state->nmaxitems == 6)
+ {
+ state->deduplicate = false;
+ singlevalstrat = false; /* won't be back here */
+ }
+ }
+
+ /* itup starts new pending posting list */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ }
+
+ /* Handle the last item */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ /*
+ * If no items suitable for deduplication were found, newpage must be
+ * exactly the same as the original page, so just return from function.
+ *
+ * We could determine whether or not to proceed on the basis the space
+ * savings being sufficient to avoid an immediate page split instead. We
+ * don't do that because there is some small value in nbtsplitloc.c always
+ * operating against a page that is fully deduplicated (apart from
+ * newitem). Besides, most of the cost has already been paid.
+ */
+ if (state->nintervals == 0)
+ {
+ /* cannot leak memory here */
+ pfree(newpage);
+ pfree(state->htids);
+ pfree(state);
+ return;
+ }
+
+ /*
+ * By here, it's clear that deduplication will definitely go ahead.
+ *
+ * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
+ * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
+ * But keep things tidy.
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ BTPageOpaque nopaque = BTPageGetOpaque(newpage);
+
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+ }
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.nintervals = state->nintervals;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ /*
+ * The intervals array is not in the buffer, but pretend that it is.
+ * When XLogInsert stores the whole buffer, the array need not be
+ * stored too.
+ */
+ XLogRegisterBufData(0, (char *) state->intervals,
+ state->nintervals * sizeof(BTDedupInterval));
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* Local space accounting should agree with page accounting */
+ Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+ /* cannot leak memory here */
+ pfree(state->htids);
+ pfree(state);
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f6f4af8bfe..ec6c73d1cc 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,18 +30,13 @@
#define BTREE_FASTPATH_MIN_LEVEL 2
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
Relation heapRel,
IndexUniqueCheck checkUnique, bool *is_unique,
uint32 *speculativeToken);
-static OffsetNumber _bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel);
-static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_stepright(Relation rel,
+ BTInsertState insertstate,
+ BTStack stack);
static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
Buffer buf,
Buffer cbuf,
@@ -73,311 +68,10 @@ static BlockNumber *_bt_deadblocks(Page page, OffsetNumber *deletable,
int *nblocks);
static inline int _bt_blk_cmp(const void *arg1, const void *arg2);
-/*
- * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
- *
- * This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
- *
- * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
- * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
- * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
- * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
- * don't actually insert.
- *
- * indexUnchanged executor hint indicates if itup is from an
- * UPDATE that didn't logically change the indexed value, but
- * must nevertheless have a new entry to point to a successor
- * version.
- *
- * The result value is only significant for UNIQUE_CHECK_PARTIAL:
- * it must be true if the entry is known unique, else false.
- * (In the current implementation we'll also return true after a
- * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
- * that's just a coding artifact.)
- */
-bool
-_bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel)
-{
- bool is_unique = false;
- BTInsertStateData insertstate;
- BTScanInsert itup_key;
- BTStack stack;
- bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
-
- /* we need an insertion scan key to do our search, so build one */
- itup_key = _bt_mkscankey(rel, itup);
-
- if (checkingunique)
- {
- if (!itup_key->anynullkeys)
- {
- /* No (heapkeyspace) scantid until uniqueness established */
- itup_key->scantid = NULL;
- }
- else
- {
- /*
- * Scan key for new tuple contains NULL key values. Bypass
- * checkingunique steps. They are unnecessary because core code
- * considers NULL unequal to every value, including NULL.
- *
- * This optimization avoids O(N^2) behavior within the
- * _bt_findinsertloc() heapkeyspace path when a unique index has a
- * large number of "duplicates" with NULL key values.
- */
- checkingunique = false;
- /* Tuple is unique in the sense that core code cares about */
- Assert(checkUnique != UNIQUE_CHECK_EXISTING);
- is_unique = true;
- }
- }
-
- /*
- * Fill in the BTInsertState working area, to track the current page and
- * position within the page to insert on.
- *
- * Note that itemsz is passed down to lower level code that deals with
- * inserting the item. It must be MAXALIGN()'d. This ensures that space
- * accounting code consistently considers the alignment overhead that we
- * expect PageAddItem() will add later. (Actually, index_form_tuple() is
- * already conservative about alignment, but we don't rely on that from
- * this distance. Besides, preserving the "true" tuple size in index
- * tuple headers for the benefit of nbtsplitloc.c might happen someday.
- * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
- */
- insertstate.itup = itup;
- insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
- insertstate.itup_key = itup_key;
- insertstate.bounds_valid = false;
- insertstate.buf = InvalidBuffer;
- insertstate.postingoff = 0;
-
-search:
-
- /*
- * Find and lock the leaf page that the tuple should be added to by
- * searching from the root page. insertstate.buf will hold a buffer that
- * is locked in exclusive mode afterwards.
- */
- stack = _bt_search_insert(rel, &insertstate);
-
- /*
- * checkingunique inserts are not allowed to go ahead when two tuples with
- * equal key attribute values would be visible to new MVCC snapshots once
- * the xact commits. Check for conflicts in the locked page/buffer (if
- * needed) here.
- *
- * It might be necessary to check a page to the right in _bt_check_unique,
- * though that should be very rare. In practice the first page the value
- * could be on (with scantid omitted) is almost always also the only page
- * that a matching tuple might be found on. This is due to the behavior
- * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
- * only be allowed to cross a page boundary when there is no candidate
- * leaf page split point that avoids it. Also, _bt_check_unique can use
- * the leaf page high key to determine that there will be no duplicates on
- * the right sibling without actually visiting it (it uses the high key in
- * cases where the new item happens to belong at the far right of the leaf
- * page).
- *
- * NOTE: obviously, _bt_check_unique can only detect keys that are already
- * in the index; so it cannot defend against concurrent insertions of the
- * same key. We protect against that by means of holding a write lock on
- * the first page the value could be on, with omitted/-inf value for the
- * implicit heap TID tiebreaker attribute. Any other would-be inserter of
- * the same key must acquire a write lock on the same page, so only one
- * would-be inserter can be making the check at one time. Furthermore,
- * once we are past the check we hold write locks continuously until we
- * have performed our insertion, so no later inserter can fail to see our
- * insertion. (This requires some care in _bt_findinsertloc.)
- *
- * If we must wait for another xact, we release the lock while waiting,
- * and then must perform a new search.
- *
- * For a partial uniqueness check, we don't wait for the other xact. Just
- * let the tuple in and return false for possibly non-unique, or true for
- * definitely unique.
- */
- if (checkingunique)
- {
- TransactionId xwait;
- uint32 speculativeToken;
-
- xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
- &is_unique, &speculativeToken);
-
- if (unlikely(TransactionIdIsValid(xwait)))
- {
- /* Have to wait for the other guy ... */
- _bt_relbuf(rel, insertstate.buf);
- insertstate.buf = InvalidBuffer;
-
- /*
- * If it's a speculative insertion, wait for it to finish (ie. to
- * go ahead with the insertion, or kill the tuple). Otherwise
- * wait for the transaction to finish as usual.
- */
- if (speculativeToken)
- SpeculativeInsertionWait(xwait, speculativeToken);
- else
- XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
-
- /* start over... */
- if (stack)
- _bt_freestack(stack);
- goto search;
- }
-
- /* Uniqueness is established -- restore heap tid as scantid */
- if (itup_key->heapkeyspace)
- itup_key->scantid = &itup->t_tid;
- }
-
- if (checkUnique != UNIQUE_CHECK_EXISTING)
- {
- OffsetNumber newitemoff;
-
- /*
- * The only conflict predicate locking cares about for indexes is when
- * an index tuple insert conflicts with an existing lock. We don't
- * know the actual page we're going to insert on for sure just yet in
- * checkingunique and !heapkeyspace cases, but it's okay to use the
- * first page the value could be on (with scantid omitted) instead.
- */
- CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
-
- /*
- * Do the insertion. Note that insertstate contains cached binary
- * search bounds established within _bt_check_unique when insertion is
- * checkingunique.
- */
- newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
- indexUnchanged, stack, heapRel);
- _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, insertstate.itemsz, newitemoff,
- insertstate.postingoff, false);
- }
- else
- {
- /* just release the buffer */
- _bt_relbuf(rel, insertstate.buf);
- }
-
- /* be tidy */
- if (stack)
- _bt_freestack(stack);
- pfree(itup_key);
-
- return is_unique;
-}
-
-/*
- * _bt_search_insert() -- _bt_search() wrapper for inserts
- *
- * Search the tree for a particular scankey, or more precisely for the first
- * leaf page it could be on. Try to make use of the fastpath optimization's
- * rightmost leaf page cache before actually searching the tree from the root
- * page, though.
- *
- * Return value is a stack of parent-page pointers (though see notes about
- * fastpath optimization and page splits below). insertstate->buf is set to
- * the address of the leaf-page buffer, which is write-locked and pinned in
- * all cases (if necessary by creating a new empty root page for caller).
- *
- * The fastpath optimization avoids most of the work of searching the tree
- * repeatedly when a single backend inserts successive new tuples on the
- * rightmost leaf page of an index. A backend cache of the rightmost leaf
- * page is maintained within _bt_insertonpg(), and used here. The cache is
- * invalidated here when an insert of a non-pivot tuple must take place on a
- * non-rightmost leaf page.
- *
- * The optimization helps with indexes on an auto-incremented field. It also
- * helps with indexes on datetime columns, as well as indexes with lots of
- * NULL values. (NULLs usually get inserted in the rightmost page for single
- * column indexes, since they usually get treated as coming after everything
- * else in the key space. Individual NULL tuples will generally be placed on
- * the rightmost leaf page due to the influence of the heap TID column.)
- *
- * Note that we avoid applying the optimization when there is insufficient
- * space on the rightmost page to fit caller's new item. This is necessary
- * because we'll need to return a real descent stack when a page split is
- * expected (actually, caller can cope with a leaf page split that uses a NULL
- * stack, but that's very slow and so must be avoided). Note also that the
- * fastpath optimization acquires the lock on the page conditionally as a way
- * of reducing extra contention when there are concurrent insertions into the
- * rightmost page (we give up if we'd have to wait for the lock). We assume
- * that it isn't useful to apply the optimization when there is contention,
- * since each per-backend cache won't stay valid for long.
- */
-static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
-{
- Assert(insertstate->buf == InvalidBuffer);
- Assert(!insertstate->bounds_valid);
- Assert(insertstate->postingoff == 0);
-
- if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
- {
- /* Simulate a _bt_getbuf() call with conditional locking */
- insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
- if (_bt_conditionallockbuf(rel, insertstate->buf))
- {
- Page page;
- BTPageOpaque opaque;
-
- _bt_checkpage(rel, insertstate->buf);
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- /*
- * Check if the page is still the rightmost leaf page and has
- * enough free space to accommodate the new tuple. Also check
- * that the insertion scan key is strictly greater than the first
- * non-pivot tuple on the page. (Note that we expect itup_key's
- * scantid to be unset when our caller is a checkingunique
- * inserter.)
- */
- if (P_RIGHTMOST(opaque) &&
- P_ISLEAF(opaque) &&
- !P_IGNORE(opaque) &&
- PageGetFreeSpace(page) > insertstate->itemsz &&
- PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- _bt_compare(rel, insertstate->itup_key, page, P_HIKEY) > 0)
- {
- /*
- * Caller can use the fastpath optimization because cached
- * block is still rightmost leaf page, which can fit caller's
- * new tuple without splitting. Keep block in local cache for
- * next insert, and have caller use NULL stack.
- *
- * Note that _bt_insert_parent() has an assertion that catches
- * leaf page splits that somehow follow from a fastpath insert
- * (it should only be passed a NULL stack when it must deal
- * with a concurrent root page split, and never because a NULL
- * stack was returned here).
- */
- return NULL;
- }
-
- /* Page unsuitable for caller, drop lock and pin */
- _bt_relbuf(rel, insertstate->buf);
- }
- else
- {
- /* Lock unavailable, drop pin */
- ReleaseBuffer(insertstate->buf);
- }
-
- /* Forget block, since cache doesn't appear to be useful */
- RelationSetTargetBlock(rel, InvalidBlockNumber);
- }
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtinsert_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
- /* Cannot use optimization -- descend tree, return proper descent stack */
- return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
- NULL);
-}
/*
* _bt_check_unique() -- Check for violation of unique index constraint
@@ -438,7 +132,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* in the fastpath below, but also in the _bt_findinsertloc() call later.
*/
Assert(!insertstate->bounds_valid);
- offset = _bt_binsrch_insert(rel, insertstate);
+ offset = nbts_call(_bt_binsrch_insert, rel, insertstate);
/*
* Scan over all equal tuples, looking for live conflicts.
@@ -483,7 +177,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(insertstate->bounds_valid);
Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
Assert(insertstate->low <= insertstate->stricthigh);
- Assert(_bt_compare(rel, itup_key, page, offset) < 0);
+ Assert(nbts_call(_bt_compare, rel, itup_key, page, offset) < 0);
break;
}
@@ -508,7 +202,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
if (!inposting)
{
/* Plain tuple, or first TID in posting list tuple */
- if (_bt_compare(rel, itup_key, page, offset) != 0)
+ if (nbts_call(_bt_compare, rel, itup_key, page, offset) != 0)
break; /* we're past all the equal tuples */
/* Advanced curitup */
@@ -722,7 +416,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* If scankey == hikey we gotta check the next page too */
if (P_RIGHTMOST(opaque))
break;
- highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+ highkeycmp = nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY);
Assert(highkeycmp <= 0);
if (highkeycmp != 0)
break;
@@ -769,246 +463,6 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
return InvalidTransactionId;
}
-
-/*
- * _bt_findinsertloc() -- Finds an insert location for a tuple
- *
- * On entry, insertstate buffer contains the page the new tuple belongs
- * on. It is exclusive-locked and pinned by the caller.
- *
- * If 'checkingunique' is true, the buffer on entry is the first page
- * that contains duplicates of the new key. If there are duplicates on
- * multiple pages, the correct insertion position might be some page to
- * the right, rather than the first page. In that case, this function
- * moves right to the correct target page.
- *
- * (In a !heapkeyspace index, there can be multiple pages with the same
- * high key, where the new tuple could legitimately be placed on. In
- * that case, the caller passes the first page containing duplicates,
- * just like when checkingunique=true. If that page doesn't have enough
- * room for the new tuple, this function moves right, trying to find a
- * legal page that does.)
- *
- * If 'indexUnchanged' is true, this is for an UPDATE that didn't
- * logically change the indexed value, but must nevertheless have a new
- * entry to point to a successor version. This hint from the executor
- * will influence our behavior when the page might have to be split and
- * we must consider our options. Bottom-up index deletion can avoid
- * pathological version-driven page splits, but we only want to go to the
- * trouble of trying it when we already have moderate confidence that
- * it's appropriate. The hint should not significantly affect our
- * behavior over time unless practically all inserts on to the leaf page
- * get the hint.
- *
- * On exit, insertstate buffer contains the chosen insertion page, and
- * the offset within that page is returned. If _bt_findinsertloc needed
- * to move right, the lock and pin on the original page are released, and
- * the new buffer is exclusively locked and pinned instead.
- *
- * If insertstate contains cached binary search bounds, we will take
- * advantage of them. This avoids repeating comparisons that we made in
- * _bt_check_unique() already.
- */
-static OffsetNumber
-_bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel)
-{
- BTScanInsert itup_key = insertstate->itup_key;
- Page page = BufferGetPage(insertstate->buf);
- BTPageOpaque opaque;
- OffsetNumber newitemoff;
-
- opaque = BTPageGetOpaque(page);
-
- /* Check 1/3 of a page restriction */
- if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
- _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
- insertstate->itup);
-
- Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
- Assert(!insertstate->bounds_valid || checkingunique);
- Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
- Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
- Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
-
- if (itup_key->heapkeyspace)
- {
- /* Keep track of whether checkingunique duplicate seen */
- bool uniquedup = indexUnchanged;
-
- /*
- * If we're inserting into a unique index, we may have to walk right
- * through leaf pages to find the one leaf page that we must insert on
- * to.
- *
- * This is needed for checkingunique callers because a scantid was not
- * used when we called _bt_search(). scantid can only be set after
- * _bt_check_unique() has checked for duplicates. The buffer
- * initially stored in insertstate->buf has the page where the first
- * duplicate key might be found, which isn't always the page that new
- * tuple belongs on. The heap TID attribute for new tuple (scantid)
- * could force us to insert on a sibling page, though that should be
- * very rare in practice.
- */
- if (checkingunique)
- {
- if (insertstate->low < insertstate->stricthigh)
- {
- /* Encountered a duplicate in _bt_check_unique() */
- Assert(insertstate->bounds_valid);
- uniquedup = true;
- }
-
- for (;;)
- {
- /*
- * Does the new tuple belong on this page?
- *
- * The earlier _bt_check_unique() call may well have
- * established a strict upper bound on the offset for the new
- * item. If it's not the last item of the page (i.e. if there
- * is at least one tuple on the page that goes after the tuple
- * we're inserting) then we know that the tuple belongs on
- * this page. We can skip the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- /* Test '<=', not '!=', since scantid is set now */
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
- break;
-
- _bt_stepright(rel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- /* Assume duplicates (if checkingunique) */
- uniquedup = true;
- }
- }
-
- /*
- * If the target page cannot fit newitem, try to avoid splitting the
- * page on insert by performing deletion or deduplication now
- */
- if (PageGetFreeSpace(page) < insertstate->itemsz)
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
- checkingunique, uniquedup,
- indexUnchanged);
- }
- else
- {
- /*----------
- * This is a !heapkeyspace (version 2 or 3) index. The current page
- * is the first page that we could insert the new tuple to, but there
- * may be other pages to the right that we could opt to use instead.
- *
- * If the new key is equal to one or more existing keys, we can
- * legitimately place it anywhere in the series of equal keys. In
- * fact, if the new key is equal to the page's "high key" we can place
- * it on the next page. If it is equal to the high key, and there's
- * not room to insert the new tuple on the current page without
- * splitting, then we move right hoping to find more free space and
- * avoid a split.
- *
- * Keep scanning right until we
- * (a) find a page with enough free space,
- * (b) reach the last page where the tuple can legally go, or
- * (c) get tired of searching.
- * (c) is not flippant; it is important because if there are many
- * pages' worth of equal keys, it's better to split one of the early
- * pages than to scan all the way to the end of the run of equal keys
- * on every insert. We implement "get tired" as a random choice,
- * since stopping after scanning a fixed number of pages wouldn't work
- * well (we'd never reach the right-hand side of previously split
- * pages). The probability of moving right is set at 0.99, which may
- * seem too high to change the behavior much, but it does an excellent
- * job of preventing O(N^2) behavior with many equal keys.
- *----------
- */
- while (PageGetFreeSpace(page) < insertstate->itemsz)
- {
- /*
- * Before considering moving right, see if we can obtain enough
- * space by erasing LP_DEAD items
- */
- if (P_HAS_GARBAGE(opaque))
- {
- /* Perform simple deletion */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- if (PageGetFreeSpace(page) >= insertstate->itemsz)
- break; /* OK, now we have enough space */
- }
-
- /*
- * Nope, so check conditions (b) and (c) enumerated above
- *
- * The earlier _bt_check_unique() call may well have established a
- * strict upper bound on the offset for the new item. If it's not
- * the last item of the page (i.e. if there is at least one tuple
- * on the page that's greater than the tuple we're inserting to)
- * then we know that the tuple belongs on this page. We can skip
- * the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) != 0 ||
- pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
- break;
-
- _bt_stepright(rel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- }
- }
-
- /*
- * We should now be on the correct page. Find the offset within the page
- * for the new tuple. (Possibly reusing earlier search bounds.)
- */
- Assert(P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
-
- newitemoff = _bt_binsrch_insert(rel, insertstate);
-
- if (insertstate->postingoff == -1)
- {
- /*
- * There is an overlapping posting list tuple with its LP_DEAD bit
- * set. We don't want to unnecessarily unset its LP_DEAD bit while
- * performing a posting list split, so perform simple index tuple
- * deletion early.
- */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- /*
- * Do new binary search. New insert location cannot overlap with any
- * posting list now.
- */
- Assert(!insertstate->bounds_valid);
- insertstate->postingoff = 0;
- newitemoff = _bt_binsrch_insert(rel, insertstate);
- Assert(insertstate->postingoff == 0);
- }
-
- return newitemoff;
-}
-
/*
* Step right to next non-dead page, during insertion.
*
@@ -1649,7 +1103,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
lastleft = nposting;
}
- lefthighkey = _bt_truncate(rel, lastleft, firstright, itup_key);
+ lefthighkey = nbts_call(_bt_truncate, rel, lastleft, firstright, itup_key);
itemsz = IndexTupleSize(lefthighkey);
}
else
@@ -2764,8 +2218,8 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
/* Perform deduplication pass (when enabled and index-is-allequalimage) */
if (BTGetDeduplicateItems(rel) && itup_key->allequalimage)
- _bt_dedup_pass(rel, buffer, heapRel, insertstate->itup,
- insertstate->itemsz, (indexUnchanged || uniquedup));
+ nbts_call(_bt_dedup_pass, rel, buffer, heapRel, insertstate->itup,
+ insertstate->itemsz, (indexUnchanged || uniquedup));
}
/*
diff --git a/src/backend/access/nbtree/nbtinsert_spec.h b/src/backend/access/nbtree/nbtinsert_spec.h
new file mode 100644
index 0000000000..97c866aea3
--- /dev/null
+++ b/src/backend/access/nbtree/nbtinsert_spec.h
@@ -0,0 +1,569 @@
+/*
+ * Specialized functions for nbtinsert.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static BTStack NBTS_FUNCTION(_bt_search_insert)(Relation rel,
+ BTInsertState insertstate);
+
+static OffsetNumber NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel);
+
+/*
+ * _bt_search_insert() -- _bt_search() wrapper for inserts
+ *
+ * Search the tree for a particular scankey, or more precisely for the first
+ * leaf page it could be on. Try to make use of the fastpath optimization's
+ * rightmost leaf page cache before actually searching the tree from the root
+ * page, though.
+ *
+ * Return value is a stack of parent-page pointers (though see notes about
+ * fastpath optimization and page splits below). insertstate->buf is set to
+ * the address of the leaf-page buffer, which is write-locked and pinned in
+ * all cases (if necessary by creating a new empty root page for caller).
+ *
+ * The fastpath optimization avoids most of the work of searching the tree
+ * repeatedly when a single backend inserts successive new tuples on the
+ * rightmost leaf page of an index. A backend cache of the rightmost leaf
+ * page is maintained within _bt_insertonpg(), and used here. The cache is
+ * invalidated here when an insert of a non-pivot tuple must take place on a
+ * non-rightmost leaf page.
+ *
+ * The optimization helps with indexes on an auto-incremented field. It also
+ * helps with indexes on datetime columns, as well as indexes with lots of
+ * NULL values. (NULLs usually get inserted in the rightmost page for single
+ * column indexes, since they usually get treated as coming after everything
+ * else in the key space. Individual NULL tuples will generally be placed on
+ * the rightmost leaf page due to the influence of the heap TID column.)
+ *
+ * Note that we avoid applying the optimization when there is insufficient
+ * space on the rightmost page to fit caller's new item. This is necessary
+ * because we'll need to return a real descent stack when a page split is
+ * expected (actually, caller can cope with a leaf page split that uses a NULL
+ * stack, but that's very slow and so must be avoided). Note also that the
+ * fastpath optimization acquires the lock on the page conditionally as a way
+ * of reducing extra contention when there are concurrent insertions into the
+ * rightmost page (we give up if we'd have to wait for the lock). We assume
+ * that it isn't useful to apply the optimization when there is contention,
+ * since each per-backend cache won't stay valid for long.
+ */
+static BTStack
+NBTS_FUNCTION(_bt_search_insert)(Relation rel, BTInsertState insertstate)
+{
+ Assert(insertstate->buf == InvalidBuffer);
+ Assert(!insertstate->bounds_valid);
+ Assert(insertstate->postingoff == 0);
+
+ if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
+ {
+ /* Simulate a _bt_getbuf() call with conditional locking */
+ insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
+ if (_bt_conditionallockbuf(rel, insertstate->buf))
+ {
+ Page page;
+ BTPageOpaque opaque;
+
+ _bt_checkpage(rel, insertstate->buf);
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ /*
+ * Check if the page is still the rightmost leaf page and has
+ * enough free space to accommodate the new tuple. Also check
+ * that the insertion scan key is strictly greater than the first
+ * non-pivot tuple on the page. (Note that we expect itup_key's
+ * scantid to be unset when our caller is a checkingunique
+ * inserter.)
+ */
+ if (P_RIGHTMOST(opaque) &&
+ P_ISLEAF(opaque) &&
+ !P_IGNORE(opaque) &&
+ PageGetFreeSpace(page) > insertstate->itemsz &&
+ PageGetMaxOffsetNumber(page) >= P_HIKEY &&
+ nbts_call(_bt_compare, rel, insertstate->itup_key, page, P_HIKEY) > 0)
+ {
+ /*
+ * Caller can use the fastpath optimization because cached
+ * block is still rightmost leaf page, which can fit caller's
+ * new tuple without splitting. Keep block in local cache for
+ * next insert, and have caller use NULL stack.
+ *
+ * Note that _bt_insert_parent() has an assertion that catches
+ * leaf page splits that somehow follow from a fastpath insert
+ * (it should only be passed a NULL stack when it must deal
+ * with a concurrent root page split, and never because a NULL
+ * stack was returned here).
+ */
+ return NULL;
+ }
+
+ /* Page unsuitable for caller, drop lock and pin */
+ _bt_relbuf(rel, insertstate->buf);
+ }
+ else
+ {
+ /* Lock unavailable, drop pin */
+ ReleaseBuffer(insertstate->buf);
+ }
+
+ /* Forget block, since cache doesn't appear to be useful */
+ RelationSetTargetBlock(rel, InvalidBlockNumber);
+ }
+
+ /* Cannot use optimization -- descend tree, return proper descent stack */
+ return nbts_call(_bt_search, rel, insertstate->itup_key,
+ &insertstate->buf, BT_WRITE, NULL);
+}
+
+/*
+ * _bt_findinsertloc() -- Finds an insert location for a tuple
+ *
+ * On entry, insertstate buffer contains the page the new tuple belongs
+ * on. It is exclusive-locked and pinned by the caller.
+ *
+ * If 'checkingunique' is true, the buffer on entry is the first page
+ * that contains duplicates of the new key. If there are duplicates on
+ * multiple pages, the correct insertion position might be some page to
+ * the right, rather than the first page. In that case, this function
+ * moves right to the correct target page.
+ *
+ * (In a !heapkeyspace index, there can be multiple pages with the same
+ * high key, where the new tuple could legitimately be placed on. In
+ * that case, the caller passes the first page containing duplicates,
+ * just like when checkingunique=true. If that page doesn't have enough
+ * room for the new tuple, this function moves right, trying to find a
+ * legal page that does.)
+ *
+ * If 'indexUnchanged' is true, this is for an UPDATE that didn't
+ * logically change the indexed value, but must nevertheless have a new
+ * entry to point to a successor version. This hint from the executor
+ * will influence our behavior when the page might have to be split and
+ * we must consider our options. Bottom-up index deletion can avoid
+ * pathological version-driven page splits, but we only want to go to the
+ * trouble of trying it when we already have moderate confidence that
+ * it's appropriate. The hint should not significantly affect our
+ * behavior over time unless practically all inserts on to the leaf page
+ * get the hint.
+ *
+ * On exit, insertstate buffer contains the chosen insertion page, and
+ * the offset within that page is returned. If _bt_findinsertloc needed
+ * to move right, the lock and pin on the original page are released, and
+ * the new buffer is exclusively locked and pinned instead.
+ *
+ * If insertstate contains cached binary search bounds, we will take
+ * advantage of them. This avoids repeating comparisons that we made in
+ * _bt_check_unique() already.
+ */
+static OffsetNumber
+NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel)
+{
+ BTScanInsert itup_key = insertstate->itup_key;
+ Page page = BufferGetPage(insertstate->buf);
+ BTPageOpaque opaque;
+ OffsetNumber newitemoff;
+
+ opaque = BTPageGetOpaque(page);
+
+ /* Check 1/3 of a page restriction */
+ if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
+ _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
+ insertstate->itup);
+
+ Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
+ Assert(!insertstate->bounds_valid || checkingunique);
+ Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
+ Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
+ Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
+
+ if (itup_key->heapkeyspace)
+ {
+ /* Keep track of whether checkingunique duplicate seen */
+ bool uniquedup = indexUnchanged;
+
+ /*
+ * If we're inserting into a unique index, we may have to walk right
+ * through leaf pages to find the one leaf page that we must insert on
+ * to.
+ *
+ * This is needed for checkingunique callers because a scantid was not
+ * used when we called _bt_search(). scantid can only be set after
+ * _bt_check_unique() has checked for duplicates. The buffer
+ * initially stored in insertstate->buf has the page where the first
+ * duplicate key might be found, which isn't always the page that new
+ * tuple belongs on. The heap TID attribute for new tuple (scantid)
+ * could force us to insert on a sibling page, though that should be
+ * very rare in practice.
+ */
+ if (checkingunique)
+ {
+ if (insertstate->low < insertstate->stricthigh)
+ {
+ /* Encountered a duplicate in _bt_check_unique() */
+ Assert(insertstate->bounds_valid);
+ uniquedup = true;
+ }
+
+ for (;;)
+ {
+ /*
+ * Does the new tuple belong on this page?
+ *
+ * The earlier _bt_check_unique() call may well have
+ * established a strict upper bound on the offset for the new
+ * item. If it's not the last item of the page (i.e. if there
+ * is at least one tuple on the page that goes after the tuple
+ * we're inserting) then we know that the tuple belongs on
+ * this page. We can skip the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ /* Test '<=', not '!=', since scantid is set now */
+ if (P_RIGHTMOST(opaque) ||
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) <= 0)
+ break;
+
+ _bt_stepright(rel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ /* Assume duplicates (if checkingunique) */
+ uniquedup = true;
+ }
+ }
+
+ /*
+ * If the target page cannot fit newitem, try to avoid splitting the
+ * page on insert by performing deletion or deduplication now
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
+ checkingunique, uniquedup,
+ indexUnchanged);
+ }
+ else
+ {
+ /*----------
+ * This is a !heapkeyspace (version 2 or 3) index. The current page
+ * is the first page that we could insert the new tuple to, but there
+ * may be other pages to the right that we could opt to use instead.
+ *
+ * If the new key is equal to one or more existing keys, we can
+ * legitimately place it anywhere in the series of equal keys. In
+ * fact, if the new key is equal to the page's "high key" we can place
+ * it on the next page. If it is equal to the high key, and there's
+ * not room to insert the new tuple on the current page without
+ * splitting, then we move right hoping to find more free space and
+ * avoid a split.
+ *
+ * Keep scanning right until we
+ * (a) find a page with enough free space,
+ * (b) reach the last page where the tuple can legally go, or
+ * (c) get tired of searching.
+ * (c) is not flippant; it is important because if there are many
+ * pages' worth of equal keys, it's better to split one of the early
+ * pages than to scan all the way to the end of the run of equal keys
+ * on every insert. We implement "get tired" as a random choice,
+ * since stopping after scanning a fixed number of pages wouldn't work
+ * well (we'd never reach the right-hand side of previously split
+ * pages). The probability of moving right is set at 0.99, which may
+ * seem too high to change the behavior much, but it does an excellent
+ * job of preventing O(N^2) behavior with many equal keys.
+ *----------
+ */
+ while (PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ /*
+ * Before considering moving right, see if we can obtain enough
+ * space by erasing LP_DEAD items
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ /* Perform simple deletion */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ if (PageGetFreeSpace(page) >= insertstate->itemsz)
+ break; /* OK, now we have enough space */
+ }
+
+ /*
+ * Nope, so check conditions (b) and (c) enumerated above
+ *
+ * The earlier _bt_check_unique() call may well have established a
+ * strict upper bound on the offset for the new item. If it's not
+ * the last item of the page (i.e. if there is at least one tuple
+ * on the page that's greater than the tuple we're inserting to)
+ * then we know that the tuple belongs on this page. We can skip
+ * the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ if (P_RIGHTMOST(opaque) ||
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) != 0 ||
+ pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
+ break;
+
+ _bt_stepright(rel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ }
+ }
+
+ /*
+ * We should now be on the correct page. Find the offset within the page
+ * for the new tuple. (Possibly reusing earlier search bounds.)
+ */
+ Assert(P_RIGHTMOST(opaque) ||
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) <= 0);
+
+ newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate);
+
+ if (insertstate->postingoff == -1)
+ {
+ /*
+ * There is an overlapping posting list tuple with its LP_DEAD bit
+ * set. We don't want to unnecessarily unset its LP_DEAD bit while
+ * performing a posting list split, so perform simple index tuple
+ * deletion early.
+ */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ /*
+ * Do new binary search. New insert location cannot overlap with any
+ * posting list now.
+ */
+ Assert(!insertstate->bounds_valid);
+ insertstate->postingoff = 0;
+ newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate);
+ Assert(insertstate->postingoff == 0);
+ }
+
+ return newitemoff;
+}
+
+#endif /* ifndef NBTS_SPECIALIZING_DEFAULT */
+
+/*
+ * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
+ *
+ * This routine is called by the public interface routine, btinsert.
+ * By here, itup is filled in, including the TID.
+ *
+ * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
+ * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
+ * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
+ * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
+ * don't actually insert.
+ *
+ * indexUnchanged executor hint indicates if itup is from an
+ * UPDATE that didn't logically change the indexed value, but
+ * must nevertheless have a new entry to point to a successor
+ * version.
+ *
+ * The result value is only significant for UNIQUE_CHECK_PARTIAL:
+ * it must be true if the entry is known unique, else false.
+ * (In the current implementation we'll also return true after a
+ * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
+ * that's just a coding artifact.)
+ */
+bool
+NBTS_FUNCTION(_bt_doinsert)(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ Relation heapRel)
+{
+ bool is_unique = false;
+ BTInsertStateData insertstate;
+ BTScanInsert itup_key;
+ BTStack stack;
+ bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
+
+ /* we need an insertion scan key to do our search, so build one */
+ itup_key = nbts_call(_bt_mkscankey, rel, itup);
+
+ if (checkingunique)
+ {
+ if (!itup_key->anynullkeys)
+ {
+ /* No (heapkeyspace) scantid until uniqueness established */
+ itup_key->scantid = NULL;
+ }
+ else
+ {
+ /*
+ * Scan key for new tuple contains NULL key values. Bypass
+ * checkingunique steps. They are unnecessary because core code
+ * considers NULL unequal to every value, including NULL.
+ *
+ * This optimization avoids O(N^2) behavior within the
+ * _bt_findinsertloc() heapkeyspace path when a unique index has a
+ * large number of "duplicates" with NULL key values.
+ */
+ checkingunique = false;
+ /* Tuple is unique in the sense that core code cares about */
+ Assert(checkUnique != UNIQUE_CHECK_EXISTING);
+ is_unique = true;
+ }
+ }
+
+ /*
+ * Fill in the BTInsertState working area, to track the current page and
+ * position within the page to insert on.
+ *
+ * Note that itemsz is passed down to lower level code that deals with
+ * inserting the item. It must be MAXALIGN()'d. This ensures that space
+ * accounting code consistently considers the alignment overhead that we
+ * expect PageAddItem() will add later. (Actually, index_form_tuple() is
+ * already conservative about alignment, but we don't rely on that from
+ * this distance. Besides, preserving the "true" tuple size in index
+ * tuple headers for the benefit of nbtsplitloc.c might happen someday.
+ * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
+ */
+ insertstate.itup = itup;
+ insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
+ insertstate.itup_key = itup_key;
+ insertstate.bounds_valid = false;
+ insertstate.buf = InvalidBuffer;
+ insertstate.postingoff = 0;
+
+ search:
+
+ /*
+ * Find and lock the leaf page that the tuple should be added to by
+ * searching from the root page. insertstate.buf will hold a buffer that
+ * is locked in exclusive mode afterwards.
+ */
+ stack = nbts_call(_bt_search_insert, rel, &insertstate);
+
+ /*
+ * checkingunique inserts are not allowed to go ahead when two tuples with
+ * equal key attribute values would be visible to new MVCC snapshots once
+ * the xact commits. Check for conflicts in the locked page/buffer (if
+ * needed) here.
+ *
+ * It might be necessary to check a page to the right in _bt_check_unique,
+ * though that should be very rare. In practice the first page the value
+ * could be on (with scantid omitted) is almost always also the only page
+ * that a matching tuple might be found on. This is due to the behavior
+ * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
+ * only be allowed to cross a page boundary when there is no candidate
+ * leaf page split point that avoids it. Also, _bt_check_unique can use
+ * the leaf page high key to determine that there will be no duplicates on
+ * the right sibling without actually visiting it (it uses the high key in
+ * cases where the new item happens to belong at the far right of the leaf
+ * page).
+ *
+ * NOTE: obviously, _bt_check_unique can only detect keys that are already
+ * in the index; so it cannot defend against concurrent insertions of the
+ * same key. We protect against that by means of holding a write lock on
+ * the first page the value could be on, with omitted/-inf value for the
+ * implicit heap TID tiebreaker attribute. Any other would-be inserter of
+ * the same key must acquire a write lock on the same page, so only one
+ * would-be inserter can be making the check at one time. Furthermore,
+ * once we are past the check we hold write locks continuously until we
+ * have performed our insertion, so no later inserter can fail to see our
+ * insertion. (This requires some care in _bt_findinsertloc.)
+ *
+ * If we must wait for another xact, we release the lock while waiting,
+ * and then must perform a new search.
+ *
+ * For a partial uniqueness check, we don't wait for the other xact. Just
+ * let the tuple in and return false for possibly non-unique, or true for
+ * definitely unique.
+ */
+ if (checkingunique)
+ {
+ TransactionId xwait;
+ uint32 speculativeToken;
+
+ xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
+ &is_unique, &speculativeToken);
+
+ if (unlikely(TransactionIdIsValid(xwait)))
+ {
+ /* Have to wait for the other guy ... */
+ _bt_relbuf(rel, insertstate.buf);
+ insertstate.buf = InvalidBuffer;
+
+ /*
+ * If it's a speculative insertion, wait for it to finish (ie. to
+ * go ahead with the insertion, or kill the tuple). Otherwise
+ * wait for the transaction to finish as usual.
+ */
+ if (speculativeToken)
+ SpeculativeInsertionWait(xwait, speculativeToken);
+ else
+ XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
+
+ /* start over... */
+ if (stack)
+ _bt_freestack(stack);
+ goto search;
+ }
+
+ /* Uniqueness is established -- restore heap tid as scantid */
+ if (itup_key->heapkeyspace)
+ itup_key->scantid = &itup->t_tid;
+ }
+
+ if (checkUnique != UNIQUE_CHECK_EXISTING)
+ {
+ OffsetNumber newitemoff;
+
+ /*
+ * The only conflict predicate locking cares about for indexes is when
+ * an index tuple insert conflicts with an existing lock. We don't
+ * know the actual page we're going to insert on for sure just yet in
+ * checkingunique and !heapkeyspace cases, but it's okay to use the
+ * first page the value could be on (with scantid omitted) instead.
+ */
+ CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
+
+ /*
+ * Do the insertion. Note that insertstate contains cached binary
+ * search bounds established within _bt_check_unique when insertion is
+ * checkingunique.
+ */
+ newitemoff = nbts_call(_bt_findinsertloc, rel, &insertstate, checkingunique,
+ indexUnchanged, stack, heapRel);
+ _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
+ itup, insertstate.itemsz, newitemoff,
+ insertstate.postingoff, false);
+ }
+ else
+ {
+ /* just release the buffer */
+ _bt_relbuf(rel, insertstate.buf);
+ }
+
+ /* be tidy */
+ if (stack)
+ _bt_freestack(stack);
+ pfree(itup_key);
+
+ return is_unique;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 8b96708b3e..70304c793d 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1967,10 +1967,10 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
}
/* we need an insertion scan key for the search, so build one */
- itup_key = _bt_mkscankey(rel, targetkey);
+ itup_key = nbts_call(_bt_mkscankey, rel, targetkey);
/* find the leftmost leaf page with matching pivot/high key */
itup_key->pivotsearch = true;
- stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
+ stack = nbts_call(_bt_search, rel, itup_key, &sleafbuf, BT_READ, NULL);
/* won't need a second lock or pin on leafbuf */
_bt_relbuf(rel, sleafbuf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b52eca8f38..1481db4dcf 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -87,6 +87,10 @@ static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
OffsetNumber updatedoffset,
int *nremaining);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtree_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
+
/*
* Btree handler function: return IndexAmRoutine with access method parameters
@@ -177,33 +181,6 @@ btbuildempty(Relation index)
smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
}
-/*
- * btinsert() -- insert an index tuple into a btree.
- *
- * Descend the tree recursively, find the appropriate location for our
- * new tuple, and put it there.
- */
-bool
-btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- IndexInfo *indexInfo)
-{
- bool result;
- IndexTuple itup;
-
- /* generate an index tuple */
- itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
- itup->t_tid = *ht_ctid;
-
- result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);
-
- pfree(itup);
-
- return result;
-}
-
/*
* btgettuple() -- Get the next tuple in the scan.
*/
diff --git a/src/backend/access/nbtree/nbtree_spec.h b/src/backend/access/nbtree/nbtree_spec.h
new file mode 100644
index 0000000000..4c342287f6
--- /dev/null
+++ b/src/backend/access/nbtree/nbtree_spec.h
@@ -0,0 +1,50 @@
+/*
+ * Specialized functions for nbtree.c
+ */
+
+/*
+ * _bt_specialize() -- Specialize this index relation for its index key.
+ */
+void
+NBTS_FUNCTION(_bt_specialize)(Relation rel)
+{
+#ifdef NBTS_SPECIALIZING_DEFAULT
+ nbts_call_norel(_bt_specialize, rel, rel);
+#else
+ rel->rd_indam->aminsert = NBTS_FUNCTION(btinsert);
+#endif
+}
+
+/*
+ * btinsert() -- insert an index tuple into a btree.
+ *
+ * Descend the tree recursively, find the appropriate location for our
+ * new tuple, and put it there.
+ */
+bool
+NBTS_FUNCTION(btinsert)(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ IndexInfo *indexInfo)
+{
+#ifdef NBTS_SPECIALIZING_DEFAULT
+ nbts_call_norel(_bt_specialize, rel, rel);
+
+ return nbts_call(btinsert, rel, values, isnull, ht_ctid, heapRel,
+ checkUnique, indexUnchanged, indexInfo);
+#else
+ bool result;
+ IndexTuple itup;
+
+ /* generate an index tuple */
+ itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
+ itup->t_tid = *ht_ctid;
+
+ result = nbts_call(_bt_doinsert, rel, itup, checkUnique, indexUnchanged, heapRel);
+
+ pfree(itup);
+
+ return result;
+#endif
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c74543bfde..e81eee9c35 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,11 +25,8 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
-static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
static int _bt_binsrch_posting(BTScanInsert key, Page page,
OffsetNumber offnum);
-static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
- OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
static int _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
@@ -46,6 +43,9 @@ static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsearch_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
/*
* _bt_drop_lock_and_maybe_pin()
@@ -70,493 +70,6 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
}
}
-/*
- * _bt_search() -- Search the tree for a particular scankey,
- * or more precisely for the first leaf page it could be on.
- *
- * The passed scankey is an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * Return value is a stack of parent-page pointers (i.e. there is no entry for
- * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
- * which is locked and pinned. No locks are held on the parent pages,
- * however!
- *
- * If the snapshot parameter is not NULL, "old snapshot" checking will take
- * place during the descent through the tree. This is not needed when
- * positioning for an insert or delete, so NULL is used for those cases.
- *
- * The returned buffer is locked according to access parameter. Additionally,
- * access = BT_WRITE will allow an empty root page to be created and returned.
- * When access = BT_READ, an empty index will result in *bufP being set to
- * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
- * during the search will be finished.
- */
-BTStack
-_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
- Snapshot snapshot)
-{
- BTStack stack_in = NULL;
- int page_access = BT_READ;
-
- /* Get the root page to start with */
- *bufP = _bt_getroot(rel, access);
-
- /* If index is empty and access = BT_READ, no root page is created. */
- if (!BufferIsValid(*bufP))
- return (BTStack) NULL;
-
- /* Loop iterates once per level descended in the tree */
- for (;;)
- {
- Page page;
- BTPageOpaque opaque;
- OffsetNumber offnum;
- ItemId itemid;
- IndexTuple itup;
- BlockNumber child;
- BTStack new_stack;
-
- /*
- * Race -- the page we just grabbed may have split since we read its
- * downlink in its parent page (or the metapage). If it has, we may
- * need to move right to its new sibling. Do that.
- *
- * In write-mode, allow _bt_moveright to finish any incomplete splits
- * along the way. Strictly speaking, we'd only need to finish an
- * incomplete split on the leaf page we're about to insert to, not on
- * any of the upper levels (internal pages with incomplete splits are
- * also taken care of in _bt_getstackbuf). But this is a good
- * opportunity to finish splits of internal pages too.
- */
- *bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
- page_access, snapshot);
-
- /* if this is a leaf page, we're done */
- page = BufferGetPage(*bufP);
- opaque = BTPageGetOpaque(page);
- if (P_ISLEAF(opaque))
- break;
-
- /*
- * Find the appropriate pivot tuple on this page. Its downlink points
- * to the child page that we're about to descend to.
- */
- offnum = _bt_binsrch(rel, key, *bufP);
- itemid = PageGetItemId(page, offnum);
- itup = (IndexTuple) PageGetItem(page, itemid);
- Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
- child = BTreeTupleGetDownLink(itup);
-
- /*
- * We need to save the location of the pivot tuple we chose in a new
- * stack entry for this page/level. If caller ends up splitting a
- * page one level down, it usually ends up inserting a new pivot
- * tuple/downlink immediately after the location recorded here.
- */
- new_stack = (BTStack) palloc(sizeof(BTStackData));
- new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
- new_stack->bts_offset = offnum;
- new_stack->bts_parent = stack_in;
-
- /*
- * Page level 1 is lowest non-leaf page level prior to leaves. So, if
- * we're on the level 1 and asked to lock leaf page in write mode,
- * then lock next page in write mode, because it must be a leaf.
- */
- if (opaque->btpo_level == 1 && access == BT_WRITE)
- page_access = BT_WRITE;
-
- /* drop the read lock on the page, then acquire one on its child */
- *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
-
- /* okay, all set to move down a level */
- stack_in = new_stack;
- }
-
- /*
- * If we're asked to lock leaf in write mode, but didn't manage to, then
- * relock. This should only happen when the root page is a leaf page (and
- * the only page in the index other than the metapage).
- */
- if (access == BT_WRITE && page_access == BT_READ)
- {
- /* trade in our read lock for a write lock */
- _bt_unlockbuf(rel, *bufP);
- _bt_lockbuf(rel, *bufP, BT_WRITE);
-
- /*
- * Race -- the leaf page may have split after we dropped the read lock
- * but before we acquired a write lock. If it has, we may need to
- * move right to its new sibling. Do that.
- */
- *bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
- snapshot);
- }
-
- return stack_in;
-}
-
-/*
- * _bt_moveright() -- move right in the btree if necessary.
- *
- * When we follow a pointer to reach a page, it is possible that
- * the page has changed in the meanwhile. If this happens, we're
- * guaranteed that the page has "split right" -- that is, that any
- * data that appeared on the page originally is either on the page
- * or strictly to the right of it.
- *
- * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page. If that entry is
- * strictly less than the scankey, or <= the scankey in the
- * key.nextkey=true case, then we followed the wrong link and we need
- * to move right.
- *
- * The passed insertion-type scankey can omit the rightmost column(s) of the
- * index. (see nbtree/README)
- *
- * When key.nextkey is false (the usual case), we are looking for the first
- * item >= key. When key.nextkey is true, we are looking for the first item
- * strictly greater than key.
- *
- * If forupdate is true, we will attempt to finish any incomplete splits
- * that we encounter. This is required when locking a target page for an
- * insertion, because we don't allow inserting on a page before the split
- * is completed. 'stack' is only used if forupdate is true.
- *
- * On entry, we have the buffer pinned and a lock of the type specified by
- * 'access'. If we move right, we release the buffer and lock and acquire
- * the same on the right sibling. Return value is the buffer we stop at.
- *
- * If the snapshot parameter is not NULL, "old snapshot" checking will take
- * place during the descent through the tree. This is not needed when
- * positioning for an insert or delete, so NULL is used for those cases.
- */
-Buffer
-_bt_moveright(Relation rel,
- BTScanInsert key,
- Buffer buf,
- bool forupdate,
- BTStack stack,
- int access,
- Snapshot snapshot)
-{
- Page page;
- BTPageOpaque opaque;
- int32 cmpval;
-
- /*
- * When nextkey = false (normal case): if the scan key that brought us to
- * this page is > the high key stored on the page, then the page has split
- * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
- * have some duplicates to the right as well as the left, but that's
- * something that's only ever dealt with on the leaf level, after
- * _bt_search has found an initial leaf page.)
- *
- * When nextkey = true: move right if the scan key is >= page's high key.
- * (Note that key.scantid cannot be set in this case.)
- *
- * The page could even have split more than once, so scan as far as
- * needed.
- *
- * We also have to move right if we followed a link that brought us to a
- * dead page.
- */
- cmpval = key->nextkey ? 0 : 1;
-
- for (;;)
- {
- page = BufferGetPage(buf);
- TestForOldSnapshot(snapshot, rel, page);
- opaque = BTPageGetOpaque(page);
-
- if (P_RIGHTMOST(opaque))
- break;
-
- /*
- * Finish any incomplete splits we encounter along the way.
- */
- if (forupdate && P_INCOMPLETE_SPLIT(opaque))
- {
- BlockNumber blkno = BufferGetBlockNumber(buf);
-
- /* upgrade our lock if necessary */
- if (access == BT_READ)
- {
- _bt_unlockbuf(rel, buf);
- _bt_lockbuf(rel, buf, BT_WRITE);
- }
-
- if (P_INCOMPLETE_SPLIT(opaque))
- _bt_finish_split(rel, buf, stack);
- else
- _bt_relbuf(rel, buf);
-
- /* re-acquire the lock in the right mode, and re-check */
- buf = _bt_getbuf(rel, blkno, access);
- continue;
- }
-
- if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
- {
- /* step right one page */
- buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
- continue;
- }
- else
- break;
- }
-
- if (P_IGNORE(opaque))
- elog(ERROR, "fell off the end of index \"%s\"",
- RelationGetRelationName(rel));
-
- return buf;
-}
-
-/*
- * _bt_binsrch() -- Do a binary search for a key on a particular page.
- *
- * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
- * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
- * particular, this means it is possible to return a value 1 greater than the
- * number of keys on the page, if the scankey is > all keys on the page.)
- *
- * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
- * of the last key < given scankey, or last key <= given scankey if nextkey
- * is true. (Since _bt_compare treats the first data key of such a page as
- * minus infinity, there will be at least one key < scankey, so the result
- * always points at one of the keys on the page.) This key indicates the
- * right place to descend to be sure we find all leaf keys >= given scankey
- * (or leaf keys > given scankey when nextkey is true).
- *
- * This procedure is not responsible for walking right, it just examines
- * the given page. _bt_binsrch() has no lock or refcount side effects
- * on the buffer.
- */
-static OffsetNumber
-_bt_binsrch(Relation rel,
- BTScanInsert key,
- Buffer buf)
-{
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high;
- int32 result,
- cmpval;
-
- page = BufferGetPage(buf);
- opaque = BTPageGetOpaque(page);
-
- /* Requesting nextkey semantics while using scantid seems nonsensical */
- Assert(!key->nextkey || key->scantid == NULL);
- /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
- Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
-
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
-
- /*
- * If there are no keys on the page, return the first available slot. Note
- * this covers two cases: the page is really empty (no keys), or it
- * contains only a high key. The latter case is possible after vacuuming.
- * This can never happen on an internal page, however, since they are
- * never empty (an internal page must have children).
- */
- if (unlikely(high < low))
- return low;
-
- /*
- * Binary search to find the first key on the page >= scan key, or first
- * key > scankey when nextkey is true.
- *
- * For nextkey=false (cmpval=1), the loop invariant is: all slots before
- * 'low' are < scan key, all slots at or after 'high' are >= scan key.
- *
- * For nextkey=true (cmpval=0), the loop invariant is: all slots before
- * 'low' are <= scan key, all slots at or after 'high' are > scan key.
- *
- * We can fall out when high == low.
- */
- high++; /* establish the loop invariant for high */
-
- cmpval = key->nextkey ? 0 : 1; /* select comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid);
-
- if (result >= cmpval)
- low = mid + 1;
- else
- high = mid;
- }
-
- /*
- * At this point we have high == low, but be careful: they could point
- * past the last slot on the page.
- *
- * On a leaf page, we always return the first key >= scan key (resp. >
- * scan key), which could be the last slot + 1.
- */
- if (P_ISLEAF(opaque))
- return low;
-
- /*
- * On a non-leaf page, return the last key < scan key (resp. <= scan key).
- * There must be one if _bt_compare() is playing by the rules.
- */
- Assert(low > P_FIRSTDATAKEY(opaque));
-
- return OffsetNumberPrev(low);
-}
-
-/*
- *
- * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
- *
- * Like _bt_binsrch(), but with support for caching the binary search
- * bounds. Only used during insertion, and only on the leaf page that it
- * looks like caller will insert tuple on. Exclusive-locked and pinned
- * leaf page is contained within insertstate.
- *
- * Caches the bounds fields in insertstate so that a subsequent call can
- * reuse the low and strict high bounds of original binary search. Callers
- * that use these fields directly must be prepared for the case where low
- * and/or stricthigh are not on the same page (one or both exceed maxoff
- * for the page). The case where there are no items on the page (high <
- * low) makes bounds invalid.
- *
- * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time, and for dealing with posting list
- * tuple matches (callers can use insertstate's postingoff field to
- * determine which existing heap TID will need to be replaced by a posting
- * list split).
- */
-OffsetNumber
-_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
-{
- BTScanInsert key = insertstate->itup_key;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high,
- stricthigh;
- int32 result,
- cmpval;
-
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- Assert(P_ISLEAF(opaque));
- Assert(!key->nextkey);
- Assert(insertstate->postingoff == 0);
-
- if (!insertstate->bounds_valid)
- {
- /* Start new binary search */
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
- }
- else
- {
- /* Restore result of previous binary search against same page */
- low = insertstate->low;
- high = insertstate->stricthigh;
- }
-
- /* If there are no keys on the page, return the first available slot */
- if (unlikely(high < low))
- {
- /* Caller can't reuse bounds */
- insertstate->low = InvalidOffsetNumber;
- insertstate->stricthigh = InvalidOffsetNumber;
- insertstate->bounds_valid = false;
- return low;
- }
-
- /*
- * Binary search to find the first key on the page >= scan key. (nextkey
- * is always false when inserting).
- *
- * The loop invariant is: all slots before 'low' are < scan key, all slots
- * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
- * maintained to save additional search effort for caller.
- *
- * We can fall out when high == low.
- */
- if (!insertstate->bounds_valid)
- high++; /* establish the loop invariant for high */
- stricthigh = high; /* high initially strictly higher */
-
- cmpval = 1; /* !nextkey comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid);
-
- if (result >= cmpval)
- low = mid + 1;
- else
- {
- high = mid;
- if (result != 0)
- stricthigh = high;
- }
-
- /*
- * If tuple at offset located by binary search is a posting list whose
- * TID range overlaps with caller's scantid, perform posting list
- * binary search to set postingoff for caller. Caller must split the
- * posting list when postingoff is set. This should happen
- * infrequently.
- */
- if (unlikely(result == 0 && key->scantid != NULL))
- {
- /*
- * postingoff should never be set more than once per leaf page
- * binary search. That would mean that there are duplicate table
- * TIDs in the index, which is never okay. Check for that here.
- */
- if (insertstate->postingoff != 0)
- ereport(ERROR,
- (errcode(ERRCODE_INDEX_CORRUPTED),
- errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
- ItemPointerGetBlockNumber(key->scantid),
- ItemPointerGetOffsetNumber(key->scantid),
- low, stricthigh,
- BufferGetBlockNumber(insertstate->buf),
- RelationGetRelationName(rel))));
-
- insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
- }
- }
-
- /*
- * On a leaf page, a binary search always returns the first key >= scan
- * key (at least in !nextkey case), which could be the last slot + 1. This
- * is also the lower bound of cached search.
- *
- * stricthigh may also be the last slot + 1, which prevents caller from
- * using bounds directly, but is still useful to us if we're called a
- * second time with cached bounds (cached low will be < stricthigh when
- * that happens).
- */
- insertstate->low = low;
- insertstate->stricthigh = stricthigh;
- insertstate->bounds_valid = true;
-
- return low;
-}
/*----------
* _bt_binsrch_posting() -- posting list binary search.
@@ -625,217 +138,6 @@ _bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
return low;
}
-/*----------
- * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
- *
- * page/offnum: location of btree item to be compared to.
- *
- * This routine returns:
- * <0 if scankey < tuple at offnum;
- * 0 if scankey == tuple at offnum;
- * >0 if scankey > tuple at offnum.
- *
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be returned
- * to the caller as a matching key. Similarly, an insertion scankey
- * with its scantid set is treated as equal to a posting tuple whose TID
- * range overlaps with their scantid. There generally won't be a
- * matching TID in the posting tuple, which caller must handle
- * themselves (e.g., by splitting the posting list tuple).
- *
- * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
- * "minus infinity": this routine will always claim it is less than the
- * scankey. The actual key value stored is explicitly truncated to 0
- * attributes (explicitly minus infinity) with version 3+ indexes, but
- * that isn't relied upon. This allows us to implement the Lehman and
- * Yao convention that the first down-link pointer is before the first
- * key. See backend/access/nbtree/README for details.
- *----------
- */
-int32
-_bt_compare(Relation rel,
- BTScanInsert key,
- Page page,
- OffsetNumber offnum)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- IndexTuple itup;
- ItemPointer heapTid;
- ScanKey scankey;
- int ncmpkey;
- int ntupatts;
- int32 result;
-
- Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
- Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
- Assert(key->heapkeyspace || key->scantid == NULL);
-
- /*
- * Force result ">" if target item is first data item on an internal page
- * --- see NOTE above.
- */
- if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
- return 1;
-
- itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
- ntupatts = BTreeTupleGetNAtts(itup, rel);
-
- /*
- * The scan key is set up with the attribute number associated with each
- * term in the key. It is important that, if the index is multi-key, the
- * scan contain the first k key attributes, and that they be in order. If
- * you think about how multi-key ordering works, you'll understand why
- * this is.
- *
- * We don't test for violation of this condition here, however. The
- * initial setup for the index scan had better have gotten it right (see
- * _bt_first).
- */
-
- ncmpkey = Min(ntupatts, key->keysz);
- Assert(key->heapkeyspace || ncmpkey == key->keysz);
- Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
- scankey = key->scankeys;
- for (int i = 1; i <= ncmpkey; i++)
- {
- Datum datum;
- bool isNull;
-
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
-
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
- {
- if (isNull)
- result = 0; /* NULL "=" NULL */
- else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = -1; /* NULL "<" NOT_NULL */
- else
- result = 1; /* NULL ">" NOT_NULL */
- }
- else if (isNull) /* key is NOT_NULL and item is NULL */
- {
- if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = 1; /* NOT_NULL ">" NULL */
- else
- result = -1; /* NOT_NULL "<" NULL */
- }
- else
- {
- /*
- * The sk_func needs to be passed the index value as left arg and
- * the sk_argument as right arg (they might be of different
- * types). Since it is convenient for callers to think of
- * _bt_compare as comparing the scankey to the index item, we have
- * to flip the sign of the comparison result. (Unless it's a DESC
- * column, in which case we *don't* flip the sign.)
- */
- result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum,
- scankey->sk_argument));
-
- if (!(scankey->sk_flags & SK_BT_DESC))
- INVERT_COMPARE_RESULT(result);
- }
-
- /* if the keys are unequal, return the difference */
- if (result != 0)
- return result;
-
- scankey++;
- }
-
- /*
- * All non-truncated attributes (other than heap TID) were found to be
- * equal. Treat truncated attributes as minus infinity when scankey has a
- * key attribute value that would otherwise be compared directly.
- *
- * Note: it doesn't matter if ntupatts includes non-key attributes;
- * scankey won't, so explicitly excluding non-key attributes isn't
- * necessary.
- */
- if (key->keysz > ntupatts)
- return 1;
-
- /*
- * Use the heap TID attribute and scantid to try to break the tie. The
- * rules are the same as any other key attribute -- only the
- * representation differs.
- */
- heapTid = BTreeTupleGetHeapTID(itup);
- if (key->scantid == NULL)
- {
- /*
- * Most searches have a scankey that is considered greater than a
- * truncated pivot tuple if and when the scankey has equal values for
- * attributes up to and including the least significant untruncated
- * attribute in tuple.
- *
- * For example, if an index has the minimum two attributes (single
- * user key attribute, plus heap TID attribute), and a page's high key
- * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
- * will not descend to the page to the left. The search will descend
- * right instead. The truncated attribute in pivot tuple means that
- * all non-pivot tuples on the page to the left are strictly < 'foo',
- * so it isn't necessary to descend left. In other words, search
- * doesn't have to descend left because it isn't interested in a match
- * that has a heap TID value of -inf.
- *
- * However, some searches (pivotsearch searches) actually require that
- * we descend left when this happens. -inf is treated as a possible
- * match for omitted scankey attribute(s). This is needed by page
- * deletion, which must re-find leaf pages that are targets for
- * deletion using their high keys.
- *
- * Note: the heap TID part of the test ensures that scankey is being
- * compared to a pivot tuple with one or more truncated key
- * attributes.
- *
- * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
- * left here, since they have no heap TID attribute (and cannot have
- * any -inf key values in any case, since truncation can only remove
- * non-key attributes). !heapkeyspace searches must always be
- * prepared to deal with matches on both sides of the pivot once the
- * leaf level is reached.
- */
- if (key->heapkeyspace && !key->pivotsearch &&
- key->keysz == ntupatts && heapTid == NULL)
- return 1;
-
- /* All provided scankey arguments found to be equal */
- return 0;
- }
-
- /*
- * Treat truncated heap TID as minus infinity, since scankey has a key
- * attribute value (scantid) that would otherwise be compared directly
- */
- Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
- if (heapTid == NULL)
- return 1;
-
- /*
- * Scankey must be treated as equal to a posting list tuple if its scantid
- * value falls within the range of the posting list. In all other cases
- * there can only be a single heap TID value, which is compared directly
- * with scantid.
- */
- Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- result = ItemPointerCompare(key->scantid, heapTid);
- if (result <= 0 || !BTreeTupleIsPosting(itup))
- return result;
- else
- {
- result = ItemPointerCompare(key->scantid,
- BTreeTupleGetMaxHeapTID(itup));
- if (result > 0)
- return 1;
- }
-
- return 0;
-}
-
/*
* _bt_first() -- Find the first item in a scan.
*
@@ -1363,7 +665,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* Use the manufactured insertion scan key to descend the tree and
* position ourselves on the target leaf page.
*/
- stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+ stack = nbts_call(_bt_search, rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
/* don't need to keep the stack around... */
_bt_freestack(stack);
@@ -1392,7 +694,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
_bt_initialize_more_data(so, dir);
/* position to the precise item on the page */
- offnum = _bt_binsrch(rel, &inskey, buf);
+ offnum = nbts_call(_bt_binsrch, rel, &inskey, buf);
/*
* If nextkey = false, we are positioned at the first item >= scan key, or
@@ -1422,9 +724,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
/*
* Now load data from the first page of the scan.
*/
- if (!_bt_readpage(scan, dir, offnum))
+ if (!nbts_call_norel(_bt_readpage, scan->indexRelation, scan, dir, offnum))
{
- /*
+ /*`
* There's no actually-matching data on this page. Try to advance to
* the next page. Return false if there's no matching data at all.
*/
@@ -1498,280 +800,6 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
-/*
- * _bt_readpage() -- Load data from current index page into so->currPos
- *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate. All other fields of so->currPos are
- * initialized from scratch here.
- *
- * We scan the current page starting at offnum and moving in the indicated
- * direction. All items matching the scan keys are loaded into currPos.items.
- * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
- * that there can be no more matching tuples in the current scan direction.
- *
- * In the case of a parallel scan, caller must have called _bt_parallel_seize
- * prior to calling this function; this function will invoke
- * _bt_parallel_release before returning.
- *
- * Returns true if any matching items found on the page, false if none.
- */
-static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
-{
- BTScanOpaque so = (BTScanOpaque) scan->opaque;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber minoff;
- OffsetNumber maxoff;
- int itemIndex;
- bool continuescan;
- int indnatts;
-
- /*
- * We must have the buffer pinned and locked, but the usual macro can't be
- * used here; this function is what makes it good for currPos.
- */
- Assert(BufferIsValid(so->currPos.buf));
-
- page = BufferGetPage(so->currPos.buf);
- opaque = BTPageGetOpaque(page);
-
- /* allow next page be processed by parallel worker */
- if (scan->parallel_scan)
- {
- if (ScanDirectionIsForward(dir))
- _bt_parallel_release(scan, opaque->btpo_next);
- else
- _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
- }
-
- continuescan = true; /* default assumption */
- indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * We note the buffer's block number so that we can release the pin later.
- * This allows us to re-read the buffer if it is needed again for hinting.
- */
- so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-
- /*
- * We save the LSN of the page as we read it, so that we know whether it
- * safe to apply LP_DEAD hints to the page later. This allows us to drop
- * the pin for MVCC scans, which allows vacuum to avoid blocking.
- */
- so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-
- /*
- * we must save the page's right-link while scanning it; this tells us
- * where to step right to after we're done with these items. There is no
- * corresponding need for the left-link, since splits always go right.
- */
- so->currPos.nextPage = opaque->btpo_next;
-
- /* initialize tuple workspace to empty */
- so->currPos.nextTupleOffset = 0;
-
- /*
- * Now that the current page has been made consistent, the macro should be
- * good.
- */
- Assert(BTScanPosIsPinned(so->currPos));
-
- if (ScanDirectionIsForward(dir))
- {
- /* load items[] in ascending order */
- itemIndex = 0;
-
- offnum = Max(offnum, minoff);
-
- while (offnum <= maxoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- offnum = OffsetNumberNext(offnum);
- continue;
- }
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID
- */
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- itemIndex++;
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- itemIndex++;
- }
- }
- }
- /* When !continuescan, there can't be any more matches, so stop */
- if (!continuescan)
- break;
-
- offnum = OffsetNumberNext(offnum);
- }
-
- /*
- * We don't need to visit page to the right when the high key
- * indicates that no more matches will be found there.
- *
- * Checking the high key like this works out more often than you might
- * think. Leaf page splits pick a split point between the two most
- * dissimilar tuples (this is weighed against the need to evenly share
- * free space). Leaf pages with high key attribute values that can
- * only appear on non-pivot tuples on the right sibling page are
- * common.
- */
- if (continuescan && !P_RIGHTMOST(opaque))
- {
- ItemId iid = PageGetItemId(page, P_HIKEY);
- IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
- int truncatt;
-
- truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
- _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
- }
-
- if (!continuescan)
- so->currPos.moreRight = false;
-
- Assert(itemIndex <= MaxTIDsPerBTreePage);
- so->currPos.firstItem = 0;
- so->currPos.lastItem = itemIndex - 1;
- so->currPos.itemIndex = 0;
- }
- else
- {
- /* load items[] in descending order */
- itemIndex = MaxTIDsPerBTreePage;
-
- offnum = Min(offnum, maxoff);
-
- while (offnum >= minoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
- bool tuple_alive;
- bool passes_quals;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual. Most of the
- * time, it's a win to not bother examining the tuple's index
- * keys, but just skip to the next tuple (previous, actually,
- * since we're scanning backwards). However, if this is the first
- * tuple on the page, we do check the index keys, to prevent
- * uselessly advancing to the page to the left. This is similar
- * to the high key optimization used by forward scans.
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- Assert(offnum >= P_FIRSTDATAKEY(opaque));
- if (offnum > P_FIRSTDATAKEY(opaque))
- {
- offnum = OffsetNumberPrev(offnum);
- continue;
- }
-
- tuple_alive = false;
- }
- else
- tuple_alive = true;
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan);
- if (passes_quals && tuple_alive)
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID.
- *
- * Note that we deliberately save/return items from
- * posting lists in ascending heap TID order for backwards
- * scans. This allows _bt_killitems() to make a
- * consistent assumption about the order of items
- * associated with the same posting list tuple.
- */
- itemIndex--;
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- itemIndex--;
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- }
- }
- }
- if (!continuescan)
- {
- /* there can't be any more matches, so stop */
- so->currPos.moreLeft = false;
- break;
- }
-
- offnum = OffsetNumberPrev(offnum);
- }
-
- Assert(itemIndex >= 0);
- so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
- so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
- }
-
- return (so->currPos.firstItem <= so->currPos.lastItem);
-}
-
/* Save an index item into so->currPos.items[itemIndex] */
static void
_bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -2014,7 +1042,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
PredicateLockPage(rel, blkno, scan->xs_snapshot);
/* see if there are any matches on this page */
/* note that this will clear moreRight if we can stop */
- if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
+ if (nbts_call_norel(_bt_readpage, scan->indexRelation,
+ scan, dir, P_FIRSTDATAKEY(opaque)))
break;
}
else if (scan->parallel_scan != NULL)
@@ -2116,7 +1145,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
/* see if there are any matches on this page */
/* note that this will clear moreLeft if we can stop */
- if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
+ if (nbts_call_norel(_bt_readpage, scan->indexRelation, scan,
+ dir, PageGetMaxOffsetNumber(page)))
break;
}
else if (scan->parallel_scan != NULL)
@@ -2448,7 +1478,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
/*
* Now load data from the first page of the scan.
*/
- if (!_bt_readpage(scan, dir, start))
+ if (!nbts_call_norel(_bt_readpage, scan->indexRelation, scan, dir, start))
{
/*
* There's no actually-matching data on this page. Try to advance to
diff --git a/src/backend/access/nbtree/nbtsearch_spec.h b/src/backend/access/nbtree/nbtsearch_spec.h
new file mode 100644
index 0000000000..73d5370496
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsearch_spec.h
@@ -0,0 +1,994 @@
+/*
+ * Specialized functions for nbtsearch.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static OffsetNumber NBTS_FUNCTION(_bt_binsrch)(Relation rel, BTScanInsert key,
+ Buffer buf);
+static bool NBTS_FUNCTION(_bt_readpage)(IndexScanDesc scan, ScanDirection dir,
+ OffsetNumber offnum);
+
+/*
+ * _bt_binsrch() -- Do a binary search for a key on a particular page.
+ *
+ * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
+ * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
+ * particular, this means it is possible to return a value 1 greater than the
+ * number of keys on the page, if the scankey is > all keys on the page.)
+ *
+ * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
+ * of the last key < given scankey, or last key <= given scankey if nextkey
+ * is true. (Since _bt_compare treats the first data key of such a page as
+ * minus infinity, there will be at least one key < scankey, so the result
+ * always points at one of the keys on the page.) This key indicates the
+ * right place to descend to be sure we find all leaf keys >= given scankey
+ * (or leaf keys > given scankey when nextkey is true).
+ *
+ * This procedure is not responsible for walking right, it just examines
+ * the given page. _bt_binsrch() has no lock or refcount side effects
+ * on the buffer.
+ */
+static OffsetNumber
+NBTS_FUNCTION(_bt_binsrch)(Relation rel,
+ BTScanInsert key,
+ Buffer buf)
+{
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high;
+ int32 result,
+ cmpval;
+
+ page = BufferGetPage(buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* Requesting nextkey semantics while using scantid seems nonsensical */
+ Assert(!key->nextkey || key->scantid == NULL);
+ /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
+ Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+
+ /*
+ * If there are no keys on the page, return the first available slot. Note
+ * this covers two cases: the page is really empty (no keys), or it
+ * contains only a high key. The latter case is possible after vacuuming.
+ * This can never happen on an internal page, however, since they are
+ * never empty (an internal page must have children).
+ */
+ if (unlikely(high < low))
+ return low;
+
+ /*
+ * Binary search to find the first key on the page >= scan key, or first
+ * key > scankey when nextkey is true.
+ *
+ * For nextkey=false (cmpval=1), the loop invariant is: all slots before
+ * 'low' are < scan key, all slots at or after 'high' are >= scan key.
+ *
+ * For nextkey=true (cmpval=0), the loop invariant is: all slots before
+ * 'low' are <= scan key, all slots at or after 'high' are > scan key.
+ *
+ * We can fall out when high == low.
+ */
+ high++; /* establish the loop invariant for high */
+
+ cmpval = key->nextkey ? 0 : 1; /* select comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = nbts_call(_bt_compare, rel, key, page, mid);
+
+ if (result >= cmpval)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ /*
+ * At this point we have high == low, but be careful: they could point
+ * past the last slot on the page.
+ *
+ * On a leaf page, we always return the first key >= scan key (resp. >
+ * scan key), which could be the last slot + 1.
+ */
+ if (P_ISLEAF(opaque))
+ return low;
+
+ /*
+ * On a non-leaf page, return the last key < scan key (resp. <= scan key).
+ * There must be one if _bt_compare() is playing by the rules.
+ */
+ Assert(low > P_FIRSTDATAKEY(opaque));
+
+ return OffsetNumberPrev(low);
+}
+
+/*
+ * _bt_readpage() -- Load data from current index page into so->currPos
+ *
+ * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
+ * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
+ * they are updated as appropriate. All other fields of so->currPos are
+ * initialized from scratch here.
+ *
+ * We scan the current page starting at offnum and moving in the indicated
+ * direction. All items matching the scan keys are loaded into currPos.items.
+ * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
+ * that there can be no more matching tuples in the current scan direction.
+ *
+ * In the case of a parallel scan, caller must have called _bt_parallel_seize
+ * prior to calling this function; this function will invoke
+ * _bt_parallel_release before returning.
+ *
+ * Returns true if any matching items found on the page, false if none.
+ */
+static bool
+NBTS_FUNCTION(_bt_readpage)(IndexScanDesc scan, ScanDirection dir,
+ OffsetNumber offnum)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber minoff;
+ OffsetNumber maxoff;
+ int itemIndex;
+ bool continuescan;
+ int indnatts;
+
+ /*
+ * We must have the buffer pinned and locked, but the usual macro can't be
+ * used here; this function is what makes it good for currPos.
+ */
+ Assert(BufferIsValid(so->currPos.buf));
+
+ page = BufferGetPage(so->currPos.buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* allow next page be processed by parallel worker */
+ if (scan->parallel_scan)
+ {
+ if (ScanDirectionIsForward(dir))
+ _bt_parallel_release(scan, opaque->btpo_next);
+ else
+ _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+ }
+
+ continuescan = true; /* default assumption */
+ indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * We note the buffer's block number so that we can release the pin later.
+ * This allows us to re-read the buffer if it is needed again for hinting.
+ */
+ so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later. This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+
+ /*
+ * we must save the page's right-link while scanning it; this tells us
+ * where to step right to after we're done with these items. There is no
+ * corresponding need for the left-link, since splits always go right.
+ */
+ so->currPos.nextPage = opaque->btpo_next;
+
+ /* initialize tuple workspace to empty */
+ so->currPos.nextTupleOffset = 0;
+
+ /*
+ * Now that the current page has been made consistent, the macro should be
+ * good.
+ */
+ Assert(BTScanPosIsPinned(so->currPos));
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* load items[] in ascending order */
+ itemIndex = 0;
+
+ offnum = Max(offnum, minoff);
+
+ while (offnum <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ offnum = OffsetNumberNext(offnum);
+ continue;
+ }
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ if (nbts_call(_bt_checkkeys, scan->indexRelation, scan, itup, indnatts, dir, &continuescan))
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID
+ */
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ itemIndex++;
+ }
+ }
+ }
+ /* When !continuescan, there can't be any more matches, so stop */
+ if (!continuescan)
+ break;
+
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /*
+ * We don't need to visit page to the right when the high key
+ * indicates that no more matches will be found there.
+ *
+ * Checking the high key like this works out more often than you might
+ * think. Leaf page splits pick a split point between the two most
+ * dissimilar tuples (this is weighed against the need to evenly share
+ * free space). Leaf pages with high key attribute values that can
+ * only appear on non-pivot tuples on the right sibling page are
+ * common.
+ */
+ if (continuescan && !P_RIGHTMOST(opaque))
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
+ int truncatt;
+
+ truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
+ nbts_call(_bt_checkkeys, scan->indexRelation, scan, itup, truncatt, dir, &continuescan);
+ }
+
+ if (!continuescan)
+ so->currPos.moreRight = false;
+
+ Assert(itemIndex <= MaxTIDsPerBTreePage);
+ so->currPos.firstItem = 0;
+ so->currPos.lastItem = itemIndex - 1;
+ so->currPos.itemIndex = 0;
+ }
+ else
+ {
+ /* load items[] in descending order */
+ itemIndex = MaxTIDsPerBTreePage;
+
+ offnum = Min(offnum, maxoff);
+
+ while (offnum >= minoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+ bool tuple_alive;
+ bool passes_quals;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual. Most of the
+ * time, it's a win to not bother examining the tuple's index
+ * keys, but just skip to the next tuple (previous, actually,
+ * since we're scanning backwards). However, if this is the first
+ * tuple on the page, we do check the index keys, to prevent
+ * uselessly advancing to the page to the left. This is similar
+ * to the high key optimization used by forward scans.
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ Assert(offnum >= P_FIRSTDATAKEY(opaque));
+ if (offnum > P_FIRSTDATAKEY(opaque))
+ {
+ offnum = OffsetNumberPrev(offnum);
+ continue;
+ }
+
+ tuple_alive = false;
+ }
+ else
+ tuple_alive = true;
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ passes_quals = nbts_call(_bt_checkkeys, scan->indexRelation,
+ scan, itup, indnatts, dir,
+ &continuescan);
+ if (passes_quals && tuple_alive)
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID.
+ *
+ * Note that we deliberately save/return items from
+ * posting lists in ascending heap TID order for backwards
+ * scans. This allows _bt_killitems() to make a
+ * consistent assumption about the order of items
+ * associated with the same posting list tuple.
+ */
+ itemIndex--;
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ }
+ }
+ }
+ if (!continuescan)
+ {
+ /* there can't be any more matches, so stop */
+ so->currPos.moreLeft = false;
+ break;
+ }
+
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ Assert(itemIndex >= 0);
+ so->currPos.firstItem = itemIndex;
+ so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
+ so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+ }
+
+ return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+#endif /* NBTS_SPECIALIZING_DEFAULT */
+
+/*
+ * _bt_search() -- Search the tree for a particular scankey,
+ * or more precisely for the first leaf page it could be on.
+ *
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
+ * but it can omit the rightmost column(s) of the index.
+ *
+ * Return value is a stack of parent-page pointers (i.e. there is no entry for
+ * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
+ * which is locked and pinned. No locks are held on the parent pages,
+ * however!
+ *
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
+ *
+ * The returned buffer is locked according to access parameter. Additionally,
+ * access = BT_WRITE will allow an empty root page to be created and returned.
+ * When access = BT_READ, an empty index will result in *bufP being set to
+ * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
+ * during the search will be finished.
+ */
+BTStack
+NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
+ int access, Snapshot snapshot)
+{
+ BTStack stack_in = NULL;
+ int page_access = BT_READ;
+
+ /* Get the root page to start with */
+ *bufP = _bt_getroot(rel, access);
+
+ /* If index is empty and access = BT_READ, no root page is created. */
+ if (!BufferIsValid(*bufP))
+ return (BTStack) NULL;
+
+ /* Loop iterates once per level descended in the tree */
+ for (;;)
+ {
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber offnum;
+ ItemId itemid;
+ IndexTuple itup;
+ BlockNumber child;
+ BTStack new_stack;
+
+ /*
+ * Race -- the page we just grabbed may have split since we read its
+ * downlink in its parent page (or the metapage). If it has, we may
+ * need to move right to its new sibling. Do that.
+ *
+ * In write-mode, allow _bt_moveright to finish any incomplete splits
+ * along the way. Strictly speaking, we'd only need to finish an
+ * incomplete split on the leaf page we're about to insert to, not on
+ * any of the upper levels (internal pages with incomplete splits are
+ * also taken care of in _bt_getstackbuf). But this is a good
+ * opportunity to finish splits of internal pages too.
+ */
+ *bufP = nbts_call(_bt_moveright, rel, key, *bufP,
+ (access == BT_WRITE), stack_in,
+ page_access, snapshot);
+
+ /* if this is a leaf page, we're done */
+ page = BufferGetPage(*bufP);
+ opaque = BTPageGetOpaque(page);
+ if (P_ISLEAF(opaque))
+ break;
+
+ /*
+ * Find the appropriate pivot tuple on this page. Its downlink points
+ * to the child page that we're about to descend to.
+ */
+ offnum = nbts_call(_bt_binsrch, rel, key, *bufP);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
+ child = BTreeTupleGetDownLink(itup);
+
+ /*
+ * We need to save the location of the pivot tuple we chose in a new
+ * stack entry for this page/level. If caller ends up splitting a
+ * page one level down, it usually ends up inserting a new pivot
+ * tuple/downlink immediately after the location recorded here.
+ */
+ new_stack = (BTStack) palloc(sizeof(BTStackData));
+ new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
+ new_stack->bts_offset = offnum;
+ new_stack->bts_parent = stack_in;
+
+ /*
+ * Page level 1 is lowest non-leaf page level prior to leaves. So, if
+ * we're on the level 1 and asked to lock leaf page in write mode,
+ * then lock next page in write mode, because it must be a leaf.
+ */
+ if (opaque->btpo_level == 1 && access == BT_WRITE)
+ page_access = BT_WRITE;
+
+ /* drop the read lock on the page, then acquire one on its child */
+ *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
+
+ /* okay, all set to move down a level */
+ stack_in = new_stack;
+ }
+
+ /*
+ * If we're asked to lock leaf in write mode, but didn't manage to, then
+ * relock. This should only happen when the root page is a leaf page (and
+ * the only page in the index other than the metapage).
+ */
+ if (access == BT_WRITE && page_access == BT_READ)
+ {
+ /* trade in our read lock for a write lock */
+ _bt_unlockbuf(rel, *bufP);
+ _bt_lockbuf(rel, *bufP, BT_WRITE);
+
+ /*
+ * Race -- the leaf page may have split after we dropped the read lock
+ * but before we acquired a write lock. If it has, we may need to
+ * move right to its new sibling. Do that.
+ */
+ *bufP = nbts_call(_bt_moveright, rel, key, *bufP, true, stack_in,
+ BT_WRITE, snapshot);
+ }
+
+ return stack_in;
+}
+
+/*
+ * _bt_moveright() -- move right in the btree if necessary.
+ *
+ * When we follow a pointer to reach a page, it is possible that
+ * the page has changed in the meanwhile. If this happens, we're
+ * guaranteed that the page has "split right" -- that is, that any
+ * data that appeared on the page originally is either on the page
+ * or strictly to the right of it.
+ *
+ * This routine decides whether or not we need to move right in the
+ * tree by examining the high key entry on the page. If that entry is
+ * strictly less than the scankey, or <= the scankey in the
+ * key.nextkey=true case, then we followed the wrong link and we need
+ * to move right.
+ *
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
+ *
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key. When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
+ *
+ * If forupdate is true, we will attempt to finish any incomplete splits
+ * that we encounter. This is required when locking a target page for an
+ * insertion, because we don't allow inserting on a page before the split
+ * is completed. 'stack' is only used if forupdate is true.
+ *
+ * On entry, we have the buffer pinned and a lock of the type specified by
+ * 'access'. If we move right, we release the buffer and lock and acquire
+ * the same on the right sibling. Return value is the buffer we stop at.
+ *
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
+ */
+Buffer
+NBTS_FUNCTION(_bt_moveright)(Relation rel,
+ BTScanInsert key,
+ Buffer buf,
+ bool forupdate,
+ BTStack stack,
+ int access,
+ Snapshot snapshot)
+{
+ Page page;
+ BTPageOpaque opaque;
+ int32 cmpval;
+
+ /*
+ * When nextkey = false (normal case): if the scan key that brought us to
+ * this page is > the high key stored on the page, then the page has split
+ * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
+ * have some duplicates to the right as well as the left, but that's
+ * something that's only ever dealt with on the leaf level, after
+ * _bt_search has found an initial leaf page.)
+ *
+ * When nextkey = true: move right if the scan key is >= page's high key.
+ * (Note that key.scantid cannot be set in this case.)
+ *
+ * The page could even have split more than once, so scan as far as
+ * needed.
+ *
+ * We also have to move right if we followed a link that brought us to a
+ * dead page.
+ */
+ cmpval = key->nextkey ? 0 : 1;
+
+ for (;;)
+ {
+ page = BufferGetPage(buf);
+ TestForOldSnapshot(snapshot, rel, page);
+ opaque = BTPageGetOpaque(page);
+
+ if (P_RIGHTMOST(opaque))
+ break;
+
+ /*
+ * Finish any incomplete splits we encounter along the way.
+ */
+ if (forupdate && P_INCOMPLETE_SPLIT(opaque))
+ {
+ BlockNumber blkno = BufferGetBlockNumber(buf);
+
+ /* upgrade our lock if necessary */
+ if (access == BT_READ)
+ {
+ _bt_unlockbuf(rel, buf);
+ _bt_lockbuf(rel, buf, BT_WRITE);
+ }
+
+ if (P_INCOMPLETE_SPLIT(opaque))
+ _bt_finish_split(rel, buf, stack);
+ else
+ _bt_relbuf(rel, buf);
+
+ /* re-acquire the lock in the right mode, and re-check */
+ buf = _bt_getbuf(rel, blkno, access);
+ continue;
+ }
+
+ if (P_IGNORE(opaque) || nbts_call(_bt_compare, rel, key, page, P_HIKEY) >= cmpval)
+ {
+ /* step right one page */
+ buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
+ continue;
+ }
+ else
+ break;
+ }
+
+ if (P_IGNORE(opaque))
+ elog(ERROR, "fell off the end of index \"%s\"",
+ RelationGetRelationName(rel));
+
+ return buf;
+}
+
+/*
+ *
+ * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
+ *
+ * Like _bt_binsrch(), but with support for caching the binary search
+ * bounds. Only used during insertion, and only on the leaf page that it
+ * looks like caller will insert tuple on. Exclusive-locked and pinned
+ * leaf page is contained within insertstate.
+ *
+ * Caches the bounds fields in insertstate so that a subsequent call can
+ * reuse the low and strict high bounds of original binary search. Callers
+ * that use these fields directly must be prepared for the case where low
+ * and/or stricthigh are not on the same page (one or both exceed maxoff
+ * for the page). The case where there are no items on the page (high <
+ * low) makes bounds invalid.
+ *
+ * Caller is responsible for invalidating bounds when it modifies the page
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by a posting
+ * list split).
+ */
+OffsetNumber
+NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate)
+{
+ BTScanInsert key = insertstate->itup_key;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high,
+ stricthigh;
+ int32 result,
+ cmpval;
+
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ Assert(P_ISLEAF(opaque));
+ Assert(!key->nextkey);
+ Assert(insertstate->postingoff == 0);
+
+ if (!insertstate->bounds_valid)
+ {
+ /* Start new binary search */
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ }
+ else
+ {
+ /* Restore result of previous binary search against same page */
+ low = insertstate->low;
+ high = insertstate->stricthigh;
+ }
+
+ /* If there are no keys on the page, return the first available slot */
+ if (unlikely(high < low))
+ {
+ /* Caller can't reuse bounds */
+ insertstate->low = InvalidOffsetNumber;
+ insertstate->stricthigh = InvalidOffsetNumber;
+ insertstate->bounds_valid = false;
+ return low;
+ }
+
+ /*
+ * Binary search to find the first key on the page >= scan key. (nextkey
+ * is always false when inserting).
+ *
+ * The loop invariant is: all slots before 'low' are < scan key, all slots
+ * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
+ * maintained to save additional search effort for caller.
+ *
+ * We can fall out when high == low.
+ */
+ if (!insertstate->bounds_valid)
+ high++; /* establish the loop invariant for high */
+ stricthigh = high; /* high initially strictly higher */
+
+ cmpval = 1; /* !nextkey comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = nbts_call(_bt_compare, rel, key, page, mid);
+
+ if (result >= cmpval)
+ low = mid + 1;
+ else
+ {
+ high = mid;
+ if (result != 0)
+ stricthigh = high;
+ }
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set postingoff for caller. Caller must split the
+ * posting list when postingoff is set. This should happen
+ * infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ {
+ /*
+ * postingoff should never be set more than once per leaf page
+ * binary search. That would mean that there are duplicate table
+ * TIDs in the index, which is never okay. Check for that here.
+ */
+ if (insertstate->postingoff != 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
+ ItemPointerGetBlockNumber(key->scantid),
+ ItemPointerGetOffsetNumber(key->scantid),
+ low, stricthigh,
+ BufferGetBlockNumber(insertstate->buf),
+ RelationGetRelationName(rel))));
+
+ insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
+ }
+ }
+
+ /*
+ * On a leaf page, a binary search always returns the first key >= scan
+ * key (at least in !nextkey case), which could be the last slot + 1. This
+ * is also the lower bound of cached search.
+ *
+ * stricthigh may also be the last slot + 1, which prevents caller from
+ * using bounds directly, but is still useful to us if we're called a
+ * second time with cached bounds (cached low will be < stricthigh when
+ * that happens).
+ */
+ insertstate->low = low;
+ insertstate->stricthigh = stricthigh;
+ insertstate->bounds_valid = true;
+
+ return low;
+}
+
+/*----------
+ * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
+ *
+ * page/offnum: location of btree item to be compared to.
+ *
+ * This routine returns:
+ * <0 if scankey < tuple at offnum;
+ * 0 if scankey == tuple at offnum;
+ * >0 if scankey > tuple at offnum.
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
+ * "minus infinity": this routine will always claim it is less than the
+ * scankey. The actual key value stored is explicitly truncated to 0
+ * attributes (explicitly minus infinity) with version 3+ indexes, but
+ * that isn't relied upon. This allows us to implement the Lehman and
+ * Yao convention that the first down-link pointer is before the first
+ * key. See backend/access/nbtree/README for details.
+ *----------
+ */
+int32
+NBTS_FUNCTION(_bt_compare)(Relation rel,
+ BTScanInsert key,
+ Page page,
+ OffsetNumber offnum)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ IndexTuple itup;
+ ItemPointer heapTid;
+ ScanKey scankey;
+ int ncmpkey;
+ int ntupatts;
+ int32 result;
+
+ Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
+ Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+ Assert(key->heapkeyspace || key->scantid == NULL);
+
+ /*
+ * Force result ">" if target item is first data item on an internal page
+ * --- see NOTE above.
+ */
+ if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
+ return 1;
+
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+ ntupatts = BTreeTupleGetNAtts(itup, rel);
+
+ /*
+ * The scan key is set up with the attribute number associated with each
+ * term in the key. It is important that, if the index is multi-key, the
+ * scan contain the first k key attributes, and that they be in order. If
+ * you think about how multi-key ordering works, you'll understand why
+ * this is.
+ *
+ * We don't test for violation of this condition here, however. The
+ * initial setup for the index scan had better have gotten it right (see
+ * _bt_first).
+ */
+
+ ncmpkey = Min(ntupatts, key->keysz);
+ Assert(key->heapkeyspace || ncmpkey == key->keysz);
+ Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
+ scankey = key->scankeys;
+ for (int i = 1; i <= ncmpkey; i++)
+ {
+ Datum datum;
+ bool isNull;
+
+ datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+
+ if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ {
+ if (isNull)
+ result = 0; /* NULL "=" NULL */
+ else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = -1; /* NULL "<" NOT_NULL */
+ else
+ result = 1; /* NULL ">" NOT_NULL */
+ }
+ else if (isNull) /* key is NOT_NULL and item is NULL */
+ {
+ if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = 1; /* NOT_NULL ">" NULL */
+ else
+ result = -1; /* NOT_NULL "<" NULL */
+ }
+ else
+ {
+ /*
+ * The sk_func needs to be passed the index value as left arg and
+ * the sk_argument as right arg (they might be of different
+ * types). Since it is convenient for callers to think of
+ * _bt_compare as comparing the scankey to the index item, we have
+ * to flip the sign of the comparison result. (Unless it's a DESC
+ * column, in which case we *don't* flip the sign.)
+ */
+ result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum,
+ scankey->sk_argument));
+
+ if (!(scankey->sk_flags & SK_BT_DESC))
+ INVERT_COMPARE_RESULT(result);
+ }
+
+ /* if the keys are unequal, return the difference */
+ if (result != 0)
+ return result;
+
+ scankey++;
+ }
+
+ /*
+ * All non-truncated attributes (other than heap TID) were found to be
+ * equal. Treat truncated attributes as minus infinity when scankey has a
+ * key attribute value that would otherwise be compared directly.
+ *
+ * Note: it doesn't matter if ntupatts includes non-key attributes;
+ * scankey won't, so explicitly excluding non-key attributes isn't
+ * necessary.
+ */
+ if (key->keysz > ntupatts)
+ return 1;
+
+ /*
+ * Use the heap TID attribute and scantid to try to break the tie. The
+ * rules are the same as any other key attribute -- only the
+ * representation differs.
+ */
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid == NULL)
+ {
+ /*
+ * Most searches have a scankey that is considered greater than a
+ * truncated pivot tuple if and when the scankey has equal values for
+ * attributes up to and including the least significant untruncated
+ * attribute in tuple.
+ *
+ * For example, if an index has the minimum two attributes (single
+ * user key attribute, plus heap TID attribute), and a page's high key
+ * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
+ * will not descend to the page to the left. The search will descend
+ * right instead. The truncated attribute in pivot tuple means that
+ * all non-pivot tuples on the page to the left are strictly < 'foo',
+ * so it isn't necessary to descend left. In other words, search
+ * doesn't have to descend left because it isn't interested in a match
+ * that has a heap TID value of -inf.
+ *
+ * However, some searches (pivotsearch searches) actually require that
+ * we descend left when this happens. -inf is treated as a possible
+ * match for omitted scankey attribute(s). This is needed by page
+ * deletion, which must re-find leaf pages that are targets for
+ * deletion using their high keys.
+ *
+ * Note: the heap TID part of the test ensures that scankey is being
+ * compared to a pivot tuple with one or more truncated key
+ * attributes.
+ *
+ * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
+ * left here, since they have no heap TID attribute (and cannot have
+ * any -inf key values in any case, since truncation can only remove
+ * non-key attributes). !heapkeyspace searches must always be
+ * prepared to deal with matches on both sides of the pivot once the
+ * leaf level is reached.
+ */
+ if (key->heapkeyspace && !key->pivotsearch &&
+ key->keysz == ntupatts && heapTid == NULL)
+ return 1;
+
+ /* All provided scankey arguments found to be equal */
+ return 0;
+ }
+
+ /*
+ * Treat truncated heap TID as minus infinity, since scankey has a key
+ * attribute value (scantid) that would otherwise be compared directly
+ */
+ Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+ if (heapTid == NULL)
+ return 1;
+
+ /*
+ * Scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * with scantid.
+ */
+ Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (result <= 0 || !BTreeTupleIsPosting(itup))
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid,
+ BTreeTupleGetMaxHeapTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
+}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index bd1685c441..3558b2d3da 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -279,8 +279,6 @@ static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
BTPageState *state,
BTDedupState dstate);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
-static void _bt_load(BTWriteState *wstate,
- BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
int request);
static void _bt_end_parallel(BTLeader *btleader);
@@ -293,6 +291,9 @@ static void _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
Sharedsort *sharedsort2, int sortmem,
bool progress);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsort_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
/*
* btbuild() -- build a new btree index.
@@ -566,7 +567,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
wstate.heap = btspool->heap;
wstate.index = btspool->index;
- wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+ wstate.inskey = nbts_call(_bt_mkscankey, wstate.index, NULL);
/* _bt_mkscankey() won't set allequalimage without metapage */
wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
@@ -578,7 +579,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
PROGRESS_BTREE_PHASE_LEAF_LOAD);
- _bt_load(&wstate, btspool, btspool2);
+ nbts_call_norel(_bt_load, wstate.index, &wstate, btspool, btspool2);
}
/*
@@ -978,8 +979,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
lastleft = (IndexTuple) PageGetItem(opage, ii);
Assert(IndexTupleSize(oitup) > last_truncextra);
- truncated = _bt_truncate(wstate->index, lastleft, oitup,
- wstate->inskey);
+ truncated = nbts_call(_bt_truncate, wstate->index, lastleft, oitup,
+ wstate->inskey);
if (!PageIndexTupleOverwrite(opage, P_HIKEY, (Item) truncated,
IndexTupleSize(truncated)))
elog(ERROR, "failed to add high key to the index page");
@@ -1176,264 +1177,6 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
-/*
- * Read tuples in correct sort order from tuplesort, and load them into
- * btree leaves.
- */
-static void
-_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
-{
- BTPageState *state = NULL;
- bool merge = (btspool2 != NULL);
- IndexTuple itup,
- itup2 = NULL;
- bool load1;
- TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
- SortSupport sortKeys;
- int64 tuples_done = 0;
- bool deduplicate;
-
- deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
- BTGetDeduplicateItems(wstate->index);
-
- if (merge)
- {
- /*
- * Another BTSpool for dead tuples exists. Now we have to merge
- * btspool and btspool2.
- */
-
- /* the preparation of merge */
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
-
- /* Prepare SortSupport data for each column */
- sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
- for (i = 0; i < keysz; i++)
- {
- SortSupport sortKey = sortKeys + i;
- ScanKey scanKey = wstate->inskey->scankeys + i;
- int16 strategy;
-
- sortKey->ssup_cxt = CurrentMemoryContext;
- sortKey->ssup_collation = scanKey->sk_collation;
- sortKey->ssup_nulls_first =
- (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
- sortKey->ssup_attno = scanKey->sk_attno;
- /* Abbreviation is not supported here */
- sortKey->abbreviate = false;
-
- AssertState(sortKey->ssup_attno != 0);
-
- strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
- BTGreaterStrategyNumber : BTLessStrategyNumber;
-
- PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
- }
-
- for (;;)
- {
- load1 = true; /* load BTSpool next ? */
- if (itup2 == NULL)
- {
- if (itup == NULL)
- break;
- }
- else if (itup != NULL)
- {
- int32 compare = 0;
-
- for (i = 1; i <= keysz; i++)
- {
- SortSupport entry;
- Datum attrDatum1,
- attrDatum2;
- bool isNull1,
- isNull2;
-
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
- entry);
- if (compare > 0)
- {
- load1 = false;
- break;
- }
- else if (compare < 0)
- break;
- }
-
- /*
- * If key values are equal, we sort on ItemPointer. This is
- * required for btree indexes, since heap TID is treated as an
- * implicit last key attribute in order to ensure that all
- * keys in the index are physically unique.
- */
- if (compare == 0)
- {
- compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
- Assert(compare != 0);
- if (compare > 0)
- load1 = false;
- }
- }
- else
- load1 = false;
-
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- if (load1)
- {
- _bt_buildadd(wstate, state, itup, 0);
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- }
- else
- {
- _bt_buildadd(wstate, state, itup2, 0);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- pfree(sortKeys);
- }
- else if (deduplicate)
- {
- /* merge is unnecessary, deduplicate into posting lists */
- BTDedupState dstate;
-
- dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
- dstate->deduplicate = true; /* unused */
- dstate->nmaxitems = 0; /* unused */
- dstate->maxpostingsize = 0; /* set later */
- /* Metadata about base tuple of current pending posting list */
- dstate->base = NULL;
- dstate->baseoff = InvalidOffsetNumber; /* unused */
- dstate->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- dstate->htids = NULL;
- dstate->nhtids = 0;
- dstate->nitems = 0;
- dstate->phystupsize = 0; /* unused */
- dstate->nintervals = 0; /* unused */
-
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- {
- state = _bt_pagestate(wstate, 0);
-
- /*
- * Limit size of posting list tuples to 1/10 space we want to
- * leave behind on the page, plus space for final item's line
- * pointer. This is equal to the space that we'd like to
- * leave behind on each leaf page when fillfactor is 90,
- * allowing us to get close to fillfactor% space utilization
- * when there happen to be a great many duplicates. (This
- * makes higher leaf fillfactor settings ineffective when
- * building indexes that have many duplicates, but packing
- * leaf pages full with few very large tuples doesn't seem
- * like a useful goal.)
- */
- dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
- sizeof(ItemIdData);
- Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
- dstate->maxpostingsize <= INDEX_SIZE_MASK);
- dstate->htids = palloc(dstate->maxpostingsize);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
- else if (_bt_keep_natts_fast(wstate->index, dstate->base,
- itup) > keysz &&
- _bt_dedup_save_htid(dstate, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID from itup has been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list.
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
-
- if (state)
- {
- /*
- * Handle the last item (there must be a last item when the
- * tuplesort returned one or more tuples)
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
- pfree(dstate->htids);
- }
-
- pfree(dstate);
- }
- else
- {
- /* merging and deduplication are both unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- _bt_buildadd(wstate, state, itup, 0);
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- }
-
- /* Close down final pages and write the metapage */
- _bt_uppershutdown(wstate, state);
-
- /*
- * When we WAL-logged index pages, we must nonetheless fsync index files.
- * Since we're building outside shared buffers, a CHECKPOINT occurring
- * during the build has no way to flush the previously written data to
- * disk (indeed it won't know the index even exists). A crash later on
- * would replay WAL from the checkpoint, therefore it wouldn't replay our
- * earlier WAL entries. If we do not fsync those pages here, they might
- * still not be on disk when the crash occurs.
- */
- if (wstate->btws_use_wal)
- smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
-}
-
/*
* Create parallel context, and launch workers for leader.
*
diff --git a/src/backend/access/nbtree/nbtsort_spec.h b/src/backend/access/nbtree/nbtsort_spec.h
new file mode 100644
index 0000000000..8f4a3602ca
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsort_spec.h
@@ -0,0 +1,275 @@
+/*
+ * Specialized functions included in nbtsort.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static void NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
+ BTSpool *btspool2);
+
+/*
+ * Read tuples in correct sort order from tuplesort, and load them into
+ * btree leaves.
+ */
+static void
+NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
+ BTSpool *btspool2)
+{
+ BTPageState *state = NULL;
+ bool merge = (btspool2 != NULL);
+ IndexTuple itup,
+ itup2 = NULL;
+ bool load1;
+ TupleDesc tupdes = RelationGetDescr(wstate->index);
+ int i,
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ SortSupport sortKeys;
+ int64 tuples_done = 0;
+ bool deduplicate;
+
+ deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
+ BTGetDeduplicateItems(wstate->index);
+
+ if (merge)
+ {
+ /*
+ * Another BTSpool for dead tuples exists. Now we have to merge
+ * btspool and btspool2.
+ */
+
+ /* the preparation of merge */
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+
+ /* Prepare SortSupport data for each column */
+ sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+
+ for (i = 0; i < keysz; i++)
+ {
+ SortSupport sortKey = sortKeys + i;
+ ScanKey scanKey = wstate->inskey->scankeys + i;
+ int16 strategy;
+
+ sortKey->ssup_cxt = CurrentMemoryContext;
+ sortKey->ssup_collation = scanKey->sk_collation;
+ sortKey->ssup_nulls_first =
+ (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+ sortKey->ssup_attno = scanKey->sk_attno;
+ /* Abbreviation is not supported here */
+ sortKey->abbreviate = false;
+
+ AssertState(sortKey->ssup_attno != 0);
+
+ strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+ BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+ PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+ }
+
+ for (;;)
+ {
+ load1 = true; /* load BTSpool next ? */
+ if (itup2 == NULL)
+ {
+ if (itup == NULL)
+ break;
+ }
+ else if (itup != NULL)
+ {
+ int32 compare = 0;
+
+ for (i = 1; i <= keysz; i++)
+ {
+ SortSupport entry;
+ Datum attrDatum1,
+ attrDatum2;
+ bool isNull1,
+ isNull2;
+
+ entry = sortKeys + i - 1;
+ attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+ attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+ compare = ApplySortComparator(attrDatum1, isNull1,
+ attrDatum2, isNull2,
+ entry);
+ if (compare > 0)
+ {
+ load1 = false;
+ break;
+ }
+ else if (compare < 0)
+ break;
+ }
+
+ /*
+ * If key values are equal, we sort on ItemPointer. This is
+ * required for btree indexes, since heap TID is treated as an
+ * implicit last key attribute in order to ensure that all
+ * keys in the index are physically unique.
+ */
+ if (compare == 0)
+ {
+ compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+ Assert(compare != 0);
+ if (compare > 0)
+ load1 = false;
+ }
+ }
+ else
+ load1 = false;
+
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ if (load1)
+ {
+ _bt_buildadd(wstate, state, itup, 0);
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ }
+ else
+ {
+ _bt_buildadd(wstate, state, itup2, 0);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ pfree(sortKeys);
+ }
+ else if (deduplicate)
+ {
+ /* merge is unnecessary, deduplicate into posting lists */
+ BTDedupState dstate;
+
+ dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ dstate->deduplicate = true; /* unused */
+ dstate->nmaxitems = 0; /* unused */
+ dstate->maxpostingsize = 0; /* set later */
+ /* Metadata about base tuple of current pending posting list */
+ dstate->base = NULL;
+ dstate->baseoff = InvalidOffsetNumber; /* unused */
+ dstate->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ dstate->htids = NULL;
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+ dstate->phystupsize = 0; /* unused */
+ dstate->nintervals = 0; /* unused */
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+
+ /*
+ * Limit size of posting list tuples to 1/10 space we want to
+ * leave behind on the page, plus space for final item's line
+ * pointer. This is equal to the space that we'd like to
+ * leave behind on each leaf page when fillfactor is 90,
+ * allowing us to get close to fillfactor% space utilization
+ * when there happen to be a great many duplicates. (This
+ * makes higher leaf fillfactor settings ineffective when
+ * building indexes that have many duplicates, but packing
+ * leaf pages full with few very large tuples doesn't seem
+ * like a useful goal.)
+ */
+ dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
+ sizeof(ItemIdData);
+ Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
+ dstate->maxpostingsize <= INDEX_SIZE_MASK);
+ dstate->htids = palloc(dstate->maxpostingsize);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+ else if (nbts_call(_bt_keep_natts_fast, wstate->index, dstate->base,
+ itup) > keysz &&
+ _bt_dedup_save_htid(dstate, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID from itup has been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list.
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+
+ if (state)
+ {
+ /*
+ * Handle the last item (there must be a last item when the
+ * tuplesort returned one or more tuples)
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+ pfree(dstate->htids);
+ }
+
+ pfree(dstate);
+ }
+ else
+ {
+ /* merging and deduplication are both unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ _bt_buildadd(wstate, state, itup, 0);
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+
+ /* Close down final pages and write the metapage */
+ _bt_uppershutdown(wstate, state);
+
+ /*
+ * When we WAL-logged index pages, we must nonetheless fsync index files.
+ * Since we're building outside shared buffers, a CHECKPOINT occurring
+ * during the build has no way to flush the previously written data to
+ * disk (indeed it won't know the index even exists). A crash later on
+ * would replay WAL from the checkpoint, therefore it wouldn't replay our
+ * earlier WAL entries. If we do not fsync those pages here, they might
+ * still not be on disk when the crash occurs.
+ */
+ if (wstate->btws_use_wal)
+ smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+}
+
+#endif
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 241e26d338..8e5337cad7 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -692,7 +692,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
{
itemid = PageGetItemId(state->origpage, maxoff);
tup = (IndexTuple) PageGetItem(state->origpage, itemid);
- keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+ keepnatts = nbts_call(_bt_keep_natts_fast, state->rel, tup, state->newitem);
if (keepnatts > 1 && keepnatts <= nkeyatts)
{
@@ -723,7 +723,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
- keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+ keepnatts = nbts_call(_bt_keep_natts_fast, state->rel, tup, state->newitem);
if (keepnatts > 1 && keepnatts <= nkeyatts)
{
@@ -972,7 +972,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
* avoid appending a heap TID in new high key, we're done. Finish split
* with default strategy and initial split interval.
*/
- perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+ perfectpenalty = nbts_call(_bt_keep_natts_fast, state->rel, leftmost, rightmost);
if (perfectpenalty <= indnkeyatts)
return perfectpenalty;
@@ -993,7 +993,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
* If page is entirely full of duplicates, a single value strategy split
* will be performed.
*/
- perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+ perfectpenalty = nbts_call(_bt_keep_natts_fast, state->rel, leftmost, rightmost);
if (perfectpenalty <= indnkeyatts)
{
*strategy = SPLIT_MANY_DUPLICATES;
@@ -1031,8 +1031,8 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
itemid = PageGetItemId(state->origpage, P_HIKEY);
hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
- perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
- state->newitem);
+ perfectpenalty = nbts_call(_bt_keep_natts_fast, state->rel, hikey,
+ state->newitem);
if (perfectpenalty <= indnkeyatts)
*strategy = SPLIT_SINGLE_VALUE;
else
@@ -1154,7 +1154,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
lastleft = _bt_split_lastleft(state, split);
firstright = _bt_split_firstright(state, split);
- return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+ return nbts_call(_bt_keep_natts_fast, state->rel, lastleft, firstright);
}
/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index ff260c393a..bc443ebd27 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -50,130 +50,11 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
bool *result);
static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
static void _bt_mark_scankey_required(ScanKey skey);
-static bool _bt_check_rowcompare(ScanKey skey,
- IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
- ScanDirection dir, bool *continuescan);
-static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtutils_spec.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
-/*
- * _bt_mkscankey
- * Build an insertion scan key that contains comparison data from itup
- * as well as comparator routines appropriate to the key datatypes.
- *
- * When itup is a non-pivot tuple, the returned insertion scan key is
- * suitable for finding a place for it to go on the leaf level. Pivot
- * tuples can be used to re-find leaf page with matching high key, but
- * then caller needs to set scan key's pivotsearch field to true. This
- * allows caller to search for a leaf page with a matching high key,
- * which is usually to the left of the first leaf page a non-pivot match
- * might appear on.
- *
- * The result is intended for use with _bt_compare() and _bt_truncate().
- * Callers that don't need to fill out the insertion scankey arguments
- * (e.g. they use an ad-hoc comparison routine, or only need a scankey
- * for _bt_truncate()) can pass a NULL index tuple. The scankey will
- * be initialized as if an "all truncated" pivot tuple was passed
- * instead.
- *
- * Note that we may occasionally have to share lock the metapage to
- * determine whether or not the keys in the index are expected to be
- * unique (i.e. if this is a "heapkeyspace" index). We assume a
- * heapkeyspace index when caller passes a NULL tuple, allowing index
- * build callers to avoid accessing the non-existent metapage. We
- * also assume that the index is _not_ allequalimage when a NULL tuple
- * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
- * field themselves.
- */
-BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
-{
- BTScanInsert key;
- ScanKey skey;
- TupleDesc itupdesc;
- int indnkeyatts;
- int16 *indoption;
- int tupnatts;
- int i;
-
- itupdesc = RelationGetDescr(rel);
- indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- indoption = rel->rd_indoption;
- tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
-
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
- /*
- * We'll execute search using scan key constructed on key columns.
- * Truncated attributes and non-key attributes are omitted from the final
- * scan key.
- */
- key = palloc(offsetof(BTScanInsertData, scankeys) +
- sizeof(ScanKeyData) * indnkeyatts);
- if (itup)
- _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
- else
- {
- /* Utility statement callers can set these fields themselves */
- key->heapkeyspace = true;
- key->allequalimage = false;
- }
- key->anynullkeys = false; /* initial assumption */
- key->nextkey = false;
- key->pivotsearch = false;
- key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
- skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
- {
- FmgrInfo *procinfo;
- Datum arg;
- bool null;
- int flags;
-
- /*
- * We can use the cached (default) support procs since no cross-type
- * comparison can be needed.
- */
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-
- /*
- * Key arguments built from truncated attributes (or when caller
- * provides no tuple) are defensively represented as NULL values. They
- * should never be used.
- */
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
- else
- {
- arg = (Datum) 0;
- null = true;
- }
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
- flags,
- (AttrNumber) (i + 1),
- InvalidStrategy,
- InvalidOid,
- rel->rd_indcollation[i],
- procinfo,
- arg);
- /* Record if any key attribute is NULL (or truncated) */
- if (null)
- key->anynullkeys = true;
- }
-
- /*
- * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
- * that full uniqueness check is done.
- */
- if (rel->rd_index->indnullsnotdistinct)
- key->anynullkeys = false;
-
- return key;
-}
/*
* free a retracement stack made by _bt_search.
@@ -1340,356 +1221,6 @@ _bt_mark_scankey_required(ScanKey skey)
}
}
-/*
- * Test whether an indextuple satisfies all the scankey conditions.
- *
- * Return true if so, false if not. If the tuple fails to pass the qual,
- * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly. See comments for
- * _bt_preprocess_keys(), above, about how this is done.
- *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
- *
- * scan: index scan descriptor (containing a search-type scankey)
- * tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
- */
-bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
- ScanDirection dir, bool *continuescan)
-{
- TupleDesc tupdesc;
- BTScanOpaque so;
- int keysz;
- int ikey;
- ScanKey key;
-
- Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
-
- *continuescan = true; /* default assumption */
-
- tupdesc = RelationGetDescr(scan->indexRelation);
- so = (BTScanOpaque) scan->opaque;
- keysz = so->numberOfKeys;
-
- for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
- {
- Datum datum;
- bool isNull;
- Datum test;
-
- if (key->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- continue;
- }
-
- /* row-comparison keys need special processing */
- if (key->sk_flags & SK_ROW_HEADER)
- {
- if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
- continuescan))
- continue;
- return false;
- }
-
- datum = index_getattr(tuple,
- key->sk_attno,
- tupdesc,
- &isNull);
-
- if (key->sk_flags & SK_ISNULL)
- {
- /* Handle IS NULL/NOT NULL tests */
- if (key->sk_flags & SK_SEARCHNULL)
- {
- if (isNull)
- continue; /* tuple satisfies this qual */
- }
- else
- {
- Assert(key->sk_flags & SK_SEARCHNOTNULL);
- if (!isNull)
- continue; /* tuple satisfies this qual */
- }
-
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (isNull)
- {
- if (key->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
- datum, key->sk_argument);
-
- if (!DatumGetBool(test))
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- *
- * Note: because we stop the scan as soon as any required equality
- * qual fails, it is critical that equality quals be used for the
- * initial positioning in _bt_first() when they are available. See
- * comments in _bt_first().
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
- }
-
- /* If we get here, the tuple passes all index quals. */
- return true;
-}
-
-/*
- * Test whether an indextuple satisfies a row-comparison scan condition.
- *
- * Return true if so, false if not. If not, also clear *continuescan if
- * it's not possible for any future tuples in the current scan direction
- * to pass the qual.
- *
- * This is a subroutine for _bt_checkkeys, which see for more info.
- */
-static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
- TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
-{
- ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
- int32 cmpresult = 0;
- bool result;
-
- /* First subkey should be same as the header says */
- Assert(subkey->sk_attno == skey->sk_attno);
-
- /* Loop over columns of the row condition */
- for (;;)
- {
- Datum datum;
- bool isNull;
-
- Assert(subkey->sk_flags & SK_ROW_MEMBER);
-
- if (subkey->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- cmpresult = 0;
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- continue;
- }
-
- datum = index_getattr(tuple,
- subkey->sk_attno,
- tupdesc,
- &isNull);
-
- if (isNull)
- {
- if (subkey->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (subkey->sk_flags & SK_ISNULL)
- {
- /*
- * Unlike the simple-scankey case, this isn't a disallowed case.
- * But it can never match. If all the earlier row comparison
- * columns are required for the scan direction, we can stop the
- * scan, because there can't be another tuple that will succeed.
- */
- if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
- subkey--;
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- return false;
- }
-
- /* Perform the test --- three-way comparison not bool operator */
- cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
- subkey->sk_collation,
- datum,
- subkey->sk_argument));
-
- if (subkey->sk_flags & SK_BT_DESC)
- INVERT_COMPARE_RESULT(cmpresult);
-
- /* Done comparing if unequal, else advance to next column */
- if (cmpresult != 0)
- break;
-
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- }
-
- /*
- * At this point cmpresult indicates the overall result of the row
- * comparison, and subkey points to the deciding column (or the last
- * column if the result is "=").
- */
- switch (subkey->sk_strategy)
- {
- /* EQ and NE cases aren't allowed here */
- case BTLessStrategyNumber:
- result = (cmpresult < 0);
- break;
- case BTLessEqualStrategyNumber:
- result = (cmpresult <= 0);
- break;
- case BTGreaterEqualStrategyNumber:
- result = (cmpresult >= 0);
- break;
- case BTGreaterStrategyNumber:
- result = (cmpresult > 0);
- break;
- default:
- elog(ERROR, "unrecognized RowCompareType: %d",
- (int) subkey->sk_strategy);
- result = 0; /* keep compiler quiet */
- break;
- }
-
- if (!result)
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will pass,
- * either. Note we have to look at the deciding column, not
- * necessarily the first or last column of the row condition.
- */
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
-
- return result;
-}
-
/*
* _bt_killitems - set LP_DEAD state for items an indexscan caller has
* told us were killed
@@ -2173,286 +1704,6 @@ btbuildphasename(int64 phasenum)
}
}
-/*
- * _bt_truncate() -- create tuple without unneeded suffix attributes.
- *
- * Returns truncated pivot index tuple allocated in caller's memory context,
- * with key attributes copied from caller's firstright argument. If rel is
- * an INCLUDE index, non-key attributes will definitely be truncated away,
- * since they're not part of the key space. More aggressive suffix
- * truncation can take place when it's clear that the returned tuple does not
- * need one or more suffix key attributes. We only need to keep firstright
- * attributes up to and including the first non-lastleft-equal attribute.
- * Caller's insertion scankey is used to compare the tuples; the scankey's
- * argument values are not considered here.
- *
- * Note that returned tuple's t_tid offset will hold the number of attributes
- * present, so the original item pointer offset is not represented. Caller
- * should only change truncated tuple's downlink. Note also that truncated
- * key attributes are treated as containing "minus infinity" values by
- * _bt_compare().
- *
- * In the worst case (when a heap TID must be appended to distinguish lastleft
- * from firstright), the size of the returned tuple is the size of firstright
- * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
- * is important, since callers need to stay under the 1/3 of a page
- * restriction on tuple size. If this routine is ever taught to truncate
- * within an attribute/datum, it will need to avoid returning an enlarged
- * tuple to caller when truncation + TOAST compression ends up enlarging the
- * final datum.
- */
-IndexTuple
-_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
- IndexTuple pivot;
- IndexTuple tidpivot;
- ItemPointer pivotheaptid;
- Size newsize;
-
- /*
- * We should only ever truncate non-pivot tuples from leaf pages. It's
- * never okay to truncate when splitting an internal page.
- */
- Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
-
- /* Determine how many attributes must be kept in truncated tuple */
- keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
-
-#ifdef DEBUG_NO_TRUNCATE
- /* Force truncation to be ineffective for testing purposes */
- keepnatts = nkeyatts + 1;
-#endif
-
- pivot = index_truncate_tuple(itupdesc, firstright,
- Min(keepnatts, nkeyatts));
-
- if (BTreeTupleIsPosting(pivot))
- {
- /*
- * index_truncate_tuple() just returns a straight copy of firstright
- * when it has no attributes to truncate. When that happens, we may
- * need to truncate away a posting list here instead.
- */
- Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
- Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
- pivot->t_info &= ~INDEX_SIZE_MASK;
- pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
- }
-
- /*
- * If there is a distinguishing key attribute within pivot tuple, we're
- * done
- */
- if (keepnatts <= nkeyatts)
- {
- BTreeTupleSetNAtts(pivot, keepnatts, false);
- return pivot;
- }
-
- /*
- * We have to store a heap TID in the new pivot tuple, since no non-TID
- * key attribute value in firstright distinguishes the right side of the
- * split from the left side. nbtree conceptualizes this case as an
- * inability to truncate away any key attributes, since heap TID is
- * treated as just another key attribute (despite lacking a pg_attribute
- * entry).
- *
- * Use enlarged space that holds a copy of pivot. We need the extra space
- * to store a heap TID at the end (using the special pivot tuple
- * representation). Note that the original pivot already has firstright's
- * possible posting list/non-key attribute values removed at this point.
- */
- newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
- tidpivot = palloc0(newsize);
- memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
- /* Cannot leak memory here */
- pfree(pivot);
-
- /*
- * Store all of firstright's key attribute values plus a tiebreaker heap
- * TID value in enlarged pivot tuple
- */
- tidpivot->t_info &= ~INDEX_SIZE_MASK;
- tidpivot->t_info |= newsize;
- BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
- pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
-
- /*
- * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
- * consider suffix truncation. It seems like a good idea to follow that
- * example in cases where no truncation takes place -- use lastleft's heap
- * TID. (This is also the closest value to negative infinity that's
- * legally usable.)
- */
- ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
-
- /*
- * We're done. Assert() that heap TID invariants hold before returning.
- *
- * Lehman and Yao require that the downlink to the right page, which is to
- * be inserted into the parent page in the second phase of a page split be
- * a strict lower bound on items on the right page, and a non-strict upper
- * bound for items on the left page. Assert that heap TIDs follow these
- * invariants, since a heap TID value is apparently needed as a
- * tiebreaker.
- */
-#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
- BTreeTupleGetHeapTID(firstright)) < 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(lastleft)) >= 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#else
-
- /*
- * Those invariants aren't guaranteed to hold for lastleft + firstright
- * heap TID attribute values when they're considered here only because
- * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
- * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
- * TID value that always works as a strict lower bound for items to the
- * right. In particular, it must avoid using firstright's leading key
- * attribute values along with lastleft's heap TID value when lastleft's
- * TID happens to be greater than firstright's TID.
- */
- ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
-
- /*
- * Pivot heap TID should never be fully equal to firstright. Note that
- * the pivot heap TID will still end up equal to lastleft's heap TID when
- * that's the only usable value.
- */
- ItemPointerSetOffsetNumber(pivotheaptid,
- OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#endif
-
- return tidpivot;
-}
-
-/*
- * _bt_keep_natts - how many key attributes to keep when truncating.
- *
- * Caller provides two tuples that enclose a split point. Caller's insertion
- * scankey is used to compare the tuples; the scankey's argument values are
- * not considered here.
- *
- * This can return a number of attributes that is one greater than the
- * number of key attributes for the index relation. This indicates that the
- * caller must use a heap TID as a unique-ifier in new pivot tuple.
- */
-static int
-_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keepnatts;
- ScanKey scankey;
-
- /*
- * _bt_compare() treats truncated key attributes as having the value minus
- * infinity, which would break searches within !heapkeyspace indexes. We
- * must still truncate away non-key attribute values, though.
- */
- if (!itup_key->heapkeyspace)
- return nkeyatts;
-
- scankey = itup_key->scankeys;
- keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum1,
- datum2)) != 0)
- break;
-
- keepnatts++;
- }
-
- /*
- * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
- * expected in an allequalimage index.
- */
- Assert(!itup_key->allequalimage ||
- keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
-
- return keepnatts;
-}
-
-/*
- * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
- *
- * This is exported so that a candidate split point can have its effect on
- * suffix truncation inexpensively evaluated ahead of time when finding a
- * split location. A naive bitwise approach to datum comparisons is used to
- * save cycles.
- *
- * The approach taken here usually provides the same answer as _bt_keep_natts
- * will (for the same pair of tuples from a heapkeyspace index), since the
- * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting. When an index only has
- * "equal image" columns, routine is guaranteed to give the same result as
- * _bt_keep_natts would.
- *
- * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts, even when the index uses
- * an opclass or collation that is not "allequalimage"/deduplication-safe.
- * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
- * negatives generally only have the effect of making leaf page splits use a
- * more balanced split point.
- */
-int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
-
- keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
- Form_pg_attribute att;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
- break;
-
- keepnatts++;
- }
-
- return keepnatts;
-}
-
/*
* _bt_check_natts() -- Verify tuple has expected number of attributes.
*
diff --git a/src/backend/access/nbtree/nbtutils_spec.h b/src/backend/access/nbtree/nbtutils_spec.h
new file mode 100644
index 0000000000..a4b934ae7a
--- /dev/null
+++ b/src/backend/access/nbtree/nbtutils_spec.h
@@ -0,0 +1,772 @@
+/*
+ * Specialized functions included in nbtutils.c
+ */
+
+/*
+ * These functions are not exposed, so their "default" emitted form would be
+ * unused and would generate warnings. Avoid unused code generation and the
+ * subsequent warnings by not emitting these functions when generating the
+ * code for defaults.
+ */
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static bool NBTS_FUNCTION(_bt_check_rowcompare)(ScanKey skey,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ ScanDirection dir, bool *continuescan);
+
+static int NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+
+/*
+ * Test whether an indextuple satisfies a row-comparison scan condition.
+ *
+ * Return true if so, false if not. If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction
+ * to pass the qual.
+ *
+ * This is a subroutine for _bt_checkkeys, which see for more info.
+ */
+static bool
+NBTS_FUNCTION(_bt_check_rowcompare)(ScanKey skey, IndexTuple tuple,
+ int tupnatts, TupleDesc tupdesc,
+ ScanDirection dir, bool *continuescan)
+{
+ ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
+ int32 cmpresult = 0;
+ bool result;
+
+ /* First subkey should be same as the header says */
+ Assert(subkey->sk_attno == skey->sk_attno);
+
+ /* Loop over columns of the row condition */
+ for (;;)
+ {
+ Datum datum;
+ bool isNull;
+
+ Assert(subkey->sk_flags & SK_ROW_MEMBER);
+
+ if (subkey->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ cmpresult = 0;
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ continue;
+ }
+
+ datum = index_getattr(tuple,
+ subkey->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (isNull)
+ {
+ if (subkey->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (subkey->sk_flags & SK_ISNULL)
+ {
+ /*
+ * Unlike the simple-scankey case, this isn't a disallowed case.
+ * But it can never match. If all the earlier row comparison
+ * columns are required for the scan direction, we can stop the
+ * scan, because there can't be another tuple that will succeed.
+ */
+ if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
+ subkey--;
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ return false;
+ }
+
+ /* Perform the test --- three-way comparison not bool operator */
+ cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
+ subkey->sk_collation,
+ datum,
+ subkey->sk_argument));
+
+ if (subkey->sk_flags & SK_BT_DESC)
+ INVERT_COMPARE_RESULT(cmpresult);
+
+ /* Done comparing if unequal, else advance to next column */
+ if (cmpresult != 0)
+ break;
+
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ }
+
+ /*
+ * At this point cmpresult indicates the overall result of the row
+ * comparison, and subkey points to the deciding column (or the last
+ * column if the result is "=").
+ */
+ switch (subkey->sk_strategy)
+ {
+ /* EQ and NE cases aren't allowed here */
+ case BTLessStrategyNumber:
+ result = (cmpresult < 0);
+ break;
+ case BTLessEqualStrategyNumber:
+ result = (cmpresult <= 0);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ result = (cmpresult >= 0);
+ break;
+ case BTGreaterStrategyNumber:
+ result = (cmpresult > 0);
+ break;
+ default:
+ elog(ERROR, "unrecognized RowCompareType: %d",
+ (int) subkey->sk_strategy);
+ result = 0; /* keep compiler quiet */
+ break;
+ }
+
+ if (!result)
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will pass,
+ * either. Note we have to look at the deciding column, not
+ * necessarily the first or last column of the row condition.
+ */
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+
+ return result;
+}
+
+/*
+ * _bt_keep_natts - how many key attributes to keep when truncating.
+ *
+ * Caller provides two tuples that enclose a split point. Caller's insertion
+ * scankey is used to compare the tuples; the scankey's argument values are
+ * not considered here.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation. This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keepnatts;
+ ScanKey scankey;
+
+ /*
+ * _bt_compare() treats truncated key attributes as having the value minus
+ * infinity, which would break searches within !heapkeyspace indexes. We
+ * must still truncate away non-key attribute values, though.
+ */
+ if (!itup_key->heapkeyspace)
+ return nkeyatts;
+
+ scankey = itup_key->scankeys;
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum1,
+ datum2)) != 0)
+ break;
+
+ keepnatts++;
+ }
+
+ /*
+ * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
+ * expected in an allequalimage index.
+ */
+ Assert(!itup_key->allequalimage ||
+ keepnatts == nbts_call(_bt_keep_natts_fast, rel, lastleft, firstright));
+
+ return keepnatts;
+}
+
+#endif /* NBTS_SPECIALIZING_DEFAULT */
+
+/*
+ * _bt_mkscankey
+ * Build an insertion scan key that contains comparison data from itup
+ * as well as comparator routines appropriate to the key datatypes.
+ *
+ * When itup is a non-pivot tuple, the returned insertion scan key is
+ * suitable for finding a place for it to go on the leaf level. Pivot
+ * tuples can be used to re-find leaf page with matching high key, but
+ * then caller needs to set scan key's pivotsearch field to true. This
+ * allows caller to search for a leaf page with a matching high key,
+ * which is usually to the left of the first leaf page a non-pivot match
+ * might appear on.
+ *
+ * The result is intended for use with _bt_compare() and _bt_truncate().
+ * Callers that don't need to fill out the insertion scankey arguments
+ * (e.g. they use an ad-hoc comparison routine, or only need a scankey
+ * for _bt_truncate()) can pass a NULL index tuple. The scankey will
+ * be initialized as if an "all truncated" pivot tuple was passed
+ * instead.
+ *
+ * Note that we may occasionally have to share lock the metapage to
+ * determine whether or not the keys in the index are expected to be
+ * unique (i.e. if this is a "heapkeyspace" index). We assume a
+ * heapkeyspace index when caller passes a NULL tuple, allowing index
+ * build callers to avoid accessing the non-existent metapage. We
+ * also assume that the index is _not_ allequalimage when a NULL tuple
+ * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
+ * field themselves.
+ */
+BTScanInsert
+NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
+{
+ BTScanInsert key;
+ ScanKey skey;
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int16 *indoption;
+ int tupnatts;
+ int i;
+
+ itupdesc = RelationGetDescr(rel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ indoption = rel->rd_indoption;
+ tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+
+ /*
+ * We'll execute search using scan key constructed on key columns.
+ * Truncated attributes and non-key attributes are omitted from the final
+ * scan key.
+ */
+ key = palloc(offsetof(BTScanInsertData, scankeys) +
+ sizeof(ScanKeyData) * indnkeyatts);
+ if (itup)
+ _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+ else
+ {
+ /* Utility statement callers can set these fields themselves */
+ key->heapkeyspace = true;
+ key->allequalimage = false;
+ }
+ key->anynullkeys = false; /* initial assumption */
+ key->nextkey = false;
+ key->pivotsearch = false;
+ key->keysz = Min(indnkeyatts, tupnatts);
+ key->scantid = key->heapkeyspace && itup ?
+ BTreeTupleGetHeapTID(itup) : NULL;
+ skey = key->scankeys;
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ FmgrInfo *procinfo;
+ Datum arg;
+ bool null;
+ int flags;
+
+ /*
+ * We can use the cached (default) support procs since no cross-type
+ * comparison can be needed.
+ */
+ procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+
+ /*
+ * Key arguments built from truncated attributes (or when caller
+ * provides no tuple) are defensively represented as NULL values. They
+ * should never be used.
+ */
+ if (i < tupnatts)
+ arg = index_getattr(itup, i + 1, itupdesc, &null);
+ else
+ {
+ arg = (Datum) 0;
+ null = true;
+ }
+ flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags,
+ (AttrNumber) (i + 1),
+ InvalidStrategy,
+ InvalidOid,
+ rel->rd_indcollation[i],
+ procinfo,
+ arg);
+ /* Record if any key attribute is NULL (or truncated) */
+ if (null)
+ key->anynullkeys = true;
+ }
+
+ /*
+ * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
+ * that full uniqueness check is done.
+ */
+ if (rel->rd_index->indnullsnotdistinct)
+ key->anynullkeys = false;
+
+ return key;
+}
+
+/*
+ * Test whether an indextuple satisfies all the scankey conditions.
+ *
+ * Return true if so, false if not. If the tuple fails to pass the qual,
+ * we also determine whether there's any need to continue the scan beyond
+ * this tuple, and set *continuescan accordingly. See comments for
+ * _bt_preprocess_keys(), above, about how this is done.
+ *
+ * Forward scan callers can pass a high key tuple in the hopes of having
+ * us set *continuescan to false, and avoiding an unnecessary visit to
+ * the page to the right.
+ *
+ * scan: index scan descriptor (containing a search-type scankey)
+ * tuple: index tuple to test
+ * tupnatts: number of attributes in tupnatts (high key may be truncated)
+ * dir: direction we are scanning in
+ * continuescan: output parameter (will be set correctly in all cases)
+ */
+bool
+NBTS_FUNCTION(_bt_checkkeys)(Relation rel, IndexScanDesc scan,
+ IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan)
+{
+ TupleDesc tupdesc;
+ BTScanOpaque so;
+ int keysz;
+ int ikey;
+ ScanKey key;
+
+ Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+
+ *continuescan = true; /* default assumption */
+
+ tupdesc = RelationGetDescr(scan->indexRelation);
+ so = (BTScanOpaque) scan->opaque;
+ keysz = so->numberOfKeys;
+
+ for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+ {
+ Datum datum;
+ bool isNull;
+ Datum test;
+
+ if (key->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ continue;
+ }
+
+ /* row-comparison keys need special processing */
+ if (key->sk_flags & SK_ROW_HEADER)
+ {
+ if (nbts_call_norel(_bt_check_rowcompare, rel, key, tuple,
+ tupnatts, tupdesc, dir, continuescan))
+ continue;
+ return false;
+ }
+
+ datum = index_getattr(tuple,
+ key->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (key->sk_flags & SK_ISNULL)
+ {
+ /* Handle IS NULL/NOT NULL tests */
+ if (key->sk_flags & SK_SEARCHNULL)
+ {
+ if (isNull)
+ continue; /* tuple satisfies this qual */
+ }
+ else
+ {
+ Assert(key->sk_flags & SK_SEARCHNOTNULL);
+ if (!isNull)
+ continue; /* tuple satisfies this qual */
+ }
+
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (isNull)
+ {
+ if (key->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
+ datum, key->sk_argument);
+
+ if (!DatumGetBool(test))
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ *
+ * Note: because we stop the scan as soon as any required equality
+ * qual fails, it is critical that equality quals be used for the
+ * initial positioning in _bt_first() when they are available. See
+ * comments in _bt_first().
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+ }
+
+ /* If we get here, the tuple passes all index quals. */
+ return true;
+}
+
+/*
+ * _bt_truncate() -- create tuple without unneeded suffix attributes.
+ *
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument. If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space. More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes. We only need to keep firstright
+ * attributes up to and including the first non-lastleft-equal attribute.
+ * Caller's insertion scankey is used to compare the tuples; the scankey's
+ * argument values are not considered here.
+ *
+ * Note that returned tuple's t_tid offset will hold the number of attributes
+ * present, so the original item pointer offset is not represented. Caller
+ * should only change truncated tuple's downlink. Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
+ *
+ * In the worst case (when a heap TID must be appended to distinguish lastleft
+ * from firstright), the size of the returned tuple is the size of firstright
+ * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
+ * is important, since callers need to stay under the 1/3 of a page
+ * restriction on tuple size. If this routine is ever taught to truncate
+ * within an attribute/datum, it will need to avoid returning an enlarged
+ * tuple to caller when truncation + TOAST compression ends up enlarging the
+ * final datum.
+ */
+IndexTuple
+NBTS_FUNCTION(_bt_truncate)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+ IndexTuple pivot;
+ IndexTuple tidpivot;
+ ItemPointer pivotheaptid;
+ Size newsize;
+
+ /*
+ * We should only ever truncate non-pivot tuples from leaf pages. It's
+ * never okay to truncate when splitting an internal page.
+ */
+ Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
+
+ /* Determine how many attributes must be kept in truncated tuple */
+ keepnatts = nbts_call(_bt_keep_natts, rel, lastleft, firstright, itup_key);
+
+#ifdef DEBUG_NO_TRUNCATE
+ /* Force truncation to be ineffective for testing purposes */
+ keepnatts = nkeyatts + 1;
+#endif
+
+ pivot = index_truncate_tuple(itupdesc, firstright,
+ Min(keepnatts, nkeyatts));
+
+ if (BTreeTupleIsPosting(pivot))
+ {
+ /*
+ * index_truncate_tuple() just returns a straight copy of firstright
+ * when it has no attributes to truncate. When that happens, we may
+ * need to truncate away a posting list here instead.
+ */
+ Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
+ Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+
+ /*
+ * If there is a distinguishing key attribute within pivot tuple, we're
+ * done
+ */
+ if (keepnatts <= nkeyatts)
+ {
+ BTreeTupleSetNAtts(pivot, keepnatts, false);
+ return pivot;
+ }
+
+ /*
+ * We have to store a heap TID in the new pivot tuple, since no non-TID
+ * key attribute value in firstright distinguishes the right side of the
+ * split from the left side. nbtree conceptualizes this case as an
+ * inability to truncate away any key attributes, since heap TID is
+ * treated as just another key attribute (despite lacking a pg_attribute
+ * entry).
+ *
+ * Use enlarged space that holds a copy of pivot. We need the extra space
+ * to store a heap TID at the end (using the special pivot tuple
+ * representation). Note that the original pivot already has firstright's
+ * possible posting list/non-key attribute values removed at this point.
+ */
+ newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
+ tidpivot = palloc0(newsize);
+ memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
+ /* Cannot leak memory here */
+ pfree(pivot);
+
+ /*
+ * Store all of firstright's key attribute values plus a tiebreaker heap
+ * TID value in enlarged pivot tuple
+ */
+ tidpivot->t_info &= ~INDEX_SIZE_MASK;
+ tidpivot->t_info |= newsize;
+ BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
+ pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
+
+ /*
+ * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
+ * consider suffix truncation. It seems like a good idea to follow that
+ * example in cases where no truncation takes place -- use lastleft's heap
+ * TID. (This is also the closest value to negative infinity that's
+ * legally usable.)
+ */
+ ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
+
+ /*
+ * We're done. Assert() that heap TID invariants hold before returning.
+ *
+ * Lehman and Yao require that the downlink to the right page, which is to
+ * be inserted into the parent page in the second phase of a page split be
+ * a strict lower bound on items on the right page, and a non-strict upper
+ * bound for items on the left page. Assert that heap TIDs follow these
+ * invariants, since a heap TID value is apparently needed as a
+ * tiebreaker.
+ */
+#ifndef DEBUG_NO_TRUNCATE
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#else
+
+ /*
+ * Those invariants aren't guaranteed to hold for lastleft + firstright
+ * heap TID attribute values when they're considered here only because
+ * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
+ * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
+ * TID value that always works as a strict lower bound for items to the
+ * right. In particular, it must avoid using firstright's leading key
+ * attribute values along with lastleft's heap TID value when lastleft's
+ * TID happens to be greater than firstright's TID.
+ */
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
+
+ /*
+ * Pivot heap TID should never be fully equal to firstright. Note that
+ * the pivot heap TID will still end up equal to lastleft's heap TID when
+ * that's the only usable value.
+ */
+ ItemPointerSetOffsetNumber(pivotheaptid,
+ OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#endif
+
+ return tidpivot;
+}
+
+/*
+ * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location. A naive bitwise approach to datum comparisons is used to
+ * save cycles.
+ *
+ * The approach taken here usually provides the same answer as _bt_keep_natts
+ * will (for the same pair of tuples from a heapkeyspace index), since the
+ * majority of btree opclasses can never indicate that two datums are equal
+ * unless they're bitwise equal after detoasting. When an index only has
+ * "equal image" columns, routine is guaranteed to give the same result as
+ * _bt_keep_natts would.
+ *
+ * Callers can rely on the fact that attributes considered equal here are
+ * definitely also equal according to _bt_keep_natts, even when the index uses
+ * an opclass or collation that is not "allequalimage"/deduplication-safe.
+ * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
+ * negatives generally only have the effect of making leaf page splits use a
+ * more balanced split point.
+ */
+int
+NBTS_FUNCTION(_bt_keep_natts_fast)(Relation rel,
+ IndexTuple lastleft,
+ IndexTuple firstright)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= keysz; attnum++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+ Form_pg_attribute att;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ att = TupleDescAttr(itupdesc, attnum - 1);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
+ break;
+
+ keepnatts++;
+ }
+
+ return keepnatts;
+}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 93f8267b48..83e0dbab16 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1116,15 +1116,47 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_PERFORMSORT_2 4
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
+
+/*
+ * Macros used in the nbtree specialization code.
+ */
+#define NBTS_TYPE_CACHED cached
+#define NBTS_TYPE_DEFAULT default
+
+
+#define NBTS_MAKE_PREFIX(a) CppConcat(a,_)
+#define NBTS_MAKE_NAME_(a,b) CppConcat(a,b)
+#define NBTS_MAKE_NAME(a,b) NBTS_MAKE_NAME_(NBTS_MAKE_PREFIX(a),b)
+
+#define NBTS_ENABLED
+
+#ifdef NBTS_ENABLED
+
+/*
+ * Access a specialized nbtree function, based on the shape of the index key.
+ */
+
+#define NBT_SPECIALIZE_CALL(function, rel, ...) \
+( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+)
+
+#else /* not defined NBTS_ENABLED */
+
+#define NBT_SPECIALIZE_CALL(function, rel, ...) function(__VA_ARGS__)
+
+#endif /* NBTS_ENABLED */
+
+
+#define NBT_SPECIALIZE_FILE "access/nbtree_specialized.h"
+#include "nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
+
+
/*
* external entry points for btree, in nbtree.c
*/
extern void btbuildempty(Relation index);
-extern bool btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- struct IndexInfo *indexInfo);
extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
extern Size btestimateparallelscan(void);
extern void btinitparallelscan(void *target);
@@ -1155,9 +1187,6 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
/*
* prototypes for functions in nbtdedup.c
*/
-extern void _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel,
- IndexTuple newitem, Size newitemsz,
- bool bottomupdedup);
extern bool _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
Size newitemsz);
extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
@@ -1173,9 +1202,6 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
/*
* prototypes for functions in nbtinsert.c
*/
-extern bool _bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel);
extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
@@ -1223,12 +1249,6 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
/*
* prototypes for functions in nbtsearch.c
*/
-extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
- int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
- bool forupdate, BTStack stack, int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
-extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -1237,7 +1257,6 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
/*
* prototypes for functions in nbtutils.c
*/
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
extern void _bt_freestack(BTStack stack);
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
@@ -1245,8 +1264,6 @@ extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
extern void _bt_mark_array_keys(IndexScanDesc scan);
extern void _bt_restore_array_keys(IndexScanDesc scan);
extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
- int tupnatts, ScanDirection dir, bool *continuescan);
extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
extern BTCycleId _bt_start_vacuum(Relation rel);
@@ -1259,10 +1276,6 @@ extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
extern char *btbuildphasename(int64 phasenum);
-extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
-extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
- IndexTuple firstright);
extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/nbtree_specialize.h b/src/include/access/nbtree_specialize.h
new file mode 100644
index 0000000000..23fdda4f0e
--- /dev/null
+++ b/src/include/access/nbtree_specialize.h
@@ -0,0 +1,204 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtree_specialize.h
+ * header file for postgres btree access method implementation.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/nbtree_specialize.h
+ *
+ *-------------------------------------------------------------------------
+ *
+ * Specialize key-accessing functions and the hot code around those.
+ *
+ * Key attribute iteration is specialized through the use of the following
+ * macros:
+ *
+ * - nbts_call(function, indexrel, ...rest_of_args), and
+ * nbts_call_norel(function, indexrel, ...args)
+ * This will call the specialized variant of 'function' based on the index
+ * relation data.
+ * The difference between nbts_call and nbts_call_norel is that _call
+ * uses indexrel as first argument in the function call, whereas
+ * nbts_call_norel does not.
+ * - nbts_attiterdeclare(itup)
+ * Declare the variables required to iterate over the provided IndexTuple's
+ * key attributes. Many tuples may have their attributes iterated over at the
+ * same time.
+ * - nbts_attiterinit(itup, initAttNum, tupDesc)
+ * Initialize the attribute iterator for the provided IndexTuple at
+ * the provided AttributeNumber.
+ * - nbts_foreachattr(initAttNum, endAttNum)
+ * Start a loop over the attributes, starting at initAttNum and ending at
+ * endAttNum, inclusive. It also takes care of truncated attributes.
+ * - nbts_attiter_attnum
+ * The current attribute number
+ * - nbts_attiter_nextattdatum(itup, tupDesc)
+ * Updates the attribute iterator state to the next attribute. Returns the
+ * datum of the next attribute, which might be null (see below)
+ * - nbts_attiter_curattisnull(itup)
+ * Returns whether the result from the last nbts_attiter_nextattdatum is
+ * null.
+ *
+ * example usage:
+ *
+ * kwithnulls = nbts_call_norel(_bt_key_hasnulls, myindex, mytuple, tupDesc);
+ *
+ * NBTS_FUNCTION(_bt_key_hasnulls)(IndexTuple mytuple, TupleDesc tupDesc)
+ * {
+ * nbts_attiterdeclare(mytuple);
+ * nbts_attiterinit(mytuple, 1, tupDesc);
+ * nbts_foreachattr(1, 10)
+ * {
+ * Datum it = nbts_attiter_nextattdatum(tuple, tupDesc);
+ * if (nbts_attiter_curattisnull(tuple))
+ * return true;
+ * }
+ * return false
+ * }
+ */
+
+/*
+ * Call a potentially specialized function for a given btree operation.
+ *
+ * NB: the rel argument is evaluated multiple times.
+ */
+#define nbts_call(name, rel, ...) \
+ nbts_call_norel(name, (rel), (rel), __VA_ARGS__)
+
+#ifdef NBTS_ENABLED
+
+#define NBTS_FUNCTION(name) NBTS_MAKE_NAME(name, NBTS_TYPE)
+
+#ifdef nbts_call_norel
+#undef nbts_call_norel
+#endif
+
+#define nbts_call_norel(name, rel, ...) \
+ (NBTS_FUNCTION(name)(__VA_ARGS__))
+
+/*
+ * Multiple key columns, optimized access for attcacheoff -cacheable offsets.
+ */
+#define NBTS_SPECIALIZING_CACHED
+#define NBTS_TYPE NBTS_TYPE_CACHED
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_CACHED
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/* reset call to SPECIALIZE_CALL for default behaviour */
+#undef nbts_call_norel
+#define nbts_call_norel(name, rel, ...) \
+ NBT_SPECIALIZE_CALL(name, (rel), __VA_ARGS__)
+
+/*
+ * "Default", externally accessible, not so much optimized functions
+ */
+
+#define NBTS_SPECIALIZING_DEFAULT
+#define NBTS_TYPE NBTS_TYPE_DEFAULT
+
+/* for the default functions, we want to use the unspecialized name. */
+#undef NBTS_FUNCTION
+#define NBTS_FUNCTION(name) name
+
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_DEFAULT
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/* from here on there are no more NBTS_FUNCTIONs */
+#undef NBTS_FUNCTION
+
+#else /* not defined NBTS_ENABLED */
+
+/*
+ * NBTS_ENABLE is not defined, so we don't want to use the specializations.
+ * We revert to the behaviour from PG14 and earlier, which only uses
+ * attcacheoff.
+ */
+
+#define NBTS_FUNCTION(name) name
+
+#define nbts_call_norel(name, rel, ...) \
+ name(__VA_ARGS__)
+
+#define NBTS_TYPE NBTS_TYPE_CACHED
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+
+#endif /* !NBTS_ENABLED */
diff --git a/src/include/access/nbtree_specialized.h b/src/include/access/nbtree_specialized.h
new file mode 100644
index 0000000000..c45fa84aed
--- /dev/null
+++ b/src/include/access/nbtree_specialized.h
@@ -0,0 +1,67 @@
+/*
+ * prototypes for functions that are included in nbtree.h
+ */
+
+/*
+ * prototypes for functions in nbtree_spec.h
+ */
+extern void
+NBTS_FUNCTION(_bt_specialize)(Relation rel);
+
+extern bool
+NBTS_FUNCTION(btinsert)(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ struct IndexInfo *indexInfo);
+
+/*
+ * prototypes for functions in nbtdedup_spec.h
+ */
+extern void
+NBTS_FUNCTION(_bt_dedup_pass)(Relation rel, Buffer buf, Relation heapRel,
+ IndexTuple newitem, Size newitemsz,
+ bool bottomupdedup);
+
+
+/*
+ * prototypes for functions in nbtinsert_spec.h
+ */
+
+extern bool
+NBTS_FUNCTION(_bt_doinsert)(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ Relation heapRel);
+
+/*
+ * prototypes for functions in nbtsearch_spec.h
+ */
+extern BTStack
+NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key,
+ Buffer *bufP, int access,
+ Snapshot snapshot);
+extern Buffer
+NBTS_FUNCTION(_bt_moveright)(Relation rel, BTScanInsert key, Buffer buf,
+ bool forupdate, BTStack stack, int access,
+ Snapshot snapshot);
+extern OffsetNumber
+NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate);
+extern int32
+NBTS_FUNCTION(_bt_compare)(Relation rel, BTScanInsert key,
+ Page page, OffsetNumber offnum);
+
+/*
+ * prototypes for functions in nbtutils_spec.h
+ */
+extern BTScanInsert
+NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup);
+extern bool
+NBTS_FUNCTION(_bt_checkkeys)(Relation rel, IndexScanDesc scan,
+ IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan);
+extern IndexTuple
+NBTS_FUNCTION(_bt_truncate)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+extern int
+NBTS_FUNCTION(_bt_keep_natts_fast)(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright);
--
2.30.2
v6-0002-Use-specialized-attribute-iterators-in-backend-nb.patchapplication/octet-stream; name=v6-0002-Use-specialized-attribute-iterators-in-backend-nb.patchDownload
From 1ea75d187f602bed86a9bce88f2c50c1faeba49c Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 7 Apr 2022 12:30:00 +0200
Subject: [PATCH v6 2/8] Use specialized attribute iterators in
backend/*/nbt*_spec.h
Split out for making it clear what substantial changes were made to the
pre-existing functions.
Even though not all nbt*_spec functions have been updated; most call sites
now can directly call the specialized functions instead of having to determine
the right specialization based on the (potentially locally unavailable) index
relation, making the specialization of those functions worth the effort.
---
src/backend/access/nbtree/nbtsearch_spec.h | 16 +++---
src/backend/access/nbtree/nbtsort_spec.h | 24 +++++----
src/backend/access/nbtree/nbtutils_spec.h | 63 +++++++++++++---------
3 files changed, 62 insertions(+), 41 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsearch_spec.h b/src/backend/access/nbtree/nbtsearch_spec.h
index 73d5370496..a5c5f2b94f 100644
--- a/src/backend/access/nbtree/nbtsearch_spec.h
+++ b/src/backend/access/nbtree/nbtsearch_spec.h
@@ -823,6 +823,7 @@ NBTS_FUNCTION(_bt_compare)(Relation rel,
int ncmpkey;
int ntupatts;
int32 result;
+ nbts_attiterdeclare(itup);
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -854,23 +855,26 @@ NBTS_FUNCTION(_bt_compare)(Relation rel,
Assert(key->heapkeyspace || ncmpkey == key->keysz);
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
scankey = key->scankeys;
- for (int i = 1; i <= ncmpkey; i++)
+ nbts_attiterinit(itup, 1, itupdesc);
+
+ nbts_foreachattr(1, ncmpkey)
{
Datum datum;
- bool isNull;
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+ datum = nbts_attiter_nextattdatum(itup, itupdesc);
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ /* key is NULL */
+ if (scankey->sk_flags & SK_ISNULL)
{
- if (isNull)
+ if (nbts_attiter_curattisnull(itup))
result = 0; /* NULL "=" NULL */
else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = -1; /* NULL "<" NOT_NULL */
else
result = 1; /* NULL ">" NOT_NULL */
}
- else if (isNull) /* key is NOT_NULL and item is NULL */
+ /* key is NOT_NULL and item is NULL */
+ else if (nbts_attiter_curattisnull(itup))
{
if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = 1; /* NOT_NULL ">" NULL */
diff --git a/src/backend/access/nbtree/nbtsort_spec.h b/src/backend/access/nbtree/nbtsort_spec.h
index 8f4a3602ca..d3f2db2dc4 100644
--- a/src/backend/access/nbtree/nbtsort_spec.h
+++ b/src/backend/access/nbtree/nbtsort_spec.h
@@ -27,8 +27,7 @@ NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
itup2 = NULL;
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
bool deduplicate;
@@ -50,7 +49,7 @@ NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
/* Prepare SortSupport data for each column */
sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
- for (i = 0; i < keysz; i++)
+ for (int i = 0; i < keysz; i++)
{
SortSupport sortKey = sortKeys + i;
ScanKey scanKey = wstate->inskey->scankeys + i;
@@ -82,22 +81,25 @@ NBTS_FUNCTION(_bt_load)(BTWriteState *wstate, BTSpool *btspool,
}
else if (itup != NULL)
{
+ nbts_attiterdeclare(itup);
+ nbts_attiterdeclare(itup2);
int32 compare = 0;
- for (i = 1; i <= keysz; i++)
+ nbts_attiterinit(itup, 1, tupdes);
+ nbts_attiterinit(itup2, 1, tupdes);
+
+ nbts_foreachattr(1, keysz)
{
SortSupport entry;
Datum attrDatum1,
attrDatum2;
- bool isNull1,
- isNull2;
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+ entry = sortKeys + nbts_attiter_attnum - 1;
+ attrDatum1 = nbts_attiter_nextattdatum(itup, tupdes);
+ attrDatum2 = nbts_attiter_nextattdatum(itup2, tupdes);
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
+ compare = ApplySortComparator(attrDatum1, nbts_attiter_curattisnull(itup),
+ attrDatum2, nbts_attiter_curattisnull(itup2),
entry);
if (compare > 0)
{
diff --git a/src/backend/access/nbtree/nbtutils_spec.h b/src/backend/access/nbtree/nbtutils_spec.h
index a4b934ae7a..638eff18f6 100644
--- a/src/backend/access/nbtree/nbtutils_spec.h
+++ b/src/backend/access/nbtree/nbtutils_spec.h
@@ -211,6 +211,8 @@ NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
TupleDesc itupdesc = RelationGetDescr(rel);
int keepnatts;
ScanKey scankey;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
/*
* _bt_compare() treats truncated key attributes as having the value minus
@@ -222,20 +224,22 @@ NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
scankey = itup_key->scankeys;
keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, nkeyatts)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ datum1 = nbts_attiter_nextattdatum(lastleft, itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) != nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
scankey->sk_collation,
datum1,
@@ -243,6 +247,7 @@ NBTS_FUNCTION(_bt_keep_natts)(Relation rel, IndexTuple lastleft,
break;
keepnatts++;
+ scankey++;
}
/*
@@ -295,7 +300,7 @@ NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
int indnkeyatts;
int16 *indoption;
int tupnatts;
- int i;
+ nbts_attiterdeclare(itup);
itupdesc = RelationGetDescr(rel);
indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -326,7 +331,10 @@ NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
key->scantid = key->heapkeyspace && itup ?
BTreeTupleGetHeapTID(itup) : NULL;
skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
+
+ nbts_attiterinit(itup, 1, itupdesc);
+
+ nbts_foreachattr(1, indnkeyatts)
{
FmgrInfo *procinfo;
Datum arg;
@@ -337,27 +345,30 @@ NBTS_FUNCTION(_bt_mkscankey)(Relation rel, IndexTuple itup)
* We can use the cached (default) support procs since no cross-type
* comparison can be needed.
*/
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+ procinfo = index_getprocinfo(rel, nbts_attiter_attnum, BTORDER_PROC);
/*
* Key arguments built from truncated attributes (or when caller
* provides no tuple) are defensively represented as NULL values. They
* should never be used.
*/
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
+ if (nbts_attiter_attnum <= tupnatts)
+ {
+ arg = nbts_attiter_nextattdatum(itup, itupdesc);
+ null = nbts_attiter_curattisnull(itup);
+ }
else
{
arg = (Datum) 0;
null = true;
}
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags = (null ? SK_ISNULL : 0) | (indoption[nbts_attiter_attnum - 1] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[nbts_attiter_attnum - 1],
flags,
- (AttrNumber) (i + 1),
+ (AttrNumber) nbts_attiter_attnum,
InvalidStrategy,
InvalidOid,
- rel->rd_indcollation[i],
+ rel->rd_indcollation[nbts_attiter_attnum - 1],
procinfo,
arg);
/* Record if any key attribute is NULL (or truncated) */
@@ -744,24 +755,28 @@ NBTS_FUNCTION(_bt_keep_natts_fast)(Relation rel,
TupleDesc itupdesc = RelationGetDescr(rel);
int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
int keepnatts;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, keysz)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
Form_pg_attribute att;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
+ datum1 = nbts_attiter_nextattdatum(lastleft,itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
+ att = TupleDescAttr(itupdesc, nbts_attiter_attnum - 1);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) !=
+ nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
--
2.30.2
v6-0007-Add-specialization-to-btree-index-creation.patchapplication/octet-stream; name=v6-0007-Add-specialization-to-btree-index-creation.patchDownload
From 8d587233c593d510edc50783af303831837f6b32 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 21 Apr 2022 16:22:07 +0200
Subject: [PATCH v6 7/8] Add specialization to btree index creation.
This was an oversight that is corrected easily; but an oversight nonetheless.
This increases the (re)build performance of indexes by another few percents.
---
src/backend/utils/sort/multisort.c | 22 +++
src/backend/utils/sort/tuplesortvariants.c | 146 +----------------
.../utils/sort/tuplesortvariants_nbts.h | 150 ++++++++++++++++++
src/include/access/nbtree.h | 18 +++
4 files changed, 196 insertions(+), 140 deletions(-)
create mode 100644 src/backend/utils/sort/multisort.c
create mode 100644 src/backend/utils/sort/tuplesortvariants_nbts.h
diff --git a/src/backend/utils/sort/multisort.c b/src/backend/utils/sort/multisort.c
new file mode 100644
index 0000000000..3c0871faa6
--- /dev/null
+++ b/src/backend/utils/sort/multisort.c
@@ -0,0 +1,22 @@
+#include "postgres.h"
+
+#include "utils/sortsupport.h"
+#include "utils/tuplesort.h"
+
+struct MultiSortData {
+ Tuplesortstate *buffer;
+ void *inner;
+
+};
+
+
+/*
+ * MultiSort {
+ * TupleSort buffer Contains sorted run of tuples received from
+ * innerstate. May have dropped tuples since.
+ * void* innerstate Inner state of multisort scheme.
+ * SortSlot low_watermark all tuples up to X have been received)
+ * SortSlot high_watermark (no tuples after X have been received)
+ * }
+ * sort_begin(MultiSort state,
+ */
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 2933020dcc..06fdeb64c6 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -57,8 +57,6 @@ static void writetup_cluster(Tuplesortstate *state, LogicalTape *tape,
SortTuple *stup);
static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
LogicalTape *tape, unsigned int len);
-static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state);
static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
Tuplesortstate *state);
static void writetup_index(Tuplesortstate *state, LogicalTape *tape,
@@ -130,6 +128,10 @@ typedef struct
int datumTypeLen;
} TuplesortDatumArg;
+#define NBT_SPECIALIZE_FILE "../../backend/utils/sort/tuplesortvariants_nbts.h"
+#include "access/nbtree_specialize.h"
+#undef NBT_SPECIALIZE_FILE
+
Tuplesortstate *
tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
@@ -350,7 +352,7 @@ tuplesort_begin_index_btree(Relation heapRel,
PARALLEL_SORT(coordinate));
base->removeabbrev = removeabbrev_index;
- base->comparetup = comparetup_index_btree;
+ base->comparetup = NBT_SPECIALIZE_NAME(comparetup_index_btree, indexRel);
base->writetup = writetup_index;
base->readtup = readtup_index;
base->haveDatum1 = true;
@@ -475,7 +477,7 @@ tuplesort_begin_index_gist(Relation heapRel,
base->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel);
base->removeabbrev = removeabbrev_index;
- base->comparetup = comparetup_index_btree;
+ base->comparetup = NBT_SPECIALIZE_NAME(comparetup_index_btree, indexRel);
base->writetup = writetup_index;
base->readtup = readtup_index;
base->haveDatum1 = true;
@@ -1245,142 +1247,6 @@ removeabbrev_index(Tuplesortstate *state, SortTuple *stups, int count)
}
}
-static int
-comparetup_index_btree(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state)
-{
- /*
- * This is similar to comparetup_heap(), but expects index tuples. There
- * is also special handling for enforcing uniqueness, and special
- * treatment for equal keys at the end.
- */
- TuplesortPublic *base = TuplesortstateGetPublic(state);
- TuplesortIndexBTreeArg *arg = (TuplesortIndexBTreeArg *) base->arg;
- SortSupport sortKey = base->sortKeys;
- IndexTuple tuple1;
- IndexTuple tuple2;
- int keysz;
- TupleDesc tupDes;
- bool equal_hasnull = false;
- int nkey;
- int32 compare;
- Datum datum1,
- datum2;
- bool isnull1,
- isnull2;
-
-
- /* Compare the leading sort key */
- compare = ApplySortComparator(a->datum1, a->isnull1,
- b->datum1, b->isnull1,
- sortKey);
- if (compare != 0)
- return compare;
-
- /* Compare additional sort keys */
- tuple1 = (IndexTuple) a->tuple;
- tuple2 = (IndexTuple) b->tuple;
- keysz = base->nKeys;
- tupDes = RelationGetDescr(arg->index.indexRel);
-
- if (sortKey->abbrev_converter)
- {
- datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
-
- compare = ApplySortAbbrevFullComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare;
- }
-
- /* they are equal, so we only need to examine one null flag */
- if (a->isnull1)
- equal_hasnull = true;
-
- sortKey++;
- for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
- {
- datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
-
- compare = ApplySortComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare; /* done when we find unequal attributes */
-
- /* they are equal, so we only need to examine one null flag */
- if (isnull1)
- equal_hasnull = true;
- }
-
- /*
- * If btree has asked us to enforce uniqueness, complain if two equal
- * tuples are detected (unless there was at least one NULL field and NULLS
- * NOT DISTINCT was not set).
- *
- * It is sufficient to make the test here, because if two tuples are equal
- * they *must* get compared at some stage of the sort --- otherwise the
- * sort algorithm wouldn't have checked whether one must appear before the
- * other.
- */
- if (arg->enforceUnique && !(!arg->uniqueNullsNotDistinct && equal_hasnull))
- {
- Datum values[INDEX_MAX_KEYS];
- bool isnull[INDEX_MAX_KEYS];
- char *key_desc;
-
- /*
- * Some rather brain-dead implementations of qsort (such as the one in
- * QNX 4) will sometimes call the comparison routine to compare a
- * value to itself, but we always use our own implementation, which
- * does not.
- */
- Assert(tuple1 != tuple2);
-
- index_deform_tuple(tuple1, tupDes, values, isnull);
-
- key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("could not create unique index \"%s\"",
- RelationGetRelationName(arg->index.indexRel)),
- key_desc ? errdetail("Key %s is duplicated.", key_desc) :
- errdetail("Duplicate keys exist."),
- errtableconstraint(arg->index.heapRel,
- RelationGetRelationName(arg->index.indexRel))));
- }
-
- /*
- * If key values are equal, we sort on ItemPointer. This is required for
- * btree indexes, since heap TID is treated as an implicit last key
- * attribute in order to ensure that all keys in the index are physically
- * unique.
- */
- {
- BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
- BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
-
- if (blk1 != blk2)
- return (blk1 < blk2) ? -1 : 1;
- }
- {
- OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
- OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
-
- if (pos1 != pos2)
- return (pos1 < pos2) ? -1 : 1;
- }
-
- /* ItemPointer values should never be equal */
- Assert(false);
-
- return 0;
-}
-
static int
comparetup_index_hash(const SortTuple *a, const SortTuple *b,
Tuplesortstate *state)
diff --git a/src/backend/utils/sort/tuplesortvariants_nbts.h b/src/backend/utils/sort/tuplesortvariants_nbts.h
new file mode 100644
index 0000000000..d52d34c749
--- /dev/null
+++ b/src/backend/utils/sort/tuplesortvariants_nbts.h
@@ -0,0 +1,150 @@
+#ifndef NBTS_SPECIALIZING_DEFAULT
+
+static int NBTS_FUNCTION(comparetup_index_btree)(const SortTuple *a,
+ const SortTuple *b,
+ Tuplesortstate *state);
+
+static int
+NBTS_FUNCTION(comparetup_index_btree)(const SortTuple *a, const SortTuple *b,
+ Tuplesortstate *state)
+{
+ /*
+ * This is similar to comparetup_heap(), but expects index tuples. There
+ * is also special handling for enforcing uniqueness, and special
+ * treatment for equal keys at the end.
+ */
+ TuplesortPublic *base = TuplesortstateGetPublic(state);
+ TuplesortIndexBTreeArg *arg = (TuplesortIndexBTreeArg *) base->arg;
+ SortSupport sortKey = base->sortKeys;
+ IndexTuple tuple1;
+ IndexTuple tuple2;
+ int keysz;
+ TupleDesc tupDes;
+ bool equal_hasnull = false;
+ int nkey;
+ int32 compare;
+ nbts_attiterdeclare(tuple1);
+ nbts_attiterdeclare(tuple2);
+
+ /* Compare the leading sort key */
+ compare = ApplySortComparator(a->datum1, a->isnull1,
+ b->datum1, b->isnull1,
+ sortKey);
+ if (compare != 0)
+ return compare;
+
+ /* Compare additional sort keys */
+ tuple1 = (IndexTuple) a->tuple;
+ tuple2 = (IndexTuple) b->tuple;
+ keysz = base->nKeys;
+ tupDes = RelationGetDescr(arg->index.indexRel);
+
+ if (!sortKey->abbrev_converter)
+ {
+ nkey = 2;
+ sortKey++;
+ }
+ else
+ nkey = 1;
+
+ if (a->isnull1)
+ equal_hasnull = true;
+
+ nbts_attiterinit(tuple1, nkey, tupDes);
+ nbts_attiterinit(tuple2, nkey, tupDes);
+
+ nbts_foreachattr(nkey, keysz)
+ {
+ Datum datum1,
+ datum2;
+ datum1 = nbts_attiter_nextattdatum(tuple1, tupDes);
+ datum2 = nbts_attiter_nextattdatum(tuple2, tupDes);
+
+ if (nbts_attiter_attnum == 1)
+ {
+ compare = ApplySortAbbrevFullComparator(datum1, nbts_attiter_curattisnull(tuple1),
+ datum2, nbts_attiter_curattisnull(tuple2),
+ sortKey);
+ }
+ else
+ {
+ compare = ApplySortComparator(datum1, nbts_attiter_curattisnull(tuple1),
+ datum2, nbts_attiter_curattisnull(tuple2),
+ sortKey);
+ }
+
+ if (compare != 0)
+ return compare;
+
+ if (nbts_attiter_curattisnull(tuple1))
+ equal_hasnull = true;
+
+ sortKey++;
+ }
+
+ /*
+ * If btree has asked us to enforce uniqueness, complain if two equal
+ * tuples are detected (unless there was at least one NULL field and NULLS
+ * NOT DISTINCT was not set).
+ *
+ * It is sufficient to make the test here, because if two tuples are equal
+ * they *must* get compared at some stage of the sort --- otherwise the
+ * sort algorithm wouldn't have checked whether one must appear before the
+ * other.
+ */
+ if (arg->enforceUnique && !(!arg->uniqueNullsNotDistinct && equal_hasnull))
+ {
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ char *key_desc;
+
+ /*
+ * Some rather brain-dead implementations of qsort (such as the one in
+ * QNX 4) will sometimes call the comparison routine to compare a
+ * value to itself, but we always use our own implementation, which
+ * does not.
+ */
+ Assert(tuple1 != tuple2);
+
+ index_deform_tuple(tuple1, tupDes, values, isnull);
+
+ key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("could not create unique index \"%s\"",
+ RelationGetRelationName(arg->index.indexRel)),
+ key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+ errdetail("Duplicate keys exist."),
+ errtableconstraint(arg->index.heapRel,
+ RelationGetRelationName(arg->index.indexRel))));
+ }
+
+ /*
+ * If key values are equal, we sort on ItemPointer. This is required for
+ * btree indexes, since heap TID is treated as an implicit last key
+ * attribute in order to ensure that all keys in the index are physically
+ * unique.
+ */
+ {
+ BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
+ BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
+
+ if (blk1 != blk2)
+ return (blk1 < blk2) ? -1 : 1;
+ }
+ {
+ OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
+ OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
+
+ if (pos1 != pos2)
+ return (pos1 < pos2) ? -1 : 1;
+ }
+
+ /* ItemPointer values should never be equal */
+ Assert(false);
+
+ return 0;
+}
+
+#endif
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 92894e4ea7..11116b47ca 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1170,6 +1170,24 @@ do { \
) \
)
+#define NBT_SPECIALIZE_NAME(name, rel) \
+( \
+ IndexRelationGetNumberOfKeyAttributes(rel) == 1 ? ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_SINGLE_COLUMN) \
+ ) \
+ : \
+ ( \
+ TupleDescAttr(RelationGetDescr(rel), \
+ IndexRelationGetNumberOfKeyAttributes(rel) - 1)->attcacheoff > 0 ? ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED) \
+ ) \
+ : \
+ ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_UNCACHED) \
+ ) \
+ ) \
+)
+
#else /* not defined NBTS_ENABLED */
#define nbt_opt_specialize(rel)
--
2.30.2
v6-0006-Implement-specialized-uncacheable-attribute-itera.patchapplication/octet-stream; name=v6-0006-Implement-specialized-uncacheable-attribute-itera.patchDownload
From 81fd521dc4e67896456703997caddaa00721686f Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 8 Apr 2022 14:44:01 +0200
Subject: [PATCH v6 6/8] Implement specialized uncacheable attribute iteration
Uses an iterator to prevent doing duplicate work while iterating over
attributes.
Inspiration: https://www.postgresql.org/message-id/CAEze2WjE9ka8i%3Ds-Vv5oShro9xTrt5VQnQvFG9AaRwWpMm3-fg%40mail.gmail.com
---
src/backend/access/nbtree/nbtree_spec.h | 1 +
src/include/access/itup_attiter.h | 198 ++++++++++++++++++++++++
src/include/access/nbtree.h | 13 +-
src/include/access/nbtree_specialize.h | 40 ++++-
4 files changed, 249 insertions(+), 3 deletions(-)
create mode 100644 src/include/access/itup_attiter.h
diff --git a/src/backend/access/nbtree/nbtree_spec.h b/src/backend/access/nbtree/nbtree_spec.h
index 4c342287f6..88b01c86f7 100644
--- a/src/backend/access/nbtree/nbtree_spec.h
+++ b/src/backend/access/nbtree/nbtree_spec.h
@@ -9,6 +9,7 @@ void
NBTS_FUNCTION(_bt_specialize)(Relation rel)
{
#ifdef NBTS_SPECIALIZING_DEFAULT
+ PopulateTupleDescCacheOffsets(rel->rd_att);
nbts_call_norel(_bt_specialize, rel, rel);
#else
rel->rd_indam->aminsert = NBTS_FUNCTION(btinsert);
diff --git a/src/include/access/itup_attiter.h b/src/include/access/itup_attiter.h
new file mode 100644
index 0000000000..9f16a4b3d7
--- /dev/null
+++ b/src/include/access/itup_attiter.h
@@ -0,0 +1,198 @@
+/*-------------------------------------------------------------------------
+ *
+ * itup.h
+ * POSTGRES index tuple attribute iterator definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/itup_attiter.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef ITUP_ATTITER_H
+#define ITUP_ATTITER_H
+
+#include "access/itup.h"
+
+typedef struct IAttrIterStateData
+{
+ int offset;
+ bool slow;
+ bool isNull;
+} IAttrIterStateData;
+
+typedef IAttrIterStateData * IAttrIterState;
+
+/* ----------------
+ * index_attiterinit
+ *
+ * This gets called many times, so we macro the cacheable and NULL
+ * lookups, and call nocache_index_attiterinit() for the rest.
+ *
+ * tup - the tuple being iterated on
+ * attnum - the attribute number that we start the iteration with
+ * in the first index_attiternext call
+ * tupdesc - the tuple description
+ *
+ * ----------------
+ */
+#define index_attiterinit(tup, attnum, tupleDesc, iter) \
+do { \
+ if ((attnum) == 1) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ 0 /* Offset of attribute 1 is always 0 */, \
+ false /* slow */, \
+ false /* isNull */ \
+ }); \
+ } \
+ else if (!IndexTupleHasNulls(tup) && \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff >= 0) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff, /* offset */ \
+ false, /* slow */ \
+ false /* isNull */ \
+ }); \
+ } \
+ else \
+ nocache_index_attiterinit((tup), (attnum) - 1, (tupleDesc), (iter)); \
+} while (false);
+
+/*
+ * Initiate an index attribute iterator to attribute attnum,
+ * and return the corresponding datum.
+ *
+ * This is nearly the same as index_deform_tuple, except that this
+ * returns the internal state up to attnum, instead of populating the
+ * datum- and isnull-arrays
+ */
+static inline void
+nocache_index_attiterinit(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ int curatt;
+ char *tp; /* ptr to tuple data */
+ int off; /* offset in tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ bool slow = false; /* can we use/set attcacheoff? */
+ bool null = false;
+
+ /* Assert to protect callers */
+ Assert(PointerIsValid(iter));
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ /* XXX "knows" t_bits are just after fixed tuple header! */
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+ off = 0;
+
+ for (curatt = 0; curatt < attnum; curatt++)
+ {
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, curatt);
+
+ if (hasnulls && att_isnull(curatt, bp))
+ {
+ null = true;
+ slow = true; /* can't use attcacheoff anymore */
+ continue;
+ }
+
+ null = false;
+
+ if (!slow && thisatt->attcacheoff >= 0)
+ off = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ slow = true;
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+ }
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+
+ if (thisatt->attlen <= 0)
+ slow = true; /* can't use attcacheoff anymore */
+ }
+
+ iter->isNull = null;
+ iter->offset = off;
+ iter->slow = slow;
+}
+
+/* ----------------
+ * index_attiternext() - get the next attribute of an index tuple
+ *
+ * This gets called many times, so we do the least amount of work
+ * possible.
+ *
+ * The code does not attempt to update attcacheoff; as it is unlikely
+ * to reach a situation where the cached offset matters a lot.
+ * If the cached offset do matter, the caller should make sure that
+ * PopulateTupleDescCacheOffsets() was called on the tuple descriptor
+ * to populate the attribute offset cache.
+ *
+ * ----------------
+ */
+static inline Datum
+index_attiternext(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ char *tp; /* ptr to tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ Datum datum;
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, attnum - 1);
+
+ Assert(PointerIsValid(iter));
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+
+ if (hasnulls && att_isnull(attnum - 1, bp))
+ {
+ iter->isNull = true;
+ iter->slow = true;
+ return (Datum) 0;
+ }
+
+ iter->isNull = false;
+
+ if (!iter->slow && thisatt->attcacheoff >= 0)
+ iter->offset = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ iter->offset = att_align_pointer(iter->offset, thisatt->attalign, -1,
+ tp + iter->offset);
+ iter->slow = true;
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ iter->offset = att_align_nominal(iter->offset, thisatt->attalign);
+ }
+
+ datum = fetchatt(thisatt, tp + iter->offset);
+
+ iter->offset = att_addlength_pointer(iter->offset, thisatt->attlen, tp + iter->offset);
+
+ if (thisatt->attlen <= 0)
+ iter->slow = true; /* can't use attcacheoff anymore */
+
+ return datum;
+}
+
+#endif /* ITUP_ATTITER_H */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 1559399b0e..92894e4ea7 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -16,6 +16,7 @@
#include "access/amapi.h"
#include "access/itup.h"
+#include "access/itup_attiter.h"
#include "access/sdir.h"
#include "access/tableam.h"
#include "access/xlogreader.h"
@@ -1122,6 +1123,7 @@ typedef struct BTOptions
*/
#define NBTS_TYPE_SINGLE_COLUMN single
#define NBTS_TYPE_CACHED cached
+#define NBTS_TYPE_UNCACHED uncached
#define NBTS_TYPE_DEFAULT default
@@ -1152,12 +1154,19 @@ do { \
#define NBT_SPECIALIZE_CALL(function, rel, ...) \
( \
- IndexRelationGetNumberOfKeyAttributes(rel) == 1 ? ( \
+ IndexRelationGetNumberOfKeyAttributes(rel) == 1 ? ( \
NBTS_MAKE_NAME(function, NBTS_TYPE_SINGLE_COLUMN)(__VA_ARGS__) \
) \
: \
( \
- NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+ TupleDescAttr(RelationGetDescr(rel), \
+ IndexRelationGetNumberOfKeyAttributes(rel) - 1)->attcacheoff > 0 ? ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_CACHED)(__VA_ARGS__) \
+ ) \
+ : \
+ ( \
+ NBTS_MAKE_NAME(function, NBTS_TYPE_UNCACHED)(__VA_ARGS__) \
+ ) \
) \
)
diff --git a/src/include/access/nbtree_specialize.h b/src/include/access/nbtree_specialize.h
index 9733a27bdd..efbacf7d67 100644
--- a/src/include/access/nbtree_specialize.h
+++ b/src/include/access/nbtree_specialize.h
@@ -115,7 +115,11 @@
#define nbts_attiter_nextattdatum(itup, tupDesc) \
( \
AssertMacro(spec_i == 0), \
- (IndexTupleHasNulls(itup) && att_isnull(0, (char *)(itup) + sizeof(IndexTupleData))) ? \
+ ( \
+ IndexTupleHasNulls(itup) && \
+ att_isnull(0, (bits8 *) ((char *) (itup) + sizeof(IndexTupleData))) \
+ ) \
+ ? \
( \
(NBTS_MAKE_NAME(itup, isNull)) = true, \
(Datum)NULL \
@@ -175,6 +179,40 @@
#undef nbts_attiter_nextattdatum
#undef nbts_attiter_curattisnull
+/*
+ * Multiple key columns, but attcacheoff -optimization doesn't apply.
+ */
+#define NBTS_SPECIALIZING_UNCACHED
+#define NBTS_TYPE NBTS_TYPE_UNCACHED
+
+#define nbts_attiterdeclare(itup) \
+ IAttrIterStateData NBTS_MAKE_NAME(itup, iter)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc) \
+ index_attiterinit((itup), (initAttNum), (tupDesc), &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_attiternext((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, iter).isNull
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_UNCACHED
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
/* reset call to SPECIALIZE_CALL for default behaviour */
#undef nbts_call_norel
#define nbts_call_norel(name, rel, ...) \
--
2.30.2
v6-0008-Implement-dynamic-prefix-compression-in-nbtree.patchapplication/octet-stream; name=v6-0008-Implement-dynamic-prefix-compression-in-nbtree.patchDownload
From 512c885a103a807eadf5c8611b66eeb6ca74b11d Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 6 Jun 2022 23:16:18 +0200
Subject: [PATCH v6 8/8] Implement dynamic prefix compression in nbtree
Because tuples are ordered on the page, if some prefix of the
scan attributes on both sides of the compared tuple are equal
to the scankey, then the current tuple that is being compared
must also have those prefixing attributes that equal the
scankey.
We cannot propagate this information to _binsrch on lower pages,
as this downstream page may concurrently have split and/or have
merged with its deleted left neighbour (see [0]), which moves
the keyspace of the linked page. We thus can only trust the
current state of this current page for this optimization, which
means we must validate this state each time we open the page.
Although this limits the overall applicability of the
performance improvement, it still allows for a nice performance
improvement in most cases where initial columns have many
duplicate values and a compare function that is not cheap.
Additionally, most of the time a pages' highkey is equal to the
right seperator on the parent page. By storing this seperator
and doing a binary equality check, we can cheaply validate the
highkey on a page, which also allows us to carry over the
right seperators' prefix into the page.
---
contrib/amcheck/verify_nbtree.c | 17 +--
src/backend/access/nbtree/README | 25 +++++
src/backend/access/nbtree/nbtinsert.c | 14 ++-
src/backend/access/nbtree/nbtinsert_spec.h | 22 ++--
src/backend/access/nbtree/nbtsearch.c | 3 +-
src/backend/access/nbtree/nbtsearch_spec.h | 115 ++++++++++++++++++---
src/include/access/nbtree_specialized.h | 9 +-
7 files changed, 169 insertions(+), 36 deletions(-)
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 2beeebb163..8c4215372a 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2700,6 +2700,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
BTInsertStateData insertstate;
OffsetNumber offnum;
Page page;
+ AttrNumber cmpcol = 1;
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
@@ -2709,13 +2710,13 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.buf = lbuf;
/* Get matching tuple on leaf page */
- offnum = _bt_binsrch_insert(state->rel, &insertstate);
+ offnum = _bt_binsrch_insert(state->rel, &insertstate, 1);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
/* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
insertstate.postingoff <= 0 &&
- _bt_compare(state->rel, key, page, offnum) == 0)
+ _bt_compare(state->rel, key, page, offnum, &cmpcol) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
}
@@ -2777,6 +2778,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
@@ -2787,7 +2789,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
if (!key->heapkeyspace)
return invariant_leq_offset(state, key, upperbound);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
/*
* _bt_compare() is capable of determining that a scankey with a
@@ -2839,10 +2841,11 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber upperbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
return cmp <= 0;
}
@@ -2862,10 +2865,11 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber lowerbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, lowerbound);
+ cmp = _bt_compare(state->rel, key, state->target, lowerbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
@@ -2900,13 +2904,14 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
/* Verify line pointer before checking tuple */
itemid = PageGetItemIdCareful(state, nontargetblock, nontarget,
upperbound);
- cmp = _bt_compare(state->rel, key, nontarget, upperbound);
+ cmp = _bt_compare(state->rel, key, nontarget, upperbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3c08888c23..13ac9ee2be 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -901,6 +901,31 @@ large groups of duplicates, maximizing space utilization. Note also that
deduplication more efficient. Deduplication can be performed infrequently,
without merging together existing posting list tuples too often.
+
+Notes about dynamic prefix truncation
+-------------------------------------
+
+Because NBTrees have a sorted keyspace, when we have determined that some
+prefixing columns of tuples on both sides of the tuple that is being
+compared are equal to the scankey, then the current tuple must also share
+this prefix with the scankey. This allows us to skip comparing those columns,
+potentially saving cycles.
+
+We can only use this constraint if we have proven this information while we
+hold a pin on the page, so this is only useful on the page level: Concurrent
+page deletions and splits may have moved the keyspace of the page referenced
+by an inner page to the right. If we re-used high- and low-column-prefixes,
+we would not be able to detect a change of keyspace from e.g. (2,2) to (1,2),
+and subsequently return invalid results. This race condition can only be
+prevented by re-establishing the prefix-equal-columns for each page.
+
+The positive part of this, is that we already have results of the highest
+value of a page: a pages' highkey is compared to the scankey while we have
+a pin on the page in the _bt_moveright procedure. The _bt_binsrch procedure
+will use this result as a rightmost prefix compare, and for each step in the
+binary search (that does not compare less than the insert key) improve the
+equal-prefix bounds.
+
Notes about deduplication
-------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index ec6c73d1cc..20e5f33f98 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -132,7 +132,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* in the fastpath below, but also in the _bt_findinsertloc() call later.
*/
Assert(!insertstate->bounds_valid);
- offset = nbts_call(_bt_binsrch_insert, rel, insertstate);
+ offset = nbts_call(_bt_binsrch_insert, rel, insertstate, 1);
/*
* Scan over all equal tuples, looking for live conflicts.
@@ -142,6 +142,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(itup_key->scantid == NULL);
for (;;)
{
+ AttrNumber cmpcol = 1;
+
/*
* Each iteration of the loop processes one heap TID, not one index
* tuple. Current offset number for page isn't usually advanced on
@@ -177,7 +179,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(insertstate->bounds_valid);
Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
Assert(insertstate->low <= insertstate->stricthigh);
- Assert(nbts_call(_bt_compare, rel, itup_key, page, offset) < 0);
+ Assert(nbts_call(_bt_compare, rel, itup_key, page, offset,
+ &cmpcol) < 0);
break;
}
@@ -202,7 +205,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
if (!inposting)
{
/* Plain tuple, or first TID in posting list tuple */
- if (nbts_call(_bt_compare, rel, itup_key, page, offset) != 0)
+ if (nbts_call(_bt_compare, rel, itup_key, page, offset,
+ &cmpcol) != 0)
break; /* we're past all the equal tuples */
/* Advanced curitup */
@@ -412,11 +416,13 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
else
{
int highkeycmp;
+ cmpcol = 1;
/* If scankey == hikey we gotta check the next page too */
if (P_RIGHTMOST(opaque))
break;
- highkeycmp = nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY);
+ highkeycmp = nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY,
+ &cmpcol);
Assert(highkeycmp <= 0);
if (highkeycmp != 0)
break;
diff --git a/src/backend/access/nbtree/nbtinsert_spec.h b/src/backend/access/nbtree/nbtinsert_spec.h
index 97c866aea3..ccba0fa5ed 100644
--- a/src/backend/access/nbtree/nbtinsert_spec.h
+++ b/src/backend/access/nbtree/nbtinsert_spec.h
@@ -73,6 +73,7 @@ NBTS_FUNCTION(_bt_search_insert)(Relation rel, BTInsertState insertstate)
{
Page page;
BTPageOpaque opaque;
+ AttrNumber comparecol = 1;
_bt_checkpage(rel, insertstate->buf);
page = BufferGetPage(insertstate->buf);
@@ -91,7 +92,8 @@ NBTS_FUNCTION(_bt_search_insert)(Relation rel, BTInsertState insertstate)
!P_IGNORE(opaque) &&
PageGetFreeSpace(page) > insertstate->itemsz &&
PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- nbts_call(_bt_compare, rel, insertstate->itup_key, page, P_HIKEY) > 0)
+ nbts_call(_bt_compare, rel, insertstate->itup_key, page,
+ P_HIKEY, &comparecol) > 0)
{
/*
* Caller can use the fastpath optimization because cached
@@ -221,6 +223,7 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
/*
* Does the new tuple belong on this page?
*
@@ -238,7 +241,7 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
/* Test '<=', not '!=', since scantid is set now */
if (P_RIGHTMOST(opaque) ||
- nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) <= 0)
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
break;
_bt_stepright(rel, insertstate, stack);
@@ -291,6 +294,7 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
*/
while (PageGetFreeSpace(page) < insertstate->itemsz)
{
+ AttrNumber cmpcol = 1;
/*
* Before considering moving right, see if we can obtain enough
* space by erasing LP_DEAD items
@@ -321,7 +325,8 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
break;
if (P_RIGHTMOST(opaque) ||
- nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) != 0 ||
+ nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY,
+ &cmpcol) != 0 ||
pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
break;
@@ -336,10 +341,13 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
* We should now be on the correct page. Find the offset within the page
* for the new tuple. (Possibly reusing earlier search bounds.)
*/
- Assert(P_RIGHTMOST(opaque) ||
- nbts_call(_bt_compare, rel, itup_key, page, P_HIKEY) <= 0);
+ {
+ AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
+ Assert(P_RIGHTMOST(opaque) || nbts_call(_bt_compare, rel, itup_key,
+ page, P_HIKEY, &cmpcol) <= 0);
+ }
- newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate);
+ newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate, 1);
if (insertstate->postingoff == -1)
{
@@ -358,7 +366,7 @@ NBTS_FUNCTION(_bt_findinsertloc)(Relation rel,
*/
Assert(!insertstate->bounds_valid);
insertstate->postingoff = 0;
- newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate);
+ newitemoff = nbts_call(_bt_binsrch_insert, rel, insertstate, 1);
Assert(insertstate->postingoff == 0);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d5152bfcb7..607940bbcd 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -178,6 +178,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
StrategyNumber strat_total;
BTScanPosItem *currItem;
BlockNumber blkno;
+ AttrNumber attno = 1;
Assert(!BTScanPosIsValid(so->currPos));
@@ -696,7 +697,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
_bt_initialize_more_data(so, dir);
/* position to the precise item on the page */
- offnum = nbts_call(_bt_binsrch, rel, &inskey, buf);
+ offnum = nbts_call(_bt_binsrch, rel, &inskey, buf, &attno);
/*
* If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsearch_spec.h b/src/backend/access/nbtree/nbtsearch_spec.h
index a5c5f2b94f..19a6178334 100644
--- a/src/backend/access/nbtree/nbtsearch_spec.h
+++ b/src/backend/access/nbtree/nbtsearch_spec.h
@@ -10,8 +10,10 @@
*/
#ifndef NBTS_SPECIALIZING_DEFAULT
-static OffsetNumber NBTS_FUNCTION(_bt_binsrch)(Relation rel, BTScanInsert key,
- Buffer buf);
+static OffsetNumber NBTS_FUNCTION(_bt_binsrch)(Relation rel,
+ BTScanInsert key,
+ Buffer buf,
+ AttrNumber *highkeycmpcol);
static bool NBTS_FUNCTION(_bt_readpage)(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
@@ -38,7 +40,8 @@ static bool NBTS_FUNCTION(_bt_readpage)(IndexScanDesc scan, ScanDirection dir,
static OffsetNumber
NBTS_FUNCTION(_bt_binsrch)(Relation rel,
BTScanInsert key,
- Buffer buf)
+ Buffer buf,
+ AttrNumber *highkeycmpcol)
{
Page page;
BTPageOpaque opaque;
@@ -46,6 +49,8 @@ NBTS_FUNCTION(_bt_binsrch)(Relation rel,
high;
int32 result,
cmpval;
+ AttrNumber highcmpcol = *highkeycmpcol,
+ lowcmpcol = 1;
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
@@ -87,17 +92,26 @@ NBTS_FUNCTION(_bt_binsrch)(Relation rel,
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
/* We have low <= mid < high, so mid points at a real slot */
- result = nbts_call(_bt_compare, rel, key, page, mid);
+ result = nbts_call(_bt_compare, rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
+ {
high = mid;
+ highcmpcol = cmpcol;
+ }
}
+ *highkeycmpcol = highcmpcol;
+
/*
* At this point we have high == low, but be careful: they could point
* past the last slot on the page.
@@ -423,6 +437,7 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
{
BTStack stack_in = NULL;
int page_access = BT_READ;
+ char tupdatabuf[BLCKSZ / 3];
/* Get the root page to start with */
*bufP = _bt_getroot(rel, access);
@@ -441,6 +456,7 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
IndexTuple itup;
BlockNumber child;
BTStack new_stack;
+ AttrNumber highkeycmpcol = 1;
/*
* Race -- the page we just grabbed may have split since we read its
@@ -456,7 +472,8 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
*/
*bufP = nbts_call(_bt_moveright, rel, key, *bufP,
(access == BT_WRITE), stack_in,
- page_access, snapshot);
+ page_access, snapshot, &highkeycmpcol,
+ (char *) tupdatabuf);
/* if this is a leaf page, we're done */
page = BufferGetPage(*bufP);
@@ -468,12 +485,17 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
* Find the appropriate pivot tuple on this page. Its downlink points
* to the child page that we're about to descend to.
*/
- offnum = nbts_call(_bt_binsrch, rel, key, *bufP);
+ offnum = nbts_call(_bt_binsrch, rel, key, *bufP, &highkeycmpcol);
itemid = PageGetItemId(page, offnum);
itup = (IndexTuple) PageGetItem(page, itemid);
Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
child = BTreeTupleGetDownLink(itup);
+ if (highkeycmpcol > 1)
+ {
+ memcpy((char *) tupdatabuf, (char *) itup, IndexTupleSize(itup));
+ }
+
/*
* We need to save the location of the pivot tuple we chose in a new
* stack entry for this page/level. If caller ends up splitting a
@@ -507,6 +529,7 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
*/
if (access == BT_WRITE && page_access == BT_READ)
{
+ AttrNumber highkeycmpcol = 1;
/* trade in our read lock for a write lock */
_bt_unlockbuf(rel, *bufP);
_bt_lockbuf(rel, *bufP, BT_WRITE);
@@ -517,7 +540,7 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key, Buffer *bufP,
* move right to its new sibling. Do that.
*/
*bufP = nbts_call(_bt_moveright, rel, key, *bufP, true, stack_in,
- BT_WRITE, snapshot);
+ BT_WRITE, snapshot, &highkeycmpcol, (char *) tupdatabuf);
}
return stack_in;
@@ -565,12 +588,16 @@ NBTS_FUNCTION(_bt_moveright)(Relation rel,
bool forupdate,
BTStack stack,
int access,
- Snapshot snapshot)
+ Snapshot snapshot,
+ AttrNumber *comparecol,
+ char *tupdatabuf)
{
Page page;
BTPageOpaque opaque;
int32 cmpval;
+ Assert(PointerIsValid(comparecol));
+
/*
* When nextkey = false (normal case): if the scan key that brought us to
* this page is > the high key stored on the page, then the page has split
@@ -592,12 +619,17 @@ NBTS_FUNCTION(_bt_moveright)(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
+
page = BufferGetPage(buf);
TestForOldSnapshot(snapshot, rel, page);
opaque = BTPageGetOpaque(page);
if (P_RIGHTMOST(opaque))
+ {
+ *comparecol = cmpcol;
break;
+ }
/*
* Finish any incomplete splits we encounter along the way.
@@ -623,14 +655,49 @@ NBTS_FUNCTION(_bt_moveright)(Relation rel,
continue;
}
- if (P_IGNORE(opaque) || nbts_call(_bt_compare, rel, key, page, P_HIKEY) >= cmpval)
+ /*
+ * When comparecol is > 1, tupdatabuf is filled with the right seperator
+ * of the parent node. This allows us to do a binary equality check
+ * between the parent node's right seperator (which is < key) and this
+ * page's P_HIKEY. If they equal, we can reuse the result of the
+ * parent node's rightkey compare, which means we can potentially save
+ * a full key compare.
+ *
+ * Without this, we'd on average use 3 full key compares per page before
+ * we achieve full dynamic prefix bounds, with this optimization we can
+ * skip one of those.
+ *
+ * 3: 1 for the highkey (rightmost), and on average 2 before we move
+ * right in the binary search on the page.
+ */
+ if (!P_IGNORE(opaque) && *comparecol > 1)
+ {
+ IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
+ IndexTuple buftuple = (IndexTuple) tupdatabuf;
+ if (IndexTupleSize(itup) == IndexTupleSize(buftuple))
+ {
+ char *dataptr = (char *) itup;
+
+ if (memcmp(dataptr + sizeof(IndexTupleData),
+ tupdatabuf + sizeof(IndexTupleData),
+ IndexTupleSize(itup) - sizeof(IndexTupleData)) == 0)
+ break;
+ }
+ }
+
+ if (P_IGNORE(opaque) || nbts_call(_bt_compare, rel, key, page, P_HIKEY,
+ &cmpcol) >= cmpval)
{
/* step right one page */
+ *comparecol = 1;
buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
continue;
}
else
+ {
+ *comparecol = cmpcol;
break;
+ }
}
if (P_IGNORE(opaque))
@@ -663,7 +730,8 @@ NBTS_FUNCTION(_bt_moveright)(Relation rel,
* list split).
*/
OffsetNumber
-NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate)
+NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol)
{
BTScanInsert key = insertstate->itup_key;
Page page;
@@ -673,6 +741,7 @@ NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate)
stricthigh;
int32 result,
cmpval;
+ AttrNumber lowcmpcol = 1;
page = BufferGetPage(insertstate->buf);
opaque = BTPageGetOpaque(page);
@@ -723,16 +792,21 @@ NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate)
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
/* We have low <= mid < high, so mid points at a real slot */
- result = nbts_call(_bt_compare, rel, key, page, mid);
+ result = nbts_call(_bt_compare, rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
{
high = mid;
+ highcmpcol = cmpcol;
if (result != 0)
stricthigh = high;
}
@@ -813,7 +887,8 @@ int32
NBTS_FUNCTION(_bt_compare)(Relation rel,
BTScanInsert key,
Page page,
- OffsetNumber offnum)
+ OffsetNumber offnum,
+ AttrNumber *comparecol)
{
TupleDesc itupdesc = RelationGetDescr(rel);
BTPageOpaque opaque = BTPageGetOpaque(page);
@@ -854,10 +929,11 @@ NBTS_FUNCTION(_bt_compare)(Relation rel,
ncmpkey = Min(ntupatts, key->keysz);
Assert(key->heapkeyspace || ncmpkey == key->keysz);
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
- scankey = key->scankeys;
- nbts_attiterinit(itup, 1, itupdesc);
- nbts_foreachattr(1, ncmpkey)
+ nbts_attiterinit(itup, *comparecol, itupdesc);
+ scankey = key->scankeys + ((*comparecol) - 1);
+
+ nbts_foreachattr(*comparecol, ncmpkey)
{
Datum datum;
@@ -902,11 +978,20 @@ NBTS_FUNCTION(_bt_compare)(Relation rel,
/* if the keys are unequal, return the difference */
if (result != 0)
+ {
+ *comparecol = nbts_attiter_attnum;
return result;
+ }
scankey++;
}
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *comparecol = ntupatts + 1;
+
/*
* All non-truncated attributes (other than heap TID) were found to be
* equal. Treat truncated attributes as minus infinity when scankey has a
diff --git a/src/include/access/nbtree_specialized.h b/src/include/access/nbtree_specialized.h
index c45fa84aed..ddceb4a4aa 100644
--- a/src/include/access/nbtree_specialized.h
+++ b/src/include/access/nbtree_specialized.h
@@ -43,12 +43,15 @@ NBTS_FUNCTION(_bt_search)(Relation rel, BTScanInsert key,
extern Buffer
NBTS_FUNCTION(_bt_moveright)(Relation rel, BTScanInsert key, Buffer buf,
bool forupdate, BTStack stack, int access,
- Snapshot snapshot);
+ Snapshot snapshot, AttrNumber *comparecol,
+ char *tupdatabuf);
extern OffsetNumber
-NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate);
+NBTS_FUNCTION(_bt_binsrch_insert)(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol);
extern int32
NBTS_FUNCTION(_bt_compare)(Relation rel, BTScanInsert key,
- Page page, OffsetNumber offnum);
+ Page page, OffsetNumber offnum,
+ AttrNumber *comparecol);
/*
* prototypes for functions in nbtutils_spec.h
--
2.30.2
On Wed, 27 Jul 2022 at 13:34, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Wed, 27 Jul 2022 at 09:35, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Mon, 4 Jul 2022 at 16:18, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Sun, 5 Jun 2022 at 21:12, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:While working on benchmarking the v2 patchset, I noticed no
improvement on reindex, which I attributed to forgetting to also
specialize comparetup_index_btree in tuplesorth.c. After adding the
specialization there as well (attached in v3), reindex performance
improved significantly too.PFA version 4 of this patchset. Changes:
Version 5 now, which is identical to v4 except for bitrot fixes to
deal with f58d7073.... and now v6 to deal with d0b193c0 and co.
I probably should've waited a bit longer this morning and checked
master before sending, but that's not how it went. Sorry for the
noise.
Here's the dynamic prefix truncation patch on it's own (this was 0008).
I'll test the performance of this tomorrow, but at least it compiles
and passes check-world against HEAD @ 6e10631d. If performance doesn't
disappoint (isn't measurably worse in known workloads), this will be
the only patch in the patchset - specialization would then be dropped.
Else, tomorrow I'll post the remainder of the patchset that
specializes the nbtree functions on key shape.
Kind regards,
Matthias van de Meent.
Attachments:
v7-0001-Implement-dynamic-prefix-compression-in-nbtree.patchapplication/octet-stream; name=v7-0001-Implement-dynamic-prefix-compression-in-nbtree.patchDownload
From a3aad32e679d75ec98f05ea7fa24ebc6109d434e Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 16 Sep 2022 17:38:32 +0200
Subject: [PATCH v7] Implement dynamic prefix compression in nbtree
Because tuples are ordered on the page, if some prefix of the
scan attributes on both sides of the compared tuple are equal
to the scankey, then the current tuple that is being compared
must also have those prefixing attributes that equal the
scankey.
We cannot generally propagate this information to _binsrch on
lower pages, as this downstream page may have concurrently split
and/or have merged with its deleted left neighbour (see [0]),
which moves the keyspace of the linked page. We thus can only
trust the current state of this current page for this optimization,
which means we must validate this state each time we open the page.
Although this limits the overall applicability of the
performance improvement, it still allows for a nice performance
improvement in most cases where initial columns have many
duplicate values and a compare function that is not cheap.
As an exception to the above rule, most of the time a pages'
highkey is equal to the right seperator on the parent page due to
how btree splits are done. By storing this right seperator from
the parent page and then validating that the highkey of the child
page contains the exact same data, we can restore the right prefix
bound without having to call the relatively expensive _bt_compare.
In the worst-case scenario of a concurrent page split, we'd still
have to validate the full key, but that doesn't happen very often,
so we pay one branch and memcmp() to validate whether .
---
contrib/amcheck/verify_nbtree.c | 17 ++--
src/backend/access/nbtree/README | 42 +++++++++
src/backend/access/nbtree/nbtinsert.c | 34 +++++---
src/backend/access/nbtree/nbtsearch.c | 119 +++++++++++++++++++++++---
src/include/access/nbtree.h | 10 ++-
5 files changed, 188 insertions(+), 34 deletions(-)
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 9021d156eb..041ff5464e 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2701,6 +2701,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
BTInsertStateData insertstate;
OffsetNumber offnum;
Page page;
+ AttrNumber cmpcol = 1;
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
@@ -2710,13 +2711,13 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.buf = lbuf;
/* Get matching tuple on leaf page */
- offnum = _bt_binsrch_insert(state->rel, &insertstate);
+ offnum = _bt_binsrch_insert(state->rel, &insertstate, 1);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
/* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
insertstate.postingoff <= 0 &&
- _bt_compare(state->rel, key, page, offnum) == 0)
+ _bt_compare(state->rel, key, page, offnum, &cmpcol) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
}
@@ -2778,6 +2779,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
@@ -2788,7 +2790,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
if (!key->heapkeyspace)
return invariant_leq_offset(state, key, upperbound);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
/*
* _bt_compare() is capable of determining that a scankey with a
@@ -2840,10 +2842,11 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber upperbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
return cmp <= 0;
}
@@ -2863,10 +2866,11 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber lowerbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, lowerbound);
+ cmp = _bt_compare(state->rel, key, state->target, lowerbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
@@ -2901,13 +2905,14 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
/* Verify line pointer before checking tuple */
itemid = PageGetItemIdCareful(state, nontargetblock, nontarget,
upperbound);
- cmp = _bt_compare(state->rel, key, nontarget, upperbound);
+ cmp = _bt_compare(state->rel, key, nontarget, upperbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 5529afc1fe..5df29a692e 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -901,6 +901,48 @@ large groups of duplicates, maximizing space utilization. Note also that
deduplication more efficient. Deduplication can be performed infrequently,
without merging together existing posting list tuples too often.
+Notes about dynamic prefix truncation
+-------------------------------------
+
+Because NBTrees have a sorted keyspace, when we have determined that some
+prefixing columns of tuples on both sides of the tuple that is being
+compared are equal to the scankey, then the current tuple must also share
+this prefix with the scankey. This allows us to skip comparing those columns,
+saving the indirect function calls in the compare operation.
+
+We can only use this constraint if we have proven this information while we
+hold a pin on the page, so this is only useful on the page level: Concurrent
+page deletions and splits may have moved the keyspace of the page referenced
+by a parent page to the right. If we re-used high- and low-column-prefixes,
+we would not be able to detect a change of keyspace from e.g. [2,3) to [1,2),
+and subsequently return invalid results. This race condition can only be
+prevented by re-establishing the prefix-equal-columns for each page.
+
+There is positive news, though: A page split will put a binary copy of the
+page's highkey in the parent page. This means that we usually can reuse
+the compare result of the parent page's downlink's right sibling when we
+discover that their representation is binary equal. In general this will
+be the case, as only in concurrent page splits and deletes the downlink
+may not point to the page with the correct highkey bound (_bt_moveright
+only rarely actually moves right).
+
+To implement this, we copy the downlink's right differentiator key into a
+temporary buffer, which is then compared against the child pages' highkey.
+If they match, we reuse the compare result (plus prefix) we had for it from
+the parent page, if not, we need to do a full _bt_compare. Because memcpy +
+memcmp is cheap compared to _bt_compare, and because it's quite unlikely
+that we guess wrong this speeds up our _bt_moveright code (at cost of some
+stack memory in _bt_search and some overhead in case of a wrong prediction)
+
+Now that we have prefix bounds on the highest value of a page, the
+_bt_binsrch procedure will use this result as a rightmost prefix compare,
+and for each step in the binary search (that does not compare less than the
+insert key) improve the equal-prefix bounds.
+
+Using the above optimization, we now (on average) only need 2 full key
+compares per page, as opposed to ceil(log2(ntupsperpage)) + 1; a significant
+improvement.
+
Notes about deduplication
-------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f6f4af8bfe..36e2d8ffed 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -326,6 +326,7 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
{
Page page;
BTPageOpaque opaque;
+ AttrNumber cmpcol = 1;
_bt_checkpage(rel, insertstate->buf);
page = BufferGetPage(insertstate->buf);
@@ -344,7 +345,8 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
!P_IGNORE(opaque) &&
PageGetFreeSpace(page) > insertstate->itemsz &&
PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- _bt_compare(rel, insertstate->itup_key, page, P_HIKEY) > 0)
+ _bt_compare(rel, insertstate->itup_key, page, P_HIKEY,
+ &cmpcol) > 0)
{
/*
* Caller can use the fastpath optimization because cached
@@ -438,7 +440,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* in the fastpath below, but also in the _bt_findinsertloc() call later.
*/
Assert(!insertstate->bounds_valid);
- offset = _bt_binsrch_insert(rel, insertstate);
+ offset = _bt_binsrch_insert(rel, insertstate, 1);
/*
* Scan over all equal tuples, looking for live conflicts.
@@ -448,6 +450,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(itup_key->scantid == NULL);
for (;;)
{
+ AttrNumber cmpcol = 1;
+
/*
* Each iteration of the loop processes one heap TID, not one index
* tuple. Current offset number for page isn't usually advanced on
@@ -483,7 +487,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(insertstate->bounds_valid);
Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
Assert(insertstate->low <= insertstate->stricthigh);
- Assert(_bt_compare(rel, itup_key, page, offset) < 0);
+ Assert(_bt_compare(rel, itup_key, page, offset, &cmpcol) < 0);
break;
}
@@ -508,7 +512,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
if (!inposting)
{
/* Plain tuple, or first TID in posting list tuple */
- if (_bt_compare(rel, itup_key, page, offset) != 0)
+ if (_bt_compare(rel, itup_key, page, offset, &cmpcol) != 0)
break; /* we're past all the equal tuples */
/* Advanced curitup */
@@ -718,11 +722,12 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
else
{
int highkeycmp;
+ cmpcol = 1;
/* If scankey == hikey we gotta check the next page too */
if (P_RIGHTMOST(opaque))
break;
- highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+ highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol);
Assert(highkeycmp <= 0);
if (highkeycmp != 0)
break;
@@ -865,6 +870,8 @@ _bt_findinsertloc(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
+
/*
* Does the new tuple belong on this page?
*
@@ -882,7 +889,7 @@ _bt_findinsertloc(Relation rel,
/* Test '<=', not '!=', since scantid is set now */
if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
break;
_bt_stepright(rel, insertstate, stack);
@@ -935,6 +942,8 @@ _bt_findinsertloc(Relation rel,
*/
while (PageGetFreeSpace(page) < insertstate->itemsz)
{
+ AttrNumber cmpcol = 1;
+
/*
* Before considering moving right, see if we can obtain enough
* space by erasing LP_DEAD items
@@ -965,7 +974,7 @@ _bt_findinsertloc(Relation rel,
break;
if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) != 0 ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) != 0 ||
pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
break;
@@ -980,10 +989,13 @@ _bt_findinsertloc(Relation rel,
* We should now be on the correct page. Find the offset within the page
* for the new tuple. (Possibly reusing earlier search bounds.)
*/
- Assert(P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
+ {
+ AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
+ Assert(P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0);
+ }
- newitemoff = _bt_binsrch_insert(rel, insertstate);
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
if (insertstate->postingoff == -1)
{
@@ -1002,7 +1014,7 @@ _bt_findinsertloc(Relation rel,
*/
Assert(!insertstate->bounds_valid);
insertstate->postingoff = 0;
- newitemoff = _bt_binsrch_insert(rel, insertstate);
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
Assert(insertstate->postingoff == 0);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c74543bfde..67c8705c54 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,7 +25,8 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
-static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf,
+ AttrNumber *highkeycmpcol);
static int _bt_binsrch_posting(BTScanInsert key, Page page,
OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
@@ -98,6 +99,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
{
BTStack stack_in = NULL;
int page_access = BT_READ;
+ char tupdatabuf[BLCKSZ / 3];
+ AttrNumber highkeycmpcol = 1;
/* Get the root page to start with */
*bufP = _bt_getroot(rel, access);
@@ -130,7 +133,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
* opportunity to finish splits of internal pages too.
*/
*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
- page_access, snapshot);
+ page_access, snapshot, &highkeycmpcol,
+ (char *) tupdatabuf);
/* if this is a leaf page, we're done */
page = BufferGetPage(*bufP);
@@ -142,12 +146,15 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
* Find the appropriate pivot tuple on this page. Its downlink points
* to the child page that we're about to descend to.
*/
- offnum = _bt_binsrch(rel, key, *bufP);
+ offnum = _bt_binsrch(rel, key, *bufP, &highkeycmpcol);
itemid = PageGetItemId(page, offnum);
itup = (IndexTuple) PageGetItem(page, itemid);
Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
child = BTreeTupleGetDownLink(itup);
+ Assert(IndexTupleSize(itup) < sizeof(tupdatabuf));
+ memcpy((char *) tupdatabuf, (char *) itup, IndexTupleSize(itup));
+
/*
* We need to save the location of the pivot tuple we chose in a new
* stack entry for this page/level. If caller ends up splitting a
@@ -181,6 +188,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
*/
if (access == BT_WRITE && page_access == BT_READ)
{
+ highkeycmpcol = 1;
+
/* trade in our read lock for a write lock */
_bt_unlockbuf(rel, *bufP);
_bt_lockbuf(rel, *bufP, BT_WRITE);
@@ -191,7 +200,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
* move right to its new sibling. Do that.
*/
*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
- snapshot);
+ snapshot, &highkeycmpcol, (char *) tupdatabuf);
}
return stack_in;
@@ -239,12 +248,16 @@ _bt_moveright(Relation rel,
bool forupdate,
BTStack stack,
int access,
- Snapshot snapshot)
+ Snapshot snapshot,
+ AttrNumber *comparecol,
+ char *tupdatabuf)
{
Page page;
BTPageOpaque opaque;
int32 cmpval;
+ Assert(PointerIsValid(comparecol) && PointerIsValid(tupdatabuf));
+
/*
* When nextkey = false (normal case): if the scan key that brought us to
* this page is > the high key stored on the page, then the page has split
@@ -266,12 +279,17 @@ _bt_moveright(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
+
page = BufferGetPage(buf);
TestForOldSnapshot(snapshot, rel, page);
opaque = BTPageGetOpaque(page);
if (P_RIGHTMOST(opaque))
+ {
+ *comparecol = 1;
break;
+ }
/*
* Finish any incomplete splits we encounter along the way.
@@ -297,14 +315,55 @@ _bt_moveright(Relation rel,
continue;
}
- if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
+ /*
+ * tupdatabuf is filled with the right seperator of the parent node.
+ * This allows us to do a binary equality check between the parent
+ * node's right seperator (which is < key) and this page's P_HIKEY.
+ * If they equal, we can reuse the result of the parent node's
+ * rightkey compare, which means we can potentially save a full key
+ * compare (which includes indirect calls to attribute comparison
+ * functions).
+ *
+ * Without this, we'd on average use 3 full key compares per page before
+ * we achieve full dynamic prefix bounds, but with this optimization
+ * that is only 2.
+ *
+ * 3 compares: 1 for the highkey (rightmost), and on average 2 before
+ * we move right in the binary search on the page, this average equals
+ * SUM (1/2 ^ x) for x from 0 to log(n items)), which tends to 2.
+ */
+ if (!P_IGNORE(opaque) && *comparecol > 1)
+ {
+ IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
+ IndexTuple buftuple = (IndexTuple) tupdatabuf;
+ if (IndexTupleSize(itup) == IndexTupleSize(buftuple))
+ {
+ char *dataptr = (char *) itup;
+
+ if (memcmp(dataptr + sizeof(IndexTupleData),
+ tupdatabuf + sizeof(IndexTupleData),
+ IndexTupleSize(itup) - sizeof(IndexTupleData)) == 0)
+ break;
+ } else {
+ *comparecol = 1;
+ }
+ } else {
+ *comparecol = 1;
+ }
+
+ if (P_IGNORE(opaque) ||
+ _bt_compare(rel, key, page, P_HIKEY, &cmpcol) >= cmpval)
{
+ *comparecol = 1;
/* step right one page */
buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
continue;
}
else
+ {
+ *comparecol = cmpcol;
break;
+ }
}
if (P_IGNORE(opaque))
@@ -337,7 +396,8 @@ _bt_moveright(Relation rel,
static OffsetNumber
_bt_binsrch(Relation rel,
BTScanInsert key,
- Buffer buf)
+ Buffer buf,
+ AttrNumber *highkeycmpcol)
{
Page page;
BTPageOpaque opaque;
@@ -345,6 +405,8 @@ _bt_binsrch(Relation rel,
high;
int32 result,
cmpval;
+ AttrNumber highcmpcol = *highkeycmpcol,
+ lowcmpcol = 1;
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
@@ -386,16 +448,25 @@ _bt_binsrch(Relation rel,
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, key, page, mid);
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
+ {
high = mid;
+ highcmpcol = cmpcol;
+ }
}
+
+ *highkeycmpcol = highcmpcol;
/*
* At this point we have high == low, but be careful: they could point
@@ -439,7 +510,8 @@ _bt_binsrch(Relation rel,
* list split).
*/
OffsetNumber
-_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
+_bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol)
{
BTScanInsert key = insertstate->itup_key;
Page page;
@@ -449,6 +521,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
stricthigh;
int32 result,
cmpval;
+ AttrNumber lowcmpcol = 1;
page = BufferGetPage(insertstate->buf);
opaque = BTPageGetOpaque(page);
@@ -499,16 +572,22 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, key, page, mid);
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
{
high = mid;
+ highcmpcol = cmpcol;
+
if (result != 0)
stricthigh = high;
}
@@ -656,7 +735,8 @@ int32
_bt_compare(Relation rel,
BTScanInsert key,
Page page,
- OffsetNumber offnum)
+ OffsetNumber offnum,
+ AttrNumber *comparecol)
{
TupleDesc itupdesc = RelationGetDescr(rel);
BTPageOpaque opaque = BTPageGetOpaque(page);
@@ -696,8 +776,9 @@ _bt_compare(Relation rel,
ncmpkey = Min(ntupatts, key->keysz);
Assert(key->heapkeyspace || ncmpkey == key->keysz);
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
- scankey = key->scankeys;
- for (int i = 1; i <= ncmpkey; i++)
+
+ scankey = key->scankeys + ((*comparecol) - 1);
+ for (int i = *comparecol; i <= ncmpkey; i++)
{
Datum datum;
bool isNull;
@@ -741,11 +822,20 @@ _bt_compare(Relation rel,
/* if the keys are unequal, return the difference */
if (result != 0)
+ {
+ *comparecol = i;
return result;
+ }
scankey++;
}
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *comparecol = ntupatts + 1;
+
/*
* All non-truncated attributes (other than heap TID) were found to be
* equal. Treat truncated attributes as minus infinity when scankey has a
@@ -876,6 +966,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
StrategyNumber strat_total;
BTScanPosItem *currItem;
BlockNumber blkno;
+ AttrNumber cmpcol = 1;
Assert(!BTScanPosIsValid(so->currPos));
@@ -1392,7 +1483,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
_bt_initialize_more_data(so, dir);
/* position to the precise item on the page */
- offnum = _bt_binsrch(rel, &inskey, buf);
+ offnum = _bt_binsrch(rel, &inskey, buf, &cmpcol);
/*
* If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8e4f6864e5..ddc34a7a9e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1225,9 +1225,13 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
int access, Snapshot snapshot);
extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
- bool forupdate, BTStack stack, int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
-extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+ bool forupdate, BTStack stack, int access,
+ Snapshot snapshot, AttrNumber *comparecol,
+ char *tupdatabuf);
+extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page,
+ OffsetNumber offnum, AttrNumber *comparecol);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
--
2.30.2
Hi Matthias,
I'm going to look at this patch series if you're still interested. What was the status of your final performance testing for the 0008 patch alone vs the specialization series? Last I saw on the thread you were going to see if the specialization was required or not.
Best,
David
On Thu, 12 Jan 2023 at 16:11, David Christensen <david@pgguru.net> wrote:
Hi Matthias,
I'm going to look at this patch series if you're still interested. What was the status of your final performance testing for the 0008 patch alone vs the specialization series? Last I saw on the thread you were going to see if the specialization was required or not.
Thank you for your interest, and sorry for the delayed response. I've
been working on rebasing and polishing the patches, and hit some
issues benchmarking the set. Attached in Perf_results.xlsx are the
results of my benchmarks, and a new rebased patchset.
TLDR for benchmarks: There may be a small regression 0.5-1% in the
patchset for reindex and insert-based workloads in certain corner
cases, but I can't rule out that it's just a quirk of my testing
setup. Master was taken at eb5ad4ff, and all patches are based on that
as well.
0001 (former 0008) sees 2% performance loss on average on
non-optimized index insertions - this performance loss is fixed with
the rest of the patchset.
The patchset was reordered again: 0001 contains the dynamic prefix
truncation changes; 0002 and 0003 refactor and update btree code to
specialize on key shape, and 0004 and 0006 define the specializations.
0005 is a separated addition to attcacheoff infrastructure that is
useful on it's own; it flags an attribute with 'this attribute cannot
have a cached offset, look at this other attribute instead'.
A significant change from previous versions is that the specialized
function identifiers are published as macros, so `_bt_compare` is
published as a macro that (based on context) calls the specialized
version. This reduced a lot of cognitive overhead and churn in the
code.
Kind regards,
Matthias van de Meent
Attachments:
v8-0001-Implement-dynamic-prefix-compression-in-nbtree.patchapplication/octet-stream; name=v8-0001-Implement-dynamic-prefix-compression-in-nbtree.patchDownload
From 02ff7258fa073e9a21b032ea40da34620a265182 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Tue, 10 Jan 2023 21:45:44 +0100
Subject: [PATCH v8 1/6] Implement dynamic prefix compression in nbtree
Because tuples are ordered on the page, if some prefix of the
scan attributes on both sides of the compared tuple are equal
to the scankey, then the current tuple that is being compared
must also have those prefixing attributes that equal the
scankey.
We cannot generally propagate this information to _binsrch on
lower pages, as this downstream page may have concurrently split
and/or have merged with its deleted left neighbour (see [0]),
which moves the keyspace of the linked page. We thus can only
trust the current state of this current page for this optimization,
which means we must validate this state each time we open the page.
Although this limits the overall applicability of the
performance improvement, it still allows for a nice performance
improvement in most cases where initial columns have many
duplicate values and a compare function that is not cheap.
As an exception to the above rule, most of the time a pages'
highkey is equal to the right seperator on the parent page due to
how btree splits are done. By storing this right seperator from
the parent page and then validating that the highkey of the child
page contains the exact same data, we can restore the right prefix
bound without having to call the relatively expensive _bt_compare.
In the worst-case scenario of a concurrent page split, we'd still
have to validate the full key, but that doesn't happen very often
when compared to the number of times we descend the btree.
---
contrib/amcheck/verify_nbtree.c | 17 ++--
reproducer.bench.sql | 8 ++
src/backend/access/nbtree/README | 42 +++++++++
src/backend/access/nbtree/nbtinsert.c | 34 +++++---
src/backend/access/nbtree/nbtsearch.c | 119 +++++++++++++++++++++++---
src/include/access/nbtree.h | 10 ++-
6 files changed, 196 insertions(+), 34 deletions(-)
create mode 100644 reproducer.bench.sql
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 257cff671b..22bb229820 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2701,6 +2701,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
BTInsertStateData insertstate;
OffsetNumber offnum;
Page page;
+ AttrNumber cmpcol = 1;
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
@@ -2710,13 +2711,13 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.buf = lbuf;
/* Get matching tuple on leaf page */
- offnum = _bt_binsrch_insert(state->rel, &insertstate);
+ offnum = _bt_binsrch_insert(state->rel, &insertstate, 1);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
/* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
insertstate.postingoff <= 0 &&
- _bt_compare(state->rel, key, page, offnum) == 0)
+ _bt_compare(state->rel, key, page, offnum, &cmpcol) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
}
@@ -2778,6 +2779,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
@@ -2788,7 +2790,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
if (!key->heapkeyspace)
return invariant_leq_offset(state, key, upperbound);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
/*
* _bt_compare() is capable of determining that a scankey with a
@@ -2840,10 +2842,11 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber upperbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
return cmp <= 0;
}
@@ -2863,10 +2866,11 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber lowerbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, lowerbound);
+ cmp = _bt_compare(state->rel, key, state->target, lowerbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
@@ -2901,13 +2905,14 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
/* Verify line pointer before checking tuple */
itemid = PageGetItemIdCareful(state, nontargetblock, nontarget,
upperbound);
- cmp = _bt_compare(state->rel, key, nontarget, upperbound);
+ cmp = _bt_compare(state->rel, key, nontarget, upperbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
diff --git a/reproducer.bench.sql b/reproducer.bench.sql
new file mode 100644
index 0000000000..76cd8d481e
--- /dev/null
+++ b/reproducer.bench.sql
@@ -0,0 +1,8 @@
+BEGIN;
+SELECT omg.*
+FROM something_is_wrong_here AS omg
+ORDER BY random()
+LIMIT 1
+\gset
+UPDATE something_is_wrong_here SET value = :value + 1 WHERE id = :id;
+END;
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index dd0f7ad2bd..4d7fa5aff4 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -901,6 +901,48 @@ large groups of duplicates, maximizing space utilization. Note also that
deduplication more efficient. Deduplication can be performed infrequently,
without merging together existing posting list tuples too often.
+Notes about dynamic prefix truncation
+-------------------------------------
+
+Because NBTrees have a sorted keyspace, when we have determined that some
+prefixing columns of tuples on both sides of the tuple that is being
+compared are equal to the scankey, then the current tuple must also share
+this prefix with the scankey. This allows us to skip comparing those columns,
+saving the indirect function calls in the compare operation.
+
+We can only use this constraint if we have proven this information while we
+hold a pin on the page, so this is only useful on the page level: Concurrent
+page deletions and splits may have moved the keyspace of the page referenced
+by a parent page to the right. If we re-used high- and low-column-prefixes,
+we would not be able to detect a change of keyspace from e.g. [2,3) to [1,2),
+and subsequently return invalid results. This race condition can only be
+prevented by re-establishing the prefix-equal-columns for each page.
+
+There is positive news, though: A page split will put a binary copy of the
+page's highkey in the parent page. This means that we usually can reuse
+the compare result of the parent page's downlink's right sibling when we
+discover that their representation is binary equal. In general this will
+be the case, as only in concurrent page splits and deletes the downlink
+may not point to the page with the correct highkey bound (_bt_moveright
+only rarely actually moves right).
+
+To implement this, we copy the downlink's right differentiator key into a
+temporary buffer, which is then compared against the child pages' highkey.
+If they match, we reuse the compare result (plus prefix) we had for it from
+the parent page, if not, we need to do a full _bt_compare. Because memcpy +
+memcmp is cheap compared to _bt_compare, and because it's quite unlikely
+that we guess wrong this speeds up our _bt_moveright code (at cost of some
+stack memory in _bt_search and some overhead in case of a wrong prediction)
+
+Now that we have prefix bounds on the highest value of a page, the
+_bt_binsrch procedure will use this result as a rightmost prefix compare,
+and for each step in the binary search (that does not compare less than the
+insert key) improve the equal-prefix bounds.
+
+Using the above optimization, we now (on average) only need 2 full key
+compares per page, as opposed to ceil(log2(ntupsperpage)) + 1; a significant
+improvement.
+
Notes about deduplication
-------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f4c1a974ef..4c3bdefae2 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -326,6 +326,7 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
{
Page page;
BTPageOpaque opaque;
+ AttrNumber cmpcol = 1;
_bt_checkpage(rel, insertstate->buf);
page = BufferGetPage(insertstate->buf);
@@ -344,7 +345,8 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
!P_IGNORE(opaque) &&
PageGetFreeSpace(page) > insertstate->itemsz &&
PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- _bt_compare(rel, insertstate->itup_key, page, P_HIKEY) > 0)
+ _bt_compare(rel, insertstate->itup_key, page, P_HIKEY,
+ &cmpcol) > 0)
{
/*
* Caller can use the fastpath optimization because cached
@@ -438,7 +440,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* in the fastpath below, but also in the _bt_findinsertloc() call later.
*/
Assert(!insertstate->bounds_valid);
- offset = _bt_binsrch_insert(rel, insertstate);
+ offset = _bt_binsrch_insert(rel, insertstate, 1);
/*
* Scan over all equal tuples, looking for live conflicts.
@@ -448,6 +450,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(itup_key->scantid == NULL);
for (;;)
{
+ AttrNumber cmpcol = 1;
+
/*
* Each iteration of the loop processes one heap TID, not one index
* tuple. Current offset number for page isn't usually advanced on
@@ -483,7 +487,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(insertstate->bounds_valid);
Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
Assert(insertstate->low <= insertstate->stricthigh);
- Assert(_bt_compare(rel, itup_key, page, offset) < 0);
+ Assert(_bt_compare(rel, itup_key, page, offset, &cmpcol) < 0);
break;
}
@@ -508,7 +512,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
if (!inposting)
{
/* Plain tuple, or first TID in posting list tuple */
- if (_bt_compare(rel, itup_key, page, offset) != 0)
+ if (_bt_compare(rel, itup_key, page, offset, &cmpcol) != 0)
break; /* we're past all the equal tuples */
/* Advanced curitup */
@@ -718,11 +722,12 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
else
{
int highkeycmp;
+ cmpcol = 1;
/* If scankey == hikey we gotta check the next page too */
if (P_RIGHTMOST(opaque))
break;
- highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+ highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol);
Assert(highkeycmp <= 0);
if (highkeycmp != 0)
break;
@@ -865,6 +870,8 @@ _bt_findinsertloc(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
+
/*
* Does the new tuple belong on this page?
*
@@ -882,7 +889,7 @@ _bt_findinsertloc(Relation rel,
/* Test '<=', not '!=', since scantid is set now */
if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
break;
_bt_stepright(rel, insertstate, stack);
@@ -935,6 +942,8 @@ _bt_findinsertloc(Relation rel,
*/
while (PageGetFreeSpace(page) < insertstate->itemsz)
{
+ AttrNumber cmpcol = 1;
+
/*
* Before considering moving right, see if we can obtain enough
* space by erasing LP_DEAD items
@@ -965,7 +974,7 @@ _bt_findinsertloc(Relation rel,
break;
if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) != 0 ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) != 0 ||
pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
break;
@@ -980,10 +989,13 @@ _bt_findinsertloc(Relation rel,
* We should now be on the correct page. Find the offset within the page
* for the new tuple. (Possibly reusing earlier search bounds.)
*/
- Assert(P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
+ {
+ AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
+ Assert(P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0);
+ }
- newitemoff = _bt_binsrch_insert(rel, insertstate);
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
if (insertstate->postingoff == -1)
{
@@ -1002,7 +1014,7 @@ _bt_findinsertloc(Relation rel,
*/
Assert(!insertstate->bounds_valid);
insertstate->postingoff = 0;
- newitemoff = _bt_binsrch_insert(rel, insertstate);
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
Assert(insertstate->postingoff == 0);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c43c1a2830..e3b828137b 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,7 +25,8 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
-static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf,
+ AttrNumber *highkeycmpcol);
static int _bt_binsrch_posting(BTScanInsert key, Page page,
OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
@@ -98,6 +99,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
{
BTStack stack_in = NULL;
int page_access = BT_READ;
+ char tupdatabuf[BLCKSZ / 3];
+ AttrNumber highkeycmpcol = 1;
/* Get the root page to start with */
*bufP = _bt_getroot(rel, access);
@@ -130,7 +133,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
* opportunity to finish splits of internal pages too.
*/
*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
- page_access, snapshot);
+ page_access, snapshot, &highkeycmpcol,
+ (char *) tupdatabuf);
/* if this is a leaf page, we're done */
page = BufferGetPage(*bufP);
@@ -142,12 +146,15 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
* Find the appropriate pivot tuple on this page. Its downlink points
* to the child page that we're about to descend to.
*/
- offnum = _bt_binsrch(rel, key, *bufP);
+ offnum = _bt_binsrch(rel, key, *bufP, &highkeycmpcol);
itemid = PageGetItemId(page, offnum);
itup = (IndexTuple) PageGetItem(page, itemid);
Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
child = BTreeTupleGetDownLink(itup);
+ Assert(IndexTupleSize(itup) < sizeof(tupdatabuf));
+ memcpy((char *) tupdatabuf, (char *) itup, IndexTupleSize(itup));
+
/*
* We need to save the location of the pivot tuple we chose in a new
* stack entry for this page/level. If caller ends up splitting a
@@ -181,6 +188,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
*/
if (access == BT_WRITE && page_access == BT_READ)
{
+ highkeycmpcol = 1;
+
/* trade in our read lock for a write lock */
_bt_unlockbuf(rel, *bufP);
_bt_lockbuf(rel, *bufP, BT_WRITE);
@@ -191,7 +200,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
* move right to its new sibling. Do that.
*/
*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
- snapshot);
+ snapshot, &highkeycmpcol, (char *) tupdatabuf);
}
return stack_in;
@@ -239,12 +248,16 @@ _bt_moveright(Relation rel,
bool forupdate,
BTStack stack,
int access,
- Snapshot snapshot)
+ Snapshot snapshot,
+ AttrNumber *comparecol,
+ char *tupdatabuf)
{
Page page;
BTPageOpaque opaque;
int32 cmpval;
+ Assert(PointerIsValid(comparecol) && PointerIsValid(tupdatabuf));
+
/*
* When nextkey = false (normal case): if the scan key that brought us to
* this page is > the high key stored on the page, then the page has split
@@ -266,12 +279,17 @@ _bt_moveright(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
+
page = BufferGetPage(buf);
TestForOldSnapshot(snapshot, rel, page);
opaque = BTPageGetOpaque(page);
if (P_RIGHTMOST(opaque))
+ {
+ *comparecol = 1;
break;
+ }
/*
* Finish any incomplete splits we encounter along the way.
@@ -297,14 +315,55 @@ _bt_moveright(Relation rel,
continue;
}
- if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
+ /*
+ * tupdatabuf is filled with the right seperator of the parent node.
+ * This allows us to do a binary equality check between the parent
+ * node's right seperator (which is < key) and this page's P_HIKEY.
+ * If they equal, we can reuse the result of the parent node's
+ * rightkey compare, which means we can potentially save a full key
+ * compare (which includes indirect calls to attribute comparison
+ * functions).
+ *
+ * Without this, we'd on average use 3 full key compares per page before
+ * we achieve full dynamic prefix bounds, but with this optimization
+ * that is only 2.
+ *
+ * 3 compares: 1 for the highkey (rightmost), and on average 2 before
+ * we move right in the binary search on the page, this average equals
+ * SUM (1/2 ^ x) for x from 0 to log(n items)), which tends to 2.
+ */
+ if (!P_IGNORE(opaque) && *comparecol > 1)
+ {
+ IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
+ IndexTuple buftuple = (IndexTuple) tupdatabuf;
+ if (IndexTupleSize(itup) == IndexTupleSize(buftuple))
+ {
+ char *dataptr = (char *) itup;
+
+ if (memcmp(dataptr + sizeof(IndexTupleData),
+ tupdatabuf + sizeof(IndexTupleData),
+ IndexTupleSize(itup) - sizeof(IndexTupleData)) == 0)
+ break;
+ } else {
+ *comparecol = 1;
+ }
+ } else {
+ *comparecol = 1;
+ }
+
+ if (P_IGNORE(opaque) ||
+ _bt_compare(rel, key, page, P_HIKEY, &cmpcol) >= cmpval)
{
+ *comparecol = 1;
/* step right one page */
buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
continue;
}
else
+ {
+ *comparecol = cmpcol;
break;
+ }
}
if (P_IGNORE(opaque))
@@ -337,7 +396,8 @@ _bt_moveright(Relation rel,
static OffsetNumber
_bt_binsrch(Relation rel,
BTScanInsert key,
- Buffer buf)
+ Buffer buf,
+ AttrNumber *highkeycmpcol)
{
Page page;
BTPageOpaque opaque;
@@ -345,6 +405,8 @@ _bt_binsrch(Relation rel,
high;
int32 result,
cmpval;
+ AttrNumber highcmpcol = *highkeycmpcol,
+ lowcmpcol = 1;
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
@@ -386,16 +448,25 @@ _bt_binsrch(Relation rel,
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, key, page, mid);
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
+ {
high = mid;
+ highcmpcol = cmpcol;
+ }
}
+
+ *highkeycmpcol = highcmpcol;
/*
* At this point we have high == low, but be careful: they could point
@@ -439,7 +510,8 @@ _bt_binsrch(Relation rel,
* list split).
*/
OffsetNumber
-_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
+_bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol)
{
BTScanInsert key = insertstate->itup_key;
Page page;
@@ -449,6 +521,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
stricthigh;
int32 result,
cmpval;
+ AttrNumber lowcmpcol = 1;
page = BufferGetPage(insertstate->buf);
opaque = BTPageGetOpaque(page);
@@ -499,16 +572,22 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, key, page, mid);
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
{
high = mid;
+ highcmpcol = cmpcol;
+
if (result != 0)
stricthigh = high;
}
@@ -656,7 +735,8 @@ int32
_bt_compare(Relation rel,
BTScanInsert key,
Page page,
- OffsetNumber offnum)
+ OffsetNumber offnum,
+ AttrNumber *comparecol)
{
TupleDesc itupdesc = RelationGetDescr(rel);
BTPageOpaque opaque = BTPageGetOpaque(page);
@@ -696,8 +776,9 @@ _bt_compare(Relation rel,
ncmpkey = Min(ntupatts, key->keysz);
Assert(key->heapkeyspace || ncmpkey == key->keysz);
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
- scankey = key->scankeys;
- for (int i = 1; i <= ncmpkey; i++)
+
+ scankey = key->scankeys + ((*comparecol) - 1);
+ for (int i = *comparecol; i <= ncmpkey; i++)
{
Datum datum;
bool isNull;
@@ -741,11 +822,20 @@ _bt_compare(Relation rel,
/* if the keys are unequal, return the difference */
if (result != 0)
+ {
+ *comparecol = i;
return result;
+ }
scankey++;
}
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *comparecol = ntupatts + 1;
+
/*
* All non-truncated attributes (other than heap TID) were found to be
* equal. Treat truncated attributes as minus infinity when scankey has a
@@ -876,6 +966,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
StrategyNumber strat_total;
BTScanPosItem *currItem;
BlockNumber blkno;
+ AttrNumber cmpcol = 1;
Assert(!BTScanPosIsValid(so->currPos));
@@ -1392,7 +1483,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
_bt_initialize_more_data(so, dir);
/* position to the precise item on the page */
- offnum = _bt_binsrch(rel, &inskey, buf);
+ offnum = _bt_binsrch(rel, &inskey, buf, &cmpcol);
/*
* If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8f48960f9d..4cb24fa005 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1232,9 +1232,13 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
int access, Snapshot snapshot);
extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
- bool forupdate, BTStack stack, int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
-extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+ bool forupdate, BTStack stack, int access,
+ Snapshot snapshot, AttrNumber *comparecol,
+ char *tupdatabuf);
+extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page,
+ OffsetNumber offnum, AttrNumber *comparecol);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
--
2.39.0
v8-0005-Add-an-attcacheoff-populating-function.patchapplication/octet-stream; name=v8-0005-Add-an-attcacheoff-populating-function.patchDownload
From 50ed073da7a6ba52c5e00e765e1959cf305f3f3b Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 12 Jan 2023 21:34:36 +0100
Subject: [PATCH v8 5/6] Add an attcacheoff-populating function
It populates attcacheoff-capable attributes with the correct offset,
and fills attributes whose offset is uncacheable with an 'uncacheable'
indicator value; as opposed to -1 which signals "unknown".
This allows users of the API to remove redundant cycles that try to
cache the offset of attributes - instead of O(N-attrs) operations, this
one only requires a O(1) check.
---
src/backend/access/common/tupdesc.c | 111 ++++++++++++++++++++++++++++
src/include/access/tupdesc.h | 2 +
2 files changed, 113 insertions(+)
diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c
index 72a2c3d3db..f2d80ed0db 100644
--- a/src/backend/access/common/tupdesc.c
+++ b/src/backend/access/common/tupdesc.c
@@ -919,3 +919,114 @@ BuildDescFromLists(List *names, List *types, List *typmods, List *collations)
return desc;
}
+
+/*
+ * PopulateTupleDescCacheOffsets
+ *
+ * Populate the attcacheoff fields of a TupleDesc, returning the last
+ * attcacheoff with a valid offset value.
+ *
+ * Populates attcacheoff with a negative cache value when no offset
+ * can be calculated (due to e.g. variable length attributes).
+ * The negative value is a value relative to the last cacheable attribute
+ * attcacheoff = -1 - (thisattno - cachedattno)
+ * so that the last attribute with cached offset can be found with
+ * cachedattno = attcacheoff + 1 + thisattno
+ *
+ * The value returned is the AttrNumber of the last (1-based) attribute that
+ * had its offset cached.
+ *
+ * When the TupleDesc has 0 attributes, it returns 0.
+ */
+AttrNumber
+PopulateTupleDescCacheOffsets(TupleDesc desc)
+{
+ int numberOfAttributes = desc->natts;
+ AttrNumber currAttNo, lastCachedAttNo;
+
+ if (numberOfAttributes == 0)
+ return 0;
+
+ /* Non-negative value: this attribute is cached */
+ if (TupleDescAttr(desc, desc->natts - 1)->attcacheoff >= 0)
+ return (AttrNumber) desc->natts;
+ /*
+ * Attribute has been filled with relative offset to last cached value, but
+ * it itself is unreachable.
+ */
+ if (TupleDescAttr(desc, desc->natts - 1)->attcacheoff != -1)
+ return (AttrNumber) (TupleDescAttr(desc, desc->natts - 1)->attcacheoff + 1 + desc->natts);
+
+ /* last attribute of the tupledesc may or may not support attcacheoff */
+
+ /*
+ * First attribute always starts at offset zero.
+ */
+ TupleDescAttr(desc, 0)->attcacheoff = 0;
+
+ currAttNo = 1;
+ /*
+ * Other code may have populated the value previously.
+ * Skip all positive offsets to get to the first attribute without
+ * attcacheoff.
+ */
+ while (currAttNo < numberOfAttributes &&
+ TupleDescAttr(desc, currAttNo)->attcacheoff >= 0)
+ currAttNo++;
+
+ /*
+ * Cache offset is undetermined. Start calculating offsets if possible.
+ *
+ * When we exit this block, currAttNo will point at the first uncacheable
+ * attribute, or past the end of the attribute array.
+ */
+ if (currAttNo < numberOfAttributes &&
+ TupleDescAttr(desc, currAttNo)->attcacheoff == -1)
+ {
+ Form_pg_attribute att = TupleDescAttr(desc, currAttNo - 1);
+ int32 off = att->attcacheoff;
+
+ if (att->attlen >= 0) {
+ off += att->attlen;
+
+ while (currAttNo < numberOfAttributes)
+ {
+ att = TupleDescAttr(desc, currAttNo);
+
+ if (att->attlen < 0)
+ {
+ if (off == att_align_nominal(off, att->attalign))
+ {
+ att->attcacheoff = off;
+ currAttNo++;
+ }
+ break;
+ }
+
+ off = att_align_nominal(off, att->attalign);
+ att->attcacheoff = off;
+ off += att->attlen;
+ currAttNo++;
+ }
+ }
+ }
+
+ Assert(currAttNo == numberOfAttributes || (
+ currAttNo < numberOfAttributes
+ && TupleDescAttr(desc, (currAttNo - 1))->attcacheoff >= 0
+ && TupleDescAttr(desc, currAttNo)->attcacheoff == -1
+ ));
+ /*
+ * No cacheable offsets left. Fill the rest with negative cache values,
+ * but return the latest cached offset.
+ */
+ lastCachedAttNo = currAttNo;
+
+ while (currAttNo < numberOfAttributes)
+ {
+ TupleDescAttr(desc, currAttNo)->attcacheoff = -1 - (currAttNo - lastCachedAttNo);
+ currAttNo++;
+ }
+
+ return lastCachedAttNo;
+}
\ No newline at end of file
diff --git a/src/include/access/tupdesc.h b/src/include/access/tupdesc.h
index b4286cf922..2673f2d0f3 100644
--- a/src/include/access/tupdesc.h
+++ b/src/include/access/tupdesc.h
@@ -151,4 +151,6 @@ extern TupleDesc BuildDescForRelation(List *schema);
extern TupleDesc BuildDescFromLists(List *names, List *types, List *typmods, List *collations);
+extern AttrNumber PopulateTupleDescCacheOffsets(TupleDesc desc);
+
#endif /* TUPDESC_H */
--
2.39.0
v8-0004-Optimize-nbts_attiter-for-nkeyatts-1-btrees.patchapplication/octet-stream; name=v8-0004-Optimize-nbts_attiter-for-nkeyatts-1-btrees.patchDownload
From a81ec04ab71763095f9a2218d76dfa95aa89f489 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 11 Jan 2023 20:04:56 +0100
Subject: [PATCH v8 4/6] Optimize nbts_attiter for nkeyatts==1 btrees
This removes the index_getattr_nocache call path, which has significant overhead, and uses constant 0 offset.
---
src/backend/access/nbtree/README | 1 +
src/backend/access/nbtree/nbtree_spec.c | 3 ++
src/include/access/nbtree.h | 35 +++++++++++++
src/include/access/nbtree_spec.h | 68 +++++++++++++++++++++++--
4 files changed, 102 insertions(+), 5 deletions(-)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 4b11ea9ad7..6864902637 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -1104,6 +1104,7 @@ in the index AM to call the specialized functions, increasing the
performance of those hot paths.
Optimized code paths exist for the following cases, in order of preference:
+ - indexes with only a single key attribute
- multi-column indexes that could benefit from the attcacheoff optimization
NB: This is also the default path, and is comparatively slow for uncachable
attribute offsets.
diff --git a/src/backend/access/nbtree/nbtree_spec.c b/src/backend/access/nbtree/nbtree_spec.c
index 6b766581ab..21635397ed 100644
--- a/src/backend/access/nbtree/nbtree_spec.c
+++ b/src/backend/access/nbtree/nbtree_spec.c
@@ -33,6 +33,9 @@ _bt_specialize(Relation rel)
case NBTS_CTX_CACHED:
_bt_specialize_cached(rel);
break;
+ case NBTS_CTX_SINGLE_KEYATT:
+ _bt_specialize_single_keyatt(rel);
+ break;
case NBTS_CTX_DEFAULT:
break;
}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index f3f0961052..4628c41e9a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1123,6 +1123,7 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
typedef enum NBTS_CTX {
+ NBTS_CTX_SINGLE_KEYATT,
NBTS_CTX_CACHED,
NBTS_CTX_DEFAULT, /* fallback */
} NBTS_CTX;
@@ -1132,9 +1133,43 @@ static inline NBTS_CTX _nbt_spec_context(Relation irel)
if (!PointerIsValid(irel))
return NBTS_CTX_DEFAULT;
+ if (IndexRelationGetNumberOfKeyAttributes(irel) == 1)
+ return NBTS_CTX_SINGLE_KEYATT;
+
return NBTS_CTX_CACHED;
}
+static inline Datum _bt_getfirstatt(IndexTuple tuple, TupleDesc tupleDesc,
+ bool *isNull)
+{
+ Datum result;
+ if (IndexTupleHasNulls(tuple))
+ {
+ if (att_isnull(0, (bits8 *)(tuple) + sizeof(IndexTupleData)))
+ {
+ *isNull = true;
+ result = (Datum) 0;
+ }
+ else
+ {
+ *isNull = false;
+ result = fetchatt(TupleDescAttr(tupleDesc, 0),
+ ((char *) tuple)
+ + MAXALIGN(sizeof(IndexTupleData)
+ + sizeof(IndexAttributeBitMapData)));
+ }
+ }
+ else
+ {
+ *isNull = false;
+ result = fetchatt(TupleDescAttr(tupleDesc, 0),
+ ((char *) tuple)
+ + MAXALIGN(sizeof(IndexTupleData)));
+ }
+
+ return result;
+}
+
#define NBT_SPECIALIZE_FILE "access/nbtree_specfuncs.h"
#include "nbtree_spec.h"
diff --git a/src/include/access/nbtree_spec.h b/src/include/access/nbtree_spec.h
index 6ddba4d520..3ad64aad39 100644
--- a/src/include/access/nbtree_spec.h
+++ b/src/include/access/nbtree_spec.h
@@ -43,6 +43,7 @@
/*
* Macros used in the nbtree specialization code.
*/
+#define NBTS_TYPE_SINGLE_KEYATT single_keyatt
#define NBTS_TYPE_CACHED cached
#define NBTS_TYPE_DEFAULT default
#define NBTS_CTX_NAME __nbts_ctx
@@ -50,8 +51,10 @@
/* contextual specializations */
#define NBTS_MAKE_CTX(rel) const NBTS_CTX NBTS_CTX_NAME = _nbt_spec_context(rel)
#define NBTS_SPECIALIZE_NAME(name) ( \
- (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
- NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ (NBTS_CTX_NAME) == NBTS_CTX_SINGLE_KEYATT ? (NBTS_MAKE_NAME(name, NBTS_TYPE_SINGLE_KEYATT)) : ( \
+ (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ ) \
) \
)
@@ -72,9 +75,9 @@ do { \
} while (false)
/*
- * Call a potentially specialized function for a given btree operation.
- *
- * NB: the rel argument is evaluated multiple times.
+ * Protections against multiple inclusions - the definition of this macro is
+ * different for files included with the templating mechanism vs the users
+ * of this template, so redefine these macros at top and bottom.
*/
#ifdef NBTS_FUNCTION
#undef NBTS_FUNCTION
@@ -164,6 +167,61 @@ do { \
#undef nbts_attiter_nextattdatum
#undef nbts_attiter_curattisnull
+/*
+ * Specialization 3: SINGLE_KEYATT
+ *
+ * Optimized access for indexes with a single key column.
+ *
+ * Note that this path cannot be used for indexes with multiple key
++ * columns, because it never considers the next column.
+ */
+
+/* the default context (and later contexts) do need to specialize, so here's that */
+#undef nbts_prep_ctx
+#define nbts_prep_ctx(rel)
+
+#define NBTS_SPECIALIZING_SINGLE_KEYATT
+#define NBTS_TYPE NBTS_TYPE_SINGLE_KEYATT
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ Assert((endAttNum) == 1); ((void) (endAttNum)); \
+ if ((initAttNum) == 1) for (int spec_i = 0; spec_i < 1; spec_i++)
+
+#define nbts_attiter_attnum 1
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+( \
+ AssertMacro(spec_i == 0), \
+ _bt_getfirstatt(itup, tupDesc, &NBTS_MAKE_NAME(itup, isNull)) \
+)
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_SINGLE_KEYATT
+
+/* un-define the optimization macros */
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/*
+ * All subsequent contexts are from non-templated code, so
+ * they need to actually include the context.
+ */
+#undef nbts_prep_ctx
+#define nbts_prep_ctx(rel) NBTS_MAKE_CTX(rel)
/*
* from here on all NBTS_FUNCTIONs are from specialized function names that
* are being called. Change the result of those macros from a direct call
--
2.39.0
v8-0003-Use-specialized-attribute-iterators-in-the-specia.patchapplication/octet-stream; name=v8-0003-Use-specialized-attribute-iterators-in-the-specia.patchDownload
From 5f95f55469291c188049c5042982ab4cdcb0ca91 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 11 Jan 2023 02:57:21 +0100
Subject: [PATCH v8 3/6] Use specialized attribute iterators in the specialized
source files
This is committed separately to make clear what substantial changes were
made to the pre-existing code.
Even though not all nbt*_spec functions have been updated; these functions
can now directly call (and inline, and optimize for) the specialized functions
they call, instead of having to determine the right specialization based on
the (potentially locally unavailable) index relation, making the specialization
of those functions still worth specializing/duplicating.
---
src/backend/access/nbtree/nbtsearch_spec.c | 18 +++---
src/backend/access/nbtree/nbtsort_spec.c | 24 +++----
src/backend/access/nbtree/nbtutils_spec.c | 62 ++++++++++++-------
.../utils/sort/tuplesortvariants_spec.c | 54 +++++++++-------
src/include/access/nbtree_spec.h | 10 +--
5 files changed, 95 insertions(+), 73 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsearch_spec.c b/src/backend/access/nbtree/nbtsearch_spec.c
index 37cc3647d3..4ce39e7724 100644
--- a/src/backend/access/nbtree/nbtsearch_spec.c
+++ b/src/backend/access/nbtree/nbtsearch_spec.c
@@ -632,6 +632,7 @@ _bt_compare(Relation rel,
int ncmpkey;
int ntupatts;
int32 result;
+ nbts_attiterdeclare(itup);
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -664,23 +665,26 @@ _bt_compare(Relation rel,
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
scankey = key->scankeys + ((*comparecol) - 1);
- for (int i = *comparecol; i <= ncmpkey; i++)
+ nbts_attiterinit(itup, *comparecol, itupdesc);
+
+ nbts_foreachattr(*comparecol, ncmpkey)
{
Datum datum;
- bool isNull;
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+ datum = nbts_attiter_nextattdatum(itup, itupdesc);
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ /* key is NULL */
+ if (scankey->sk_flags & SK_ISNULL)
{
- if (isNull)
+ if (nbts_attiter_curattisnull(itup))
result = 0; /* NULL "=" NULL */
else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = -1; /* NULL "<" NOT_NULL */
else
result = 1; /* NULL ">" NOT_NULL */
}
- else if (isNull) /* key is NOT_NULL and item is NULL */
+ /* key is NOT_NULL and item is NULL */
+ else if (nbts_attiter_curattisnull(itup))
{
if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = 1; /* NOT_NULL ">" NULL */
@@ -709,7 +713,7 @@ _bt_compare(Relation rel,
/* if the keys are unequal, return the difference */
if (result != 0)
{
- *comparecol = i;
+ *comparecol = nbts_attiter_attnum;
return result;
}
diff --git a/src/backend/access/nbtree/nbtsort_spec.c b/src/backend/access/nbtree/nbtsort_spec.c
index 368d6f244c..6f33cc4cc2 100644
--- a/src/backend/access/nbtree/nbtsort_spec.c
+++ b/src/backend/access/nbtree/nbtsort_spec.c
@@ -34,8 +34,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
itup2 = NULL;
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
bool deduplicate;
@@ -57,7 +56,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
/* Prepare SortSupport data for each column */
sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
- for (i = 0; i < keysz; i++)
+ for (int i = 0; i < keysz; i++)
{
SortSupport sortKey = sortKeys + i;
ScanKey scanKey = wstate->inskey->scankeys + i;
@@ -90,21 +89,24 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
else if (itup != NULL)
{
int32 compare = 0;
+ nbts_attiterdeclare(itup);
+ nbts_attiterdeclare(itup2);
- for (i = 1; i <= keysz; i++)
+ nbts_attiterinit(itup, 1, tupdes);
+ nbts_attiterinit(itup2, 1, tupdes);
+
+ nbts_foreachattr(1, keysz)
{
SortSupport entry;
Datum attrDatum1,
attrDatum2;
- bool isNull1,
- isNull2;
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+ entry = sortKeys + nbts_attiter_attnum - 1;
+ attrDatum1 = nbts_attiter_nextattdatum(itup, tupdes);
+ attrDatum2 = nbts_attiter_nextattdatum(itup2, tupdes);
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
+ compare = ApplySortComparator(attrDatum1, nbts_attiter_curattisnull(itup),
+ attrDatum2, nbts_attiter_curattisnull(itup2),
entry);
if (compare > 0)
{
diff --git a/src/backend/access/nbtree/nbtutils_spec.c b/src/backend/access/nbtree/nbtutils_spec.c
index 0288da22d6..07ca18f404 100644
--- a/src/backend/access/nbtree/nbtutils_spec.c
+++ b/src/backend/access/nbtree/nbtutils_spec.c
@@ -64,7 +64,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
int indnkeyatts;
int16 *indoption;
int tupnatts;
- int i;
+ nbts_attiterdeclare(itup);
itupdesc = RelationGetDescr(rel);
indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -95,7 +95,10 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->scantid = key->heapkeyspace && itup ?
BTreeTupleGetHeapTID(itup) : NULL;
skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
+
+ nbts_attiterinit(itup, 1, itupdesc);
+
+ nbts_foreachattr(1, indnkeyatts)
{
FmgrInfo *procinfo;
Datum arg;
@@ -106,27 +109,30 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
* We can use the cached (default) support procs since no cross-type
* comparison can be needed.
*/
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+ procinfo = index_getprocinfo(rel, nbts_attiter_attnum, BTORDER_PROC);
/*
* Key arguments built from truncated attributes (or when caller
* provides no tuple) are defensively represented as NULL values. They
* should never be used.
*/
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
+ if (nbts_attiter_attnum <= tupnatts)
+ {
+ arg = nbts_attiter_nextattdatum(itup, itupdesc);
+ null = nbts_attiter_curattisnull(itup);
+ }
else
{
arg = (Datum) 0;
null = true;
}
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags = (null ? SK_ISNULL : 0) | (indoption[nbts_attiter_attnum - 1] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[nbts_attiter_attnum - 1],
flags,
- (AttrNumber) (i + 1),
+ (AttrNumber) nbts_attiter_attnum,
InvalidStrategy,
InvalidOid,
- rel->rd_indcollation[i],
+ rel->rd_indcollation[nbts_attiter_attnum - 1],
procinfo,
arg);
/* Record if any key attribute is NULL (or truncated) */
@@ -675,6 +681,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
TupleDesc itupdesc = RelationGetDescr(rel);
int keepnatts;
ScanKey scankey;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
/*
* _bt_compare() treats truncated key attributes as having the value minus
@@ -686,20 +694,22 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
scankey = itup_key->scankeys;
keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, nkeyatts)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ datum1 = nbts_attiter_nextattdatum(lastleft, itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) != nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
scankey->sk_collation,
datum1,
@@ -707,6 +717,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
break;
keepnatts++;
+ scankey++;
}
/*
@@ -747,24 +758,27 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
TupleDesc itupdesc = RelationGetDescr(rel);
int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
int keepnatts;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, keysz)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
Form_pg_attribute att;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
+ datum1 = nbts_attiter_nextattdatum(lastleft, itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
+ att = TupleDescAttr(itupdesc, nbts_attiter_attnum - 1);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) != nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
diff --git a/src/backend/utils/sort/tuplesortvariants_spec.c b/src/backend/utils/sort/tuplesortvariants_spec.c
index 0791f41136..61c4826853 100644
--- a/src/backend/utils/sort/tuplesortvariants_spec.c
+++ b/src/backend/utils/sort/tuplesortvariants_spec.c
@@ -40,11 +40,8 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
bool equal_hasnull = false;
int nkey;
int32 compare;
- Datum datum1,
- datum2;
- bool isnull1,
- isnull2;
-
+ nbts_attiterdeclare(tuple1);
+ nbts_attiterdeclare(tuple2);
/* Compare the leading sort key */
compare = ApplySortComparator(a->datum1, a->isnull1,
@@ -59,37 +56,46 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
keysz = base->nKeys;
tupDes = RelationGetDescr(arg->index.indexRel);
- if (sortKey->abbrev_converter)
+ if (!sortKey->abbrev_converter)
{
- datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
-
- compare = ApplySortAbbrevFullComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare;
+ nkey = 2;
+ sortKey++;
+ }
+ else
+ {
+ nkey = 1;
}
/* they are equal, so we only need to examine one null flag */
if (a->isnull1)
equal_hasnull = true;
- sortKey++;
- for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
+ nbts_attiterinit(tuple1, nkey, tupDes);
+ nbts_attiterinit(tuple2, nkey, tupDes);
+
+ nbts_foreachattr(nkey, keysz)
{
- datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
+ Datum datum1,
+ datum2;
+ datum1 = nbts_attiter_nextattdatum(tuple1, tupDes);
+ datum2 = nbts_attiter_nextattdatum(tuple2, tupDes);
+
+ if (nbts_attiter_attnum == 1)
+ compare = ApplySortAbbrevFullComparator(datum1, nbts_attiter_curattisnull(tuple1),
+ datum2, nbts_attiter_curattisnull(tuple2),
+ sortKey);
+ else
+ compare = ApplySortComparator(datum1, nbts_attiter_curattisnull(tuple1),
+ datum2, nbts_attiter_curattisnull(tuple2),
+ sortKey);
- compare = ApplySortComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
if (compare != 0)
- return compare; /* done when we find unequal attributes */
+ return compare;
- /* they are equal, so we only need to examine one null flag */
- if (isnull1)
+ if (nbts_attiter_curattisnull(tuple1))
equal_hasnull = true;
+
+ sortKey++;
}
/*
diff --git a/src/include/access/nbtree_spec.h b/src/include/access/nbtree_spec.h
index 0bfb623f37..6ddba4d520 100644
--- a/src/include/access/nbtree_spec.h
+++ b/src/include/access/nbtree_spec.h
@@ -66,15 +66,11 @@ do { \
if (unlikely((rel)->rd_indam->aminsert == btinsert_default)) \
{ \
nbts_prep_ctx(rel); \
+ Assert(PointerIsValid(rel)); \
_bt_specialize(rel); \
} \
} while (false)
-/*
- * Access a specialized nbtree function, based on the shape of the index key.
- */
-#define NBTS_DEFINITIONS
-
/*
* Call a potentially specialized function for a given btree operation.
*
@@ -102,7 +98,7 @@ do { \
#define nbts_attiterdeclare(itup) \
bool NBTS_MAKE_NAME(itup, isNull)
-#define nbts_attiterinit(itup, initAttNum, tupDesc)
+#define nbts_attiterinit(itup, initAttNum, tupDesc) do {} while (false)
#define nbts_foreachattr(initAttNum, endAttNum) \
for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
@@ -132,7 +128,7 @@ do { \
* "Default", externally accessible, not so much optimized functions
*/
-/* the default context (and later contexts) do need to specialize, so here's that */
+/* the default context needs to specialize, so here's that */
#undef nbts_prep_ctx
#define nbts_prep_ctx(rel) NBTS_MAKE_CTX(rel)
--
2.39.0
v8-0002-Specialize-nbtree-functions-on-btree-key-shape.patchapplication/octet-stream; name=v8-0002-Specialize-nbtree-functions-on-btree-key-shape.patchDownload
From b067f08567da4e1f6ad628e1fef3c058370396d2 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 11 Jan 2023 02:13:04 +0100
Subject: [PATCH v8 2/6] Specialize nbtree functions on btree key shape.
nbtree keys are not all made the same, so a significant amount of time is
spent on code that exists only to deal with other key's shape. By specializing
function calls based on the key shape, we can remove or reduce these causes
of overhead.
This commit adds the basic infrastructure for specializing specific hot code
in the nbtree AM to certain shapes of keys, and splits the code that can
benefit from attribute offset optimizations into separate files. This does
NOT yet update the code itself - it just makes the code compile cleanly.
The performance should be comparable if not the same.
---
contrib/amcheck/verify_nbtree.c | 7 +
src/backend/access/nbtree/README | 33 +-
src/backend/access/nbtree/nbtdedup.c | 300 +----
src/backend/access/nbtree/nbtdedup_spec.c | 317 +++++
src/backend/access/nbtree/nbtinsert.c | 566 +--------
src/backend/access/nbtree/nbtinsert_spec.c | 583 +++++++++
src/backend/access/nbtree/nbtpage.c | 1 +
src/backend/access/nbtree/nbtree.c | 37 +-
src/backend/access/nbtree/nbtree_spec.c | 69 ++
src/backend/access/nbtree/nbtsearch.c | 1075 +---------------
src/backend/access/nbtree/nbtsearch_spec.c | 1087 +++++++++++++++++
src/backend/access/nbtree/nbtsort.c | 264 +---
src/backend/access/nbtree/nbtsort_spec.c | 280 +++++
src/backend/access/nbtree/nbtsplitloc.c | 3 +
src/backend/access/nbtree/nbtutils.c | 754 +-----------
src/backend/access/nbtree/nbtutils_spec.c | 775 ++++++++++++
src/backend/utils/sort/tuplesortvariants.c | 144 +--
.../utils/sort/tuplesortvariants_spec.c | 158 +++
src/include/access/nbtree.h | 45 +-
src/include/access/nbtree_spec.h | 180 +++
src/include/access/nbtree_specfuncs.h | 66 +
21 files changed, 3609 insertions(+), 3135 deletions(-)
create mode 100644 src/backend/access/nbtree/nbtdedup_spec.c
create mode 100644 src/backend/access/nbtree/nbtinsert_spec.c
create mode 100644 src/backend/access/nbtree/nbtree_spec.c
create mode 100644 src/backend/access/nbtree/nbtsearch_spec.c
create mode 100644 src/backend/access/nbtree/nbtsort_spec.c
create mode 100644 src/backend/access/nbtree/nbtutils_spec.c
create mode 100644 src/backend/utils/sort/tuplesortvariants_spec.c
create mode 100644 src/include/access/nbtree_spec.h
create mode 100644 src/include/access/nbtree_specfuncs.h
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 22bb229820..fb89d6ada2 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2680,6 +2680,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
BTStack stack;
Buffer lbuf;
bool exists;
+ nbts_prep_ctx(NULL);
key = _bt_mkscankey(state->rel, itup);
Assert(key->heapkeyspace && key->scantid != NULL);
@@ -2780,6 +2781,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
ItemId itemid;
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -2843,6 +2845,7 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
{
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -2867,6 +2870,7 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
{
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -2906,6 +2910,7 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
ItemId itemid;
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -2966,6 +2971,7 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
Page page;
BTPageOpaque opaque;
OffsetNumber maxoffset;
+ nbts_prep_ctx(NULL);
page = palloc(BLCKSZ);
@@ -3141,6 +3147,7 @@ static inline BTScanInsert
bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
{
BTScanInsert skey;
+ nbts_prep_ctx(NULL);
skey = _bt_mkscankey(rel, itup);
skey->pivotsearch = true;
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 4d7fa5aff4..4b11ea9ad7 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -940,8 +940,9 @@ and for each step in the binary search (that does not compare less than the
insert key) improve the equal-prefix bounds.
Using the above optimization, we now (on average) only need 2 full key
-compares per page, as opposed to ceil(log2(ntupsperpage)) + 1; a significant
-improvement.
+compares per page (plus ceil(log2(ntupsperpage)) single-attribute compares),
+as opposed to the ceil(log2(ntupsperpage)) + 1 of a naive implementation;
+a significant improvement.
Notes about deduplication
-------------------------
@@ -1083,6 +1084,34 @@ that need a page split anyway. Besides, supporting variable "split points"
while splitting posting lists won't actually improve overall space
utilization.
+Notes about nbtree specialization
+---------------------------------
+
+Attribute iteration is a significant overhead for multi-column indexes
+with variable length attributes, due to our inability to cache the offset
+of each attribute into an on-disk tuple. To combat this, we'd have to either
+fully deserialize the tuple, or maintain our offset into the tuple as we
+iterate over the tuple's fields.
+
+Keeping track of this offset also has a non-negligible overhead too, so we'd
+prefer to not have to keep track of these offsets when we can use the cache.
+By specializing performance-sensitive search functions for these specific
+index tuple shapes and calling those selectively, we can keep the performance
+of cacheable attribute offsets where that is applicable, while improving
+performance where we currently would see O(n_atts^2) time iterating on
+variable-length attributes. Additionally, we update the entry points
+in the index AM to call the specialized functions, increasing the
+performance of those hot paths.
+
+Optimized code paths exist for the following cases, in order of preference:
+ - multi-column indexes that could benefit from the attcacheoff optimization
+ NB: This is also the default path, and is comparatively slow for uncachable
+ attribute offsets.
+
+Future work will optimize for multi-column indexes that don't benefit
+from the attcacheoff optimization by improving on the O(n^2) nature of
+index_getattr through storing attribute offsets.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 0349988cf5..4589ade267 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -22,260 +22,14 @@
static void _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
TM_IndexDeleteOp *delstate);
-static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem);
static void _bt_singleval_fillfactor(Page page, BTDedupState state,
Size newitemsz);
#ifdef USE_ASSERT_CHECKING
static bool _bt_posting_valid(IndexTuple posting);
#endif
-/*
- * Perform a deduplication pass.
- *
- * The general approach taken here is to perform as much deduplication as
- * possible to free as much space as possible. Note, however, that "single
- * value" strategy is used for !bottomupdedup callers when the page is full of
- * tuples of a single value. Deduplication passes that apply the strategy
- * will leave behind a few untouched tuples at the end of the page, preparing
- * the page for an anticipated page split that uses nbtsplitloc.c's own single
- * value strategy. Our high level goal is to delay merging the untouched
- * tuples until after the page splits.
- *
- * When a call to _bt_bottomupdel_pass() just took place (and failed), our
- * high level goal is to prevent a page split entirely by buying more time.
- * We still hope that a page split can be avoided altogether. That's why
- * single value strategy is not even considered for bottomupdedup callers.
- *
- * The page will have to be split if we cannot successfully free at least
- * newitemsz (we also need space for newitem's line pointer, which isn't
- * included in caller's newitemsz).
- *
- * Note: Caller should have already deleted all existing items with their
- * LP_DEAD bits set.
- */
-void
-_bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
- Size newitemsz, bool bottomupdedup)
-{
- OffsetNumber offnum,
- minoff,
- maxoff;
- Page page = BufferGetPage(buf);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- Page newpage;
- BTDedupState state;
- Size pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
- bool singlevalstrat = false;
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-
- /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
- newitemsz += sizeof(ItemIdData);
-
- /*
- * Initialize deduplication state.
- *
- * It would be possible for maxpostingsize (limit on posting list tuple
- * size) to be set to one third of the page. However, it seems like a
- * good idea to limit the size of posting lists to one sixth of a page.
- * That ought to leave us with a good split point when pages full of
- * duplicates can be split several times.
- */
- state = (BTDedupState) palloc(sizeof(BTDedupStateData));
- state->deduplicate = true;
- state->nmaxitems = 0;
- state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
- /* Metadata about base tuple of current pending posting list */
- state->base = NULL;
- state->baseoff = InvalidOffsetNumber;
- state->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- state->htids = palloc(state->maxpostingsize);
- state->nhtids = 0;
- state->nitems = 0;
- /* Size of all physical tuples to be replaced by pending posting list */
- state->phystupsize = 0;
- /* nintervals should be initialized to zero */
- state->nintervals = 0;
-
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * Consider applying "single value" strategy, though only if the page
- * seems likely to be split in the near future
- */
- if (!bottomupdedup)
- singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
-
- /*
- * Deduplicate items from page, and write them to newpage.
- *
- * Copy the original page's LSN into newpage copy. This will become the
- * updated version of the page. We need this because XLogInsert will
- * examine the LSN and possibly dump it in a page image.
- */
- newpage = PageGetTempPageCopySpecial(page);
- PageSetLSN(newpage, PageGetLSN(page));
-
- /* Copy high key, if any */
- if (!P_RIGHTMOST(opaque))
- {
- ItemId hitemid = PageGetItemId(page, P_HIKEY);
- Size hitemsz = ItemIdGetLength(hitemid);
- IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
-
- if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
- false, false) == InvalidOffsetNumber)
- elog(ERROR, "deduplication failed to add highkey");
- }
-
- for (offnum = minoff;
- offnum <= maxoff;
- offnum = OffsetNumberNext(offnum))
- {
- ItemId itemid = PageGetItemId(page, offnum);
- IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
-
- Assert(!ItemIdIsDead(itemid));
-
- if (offnum == minoff)
- {
- /*
- * No previous/base tuple for the data item -- use the data item
- * as base tuple of pending posting list
- */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- else if (state->deduplicate &&
- _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
- _bt_dedup_save_htid(state, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID(s) for itup have been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list for some other reason (e.g., adding more
- * TIDs would have caused posting list to exceed current
- * maxpostingsize).
- *
- * If state contains pending posting list with more than one item,
- * form new posting tuple and add it to our temp page (newpage).
- * Else add pending interval's base tuple to the temp page as-is.
- */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- if (singlevalstrat)
- {
- /*
- * Single value strategy's extra steps.
- *
- * Lower maxpostingsize for sixth and final large posting list
- * tuple at the point where 5 maxpostingsize-capped tuples
- * have either been formed or observed.
- *
- * When a sixth maxpostingsize-capped item is formed/observed,
- * stop merging together tuples altogether. The few tuples
- * that remain at the end of the page won't be merged together
- * at all (at least not until after a future page split takes
- * place, when this page's newly allocated right sibling page
- * gets its first deduplication pass).
- */
- if (state->nmaxitems == 5)
- _bt_singleval_fillfactor(page, state, newitemsz);
- else if (state->nmaxitems == 6)
- {
- state->deduplicate = false;
- singlevalstrat = false; /* won't be back here */
- }
- }
-
- /* itup starts new pending posting list */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- }
-
- /* Handle the last item */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- /*
- * If no items suitable for deduplication were found, newpage must be
- * exactly the same as the original page, so just return from function.
- *
- * We could determine whether or not to proceed on the basis the space
- * savings being sufficient to avoid an immediate page split instead. We
- * don't do that because there is some small value in nbtsplitloc.c always
- * operating against a page that is fully deduplicated (apart from
- * newitem). Besides, most of the cost has already been paid.
- */
- if (state->nintervals == 0)
- {
- /* cannot leak memory here */
- pfree(newpage);
- pfree(state->htids);
- pfree(state);
- return;
- }
-
- /*
- * By here, it's clear that deduplication will definitely go ahead.
- *
- * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
- * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
- * But keep things tidy.
- */
- if (P_HAS_GARBAGE(opaque))
- {
- BTPageOpaque nopaque = BTPageGetOpaque(newpage);
-
- nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
- }
-
- START_CRIT_SECTION();
-
- PageRestoreTempPage(newpage, page);
- MarkBufferDirty(buf);
-
- /* XLOG stuff */
- if (RelationNeedsWAL(rel))
- {
- XLogRecPtr recptr;
- xl_btree_dedup xlrec_dedup;
-
- xlrec_dedup.nintervals = state->nintervals;
-
- XLogBeginInsert();
- XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
-
- /*
- * The intervals array is not in the buffer, but pretend that it is.
- * When XLogInsert stores the whole buffer, the array need not be
- * stored too.
- */
- XLogRegisterBufData(0, (char *) state->intervals,
- state->nintervals * sizeof(BTDedupInterval));
-
- recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
-
- PageSetLSN(page, recptr);
- }
-
- END_CRIT_SECTION();
-
- /* Local space accounting should agree with page accounting */
- Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
-
- /* cannot leak memory here */
- pfree(state->htids);
- pfree(state);
-}
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtdedup_spec.c"
+#include "access/nbtree_spec.h"
/*
* Perform bottom-up index deletion pass.
@@ -316,6 +70,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
TM_IndexDeleteOp delstate;
bool neverdedup;
int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ nbts_prep_ctx(rel);
/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
newitemsz += sizeof(ItemIdData);
@@ -752,55 +507,6 @@ _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
state->phystupsize = 0;
}
-/*
- * Determine if page non-pivot tuples (data items) are all duplicates of the
- * same value -- if they are, deduplication's "single value" strategy should
- * be applied. The general goal of this strategy is to ensure that
- * nbtsplitloc.c (which uses its own single value strategy) will find a useful
- * split point as further duplicates are inserted, and successive rightmost
- * page splits occur among pages that store the same duplicate value. When
- * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
- * just like it would if deduplication were disabled.
- *
- * We expect that affected workloads will require _several_ single value
- * strategy deduplication passes (over a page that only stores duplicates)
- * before the page is finally split. The first deduplication pass should only
- * find regular non-pivot tuples. Later deduplication passes will find
- * existing maxpostingsize-capped posting list tuples, which must be skipped
- * over. The penultimate pass is generally the first pass that actually
- * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
- * few untouched non-pivot tuples. The final deduplication pass won't free
- * any space -- it will skip over everything without merging anything (it
- * retraces the steps of the penultimate pass).
- *
- * Fortunately, having several passes isn't too expensive. Each pass (after
- * the first pass) won't spend many cycles on the large posting list tuples
- * left by previous passes. Each pass will find a large contiguous group of
- * smaller duplicate tuples to merge together at the end of the page.
- */
-static bool
-_bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- ItemId itemid;
- IndexTuple itup;
-
- itemid = PageGetItemId(page, minoff);
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- {
- itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- return true;
- }
-
- return false;
-}
-
/*
* Lower maxpostingsize when using "single value" strategy, to avoid a sixth
* and final maxpostingsize-capped tuple. The sixth and final posting list
diff --git a/src/backend/access/nbtree/nbtdedup_spec.c b/src/backend/access/nbtree/nbtdedup_spec.c
new file mode 100644
index 0000000000..584211fe66
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup_spec.c
@@ -0,0 +1,317 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup_spec.c
+ * Index shape-specialized functions for nbtdedup.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtdedup_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_do_singleval NBTS_FUNCTION(_bt_do_singleval)
+
+static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem);
+
+/*
+ * Perform a deduplication pass.
+ *
+ * The general approach taken here is to perform as much deduplication as
+ * possible to free as much space as possible. Note, however, that "single
+ * value" strategy is used for !bottomupdedup callers when the page is full of
+ * tuples of a single value. Deduplication passes that apply the strategy
+ * will leave behind a few untouched tuples at the end of the page, preparing
+ * the page for an anticipated page split that uses nbtsplitloc.c's own single
+ * value strategy. Our high level goal is to delay merging the untouched
+ * tuples until after the page splits.
+ *
+ * When a call to _bt_bottomupdel_pass() just took place (and failed), our
+ * high level goal is to prevent a page split entirely by buying more time.
+ * We still hope that a page split can be avoided altogether. That's why
+ * single value strategy is not even considered for bottomupdedup callers.
+ *
+ * The page will have to be split if we cannot successfully free at least
+ * newitemsz (we also need space for newitem's line pointer, which isn't
+ * included in caller's newitemsz).
+ *
+ * Note: Caller should have already deleted all existing items with their
+ * LP_DEAD bits set.
+ */
+void
+_bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
+ Size newitemsz, bool bottomupdedup)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ Page newpage;
+ BTDedupState state;
+ Size pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
+ bool singlevalstrat = false;
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+
+ /*
+ * Initialize deduplication state.
+ *
+ * It would be possible for maxpostingsize (limit on posting list tuple
+ * size) to be set to one third of the page. However, it seems like a
+ * good idea to limit the size of posting lists to one sixth of a page.
+ * That ought to leave us with a good split point when pages full of
+ * duplicates can be split several times.
+ */
+ state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ state->deduplicate = true;
+ state->nmaxitems = 0;
+ state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
+ /* Metadata about base tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ state->htids = palloc(state->maxpostingsize);
+ state->nhtids = 0;
+ state->nitems = 0;
+ /* Size of all physical tuples to be replaced by pending posting list */
+ state->phystupsize = 0;
+ /* nintervals should be initialized to zero */
+ state->nintervals = 0;
+
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Consider applying "single value" strategy, though only if the page
+ * seems likely to be split in the near future
+ */
+ if (!bottomupdedup)
+ singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
+
+ /*
+ * Deduplicate items from page, and write them to newpage.
+ *
+ * Copy the original page's LSN into newpage copy. This will become the
+ * updated version of the page. We need this because XLogInsert will
+ * examine the LSN and possibly dump it in a page image.
+ */
+ newpage = PageGetTempPageCopySpecial(page);
+ PageSetLSN(newpage, PageGetLSN(page));
+
+ /* Copy high key, if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId hitemid = PageGetItemId(page, P_HIKEY);
+ Size hitemsz = ItemIdGetLength(hitemid);
+ IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+ if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add highkey");
+ }
+
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == minoff)
+ {
+ /*
+ * No previous/base tuple for the data item -- use the data item
+ * as base tuple of pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (state->deduplicate &&
+ _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+ _bt_dedup_save_htid(state, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID(s) for itup have been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list for some other reason (e.g., adding more
+ * TIDs would have caused posting list to exceed current
+ * maxpostingsize).
+ *
+ * If state contains pending posting list with more than one item,
+ * form new posting tuple and add it to our temp page (newpage).
+ * Else add pending interval's base tuple to the temp page as-is.
+ */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ if (singlevalstrat)
+ {
+ /*
+ * Single value strategy's extra steps.
+ *
+ * Lower maxpostingsize for sixth and final large posting list
+ * tuple at the point where 5 maxpostingsize-capped tuples
+ * have either been formed or observed.
+ *
+ * When a sixth maxpostingsize-capped item is formed/observed,
+ * stop merging together tuples altogether. The few tuples
+ * that remain at the end of the page won't be merged together
+ * at all (at least not until after a future page split takes
+ * place, when this page's newly allocated right sibling page
+ * gets its first deduplication pass).
+ */
+ if (state->nmaxitems == 5)
+ _bt_singleval_fillfactor(page, state, newitemsz);
+ else if (state->nmaxitems == 6)
+ {
+ state->deduplicate = false;
+ singlevalstrat = false; /* won't be back here */
+ }
+ }
+
+ /* itup starts new pending posting list */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ }
+
+ /* Handle the last item */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ /*
+ * If no items suitable for deduplication were found, newpage must be
+ * exactly the same as the original page, so just return from function.
+ *
+ * We could determine whether or not to proceed on the basis the space
+ * savings being sufficient to avoid an immediate page split instead. We
+ * don't do that because there is some small value in nbtsplitloc.c always
+ * operating against a page that is fully deduplicated (apart from
+ * newitem). Besides, most of the cost has already been paid.
+ */
+ if (state->nintervals == 0)
+ {
+ /* cannot leak memory here */
+ pfree(newpage);
+ pfree(state->htids);
+ pfree(state);
+ return;
+ }
+
+ /*
+ * By here, it's clear that deduplication will definitely go ahead.
+ *
+ * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
+ * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
+ * But keep things tidy.
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ BTPageOpaque nopaque = BTPageGetOpaque(newpage);
+
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+ }
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.nintervals = state->nintervals;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ /*
+ * The intervals array is not in the buffer, but pretend that it is.
+ * When XLogInsert stores the whole buffer, the array need not be
+ * stored too.
+ */
+ XLogRegisterBufData(0, (char *) state->intervals,
+ state->nintervals * sizeof(BTDedupInterval));
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* Local space accounting should agree with page accounting */
+ Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+ /* cannot leak memory here */
+ pfree(state->htids);
+ pfree(state);
+}
+
+/*
+ * Determine if page non-pivot tuples (data items) are all duplicates of the
+ * same value -- if they are, deduplication's "single value" strategy should
+ * be applied. The general goal of this strategy is to ensure that
+ * nbtsplitloc.c (which uses its own single value strategy) will find a useful
+ * split point as further duplicates are inserted, and successive rightmost
+ * page splits occur among pages that store the same duplicate value. When
+ * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
+ * just like it would if deduplication were disabled.
+ *
+ * We expect that affected workloads will require _several_ single value
+ * strategy deduplication passes (over a page that only stores duplicates)
+ * before the page is finally split. The first deduplication pass should only
+ * find regular non-pivot tuples. Later deduplication passes will find
+ * existing maxpostingsize-capped posting list tuples, which must be skipped
+ * over. The penultimate pass is generally the first pass that actually
+ * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
+ * few untouched non-pivot tuples. The final deduplication pass won't free
+ * any space -- it will skip over everything without merging anything (it
+ * retraces the steps of the penultimate pass).
+ *
+ * Fortunately, having several passes isn't too expensive. Each pass (after
+ * the first pass) won't spend many cycles on the large posting list tuples
+ * left by previous passes. Each pass will find a large contiguous group of
+ * smaller duplicate tuples to merge together at the end of the page.
+ */
+static bool
+_bt_do_singleval(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, minoff);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+ {
+ itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+ return true;
+ }
+
+ return false;
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 4c3bdefae2..ca8ea60ffb 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,17 +30,10 @@
#define BTREE_FASTPATH_MIN_LEVEL 2
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
Relation heapRel,
IndexUniqueCheck checkUnique, bool *is_unique,
uint32 *speculativeToken);
-static OffsetNumber _bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel);
static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
Buffer buf,
@@ -73,313 +66,8 @@ static BlockNumber *_bt_deadblocks(Page page, OffsetNumber *deletable,
int *nblocks);
static inline int _bt_blk_cmp(const void *arg1, const void *arg2);
-/*
- * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
- *
- * This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
- *
- * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
- * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
- * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
- * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
- * don't actually insert.
- *
- * indexUnchanged executor hint indicates if itup is from an
- * UPDATE that didn't logically change the indexed value, but
- * must nevertheless have a new entry to point to a successor
- * version.
- *
- * The result value is only significant for UNIQUE_CHECK_PARTIAL:
- * it must be true if the entry is known unique, else false.
- * (In the current implementation we'll also return true after a
- * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
- * that's just a coding artifact.)
- */
-bool
-_bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel)
-{
- bool is_unique = false;
- BTInsertStateData insertstate;
- BTScanInsert itup_key;
- BTStack stack;
- bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
-
- /* we need an insertion scan key to do our search, so build one */
- itup_key = _bt_mkscankey(rel, itup);
-
- if (checkingunique)
- {
- if (!itup_key->anynullkeys)
- {
- /* No (heapkeyspace) scantid until uniqueness established */
- itup_key->scantid = NULL;
- }
- else
- {
- /*
- * Scan key for new tuple contains NULL key values. Bypass
- * checkingunique steps. They are unnecessary because core code
- * considers NULL unequal to every value, including NULL.
- *
- * This optimization avoids O(N^2) behavior within the
- * _bt_findinsertloc() heapkeyspace path when a unique index has a
- * large number of "duplicates" with NULL key values.
- */
- checkingunique = false;
- /* Tuple is unique in the sense that core code cares about */
- Assert(checkUnique != UNIQUE_CHECK_EXISTING);
- is_unique = true;
- }
- }
-
- /*
- * Fill in the BTInsertState working area, to track the current page and
- * position within the page to insert on.
- *
- * Note that itemsz is passed down to lower level code that deals with
- * inserting the item. It must be MAXALIGN()'d. This ensures that space
- * accounting code consistently considers the alignment overhead that we
- * expect PageAddItem() will add later. (Actually, index_form_tuple() is
- * already conservative about alignment, but we don't rely on that from
- * this distance. Besides, preserving the "true" tuple size in index
- * tuple headers for the benefit of nbtsplitloc.c might happen someday.
- * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
- */
- insertstate.itup = itup;
- insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
- insertstate.itup_key = itup_key;
- insertstate.bounds_valid = false;
- insertstate.buf = InvalidBuffer;
- insertstate.postingoff = 0;
-
-search:
-
- /*
- * Find and lock the leaf page that the tuple should be added to by
- * searching from the root page. insertstate.buf will hold a buffer that
- * is locked in exclusive mode afterwards.
- */
- stack = _bt_search_insert(rel, &insertstate);
-
- /*
- * checkingunique inserts are not allowed to go ahead when two tuples with
- * equal key attribute values would be visible to new MVCC snapshots once
- * the xact commits. Check for conflicts in the locked page/buffer (if
- * needed) here.
- *
- * It might be necessary to check a page to the right in _bt_check_unique,
- * though that should be very rare. In practice the first page the value
- * could be on (with scantid omitted) is almost always also the only page
- * that a matching tuple might be found on. This is due to the behavior
- * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
- * only be allowed to cross a page boundary when there is no candidate
- * leaf page split point that avoids it. Also, _bt_check_unique can use
- * the leaf page high key to determine that there will be no duplicates on
- * the right sibling without actually visiting it (it uses the high key in
- * cases where the new item happens to belong at the far right of the leaf
- * page).
- *
- * NOTE: obviously, _bt_check_unique can only detect keys that are already
- * in the index; so it cannot defend against concurrent insertions of the
- * same key. We protect against that by means of holding a write lock on
- * the first page the value could be on, with omitted/-inf value for the
- * implicit heap TID tiebreaker attribute. Any other would-be inserter of
- * the same key must acquire a write lock on the same page, so only one
- * would-be inserter can be making the check at one time. Furthermore,
- * once we are past the check we hold write locks continuously until we
- * have performed our insertion, so no later inserter can fail to see our
- * insertion. (This requires some care in _bt_findinsertloc.)
- *
- * If we must wait for another xact, we release the lock while waiting,
- * and then must perform a new search.
- *
- * For a partial uniqueness check, we don't wait for the other xact. Just
- * let the tuple in and return false for possibly non-unique, or true for
- * definitely unique.
- */
- if (checkingunique)
- {
- TransactionId xwait;
- uint32 speculativeToken;
-
- xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
- &is_unique, &speculativeToken);
-
- if (unlikely(TransactionIdIsValid(xwait)))
- {
- /* Have to wait for the other guy ... */
- _bt_relbuf(rel, insertstate.buf);
- insertstate.buf = InvalidBuffer;
-
- /*
- * If it's a speculative insertion, wait for it to finish (ie. to
- * go ahead with the insertion, or kill the tuple). Otherwise
- * wait for the transaction to finish as usual.
- */
- if (speculativeToken)
- SpeculativeInsertionWait(xwait, speculativeToken);
- else
- XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
-
- /* start over... */
- if (stack)
- _bt_freestack(stack);
- goto search;
- }
-
- /* Uniqueness is established -- restore heap tid as scantid */
- if (itup_key->heapkeyspace)
- itup_key->scantid = &itup->t_tid;
- }
-
- if (checkUnique != UNIQUE_CHECK_EXISTING)
- {
- OffsetNumber newitemoff;
-
- /*
- * The only conflict predicate locking cares about for indexes is when
- * an index tuple insert conflicts with an existing lock. We don't
- * know the actual page we're going to insert on for sure just yet in
- * checkingunique and !heapkeyspace cases, but it's okay to use the
- * first page the value could be on (with scantid omitted) instead.
- */
- CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
-
- /*
- * Do the insertion. Note that insertstate contains cached binary
- * search bounds established within _bt_check_unique when insertion is
- * checkingunique.
- */
- newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
- indexUnchanged, stack, heapRel);
- _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, insertstate.itemsz, newitemoff,
- insertstate.postingoff, false);
- }
- else
- {
- /* just release the buffer */
- _bt_relbuf(rel, insertstate.buf);
- }
-
- /* be tidy */
- if (stack)
- _bt_freestack(stack);
- pfree(itup_key);
-
- return is_unique;
-}
-
-/*
- * _bt_search_insert() -- _bt_search() wrapper for inserts
- *
- * Search the tree for a particular scankey, or more precisely for the first
- * leaf page it could be on. Try to make use of the fastpath optimization's
- * rightmost leaf page cache before actually searching the tree from the root
- * page, though.
- *
- * Return value is a stack of parent-page pointers (though see notes about
- * fastpath optimization and page splits below). insertstate->buf is set to
- * the address of the leaf-page buffer, which is write-locked and pinned in
- * all cases (if necessary by creating a new empty root page for caller).
- *
- * The fastpath optimization avoids most of the work of searching the tree
- * repeatedly when a single backend inserts successive new tuples on the
- * rightmost leaf page of an index. A backend cache of the rightmost leaf
- * page is maintained within _bt_insertonpg(), and used here. The cache is
- * invalidated here when an insert of a non-pivot tuple must take place on a
- * non-rightmost leaf page.
- *
- * The optimization helps with indexes on an auto-incremented field. It also
- * helps with indexes on datetime columns, as well as indexes with lots of
- * NULL values. (NULLs usually get inserted in the rightmost page for single
- * column indexes, since they usually get treated as coming after everything
- * else in the key space. Individual NULL tuples will generally be placed on
- * the rightmost leaf page due to the influence of the heap TID column.)
- *
- * Note that we avoid applying the optimization when there is insufficient
- * space on the rightmost page to fit caller's new item. This is necessary
- * because we'll need to return a real descent stack when a page split is
- * expected (actually, caller can cope with a leaf page split that uses a NULL
- * stack, but that's very slow and so must be avoided). Note also that the
- * fastpath optimization acquires the lock on the page conditionally as a way
- * of reducing extra contention when there are concurrent insertions into the
- * rightmost page (we give up if we'd have to wait for the lock). We assume
- * that it isn't useful to apply the optimization when there is contention,
- * since each per-backend cache won't stay valid for long.
- */
-static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
-{
- Assert(insertstate->buf == InvalidBuffer);
- Assert(!insertstate->bounds_valid);
- Assert(insertstate->postingoff == 0);
-
- if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
- {
- /* Simulate a _bt_getbuf() call with conditional locking */
- insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
- if (_bt_conditionallockbuf(rel, insertstate->buf))
- {
- Page page;
- BTPageOpaque opaque;
- AttrNumber cmpcol = 1;
-
- _bt_checkpage(rel, insertstate->buf);
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- /*
- * Check if the page is still the rightmost leaf page and has
- * enough free space to accommodate the new tuple. Also check
- * that the insertion scan key is strictly greater than the first
- * non-pivot tuple on the page. (Note that we expect itup_key's
- * scantid to be unset when our caller is a checkingunique
- * inserter.)
- */
- if (P_RIGHTMOST(opaque) &&
- P_ISLEAF(opaque) &&
- !P_IGNORE(opaque) &&
- PageGetFreeSpace(page) > insertstate->itemsz &&
- PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- _bt_compare(rel, insertstate->itup_key, page, P_HIKEY,
- &cmpcol) > 0)
- {
- /*
- * Caller can use the fastpath optimization because cached
- * block is still rightmost leaf page, which can fit caller's
- * new tuple without splitting. Keep block in local cache for
- * next insert, and have caller use NULL stack.
- *
- * Note that _bt_insert_parent() has an assertion that catches
- * leaf page splits that somehow follow from a fastpath insert
- * (it should only be passed a NULL stack when it must deal
- * with a concurrent root page split, and never because a NULL
- * stack was returned here).
- */
- return NULL;
- }
-
- /* Page unsuitable for caller, drop lock and pin */
- _bt_relbuf(rel, insertstate->buf);
- }
- else
- {
- /* Lock unavailable, drop pin */
- ReleaseBuffer(insertstate->buf);
- }
-
- /* Forget block, since cache doesn't appear to be useful */
- RelationSetTargetBlock(rel, InvalidBlockNumber);
- }
-
- /* Cannot use optimization -- descend tree, return proper descent stack */
- return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
- NULL);
-}
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtinsert_spec.c"
+#include "access/nbtree_spec.h"
/*
* _bt_check_unique() -- Check for violation of unique index constraint
@@ -423,6 +111,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
bool inposting = false;
bool prevalldead = true;
int curposti = 0;
+ nbts_prep_ctx(rel);
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -774,253 +463,6 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
return InvalidTransactionId;
}
-
-/*
- * _bt_findinsertloc() -- Finds an insert location for a tuple
- *
- * On entry, insertstate buffer contains the page the new tuple belongs
- * on. It is exclusive-locked and pinned by the caller.
- *
- * If 'checkingunique' is true, the buffer on entry is the first page
- * that contains duplicates of the new key. If there are duplicates on
- * multiple pages, the correct insertion position might be some page to
- * the right, rather than the first page. In that case, this function
- * moves right to the correct target page.
- *
- * (In a !heapkeyspace index, there can be multiple pages with the same
- * high key, where the new tuple could legitimately be placed on. In
- * that case, the caller passes the first page containing duplicates,
- * just like when checkingunique=true. If that page doesn't have enough
- * room for the new tuple, this function moves right, trying to find a
- * legal page that does.)
- *
- * If 'indexUnchanged' is true, this is for an UPDATE that didn't
- * logically change the indexed value, but must nevertheless have a new
- * entry to point to a successor version. This hint from the executor
- * will influence our behavior when the page might have to be split and
- * we must consider our options. Bottom-up index deletion can avoid
- * pathological version-driven page splits, but we only want to go to the
- * trouble of trying it when we already have moderate confidence that
- * it's appropriate. The hint should not significantly affect our
- * behavior over time unless practically all inserts on to the leaf page
- * get the hint.
- *
- * On exit, insertstate buffer contains the chosen insertion page, and
- * the offset within that page is returned. If _bt_findinsertloc needed
- * to move right, the lock and pin on the original page are released, and
- * the new buffer is exclusively locked and pinned instead.
- *
- * If insertstate contains cached binary search bounds, we will take
- * advantage of them. This avoids repeating comparisons that we made in
- * _bt_check_unique() already.
- */
-static OffsetNumber
-_bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel)
-{
- BTScanInsert itup_key = insertstate->itup_key;
- Page page = BufferGetPage(insertstate->buf);
- BTPageOpaque opaque;
- OffsetNumber newitemoff;
-
- opaque = BTPageGetOpaque(page);
-
- /* Check 1/3 of a page restriction */
- if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
- _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
- insertstate->itup);
-
- Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
- Assert(!insertstate->bounds_valid || checkingunique);
- Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
- Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
- Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
-
- if (itup_key->heapkeyspace)
- {
- /* Keep track of whether checkingunique duplicate seen */
- bool uniquedup = indexUnchanged;
-
- /*
- * If we're inserting into a unique index, we may have to walk right
- * through leaf pages to find the one leaf page that we must insert on
- * to.
- *
- * This is needed for checkingunique callers because a scantid was not
- * used when we called _bt_search(). scantid can only be set after
- * _bt_check_unique() has checked for duplicates. The buffer
- * initially stored in insertstate->buf has the page where the first
- * duplicate key might be found, which isn't always the page that new
- * tuple belongs on. The heap TID attribute for new tuple (scantid)
- * could force us to insert on a sibling page, though that should be
- * very rare in practice.
- */
- if (checkingunique)
- {
- if (insertstate->low < insertstate->stricthigh)
- {
- /* Encountered a duplicate in _bt_check_unique() */
- Assert(insertstate->bounds_valid);
- uniquedup = true;
- }
-
- for (;;)
- {
- AttrNumber cmpcol = 1;
-
- /*
- * Does the new tuple belong on this page?
- *
- * The earlier _bt_check_unique() call may well have
- * established a strict upper bound on the offset for the new
- * item. If it's not the last item of the page (i.e. if there
- * is at least one tuple on the page that goes after the tuple
- * we're inserting) then we know that the tuple belongs on
- * this page. We can skip the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- /* Test '<=', not '!=', since scantid is set now */
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
- break;
-
- _bt_stepright(rel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- /* Assume duplicates (if checkingunique) */
- uniquedup = true;
- }
- }
-
- /*
- * If the target page cannot fit newitem, try to avoid splitting the
- * page on insert by performing deletion or deduplication now
- */
- if (PageGetFreeSpace(page) < insertstate->itemsz)
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
- checkingunique, uniquedup,
- indexUnchanged);
- }
- else
- {
- /*----------
- * This is a !heapkeyspace (version 2 or 3) index. The current page
- * is the first page that we could insert the new tuple to, but there
- * may be other pages to the right that we could opt to use instead.
- *
- * If the new key is equal to one or more existing keys, we can
- * legitimately place it anywhere in the series of equal keys. In
- * fact, if the new key is equal to the page's "high key" we can place
- * it on the next page. If it is equal to the high key, and there's
- * not room to insert the new tuple on the current page without
- * splitting, then we move right hoping to find more free space and
- * avoid a split.
- *
- * Keep scanning right until we
- * (a) find a page with enough free space,
- * (b) reach the last page where the tuple can legally go, or
- * (c) get tired of searching.
- * (c) is not flippant; it is important because if there are many
- * pages' worth of equal keys, it's better to split one of the early
- * pages than to scan all the way to the end of the run of equal keys
- * on every insert. We implement "get tired" as a random choice,
- * since stopping after scanning a fixed number of pages wouldn't work
- * well (we'd never reach the right-hand side of previously split
- * pages). The probability of moving right is set at 0.99, which may
- * seem too high to change the behavior much, but it does an excellent
- * job of preventing O(N^2) behavior with many equal keys.
- *----------
- */
- while (PageGetFreeSpace(page) < insertstate->itemsz)
- {
- AttrNumber cmpcol = 1;
-
- /*
- * Before considering moving right, see if we can obtain enough
- * space by erasing LP_DEAD items
- */
- if (P_HAS_GARBAGE(opaque))
- {
- /* Perform simple deletion */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- if (PageGetFreeSpace(page) >= insertstate->itemsz)
- break; /* OK, now we have enough space */
- }
-
- /*
- * Nope, so check conditions (b) and (c) enumerated above
- *
- * The earlier _bt_check_unique() call may well have established a
- * strict upper bound on the offset for the new item. If it's not
- * the last item of the page (i.e. if there is at least one tuple
- * on the page that's greater than the tuple we're inserting to)
- * then we know that the tuple belongs on this page. We can skip
- * the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) != 0 ||
- pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
- break;
-
- _bt_stepright(rel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- }
- }
-
- /*
- * We should now be on the correct page. Find the offset within the page
- * for the new tuple. (Possibly reusing earlier search bounds.)
- */
- {
- AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
- Assert(P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0);
- }
-
- newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
-
- if (insertstate->postingoff == -1)
- {
- /*
- * There is an overlapping posting list tuple with its LP_DEAD bit
- * set. We don't want to unnecessarily unset its LP_DEAD bit while
- * performing a posting list split, so perform simple index tuple
- * deletion early.
- */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- /*
- * Do new binary search. New insert location cannot overlap with any
- * posting list now.
- */
- Assert(!insertstate->bounds_valid);
- insertstate->postingoff = 0;
- newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
- Assert(insertstate->postingoff == 0);
- }
-
- return newitemoff;
-}
-
/*
* Step right to next non-dead page, during insertion.
*
@@ -1501,6 +943,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
bool newitemonleft,
isleaf,
isrightmost;
+ nbts_prep_ctx(rel);
/*
* origpage is the original page to be split. leftpage is a temporary
@@ -2693,6 +2136,7 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
BTScanInsert itup_key = insertstate->itup_key;
Page page = BufferGetPage(buffer);
BTPageOpaque opaque = BTPageGetOpaque(page);
+ nbts_prep_ctx(rel);
Assert(P_ISLEAF(opaque));
Assert(simpleonly || itup_key->heapkeyspace);
diff --git a/src/backend/access/nbtree/nbtinsert_spec.c b/src/backend/access/nbtree/nbtinsert_spec.c
new file mode 100644
index 0000000000..d37afae5ae
--- /dev/null
+++ b/src/backend/access/nbtree/nbtinsert_spec.c
@@ -0,0 +1,583 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtinsert_spec.c
+ * Index shape-specialized functions for nbtinsert.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtinsert_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_search_insert NBTS_FUNCTION(_bt_search_insert)
+#define _bt_findinsertloc NBTS_FUNCTION(_bt_findinsertloc)
+
+static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
+static OffsetNumber _bt_findinsertloc(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel);
+
+
+/*
+ * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
+ *
+ * This routine is called by the public interface routine, btinsert.
+ * By here, itup is filled in, including the TID.
+ *
+ * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
+ * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
+ * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
+ * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
+ * don't actually insert.
+ *
+ * indexUnchanged executor hint indicates if itup is from an
+ * UPDATE that didn't logically change the indexed value, but
+ * must nevertheless have a new entry to point to a successor
+ * version.
+ *
+ * The result value is only significant for UNIQUE_CHECK_PARTIAL:
+ * it must be true if the entry is known unique, else false.
+ * (In the current implementation we'll also return true after a
+ * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
+ * that's just a coding artifact.)
+ */
+bool
+_bt_doinsert(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ Relation heapRel)
+{
+ bool is_unique = false;
+ BTInsertStateData insertstate;
+ BTScanInsert itup_key;
+ BTStack stack;
+ bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
+
+ /* we need an insertion scan key to do our search, so build one */
+ itup_key = _bt_mkscankey(rel, itup);
+
+ if (checkingunique)
+ {
+ if (!itup_key->anynullkeys)
+ {
+ /* No (heapkeyspace) scantid until uniqueness established */
+ itup_key->scantid = NULL;
+ }
+ else
+ {
+ /*
+ * Scan key for new tuple contains NULL key values. Bypass
+ * checkingunique steps. They are unnecessary because core code
+ * considers NULL unequal to every value, including NULL.
+ *
+ * This optimization avoids O(N^2) behavior within the
+ * _bt_findinsertloc() heapkeyspace path when a unique index has a
+ * large number of "duplicates" with NULL key values.
+ */
+ checkingunique = false;
+ /* Tuple is unique in the sense that core code cares about */
+ Assert(checkUnique != UNIQUE_CHECK_EXISTING);
+ is_unique = true;
+ }
+ }
+
+ /*
+ * Fill in the BTInsertState working area, to track the current page and
+ * position within the page to insert on.
+ *
+ * Note that itemsz is passed down to lower level code that deals with
+ * inserting the item. It must be MAXALIGN()'d. This ensures that space
+ * accounting code consistently considers the alignment overhead that we
+ * expect PageAddItem() will add later. (Actually, index_form_tuple() is
+ * already conservative about alignment, but we don't rely on that from
+ * this distance. Besides, preserving the "true" tuple size in index
+ * tuple headers for the benefit of nbtsplitloc.c might happen someday.
+ * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
+ */
+ insertstate.itup = itup;
+ insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
+ insertstate.itup_key = itup_key;
+ insertstate.bounds_valid = false;
+ insertstate.buf = InvalidBuffer;
+ insertstate.postingoff = 0;
+
+ search:
+
+ /*
+ * Find and lock the leaf page that the tuple should be added to by
+ * searching from the root page. insertstate.buf will hold a buffer that
+ * is locked in exclusive mode afterwards.
+ */
+ stack = _bt_search_insert(rel, &insertstate);
+
+ /*
+ * checkingunique inserts are not allowed to go ahead when two tuples with
+ * equal key attribute values would be visible to new MVCC snapshots once
+ * the xact commits. Check for conflicts in the locked page/buffer (if
+ * needed) here.
+ *
+ * It might be necessary to check a page to the right in _bt_check_unique,
+ * though that should be very rare. In practice the first page the value
+ * could be on (with scantid omitted) is almost always also the only page
+ * that a matching tuple might be found on. This is due to the behavior
+ * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
+ * only be allowed to cross a page boundary when there is no candidate
+ * leaf page split point that avoids it. Also, _bt_check_unique can use
+ * the leaf page high key to determine that there will be no duplicates on
+ * the right sibling without actually visiting it (it uses the high key in
+ * cases where the new item happens to belong at the far right of the leaf
+ * page).
+ *
+ * NOTE: obviously, _bt_check_unique can only detect keys that are already
+ * in the index; so it cannot defend against concurrent insertions of the
+ * same key. We protect against that by means of holding a write lock on
+ * the first page the value could be on, with omitted/-inf value for the
+ * implicit heap TID tiebreaker attribute. Any other would-be inserter of
+ * the same key must acquire a write lock on the same page, so only one
+ * would-be inserter can be making the check at one time. Furthermore,
+ * once we are past the check we hold write locks continuously until we
+ * have performed our insertion, so no later inserter can fail to see our
+ * insertion. (This requires some care in _bt_findinsertloc.)
+ *
+ * If we must wait for another xact, we release the lock while waiting,
+ * and then must perform a new search.
+ *
+ * For a partial uniqueness check, we don't wait for the other xact. Just
+ * let the tuple in and return false for possibly non-unique, or true for
+ * definitely unique.
+ */
+ if (checkingunique)
+ {
+ TransactionId xwait;
+ uint32 speculativeToken;
+
+ xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
+ &is_unique, &speculativeToken);
+
+ if (unlikely(TransactionIdIsValid(xwait)))
+ {
+ /* Have to wait for the other guy ... */
+ _bt_relbuf(rel, insertstate.buf);
+ insertstate.buf = InvalidBuffer;
+
+ /*
+ * If it's a speculative insertion, wait for it to finish (ie. to
+ * go ahead with the insertion, or kill the tuple). Otherwise
+ * wait for the transaction to finish as usual.
+ */
+ if (speculativeToken)
+ SpeculativeInsertionWait(xwait, speculativeToken);
+ else
+ XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
+
+ /* start over... */
+ if (stack)
+ _bt_freestack(stack);
+ goto search;
+ }
+
+ /* Uniqueness is established -- restore heap tid as scantid */
+ if (itup_key->heapkeyspace)
+ itup_key->scantid = &itup->t_tid;
+ }
+
+ if (checkUnique != UNIQUE_CHECK_EXISTING)
+ {
+ OffsetNumber newitemoff;
+
+ /*
+ * The only conflict predicate locking cares about for indexes is when
+ * an index tuple insert conflicts with an existing lock. We don't
+ * know the actual page we're going to insert on for sure just yet in
+ * checkingunique and !heapkeyspace cases, but it's okay to use the
+ * first page the value could be on (with scantid omitted) instead.
+ */
+ CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
+
+ /*
+ * Do the insertion. Note that insertstate contains cached binary
+ * search bounds established within _bt_check_unique when insertion is
+ * checkingunique.
+ */
+ newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
+ indexUnchanged, stack, heapRel);
+ _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
+ itup, insertstate.itemsz, newitemoff,
+ insertstate.postingoff, false);
+ }
+ else
+ {
+ /* just release the buffer */
+ _bt_relbuf(rel, insertstate.buf);
+ }
+
+ /* be tidy */
+ if (stack)
+ _bt_freestack(stack);
+ pfree(itup_key);
+
+ return is_unique;
+}
+/*
+ * _bt_search_insert() -- _bt_search() wrapper for inserts
+ *
+ * Search the tree for a particular scankey, or more precisely for the first
+ * leaf page it could be on. Try to make use of the fastpath optimization's
+ * rightmost leaf page cache before actually searching the tree from the root
+ * page, though.
+ *
+ * Return value is a stack of parent-page pointers (though see notes about
+ * fastpath optimization and page splits below). insertstate->buf is set to
+ * the address of the leaf-page buffer, which is write-locked and pinned in
+ * all cases (if necessary by creating a new empty root page for caller).
+ *
+ * The fastpath optimization avoids most of the work of searching the tree
+ * repeatedly when a single backend inserts successive new tuples on the
+ * rightmost leaf page of an index. A backend cache of the rightmost leaf
+ * page is maintained within _bt_insertonpg(), and used here. The cache is
+ * invalidated here when an insert of a non-pivot tuple must take place on a
+ * non-rightmost leaf page.
+ *
+ * The optimization helps with indexes on an auto-incremented field. It also
+ * helps with indexes on datetime columns, as well as indexes with lots of
+ * NULL values. (NULLs usually get inserted in the rightmost page for single
+ * column indexes, since they usually get treated as coming after everything
+ * else in the key space. Individual NULL tuples will generally be placed on
+ * the rightmost leaf page due to the influence of the heap TID column.)
+ *
+ * Note that we avoid applying the optimization when there is insufficient
+ * space on the rightmost page to fit caller's new item. This is necessary
+ * because we'll need to return a real descent stack when a page split is
+ * expected (actually, caller can cope with a leaf page split that uses a NULL
+ * stack, but that's very slow and so must be avoided). Note also that the
+ * fastpath optimization acquires the lock on the page conditionally as a way
+ * of reducing extra contention when there are concurrent insertions into the
+ * rightmost page (we give up if we'd have to wait for the lock). We assume
+ * that it isn't useful to apply the optimization when there is contention,
+ * since each per-backend cache won't stay valid for long.
+ */
+static BTStack
+_bt_search_insert(Relation rel, BTInsertState insertstate)
+{
+ Assert(insertstate->buf == InvalidBuffer);
+ Assert(!insertstate->bounds_valid);
+ Assert(insertstate->postingoff == 0);
+
+ if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
+ {
+ /* Simulate a _bt_getbuf() call with conditional locking */
+ insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
+ if (_bt_conditionallockbuf(rel, insertstate->buf))
+ {
+ Page page;
+ BTPageOpaque opaque;
+ AttrNumber cmpcol = 1;
+
+ _bt_checkpage(rel, insertstate->buf);
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ /*
+ * Check if the page is still the rightmost leaf page and has
+ * enough free space to accommodate the new tuple. Also check
+ * that the insertion scan key is strictly greater than the first
+ * non-pivot tuple on the page. (Note that we expect itup_key's
+ * scantid to be unset when our caller is a checkingunique
+ * inserter.)
+ */
+ if (P_RIGHTMOST(opaque) &&
+ P_ISLEAF(opaque) &&
+ !P_IGNORE(opaque) &&
+ PageGetFreeSpace(page) > insertstate->itemsz &&
+ PageGetMaxOffsetNumber(page) >= P_HIKEY &&
+ _bt_compare(rel, insertstate->itup_key, page, P_HIKEY,
+ &cmpcol) > 0)
+ {
+ /*
+ * Caller can use the fastpath optimization because cached
+ * block is still rightmost leaf page, which can fit caller's
+ * new tuple without splitting. Keep block in local cache for
+ * next insert, and have caller use NULL stack.
+ *
+ * Note that _bt_insert_parent() has an assertion that catches
+ * leaf page splits that somehow follow from a fastpath insert
+ * (it should only be passed a NULL stack when it must deal
+ * with a concurrent root page split, and never because a NULL
+ * stack was returned here).
+ */
+ return NULL;
+ }
+
+ /* Page unsuitable for caller, drop lock and pin */
+ _bt_relbuf(rel, insertstate->buf);
+ }
+ else
+ {
+ /* Lock unavailable, drop pin */
+ ReleaseBuffer(insertstate->buf);
+ }
+
+ /* Forget block, since cache doesn't appear to be useful */
+ RelationSetTargetBlock(rel, InvalidBlockNumber);
+ }
+
+ /* Cannot use optimization -- descend tree, return proper descent stack */
+ return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
+ NULL);
+}
+
+
+/*
+ * _bt_findinsertloc() -- Finds an insert location for a tuple
+ *
+ * On entry, insertstate buffer contains the page the new tuple belongs
+ * on. It is exclusive-locked and pinned by the caller.
+ *
+ * If 'checkingunique' is true, the buffer on entry is the first page
+ * that contains duplicates of the new key. If there are duplicates on
+ * multiple pages, the correct insertion position might be some page to
+ * the right, rather than the first page. In that case, this function
+ * moves right to the correct target page.
+ *
+ * (In a !heapkeyspace index, there can be multiple pages with the same
+ * high key, where the new tuple could legitimately be placed on. In
+ * that case, the caller passes the first page containing duplicates,
+ * just like when checkingunique=true. If that page doesn't have enough
+ * room for the new tuple, this function moves right, trying to find a
+ * legal page that does.)
+ *
+ * If 'indexUnchanged' is true, this is for an UPDATE that didn't
+ * logically change the indexed value, but must nevertheless have a new
+ * entry to point to a successor version. This hint from the executor
+ * will influence our behavior when the page might have to be split and
+ * we must consider our options. Bottom-up index deletion can avoid
+ * pathological version-driven page splits, but we only want to go to the
+ * trouble of trying it when we already have moderate confidence that
+ * it's appropriate. The hint should not significantly affect our
+ * behavior over time unless practically all inserts on to the leaf page
+ * get the hint.
+ *
+ * On exit, insertstate buffer contains the chosen insertion page, and
+ * the offset within that page is returned. If _bt_findinsertloc needed
+ * to move right, the lock and pin on the original page are released, and
+ * the new buffer is exclusively locked and pinned instead.
+ *
+ * If insertstate contains cached binary search bounds, we will take
+ * advantage of them. This avoids repeating comparisons that we made in
+ * _bt_check_unique() already.
+ */
+static OffsetNumber
+_bt_findinsertloc(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel)
+{
+ BTScanInsert itup_key = insertstate->itup_key;
+ Page page = BufferGetPage(insertstate->buf);
+ BTPageOpaque opaque;
+ OffsetNumber newitemoff;
+
+ opaque = BTPageGetOpaque(page);
+
+ /* Check 1/3 of a page restriction */
+ if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
+ _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
+ insertstate->itup);
+
+ Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
+ Assert(!insertstate->bounds_valid || checkingunique);
+ Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
+ Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
+ Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
+
+ if (itup_key->heapkeyspace)
+ {
+ /* Keep track of whether checkingunique duplicate seen */
+ bool uniquedup = indexUnchanged;
+
+ /*
+ * If we're inserting into a unique index, we may have to walk right
+ * through leaf pages to find the one leaf page that we must insert on
+ * to.
+ *
+ * This is needed for checkingunique callers because a scantid was not
+ * used when we called _bt_search(). scantid can only be set after
+ * _bt_check_unique() has checked for duplicates. The buffer
+ * initially stored in insertstate->buf has the page where the first
+ * duplicate key might be found, which isn't always the page that new
+ * tuple belongs on. The heap TID attribute for new tuple (scantid)
+ * could force us to insert on a sibling page, though that should be
+ * very rare in practice.
+ */
+ if (checkingunique)
+ {
+ if (insertstate->low < insertstate->stricthigh)
+ {
+ /* Encountered a duplicate in _bt_check_unique() */
+ Assert(insertstate->bounds_valid);
+ uniquedup = true;
+ }
+
+ for (;;)
+ {
+ AttrNumber cmpcol = 1;
+
+ /*
+ * Does the new tuple belong on this page?
+ *
+ * The earlier _bt_check_unique() call may well have
+ * established a strict upper bound on the offset for the new
+ * item. If it's not the last item of the page (i.e. if there
+ * is at least one tuple on the page that goes after the tuple
+ * we're inserting) then we know that the tuple belongs on
+ * this page. We can skip the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ /* Test '<=', not '!=', since scantid is set now */
+ if (P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
+ break;
+
+ _bt_stepright(rel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ /* Assume duplicates (if checkingunique) */
+ uniquedup = true;
+ }
+ }
+
+ /*
+ * If the target page cannot fit newitem, try to avoid splitting the
+ * page on insert by performing deletion or deduplication now
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
+ checkingunique, uniquedup,
+ indexUnchanged);
+ }
+ else
+ {
+ /*----------
+ * This is a !heapkeyspace (version 2 or 3) index. The current page
+ * is the first page that we could insert the new tuple to, but there
+ * may be other pages to the right that we could opt to use instead.
+ *
+ * If the new key is equal to one or more existing keys, we can
+ * legitimately place it anywhere in the series of equal keys. In
+ * fact, if the new key is equal to the page's "high key" we can place
+ * it on the next page. If it is equal to the high key, and there's
+ * not room to insert the new tuple on the current page without
+ * splitting, then we move right hoping to find more free space and
+ * avoid a split.
+ *
+ * Keep scanning right until we
+ * (a) find a page with enough free space,
+ * (b) reach the last page where the tuple can legally go, or
+ * (c) get tired of searching.
+ * (c) is not flippant; it is important because if there are many
+ * pages' worth of equal keys, it's better to split one of the early
+ * pages than to scan all the way to the end of the run of equal keys
+ * on every insert. We implement "get tired" as a random choice,
+ * since stopping after scanning a fixed number of pages wouldn't work
+ * well (we'd never reach the right-hand side of previously split
+ * pages). The probability of moving right is set at 0.99, which may
+ * seem too high to change the behavior much, but it does an excellent
+ * job of preventing O(N^2) behavior with many equal keys.
+ *----------
+ */
+ while (PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ AttrNumber cmpcol = 1;
+
+ /*
+ * Before considering moving right, see if we can obtain enough
+ * space by erasing LP_DEAD items
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ /* Perform simple deletion */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ if (PageGetFreeSpace(page) >= insertstate->itemsz)
+ break; /* OK, now we have enough space */
+ }
+
+ /*
+ * Nope, so check conditions (b) and (c) enumerated above
+ *
+ * The earlier _bt_check_unique() call may well have established a
+ * strict upper bound on the offset for the new item. If it's not
+ * the last item of the page (i.e. if there is at least one tuple
+ * on the page that's greater than the tuple we're inserting to)
+ * then we know that the tuple belongs on this page. We can skip
+ * the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ if (P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) != 0 ||
+ pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
+ break;
+
+ _bt_stepright(rel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ }
+ }
+
+ /*
+ * We should now be on the correct page. Find the offset within the page
+ * for the new tuple. (Possibly reusing earlier search bounds.)
+ */
+ {
+ AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
+ Assert(P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0);
+ }
+
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
+
+ if (insertstate->postingoff == -1)
+ {
+ /*
+ * There is an overlapping posting list tuple with its LP_DEAD bit
+ * set. We don't want to unnecessarily unset its LP_DEAD bit while
+ * performing a posting list split, so perform simple index tuple
+ * deletion early.
+ */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ /*
+ * Do new binary search. New insert location cannot overlap with any
+ * posting list now.
+ */
+ Assert(!insertstate->bounds_valid);
+ insertstate->postingoff = 0;
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
+ Assert(insertstate->postingoff == 0);
+ }
+
+ return newitemoff;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d19..7710226f41 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1819,6 +1819,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
bool rightsib_empty;
Page page;
BTPageOpaque opaque;
+ nbts_prep_ctx(rel);
/*
* Save original leafbuf block number from caller. Only deleted blocks
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1cc88da032..ceeafd637f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -87,6 +87,8 @@ static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
OffsetNumber updatedoffset,
int *nremaining);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtree_spec.c"
+#include "access/nbtree_spec.h"
/*
* Btree handler function: return IndexAmRoutine with access method parameters
@@ -120,7 +122,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambuild = btbuild;
amroutine->ambuildempty = btbuildempty;
- amroutine->aminsert = btinsert;
+ amroutine->aminsert = btinsert_default;
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
@@ -152,6 +154,8 @@ btbuildempty(Relation index)
{
Page metapage;
+ nbt_opt_specialize(index);
+
/* Construct metapage. */
metapage = (Page) palloc(BLCKSZ);
_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
@@ -177,33 +181,6 @@ btbuildempty(Relation index)
smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
}
-/*
- * btinsert() -- insert an index tuple into a btree.
- *
- * Descend the tree recursively, find the appropriate location for our
- * new tuple, and put it there.
- */
-bool
-btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- IndexInfo *indexInfo)
-{
- bool result;
- IndexTuple itup;
-
- /* generate an index tuple */
- itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
- itup->t_tid = *ht_ctid;
-
- result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);
-
- pfree(itup);
-
- return result;
-}
-
/*
* btgettuple() -- Get the next tuple in the scan.
*/
@@ -345,6 +322,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
IndexScanDesc scan;
BTScanOpaque so;
+ nbt_opt_specialize(rel);
+
/* no order by operators allowed */
Assert(norderbys == 0);
@@ -788,6 +767,8 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
Relation rel = info->index;
BTCycleId cycleid;
+ nbt_opt_specialize(rel);
+
/* allocate stats if first time through, else re-use existing struct */
if (stats == NULL)
stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
diff --git a/src/backend/access/nbtree/nbtree_spec.c b/src/backend/access/nbtree/nbtree_spec.c
new file mode 100644
index 0000000000..6b766581ab
--- /dev/null
+++ b/src/backend/access/nbtree/nbtree_spec.c
@@ -0,0 +1,69 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtree_spec.c
+ * Index shape-specialized functions for nbtree.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtree_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+/*
+ * _bt_specialize() -- Specialize this index relation for its index key.
+ */
+void
+_bt_specialize(Relation rel)
+{
+#ifdef NBTS_SPECIALIZING_DEFAULT
+ NBTS_MAKE_CTX(rel);
+ /*
+ * We can't directly address _bt_specialize here because it'd be macro-
+ * expanded, nor can we utilize NBTS_SPECIALIZE_NAME here because it'd
+ * try to call _bt_specialize, which would be an infinite recursive call.
+ */
+ switch (__nbts_ctx) {
+ case NBTS_CTX_CACHED:
+ _bt_specialize_cached(rel);
+ break;
+ case NBTS_CTX_DEFAULT:
+ break;
+ }
+#else
+ rel->rd_indam->aminsert = btinsert;
+#endif
+}
+
+/*
+ * btinsert() -- insert an index tuple into a btree.
+ *
+ * Descend the tree recursively, find the appropriate location for our
+ * new tuple, and put it there.
+ */
+bool
+btinsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ IndexInfo *indexInfo)
+{
+ bool result;
+ IndexTuple itup;
+
+ /* generate an index tuple */
+ itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
+ itup->t_tid = *ht_ctid;
+
+ result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);
+
+ pfree(itup);
+
+ return result;
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index e3b828137b..0089fe7eeb 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,12 +25,8 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
-static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf,
- AttrNumber *highkeycmpcol);
static int _bt_binsrch_posting(BTScanInsert key, Page page,
OffsetNumber offnum);
-static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
- OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
static int _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
@@ -47,6 +43,8 @@ static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsearch_spec.c"
+#include "access/nbtree_spec.h"
/*
* _bt_drop_lock_and_maybe_pin()
@@ -71,572 +69,6 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
}
}
-/*
- * _bt_search() -- Search the tree for a particular scankey,
- * or more precisely for the first leaf page it could be on.
- *
- * The passed scankey is an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * Return value is a stack of parent-page pointers (i.e. there is no entry for
- * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
- * which is locked and pinned. No locks are held on the parent pages,
- * however!
- *
- * If the snapshot parameter is not NULL, "old snapshot" checking will take
- * place during the descent through the tree. This is not needed when
- * positioning for an insert or delete, so NULL is used for those cases.
- *
- * The returned buffer is locked according to access parameter. Additionally,
- * access = BT_WRITE will allow an empty root page to be created and returned.
- * When access = BT_READ, an empty index will result in *bufP being set to
- * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
- * during the search will be finished.
- */
-BTStack
-_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
- Snapshot snapshot)
-{
- BTStack stack_in = NULL;
- int page_access = BT_READ;
- char tupdatabuf[BLCKSZ / 3];
- AttrNumber highkeycmpcol = 1;
-
- /* Get the root page to start with */
- *bufP = _bt_getroot(rel, access);
-
- /* If index is empty and access = BT_READ, no root page is created. */
- if (!BufferIsValid(*bufP))
- return (BTStack) NULL;
-
- /* Loop iterates once per level descended in the tree */
- for (;;)
- {
- Page page;
- BTPageOpaque opaque;
- OffsetNumber offnum;
- ItemId itemid;
- IndexTuple itup;
- BlockNumber child;
- BTStack new_stack;
-
- /*
- * Race -- the page we just grabbed may have split since we read its
- * downlink in its parent page (or the metapage). If it has, we may
- * need to move right to its new sibling. Do that.
- *
- * In write-mode, allow _bt_moveright to finish any incomplete splits
- * along the way. Strictly speaking, we'd only need to finish an
- * incomplete split on the leaf page we're about to insert to, not on
- * any of the upper levels (internal pages with incomplete splits are
- * also taken care of in _bt_getstackbuf). But this is a good
- * opportunity to finish splits of internal pages too.
- */
- *bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
- page_access, snapshot, &highkeycmpcol,
- (char *) tupdatabuf);
-
- /* if this is a leaf page, we're done */
- page = BufferGetPage(*bufP);
- opaque = BTPageGetOpaque(page);
- if (P_ISLEAF(opaque))
- break;
-
- /*
- * Find the appropriate pivot tuple on this page. Its downlink points
- * to the child page that we're about to descend to.
- */
- offnum = _bt_binsrch(rel, key, *bufP, &highkeycmpcol);
- itemid = PageGetItemId(page, offnum);
- itup = (IndexTuple) PageGetItem(page, itemid);
- Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
- child = BTreeTupleGetDownLink(itup);
-
- Assert(IndexTupleSize(itup) < sizeof(tupdatabuf));
- memcpy((char *) tupdatabuf, (char *) itup, IndexTupleSize(itup));
-
- /*
- * We need to save the location of the pivot tuple we chose in a new
- * stack entry for this page/level. If caller ends up splitting a
- * page one level down, it usually ends up inserting a new pivot
- * tuple/downlink immediately after the location recorded here.
- */
- new_stack = (BTStack) palloc(sizeof(BTStackData));
- new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
- new_stack->bts_offset = offnum;
- new_stack->bts_parent = stack_in;
-
- /*
- * Page level 1 is lowest non-leaf page level prior to leaves. So, if
- * we're on the level 1 and asked to lock leaf page in write mode,
- * then lock next page in write mode, because it must be a leaf.
- */
- if (opaque->btpo_level == 1 && access == BT_WRITE)
- page_access = BT_WRITE;
-
- /* drop the read lock on the page, then acquire one on its child */
- *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
-
- /* okay, all set to move down a level */
- stack_in = new_stack;
- }
-
- /*
- * If we're asked to lock leaf in write mode, but didn't manage to, then
- * relock. This should only happen when the root page is a leaf page (and
- * the only page in the index other than the metapage).
- */
- if (access == BT_WRITE && page_access == BT_READ)
- {
- highkeycmpcol = 1;
-
- /* trade in our read lock for a write lock */
- _bt_unlockbuf(rel, *bufP);
- _bt_lockbuf(rel, *bufP, BT_WRITE);
-
- /*
- * Race -- the leaf page may have split after we dropped the read lock
- * but before we acquired a write lock. If it has, we may need to
- * move right to its new sibling. Do that.
- */
- *bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
- snapshot, &highkeycmpcol, (char *) tupdatabuf);
- }
-
- return stack_in;
-}
-
-/*
- * _bt_moveright() -- move right in the btree if necessary.
- *
- * When we follow a pointer to reach a page, it is possible that
- * the page has changed in the meanwhile. If this happens, we're
- * guaranteed that the page has "split right" -- that is, that any
- * data that appeared on the page originally is either on the page
- * or strictly to the right of it.
- *
- * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page. If that entry is
- * strictly less than the scankey, or <= the scankey in the
- * key.nextkey=true case, then we followed the wrong link and we need
- * to move right.
- *
- * The passed insertion-type scankey can omit the rightmost column(s) of the
- * index. (see nbtree/README)
- *
- * When key.nextkey is false (the usual case), we are looking for the first
- * item >= key. When key.nextkey is true, we are looking for the first item
- * strictly greater than key.
- *
- * If forupdate is true, we will attempt to finish any incomplete splits
- * that we encounter. This is required when locking a target page for an
- * insertion, because we don't allow inserting on a page before the split
- * is completed. 'stack' is only used if forupdate is true.
- *
- * On entry, we have the buffer pinned and a lock of the type specified by
- * 'access'. If we move right, we release the buffer and lock and acquire
- * the same on the right sibling. Return value is the buffer we stop at.
- *
- * If the snapshot parameter is not NULL, "old snapshot" checking will take
- * place during the descent through the tree. This is not needed when
- * positioning for an insert or delete, so NULL is used for those cases.
- */
-Buffer
-_bt_moveright(Relation rel,
- BTScanInsert key,
- Buffer buf,
- bool forupdate,
- BTStack stack,
- int access,
- Snapshot snapshot,
- AttrNumber *comparecol,
- char *tupdatabuf)
-{
- Page page;
- BTPageOpaque opaque;
- int32 cmpval;
-
- Assert(PointerIsValid(comparecol) && PointerIsValid(tupdatabuf));
-
- /*
- * When nextkey = false (normal case): if the scan key that brought us to
- * this page is > the high key stored on the page, then the page has split
- * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
- * have some duplicates to the right as well as the left, but that's
- * something that's only ever dealt with on the leaf level, after
- * _bt_search has found an initial leaf page.)
- *
- * When nextkey = true: move right if the scan key is >= page's high key.
- * (Note that key.scantid cannot be set in this case.)
- *
- * The page could even have split more than once, so scan as far as
- * needed.
- *
- * We also have to move right if we followed a link that brought us to a
- * dead page.
- */
- cmpval = key->nextkey ? 0 : 1;
-
- for (;;)
- {
- AttrNumber cmpcol = 1;
-
- page = BufferGetPage(buf);
- TestForOldSnapshot(snapshot, rel, page);
- opaque = BTPageGetOpaque(page);
-
- if (P_RIGHTMOST(opaque))
- {
- *comparecol = 1;
- break;
- }
-
- /*
- * Finish any incomplete splits we encounter along the way.
- */
- if (forupdate && P_INCOMPLETE_SPLIT(opaque))
- {
- BlockNumber blkno = BufferGetBlockNumber(buf);
-
- /* upgrade our lock if necessary */
- if (access == BT_READ)
- {
- _bt_unlockbuf(rel, buf);
- _bt_lockbuf(rel, buf, BT_WRITE);
- }
-
- if (P_INCOMPLETE_SPLIT(opaque))
- _bt_finish_split(rel, buf, stack);
- else
- _bt_relbuf(rel, buf);
-
- /* re-acquire the lock in the right mode, and re-check */
- buf = _bt_getbuf(rel, blkno, access);
- continue;
- }
-
- /*
- * tupdatabuf is filled with the right seperator of the parent node.
- * This allows us to do a binary equality check between the parent
- * node's right seperator (which is < key) and this page's P_HIKEY.
- * If they equal, we can reuse the result of the parent node's
- * rightkey compare, which means we can potentially save a full key
- * compare (which includes indirect calls to attribute comparison
- * functions).
- *
- * Without this, we'd on average use 3 full key compares per page before
- * we achieve full dynamic prefix bounds, but with this optimization
- * that is only 2.
- *
- * 3 compares: 1 for the highkey (rightmost), and on average 2 before
- * we move right in the binary search on the page, this average equals
- * SUM (1/2 ^ x) for x from 0 to log(n items)), which tends to 2.
- */
- if (!P_IGNORE(opaque) && *comparecol > 1)
- {
- IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
- IndexTuple buftuple = (IndexTuple) tupdatabuf;
- if (IndexTupleSize(itup) == IndexTupleSize(buftuple))
- {
- char *dataptr = (char *) itup;
-
- if (memcmp(dataptr + sizeof(IndexTupleData),
- tupdatabuf + sizeof(IndexTupleData),
- IndexTupleSize(itup) - sizeof(IndexTupleData)) == 0)
- break;
- } else {
- *comparecol = 1;
- }
- } else {
- *comparecol = 1;
- }
-
- if (P_IGNORE(opaque) ||
- _bt_compare(rel, key, page, P_HIKEY, &cmpcol) >= cmpval)
- {
- *comparecol = 1;
- /* step right one page */
- buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
- continue;
- }
- else
- {
- *comparecol = cmpcol;
- break;
- }
- }
-
- if (P_IGNORE(opaque))
- elog(ERROR, "fell off the end of index \"%s\"",
- RelationGetRelationName(rel));
-
- return buf;
-}
-
-/*
- * _bt_binsrch() -- Do a binary search for a key on a particular page.
- *
- * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
- * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
- * particular, this means it is possible to return a value 1 greater than the
- * number of keys on the page, if the scankey is > all keys on the page.)
- *
- * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
- * of the last key < given scankey, or last key <= given scankey if nextkey
- * is true. (Since _bt_compare treats the first data key of such a page as
- * minus infinity, there will be at least one key < scankey, so the result
- * always points at one of the keys on the page.) This key indicates the
- * right place to descend to be sure we find all leaf keys >= given scankey
- * (or leaf keys > given scankey when nextkey is true).
- *
- * This procedure is not responsible for walking right, it just examines
- * the given page. _bt_binsrch() has no lock or refcount side effects
- * on the buffer.
- */
-static OffsetNumber
-_bt_binsrch(Relation rel,
- BTScanInsert key,
- Buffer buf,
- AttrNumber *highkeycmpcol)
-{
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high;
- int32 result,
- cmpval;
- AttrNumber highcmpcol = *highkeycmpcol,
- lowcmpcol = 1;
-
- page = BufferGetPage(buf);
- opaque = BTPageGetOpaque(page);
-
- /* Requesting nextkey semantics while using scantid seems nonsensical */
- Assert(!key->nextkey || key->scantid == NULL);
- /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
- Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
-
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
-
- /*
- * If there are no keys on the page, return the first available slot. Note
- * this covers two cases: the page is really empty (no keys), or it
- * contains only a high key. The latter case is possible after vacuuming.
- * This can never happen on an internal page, however, since they are
- * never empty (an internal page must have children).
- */
- if (unlikely(high < low))
- return low;
-
- /*
- * Binary search to find the first key on the page >= scan key, or first
- * key > scankey when nextkey is true.
- *
- * For nextkey=false (cmpval=1), the loop invariant is: all slots before
- * 'low' are < scan key, all slots at or after 'high' are >= scan key.
- *
- * For nextkey=true (cmpval=0), the loop invariant is: all slots before
- * 'low' are <= scan key, all slots at or after 'high' are > scan key.
- *
- * We can fall out when high == low.
- */
- high++; /* establish the loop invariant for high */
-
- cmpval = key->nextkey ? 0 : 1; /* select comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
- AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid, &cmpcol);
-
- if (result >= cmpval)
- {
- low = mid + 1;
- lowcmpcol = cmpcol;
- }
- else
- {
- high = mid;
- highcmpcol = cmpcol;
- }
- }
-
- *highkeycmpcol = highcmpcol;
-
- /*
- * At this point we have high == low, but be careful: they could point
- * past the last slot on the page.
- *
- * On a leaf page, we always return the first key >= scan key (resp. >
- * scan key), which could be the last slot + 1.
- */
- if (P_ISLEAF(opaque))
- return low;
-
- /*
- * On a non-leaf page, return the last key < scan key (resp. <= scan key).
- * There must be one if _bt_compare() is playing by the rules.
- */
- Assert(low > P_FIRSTDATAKEY(opaque));
-
- return OffsetNumberPrev(low);
-}
-
-/*
- *
- * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
- *
- * Like _bt_binsrch(), but with support for caching the binary search
- * bounds. Only used during insertion, and only on the leaf page that it
- * looks like caller will insert tuple on. Exclusive-locked and pinned
- * leaf page is contained within insertstate.
- *
- * Caches the bounds fields in insertstate so that a subsequent call can
- * reuse the low and strict high bounds of original binary search. Callers
- * that use these fields directly must be prepared for the case where low
- * and/or stricthigh are not on the same page (one or both exceed maxoff
- * for the page). The case where there are no items on the page (high <
- * low) makes bounds invalid.
- *
- * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time, and for dealing with posting list
- * tuple matches (callers can use insertstate's postingoff field to
- * determine which existing heap TID will need to be replaced by a posting
- * list split).
- */
-OffsetNumber
-_bt_binsrch_insert(Relation rel, BTInsertState insertstate,
- AttrNumber highcmpcol)
-{
- BTScanInsert key = insertstate->itup_key;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high,
- stricthigh;
- int32 result,
- cmpval;
- AttrNumber lowcmpcol = 1;
-
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- Assert(P_ISLEAF(opaque));
- Assert(!key->nextkey);
- Assert(insertstate->postingoff == 0);
-
- if (!insertstate->bounds_valid)
- {
- /* Start new binary search */
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
- }
- else
- {
- /* Restore result of previous binary search against same page */
- low = insertstate->low;
- high = insertstate->stricthigh;
- }
-
- /* If there are no keys on the page, return the first available slot */
- if (unlikely(high < low))
- {
- /* Caller can't reuse bounds */
- insertstate->low = InvalidOffsetNumber;
- insertstate->stricthigh = InvalidOffsetNumber;
- insertstate->bounds_valid = false;
- return low;
- }
-
- /*
- * Binary search to find the first key on the page >= scan key. (nextkey
- * is always false when inserting).
- *
- * The loop invariant is: all slots before 'low' are < scan key, all slots
- * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
- * maintained to save additional search effort for caller.
- *
- * We can fall out when high == low.
- */
- if (!insertstate->bounds_valid)
- high++; /* establish the loop invariant for high */
- stricthigh = high; /* high initially strictly higher */
-
- cmpval = 1; /* !nextkey comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
- AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid, &cmpcol);
-
- if (result >= cmpval)
- {
- low = mid + 1;
- lowcmpcol = cmpcol;
- }
- else
- {
- high = mid;
- highcmpcol = cmpcol;
-
- if (result != 0)
- stricthigh = high;
- }
-
- /*
- * If tuple at offset located by binary search is a posting list whose
- * TID range overlaps with caller's scantid, perform posting list
- * binary search to set postingoff for caller. Caller must split the
- * posting list when postingoff is set. This should happen
- * infrequently.
- */
- if (unlikely(result == 0 && key->scantid != NULL))
- {
- /*
- * postingoff should never be set more than once per leaf page
- * binary search. That would mean that there are duplicate table
- * TIDs in the index, which is never okay. Check for that here.
- */
- if (insertstate->postingoff != 0)
- ereport(ERROR,
- (errcode(ERRCODE_INDEX_CORRUPTED),
- errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
- ItemPointerGetBlockNumber(key->scantid),
- ItemPointerGetOffsetNumber(key->scantid),
- low, stricthigh,
- BufferGetBlockNumber(insertstate->buf),
- RelationGetRelationName(rel))));
-
- insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
- }
- }
-
- /*
- * On a leaf page, a binary search always returns the first key >= scan
- * key (at least in !nextkey case), which could be the last slot + 1. This
- * is also the lower bound of cached search.
- *
- * stricthigh may also be the last slot + 1, which prevents caller from
- * using bounds directly, but is still useful to us if we're called a
- * second time with cached bounds (cached low will be < stricthigh when
- * that happens).
- */
- insertstate->low = low;
- insertstate->stricthigh = stricthigh;
- insertstate->bounds_valid = true;
-
- return low;
-}
-
/*----------
* _bt_binsrch_posting() -- posting list binary search.
*
@@ -704,228 +136,6 @@ _bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
return low;
}
-/*----------
- * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
- *
- * page/offnum: location of btree item to be compared to.
- *
- * This routine returns:
- * <0 if scankey < tuple at offnum;
- * 0 if scankey == tuple at offnum;
- * >0 if scankey > tuple at offnum.
- *
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be returned
- * to the caller as a matching key. Similarly, an insertion scankey
- * with its scantid set is treated as equal to a posting tuple whose TID
- * range overlaps with their scantid. There generally won't be a
- * matching TID in the posting tuple, which caller must handle
- * themselves (e.g., by splitting the posting list tuple).
- *
- * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
- * "minus infinity": this routine will always claim it is less than the
- * scankey. The actual key value stored is explicitly truncated to 0
- * attributes (explicitly minus infinity) with version 3+ indexes, but
- * that isn't relied upon. This allows us to implement the Lehman and
- * Yao convention that the first down-link pointer is before the first
- * key. See backend/access/nbtree/README for details.
- *----------
- */
-int32
-_bt_compare(Relation rel,
- BTScanInsert key,
- Page page,
- OffsetNumber offnum,
- AttrNumber *comparecol)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- IndexTuple itup;
- ItemPointer heapTid;
- ScanKey scankey;
- int ncmpkey;
- int ntupatts;
- int32 result;
-
- Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
- Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
- Assert(key->heapkeyspace || key->scantid == NULL);
-
- /*
- * Force result ">" if target item is first data item on an internal page
- * --- see NOTE above.
- */
- if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
- return 1;
-
- itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
- ntupatts = BTreeTupleGetNAtts(itup, rel);
-
- /*
- * The scan key is set up with the attribute number associated with each
- * term in the key. It is important that, if the index is multi-key, the
- * scan contain the first k key attributes, and that they be in order. If
- * you think about how multi-key ordering works, you'll understand why
- * this is.
- *
- * We don't test for violation of this condition here, however. The
- * initial setup for the index scan had better have gotten it right (see
- * _bt_first).
- */
-
- ncmpkey = Min(ntupatts, key->keysz);
- Assert(key->heapkeyspace || ncmpkey == key->keysz);
- Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
-
- scankey = key->scankeys + ((*comparecol) - 1);
- for (int i = *comparecol; i <= ncmpkey; i++)
- {
- Datum datum;
- bool isNull;
-
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
-
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
- {
- if (isNull)
- result = 0; /* NULL "=" NULL */
- else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = -1; /* NULL "<" NOT_NULL */
- else
- result = 1; /* NULL ">" NOT_NULL */
- }
- else if (isNull) /* key is NOT_NULL and item is NULL */
- {
- if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = 1; /* NOT_NULL ">" NULL */
- else
- result = -1; /* NOT_NULL "<" NULL */
- }
- else
- {
- /*
- * The sk_func needs to be passed the index value as left arg and
- * the sk_argument as right arg (they might be of different
- * types). Since it is convenient for callers to think of
- * _bt_compare as comparing the scankey to the index item, we have
- * to flip the sign of the comparison result. (Unless it's a DESC
- * column, in which case we *don't* flip the sign.)
- */
- result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum,
- scankey->sk_argument));
-
- if (!(scankey->sk_flags & SK_BT_DESC))
- INVERT_COMPARE_RESULT(result);
- }
-
- /* if the keys are unequal, return the difference */
- if (result != 0)
- {
- *comparecol = i;
- return result;
- }
-
- scankey++;
- }
-
- /*
- * All tuple attributes are equal to the scan key, only later attributes
- * could potentially not equal the scan key.
- */
- *comparecol = ntupatts + 1;
-
- /*
- * All non-truncated attributes (other than heap TID) were found to be
- * equal. Treat truncated attributes as minus infinity when scankey has a
- * key attribute value that would otherwise be compared directly.
- *
- * Note: it doesn't matter if ntupatts includes non-key attributes;
- * scankey won't, so explicitly excluding non-key attributes isn't
- * necessary.
- */
- if (key->keysz > ntupatts)
- return 1;
-
- /*
- * Use the heap TID attribute and scantid to try to break the tie. The
- * rules are the same as any other key attribute -- only the
- * representation differs.
- */
- heapTid = BTreeTupleGetHeapTID(itup);
- if (key->scantid == NULL)
- {
- /*
- * Most searches have a scankey that is considered greater than a
- * truncated pivot tuple if and when the scankey has equal values for
- * attributes up to and including the least significant untruncated
- * attribute in tuple.
- *
- * For example, if an index has the minimum two attributes (single
- * user key attribute, plus heap TID attribute), and a page's high key
- * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
- * will not descend to the page to the left. The search will descend
- * right instead. The truncated attribute in pivot tuple means that
- * all non-pivot tuples on the page to the left are strictly < 'foo',
- * so it isn't necessary to descend left. In other words, search
- * doesn't have to descend left because it isn't interested in a match
- * that has a heap TID value of -inf.
- *
- * However, some searches (pivotsearch searches) actually require that
- * we descend left when this happens. -inf is treated as a possible
- * match for omitted scankey attribute(s). This is needed by page
- * deletion, which must re-find leaf pages that are targets for
- * deletion using their high keys.
- *
- * Note: the heap TID part of the test ensures that scankey is being
- * compared to a pivot tuple with one or more truncated key
- * attributes.
- *
- * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
- * left here, since they have no heap TID attribute (and cannot have
- * any -inf key values in any case, since truncation can only remove
- * non-key attributes). !heapkeyspace searches must always be
- * prepared to deal with matches on both sides of the pivot once the
- * leaf level is reached.
- */
- if (key->heapkeyspace && !key->pivotsearch &&
- key->keysz == ntupatts && heapTid == NULL)
- return 1;
-
- /* All provided scankey arguments found to be equal */
- return 0;
- }
-
- /*
- * Treat truncated heap TID as minus infinity, since scankey has a key
- * attribute value (scantid) that would otherwise be compared directly
- */
- Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
- if (heapTid == NULL)
- return 1;
-
- /*
- * Scankey must be treated as equal to a posting list tuple if its scantid
- * value falls within the range of the posting list. In all other cases
- * there can only be a single heap TID value, which is compared directly
- * with scantid.
- */
- Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- result = ItemPointerCompare(key->scantid, heapTid);
- if (result <= 0 || !BTreeTupleIsPosting(itup))
- return result;
- else
- {
- result = ItemPointerCompare(key->scantid,
- BTreeTupleGetMaxHeapTID(itup));
- if (result > 0)
- return 1;
- }
-
- return 0;
-}
-
/*
* _bt_first() -- Find the first item in a scan.
*
@@ -967,6 +177,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
BTScanPosItem *currItem;
BlockNumber blkno;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(rel);
Assert(!BTScanPosIsValid(so->currPos));
@@ -1589,280 +800,6 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
-/*
- * _bt_readpage() -- Load data from current index page into so->currPos
- *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate. All other fields of so->currPos are
- * initialized from scratch here.
- *
- * We scan the current page starting at offnum and moving in the indicated
- * direction. All items matching the scan keys are loaded into currPos.items.
- * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
- * that there can be no more matching tuples in the current scan direction.
- *
- * In the case of a parallel scan, caller must have called _bt_parallel_seize
- * prior to calling this function; this function will invoke
- * _bt_parallel_release before returning.
- *
- * Returns true if any matching items found on the page, false if none.
- */
-static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
-{
- BTScanOpaque so = (BTScanOpaque) scan->opaque;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber minoff;
- OffsetNumber maxoff;
- int itemIndex;
- bool continuescan;
- int indnatts;
-
- /*
- * We must have the buffer pinned and locked, but the usual macro can't be
- * used here; this function is what makes it good for currPos.
- */
- Assert(BufferIsValid(so->currPos.buf));
-
- page = BufferGetPage(so->currPos.buf);
- opaque = BTPageGetOpaque(page);
-
- /* allow next page be processed by parallel worker */
- if (scan->parallel_scan)
- {
- if (ScanDirectionIsForward(dir))
- _bt_parallel_release(scan, opaque->btpo_next);
- else
- _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
- }
-
- continuescan = true; /* default assumption */
- indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * We note the buffer's block number so that we can release the pin later.
- * This allows us to re-read the buffer if it is needed again for hinting.
- */
- so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-
- /*
- * We save the LSN of the page as we read it, so that we know whether it
- * safe to apply LP_DEAD hints to the page later. This allows us to drop
- * the pin for MVCC scans, which allows vacuum to avoid blocking.
- */
- so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-
- /*
- * we must save the page's right-link while scanning it; this tells us
- * where to step right to after we're done with these items. There is no
- * corresponding need for the left-link, since splits always go right.
- */
- so->currPos.nextPage = opaque->btpo_next;
-
- /* initialize tuple workspace to empty */
- so->currPos.nextTupleOffset = 0;
-
- /*
- * Now that the current page has been made consistent, the macro should be
- * good.
- */
- Assert(BTScanPosIsPinned(so->currPos));
-
- if (ScanDirectionIsForward(dir))
- {
- /* load items[] in ascending order */
- itemIndex = 0;
-
- offnum = Max(offnum, minoff);
-
- while (offnum <= maxoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- offnum = OffsetNumberNext(offnum);
- continue;
- }
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID
- */
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- itemIndex++;
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- itemIndex++;
- }
- }
- }
- /* When !continuescan, there can't be any more matches, so stop */
- if (!continuescan)
- break;
-
- offnum = OffsetNumberNext(offnum);
- }
-
- /*
- * We don't need to visit page to the right when the high key
- * indicates that no more matches will be found there.
- *
- * Checking the high key like this works out more often than you might
- * think. Leaf page splits pick a split point between the two most
- * dissimilar tuples (this is weighed against the need to evenly share
- * free space). Leaf pages with high key attribute values that can
- * only appear on non-pivot tuples on the right sibling page are
- * common.
- */
- if (continuescan && !P_RIGHTMOST(opaque))
- {
- ItemId iid = PageGetItemId(page, P_HIKEY);
- IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
- int truncatt;
-
- truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
- _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
- }
-
- if (!continuescan)
- so->currPos.moreRight = false;
-
- Assert(itemIndex <= MaxTIDsPerBTreePage);
- so->currPos.firstItem = 0;
- so->currPos.lastItem = itemIndex - 1;
- so->currPos.itemIndex = 0;
- }
- else
- {
- /* load items[] in descending order */
- itemIndex = MaxTIDsPerBTreePage;
-
- offnum = Min(offnum, maxoff);
-
- while (offnum >= minoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
- bool tuple_alive;
- bool passes_quals;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual. Most of the
- * time, it's a win to not bother examining the tuple's index
- * keys, but just skip to the next tuple (previous, actually,
- * since we're scanning backwards). However, if this is the first
- * tuple on the page, we do check the index keys, to prevent
- * uselessly advancing to the page to the left. This is similar
- * to the high key optimization used by forward scans.
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- Assert(offnum >= P_FIRSTDATAKEY(opaque));
- if (offnum > P_FIRSTDATAKEY(opaque))
- {
- offnum = OffsetNumberPrev(offnum);
- continue;
- }
-
- tuple_alive = false;
- }
- else
- tuple_alive = true;
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan);
- if (passes_quals && tuple_alive)
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID.
- *
- * Note that we deliberately save/return items from
- * posting lists in ascending heap TID order for backwards
- * scans. This allows _bt_killitems() to make a
- * consistent assumption about the order of items
- * associated with the same posting list tuple.
- */
- itemIndex--;
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- itemIndex--;
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- }
- }
- }
- if (!continuescan)
- {
- /* there can't be any more matches, so stop */
- so->currPos.moreLeft = false;
- break;
- }
-
- offnum = OffsetNumberPrev(offnum);
- }
-
- Assert(itemIndex >= 0);
- so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
- so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
- }
-
- return (so->currPos.firstItem <= so->currPos.lastItem);
-}
-
/* Save an index item into so->currPos.items[itemIndex] */
static void
_bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -2071,12 +1008,11 @@ static bool
_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
- Relation rel;
+ Relation rel = scan->indexRelation;
Page page;
BTPageOpaque opaque;
bool status;
-
- rel = scan->indexRelation;
+ nbts_prep_ctx(rel);
if (ScanDirectionIsForward(dir))
{
@@ -2488,6 +1424,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
BTPageOpaque opaque;
OffsetNumber start;
BTScanPosItem *currItem;
+ nbts_prep_ctx(rel);
/*
* Scan down to the leftmost or rightmost leaf page. This is a simplified
diff --git a/src/backend/access/nbtree/nbtsearch_spec.c b/src/backend/access/nbtree/nbtsearch_spec.c
new file mode 100644
index 0000000000..37cc3647d3
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsearch_spec.c
@@ -0,0 +1,1087 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtsearch_spec.c
+ * Index shape-specialized functions for nbtsearch.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtsearch_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_binsrch NBTS_FUNCTION(_bt_binsrch)
+#define _bt_readpage NBTS_FUNCTION(_bt_readpage)
+
+static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf,
+ AttrNumber *highkeycmpcol);
+static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
+ OffsetNumber offnum);
+
+/*
+ * _bt_search() -- Search the tree for a particular scankey,
+ * or more precisely for the first leaf page it could be on.
+ *
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
+ * but it can omit the rightmost column(s) of the index.
+ *
+ * Return value is a stack of parent-page pointers (i.e. there is no entry for
+ * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
+ * which is locked and pinned. No locks are held on the parent pages,
+ * however!
+ *
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
+ *
+ * The returned buffer is locked according to access parameter. Additionally,
+ * access = BT_WRITE will allow an empty root page to be created and returned.
+ * When access = BT_READ, an empty index will result in *bufP being set to
+ * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
+ * during the search will be finished.
+ */
+BTStack
+_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
+ Snapshot snapshot)
+{
+ BTStack stack_in = NULL;
+ int page_access = BT_READ;
+ char tupdatabuf[BLCKSZ / 3];
+ AttrNumber highkeycmpcol = 1;
+
+ /* Get the root page to start with */
+ *bufP = _bt_getroot(rel, access);
+
+ /* If index is empty and access = BT_READ, no root page is created. */
+ if (!BufferIsValid(*bufP))
+ return (BTStack) NULL;
+
+ /* Loop iterates once per level descended in the tree */
+ for (;;)
+ {
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber offnum;
+ ItemId itemid;
+ IndexTuple itup;
+ BlockNumber child;
+ BTStack new_stack;
+
+ /*
+ * Race -- the page we just grabbed may have split since we read its
+ * downlink in its parent page (or the metapage). If it has, we may
+ * need to move right to its new sibling. Do that.
+ *
+ * In write-mode, allow _bt_moveright to finish any incomplete splits
+ * along the way. Strictly speaking, we'd only need to finish an
+ * incomplete split on the leaf page we're about to insert to, not on
+ * any of the upper levels (internal pages with incomplete splits are
+ * also taken care of in _bt_getstackbuf). But this is a good
+ * opportunity to finish splits of internal pages too.
+ */
+ *bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
+ page_access, snapshot, &highkeycmpcol,
+ (char *) tupdatabuf);
+
+ /* if this is a leaf page, we're done */
+ page = BufferGetPage(*bufP);
+ opaque = BTPageGetOpaque(page);
+ if (P_ISLEAF(opaque))
+ break;
+
+ /*
+ * Find the appropriate pivot tuple on this page. Its downlink points
+ * to the child page that we're about to descend to.
+ */
+ offnum = _bt_binsrch(rel, key, *bufP, &highkeycmpcol);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
+ child = BTreeTupleGetDownLink(itup);
+
+ Assert(IndexTupleSize(itup) < sizeof(tupdatabuf));
+ memcpy((char *) tupdatabuf, (char *) itup, IndexTupleSize(itup));
+
+ /*
+ * We need to save the location of the pivot tuple we chose in a new
+ * stack entry for this page/level. If caller ends up splitting a
+ * page one level down, it usually ends up inserting a new pivot
+ * tuple/downlink immediately after the location recorded here.
+ */
+ new_stack = (BTStack) palloc(sizeof(BTStackData));
+ new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
+ new_stack->bts_offset = offnum;
+ new_stack->bts_parent = stack_in;
+
+ /*
+ * Page level 1 is lowest non-leaf page level prior to leaves. So, if
+ * we're on the level 1 and asked to lock leaf page in write mode,
+ * then lock next page in write mode, because it must be a leaf.
+ */
+ if (opaque->btpo_level == 1 && access == BT_WRITE)
+ page_access = BT_WRITE;
+
+ /* drop the read lock on the page, then acquire one on its child */
+ *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
+
+ /* okay, all set to move down a level */
+ stack_in = new_stack;
+ }
+
+ /*
+ * If we're asked to lock leaf in write mode, but didn't manage to, then
+ * relock. This should only happen when the root page is a leaf page (and
+ * the only page in the index other than the metapage).
+ */
+ if (access == BT_WRITE && page_access == BT_READ)
+ {
+ highkeycmpcol = 1;
+
+ /* trade in our read lock for a write lock */
+ _bt_unlockbuf(rel, *bufP);
+ _bt_lockbuf(rel, *bufP, BT_WRITE);
+
+ /*
+ * Race -- the leaf page may have split after we dropped the read lock
+ * but before we acquired a write lock. If it has, we may need to
+ * move right to its new sibling. Do that.
+ */
+ *bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+ snapshot, &highkeycmpcol, (char *) tupdatabuf);
+ }
+
+ return stack_in;
+}
+
+/*
+ * _bt_moveright() -- move right in the btree if necessary.
+ *
+ * When we follow a pointer to reach a page, it is possible that
+ * the page has changed in the meanwhile. If this happens, we're
+ * guaranteed that the page has "split right" -- that is, that any
+ * data that appeared on the page originally is either on the page
+ * or strictly to the right of it.
+ *
+ * This routine decides whether or not we need to move right in the
+ * tree by examining the high key entry on the page. If that entry is
+ * strictly less than the scankey, or <= the scankey in the
+ * key.nextkey=true case, then we followed the wrong link and we need
+ * to move right.
+ *
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
+ *
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key. When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
+ *
+ * If forupdate is true, we will attempt to finish any incomplete splits
+ * that we encounter. This is required when locking a target page for an
+ * insertion, because we don't allow inserting on a page before the split
+ * is completed. 'stack' is only used if forupdate is true.
+ *
+ * On entry, we have the buffer pinned and a lock of the type specified by
+ * 'access'. If we move right, we release the buffer and lock and acquire
+ * the same on the right sibling. Return value is the buffer we stop at.
+ *
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
+ */
+Buffer
+_bt_moveright(Relation rel,
+ BTScanInsert key,
+ Buffer buf,
+ bool forupdate,
+ BTStack stack,
+ int access,
+ Snapshot snapshot,
+ AttrNumber *comparecol,
+ char *tupdatabuf)
+{
+ Page page;
+ BTPageOpaque opaque;
+ int32 cmpval;
+
+ Assert(PointerIsValid(comparecol) && PointerIsValid(tupdatabuf));
+
+ /*
+ * When nextkey = false (normal case): if the scan key that brought us to
+ * this page is > the high key stored on the page, then the page has split
+ * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
+ * have some duplicates to the right as well as the left, but that's
+ * something that's only ever dealt with on the leaf level, after
+ * _bt_search has found an initial leaf page.)
+ *
+ * When nextkey = true: move right if the scan key is >= page's high key.
+ * (Note that key.scantid cannot be set in this case.)
+ *
+ * The page could even have split more than once, so scan as far as
+ * needed.
+ *
+ * We also have to move right if we followed a link that brought us to a
+ * dead page.
+ */
+ cmpval = key->nextkey ? 0 : 1;
+
+ for (;;)
+ {
+ AttrNumber cmpcol = 1;
+
+ page = BufferGetPage(buf);
+ TestForOldSnapshot(snapshot, rel, page);
+ opaque = BTPageGetOpaque(page);
+
+ if (P_RIGHTMOST(opaque))
+ {
+ *comparecol = 1;
+ break;
+ }
+
+ /*
+ * Finish any incomplete splits we encounter along the way.
+ */
+ if (forupdate && P_INCOMPLETE_SPLIT(opaque))
+ {
+ BlockNumber blkno = BufferGetBlockNumber(buf);
+
+ /* upgrade our lock if necessary */
+ if (access == BT_READ)
+ {
+ _bt_unlockbuf(rel, buf);
+ _bt_lockbuf(rel, buf, BT_WRITE);
+ }
+
+ if (P_INCOMPLETE_SPLIT(opaque))
+ _bt_finish_split(rel, buf, stack);
+ else
+ _bt_relbuf(rel, buf);
+
+ /* re-acquire the lock in the right mode, and re-check */
+ buf = _bt_getbuf(rel, blkno, access);
+ continue;
+ }
+
+ /*
+ * tupdatabuf is filled with the right seperator of the parent node.
+ * This allows us to do a binary equality check between the parent
+ * node's right seperator (which is < key) and this page's P_HIKEY.
+ * If they equal, we can reuse the result of the parent node's
+ * rightkey compare, which means we can potentially save a full key
+ * compare (which includes indirect calls to attribute comparison
+ * functions).
+ *
+ * Without this, we'd on average use 3 full key compares per page before
+ * we achieve full dynamic prefix bounds, but with this optimization
+ * that is only 2.
+ *
+ * 3 compares: 1 for the highkey (rightmost), and on average 2 before
+ * we move right in the binary search on the page, this average equals
+ * SUM (1/2 ^ x) for x from 0 to log(n items)), which tends to 2.
+ */
+ if (!P_IGNORE(opaque) && *comparecol > 1)
+ {
+ IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
+ IndexTuple buftuple = (IndexTuple) tupdatabuf;
+ if (IndexTupleSize(itup) == IndexTupleSize(buftuple))
+ {
+ char *dataptr = (char *) itup;
+
+ if (memcmp(dataptr + sizeof(IndexTupleData),
+ tupdatabuf + sizeof(IndexTupleData),
+ IndexTupleSize(itup) - sizeof(IndexTupleData)) == 0)
+ break;
+ } else {
+ *comparecol = 1;
+ }
+ } else {
+ *comparecol = 1;
+ }
+
+ if (P_IGNORE(opaque) ||
+ _bt_compare(rel, key, page, P_HIKEY, &cmpcol) >= cmpval)
+ {
+ *comparecol = 1;
+ /* step right one page */
+ buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
+ continue;
+ }
+ else
+ {
+ *comparecol = cmpcol;
+ break;
+ }
+ }
+
+ if (P_IGNORE(opaque))
+ elog(ERROR, "fell off the end of index \"%s\"",
+ RelationGetRelationName(rel));
+
+ return buf;
+}
+
+/*
+ * _bt_binsrch() -- Do a binary search for a key on a particular page.
+ *
+ * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
+ * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
+ * particular, this means it is possible to return a value 1 greater than the
+ * number of keys on the page, if the scankey is > all keys on the page.)
+ *
+ * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
+ * of the last key < given scankey, or last key <= given scankey if nextkey
+ * is true. (Since _bt_compare treats the first data key of such a page as
+ * minus infinity, there will be at least one key < scankey, so the result
+ * always points at one of the keys on the page.) This key indicates the
+ * right place to descend to be sure we find all leaf keys >= given scankey
+ * (or leaf keys > given scankey when nextkey is true).
+ *
+ * This procedure is not responsible for walking right, it just examines
+ * the given page. _bt_binsrch() has no lock or refcount side effects
+ * on the buffer.
+ */
+static OffsetNumber
+_bt_binsrch(Relation rel,
+ BTScanInsert key,
+ Buffer buf,
+ AttrNumber *highkeycmpcol)
+{
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high;
+ int32 result,
+ cmpval;
+ AttrNumber highcmpcol = *highkeycmpcol,
+ lowcmpcol = 1;
+
+ page = BufferGetPage(buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* Requesting nextkey semantics while using scantid seems nonsensical */
+ Assert(!key->nextkey || key->scantid == NULL);
+ /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
+ Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+
+ /*
+ * If there are no keys on the page, return the first available slot. Note
+ * this covers two cases: the page is really empty (no keys), or it
+ * contains only a high key. The latter case is possible after vacuuming.
+ * This can never happen on an internal page, however, since they are
+ * never empty (an internal page must have children).
+ */
+ if (unlikely(high < low))
+ return low;
+
+ /*
+ * Binary search to find the first key on the page >= scan key, or first
+ * key > scankey when nextkey is true.
+ *
+ * For nextkey=false (cmpval=1), the loop invariant is: all slots before
+ * 'low' are < scan key, all slots at or after 'high' are >= scan key.
+ *
+ * For nextkey=true (cmpval=0), the loop invariant is: all slots before
+ * 'low' are <= scan key, all slots at or after 'high' are > scan key.
+ *
+ * We can fall out when high == low.
+ */
+ high++; /* establish the loop invariant for high */
+
+ cmpval = key->nextkey ? 0 : 1; /* select comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
+
+ if (result >= cmpval)
+ {
+ low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
+ else
+ {
+ high = mid;
+ highcmpcol = cmpcol;
+ }
+ }
+
+ *highkeycmpcol = highcmpcol;
+
+ /*
+ * At this point we have high == low, but be careful: they could point
+ * past the last slot on the page.
+ *
+ * On a leaf page, we always return the first key >= scan key (resp. >
+ * scan key), which could be the last slot + 1.
+ */
+ if (P_ISLEAF(opaque))
+ return low;
+
+ /*
+ * On a non-leaf page, return the last key < scan key (resp. <= scan key).
+ * There must be one if _bt_compare() is playing by the rules.
+ */
+ Assert(low > P_FIRSTDATAKEY(opaque));
+
+ return OffsetNumberPrev(low);
+}
+
+/*
+ *
+ * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
+ *
+ * Like _bt_binsrch(), but with support for caching the binary search
+ * bounds. Only used during insertion, and only on the leaf page that it
+ * looks like caller will insert tuple on. Exclusive-locked and pinned
+ * leaf page is contained within insertstate.
+ *
+ * Caches the bounds fields in insertstate so that a subsequent call can
+ * reuse the low and strict high bounds of original binary search. Callers
+ * that use these fields directly must be prepared for the case where low
+ * and/or stricthigh are not on the same page (one or both exceed maxoff
+ * for the page). The case where there are no items on the page (high <
+ * low) makes bounds invalid.
+ *
+ * Caller is responsible for invalidating bounds when it modifies the page
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by a posting
+ * list split).
+ */
+OffsetNumber
+_bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol)
+{
+ BTScanInsert key = insertstate->itup_key;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high,
+ stricthigh;
+ int32 result,
+ cmpval;
+ AttrNumber lowcmpcol = 1;
+
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ Assert(P_ISLEAF(opaque));
+ Assert(!key->nextkey);
+ Assert(insertstate->postingoff == 0);
+
+ if (!insertstate->bounds_valid)
+ {
+ /* Start new binary search */
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ }
+ else
+ {
+ /* Restore result of previous binary search against same page */
+ low = insertstate->low;
+ high = insertstate->stricthigh;
+ }
+
+ /* If there are no keys on the page, return the first available slot */
+ if (unlikely(high < low))
+ {
+ /* Caller can't reuse bounds */
+ insertstate->low = InvalidOffsetNumber;
+ insertstate->stricthigh = InvalidOffsetNumber;
+ insertstate->bounds_valid = false;
+ return low;
+ }
+
+ /*
+ * Binary search to find the first key on the page >= scan key. (nextkey
+ * is always false when inserting).
+ *
+ * The loop invariant is: all slots before 'low' are < scan key, all slots
+ * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
+ * maintained to save additional search effort for caller.
+ *
+ * We can fall out when high == low.
+ */
+ if (!insertstate->bounds_valid)
+ high++; /* establish the loop invariant for high */
+ stricthigh = high; /* high initially strictly higher */
+
+ cmpval = 1; /* !nextkey comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
+
+ if (result >= cmpval)
+ {
+ low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
+ else
+ {
+ high = mid;
+ highcmpcol = cmpcol;
+
+ if (result != 0)
+ stricthigh = high;
+ }
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set postingoff for caller. Caller must split the
+ * posting list when postingoff is set. This should happen
+ * infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ {
+ /*
+ * postingoff should never be set more than once per leaf page
+ * binary search. That would mean that there are duplicate table
+ * TIDs in the index, which is never okay. Check for that here.
+ */
+ if (insertstate->postingoff != 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
+ ItemPointerGetBlockNumber(key->scantid),
+ ItemPointerGetOffsetNumber(key->scantid),
+ low, stricthigh,
+ BufferGetBlockNumber(insertstate->buf),
+ RelationGetRelationName(rel))));
+
+ insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
+ }
+ }
+
+ /*
+ * On a leaf page, a binary search always returns the first key >= scan
+ * key (at least in !nextkey case), which could be the last slot + 1. This
+ * is also the lower bound of cached search.
+ *
+ * stricthigh may also be the last slot + 1, which prevents caller from
+ * using bounds directly, but is still useful to us if we're called a
+ * second time with cached bounds (cached low will be < stricthigh when
+ * that happens).
+ */
+ insertstate->low = low;
+ insertstate->stricthigh = stricthigh;
+ insertstate->bounds_valid = true;
+
+ return low;
+}
+
+/*----------
+ * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
+ *
+ * page/offnum: location of btree item to be compared to.
+ *
+ * This routine returns:
+ * <0 if scankey < tuple at offnum;
+ * 0 if scankey == tuple at offnum;
+ * >0 if scankey > tuple at offnum.
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
+ * "minus infinity": this routine will always claim it is less than the
+ * scankey. The actual key value stored is explicitly truncated to 0
+ * attributes (explicitly minus infinity) with version 3+ indexes, but
+ * that isn't relied upon. This allows us to implement the Lehman and
+ * Yao convention that the first down-link pointer is before the first
+ * key. See backend/access/nbtree/README for details.
+ *----------
+ */
+int32
+_bt_compare(Relation rel,
+ BTScanInsert key,
+ Page page,
+ OffsetNumber offnum,
+ AttrNumber *comparecol)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ IndexTuple itup;
+ ItemPointer heapTid;
+ ScanKey scankey;
+ int ncmpkey;
+ int ntupatts;
+ int32 result;
+
+ Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
+ Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+ Assert(key->heapkeyspace || key->scantid == NULL);
+
+ /*
+ * Force result ">" if target item is first data item on an internal page
+ * --- see NOTE above.
+ */
+ if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
+ return 1;
+
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+ ntupatts = BTreeTupleGetNAtts(itup, rel);
+
+ /*
+ * The scan key is set up with the attribute number associated with each
+ * term in the key. It is important that, if the index is multi-key, the
+ * scan contain the first k key attributes, and that they be in order. If
+ * you think about how multi-key ordering works, you'll understand why
+ * this is.
+ *
+ * We don't test for violation of this condition here, however. The
+ * initial setup for the index scan had better have gotten it right (see
+ * _bt_first).
+ */
+
+ ncmpkey = Min(ntupatts, key->keysz);
+ Assert(key->heapkeyspace || ncmpkey == key->keysz);
+ Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
+
+ scankey = key->scankeys + ((*comparecol) - 1);
+ for (int i = *comparecol; i <= ncmpkey; i++)
+ {
+ Datum datum;
+ bool isNull;
+
+ datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+
+ if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ {
+ if (isNull)
+ result = 0; /* NULL "=" NULL */
+ else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = -1; /* NULL "<" NOT_NULL */
+ else
+ result = 1; /* NULL ">" NOT_NULL */
+ }
+ else if (isNull) /* key is NOT_NULL and item is NULL */
+ {
+ if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = 1; /* NOT_NULL ">" NULL */
+ else
+ result = -1; /* NOT_NULL "<" NULL */
+ }
+ else
+ {
+ /*
+ * The sk_func needs to be passed the index value as left arg and
+ * the sk_argument as right arg (they might be of different
+ * types). Since it is convenient for callers to think of
+ * _bt_compare as comparing the scankey to the index item, we have
+ * to flip the sign of the comparison result. (Unless it's a DESC
+ * column, in which case we *don't* flip the sign.)
+ */
+ result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum,
+ scankey->sk_argument));
+
+ if (!(scankey->sk_flags & SK_BT_DESC))
+ INVERT_COMPARE_RESULT(result);
+ }
+
+ /* if the keys are unequal, return the difference */
+ if (result != 0)
+ {
+ *comparecol = i;
+ return result;
+ }
+
+ scankey++;
+ }
+
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *comparecol = ntupatts + 1;
+
+ /*
+ * All non-truncated attributes (other than heap TID) were found to be
+ * equal. Treat truncated attributes as minus infinity when scankey has a
+ * key attribute value that would otherwise be compared directly.
+ *
+ * Note: it doesn't matter if ntupatts includes non-key attributes;
+ * scankey won't, so explicitly excluding non-key attributes isn't
+ * necessary.
+ */
+ if (key->keysz > ntupatts)
+ return 1;
+
+ /*
+ * Use the heap TID attribute and scantid to try to break the tie. The
+ * rules are the same as any other key attribute -- only the
+ * representation differs.
+ */
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid == NULL)
+ {
+ /*
+ * Most searches have a scankey that is considered greater than a
+ * truncated pivot tuple if and when the scankey has equal values for
+ * attributes up to and including the least significant untruncated
+ * attribute in tuple.
+ *
+ * For example, if an index has the minimum two attributes (single
+ * user key attribute, plus heap TID attribute), and a page's high key
+ * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
+ * will not descend to the page to the left. The search will descend
+ * right instead. The truncated attribute in pivot tuple means that
+ * all non-pivot tuples on the page to the left are strictly < 'foo',
+ * so it isn't necessary to descend left. In other words, search
+ * doesn't have to descend left because it isn't interested in a match
+ * that has a heap TID value of -inf.
+ *
+ * However, some searches (pivotsearch searches) actually require that
+ * we descend left when this happens. -inf is treated as a possible
+ * match for omitted scankey attribute(s). This is needed by page
+ * deletion, which must re-find leaf pages that are targets for
+ * deletion using their high keys.
+ *
+ * Note: the heap TID part of the test ensures that scankey is being
+ * compared to a pivot tuple with one or more truncated key
+ * attributes.
+ *
+ * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
+ * left here, since they have no heap TID attribute (and cannot have
+ * any -inf key values in any case, since truncation can only remove
+ * non-key attributes). !heapkeyspace searches must always be
+ * prepared to deal with matches on both sides of the pivot once the
+ * leaf level is reached.
+ */
+ if (key->heapkeyspace && !key->pivotsearch &&
+ key->keysz == ntupatts && heapTid == NULL)
+ return 1;
+
+ /* All provided scankey arguments found to be equal */
+ return 0;
+ }
+
+ /*
+ * Treat truncated heap TID as minus infinity, since scankey has a key
+ * attribute value (scantid) that would otherwise be compared directly
+ */
+ Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+ if (heapTid == NULL)
+ return 1;
+
+ /*
+ * Scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * with scantid.
+ */
+ Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (result <= 0 || !BTreeTupleIsPosting(itup))
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid,
+ BTreeTupleGetMaxHeapTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
+}
+
+/*
+ * _bt_readpage() -- Load data from current index page into so->currPos
+ *
+ * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
+ * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
+ * they are updated as appropriate. All other fields of so->currPos are
+ * initialized from scratch here.
+ *
+ * We scan the current page starting at offnum and moving in the indicated
+ * direction. All items matching the scan keys are loaded into currPos.items.
+ * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
+ * that there can be no more matching tuples in the current scan direction.
+ *
+ * In the case of a parallel scan, caller must have called _bt_parallel_seize
+ * prior to calling this function; this function will invoke
+ * _bt_parallel_release before returning.
+ *
+ * Returns true if any matching items found on the page, false if none.
+ */
+static bool
+_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber minoff;
+ OffsetNumber maxoff;
+ int itemIndex;
+ bool continuescan;
+ int indnatts;
+
+ /*
+ * We must have the buffer pinned and locked, but the usual macro can't be
+ * used here; this function is what makes it good for currPos.
+ */
+ Assert(BufferIsValid(so->currPos.buf));
+
+ page = BufferGetPage(so->currPos.buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* allow next page be processed by parallel worker */
+ if (scan->parallel_scan)
+ {
+ if (ScanDirectionIsForward(dir))
+ _bt_parallel_release(scan, opaque->btpo_next);
+ else
+ _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+ }
+
+ continuescan = true; /* default assumption */
+ indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * We note the buffer's block number so that we can release the pin later.
+ * This allows us to re-read the buffer if it is needed again for hinting.
+ */
+ so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later. This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+
+ /*
+ * we must save the page's right-link while scanning it; this tells us
+ * where to step right to after we're done with these items. There is no
+ * corresponding need for the left-link, since splits always go right.
+ */
+ so->currPos.nextPage = opaque->btpo_next;
+
+ /* initialize tuple workspace to empty */
+ so->currPos.nextTupleOffset = 0;
+
+ /*
+ * Now that the current page has been made consistent, the macro should be
+ * good.
+ */
+ Assert(BTScanPosIsPinned(so->currPos));
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* load items[] in ascending order */
+ itemIndex = 0;
+
+ offnum = Max(offnum, minoff);
+
+ while (offnum <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ offnum = OffsetNumberNext(offnum);
+ continue;
+ }
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID
+ */
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ itemIndex++;
+ }
+ }
+ }
+ /* When !continuescan, there can't be any more matches, so stop */
+ if (!continuescan)
+ break;
+
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /*
+ * We don't need to visit page to the right when the high key
+ * indicates that no more matches will be found there.
+ *
+ * Checking the high key like this works out more often than you might
+ * think. Leaf page splits pick a split point between the two most
+ * dissimilar tuples (this is weighed against the need to evenly share
+ * free space). Leaf pages with high key attribute values that can
+ * only appear on non-pivot tuples on the right sibling page are
+ * common.
+ */
+ if (continuescan && !P_RIGHTMOST(opaque))
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
+ int truncatt;
+
+ truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
+ _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+ }
+
+ if (!continuescan)
+ so->currPos.moreRight = false;
+
+ Assert(itemIndex <= MaxTIDsPerBTreePage);
+ so->currPos.firstItem = 0;
+ so->currPos.lastItem = itemIndex - 1;
+ so->currPos.itemIndex = 0;
+ }
+ else
+ {
+ /* load items[] in descending order */
+ itemIndex = MaxTIDsPerBTreePage;
+
+ offnum = Min(offnum, maxoff);
+
+ while (offnum >= minoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+ bool tuple_alive;
+ bool passes_quals;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual. Most of the
+ * time, it's a win to not bother examining the tuple's index
+ * keys, but just skip to the next tuple (previous, actually,
+ * since we're scanning backwards). However, if this is the first
+ * tuple on the page, we do check the index keys, to prevent
+ * uselessly advancing to the page to the left. This is similar
+ * to the high key optimization used by forward scans.
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ Assert(offnum >= P_FIRSTDATAKEY(opaque));
+ if (offnum > P_FIRSTDATAKEY(opaque))
+ {
+ offnum = OffsetNumberPrev(offnum);
+ continue;
+ }
+
+ tuple_alive = false;
+ }
+ else
+ tuple_alive = true;
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
+ &continuescan);
+ if (passes_quals && tuple_alive)
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID.
+ *
+ * Note that we deliberately save/return items from
+ * posting lists in ascending heap TID order for backwards
+ * scans. This allows _bt_killitems() to make a
+ * consistent assumption about the order of items
+ * associated with the same posting list tuple.
+ */
+ itemIndex--;
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ }
+ }
+ }
+ if (!continuescan)
+ {
+ /* there can't be any more matches, so stop */
+ so->currPos.moreLeft = false;
+ break;
+ }
+
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ Assert(itemIndex >= 0);
+ so->currPos.firstItem = itemIndex;
+ so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
+ so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+ }
+
+ return (so->currPos.firstItem <= so->currPos.lastItem);
+}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 67b7b1710c..af408f704f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -279,8 +279,6 @@ static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
BTPageState *state,
BTDedupState dstate);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
-static void _bt_load(BTWriteState *wstate,
- BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
int request);
static void _bt_end_parallel(BTLeader *btleader);
@@ -293,6 +291,8 @@ static void _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
Sharedsort *sharedsort2, int sortmem,
bool progress);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsort_spec.c"
+#include "access/nbtree_spec.h"
/*
* btbuild() -- build a new btree index.
@@ -544,6 +544,7 @@ static void
_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
{
BTWriteState wstate;
+ nbts_prep_ctx(btspool->index);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
@@ -844,6 +845,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
Size pgspc;
Size itupsz;
bool isleaf;
+ nbts_prep_ctx(wstate->index);
/*
* This is a handy place to check for cancel interrupts during the btree
@@ -1176,264 +1178,6 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
-/*
- * Read tuples in correct sort order from tuplesort, and load them into
- * btree leaves.
- */
-static void
-_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
-{
- BTPageState *state = NULL;
- bool merge = (btspool2 != NULL);
- IndexTuple itup,
- itup2 = NULL;
- bool load1;
- TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
- SortSupport sortKeys;
- int64 tuples_done = 0;
- bool deduplicate;
-
- deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
- BTGetDeduplicateItems(wstate->index);
-
- if (merge)
- {
- /*
- * Another BTSpool for dead tuples exists. Now we have to merge
- * btspool and btspool2.
- */
-
- /* the preparation of merge */
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
-
- /* Prepare SortSupport data for each column */
- sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
- for (i = 0; i < keysz; i++)
- {
- SortSupport sortKey = sortKeys + i;
- ScanKey scanKey = wstate->inskey->scankeys + i;
- int16 strategy;
-
- sortKey->ssup_cxt = CurrentMemoryContext;
- sortKey->ssup_collation = scanKey->sk_collation;
- sortKey->ssup_nulls_first =
- (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
- sortKey->ssup_attno = scanKey->sk_attno;
- /* Abbreviation is not supported here */
- sortKey->abbreviate = false;
-
- Assert(sortKey->ssup_attno != 0);
-
- strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
- BTGreaterStrategyNumber : BTLessStrategyNumber;
-
- PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
- }
-
- for (;;)
- {
- load1 = true; /* load BTSpool next ? */
- if (itup2 == NULL)
- {
- if (itup == NULL)
- break;
- }
- else if (itup != NULL)
- {
- int32 compare = 0;
-
- for (i = 1; i <= keysz; i++)
- {
- SortSupport entry;
- Datum attrDatum1,
- attrDatum2;
- bool isNull1,
- isNull2;
-
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
- entry);
- if (compare > 0)
- {
- load1 = false;
- break;
- }
- else if (compare < 0)
- break;
- }
-
- /*
- * If key values are equal, we sort on ItemPointer. This is
- * required for btree indexes, since heap TID is treated as an
- * implicit last key attribute in order to ensure that all
- * keys in the index are physically unique.
- */
- if (compare == 0)
- {
- compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
- Assert(compare != 0);
- if (compare > 0)
- load1 = false;
- }
- }
- else
- load1 = false;
-
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- if (load1)
- {
- _bt_buildadd(wstate, state, itup, 0);
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- }
- else
- {
- _bt_buildadd(wstate, state, itup2, 0);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- pfree(sortKeys);
- }
- else if (deduplicate)
- {
- /* merge is unnecessary, deduplicate into posting lists */
- BTDedupState dstate;
-
- dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
- dstate->deduplicate = true; /* unused */
- dstate->nmaxitems = 0; /* unused */
- dstate->maxpostingsize = 0; /* set later */
- /* Metadata about base tuple of current pending posting list */
- dstate->base = NULL;
- dstate->baseoff = InvalidOffsetNumber; /* unused */
- dstate->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- dstate->htids = NULL;
- dstate->nhtids = 0;
- dstate->nitems = 0;
- dstate->phystupsize = 0; /* unused */
- dstate->nintervals = 0; /* unused */
-
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- {
- state = _bt_pagestate(wstate, 0);
-
- /*
- * Limit size of posting list tuples to 1/10 space we want to
- * leave behind on the page, plus space for final item's line
- * pointer. This is equal to the space that we'd like to
- * leave behind on each leaf page when fillfactor is 90,
- * allowing us to get close to fillfactor% space utilization
- * when there happen to be a great many duplicates. (This
- * makes higher leaf fillfactor settings ineffective when
- * building indexes that have many duplicates, but packing
- * leaf pages full with few very large tuples doesn't seem
- * like a useful goal.)
- */
- dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
- sizeof(ItemIdData);
- Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
- dstate->maxpostingsize <= INDEX_SIZE_MASK);
- dstate->htids = palloc(dstate->maxpostingsize);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
- else if (_bt_keep_natts_fast(wstate->index, dstate->base,
- itup) > keysz &&
- _bt_dedup_save_htid(dstate, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID from itup has been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list.
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
-
- if (state)
- {
- /*
- * Handle the last item (there must be a last item when the
- * tuplesort returned one or more tuples)
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
- pfree(dstate->htids);
- }
-
- pfree(dstate);
- }
- else
- {
- /* merging and deduplication are both unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- _bt_buildadd(wstate, state, itup, 0);
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- }
-
- /* Close down final pages and write the metapage */
- _bt_uppershutdown(wstate, state);
-
- /*
- * When we WAL-logged index pages, we must nonetheless fsync index files.
- * Since we're building outside shared buffers, a CHECKPOINT occurring
- * during the build has no way to flush the previously written data to
- * disk (indeed it won't know the index even exists). A crash later on
- * would replay WAL from the checkpoint, therefore it wouldn't replay our
- * earlier WAL entries. If we do not fsync those pages here, they might
- * still not be on disk when the crash occurs.
- */
- if (wstate->btws_use_wal)
- smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
-}
-
/*
* Create parallel context, and launch workers for leader.
*
diff --git a/src/backend/access/nbtree/nbtsort_spec.c b/src/backend/access/nbtree/nbtsort_spec.c
new file mode 100644
index 0000000000..368d6f244c
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsort_spec.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtsort_spec.c
+ * Index shape-specialized functions for nbtsort.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtsort_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_load NBTS_FUNCTION(_bt_load)
+
+static void _bt_load(BTWriteState *wstate,
+ BTSpool *btspool, BTSpool *btspool2);
+
+/*
+ * Read tuples in correct sort order from tuplesort, and load them into
+ * btree leaves.
+ */
+static void
+_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
+{
+ BTPageState *state = NULL;
+ bool merge = (btspool2 != NULL);
+ IndexTuple itup,
+ itup2 = NULL;
+ bool load1;
+ TupleDesc tupdes = RelationGetDescr(wstate->index);
+ int i,
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ SortSupport sortKeys;
+ int64 tuples_done = 0;
+ bool deduplicate;
+
+ deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
+ BTGetDeduplicateItems(wstate->index);
+
+ if (merge)
+ {
+ /*
+ * Another BTSpool for dead tuples exists. Now we have to merge
+ * btspool and btspool2.
+ */
+
+ /* the preparation of merge */
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+
+ /* Prepare SortSupport data for each column */
+ sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+
+ for (i = 0; i < keysz; i++)
+ {
+ SortSupport sortKey = sortKeys + i;
+ ScanKey scanKey = wstate->inskey->scankeys + i;
+ int16 strategy;
+
+ sortKey->ssup_cxt = CurrentMemoryContext;
+ sortKey->ssup_collation = scanKey->sk_collation;
+ sortKey->ssup_nulls_first =
+ (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+ sortKey->ssup_attno = scanKey->sk_attno;
+ /* Abbreviation is not supported here */
+ sortKey->abbreviate = false;
+
+ Assert(sortKey->ssup_attno != 0);
+
+ strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+ BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+ PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+ }
+
+ for (;;)
+ {
+ load1 = true; /* load BTSpool next ? */
+ if (itup2 == NULL)
+ {
+ if (itup == NULL)
+ break;
+ }
+ else if (itup != NULL)
+ {
+ int32 compare = 0;
+
+ for (i = 1; i <= keysz; i++)
+ {
+ SortSupport entry;
+ Datum attrDatum1,
+ attrDatum2;
+ bool isNull1,
+ isNull2;
+
+ entry = sortKeys + i - 1;
+ attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+ attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+ compare = ApplySortComparator(attrDatum1, isNull1,
+ attrDatum2, isNull2,
+ entry);
+ if (compare > 0)
+ {
+ load1 = false;
+ break;
+ }
+ else if (compare < 0)
+ break;
+ }
+
+ /*
+ * If key values are equal, we sort on ItemPointer. This is
+ * required for btree indexes, since heap TID is treated as an
+ * implicit last key attribute in order to ensure that all
+ * keys in the index are physically unique.
+ */
+ if (compare == 0)
+ {
+ compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+ Assert(compare != 0);
+ if (compare > 0)
+ load1 = false;
+ }
+ }
+ else
+ load1 = false;
+
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ if (load1)
+ {
+ _bt_buildadd(wstate, state, itup, 0);
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ }
+ else
+ {
+ _bt_buildadd(wstate, state, itup2, 0);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ pfree(sortKeys);
+ }
+ else if (deduplicate)
+ {
+ /* merge is unnecessary, deduplicate into posting lists */
+ BTDedupState dstate;
+
+ dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ dstate->deduplicate = true; /* unused */
+ dstate->nmaxitems = 0; /* unused */
+ dstate->maxpostingsize = 0; /* set later */
+ /* Metadata about base tuple of current pending posting list */
+ dstate->base = NULL;
+ dstate->baseoff = InvalidOffsetNumber; /* unused */
+ dstate->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ dstate->htids = NULL;
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+ dstate->phystupsize = 0; /* unused */
+ dstate->nintervals = 0; /* unused */
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+
+ /*
+ * Limit size of posting list tuples to 1/10 space we want to
+ * leave behind on the page, plus space for final item's line
+ * pointer. This is equal to the space that we'd like to
+ * leave behind on each leaf page when fillfactor is 90,
+ * allowing us to get close to fillfactor% space utilization
+ * when there happen to be a great many duplicates. (This
+ * makes higher leaf fillfactor settings ineffective when
+ * building indexes that have many duplicates, but packing
+ * leaf pages full with few very large tuples doesn't seem
+ * like a useful goal.)
+ */
+ dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
+ sizeof(ItemIdData);
+ Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
+ dstate->maxpostingsize <= INDEX_SIZE_MASK);
+ dstate->htids = palloc(dstate->maxpostingsize);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+ else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+ itup) > keysz &&
+ _bt_dedup_save_htid(dstate, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID from itup has been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list.
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+
+ if (state)
+ {
+ /*
+ * Handle the last item (there must be a last item when the
+ * tuplesort returned one or more tuples)
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+ pfree(dstate->htids);
+ }
+
+ pfree(dstate);
+ }
+ else
+ {
+ /* merging and deduplication are both unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ _bt_buildadd(wstate, state, itup, 0);
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+
+ /* Close down final pages and write the metapage */
+ _bt_uppershutdown(wstate, state);
+
+ /*
+ * When we WAL-logged index pages, we must nonetheless fsync index files.
+ * Since we're building outside shared buffers, a CHECKPOINT occurring
+ * during the build has no way to flush the previously written data to
+ * disk (indeed it won't know the index even exists). A crash later on
+ * would replay WAL from the checkpoint, therefore it wouldn't replay our
+ * earlier WAL entries. If we do not fsync those pages here, they might
+ * still not be on disk when the crash occurs.
+ */
+ if (wstate->btws_use_wal)
+ smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index ecb49bb471..991118fd50 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -639,6 +639,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
ItemId itemid;
IndexTuple tup;
int keepnatts;
+ nbts_prep_ctx(state->rel);
Assert(state->is_leaf && !state->is_rightmost);
@@ -945,6 +946,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
*rightinterval;
int perfectpenalty;
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
+ nbts_prep_ctx(state->rel);
/* Assume that alternative strategy won't be used for now */
*strategy = SPLIT_DEFAULT;
@@ -1137,6 +1139,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
{
IndexTuple lastleft;
IndexTuple firstright;
+ nbts_prep_ctx(state->rel);
if (!state->is_leaf)
{
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 8003583c0a..85f92adda8 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -50,130 +50,10 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
bool *result);
static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
static void _bt_mark_scankey_required(ScanKey skey);
-static bool _bt_check_rowcompare(ScanKey skey,
- IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
- ScanDirection dir, bool *continuescan);
-static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtutils_spec.c"
+#include "access/nbtree_spec.h"
-/*
- * _bt_mkscankey
- * Build an insertion scan key that contains comparison data from itup
- * as well as comparator routines appropriate to the key datatypes.
- *
- * When itup is a non-pivot tuple, the returned insertion scan key is
- * suitable for finding a place for it to go on the leaf level. Pivot
- * tuples can be used to re-find leaf page with matching high key, but
- * then caller needs to set scan key's pivotsearch field to true. This
- * allows caller to search for a leaf page with a matching high key,
- * which is usually to the left of the first leaf page a non-pivot match
- * might appear on.
- *
- * The result is intended for use with _bt_compare() and _bt_truncate().
- * Callers that don't need to fill out the insertion scankey arguments
- * (e.g. they use an ad-hoc comparison routine, or only need a scankey
- * for _bt_truncate()) can pass a NULL index tuple. The scankey will
- * be initialized as if an "all truncated" pivot tuple was passed
- * instead.
- *
- * Note that we may occasionally have to share lock the metapage to
- * determine whether or not the keys in the index are expected to be
- * unique (i.e. if this is a "heapkeyspace" index). We assume a
- * heapkeyspace index when caller passes a NULL tuple, allowing index
- * build callers to avoid accessing the non-existent metapage. We
- * also assume that the index is _not_ allequalimage when a NULL tuple
- * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
- * field themselves.
- */
-BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
-{
- BTScanInsert key;
- ScanKey skey;
- TupleDesc itupdesc;
- int indnkeyatts;
- int16 *indoption;
- int tupnatts;
- int i;
-
- itupdesc = RelationGetDescr(rel);
- indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- indoption = rel->rd_indoption;
- tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
-
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
- /*
- * We'll execute search using scan key constructed on key columns.
- * Truncated attributes and non-key attributes are omitted from the final
- * scan key.
- */
- key = palloc(offsetof(BTScanInsertData, scankeys) +
- sizeof(ScanKeyData) * indnkeyatts);
- if (itup)
- _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
- else
- {
- /* Utility statement callers can set these fields themselves */
- key->heapkeyspace = true;
- key->allequalimage = false;
- }
- key->anynullkeys = false; /* initial assumption */
- key->nextkey = false;
- key->pivotsearch = false;
- key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
- skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
- {
- FmgrInfo *procinfo;
- Datum arg;
- bool null;
- int flags;
-
- /*
- * We can use the cached (default) support procs since no cross-type
- * comparison can be needed.
- */
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-
- /*
- * Key arguments built from truncated attributes (or when caller
- * provides no tuple) are defensively represented as NULL values. They
- * should never be used.
- */
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
- else
- {
- arg = (Datum) 0;
- null = true;
- }
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
- flags,
- (AttrNumber) (i + 1),
- InvalidStrategy,
- InvalidOid,
- rel->rd_indcollation[i],
- procinfo,
- arg);
- /* Record if any key attribute is NULL (or truncated) */
- if (null)
- key->anynullkeys = true;
- }
-
- /*
- * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
- * that full uniqueness check is done.
- */
- if (rel->rd_index->indnullsnotdistinct)
- key->anynullkeys = false;
-
- return key;
-}
/*
* free a retracement stack made by _bt_search.
@@ -1340,356 +1220,6 @@ _bt_mark_scankey_required(ScanKey skey)
}
}
-/*
- * Test whether an indextuple satisfies all the scankey conditions.
- *
- * Return true if so, false if not. If the tuple fails to pass the qual,
- * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly. See comments for
- * _bt_preprocess_keys(), above, about how this is done.
- *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
- *
- * scan: index scan descriptor (containing a search-type scankey)
- * tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
- */
-bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
- ScanDirection dir, bool *continuescan)
-{
- TupleDesc tupdesc;
- BTScanOpaque so;
- int keysz;
- int ikey;
- ScanKey key;
-
- Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
-
- *continuescan = true; /* default assumption */
-
- tupdesc = RelationGetDescr(scan->indexRelation);
- so = (BTScanOpaque) scan->opaque;
- keysz = so->numberOfKeys;
-
- for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
- {
- Datum datum;
- bool isNull;
- Datum test;
-
- if (key->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- continue;
- }
-
- /* row-comparison keys need special processing */
- if (key->sk_flags & SK_ROW_HEADER)
- {
- if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
- continuescan))
- continue;
- return false;
- }
-
- datum = index_getattr(tuple,
- key->sk_attno,
- tupdesc,
- &isNull);
-
- if (key->sk_flags & SK_ISNULL)
- {
- /* Handle IS NULL/NOT NULL tests */
- if (key->sk_flags & SK_SEARCHNULL)
- {
- if (isNull)
- continue; /* tuple satisfies this qual */
- }
- else
- {
- Assert(key->sk_flags & SK_SEARCHNOTNULL);
- if (!isNull)
- continue; /* tuple satisfies this qual */
- }
-
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (isNull)
- {
- if (key->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
- datum, key->sk_argument);
-
- if (!DatumGetBool(test))
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- *
- * Note: because we stop the scan as soon as any required equality
- * qual fails, it is critical that equality quals be used for the
- * initial positioning in _bt_first() when they are available. See
- * comments in _bt_first().
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
- }
-
- /* If we get here, the tuple passes all index quals. */
- return true;
-}
-
-/*
- * Test whether an indextuple satisfies a row-comparison scan condition.
- *
- * Return true if so, false if not. If not, also clear *continuescan if
- * it's not possible for any future tuples in the current scan direction
- * to pass the qual.
- *
- * This is a subroutine for _bt_checkkeys, which see for more info.
- */
-static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
- TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
-{
- ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
- int32 cmpresult = 0;
- bool result;
-
- /* First subkey should be same as the header says */
- Assert(subkey->sk_attno == skey->sk_attno);
-
- /* Loop over columns of the row condition */
- for (;;)
- {
- Datum datum;
- bool isNull;
-
- Assert(subkey->sk_flags & SK_ROW_MEMBER);
-
- if (subkey->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- cmpresult = 0;
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- continue;
- }
-
- datum = index_getattr(tuple,
- subkey->sk_attno,
- tupdesc,
- &isNull);
-
- if (isNull)
- {
- if (subkey->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (subkey->sk_flags & SK_ISNULL)
- {
- /*
- * Unlike the simple-scankey case, this isn't a disallowed case.
- * But it can never match. If all the earlier row comparison
- * columns are required for the scan direction, we can stop the
- * scan, because there can't be another tuple that will succeed.
- */
- if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
- subkey--;
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- return false;
- }
-
- /* Perform the test --- three-way comparison not bool operator */
- cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
- subkey->sk_collation,
- datum,
- subkey->sk_argument));
-
- if (subkey->sk_flags & SK_BT_DESC)
- INVERT_COMPARE_RESULT(cmpresult);
-
- /* Done comparing if unequal, else advance to next column */
- if (cmpresult != 0)
- break;
-
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- }
-
- /*
- * At this point cmpresult indicates the overall result of the row
- * comparison, and subkey points to the deciding column (or the last
- * column if the result is "=").
- */
- switch (subkey->sk_strategy)
- {
- /* EQ and NE cases aren't allowed here */
- case BTLessStrategyNumber:
- result = (cmpresult < 0);
- break;
- case BTLessEqualStrategyNumber:
- result = (cmpresult <= 0);
- break;
- case BTGreaterEqualStrategyNumber:
- result = (cmpresult >= 0);
- break;
- case BTGreaterStrategyNumber:
- result = (cmpresult > 0);
- break;
- default:
- elog(ERROR, "unrecognized RowCompareType: %d",
- (int) subkey->sk_strategy);
- result = 0; /* keep compiler quiet */
- break;
- }
-
- if (!result)
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will pass,
- * either. Note we have to look at the deciding column, not
- * necessarily the first or last column of the row condition.
- */
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
-
- return result;
-}
-
/*
* _bt_killitems - set LP_DEAD state for items an indexscan caller has
* told us were killed
@@ -2173,286 +1703,6 @@ btbuildphasename(int64 phasenum)
}
}
-/*
- * _bt_truncate() -- create tuple without unneeded suffix attributes.
- *
- * Returns truncated pivot index tuple allocated in caller's memory context,
- * with key attributes copied from caller's firstright argument. If rel is
- * an INCLUDE index, non-key attributes will definitely be truncated away,
- * since they're not part of the key space. More aggressive suffix
- * truncation can take place when it's clear that the returned tuple does not
- * need one or more suffix key attributes. We only need to keep firstright
- * attributes up to and including the first non-lastleft-equal attribute.
- * Caller's insertion scankey is used to compare the tuples; the scankey's
- * argument values are not considered here.
- *
- * Note that returned tuple's t_tid offset will hold the number of attributes
- * present, so the original item pointer offset is not represented. Caller
- * should only change truncated tuple's downlink. Note also that truncated
- * key attributes are treated as containing "minus infinity" values by
- * _bt_compare().
- *
- * In the worst case (when a heap TID must be appended to distinguish lastleft
- * from firstright), the size of the returned tuple is the size of firstright
- * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
- * is important, since callers need to stay under the 1/3 of a page
- * restriction on tuple size. If this routine is ever taught to truncate
- * within an attribute/datum, it will need to avoid returning an enlarged
- * tuple to caller when truncation + TOAST compression ends up enlarging the
- * final datum.
- */
-IndexTuple
-_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
- IndexTuple pivot;
- IndexTuple tidpivot;
- ItemPointer pivotheaptid;
- Size newsize;
-
- /*
- * We should only ever truncate non-pivot tuples from leaf pages. It's
- * never okay to truncate when splitting an internal page.
- */
- Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
-
- /* Determine how many attributes must be kept in truncated tuple */
- keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
-
-#ifdef DEBUG_NO_TRUNCATE
- /* Force truncation to be ineffective for testing purposes */
- keepnatts = nkeyatts + 1;
-#endif
-
- pivot = index_truncate_tuple(itupdesc, firstright,
- Min(keepnatts, nkeyatts));
-
- if (BTreeTupleIsPosting(pivot))
- {
- /*
- * index_truncate_tuple() just returns a straight copy of firstright
- * when it has no attributes to truncate. When that happens, we may
- * need to truncate away a posting list here instead.
- */
- Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
- Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
- pivot->t_info &= ~INDEX_SIZE_MASK;
- pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
- }
-
- /*
- * If there is a distinguishing key attribute within pivot tuple, we're
- * done
- */
- if (keepnatts <= nkeyatts)
- {
- BTreeTupleSetNAtts(pivot, keepnatts, false);
- return pivot;
- }
-
- /*
- * We have to store a heap TID in the new pivot tuple, since no non-TID
- * key attribute value in firstright distinguishes the right side of the
- * split from the left side. nbtree conceptualizes this case as an
- * inability to truncate away any key attributes, since heap TID is
- * treated as just another key attribute (despite lacking a pg_attribute
- * entry).
- *
- * Use enlarged space that holds a copy of pivot. We need the extra space
- * to store a heap TID at the end (using the special pivot tuple
- * representation). Note that the original pivot already has firstright's
- * possible posting list/non-key attribute values removed at this point.
- */
- newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
- tidpivot = palloc0(newsize);
- memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
- /* Cannot leak memory here */
- pfree(pivot);
-
- /*
- * Store all of firstright's key attribute values plus a tiebreaker heap
- * TID value in enlarged pivot tuple
- */
- tidpivot->t_info &= ~INDEX_SIZE_MASK;
- tidpivot->t_info |= newsize;
- BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
- pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
-
- /*
- * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
- * consider suffix truncation. It seems like a good idea to follow that
- * example in cases where no truncation takes place -- use lastleft's heap
- * TID. (This is also the closest value to negative infinity that's
- * legally usable.)
- */
- ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
-
- /*
- * We're done. Assert() that heap TID invariants hold before returning.
- *
- * Lehman and Yao require that the downlink to the right page, which is to
- * be inserted into the parent page in the second phase of a page split be
- * a strict lower bound on items on the right page, and a non-strict upper
- * bound for items on the left page. Assert that heap TIDs follow these
- * invariants, since a heap TID value is apparently needed as a
- * tiebreaker.
- */
-#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
- BTreeTupleGetHeapTID(firstright)) < 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(lastleft)) >= 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#else
-
- /*
- * Those invariants aren't guaranteed to hold for lastleft + firstright
- * heap TID attribute values when they're considered here only because
- * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
- * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
- * TID value that always works as a strict lower bound for items to the
- * right. In particular, it must avoid using firstright's leading key
- * attribute values along with lastleft's heap TID value when lastleft's
- * TID happens to be greater than firstright's TID.
- */
- ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
-
- /*
- * Pivot heap TID should never be fully equal to firstright. Note that
- * the pivot heap TID will still end up equal to lastleft's heap TID when
- * that's the only usable value.
- */
- ItemPointerSetOffsetNumber(pivotheaptid,
- OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#endif
-
- return tidpivot;
-}
-
-/*
- * _bt_keep_natts - how many key attributes to keep when truncating.
- *
- * Caller provides two tuples that enclose a split point. Caller's insertion
- * scankey is used to compare the tuples; the scankey's argument values are
- * not considered here.
- *
- * This can return a number of attributes that is one greater than the
- * number of key attributes for the index relation. This indicates that the
- * caller must use a heap TID as a unique-ifier in new pivot tuple.
- */
-static int
-_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keepnatts;
- ScanKey scankey;
-
- /*
- * _bt_compare() treats truncated key attributes as having the value minus
- * infinity, which would break searches within !heapkeyspace indexes. We
- * must still truncate away non-key attribute values, though.
- */
- if (!itup_key->heapkeyspace)
- return nkeyatts;
-
- scankey = itup_key->scankeys;
- keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum1,
- datum2)) != 0)
- break;
-
- keepnatts++;
- }
-
- /*
- * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
- * expected in an allequalimage index.
- */
- Assert(!itup_key->allequalimage ||
- keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
-
- return keepnatts;
-}
-
-/*
- * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
- *
- * This is exported so that a candidate split point can have its effect on
- * suffix truncation inexpensively evaluated ahead of time when finding a
- * split location. A naive bitwise approach to datum comparisons is used to
- * save cycles.
- *
- * The approach taken here usually provides the same answer as _bt_keep_natts
- * will (for the same pair of tuples from a heapkeyspace index), since the
- * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting. When an index only has
- * "equal image" columns, routine is guaranteed to give the same result as
- * _bt_keep_natts would.
- *
- * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts, even when the index uses
- * an opclass or collation that is not "allequalimage"/deduplication-safe.
- * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
- * negatives generally only have the effect of making leaf page splits use a
- * more balanced split point.
- */
-int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
-
- keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
- Form_pg_attribute att;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
- break;
-
- keepnatts++;
- }
-
- return keepnatts;
-}
-
/*
* _bt_check_natts() -- Verify tuple has expected number of attributes.
*
diff --git a/src/backend/access/nbtree/nbtutils_spec.c b/src/backend/access/nbtree/nbtutils_spec.c
new file mode 100644
index 0000000000..0288da22d6
--- /dev/null
+++ b/src/backend/access/nbtree/nbtutils_spec.c
@@ -0,0 +1,775 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtutils_spec.c
+ * Index shape-specialized functions for nbtutils.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtutils_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_check_rowcompare NBTS_FUNCTION(_bt_check_rowcompare)
+#define _bt_keep_natts NBTS_FUNCTION(_bt_keep_natts)
+
+static bool _bt_check_rowcompare(ScanKey skey,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ ScanDirection dir, bool *continuescan);
+static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+
+
+/*
+ * _bt_mkscankey
+ * Build an insertion scan key that contains comparison data from itup
+ * as well as comparator routines appropriate to the key datatypes.
+ *
+ * When itup is a non-pivot tuple, the returned insertion scan key is
+ * suitable for finding a place for it to go on the leaf level. Pivot
+ * tuples can be used to re-find leaf page with matching high key, but
+ * then caller needs to set scan key's pivotsearch field to true. This
+ * allows caller to search for a leaf page with a matching high key,
+ * which is usually to the left of the first leaf page a non-pivot match
+ * might appear on.
+ *
+ * The result is intended for use with _bt_compare() and _bt_truncate().
+ * Callers that don't need to fill out the insertion scankey arguments
+ * (e.g. they use an ad-hoc comparison routine, or only need a scankey
+ * for _bt_truncate()) can pass a NULL index tuple. The scankey will
+ * be initialized as if an "all truncated" pivot tuple was passed
+ * instead.
+ *
+ * Note that we may occasionally have to share lock the metapage to
+ * determine whether or not the keys in the index are expected to be
+ * unique (i.e. if this is a "heapkeyspace" index). We assume a
+ * heapkeyspace index when caller passes a NULL tuple, allowing index
+ * build callers to avoid accessing the non-existent metapage. We
+ * also assume that the index is _not_ allequalimage when a NULL tuple
+ * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
+ * field themselves.
+ */
+BTScanInsert
+_bt_mkscankey(Relation rel, IndexTuple itup)
+{
+ BTScanInsert key;
+ ScanKey skey;
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int16 *indoption;
+ int tupnatts;
+ int i;
+
+ itupdesc = RelationGetDescr(rel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ indoption = rel->rd_indoption;
+ tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+
+ /*
+ * We'll execute search using scan key constructed on key columns.
+ * Truncated attributes and non-key attributes are omitted from the final
+ * scan key.
+ */
+ key = palloc(offsetof(BTScanInsertData, scankeys) +
+ sizeof(ScanKeyData) * indnkeyatts);
+ if (itup)
+ _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+ else
+ {
+ /* Utility statement callers can set these fields themselves */
+ key->heapkeyspace = true;
+ key->allequalimage = false;
+ }
+ key->anynullkeys = false; /* initial assumption */
+ key->nextkey = false;
+ key->pivotsearch = false;
+ key->keysz = Min(indnkeyatts, tupnatts);
+ key->scantid = key->heapkeyspace && itup ?
+ BTreeTupleGetHeapTID(itup) : NULL;
+ skey = key->scankeys;
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ FmgrInfo *procinfo;
+ Datum arg;
+ bool null;
+ int flags;
+
+ /*
+ * We can use the cached (default) support procs since no cross-type
+ * comparison can be needed.
+ */
+ procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+
+ /*
+ * Key arguments built from truncated attributes (or when caller
+ * provides no tuple) are defensively represented as NULL values. They
+ * should never be used.
+ */
+ if (i < tupnatts)
+ arg = index_getattr(itup, i + 1, itupdesc, &null);
+ else
+ {
+ arg = (Datum) 0;
+ null = true;
+ }
+ flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags,
+ (AttrNumber) (i + 1),
+ InvalidStrategy,
+ InvalidOid,
+ rel->rd_indcollation[i],
+ procinfo,
+ arg);
+ /* Record if any key attribute is NULL (or truncated) */
+ if (null)
+ key->anynullkeys = true;
+ }
+
+ /*
+ * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
+ * that full uniqueness check is done.
+ */
+ if (rel->rd_index->indnullsnotdistinct)
+ key->anynullkeys = false;
+
+ return key;
+}
+
+/*
+ * Test whether an indextuple satisfies all the scankey conditions.
+ *
+ * Return true if so, false if not. If the tuple fails to pass the qual,
+ * we also determine whether there's any need to continue the scan beyond
+ * this tuple, and set *continuescan accordingly. See comments for
+ * _bt_preprocess_keys(), above, about how this is done.
+ *
+ * Forward scan callers can pass a high key tuple in the hopes of having
+ * us set *continuescan to false, and avoiding an unnecessary visit to
+ * the page to the right.
+ *
+ * scan: index scan descriptor (containing a search-type scankey)
+ * tuple: index tuple to test
+ * tupnatts: number of attributes in tupnatts (high key may be truncated)
+ * dir: direction we are scanning in
+ * continuescan: output parameter (will be set correctly in all cases)
+ */
+bool
+_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan)
+{
+ TupleDesc tupdesc;
+ BTScanOpaque so;
+ int keysz;
+ int ikey;
+ ScanKey key;
+
+ Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+
+ *continuescan = true; /* default assumption */
+
+ tupdesc = RelationGetDescr(scan->indexRelation);
+ so = (BTScanOpaque) scan->opaque;
+ keysz = so->numberOfKeys;
+
+ for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+ {
+ Datum datum;
+ bool isNull;
+ Datum test;
+
+ if (key->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ continue;
+ }
+
+ /* row-comparison keys need special processing */
+ if (key->sk_flags & SK_ROW_HEADER)
+ {
+ if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
+ continuescan))
+ continue;
+ return false;
+ }
+
+ datum = index_getattr(tuple,
+ key->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (key->sk_flags & SK_ISNULL)
+ {
+ /* Handle IS NULL/NOT NULL tests */
+ if (key->sk_flags & SK_SEARCHNULL)
+ {
+ if (isNull)
+ continue; /* tuple satisfies this qual */
+ }
+ else
+ {
+ Assert(key->sk_flags & SK_SEARCHNOTNULL);
+ if (!isNull)
+ continue; /* tuple satisfies this qual */
+ }
+
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (isNull)
+ {
+ if (key->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
+ datum, key->sk_argument);
+
+ if (!DatumGetBool(test))
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ *
+ * Note: because we stop the scan as soon as any required equality
+ * qual fails, it is critical that equality quals be used for the
+ * initial positioning in _bt_first() when they are available. See
+ * comments in _bt_first().
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+ }
+
+ /* If we get here, the tuple passes all index quals. */
+ return true;
+}
+
+/*
+ * Test whether an indextuple satisfies a row-comparison scan condition.
+ *
+ * Return true if so, false if not. If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction
+ * to pass the qual.
+ *
+ * This is a subroutine for _bt_checkkeys, which see for more info.
+ */
+static bool
+_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
+ TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
+{
+ ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
+ int32 cmpresult = 0;
+ bool result;
+
+ /* First subkey should be same as the header says */
+ Assert(subkey->sk_attno == skey->sk_attno);
+
+ /* Loop over columns of the row condition */
+ for (;;)
+ {
+ Datum datum;
+ bool isNull;
+
+ Assert(subkey->sk_flags & SK_ROW_MEMBER);
+
+ if (subkey->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ cmpresult = 0;
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ continue;
+ }
+
+ datum = index_getattr(tuple,
+ subkey->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (isNull)
+ {
+ if (subkey->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (subkey->sk_flags & SK_ISNULL)
+ {
+ /*
+ * Unlike the simple-scankey case, this isn't a disallowed case.
+ * But it can never match. If all the earlier row comparison
+ * columns are required for the scan direction, we can stop the
+ * scan, because there can't be another tuple that will succeed.
+ */
+ if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
+ subkey--;
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ return false;
+ }
+
+ /* Perform the test --- three-way comparison not bool operator */
+ cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
+ subkey->sk_collation,
+ datum,
+ subkey->sk_argument));
+
+ if (subkey->sk_flags & SK_BT_DESC)
+ INVERT_COMPARE_RESULT(cmpresult);
+
+ /* Done comparing if unequal, else advance to next column */
+ if (cmpresult != 0)
+ break;
+
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ }
+
+ /*
+ * At this point cmpresult indicates the overall result of the row
+ * comparison, and subkey points to the deciding column (or the last
+ * column if the result is "=").
+ */
+ switch (subkey->sk_strategy)
+ {
+ /* EQ and NE cases aren't allowed here */
+ case BTLessStrategyNumber:
+ result = (cmpresult < 0);
+ break;
+ case BTLessEqualStrategyNumber:
+ result = (cmpresult <= 0);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ result = (cmpresult >= 0);
+ break;
+ case BTGreaterStrategyNumber:
+ result = (cmpresult > 0);
+ break;
+ default:
+ elog(ERROR, "unrecognized RowCompareType: %d",
+ (int) subkey->sk_strategy);
+ result = 0; /* keep compiler quiet */
+ break;
+ }
+
+ if (!result)
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will pass,
+ * either. Note we have to look at the deciding column, not
+ * necessarily the first or last column of the row condition.
+ */
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+
+ return result;
+}
+
+/*
+ * _bt_truncate() -- create tuple without unneeded suffix attributes.
+ *
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument. If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space. More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes. We only need to keep firstright
+ * attributes up to and including the first non-lastleft-equal attribute.
+ * Caller's insertion scankey is used to compare the tuples; the scankey's
+ * argument values are not considered here.
+ *
+ * Note that returned tuple's t_tid offset will hold the number of attributes
+ * present, so the original item pointer offset is not represented. Caller
+ * should only change truncated tuple's downlink. Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
+ *
+ * In the worst case (when a heap TID must be appended to distinguish lastleft
+ * from firstright), the size of the returned tuple is the size of firstright
+ * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
+ * is important, since callers need to stay under the 1/3 of a page
+ * restriction on tuple size. If this routine is ever taught to truncate
+ * within an attribute/datum, it will need to avoid returning an enlarged
+ * tuple to caller when truncation + TOAST compression ends up enlarging the
+ * final datum.
+ */
+IndexTuple
+_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ BTScanInsert itup_key)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+ IndexTuple pivot;
+ IndexTuple tidpivot;
+ ItemPointer pivotheaptid;
+ Size newsize;
+
+ /*
+ * We should only ever truncate non-pivot tuples from leaf pages. It's
+ * never okay to truncate when splitting an internal page.
+ */
+ Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
+
+ /* Determine how many attributes must be kept in truncated tuple */
+ keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+
+#ifdef DEBUG_NO_TRUNCATE
+ /* Force truncation to be ineffective for testing purposes */
+ keepnatts = nkeyatts + 1;
+#endif
+
+ pivot = index_truncate_tuple(itupdesc, firstright,
+ Min(keepnatts, nkeyatts));
+
+ if (BTreeTupleIsPosting(pivot))
+ {
+ /*
+ * index_truncate_tuple() just returns a straight copy of firstright
+ * when it has no attributes to truncate. When that happens, we may
+ * need to truncate away a posting list here instead.
+ */
+ Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
+ Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+
+ /*
+ * If there is a distinguishing key attribute within pivot tuple, we're
+ * done
+ */
+ if (keepnatts <= nkeyatts)
+ {
+ BTreeTupleSetNAtts(pivot, keepnatts, false);
+ return pivot;
+ }
+
+ /*
+ * We have to store a heap TID in the new pivot tuple, since no non-TID
+ * key attribute value in firstright distinguishes the right side of the
+ * split from the left side. nbtree conceptualizes this case as an
+ * inability to truncate away any key attributes, since heap TID is
+ * treated as just another key attribute (despite lacking a pg_attribute
+ * entry).
+ *
+ * Use enlarged space that holds a copy of pivot. We need the extra space
+ * to store a heap TID at the end (using the special pivot tuple
+ * representation). Note that the original pivot already has firstright's
+ * possible posting list/non-key attribute values removed at this point.
+ */
+ newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
+ tidpivot = palloc0(newsize);
+ memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
+ /* Cannot leak memory here */
+ pfree(pivot);
+
+ /*
+ * Store all of firstright's key attribute values plus a tiebreaker heap
+ * TID value in enlarged pivot tuple
+ */
+ tidpivot->t_info &= ~INDEX_SIZE_MASK;
+ tidpivot->t_info |= newsize;
+ BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
+ pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
+
+ /*
+ * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
+ * consider suffix truncation. It seems like a good idea to follow that
+ * example in cases where no truncation takes place -- use lastleft's heap
+ * TID. (This is also the closest value to negative infinity that's
+ * legally usable.)
+ */
+ ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
+
+ /*
+ * We're done. Assert() that heap TID invariants hold before returning.
+ *
+ * Lehman and Yao require that the downlink to the right page, which is to
+ * be inserted into the parent page in the second phase of a page split be
+ * a strict lower bound on items on the right page, and a non-strict upper
+ * bound for items on the left page. Assert that heap TIDs follow these
+ * invariants, since a heap TID value is apparently needed as a
+ * tiebreaker.
+ */
+#ifndef DEBUG_NO_TRUNCATE
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#else
+
+ /*
+ * Those invariants aren't guaranteed to hold for lastleft + firstright
+ * heap TID attribute values when they're considered here only because
+ * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
+ * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
+ * TID value that always works as a strict lower bound for items to the
+ * right. In particular, it must avoid using firstright's leading key
+ * attribute values along with lastleft's heap TID value when lastleft's
+ * TID happens to be greater than firstright's TID.
+ */
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
+
+ /*
+ * Pivot heap TID should never be fully equal to firstright. Note that
+ * the pivot heap TID will still end up equal to lastleft's heap TID when
+ * that's the only usable value.
+ */
+ ItemPointerSetOffsetNumber(pivotheaptid,
+ OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#endif
+
+ return tidpivot;
+}
+
+/*
+ * _bt_keep_natts - how many key attributes to keep when truncating.
+ *
+ * Caller provides two tuples that enclose a split point. Caller's insertion
+ * scankey is used to compare the tuples; the scankey's argument values are
+ * not considered here.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation. This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ BTScanInsert itup_key)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keepnatts;
+ ScanKey scankey;
+
+ /*
+ * _bt_compare() treats truncated key attributes as having the value minus
+ * infinity, which would break searches within !heapkeyspace indexes. We
+ * must still truncate away non-key attribute values, though.
+ */
+ if (!itup_key->heapkeyspace)
+ return nkeyatts;
+
+ scankey = itup_key->scankeys;
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum1,
+ datum2)) != 0)
+ break;
+
+ keepnatts++;
+ }
+
+ /*
+ * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
+ * expected in an allequalimage index.
+ */
+ Assert(!itup_key->allequalimage ||
+ keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+
+ return keepnatts;
+}
+
+/*
+ * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location. A naive bitwise approach to datum comparisons is used to
+ * save cycles.
+ *
+ * The approach taken here usually provides the same answer as _bt_keep_natts
+ * will (for the same pair of tuples from a heapkeyspace index), since the
+ * majority of btree opclasses can never indicate that two datums are equal
+ * unless they're bitwise equal after detoasting. When an index only has
+ * "equal image" columns, routine is guaranteed to give the same result as
+ * _bt_keep_natts would.
+ *
+ * Callers can rely on the fact that attributes considered equal here are
+ * definitely also equal according to _bt_keep_natts, even when the index uses
+ * an opclass or collation that is not "allequalimage"/deduplication-safe.
+ * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
+ * negatives generally only have the effect of making leaf page splits use a
+ * more balanced split point.
+ */
+int
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= keysz; attnum++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+ Form_pg_attribute att;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ att = TupleDescAttr(itupdesc, attnum - 1);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
+ break;
+
+ keepnatts++;
+ }
+
+ return keepnatts;
+}
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb6cfcfd00..12f909e1cf 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -57,8 +57,6 @@ static void writetup_cluster(Tuplesortstate *state, LogicalTape *tape,
SortTuple *stup);
static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
LogicalTape *tape, unsigned int tuplen);
-static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state);
static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
Tuplesortstate *state);
static void writetup_index(Tuplesortstate *state, LogicalTape *tape,
@@ -130,6 +128,9 @@ typedef struct
int datumTypeLen;
} TuplesortDatumArg;
+#define NBT_SPECIALIZE_FILE "../../backend/utils/sort/tuplesortvariants_spec.c"
+#include "access/nbtree_spec.h"
+
Tuplesortstate *
tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
@@ -217,6 +218,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
MemoryContext oldcontext;
TuplesortClusterArg *arg;
int i;
+ nbts_prep_ctx(indexRel);
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
@@ -328,6 +330,7 @@ tuplesort_begin_index_btree(Relation heapRel,
TuplesortIndexBTreeArg *arg;
MemoryContext oldcontext;
int i;
+ nbts_prep_ctx(indexRel);
oldcontext = MemoryContextSwitchTo(base->maincontext);
arg = (TuplesortIndexBTreeArg *) palloc(sizeof(TuplesortIndexBTreeArg));
@@ -461,6 +464,7 @@ tuplesort_begin_index_gist(Relation heapRel,
MemoryContext oldcontext;
TuplesortIndexBTreeArg *arg;
int i;
+ nbts_prep_ctx(indexRel);
oldcontext = MemoryContextSwitchTo(base->maincontext);
arg = (TuplesortIndexBTreeArg *) palloc(sizeof(TuplesortIndexBTreeArg));
@@ -1259,142 +1263,6 @@ removeabbrev_index(Tuplesortstate *state, SortTuple *stups, int count)
}
}
-static int
-comparetup_index_btree(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state)
-{
- /*
- * This is similar to comparetup_heap(), but expects index tuples. There
- * is also special handling for enforcing uniqueness, and special
- * treatment for equal keys at the end.
- */
- TuplesortPublic *base = TuplesortstateGetPublic(state);
- TuplesortIndexBTreeArg *arg = (TuplesortIndexBTreeArg *) base->arg;
- SortSupport sortKey = base->sortKeys;
- IndexTuple tuple1;
- IndexTuple tuple2;
- int keysz;
- TupleDesc tupDes;
- bool equal_hasnull = false;
- int nkey;
- int32 compare;
- Datum datum1,
- datum2;
- bool isnull1,
- isnull2;
-
-
- /* Compare the leading sort key */
- compare = ApplySortComparator(a->datum1, a->isnull1,
- b->datum1, b->isnull1,
- sortKey);
- if (compare != 0)
- return compare;
-
- /* Compare additional sort keys */
- tuple1 = (IndexTuple) a->tuple;
- tuple2 = (IndexTuple) b->tuple;
- keysz = base->nKeys;
- tupDes = RelationGetDescr(arg->index.indexRel);
-
- if (sortKey->abbrev_converter)
- {
- datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
-
- compare = ApplySortAbbrevFullComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare;
- }
-
- /* they are equal, so we only need to examine one null flag */
- if (a->isnull1)
- equal_hasnull = true;
-
- sortKey++;
- for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
- {
- datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
-
- compare = ApplySortComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare; /* done when we find unequal attributes */
-
- /* they are equal, so we only need to examine one null flag */
- if (isnull1)
- equal_hasnull = true;
- }
-
- /*
- * If btree has asked us to enforce uniqueness, complain if two equal
- * tuples are detected (unless there was at least one NULL field and NULLS
- * NOT DISTINCT was not set).
- *
- * It is sufficient to make the test here, because if two tuples are equal
- * they *must* get compared at some stage of the sort --- otherwise the
- * sort algorithm wouldn't have checked whether one must appear before the
- * other.
- */
- if (arg->enforceUnique && !(!arg->uniqueNullsNotDistinct && equal_hasnull))
- {
- Datum values[INDEX_MAX_KEYS];
- bool isnull[INDEX_MAX_KEYS];
- char *key_desc;
-
- /*
- * Some rather brain-dead implementations of qsort (such as the one in
- * QNX 4) will sometimes call the comparison routine to compare a
- * value to itself, but we always use our own implementation, which
- * does not.
- */
- Assert(tuple1 != tuple2);
-
- index_deform_tuple(tuple1, tupDes, values, isnull);
-
- key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("could not create unique index \"%s\"",
- RelationGetRelationName(arg->index.indexRel)),
- key_desc ? errdetail("Key %s is duplicated.", key_desc) :
- errdetail("Duplicate keys exist."),
- errtableconstraint(arg->index.heapRel,
- RelationGetRelationName(arg->index.indexRel))));
- }
-
- /*
- * If key values are equal, we sort on ItemPointer. This is required for
- * btree indexes, since heap TID is treated as an implicit last key
- * attribute in order to ensure that all keys in the index are physically
- * unique.
- */
- {
- BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
- BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
-
- if (blk1 != blk2)
- return (blk1 < blk2) ? -1 : 1;
- }
- {
- OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
- OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
-
- if (pos1 != pos2)
- return (pos1 < pos2) ? -1 : 1;
- }
-
- /* ItemPointer values should never be equal */
- Assert(false);
-
- return 0;
-}
-
static int
comparetup_index_hash(const SortTuple *a, const SortTuple *b,
Tuplesortstate *state)
diff --git a/src/backend/utils/sort/tuplesortvariants_spec.c b/src/backend/utils/sort/tuplesortvariants_spec.c
new file mode 100644
index 0000000000..0791f41136
--- /dev/null
+++ b/src/backend/utils/sort/tuplesortvariants_spec.c
@@ -0,0 +1,158 @@
+/*-------------------------------------------------------------------------
+ *
+ * tuplesortvariants_spec.c
+ * Index shape-specialized functions for tuplesortvariants.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/tuplesortvariants_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define comparetup_index_btree NBTS_FUNCTION(comparetup_index_btree)
+
+static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
+ Tuplesortstate *state);
+
+static int
+comparetup_index_btree(const SortTuple *a, const SortTuple *b,
+ Tuplesortstate *state)
+{
+ /*
+ * This is similar to comparetup_heap(), but expects index tuples. There
+ * is also special handling for enforcing uniqueness, and special
+ * treatment for equal keys at the end.
+ */
+ TuplesortPublic *base = TuplesortstateGetPublic(state);
+ TuplesortIndexBTreeArg *arg = (TuplesortIndexBTreeArg *) base->arg;
+ SortSupport sortKey = base->sortKeys;
+ IndexTuple tuple1;
+ IndexTuple tuple2;
+ int keysz;
+ TupleDesc tupDes;
+ bool equal_hasnull = false;
+ int nkey;
+ int32 compare;
+ Datum datum1,
+ datum2;
+ bool isnull1,
+ isnull2;
+
+
+ /* Compare the leading sort key */
+ compare = ApplySortComparator(a->datum1, a->isnull1,
+ b->datum1, b->isnull1,
+ sortKey);
+ if (compare != 0)
+ return compare;
+
+ /* Compare additional sort keys */
+ tuple1 = (IndexTuple) a->tuple;
+ tuple2 = (IndexTuple) b->tuple;
+ keysz = base->nKeys;
+ tupDes = RelationGetDescr(arg->index.indexRel);
+
+ if (sortKey->abbrev_converter)
+ {
+ datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
+ datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
+
+ compare = ApplySortAbbrevFullComparator(datum1, isnull1,
+ datum2, isnull2,
+ sortKey);
+ if (compare != 0)
+ return compare;
+ }
+
+ /* they are equal, so we only need to examine one null flag */
+ if (a->isnull1)
+ equal_hasnull = true;
+
+ sortKey++;
+ for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
+ {
+ datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
+ datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
+
+ compare = ApplySortComparator(datum1, isnull1,
+ datum2, isnull2,
+ sortKey);
+ if (compare != 0)
+ return compare; /* done when we find unequal attributes */
+
+ /* they are equal, so we only need to examine one null flag */
+ if (isnull1)
+ equal_hasnull = true;
+ }
+
+ /*
+ * If btree has asked us to enforce uniqueness, complain if two equal
+ * tuples are detected (unless there was at least one NULL field and NULLS
+ * NOT DISTINCT was not set).
+ *
+ * It is sufficient to make the test here, because if two tuples are equal
+ * they *must* get compared at some stage of the sort --- otherwise the
+ * sort algorithm wouldn't have checked whether one must appear before the
+ * other.
+ */
+ if (arg->enforceUnique && !(!arg->uniqueNullsNotDistinct && equal_hasnull))
+ {
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ char *key_desc;
+
+ /*
+ * Some rather brain-dead implementations of qsort (such as the one in
+ * QNX 4) will sometimes call the comparison routine to compare a
+ * value to itself, but we always use our own implementation, which
+ * does not.
+ */
+ Assert(tuple1 != tuple2);
+
+ index_deform_tuple(tuple1, tupDes, values, isnull);
+
+ key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("could not create unique index \"%s\"",
+ RelationGetRelationName(arg->index.indexRel)),
+ key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+ errdetail("Duplicate keys exist."),
+ errtableconstraint(arg->index.heapRel,
+ RelationGetRelationName(arg->index.indexRel))));
+ }
+
+ /*
+ * If key values are equal, we sort on ItemPointer. This is required for
+ * btree indexes, since heap TID is treated as an implicit last key
+ * attribute in order to ensure that all keys in the index are physically
+ * unique.
+ */
+ {
+ BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
+ BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
+
+ if (blk1 != blk2)
+ return (blk1 < blk2) ? -1 : 1;
+ }
+ {
+ OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
+ OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
+
+ if (pos1 != pos2)
+ return (pos1 < pos2) ? -1 : 1;
+ }
+
+ /* ItemPointer values should never be equal */
+ Assert(false);
+
+ return 0;
+}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4cb24fa005..f3f0961052 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1122,15 +1122,27 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_PERFORMSORT_2 4
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
+typedef enum NBTS_CTX {
+ NBTS_CTX_CACHED,
+ NBTS_CTX_DEFAULT, /* fallback */
+} NBTS_CTX;
+
+static inline NBTS_CTX _nbt_spec_context(Relation irel)
+{
+ if (!PointerIsValid(irel))
+ return NBTS_CTX_DEFAULT;
+
+ return NBTS_CTX_CACHED;
+}
+
+
+#define NBT_SPECIALIZE_FILE "access/nbtree_specfuncs.h"
+#include "nbtree_spec.h"
+
/*
* external entry points for btree, in nbtree.c
*/
extern void btbuildempty(Relation index);
-extern bool btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- struct IndexInfo *indexInfo);
extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
extern Size btestimateparallelscan(void);
extern void btinitparallelscan(void *target);
@@ -1161,9 +1173,6 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
/*
* prototypes for functions in nbtdedup.c
*/
-extern void _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel,
- IndexTuple newitem, Size newitemsz,
- bool bottomupdedup);
extern bool _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
Size newitemsz);
extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
@@ -1179,9 +1188,6 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
/*
* prototypes for functions in nbtinsert.c
*/
-extern bool _bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel);
extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
@@ -1229,16 +1235,6 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
/*
* prototypes for functions in nbtsearch.c
*/
-extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
- int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
- bool forupdate, BTStack stack, int access,
- Snapshot snapshot, AttrNumber *comparecol,
- char *tupdatabuf);
-extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate,
- AttrNumber highcmpcol);
-extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page,
- OffsetNumber offnum, AttrNumber *comparecol);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -1247,7 +1243,6 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
/*
* prototypes for functions in nbtutils.c
*/
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
extern void _bt_freestack(BTStack stack);
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
@@ -1255,8 +1250,6 @@ extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
extern void _bt_mark_array_keys(IndexScanDesc scan);
extern void _bt_restore_array_keys(IndexScanDesc scan);
extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
- int tupnatts, ScanDirection dir, bool *continuescan);
extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
extern BTCycleId _bt_start_vacuum(Relation rel);
@@ -1269,10 +1262,6 @@ extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
extern char *btbuildphasename(int64 phasenum);
-extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
-extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
- IndexTuple firstright);
extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/nbtree_spec.h b/src/include/access/nbtree_spec.h
new file mode 100644
index 0000000000..0bfb623f37
--- /dev/null
+++ b/src/include/access/nbtree_spec.h
@@ -0,0 +1,180 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtree_specialize.h
+ * header file for postgres btree access method implementation.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/nbtree_specialize.h
+ *
+ *-------------------------------------------------------------------------
+ *
+ * Specialize key-accessing functions and the hot code around those.
+ *
+ * Key attribute iteration is specialized through the use of the following
+ * macros:
+ *
+ * - nbts_attiterdeclare(itup)
+ * Declare the variables required to iterate over the provided IndexTuple's
+ * key attributes. Many tuples may have their attributes iterated over at the
+ * same time.
+ * - nbts_attiterinit(itup, initAttNum, tupDesc)
+ * Initialize the attribute iterator for the provided IndexTuple at
+ * the provided AttributeNumber.
+ * - nbts_foreachattr(initAttNum, endAttNum)
+ * Start a loop over the attributes, starting at initAttNum and ending at
+ * endAttNum, inclusive. It also takes care of truncated attributes.
+ * - nbts_attiter_attnum
+ * The current attribute number
+ * - nbts_attiter_nextattdatum(itup, tupDesc)
+ * Updates the attribute iterator state to the next attribute. Returns the
+ * datum of the next attribute, which might be null (see below)
+ * - nbts_attiter_curattisnull(itup)
+ * Returns whether the result from the last nbts_attiter_nextattdatum is
+ * null.
+ * - nbts_context(irel)
+ * Constructs a context that is used to call specialized functions.
+ * Note that this is optional in paths that are inaccessible to unspecialized
+ * code paths, but should be included in NBTS_BUILD_GENERIC.
+ */
+
+/*
+ * Macros used in the nbtree specialization code.
+ */
+#define NBTS_TYPE_CACHED cached
+#define NBTS_TYPE_DEFAULT default
+#define NBTS_CTX_NAME __nbts_ctx
+
+/* contextual specializations */
+#define NBTS_MAKE_CTX(rel) const NBTS_CTX NBTS_CTX_NAME = _nbt_spec_context(rel)
+#define NBTS_SPECIALIZE_NAME(name) ( \
+ (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ ) \
+)
+
+/* how do we make names? */
+#define NBTS_MAKE_PREFIX(a) CppConcat(a,_)
+#define NBTS_MAKE_NAME_(a,b) CppConcat(a,b)
+#define NBTS_MAKE_NAME(a,b) NBTS_MAKE_NAME_(NBTS_MAKE_PREFIX(a),b)
+
+#define nbt_opt_specialize(rel) \
+do { \
+ Assert(PointerIsValid(rel)); \
+ if (unlikely((rel)->rd_indam->aminsert == btinsert_default)) \
+ { \
+ nbts_prep_ctx(rel); \
+ _bt_specialize(rel); \
+ } \
+} while (false)
+
+/*
+ * Access a specialized nbtree function, based on the shape of the index key.
+ */
+#define NBTS_DEFINITIONS
+
+/*
+ * Call a potentially specialized function for a given btree operation.
+ *
+ * NB: the rel argument is evaluated multiple times.
+ */
+#ifdef NBTS_FUNCTION
+#undef NBTS_FUNCTION
+#endif
+#define NBTS_FUNCTION(name) NBTS_MAKE_NAME(name, NBTS_TYPE)
+
+/* While specializing, the context is the local context */
+#ifdef nbts_prep_ctx
+#undef nbts_prep_ctx
+#endif
+#define nbts_prep_ctx(rel)
+
+/*
+ * Specialization 1: CACHED
+ *
+ * Multiple key columns, optimized access for attcacheoff -cacheable offsets.
+ */
+#define NBTS_SPECIALIZING_CACHED
+#define NBTS_TYPE NBTS_TYPE_CACHED
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_SPECIALIZING_CACHED
+#undef NBTS_TYPE
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/*
+ * Specialization 2: DEFAULT
+ *
+ * "Default", externally accessible, not so much optimized functions
+ */
+
+/* the default context (and later contexts) do need to specialize, so here's that */
+#undef nbts_prep_ctx
+#define nbts_prep_ctx(rel) NBTS_MAKE_CTX(rel)
+
+#define NBTS_SPECIALIZING_DEFAULT
+#define NBTS_TYPE NBTS_TYPE_DEFAULT
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_DEFAULT
+
+/* un-define the optimization macros */
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/*
+ * from here on all NBTS_FUNCTIONs are from specialized function names that
+ * are being called. Change the result of those macros from a direct call
+ * call to a conditional call to the right place, depending on the correct
+ * context.
+ */
+#undef NBTS_FUNCTION
+#define NBTS_FUNCTION(name) NBTS_SPECIALIZE_NAME(name)
+
+#undef NBT_SPECIALIZE_FILE
diff --git a/src/include/access/nbtree_specfuncs.h b/src/include/access/nbtree_specfuncs.h
new file mode 100644
index 0000000000..ac60319eff
--- /dev/null
+++ b/src/include/access/nbtree_specfuncs.h
@@ -0,0 +1,66 @@
+/*
+ * prototypes for functions that are included in nbtree.h
+ */
+
+#define _bt_specialize NBTS_FUNCTION(_bt_specialize)
+#define btinsert NBTS_FUNCTION(btinsert)
+#define _bt_dedup_pass NBTS_FUNCTION(_bt_dedup_pass)
+#define _bt_doinsert NBTS_FUNCTION(_bt_doinsert)
+#define _bt_search NBTS_FUNCTION(_bt_search)
+#define _bt_moveright NBTS_FUNCTION(_bt_moveright)
+#define _bt_binsrch_insert NBTS_FUNCTION(_bt_binsrch_insert)
+#define _bt_compare NBTS_FUNCTION(_bt_compare)
+#define _bt_mkscankey NBTS_FUNCTION(_bt_mkscankey)
+#define _bt_checkkeys NBTS_FUNCTION(_bt_checkkeys)
+#define _bt_truncate NBTS_FUNCTION(_bt_truncate)
+#define _bt_keep_natts_fast NBTS_FUNCTION(_bt_keep_natts_fast)
+
+/*
+ * prototypes for functions in nbtree_spec.h
+ */
+extern void _bt_specialize(Relation rel);
+
+extern bool btinsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ struct IndexInfo *indexInfo);
+
+/*
+ * prototypes for functions in nbtdedup_spec.h
+ */
+extern void _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel,
+ IndexTuple newitem, Size newitemsz,
+ bool bottomupdedup);
+
+
+/*
+ * prototypes for functions in nbtinsert_spec.h
+ */
+
+extern bool _bt_doinsert(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ Relation heapRel);
+
+/*
+ * prototypes for functions in nbtsearch_spec.h
+ */
+extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
+ int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
+ bool forupdate, BTStack stack, int access,
+ Snapshot snapshot, AttrNumber *comparecol,
+ char *tupdatabuf);
+extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page,
+ OffsetNumber offnum, AttrNumber *comparecol);
+/*
+ * prototypes for functions in nbtutils_spec.h
+ */
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
+extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan);
+extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright);
--
2.39.0
v8-0006-btree-specialization-for-variable-length-multi-at.patchapplication/octet-stream; name=v8-0006-btree-specialization-for-variable-length-multi-at.patchDownload
From 6b5092fd1842b66c9a894aef8e12043cf30bebb0 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 13 Jan 2023 15:42:41 +0100
Subject: [PATCH v8 6/6] btree specialization for variable-length
multi-attribute keys
The default code path is relatively slow at O(n^2), so with multiple
attributes we accept the increased startup cost in favour of lower
costs for later attributes.
Note that this will only be used for indexes that use at least one
variable-length key attribute (except as last key attribute in specific
cases).
---
src/backend/access/nbtree/README | 10 +-
src/backend/access/nbtree/nbtree_spec.c | 3 +
src/include/access/itup_attiter.h | 198 ++++++++++++++++++++++++
src/include/access/nbtree.h | 11 +-
src/include/access/nbtree_spec.h | 48 +++++-
5 files changed, 259 insertions(+), 11 deletions(-)
create mode 100644 src/include/access/itup_attiter.h
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6864902637..2219c58242 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -1104,15 +1104,13 @@ in the index AM to call the specialized functions, increasing the
performance of those hot paths.
Optimized code paths exist for the following cases, in order of preference:
- - indexes with only a single key attribute
- - multi-column indexes that could benefit from the attcacheoff optimization
+ - indexes with only a single key attribute,
+ - multi-column indexes that cannot pre-calculate the offsets of all key
+ attributes in the tuple data section,
+ - multi-column indexes that do benefit from the attcacheoff optimization
NB: This is also the default path, and is comparatively slow for uncachable
attribute offsets.
-Future work will optimize for multi-column indexes that don't benefit
-from the attcacheoff optimization by improving on the O(n^2) nature of
-index_getattr through storing attribute offsets.
-
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtree_spec.c b/src/backend/access/nbtree/nbtree_spec.c
index 21635397ed..699197dfa7 100644
--- a/src/backend/access/nbtree/nbtree_spec.c
+++ b/src/backend/access/nbtree/nbtree_spec.c
@@ -33,6 +33,9 @@ _bt_specialize(Relation rel)
case NBTS_CTX_CACHED:
_bt_specialize_cached(rel);
break;
+ case NBTS_CTX_UNCACHED:
+ _bt_specialize_uncached(rel);
+ break;
case NBTS_CTX_SINGLE_KEYATT:
_bt_specialize_single_keyatt(rel);
break;
diff --git a/src/include/access/itup_attiter.h b/src/include/access/itup_attiter.h
new file mode 100644
index 0000000000..7dd10b4cf0
--- /dev/null
+++ b/src/include/access/itup_attiter.h
@@ -0,0 +1,198 @@
+/*-------------------------------------------------------------------------
+ *
+ * itup_attiter.h
+ * POSTGRES index tuple attribute iterator definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/itup_attiter.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef ITUP_ATTITER_H
+#define ITUP_ATTITER_H
+
+#include "access/itup.h"
+
+typedef struct IAttrIterStateData
+{
+ int offset;
+ bool slow;
+ bool isNull;
+} IAttrIterStateData;
+
+typedef IAttrIterStateData * IAttrIterState;
+
+/* ----------------
+ * index_attiterinit
+ *
+ * This gets called many times, so we macro the cacheable and NULL
+ * lookups, and call nocache_index_attiterinit() for the rest.
+ *
+ * tup - the tuple being iterated on
+ * attnum - the attribute number that we start the iteration with
+ * in the first index_attiternext call
+ * tupdesc - the tuple description
+ *
+ * ----------------
+ */
+#define index_attiterinit(tup, attnum, tupleDesc, iter) \
+do { \
+ if ((attnum) == 1) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ 0 /* Offset of attribute 1 is always 0 */, \
+ false /* slow */, \
+ false /* isNull */ \
+ }); \
+ } \
+ else if (!IndexTupleHasNulls(tup) && \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff >= 0) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff, /* offset */ \
+ false, /* slow */ \
+ false /* isNull */ \
+ }); \
+ } \
+ else \
+ nocache_index_attiterinit((tup), (attnum) - 1, (tupleDesc), (iter)); \
+} while (false);
+
+/*
+ * Initiate an index attribute iterator to attribute attnum,
+ * and return the corresponding datum.
+ *
+ * This is nearly the same as index_deform_tuple, except that this
+ * returns the internal state up to attnum, instead of populating the
+ * datum- and isnull-arrays
+ */
+static inline void
+nocache_index_attiterinit(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ int curatt;
+ char *tp; /* ptr to tuple data */
+ int off; /* offset in tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ bool slow = false; /* can we use/set attcacheoff? */
+ bool null = false;
+
+ /* Assert to protect callers */
+ Assert(PointerIsValid(iter));
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ /* XXX "knows" t_bits are just after fixed tuple header! */
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+ off = 0;
+
+ for (curatt = 0; curatt < attnum; curatt++)
+ {
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, curatt);
+
+ if (hasnulls && att_isnull(curatt, bp))
+ {
+ null = true;
+ slow = true; /* can't use attcacheoff anymore */
+ continue;
+ }
+
+ null = false;
+
+ if (!slow && thisatt->attcacheoff >= 0)
+ off = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ slow = true;
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+ }
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+
+ if (thisatt->attlen <= 0)
+ slow = true; /* can't use attcacheoff anymore */
+ }
+
+ iter->isNull = null;
+ iter->offset = off;
+ iter->slow = slow;
+}
+
+/* ----------------
+ * index_attiternext() - get the next attribute of an index tuple
+ *
+ * This gets called many times, so we do the least amount of work
+ * possible.
+ *
+ * The code does not attempt to update attcacheoff; as it is unlikely
+ * to reach a situation where the cached offset matters a lot.
+ * If the cached offset do matter, the caller should make sure that
+ * PopulateTupleDescCacheOffsets() was called on the tuple descriptor
+ * to populate the attribute offset cache.
+ *
+ * ----------------
+ */
+static inline Datum
+index_attiternext(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ char *tp; /* ptr to tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ Datum datum;
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, attnum - 1);
+
+ Assert(PointerIsValid(iter));
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+
+ if (hasnulls && att_isnull(attnum - 1, bp))
+ {
+ iter->isNull = true;
+ iter->slow = true;
+ return (Datum) 0;
+ }
+
+ iter->isNull = false;
+
+ if (!iter->slow && thisatt->attcacheoff >= 0)
+ iter->offset = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ iter->offset = att_align_pointer(iter->offset, thisatt->attalign, -1,
+ tp + iter->offset);
+ iter->slow = true;
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ iter->offset = att_align_nominal(iter->offset, thisatt->attalign);
+ }
+
+ datum = fetchatt(thisatt, tp + iter->offset);
+
+ iter->offset = att_addlength_pointer(iter->offset, thisatt->attlen, tp + iter->offset);
+
+ if (thisatt->attlen <= 0)
+ iter->slow = true; /* can't use attcacheoff anymore */
+
+ return datum;
+}
+
+#endif /* ITUP_ATTITER_H */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4628c41e9a..d5ed38bb71 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -16,6 +16,7 @@
#include "access/amapi.h"
#include "access/itup.h"
+#include "access/itup_attiter.h"
#include "access/sdir.h"
#include "access/tableam.h"
#include "access/xlogreader.h"
@@ -1124,18 +1125,26 @@ typedef struct BTOptions
typedef enum NBTS_CTX {
NBTS_CTX_SINGLE_KEYATT,
+ NBTS_CTX_UNCACHED,
NBTS_CTX_CACHED,
NBTS_CTX_DEFAULT, /* fallback */
} NBTS_CTX;
static inline NBTS_CTX _nbt_spec_context(Relation irel)
{
+ AttrNumber nKeyAtts;
+
if (!PointerIsValid(irel))
return NBTS_CTX_DEFAULT;
- if (IndexRelationGetNumberOfKeyAttributes(irel) == 1)
+ nKeyAtts = IndexRelationGetNumberOfKeyAttributes(irel);
+
+ if (nKeyAtts == 1)
return NBTS_CTX_SINGLE_KEYATT;
+ if (TupleDescAttr(irel->rd_att, nKeyAtts - 1)->attcacheoff < -1)
+ return NBTS_CTX_UNCACHED;
+
return NBTS_CTX_CACHED;
}
diff --git a/src/include/access/nbtree_spec.h b/src/include/access/nbtree_spec.h
index 3ad64aad39..a57d69f588 100644
--- a/src/include/access/nbtree_spec.h
+++ b/src/include/access/nbtree_spec.h
@@ -44,6 +44,7 @@
* Macros used in the nbtree specialization code.
*/
#define NBTS_TYPE_SINGLE_KEYATT single_keyatt
+#define NBTS_TYPE_UNCACHED uncached
#define NBTS_TYPE_CACHED cached
#define NBTS_TYPE_DEFAULT default
#define NBTS_CTX_NAME __nbts_ctx
@@ -52,8 +53,10 @@
#define NBTS_MAKE_CTX(rel) const NBTS_CTX NBTS_CTX_NAME = _nbt_spec_context(rel)
#define NBTS_SPECIALIZE_NAME(name) ( \
(NBTS_CTX_NAME) == NBTS_CTX_SINGLE_KEYATT ? (NBTS_MAKE_NAME(name, NBTS_TYPE_SINGLE_KEYATT)) : ( \
- (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
- NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ (NBTS_CTX_NAME) == NBTS_CTX_UNCACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_UNCACHED)) : ( \
+ (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ ) \
) \
) \
)
@@ -68,9 +71,12 @@ do { \
Assert(PointerIsValid(rel)); \
if (unlikely((rel)->rd_indam->aminsert == btinsert_default)) \
{ \
- nbts_prep_ctx(rel); \
Assert(PointerIsValid(rel)); \
- _bt_specialize(rel); \
+ PopulateTupleDescCacheOffsets(rel->rd_att); \
+ { \
+ nbts_prep_ctx(rel); \
+ _bt_specialize(rel); \
+ } \
} \
} while (false)
@@ -216,6 +222,40 @@ do { \
#undef nbts_attiter_nextattdatum
#undef nbts_attiter_curattisnull
+/*
+ * Multiple key columns, but attcacheoff -optimization doesn't apply.
+ */
+#define NBTS_SPECIALIZING_UNCACHED
+#define NBTS_TYPE NBTS_TYPE_UNCACHED
+
+#define nbts_attiterdeclare(itup) \
+ IAttrIterStateData NBTS_MAKE_NAME(itup, iter)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc) \
+ index_attiterinit((itup), (initAttNum), (tupDesc), &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_attiternext((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, iter).isNull
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_UNCACHED
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
/*
* All subsequent contexts are from non-templated code, so
* they need to actually include the context.
--
2.39.0
On Fri, 20 Jan 2023 at 20:37, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Thu, 12 Jan 2023 at 16:11, David Christensen <david@pgguru.net> wrote:
Hi Matthias,
I'm going to look at this patch series if you're still interested. What was the status of your final performance testing for the 0008 patch alone vs the specialization series? Last I saw on the thread you were going to see if the specialization was required or not.
Thank you for your interest, and sorry for the delayed response. I've
been working on rebasing and polishing the patches, and hit some
issues benchmarking the set. Attached in Perf_results.xlsx are the
results of my benchmarks, and a new rebased patchset.
Attached v9 which rebases the patchset on b90f0b57 to deal with
compilation errors after d952373a. It also cleans up 0001 which
previously added an unrelated file, but is otherwise unchanged.
Kind regards,
Matthias van de Meent
Attachments:
v9-0005-Add-an-attcacheoff-populating-function.patchapplication/octet-stream; name=v9-0005-Add-an-attcacheoff-populating-function.patchDownload
From 82abd6a316c079ac7b223af5a7a3dcb124447ea2 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 12 Jan 2023 21:34:36 +0100
Subject: [PATCH v9 5/6] Add an attcacheoff-populating function
It populates attcacheoff-capable attributes with the correct offset,
and fills attributes whose offset is uncacheable with an 'uncacheable'
indicator value; as opposed to -1 which signals "unknown".
This allows users of the API to remove redundant cycles that try to
cache the offset of attributes - instead of O(N-attrs) operations, this
one only requires a O(1) check.
---
src/backend/access/common/tupdesc.c | 111 ++++++++++++++++++++++++++++
src/include/access/tupdesc.h | 2 +
2 files changed, 113 insertions(+)
diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c
index 72a2c3d3db..f2d80ed0db 100644
--- a/src/backend/access/common/tupdesc.c
+++ b/src/backend/access/common/tupdesc.c
@@ -919,3 +919,114 @@ BuildDescFromLists(List *names, List *types, List *typmods, List *collations)
return desc;
}
+
+/*
+ * PopulateTupleDescCacheOffsets
+ *
+ * Populate the attcacheoff fields of a TupleDesc, returning the last
+ * attcacheoff with a valid offset value.
+ *
+ * Populates attcacheoff with a negative cache value when no offset
+ * can be calculated (due to e.g. variable length attributes).
+ * The negative value is a value relative to the last cacheable attribute
+ * attcacheoff = -1 - (thisattno - cachedattno)
+ * so that the last attribute with cached offset can be found with
+ * cachedattno = attcacheoff + 1 + thisattno
+ *
+ * The value returned is the AttrNumber of the last (1-based) attribute that
+ * had its offset cached.
+ *
+ * When the TupleDesc has 0 attributes, it returns 0.
+ */
+AttrNumber
+PopulateTupleDescCacheOffsets(TupleDesc desc)
+{
+ int numberOfAttributes = desc->natts;
+ AttrNumber currAttNo, lastCachedAttNo;
+
+ if (numberOfAttributes == 0)
+ return 0;
+
+ /* Non-negative value: this attribute is cached */
+ if (TupleDescAttr(desc, desc->natts - 1)->attcacheoff >= 0)
+ return (AttrNumber) desc->natts;
+ /*
+ * Attribute has been filled with relative offset to last cached value, but
+ * it itself is unreachable.
+ */
+ if (TupleDescAttr(desc, desc->natts - 1)->attcacheoff != -1)
+ return (AttrNumber) (TupleDescAttr(desc, desc->natts - 1)->attcacheoff + 1 + desc->natts);
+
+ /* last attribute of the tupledesc may or may not support attcacheoff */
+
+ /*
+ * First attribute always starts at offset zero.
+ */
+ TupleDescAttr(desc, 0)->attcacheoff = 0;
+
+ currAttNo = 1;
+ /*
+ * Other code may have populated the value previously.
+ * Skip all positive offsets to get to the first attribute without
+ * attcacheoff.
+ */
+ while (currAttNo < numberOfAttributes &&
+ TupleDescAttr(desc, currAttNo)->attcacheoff >= 0)
+ currAttNo++;
+
+ /*
+ * Cache offset is undetermined. Start calculating offsets if possible.
+ *
+ * When we exit this block, currAttNo will point at the first uncacheable
+ * attribute, or past the end of the attribute array.
+ */
+ if (currAttNo < numberOfAttributes &&
+ TupleDescAttr(desc, currAttNo)->attcacheoff == -1)
+ {
+ Form_pg_attribute att = TupleDescAttr(desc, currAttNo - 1);
+ int32 off = att->attcacheoff;
+
+ if (att->attlen >= 0) {
+ off += att->attlen;
+
+ while (currAttNo < numberOfAttributes)
+ {
+ att = TupleDescAttr(desc, currAttNo);
+
+ if (att->attlen < 0)
+ {
+ if (off == att_align_nominal(off, att->attalign))
+ {
+ att->attcacheoff = off;
+ currAttNo++;
+ }
+ break;
+ }
+
+ off = att_align_nominal(off, att->attalign);
+ att->attcacheoff = off;
+ off += att->attlen;
+ currAttNo++;
+ }
+ }
+ }
+
+ Assert(currAttNo == numberOfAttributes || (
+ currAttNo < numberOfAttributes
+ && TupleDescAttr(desc, (currAttNo - 1))->attcacheoff >= 0
+ && TupleDescAttr(desc, currAttNo)->attcacheoff == -1
+ ));
+ /*
+ * No cacheable offsets left. Fill the rest with negative cache values,
+ * but return the latest cached offset.
+ */
+ lastCachedAttNo = currAttNo;
+
+ while (currAttNo < numberOfAttributes)
+ {
+ TupleDescAttr(desc, currAttNo)->attcacheoff = -1 - (currAttNo - lastCachedAttNo);
+ currAttNo++;
+ }
+
+ return lastCachedAttNo;
+}
\ No newline at end of file
diff --git a/src/include/access/tupdesc.h b/src/include/access/tupdesc.h
index b4286cf922..2673f2d0f3 100644
--- a/src/include/access/tupdesc.h
+++ b/src/include/access/tupdesc.h
@@ -151,4 +151,6 @@ extern TupleDesc BuildDescForRelation(List *schema);
extern TupleDesc BuildDescFromLists(List *names, List *types, List *typmods, List *collations);
+extern AttrNumber PopulateTupleDescCacheOffsets(TupleDesc desc);
+
#endif /* TUPDESC_H */
--
2.39.0
v9-0004-Optimize-nbts_attiter-for-nkeyatts-1-btrees.patchapplication/octet-stream; name=v9-0004-Optimize-nbts_attiter-for-nkeyatts-1-btrees.patchDownload
From f958f99b70daa497d85c3e1303bdf6e55975481d Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 11 Jan 2023 20:04:56 +0100
Subject: [PATCH v9 4/6] Optimize nbts_attiter for nkeyatts==1 btrees
This removes the index_getattr_nocache call path, which has significant overhead, and uses constant 0 offset.
---
src/backend/access/nbtree/README | 1 +
src/backend/access/nbtree/nbtree_spec.c | 3 ++
src/include/access/nbtree.h | 35 +++++++++++++
src/include/access/nbtree_spec.h | 68 +++++++++++++++++++++++--
4 files changed, 102 insertions(+), 5 deletions(-)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 4b11ea9ad7..6864902637 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -1104,6 +1104,7 @@ in the index AM to call the specialized functions, increasing the
performance of those hot paths.
Optimized code paths exist for the following cases, in order of preference:
+ - indexes with only a single key attribute
- multi-column indexes that could benefit from the attcacheoff optimization
NB: This is also the default path, and is comparatively slow for uncachable
attribute offsets.
diff --git a/src/backend/access/nbtree/nbtree_spec.c b/src/backend/access/nbtree/nbtree_spec.c
index 6b766581ab..21635397ed 100644
--- a/src/backend/access/nbtree/nbtree_spec.c
+++ b/src/backend/access/nbtree/nbtree_spec.c
@@ -33,6 +33,9 @@ _bt_specialize(Relation rel)
case NBTS_CTX_CACHED:
_bt_specialize_cached(rel);
break;
+ case NBTS_CTX_SINGLE_KEYATT:
+ _bt_specialize_single_keyatt(rel);
+ break;
case NBTS_CTX_DEFAULT:
break;
}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index f3f0961052..4628c41e9a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1123,6 +1123,7 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
typedef enum NBTS_CTX {
+ NBTS_CTX_SINGLE_KEYATT,
NBTS_CTX_CACHED,
NBTS_CTX_DEFAULT, /* fallback */
} NBTS_CTX;
@@ -1132,9 +1133,43 @@ static inline NBTS_CTX _nbt_spec_context(Relation irel)
if (!PointerIsValid(irel))
return NBTS_CTX_DEFAULT;
+ if (IndexRelationGetNumberOfKeyAttributes(irel) == 1)
+ return NBTS_CTX_SINGLE_KEYATT;
+
return NBTS_CTX_CACHED;
}
+static inline Datum _bt_getfirstatt(IndexTuple tuple, TupleDesc tupleDesc,
+ bool *isNull)
+{
+ Datum result;
+ if (IndexTupleHasNulls(tuple))
+ {
+ if (att_isnull(0, (bits8 *)(tuple) + sizeof(IndexTupleData)))
+ {
+ *isNull = true;
+ result = (Datum) 0;
+ }
+ else
+ {
+ *isNull = false;
+ result = fetchatt(TupleDescAttr(tupleDesc, 0),
+ ((char *) tuple)
+ + MAXALIGN(sizeof(IndexTupleData)
+ + sizeof(IndexAttributeBitMapData)));
+ }
+ }
+ else
+ {
+ *isNull = false;
+ result = fetchatt(TupleDescAttr(tupleDesc, 0),
+ ((char *) tuple)
+ + MAXALIGN(sizeof(IndexTupleData)));
+ }
+
+ return result;
+}
+
#define NBT_SPECIALIZE_FILE "access/nbtree_specfuncs.h"
#include "nbtree_spec.h"
diff --git a/src/include/access/nbtree_spec.h b/src/include/access/nbtree_spec.h
index 6ddba4d520..3ad64aad39 100644
--- a/src/include/access/nbtree_spec.h
+++ b/src/include/access/nbtree_spec.h
@@ -43,6 +43,7 @@
/*
* Macros used in the nbtree specialization code.
*/
+#define NBTS_TYPE_SINGLE_KEYATT single_keyatt
#define NBTS_TYPE_CACHED cached
#define NBTS_TYPE_DEFAULT default
#define NBTS_CTX_NAME __nbts_ctx
@@ -50,8 +51,10 @@
/* contextual specializations */
#define NBTS_MAKE_CTX(rel) const NBTS_CTX NBTS_CTX_NAME = _nbt_spec_context(rel)
#define NBTS_SPECIALIZE_NAME(name) ( \
- (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
- NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ (NBTS_CTX_NAME) == NBTS_CTX_SINGLE_KEYATT ? (NBTS_MAKE_NAME(name, NBTS_TYPE_SINGLE_KEYATT)) : ( \
+ (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ ) \
) \
)
@@ -72,9 +75,9 @@ do { \
} while (false)
/*
- * Call a potentially specialized function for a given btree operation.
- *
- * NB: the rel argument is evaluated multiple times.
+ * Protections against multiple inclusions - the definition of this macro is
+ * different for files included with the templating mechanism vs the users
+ * of this template, so redefine these macros at top and bottom.
*/
#ifdef NBTS_FUNCTION
#undef NBTS_FUNCTION
@@ -164,6 +167,61 @@ do { \
#undef nbts_attiter_nextattdatum
#undef nbts_attiter_curattisnull
+/*
+ * Specialization 3: SINGLE_KEYATT
+ *
+ * Optimized access for indexes with a single key column.
+ *
+ * Note that this path cannot be used for indexes with multiple key
++ * columns, because it never considers the next column.
+ */
+
+/* the default context (and later contexts) do need to specialize, so here's that */
+#undef nbts_prep_ctx
+#define nbts_prep_ctx(rel)
+
+#define NBTS_SPECIALIZING_SINGLE_KEYATT
+#define NBTS_TYPE NBTS_TYPE_SINGLE_KEYATT
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ Assert((endAttNum) == 1); ((void) (endAttNum)); \
+ if ((initAttNum) == 1) for (int spec_i = 0; spec_i < 1; spec_i++)
+
+#define nbts_attiter_attnum 1
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+( \
+ AssertMacro(spec_i == 0), \
+ _bt_getfirstatt(itup, tupDesc, &NBTS_MAKE_NAME(itup, isNull)) \
+)
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_SINGLE_KEYATT
+
+/* un-define the optimization macros */
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/*
+ * All subsequent contexts are from non-templated code, so
+ * they need to actually include the context.
+ */
+#undef nbts_prep_ctx
+#define nbts_prep_ctx(rel) NBTS_MAKE_CTX(rel)
/*
* from here on all NBTS_FUNCTIONs are from specialized function names that
* are being called. Change the result of those macros from a direct call
--
2.39.0
v9-0003-Use-specialized-attribute-iterators-in-the-specia.patchapplication/octet-stream; name=v9-0003-Use-specialized-attribute-iterators-in-the-specia.patchDownload
From 9167050d806299d173a11947e5d7af0f508e82d3 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 11 Jan 2023 02:57:21 +0100
Subject: [PATCH v9 3/6] Use specialized attribute iterators in the specialized
source files
This is committed separately to make clear what substantial changes were
made to the pre-existing code.
Even though not all nbt*_spec functions have been updated; these functions
can now directly call (and inline, and optimize for) the specialized functions
they call, instead of having to determine the right specialization based on
the (potentially locally unavailable) index relation, making the specialization
of those functions still worth specializing/duplicating.
---
src/backend/access/nbtree/nbtsearch_spec.c | 18 +++---
src/backend/access/nbtree/nbtsort_spec.c | 24 +++----
src/backend/access/nbtree/nbtutils_spec.c | 62 ++++++++++++-------
.../utils/sort/tuplesortvariants_spec.c | 54 +++++++++-------
src/include/access/nbtree_spec.h | 10 +--
5 files changed, 95 insertions(+), 73 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsearch_spec.c b/src/backend/access/nbtree/nbtsearch_spec.c
index 37cc3647d3..4ce39e7724 100644
--- a/src/backend/access/nbtree/nbtsearch_spec.c
+++ b/src/backend/access/nbtree/nbtsearch_spec.c
@@ -632,6 +632,7 @@ _bt_compare(Relation rel,
int ncmpkey;
int ntupatts;
int32 result;
+ nbts_attiterdeclare(itup);
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -664,23 +665,26 @@ _bt_compare(Relation rel,
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
scankey = key->scankeys + ((*comparecol) - 1);
- for (int i = *comparecol; i <= ncmpkey; i++)
+ nbts_attiterinit(itup, *comparecol, itupdesc);
+
+ nbts_foreachattr(*comparecol, ncmpkey)
{
Datum datum;
- bool isNull;
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+ datum = nbts_attiter_nextattdatum(itup, itupdesc);
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ /* key is NULL */
+ if (scankey->sk_flags & SK_ISNULL)
{
- if (isNull)
+ if (nbts_attiter_curattisnull(itup))
result = 0; /* NULL "=" NULL */
else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = -1; /* NULL "<" NOT_NULL */
else
result = 1; /* NULL ">" NOT_NULL */
}
- else if (isNull) /* key is NOT_NULL and item is NULL */
+ /* key is NOT_NULL and item is NULL */
+ else if (nbts_attiter_curattisnull(itup))
{
if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = 1; /* NOT_NULL ">" NULL */
@@ -709,7 +713,7 @@ _bt_compare(Relation rel,
/* if the keys are unequal, return the difference */
if (result != 0)
{
- *comparecol = i;
+ *comparecol = nbts_attiter_attnum;
return result;
}
diff --git a/src/backend/access/nbtree/nbtsort_spec.c b/src/backend/access/nbtree/nbtsort_spec.c
index 368d6f244c..6f33cc4cc2 100644
--- a/src/backend/access/nbtree/nbtsort_spec.c
+++ b/src/backend/access/nbtree/nbtsort_spec.c
@@ -34,8 +34,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
itup2 = NULL;
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
bool deduplicate;
@@ -57,7 +56,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
/* Prepare SortSupport data for each column */
sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
- for (i = 0; i < keysz; i++)
+ for (int i = 0; i < keysz; i++)
{
SortSupport sortKey = sortKeys + i;
ScanKey scanKey = wstate->inskey->scankeys + i;
@@ -90,21 +89,24 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
else if (itup != NULL)
{
int32 compare = 0;
+ nbts_attiterdeclare(itup);
+ nbts_attiterdeclare(itup2);
- for (i = 1; i <= keysz; i++)
+ nbts_attiterinit(itup, 1, tupdes);
+ nbts_attiterinit(itup2, 1, tupdes);
+
+ nbts_foreachattr(1, keysz)
{
SortSupport entry;
Datum attrDatum1,
attrDatum2;
- bool isNull1,
- isNull2;
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+ entry = sortKeys + nbts_attiter_attnum - 1;
+ attrDatum1 = nbts_attiter_nextattdatum(itup, tupdes);
+ attrDatum2 = nbts_attiter_nextattdatum(itup2, tupdes);
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
+ compare = ApplySortComparator(attrDatum1, nbts_attiter_curattisnull(itup),
+ attrDatum2, nbts_attiter_curattisnull(itup2),
entry);
if (compare > 0)
{
diff --git a/src/backend/access/nbtree/nbtutils_spec.c b/src/backend/access/nbtree/nbtutils_spec.c
index 0288da22d6..07ca18f404 100644
--- a/src/backend/access/nbtree/nbtutils_spec.c
+++ b/src/backend/access/nbtree/nbtutils_spec.c
@@ -64,7 +64,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
int indnkeyatts;
int16 *indoption;
int tupnatts;
- int i;
+ nbts_attiterdeclare(itup);
itupdesc = RelationGetDescr(rel);
indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -95,7 +95,10 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->scantid = key->heapkeyspace && itup ?
BTreeTupleGetHeapTID(itup) : NULL;
skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
+
+ nbts_attiterinit(itup, 1, itupdesc);
+
+ nbts_foreachattr(1, indnkeyatts)
{
FmgrInfo *procinfo;
Datum arg;
@@ -106,27 +109,30 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
* We can use the cached (default) support procs since no cross-type
* comparison can be needed.
*/
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+ procinfo = index_getprocinfo(rel, nbts_attiter_attnum, BTORDER_PROC);
/*
* Key arguments built from truncated attributes (or when caller
* provides no tuple) are defensively represented as NULL values. They
* should never be used.
*/
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
+ if (nbts_attiter_attnum <= tupnatts)
+ {
+ arg = nbts_attiter_nextattdatum(itup, itupdesc);
+ null = nbts_attiter_curattisnull(itup);
+ }
else
{
arg = (Datum) 0;
null = true;
}
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags = (null ? SK_ISNULL : 0) | (indoption[nbts_attiter_attnum - 1] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[nbts_attiter_attnum - 1],
flags,
- (AttrNumber) (i + 1),
+ (AttrNumber) nbts_attiter_attnum,
InvalidStrategy,
InvalidOid,
- rel->rd_indcollation[i],
+ rel->rd_indcollation[nbts_attiter_attnum - 1],
procinfo,
arg);
/* Record if any key attribute is NULL (or truncated) */
@@ -675,6 +681,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
TupleDesc itupdesc = RelationGetDescr(rel);
int keepnatts;
ScanKey scankey;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
/*
* _bt_compare() treats truncated key attributes as having the value minus
@@ -686,20 +694,22 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
scankey = itup_key->scankeys;
keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, nkeyatts)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ datum1 = nbts_attiter_nextattdatum(lastleft, itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) != nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
scankey->sk_collation,
datum1,
@@ -707,6 +717,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
break;
keepnatts++;
+ scankey++;
}
/*
@@ -747,24 +758,27 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
TupleDesc itupdesc = RelationGetDescr(rel);
int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
int keepnatts;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, keysz)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
Form_pg_attribute att;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
+ datum1 = nbts_attiter_nextattdatum(lastleft, itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
+ att = TupleDescAttr(itupdesc, nbts_attiter_attnum - 1);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) != nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
diff --git a/src/backend/utils/sort/tuplesortvariants_spec.c b/src/backend/utils/sort/tuplesortvariants_spec.c
index 0791f41136..61c4826853 100644
--- a/src/backend/utils/sort/tuplesortvariants_spec.c
+++ b/src/backend/utils/sort/tuplesortvariants_spec.c
@@ -40,11 +40,8 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
bool equal_hasnull = false;
int nkey;
int32 compare;
- Datum datum1,
- datum2;
- bool isnull1,
- isnull2;
-
+ nbts_attiterdeclare(tuple1);
+ nbts_attiterdeclare(tuple2);
/* Compare the leading sort key */
compare = ApplySortComparator(a->datum1, a->isnull1,
@@ -59,37 +56,46 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
keysz = base->nKeys;
tupDes = RelationGetDescr(arg->index.indexRel);
- if (sortKey->abbrev_converter)
+ if (!sortKey->abbrev_converter)
{
- datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
-
- compare = ApplySortAbbrevFullComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare;
+ nkey = 2;
+ sortKey++;
+ }
+ else
+ {
+ nkey = 1;
}
/* they are equal, so we only need to examine one null flag */
if (a->isnull1)
equal_hasnull = true;
- sortKey++;
- for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
+ nbts_attiterinit(tuple1, nkey, tupDes);
+ nbts_attiterinit(tuple2, nkey, tupDes);
+
+ nbts_foreachattr(nkey, keysz)
{
- datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
+ Datum datum1,
+ datum2;
+ datum1 = nbts_attiter_nextattdatum(tuple1, tupDes);
+ datum2 = nbts_attiter_nextattdatum(tuple2, tupDes);
+
+ if (nbts_attiter_attnum == 1)
+ compare = ApplySortAbbrevFullComparator(datum1, nbts_attiter_curattisnull(tuple1),
+ datum2, nbts_attiter_curattisnull(tuple2),
+ sortKey);
+ else
+ compare = ApplySortComparator(datum1, nbts_attiter_curattisnull(tuple1),
+ datum2, nbts_attiter_curattisnull(tuple2),
+ sortKey);
- compare = ApplySortComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
if (compare != 0)
- return compare; /* done when we find unequal attributes */
+ return compare;
- /* they are equal, so we only need to examine one null flag */
- if (isnull1)
+ if (nbts_attiter_curattisnull(tuple1))
equal_hasnull = true;
+
+ sortKey++;
}
/*
diff --git a/src/include/access/nbtree_spec.h b/src/include/access/nbtree_spec.h
index 0bfb623f37..6ddba4d520 100644
--- a/src/include/access/nbtree_spec.h
+++ b/src/include/access/nbtree_spec.h
@@ -66,15 +66,11 @@ do { \
if (unlikely((rel)->rd_indam->aminsert == btinsert_default)) \
{ \
nbts_prep_ctx(rel); \
+ Assert(PointerIsValid(rel)); \
_bt_specialize(rel); \
} \
} while (false)
-/*
- * Access a specialized nbtree function, based on the shape of the index key.
- */
-#define NBTS_DEFINITIONS
-
/*
* Call a potentially specialized function for a given btree operation.
*
@@ -102,7 +98,7 @@ do { \
#define nbts_attiterdeclare(itup) \
bool NBTS_MAKE_NAME(itup, isNull)
-#define nbts_attiterinit(itup, initAttNum, tupDesc)
+#define nbts_attiterinit(itup, initAttNum, tupDesc) do {} while (false)
#define nbts_foreachattr(initAttNum, endAttNum) \
for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
@@ -132,7 +128,7 @@ do { \
* "Default", externally accessible, not so much optimized functions
*/
-/* the default context (and later contexts) do need to specialize, so here's that */
+/* the default context needs to specialize, so here's that */
#undef nbts_prep_ctx
#define nbts_prep_ctx(rel) NBTS_MAKE_CTX(rel)
--
2.39.0
v9-0001-Implement-dynamic-prefix-compression-in-nbtree.patchapplication/octet-stream; name=v9-0001-Implement-dynamic-prefix-compression-in-nbtree.patchDownload
From 6be07223db0c1855410dc2630935420bdc368b46 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Tue, 10 Jan 2023 21:45:44 +0100
Subject: [PATCH v9 1/6] Implement dynamic prefix compression in nbtree
Because tuples are ordered on the page, if some prefix of the
scan attributes on both sides of the compared tuple are equal
to the scankey, then the current tuple that is being compared
must also have those prefixing attributes that equal the
scankey.
We cannot generally propagate this information to _binsrch on
lower pages, as this downstream page may have concurrently split
and/or have merged with its deleted left neighbour (see [0]),
which moves the keyspace of the linked page. We thus can only
trust the current state of this current page for this optimization,
which means we must validate this state each time we open the page.
Although this limits the overall applicability of the
performance improvement, it still allows for a nice performance
improvement in most cases where initial columns have many
duplicate values and a compare function that is not cheap.
As an exception to the above rule, most of the time a pages'
highkey is equal to the right seperator on the parent page due to
how btree splits are done. By storing this right seperator from
the parent page and then validating that the highkey of the child
page contains the exact same data, we can restore the right prefix
bound without having to call the relatively expensive _bt_compare.
In the worst-case scenario of a concurrent page split, we'd still
have to validate the full key, but that doesn't happen very often
when compared to the number of times we descend the btree.
---
contrib/amcheck/verify_nbtree.c | 17 ++--
src/backend/access/nbtree/README | 42 +++++++++
src/backend/access/nbtree/nbtinsert.c | 34 +++++---
src/backend/access/nbtree/nbtsearch.c | 119 +++++++++++++++++++++++---
src/include/access/nbtree.h | 10 ++-
5 files changed, 188 insertions(+), 34 deletions(-)
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 257cff671b..22bb229820 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2701,6 +2701,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
BTInsertStateData insertstate;
OffsetNumber offnum;
Page page;
+ AttrNumber cmpcol = 1;
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
@@ -2710,13 +2711,13 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.buf = lbuf;
/* Get matching tuple on leaf page */
- offnum = _bt_binsrch_insert(state->rel, &insertstate);
+ offnum = _bt_binsrch_insert(state->rel, &insertstate, 1);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
/* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
insertstate.postingoff <= 0 &&
- _bt_compare(state->rel, key, page, offnum) == 0)
+ _bt_compare(state->rel, key, page, offnum, &cmpcol) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
}
@@ -2778,6 +2779,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
@@ -2788,7 +2790,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
if (!key->heapkeyspace)
return invariant_leq_offset(state, key, upperbound);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
/*
* _bt_compare() is capable of determining that a scankey with a
@@ -2840,10 +2842,11 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber upperbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
return cmp <= 0;
}
@@ -2863,10 +2866,11 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber lowerbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, lowerbound);
+ cmp = _bt_compare(state->rel, key, state->target, lowerbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
@@ -2901,13 +2905,14 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
/* Verify line pointer before checking tuple */
itemid = PageGetItemIdCareful(state, nontargetblock, nontarget,
upperbound);
- cmp = _bt_compare(state->rel, key, nontarget, upperbound);
+ cmp = _bt_compare(state->rel, key, nontarget, upperbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index dd0f7ad2bd..4d7fa5aff4 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -901,6 +901,48 @@ large groups of duplicates, maximizing space utilization. Note also that
deduplication more efficient. Deduplication can be performed infrequently,
without merging together existing posting list tuples too often.
+Notes about dynamic prefix truncation
+-------------------------------------
+
+Because NBTrees have a sorted keyspace, when we have determined that some
+prefixing columns of tuples on both sides of the tuple that is being
+compared are equal to the scankey, then the current tuple must also share
+this prefix with the scankey. This allows us to skip comparing those columns,
+saving the indirect function calls in the compare operation.
+
+We can only use this constraint if we have proven this information while we
+hold a pin on the page, so this is only useful on the page level: Concurrent
+page deletions and splits may have moved the keyspace of the page referenced
+by a parent page to the right. If we re-used high- and low-column-prefixes,
+we would not be able to detect a change of keyspace from e.g. [2,3) to [1,2),
+and subsequently return invalid results. This race condition can only be
+prevented by re-establishing the prefix-equal-columns for each page.
+
+There is positive news, though: A page split will put a binary copy of the
+page's highkey in the parent page. This means that we usually can reuse
+the compare result of the parent page's downlink's right sibling when we
+discover that their representation is binary equal. In general this will
+be the case, as only in concurrent page splits and deletes the downlink
+may not point to the page with the correct highkey bound (_bt_moveright
+only rarely actually moves right).
+
+To implement this, we copy the downlink's right differentiator key into a
+temporary buffer, which is then compared against the child pages' highkey.
+If they match, we reuse the compare result (plus prefix) we had for it from
+the parent page, if not, we need to do a full _bt_compare. Because memcpy +
+memcmp is cheap compared to _bt_compare, and because it's quite unlikely
+that we guess wrong this speeds up our _bt_moveright code (at cost of some
+stack memory in _bt_search and some overhead in case of a wrong prediction)
+
+Now that we have prefix bounds on the highest value of a page, the
+_bt_binsrch procedure will use this result as a rightmost prefix compare,
+and for each step in the binary search (that does not compare less than the
+insert key) improve the equal-prefix bounds.
+
+Using the above optimization, we now (on average) only need 2 full key
+compares per page, as opposed to ceil(log2(ntupsperpage)) + 1; a significant
+improvement.
+
Notes about deduplication
-------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f4c1a974ef..4c3bdefae2 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -326,6 +326,7 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
{
Page page;
BTPageOpaque opaque;
+ AttrNumber cmpcol = 1;
_bt_checkpage(rel, insertstate->buf);
page = BufferGetPage(insertstate->buf);
@@ -344,7 +345,8 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
!P_IGNORE(opaque) &&
PageGetFreeSpace(page) > insertstate->itemsz &&
PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- _bt_compare(rel, insertstate->itup_key, page, P_HIKEY) > 0)
+ _bt_compare(rel, insertstate->itup_key, page, P_HIKEY,
+ &cmpcol) > 0)
{
/*
* Caller can use the fastpath optimization because cached
@@ -438,7 +440,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* in the fastpath below, but also in the _bt_findinsertloc() call later.
*/
Assert(!insertstate->bounds_valid);
- offset = _bt_binsrch_insert(rel, insertstate);
+ offset = _bt_binsrch_insert(rel, insertstate, 1);
/*
* Scan over all equal tuples, looking for live conflicts.
@@ -448,6 +450,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(itup_key->scantid == NULL);
for (;;)
{
+ AttrNumber cmpcol = 1;
+
/*
* Each iteration of the loop processes one heap TID, not one index
* tuple. Current offset number for page isn't usually advanced on
@@ -483,7 +487,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(insertstate->bounds_valid);
Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
Assert(insertstate->low <= insertstate->stricthigh);
- Assert(_bt_compare(rel, itup_key, page, offset) < 0);
+ Assert(_bt_compare(rel, itup_key, page, offset, &cmpcol) < 0);
break;
}
@@ -508,7 +512,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
if (!inposting)
{
/* Plain tuple, or first TID in posting list tuple */
- if (_bt_compare(rel, itup_key, page, offset) != 0)
+ if (_bt_compare(rel, itup_key, page, offset, &cmpcol) != 0)
break; /* we're past all the equal tuples */
/* Advanced curitup */
@@ -718,11 +722,12 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
else
{
int highkeycmp;
+ cmpcol = 1;
/* If scankey == hikey we gotta check the next page too */
if (P_RIGHTMOST(opaque))
break;
- highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+ highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol);
Assert(highkeycmp <= 0);
if (highkeycmp != 0)
break;
@@ -865,6 +870,8 @@ _bt_findinsertloc(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
+
/*
* Does the new tuple belong on this page?
*
@@ -882,7 +889,7 @@ _bt_findinsertloc(Relation rel,
/* Test '<=', not '!=', since scantid is set now */
if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
break;
_bt_stepright(rel, insertstate, stack);
@@ -935,6 +942,8 @@ _bt_findinsertloc(Relation rel,
*/
while (PageGetFreeSpace(page) < insertstate->itemsz)
{
+ AttrNumber cmpcol = 1;
+
/*
* Before considering moving right, see if we can obtain enough
* space by erasing LP_DEAD items
@@ -965,7 +974,7 @@ _bt_findinsertloc(Relation rel,
break;
if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) != 0 ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) != 0 ||
pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
break;
@@ -980,10 +989,13 @@ _bt_findinsertloc(Relation rel,
* We should now be on the correct page. Find the offset within the page
* for the new tuple. (Possibly reusing earlier search bounds.)
*/
- Assert(P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
+ {
+ AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
+ Assert(P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0);
+ }
- newitemoff = _bt_binsrch_insert(rel, insertstate);
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
if (insertstate->postingoff == -1)
{
@@ -1002,7 +1014,7 @@ _bt_findinsertloc(Relation rel,
*/
Assert(!insertstate->bounds_valid);
insertstate->postingoff = 0;
- newitemoff = _bt_binsrch_insert(rel, insertstate);
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
Assert(insertstate->postingoff == 0);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c43c1a2830..e3b828137b 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,7 +25,8 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
-static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf,
+ AttrNumber *highkeycmpcol);
static int _bt_binsrch_posting(BTScanInsert key, Page page,
OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
@@ -98,6 +99,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
{
BTStack stack_in = NULL;
int page_access = BT_READ;
+ char tupdatabuf[BLCKSZ / 3];
+ AttrNumber highkeycmpcol = 1;
/* Get the root page to start with */
*bufP = _bt_getroot(rel, access);
@@ -130,7 +133,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
* opportunity to finish splits of internal pages too.
*/
*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
- page_access, snapshot);
+ page_access, snapshot, &highkeycmpcol,
+ (char *) tupdatabuf);
/* if this is a leaf page, we're done */
page = BufferGetPage(*bufP);
@@ -142,12 +146,15 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
* Find the appropriate pivot tuple on this page. Its downlink points
* to the child page that we're about to descend to.
*/
- offnum = _bt_binsrch(rel, key, *bufP);
+ offnum = _bt_binsrch(rel, key, *bufP, &highkeycmpcol);
itemid = PageGetItemId(page, offnum);
itup = (IndexTuple) PageGetItem(page, itemid);
Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
child = BTreeTupleGetDownLink(itup);
+ Assert(IndexTupleSize(itup) < sizeof(tupdatabuf));
+ memcpy((char *) tupdatabuf, (char *) itup, IndexTupleSize(itup));
+
/*
* We need to save the location of the pivot tuple we chose in a new
* stack entry for this page/level. If caller ends up splitting a
@@ -181,6 +188,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
*/
if (access == BT_WRITE && page_access == BT_READ)
{
+ highkeycmpcol = 1;
+
/* trade in our read lock for a write lock */
_bt_unlockbuf(rel, *bufP);
_bt_lockbuf(rel, *bufP, BT_WRITE);
@@ -191,7 +200,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
* move right to its new sibling. Do that.
*/
*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
- snapshot);
+ snapshot, &highkeycmpcol, (char *) tupdatabuf);
}
return stack_in;
@@ -239,12 +248,16 @@ _bt_moveright(Relation rel,
bool forupdate,
BTStack stack,
int access,
- Snapshot snapshot)
+ Snapshot snapshot,
+ AttrNumber *comparecol,
+ char *tupdatabuf)
{
Page page;
BTPageOpaque opaque;
int32 cmpval;
+ Assert(PointerIsValid(comparecol) && PointerIsValid(tupdatabuf));
+
/*
* When nextkey = false (normal case): if the scan key that brought us to
* this page is > the high key stored on the page, then the page has split
@@ -266,12 +279,17 @@ _bt_moveright(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
+
page = BufferGetPage(buf);
TestForOldSnapshot(snapshot, rel, page);
opaque = BTPageGetOpaque(page);
if (P_RIGHTMOST(opaque))
+ {
+ *comparecol = 1;
break;
+ }
/*
* Finish any incomplete splits we encounter along the way.
@@ -297,14 +315,55 @@ _bt_moveright(Relation rel,
continue;
}
- if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
+ /*
+ * tupdatabuf is filled with the right seperator of the parent node.
+ * This allows us to do a binary equality check between the parent
+ * node's right seperator (which is < key) and this page's P_HIKEY.
+ * If they equal, we can reuse the result of the parent node's
+ * rightkey compare, which means we can potentially save a full key
+ * compare (which includes indirect calls to attribute comparison
+ * functions).
+ *
+ * Without this, we'd on average use 3 full key compares per page before
+ * we achieve full dynamic prefix bounds, but with this optimization
+ * that is only 2.
+ *
+ * 3 compares: 1 for the highkey (rightmost), and on average 2 before
+ * we move right in the binary search on the page, this average equals
+ * SUM (1/2 ^ x) for x from 0 to log(n items)), which tends to 2.
+ */
+ if (!P_IGNORE(opaque) && *comparecol > 1)
+ {
+ IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
+ IndexTuple buftuple = (IndexTuple) tupdatabuf;
+ if (IndexTupleSize(itup) == IndexTupleSize(buftuple))
+ {
+ char *dataptr = (char *) itup;
+
+ if (memcmp(dataptr + sizeof(IndexTupleData),
+ tupdatabuf + sizeof(IndexTupleData),
+ IndexTupleSize(itup) - sizeof(IndexTupleData)) == 0)
+ break;
+ } else {
+ *comparecol = 1;
+ }
+ } else {
+ *comparecol = 1;
+ }
+
+ if (P_IGNORE(opaque) ||
+ _bt_compare(rel, key, page, P_HIKEY, &cmpcol) >= cmpval)
{
+ *comparecol = 1;
/* step right one page */
buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
continue;
}
else
+ {
+ *comparecol = cmpcol;
break;
+ }
}
if (P_IGNORE(opaque))
@@ -337,7 +396,8 @@ _bt_moveright(Relation rel,
static OffsetNumber
_bt_binsrch(Relation rel,
BTScanInsert key,
- Buffer buf)
+ Buffer buf,
+ AttrNumber *highkeycmpcol)
{
Page page;
BTPageOpaque opaque;
@@ -345,6 +405,8 @@ _bt_binsrch(Relation rel,
high;
int32 result,
cmpval;
+ AttrNumber highcmpcol = *highkeycmpcol,
+ lowcmpcol = 1;
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
@@ -386,16 +448,25 @@ _bt_binsrch(Relation rel,
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, key, page, mid);
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
+ {
high = mid;
+ highcmpcol = cmpcol;
+ }
}
+
+ *highkeycmpcol = highcmpcol;
/*
* At this point we have high == low, but be careful: they could point
@@ -439,7 +510,8 @@ _bt_binsrch(Relation rel,
* list split).
*/
OffsetNumber
-_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
+_bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol)
{
BTScanInsert key = insertstate->itup_key;
Page page;
@@ -449,6 +521,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
stricthigh;
int32 result,
cmpval;
+ AttrNumber lowcmpcol = 1;
page = BufferGetPage(insertstate->buf);
opaque = BTPageGetOpaque(page);
@@ -499,16 +572,22 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, key, page, mid);
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
{
high = mid;
+ highcmpcol = cmpcol;
+
if (result != 0)
stricthigh = high;
}
@@ -656,7 +735,8 @@ int32
_bt_compare(Relation rel,
BTScanInsert key,
Page page,
- OffsetNumber offnum)
+ OffsetNumber offnum,
+ AttrNumber *comparecol)
{
TupleDesc itupdesc = RelationGetDescr(rel);
BTPageOpaque opaque = BTPageGetOpaque(page);
@@ -696,8 +776,9 @@ _bt_compare(Relation rel,
ncmpkey = Min(ntupatts, key->keysz);
Assert(key->heapkeyspace || ncmpkey == key->keysz);
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
- scankey = key->scankeys;
- for (int i = 1; i <= ncmpkey; i++)
+
+ scankey = key->scankeys + ((*comparecol) - 1);
+ for (int i = *comparecol; i <= ncmpkey; i++)
{
Datum datum;
bool isNull;
@@ -741,11 +822,20 @@ _bt_compare(Relation rel,
/* if the keys are unequal, return the difference */
if (result != 0)
+ {
+ *comparecol = i;
return result;
+ }
scankey++;
}
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *comparecol = ntupatts + 1;
+
/*
* All non-truncated attributes (other than heap TID) were found to be
* equal. Treat truncated attributes as minus infinity when scankey has a
@@ -876,6 +966,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
StrategyNumber strat_total;
BTScanPosItem *currItem;
BlockNumber blkno;
+ AttrNumber cmpcol = 1;
Assert(!BTScanPosIsValid(so->currPos));
@@ -1392,7 +1483,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
_bt_initialize_more_data(so, dir);
/* position to the precise item on the page */
- offnum = _bt_binsrch(rel, &inskey, buf);
+ offnum = _bt_binsrch(rel, &inskey, buf, &cmpcol);
/*
* If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8f48960f9d..4cb24fa005 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1232,9 +1232,13 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
int access, Snapshot snapshot);
extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
- bool forupdate, BTStack stack, int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
-extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+ bool forupdate, BTStack stack, int access,
+ Snapshot snapshot, AttrNumber *comparecol,
+ char *tupdatabuf);
+extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page,
+ OffsetNumber offnum, AttrNumber *comparecol);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
--
2.39.0
v9-0002-Specialize-nbtree-functions-on-btree-key-shape.patchapplication/octet-stream; name=v9-0002-Specialize-nbtree-functions-on-btree-key-shape.patchDownload
From 53b4b1c4598c4269379299c7999c2acb8240fea1 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 11 Jan 2023 02:13:04 +0100
Subject: [PATCH v9 2/6] Specialize nbtree functions on btree key shape.
nbtree keys are not all made the same, so a significant amount of time is
spent on code that exists only to deal with other key's shape. By specializing
function calls based on the key shape, we can remove or reduce these causes
of overhead.
This commit adds the basic infrastructure for specializing specific hot code
in the nbtree AM to certain shapes of keys, and splits the code that can
benefit from attribute offset optimizations into separate files. This does
NOT yet update the code itself - it just makes the code compile cleanly.
The performance should be comparable if not the same.
---
contrib/amcheck/verify_nbtree.c | 7 +
src/backend/access/nbtree/README | 33 +-
src/backend/access/nbtree/nbtdedup.c | 300 +----
src/backend/access/nbtree/nbtdedup_spec.c | 317 +++++
src/backend/access/nbtree/nbtinsert.c | 566 +--------
src/backend/access/nbtree/nbtinsert_spec.c | 583 +++++++++
src/backend/access/nbtree/nbtpage.c | 1 +
src/backend/access/nbtree/nbtree.c | 37 +-
src/backend/access/nbtree/nbtree_spec.c | 69 ++
src/backend/access/nbtree/nbtsearch.c | 1075 +---------------
src/backend/access/nbtree/nbtsearch_spec.c | 1087 +++++++++++++++++
src/backend/access/nbtree/nbtsort.c | 264 +---
src/backend/access/nbtree/nbtsort_spec.c | 280 +++++
src/backend/access/nbtree/nbtsplitloc.c | 3 +
src/backend/access/nbtree/nbtutils.c | 754 +-----------
src/backend/access/nbtree/nbtutils_spec.c | 775 ++++++++++++
src/backend/utils/sort/tuplesortvariants.c | 144 +--
.../utils/sort/tuplesortvariants_spec.c | 158 +++
src/include/access/nbtree.h | 45 +-
src/include/access/nbtree_spec.h | 180 +++
src/include/access/nbtree_specfuncs.h | 66 +
21 files changed, 3609 insertions(+), 3135 deletions(-)
create mode 100644 src/backend/access/nbtree/nbtdedup_spec.c
create mode 100644 src/backend/access/nbtree/nbtinsert_spec.c
create mode 100644 src/backend/access/nbtree/nbtree_spec.c
create mode 100644 src/backend/access/nbtree/nbtsearch_spec.c
create mode 100644 src/backend/access/nbtree/nbtsort_spec.c
create mode 100644 src/backend/access/nbtree/nbtutils_spec.c
create mode 100644 src/backend/utils/sort/tuplesortvariants_spec.c
create mode 100644 src/include/access/nbtree_spec.h
create mode 100644 src/include/access/nbtree_specfuncs.h
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 22bb229820..fb89d6ada2 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2680,6 +2680,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
BTStack stack;
Buffer lbuf;
bool exists;
+ nbts_prep_ctx(NULL);
key = _bt_mkscankey(state->rel, itup);
Assert(key->heapkeyspace && key->scantid != NULL);
@@ -2780,6 +2781,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
ItemId itemid;
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -2843,6 +2845,7 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
{
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -2867,6 +2870,7 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
{
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -2906,6 +2910,7 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
ItemId itemid;
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -2966,6 +2971,7 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
Page page;
BTPageOpaque opaque;
OffsetNumber maxoffset;
+ nbts_prep_ctx(NULL);
page = palloc(BLCKSZ);
@@ -3141,6 +3147,7 @@ static inline BTScanInsert
bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
{
BTScanInsert skey;
+ nbts_prep_ctx(NULL);
skey = _bt_mkscankey(rel, itup);
skey->pivotsearch = true;
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 4d7fa5aff4..4b11ea9ad7 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -940,8 +940,9 @@ and for each step in the binary search (that does not compare less than the
insert key) improve the equal-prefix bounds.
Using the above optimization, we now (on average) only need 2 full key
-compares per page, as opposed to ceil(log2(ntupsperpage)) + 1; a significant
-improvement.
+compares per page (plus ceil(log2(ntupsperpage)) single-attribute compares),
+as opposed to the ceil(log2(ntupsperpage)) + 1 of a naive implementation;
+a significant improvement.
Notes about deduplication
-------------------------
@@ -1083,6 +1084,34 @@ that need a page split anyway. Besides, supporting variable "split points"
while splitting posting lists won't actually improve overall space
utilization.
+Notes about nbtree specialization
+---------------------------------
+
+Attribute iteration is a significant overhead for multi-column indexes
+with variable length attributes, due to our inability to cache the offset
+of each attribute into an on-disk tuple. To combat this, we'd have to either
+fully deserialize the tuple, or maintain our offset into the tuple as we
+iterate over the tuple's fields.
+
+Keeping track of this offset also has a non-negligible overhead too, so we'd
+prefer to not have to keep track of these offsets when we can use the cache.
+By specializing performance-sensitive search functions for these specific
+index tuple shapes and calling those selectively, we can keep the performance
+of cacheable attribute offsets where that is applicable, while improving
+performance where we currently would see O(n_atts^2) time iterating on
+variable-length attributes. Additionally, we update the entry points
+in the index AM to call the specialized functions, increasing the
+performance of those hot paths.
+
+Optimized code paths exist for the following cases, in order of preference:
+ - multi-column indexes that could benefit from the attcacheoff optimization
+ NB: This is also the default path, and is comparatively slow for uncachable
+ attribute offsets.
+
+Future work will optimize for multi-column indexes that don't benefit
+from the attcacheoff optimization by improving on the O(n^2) nature of
+index_getattr through storing attribute offsets.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 0349988cf5..4589ade267 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -22,260 +22,14 @@
static void _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
TM_IndexDeleteOp *delstate);
-static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem);
static void _bt_singleval_fillfactor(Page page, BTDedupState state,
Size newitemsz);
#ifdef USE_ASSERT_CHECKING
static bool _bt_posting_valid(IndexTuple posting);
#endif
-/*
- * Perform a deduplication pass.
- *
- * The general approach taken here is to perform as much deduplication as
- * possible to free as much space as possible. Note, however, that "single
- * value" strategy is used for !bottomupdedup callers when the page is full of
- * tuples of a single value. Deduplication passes that apply the strategy
- * will leave behind a few untouched tuples at the end of the page, preparing
- * the page for an anticipated page split that uses nbtsplitloc.c's own single
- * value strategy. Our high level goal is to delay merging the untouched
- * tuples until after the page splits.
- *
- * When a call to _bt_bottomupdel_pass() just took place (and failed), our
- * high level goal is to prevent a page split entirely by buying more time.
- * We still hope that a page split can be avoided altogether. That's why
- * single value strategy is not even considered for bottomupdedup callers.
- *
- * The page will have to be split if we cannot successfully free at least
- * newitemsz (we also need space for newitem's line pointer, which isn't
- * included in caller's newitemsz).
- *
- * Note: Caller should have already deleted all existing items with their
- * LP_DEAD bits set.
- */
-void
-_bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
- Size newitemsz, bool bottomupdedup)
-{
- OffsetNumber offnum,
- minoff,
- maxoff;
- Page page = BufferGetPage(buf);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- Page newpage;
- BTDedupState state;
- Size pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
- bool singlevalstrat = false;
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-
- /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
- newitemsz += sizeof(ItemIdData);
-
- /*
- * Initialize deduplication state.
- *
- * It would be possible for maxpostingsize (limit on posting list tuple
- * size) to be set to one third of the page. However, it seems like a
- * good idea to limit the size of posting lists to one sixth of a page.
- * That ought to leave us with a good split point when pages full of
- * duplicates can be split several times.
- */
- state = (BTDedupState) palloc(sizeof(BTDedupStateData));
- state->deduplicate = true;
- state->nmaxitems = 0;
- state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
- /* Metadata about base tuple of current pending posting list */
- state->base = NULL;
- state->baseoff = InvalidOffsetNumber;
- state->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- state->htids = palloc(state->maxpostingsize);
- state->nhtids = 0;
- state->nitems = 0;
- /* Size of all physical tuples to be replaced by pending posting list */
- state->phystupsize = 0;
- /* nintervals should be initialized to zero */
- state->nintervals = 0;
-
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * Consider applying "single value" strategy, though only if the page
- * seems likely to be split in the near future
- */
- if (!bottomupdedup)
- singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
-
- /*
- * Deduplicate items from page, and write them to newpage.
- *
- * Copy the original page's LSN into newpage copy. This will become the
- * updated version of the page. We need this because XLogInsert will
- * examine the LSN and possibly dump it in a page image.
- */
- newpage = PageGetTempPageCopySpecial(page);
- PageSetLSN(newpage, PageGetLSN(page));
-
- /* Copy high key, if any */
- if (!P_RIGHTMOST(opaque))
- {
- ItemId hitemid = PageGetItemId(page, P_HIKEY);
- Size hitemsz = ItemIdGetLength(hitemid);
- IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
-
- if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
- false, false) == InvalidOffsetNumber)
- elog(ERROR, "deduplication failed to add highkey");
- }
-
- for (offnum = minoff;
- offnum <= maxoff;
- offnum = OffsetNumberNext(offnum))
- {
- ItemId itemid = PageGetItemId(page, offnum);
- IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
-
- Assert(!ItemIdIsDead(itemid));
-
- if (offnum == minoff)
- {
- /*
- * No previous/base tuple for the data item -- use the data item
- * as base tuple of pending posting list
- */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- else if (state->deduplicate &&
- _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
- _bt_dedup_save_htid(state, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID(s) for itup have been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list for some other reason (e.g., adding more
- * TIDs would have caused posting list to exceed current
- * maxpostingsize).
- *
- * If state contains pending posting list with more than one item,
- * form new posting tuple and add it to our temp page (newpage).
- * Else add pending interval's base tuple to the temp page as-is.
- */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- if (singlevalstrat)
- {
- /*
- * Single value strategy's extra steps.
- *
- * Lower maxpostingsize for sixth and final large posting list
- * tuple at the point where 5 maxpostingsize-capped tuples
- * have either been formed or observed.
- *
- * When a sixth maxpostingsize-capped item is formed/observed,
- * stop merging together tuples altogether. The few tuples
- * that remain at the end of the page won't be merged together
- * at all (at least not until after a future page split takes
- * place, when this page's newly allocated right sibling page
- * gets its first deduplication pass).
- */
- if (state->nmaxitems == 5)
- _bt_singleval_fillfactor(page, state, newitemsz);
- else if (state->nmaxitems == 6)
- {
- state->deduplicate = false;
- singlevalstrat = false; /* won't be back here */
- }
- }
-
- /* itup starts new pending posting list */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- }
-
- /* Handle the last item */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- /*
- * If no items suitable for deduplication were found, newpage must be
- * exactly the same as the original page, so just return from function.
- *
- * We could determine whether or not to proceed on the basis the space
- * savings being sufficient to avoid an immediate page split instead. We
- * don't do that because there is some small value in nbtsplitloc.c always
- * operating against a page that is fully deduplicated (apart from
- * newitem). Besides, most of the cost has already been paid.
- */
- if (state->nintervals == 0)
- {
- /* cannot leak memory here */
- pfree(newpage);
- pfree(state->htids);
- pfree(state);
- return;
- }
-
- /*
- * By here, it's clear that deduplication will definitely go ahead.
- *
- * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
- * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
- * But keep things tidy.
- */
- if (P_HAS_GARBAGE(opaque))
- {
- BTPageOpaque nopaque = BTPageGetOpaque(newpage);
-
- nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
- }
-
- START_CRIT_SECTION();
-
- PageRestoreTempPage(newpage, page);
- MarkBufferDirty(buf);
-
- /* XLOG stuff */
- if (RelationNeedsWAL(rel))
- {
- XLogRecPtr recptr;
- xl_btree_dedup xlrec_dedup;
-
- xlrec_dedup.nintervals = state->nintervals;
-
- XLogBeginInsert();
- XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
-
- /*
- * The intervals array is not in the buffer, but pretend that it is.
- * When XLogInsert stores the whole buffer, the array need not be
- * stored too.
- */
- XLogRegisterBufData(0, (char *) state->intervals,
- state->nintervals * sizeof(BTDedupInterval));
-
- recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
-
- PageSetLSN(page, recptr);
- }
-
- END_CRIT_SECTION();
-
- /* Local space accounting should agree with page accounting */
- Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
-
- /* cannot leak memory here */
- pfree(state->htids);
- pfree(state);
-}
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtdedup_spec.c"
+#include "access/nbtree_spec.h"
/*
* Perform bottom-up index deletion pass.
@@ -316,6 +70,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
TM_IndexDeleteOp delstate;
bool neverdedup;
int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ nbts_prep_ctx(rel);
/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
newitemsz += sizeof(ItemIdData);
@@ -752,55 +507,6 @@ _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
state->phystupsize = 0;
}
-/*
- * Determine if page non-pivot tuples (data items) are all duplicates of the
- * same value -- if they are, deduplication's "single value" strategy should
- * be applied. The general goal of this strategy is to ensure that
- * nbtsplitloc.c (which uses its own single value strategy) will find a useful
- * split point as further duplicates are inserted, and successive rightmost
- * page splits occur among pages that store the same duplicate value. When
- * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
- * just like it would if deduplication were disabled.
- *
- * We expect that affected workloads will require _several_ single value
- * strategy deduplication passes (over a page that only stores duplicates)
- * before the page is finally split. The first deduplication pass should only
- * find regular non-pivot tuples. Later deduplication passes will find
- * existing maxpostingsize-capped posting list tuples, which must be skipped
- * over. The penultimate pass is generally the first pass that actually
- * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
- * few untouched non-pivot tuples. The final deduplication pass won't free
- * any space -- it will skip over everything without merging anything (it
- * retraces the steps of the penultimate pass).
- *
- * Fortunately, having several passes isn't too expensive. Each pass (after
- * the first pass) won't spend many cycles on the large posting list tuples
- * left by previous passes. Each pass will find a large contiguous group of
- * smaller duplicate tuples to merge together at the end of the page.
- */
-static bool
-_bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- ItemId itemid;
- IndexTuple itup;
-
- itemid = PageGetItemId(page, minoff);
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- {
- itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- return true;
- }
-
- return false;
-}
-
/*
* Lower maxpostingsize when using "single value" strategy, to avoid a sixth
* and final maxpostingsize-capped tuple. The sixth and final posting list
diff --git a/src/backend/access/nbtree/nbtdedup_spec.c b/src/backend/access/nbtree/nbtdedup_spec.c
new file mode 100644
index 0000000000..584211fe66
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup_spec.c
@@ -0,0 +1,317 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup_spec.c
+ * Index shape-specialized functions for nbtdedup.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtdedup_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_do_singleval NBTS_FUNCTION(_bt_do_singleval)
+
+static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem);
+
+/*
+ * Perform a deduplication pass.
+ *
+ * The general approach taken here is to perform as much deduplication as
+ * possible to free as much space as possible. Note, however, that "single
+ * value" strategy is used for !bottomupdedup callers when the page is full of
+ * tuples of a single value. Deduplication passes that apply the strategy
+ * will leave behind a few untouched tuples at the end of the page, preparing
+ * the page for an anticipated page split that uses nbtsplitloc.c's own single
+ * value strategy. Our high level goal is to delay merging the untouched
+ * tuples until after the page splits.
+ *
+ * When a call to _bt_bottomupdel_pass() just took place (and failed), our
+ * high level goal is to prevent a page split entirely by buying more time.
+ * We still hope that a page split can be avoided altogether. That's why
+ * single value strategy is not even considered for bottomupdedup callers.
+ *
+ * The page will have to be split if we cannot successfully free at least
+ * newitemsz (we also need space for newitem's line pointer, which isn't
+ * included in caller's newitemsz).
+ *
+ * Note: Caller should have already deleted all existing items with their
+ * LP_DEAD bits set.
+ */
+void
+_bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
+ Size newitemsz, bool bottomupdedup)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ Page newpage;
+ BTDedupState state;
+ Size pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
+ bool singlevalstrat = false;
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+
+ /*
+ * Initialize deduplication state.
+ *
+ * It would be possible for maxpostingsize (limit on posting list tuple
+ * size) to be set to one third of the page. However, it seems like a
+ * good idea to limit the size of posting lists to one sixth of a page.
+ * That ought to leave us with a good split point when pages full of
+ * duplicates can be split several times.
+ */
+ state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ state->deduplicate = true;
+ state->nmaxitems = 0;
+ state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
+ /* Metadata about base tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ state->htids = palloc(state->maxpostingsize);
+ state->nhtids = 0;
+ state->nitems = 0;
+ /* Size of all physical tuples to be replaced by pending posting list */
+ state->phystupsize = 0;
+ /* nintervals should be initialized to zero */
+ state->nintervals = 0;
+
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Consider applying "single value" strategy, though only if the page
+ * seems likely to be split in the near future
+ */
+ if (!bottomupdedup)
+ singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
+
+ /*
+ * Deduplicate items from page, and write them to newpage.
+ *
+ * Copy the original page's LSN into newpage copy. This will become the
+ * updated version of the page. We need this because XLogInsert will
+ * examine the LSN and possibly dump it in a page image.
+ */
+ newpage = PageGetTempPageCopySpecial(page);
+ PageSetLSN(newpage, PageGetLSN(page));
+
+ /* Copy high key, if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId hitemid = PageGetItemId(page, P_HIKEY);
+ Size hitemsz = ItemIdGetLength(hitemid);
+ IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+ if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add highkey");
+ }
+
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == minoff)
+ {
+ /*
+ * No previous/base tuple for the data item -- use the data item
+ * as base tuple of pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (state->deduplicate &&
+ _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+ _bt_dedup_save_htid(state, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID(s) for itup have been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list for some other reason (e.g., adding more
+ * TIDs would have caused posting list to exceed current
+ * maxpostingsize).
+ *
+ * If state contains pending posting list with more than one item,
+ * form new posting tuple and add it to our temp page (newpage).
+ * Else add pending interval's base tuple to the temp page as-is.
+ */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ if (singlevalstrat)
+ {
+ /*
+ * Single value strategy's extra steps.
+ *
+ * Lower maxpostingsize for sixth and final large posting list
+ * tuple at the point where 5 maxpostingsize-capped tuples
+ * have either been formed or observed.
+ *
+ * When a sixth maxpostingsize-capped item is formed/observed,
+ * stop merging together tuples altogether. The few tuples
+ * that remain at the end of the page won't be merged together
+ * at all (at least not until after a future page split takes
+ * place, when this page's newly allocated right sibling page
+ * gets its first deduplication pass).
+ */
+ if (state->nmaxitems == 5)
+ _bt_singleval_fillfactor(page, state, newitemsz);
+ else if (state->nmaxitems == 6)
+ {
+ state->deduplicate = false;
+ singlevalstrat = false; /* won't be back here */
+ }
+ }
+
+ /* itup starts new pending posting list */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ }
+
+ /* Handle the last item */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ /*
+ * If no items suitable for deduplication were found, newpage must be
+ * exactly the same as the original page, so just return from function.
+ *
+ * We could determine whether or not to proceed on the basis the space
+ * savings being sufficient to avoid an immediate page split instead. We
+ * don't do that because there is some small value in nbtsplitloc.c always
+ * operating against a page that is fully deduplicated (apart from
+ * newitem). Besides, most of the cost has already been paid.
+ */
+ if (state->nintervals == 0)
+ {
+ /* cannot leak memory here */
+ pfree(newpage);
+ pfree(state->htids);
+ pfree(state);
+ return;
+ }
+
+ /*
+ * By here, it's clear that deduplication will definitely go ahead.
+ *
+ * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
+ * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
+ * But keep things tidy.
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ BTPageOpaque nopaque = BTPageGetOpaque(newpage);
+
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+ }
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.nintervals = state->nintervals;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ /*
+ * The intervals array is not in the buffer, but pretend that it is.
+ * When XLogInsert stores the whole buffer, the array need not be
+ * stored too.
+ */
+ XLogRegisterBufData(0, (char *) state->intervals,
+ state->nintervals * sizeof(BTDedupInterval));
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* Local space accounting should agree with page accounting */
+ Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+ /* cannot leak memory here */
+ pfree(state->htids);
+ pfree(state);
+}
+
+/*
+ * Determine if page non-pivot tuples (data items) are all duplicates of the
+ * same value -- if they are, deduplication's "single value" strategy should
+ * be applied. The general goal of this strategy is to ensure that
+ * nbtsplitloc.c (which uses its own single value strategy) will find a useful
+ * split point as further duplicates are inserted, and successive rightmost
+ * page splits occur among pages that store the same duplicate value. When
+ * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
+ * just like it would if deduplication were disabled.
+ *
+ * We expect that affected workloads will require _several_ single value
+ * strategy deduplication passes (over a page that only stores duplicates)
+ * before the page is finally split. The first deduplication pass should only
+ * find regular non-pivot tuples. Later deduplication passes will find
+ * existing maxpostingsize-capped posting list tuples, which must be skipped
+ * over. The penultimate pass is generally the first pass that actually
+ * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
+ * few untouched non-pivot tuples. The final deduplication pass won't free
+ * any space -- it will skip over everything without merging anything (it
+ * retraces the steps of the penultimate pass).
+ *
+ * Fortunately, having several passes isn't too expensive. Each pass (after
+ * the first pass) won't spend many cycles on the large posting list tuples
+ * left by previous passes. Each pass will find a large contiguous group of
+ * smaller duplicate tuples to merge together at the end of the page.
+ */
+static bool
+_bt_do_singleval(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, minoff);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+ {
+ itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+ return true;
+ }
+
+ return false;
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 4c3bdefae2..ca8ea60ffb 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,17 +30,10 @@
#define BTREE_FASTPATH_MIN_LEVEL 2
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
Relation heapRel,
IndexUniqueCheck checkUnique, bool *is_unique,
uint32 *speculativeToken);
-static OffsetNumber _bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel);
static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
Buffer buf,
@@ -73,313 +66,8 @@ static BlockNumber *_bt_deadblocks(Page page, OffsetNumber *deletable,
int *nblocks);
static inline int _bt_blk_cmp(const void *arg1, const void *arg2);
-/*
- * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
- *
- * This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
- *
- * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
- * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
- * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
- * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
- * don't actually insert.
- *
- * indexUnchanged executor hint indicates if itup is from an
- * UPDATE that didn't logically change the indexed value, but
- * must nevertheless have a new entry to point to a successor
- * version.
- *
- * The result value is only significant for UNIQUE_CHECK_PARTIAL:
- * it must be true if the entry is known unique, else false.
- * (In the current implementation we'll also return true after a
- * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
- * that's just a coding artifact.)
- */
-bool
-_bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel)
-{
- bool is_unique = false;
- BTInsertStateData insertstate;
- BTScanInsert itup_key;
- BTStack stack;
- bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
-
- /* we need an insertion scan key to do our search, so build one */
- itup_key = _bt_mkscankey(rel, itup);
-
- if (checkingunique)
- {
- if (!itup_key->anynullkeys)
- {
- /* No (heapkeyspace) scantid until uniqueness established */
- itup_key->scantid = NULL;
- }
- else
- {
- /*
- * Scan key for new tuple contains NULL key values. Bypass
- * checkingunique steps. They are unnecessary because core code
- * considers NULL unequal to every value, including NULL.
- *
- * This optimization avoids O(N^2) behavior within the
- * _bt_findinsertloc() heapkeyspace path when a unique index has a
- * large number of "duplicates" with NULL key values.
- */
- checkingunique = false;
- /* Tuple is unique in the sense that core code cares about */
- Assert(checkUnique != UNIQUE_CHECK_EXISTING);
- is_unique = true;
- }
- }
-
- /*
- * Fill in the BTInsertState working area, to track the current page and
- * position within the page to insert on.
- *
- * Note that itemsz is passed down to lower level code that deals with
- * inserting the item. It must be MAXALIGN()'d. This ensures that space
- * accounting code consistently considers the alignment overhead that we
- * expect PageAddItem() will add later. (Actually, index_form_tuple() is
- * already conservative about alignment, but we don't rely on that from
- * this distance. Besides, preserving the "true" tuple size in index
- * tuple headers for the benefit of nbtsplitloc.c might happen someday.
- * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
- */
- insertstate.itup = itup;
- insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
- insertstate.itup_key = itup_key;
- insertstate.bounds_valid = false;
- insertstate.buf = InvalidBuffer;
- insertstate.postingoff = 0;
-
-search:
-
- /*
- * Find and lock the leaf page that the tuple should be added to by
- * searching from the root page. insertstate.buf will hold a buffer that
- * is locked in exclusive mode afterwards.
- */
- stack = _bt_search_insert(rel, &insertstate);
-
- /*
- * checkingunique inserts are not allowed to go ahead when two tuples with
- * equal key attribute values would be visible to new MVCC snapshots once
- * the xact commits. Check for conflicts in the locked page/buffer (if
- * needed) here.
- *
- * It might be necessary to check a page to the right in _bt_check_unique,
- * though that should be very rare. In practice the first page the value
- * could be on (with scantid omitted) is almost always also the only page
- * that a matching tuple might be found on. This is due to the behavior
- * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
- * only be allowed to cross a page boundary when there is no candidate
- * leaf page split point that avoids it. Also, _bt_check_unique can use
- * the leaf page high key to determine that there will be no duplicates on
- * the right sibling without actually visiting it (it uses the high key in
- * cases where the new item happens to belong at the far right of the leaf
- * page).
- *
- * NOTE: obviously, _bt_check_unique can only detect keys that are already
- * in the index; so it cannot defend against concurrent insertions of the
- * same key. We protect against that by means of holding a write lock on
- * the first page the value could be on, with omitted/-inf value for the
- * implicit heap TID tiebreaker attribute. Any other would-be inserter of
- * the same key must acquire a write lock on the same page, so only one
- * would-be inserter can be making the check at one time. Furthermore,
- * once we are past the check we hold write locks continuously until we
- * have performed our insertion, so no later inserter can fail to see our
- * insertion. (This requires some care in _bt_findinsertloc.)
- *
- * If we must wait for another xact, we release the lock while waiting,
- * and then must perform a new search.
- *
- * For a partial uniqueness check, we don't wait for the other xact. Just
- * let the tuple in and return false for possibly non-unique, or true for
- * definitely unique.
- */
- if (checkingunique)
- {
- TransactionId xwait;
- uint32 speculativeToken;
-
- xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
- &is_unique, &speculativeToken);
-
- if (unlikely(TransactionIdIsValid(xwait)))
- {
- /* Have to wait for the other guy ... */
- _bt_relbuf(rel, insertstate.buf);
- insertstate.buf = InvalidBuffer;
-
- /*
- * If it's a speculative insertion, wait for it to finish (ie. to
- * go ahead with the insertion, or kill the tuple). Otherwise
- * wait for the transaction to finish as usual.
- */
- if (speculativeToken)
- SpeculativeInsertionWait(xwait, speculativeToken);
- else
- XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
-
- /* start over... */
- if (stack)
- _bt_freestack(stack);
- goto search;
- }
-
- /* Uniqueness is established -- restore heap tid as scantid */
- if (itup_key->heapkeyspace)
- itup_key->scantid = &itup->t_tid;
- }
-
- if (checkUnique != UNIQUE_CHECK_EXISTING)
- {
- OffsetNumber newitemoff;
-
- /*
- * The only conflict predicate locking cares about for indexes is when
- * an index tuple insert conflicts with an existing lock. We don't
- * know the actual page we're going to insert on for sure just yet in
- * checkingunique and !heapkeyspace cases, but it's okay to use the
- * first page the value could be on (with scantid omitted) instead.
- */
- CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
-
- /*
- * Do the insertion. Note that insertstate contains cached binary
- * search bounds established within _bt_check_unique when insertion is
- * checkingunique.
- */
- newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
- indexUnchanged, stack, heapRel);
- _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, insertstate.itemsz, newitemoff,
- insertstate.postingoff, false);
- }
- else
- {
- /* just release the buffer */
- _bt_relbuf(rel, insertstate.buf);
- }
-
- /* be tidy */
- if (stack)
- _bt_freestack(stack);
- pfree(itup_key);
-
- return is_unique;
-}
-
-/*
- * _bt_search_insert() -- _bt_search() wrapper for inserts
- *
- * Search the tree for a particular scankey, or more precisely for the first
- * leaf page it could be on. Try to make use of the fastpath optimization's
- * rightmost leaf page cache before actually searching the tree from the root
- * page, though.
- *
- * Return value is a stack of parent-page pointers (though see notes about
- * fastpath optimization and page splits below). insertstate->buf is set to
- * the address of the leaf-page buffer, which is write-locked and pinned in
- * all cases (if necessary by creating a new empty root page for caller).
- *
- * The fastpath optimization avoids most of the work of searching the tree
- * repeatedly when a single backend inserts successive new tuples on the
- * rightmost leaf page of an index. A backend cache of the rightmost leaf
- * page is maintained within _bt_insertonpg(), and used here. The cache is
- * invalidated here when an insert of a non-pivot tuple must take place on a
- * non-rightmost leaf page.
- *
- * The optimization helps with indexes on an auto-incremented field. It also
- * helps with indexes on datetime columns, as well as indexes with lots of
- * NULL values. (NULLs usually get inserted in the rightmost page for single
- * column indexes, since they usually get treated as coming after everything
- * else in the key space. Individual NULL tuples will generally be placed on
- * the rightmost leaf page due to the influence of the heap TID column.)
- *
- * Note that we avoid applying the optimization when there is insufficient
- * space on the rightmost page to fit caller's new item. This is necessary
- * because we'll need to return a real descent stack when a page split is
- * expected (actually, caller can cope with a leaf page split that uses a NULL
- * stack, but that's very slow and so must be avoided). Note also that the
- * fastpath optimization acquires the lock on the page conditionally as a way
- * of reducing extra contention when there are concurrent insertions into the
- * rightmost page (we give up if we'd have to wait for the lock). We assume
- * that it isn't useful to apply the optimization when there is contention,
- * since each per-backend cache won't stay valid for long.
- */
-static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
-{
- Assert(insertstate->buf == InvalidBuffer);
- Assert(!insertstate->bounds_valid);
- Assert(insertstate->postingoff == 0);
-
- if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
- {
- /* Simulate a _bt_getbuf() call with conditional locking */
- insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
- if (_bt_conditionallockbuf(rel, insertstate->buf))
- {
- Page page;
- BTPageOpaque opaque;
- AttrNumber cmpcol = 1;
-
- _bt_checkpage(rel, insertstate->buf);
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- /*
- * Check if the page is still the rightmost leaf page and has
- * enough free space to accommodate the new tuple. Also check
- * that the insertion scan key is strictly greater than the first
- * non-pivot tuple on the page. (Note that we expect itup_key's
- * scantid to be unset when our caller is a checkingunique
- * inserter.)
- */
- if (P_RIGHTMOST(opaque) &&
- P_ISLEAF(opaque) &&
- !P_IGNORE(opaque) &&
- PageGetFreeSpace(page) > insertstate->itemsz &&
- PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- _bt_compare(rel, insertstate->itup_key, page, P_HIKEY,
- &cmpcol) > 0)
- {
- /*
- * Caller can use the fastpath optimization because cached
- * block is still rightmost leaf page, which can fit caller's
- * new tuple without splitting. Keep block in local cache for
- * next insert, and have caller use NULL stack.
- *
- * Note that _bt_insert_parent() has an assertion that catches
- * leaf page splits that somehow follow from a fastpath insert
- * (it should only be passed a NULL stack when it must deal
- * with a concurrent root page split, and never because a NULL
- * stack was returned here).
- */
- return NULL;
- }
-
- /* Page unsuitable for caller, drop lock and pin */
- _bt_relbuf(rel, insertstate->buf);
- }
- else
- {
- /* Lock unavailable, drop pin */
- ReleaseBuffer(insertstate->buf);
- }
-
- /* Forget block, since cache doesn't appear to be useful */
- RelationSetTargetBlock(rel, InvalidBlockNumber);
- }
-
- /* Cannot use optimization -- descend tree, return proper descent stack */
- return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
- NULL);
-}
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtinsert_spec.c"
+#include "access/nbtree_spec.h"
/*
* _bt_check_unique() -- Check for violation of unique index constraint
@@ -423,6 +111,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
bool inposting = false;
bool prevalldead = true;
int curposti = 0;
+ nbts_prep_ctx(rel);
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -774,253 +463,6 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
return InvalidTransactionId;
}
-
-/*
- * _bt_findinsertloc() -- Finds an insert location for a tuple
- *
- * On entry, insertstate buffer contains the page the new tuple belongs
- * on. It is exclusive-locked and pinned by the caller.
- *
- * If 'checkingunique' is true, the buffer on entry is the first page
- * that contains duplicates of the new key. If there are duplicates on
- * multiple pages, the correct insertion position might be some page to
- * the right, rather than the first page. In that case, this function
- * moves right to the correct target page.
- *
- * (In a !heapkeyspace index, there can be multiple pages with the same
- * high key, where the new tuple could legitimately be placed on. In
- * that case, the caller passes the first page containing duplicates,
- * just like when checkingunique=true. If that page doesn't have enough
- * room for the new tuple, this function moves right, trying to find a
- * legal page that does.)
- *
- * If 'indexUnchanged' is true, this is for an UPDATE that didn't
- * logically change the indexed value, but must nevertheless have a new
- * entry to point to a successor version. This hint from the executor
- * will influence our behavior when the page might have to be split and
- * we must consider our options. Bottom-up index deletion can avoid
- * pathological version-driven page splits, but we only want to go to the
- * trouble of trying it when we already have moderate confidence that
- * it's appropriate. The hint should not significantly affect our
- * behavior over time unless practically all inserts on to the leaf page
- * get the hint.
- *
- * On exit, insertstate buffer contains the chosen insertion page, and
- * the offset within that page is returned. If _bt_findinsertloc needed
- * to move right, the lock and pin on the original page are released, and
- * the new buffer is exclusively locked and pinned instead.
- *
- * If insertstate contains cached binary search bounds, we will take
- * advantage of them. This avoids repeating comparisons that we made in
- * _bt_check_unique() already.
- */
-static OffsetNumber
-_bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel)
-{
- BTScanInsert itup_key = insertstate->itup_key;
- Page page = BufferGetPage(insertstate->buf);
- BTPageOpaque opaque;
- OffsetNumber newitemoff;
-
- opaque = BTPageGetOpaque(page);
-
- /* Check 1/3 of a page restriction */
- if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
- _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
- insertstate->itup);
-
- Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
- Assert(!insertstate->bounds_valid || checkingunique);
- Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
- Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
- Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
-
- if (itup_key->heapkeyspace)
- {
- /* Keep track of whether checkingunique duplicate seen */
- bool uniquedup = indexUnchanged;
-
- /*
- * If we're inserting into a unique index, we may have to walk right
- * through leaf pages to find the one leaf page that we must insert on
- * to.
- *
- * This is needed for checkingunique callers because a scantid was not
- * used when we called _bt_search(). scantid can only be set after
- * _bt_check_unique() has checked for duplicates. The buffer
- * initially stored in insertstate->buf has the page where the first
- * duplicate key might be found, which isn't always the page that new
- * tuple belongs on. The heap TID attribute for new tuple (scantid)
- * could force us to insert on a sibling page, though that should be
- * very rare in practice.
- */
- if (checkingunique)
- {
- if (insertstate->low < insertstate->stricthigh)
- {
- /* Encountered a duplicate in _bt_check_unique() */
- Assert(insertstate->bounds_valid);
- uniquedup = true;
- }
-
- for (;;)
- {
- AttrNumber cmpcol = 1;
-
- /*
- * Does the new tuple belong on this page?
- *
- * The earlier _bt_check_unique() call may well have
- * established a strict upper bound on the offset for the new
- * item. If it's not the last item of the page (i.e. if there
- * is at least one tuple on the page that goes after the tuple
- * we're inserting) then we know that the tuple belongs on
- * this page. We can skip the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- /* Test '<=', not '!=', since scantid is set now */
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
- break;
-
- _bt_stepright(rel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- /* Assume duplicates (if checkingunique) */
- uniquedup = true;
- }
- }
-
- /*
- * If the target page cannot fit newitem, try to avoid splitting the
- * page on insert by performing deletion or deduplication now
- */
- if (PageGetFreeSpace(page) < insertstate->itemsz)
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
- checkingunique, uniquedup,
- indexUnchanged);
- }
- else
- {
- /*----------
- * This is a !heapkeyspace (version 2 or 3) index. The current page
- * is the first page that we could insert the new tuple to, but there
- * may be other pages to the right that we could opt to use instead.
- *
- * If the new key is equal to one or more existing keys, we can
- * legitimately place it anywhere in the series of equal keys. In
- * fact, if the new key is equal to the page's "high key" we can place
- * it on the next page. If it is equal to the high key, and there's
- * not room to insert the new tuple on the current page without
- * splitting, then we move right hoping to find more free space and
- * avoid a split.
- *
- * Keep scanning right until we
- * (a) find a page with enough free space,
- * (b) reach the last page where the tuple can legally go, or
- * (c) get tired of searching.
- * (c) is not flippant; it is important because if there are many
- * pages' worth of equal keys, it's better to split one of the early
- * pages than to scan all the way to the end of the run of equal keys
- * on every insert. We implement "get tired" as a random choice,
- * since stopping after scanning a fixed number of pages wouldn't work
- * well (we'd never reach the right-hand side of previously split
- * pages). The probability of moving right is set at 0.99, which may
- * seem too high to change the behavior much, but it does an excellent
- * job of preventing O(N^2) behavior with many equal keys.
- *----------
- */
- while (PageGetFreeSpace(page) < insertstate->itemsz)
- {
- AttrNumber cmpcol = 1;
-
- /*
- * Before considering moving right, see if we can obtain enough
- * space by erasing LP_DEAD items
- */
- if (P_HAS_GARBAGE(opaque))
- {
- /* Perform simple deletion */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- if (PageGetFreeSpace(page) >= insertstate->itemsz)
- break; /* OK, now we have enough space */
- }
-
- /*
- * Nope, so check conditions (b) and (c) enumerated above
- *
- * The earlier _bt_check_unique() call may well have established a
- * strict upper bound on the offset for the new item. If it's not
- * the last item of the page (i.e. if there is at least one tuple
- * on the page that's greater than the tuple we're inserting to)
- * then we know that the tuple belongs on this page. We can skip
- * the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) != 0 ||
- pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
- break;
-
- _bt_stepright(rel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- }
- }
-
- /*
- * We should now be on the correct page. Find the offset within the page
- * for the new tuple. (Possibly reusing earlier search bounds.)
- */
- {
- AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
- Assert(P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0);
- }
-
- newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
-
- if (insertstate->postingoff == -1)
- {
- /*
- * There is an overlapping posting list tuple with its LP_DEAD bit
- * set. We don't want to unnecessarily unset its LP_DEAD bit while
- * performing a posting list split, so perform simple index tuple
- * deletion early.
- */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- /*
- * Do new binary search. New insert location cannot overlap with any
- * posting list now.
- */
- Assert(!insertstate->bounds_valid);
- insertstate->postingoff = 0;
- newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
- Assert(insertstate->postingoff == 0);
- }
-
- return newitemoff;
-}
-
/*
* Step right to next non-dead page, during insertion.
*
@@ -1501,6 +943,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
bool newitemonleft,
isleaf,
isrightmost;
+ nbts_prep_ctx(rel);
/*
* origpage is the original page to be split. leftpage is a temporary
@@ -2693,6 +2136,7 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
BTScanInsert itup_key = insertstate->itup_key;
Page page = BufferGetPage(buffer);
BTPageOpaque opaque = BTPageGetOpaque(page);
+ nbts_prep_ctx(rel);
Assert(P_ISLEAF(opaque));
Assert(simpleonly || itup_key->heapkeyspace);
diff --git a/src/backend/access/nbtree/nbtinsert_spec.c b/src/backend/access/nbtree/nbtinsert_spec.c
new file mode 100644
index 0000000000..d37afae5ae
--- /dev/null
+++ b/src/backend/access/nbtree/nbtinsert_spec.c
@@ -0,0 +1,583 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtinsert_spec.c
+ * Index shape-specialized functions for nbtinsert.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtinsert_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_search_insert NBTS_FUNCTION(_bt_search_insert)
+#define _bt_findinsertloc NBTS_FUNCTION(_bt_findinsertloc)
+
+static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
+static OffsetNumber _bt_findinsertloc(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel);
+
+
+/*
+ * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
+ *
+ * This routine is called by the public interface routine, btinsert.
+ * By here, itup is filled in, including the TID.
+ *
+ * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
+ * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
+ * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
+ * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
+ * don't actually insert.
+ *
+ * indexUnchanged executor hint indicates if itup is from an
+ * UPDATE that didn't logically change the indexed value, but
+ * must nevertheless have a new entry to point to a successor
+ * version.
+ *
+ * The result value is only significant for UNIQUE_CHECK_PARTIAL:
+ * it must be true if the entry is known unique, else false.
+ * (In the current implementation we'll also return true after a
+ * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
+ * that's just a coding artifact.)
+ */
+bool
+_bt_doinsert(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ Relation heapRel)
+{
+ bool is_unique = false;
+ BTInsertStateData insertstate;
+ BTScanInsert itup_key;
+ BTStack stack;
+ bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
+
+ /* we need an insertion scan key to do our search, so build one */
+ itup_key = _bt_mkscankey(rel, itup);
+
+ if (checkingunique)
+ {
+ if (!itup_key->anynullkeys)
+ {
+ /* No (heapkeyspace) scantid until uniqueness established */
+ itup_key->scantid = NULL;
+ }
+ else
+ {
+ /*
+ * Scan key for new tuple contains NULL key values. Bypass
+ * checkingunique steps. They are unnecessary because core code
+ * considers NULL unequal to every value, including NULL.
+ *
+ * This optimization avoids O(N^2) behavior within the
+ * _bt_findinsertloc() heapkeyspace path when a unique index has a
+ * large number of "duplicates" with NULL key values.
+ */
+ checkingunique = false;
+ /* Tuple is unique in the sense that core code cares about */
+ Assert(checkUnique != UNIQUE_CHECK_EXISTING);
+ is_unique = true;
+ }
+ }
+
+ /*
+ * Fill in the BTInsertState working area, to track the current page and
+ * position within the page to insert on.
+ *
+ * Note that itemsz is passed down to lower level code that deals with
+ * inserting the item. It must be MAXALIGN()'d. This ensures that space
+ * accounting code consistently considers the alignment overhead that we
+ * expect PageAddItem() will add later. (Actually, index_form_tuple() is
+ * already conservative about alignment, but we don't rely on that from
+ * this distance. Besides, preserving the "true" tuple size in index
+ * tuple headers for the benefit of nbtsplitloc.c might happen someday.
+ * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
+ */
+ insertstate.itup = itup;
+ insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
+ insertstate.itup_key = itup_key;
+ insertstate.bounds_valid = false;
+ insertstate.buf = InvalidBuffer;
+ insertstate.postingoff = 0;
+
+ search:
+
+ /*
+ * Find and lock the leaf page that the tuple should be added to by
+ * searching from the root page. insertstate.buf will hold a buffer that
+ * is locked in exclusive mode afterwards.
+ */
+ stack = _bt_search_insert(rel, &insertstate);
+
+ /*
+ * checkingunique inserts are not allowed to go ahead when two tuples with
+ * equal key attribute values would be visible to new MVCC snapshots once
+ * the xact commits. Check for conflicts in the locked page/buffer (if
+ * needed) here.
+ *
+ * It might be necessary to check a page to the right in _bt_check_unique,
+ * though that should be very rare. In practice the first page the value
+ * could be on (with scantid omitted) is almost always also the only page
+ * that a matching tuple might be found on. This is due to the behavior
+ * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
+ * only be allowed to cross a page boundary when there is no candidate
+ * leaf page split point that avoids it. Also, _bt_check_unique can use
+ * the leaf page high key to determine that there will be no duplicates on
+ * the right sibling without actually visiting it (it uses the high key in
+ * cases where the new item happens to belong at the far right of the leaf
+ * page).
+ *
+ * NOTE: obviously, _bt_check_unique can only detect keys that are already
+ * in the index; so it cannot defend against concurrent insertions of the
+ * same key. We protect against that by means of holding a write lock on
+ * the first page the value could be on, with omitted/-inf value for the
+ * implicit heap TID tiebreaker attribute. Any other would-be inserter of
+ * the same key must acquire a write lock on the same page, so only one
+ * would-be inserter can be making the check at one time. Furthermore,
+ * once we are past the check we hold write locks continuously until we
+ * have performed our insertion, so no later inserter can fail to see our
+ * insertion. (This requires some care in _bt_findinsertloc.)
+ *
+ * If we must wait for another xact, we release the lock while waiting,
+ * and then must perform a new search.
+ *
+ * For a partial uniqueness check, we don't wait for the other xact. Just
+ * let the tuple in and return false for possibly non-unique, or true for
+ * definitely unique.
+ */
+ if (checkingunique)
+ {
+ TransactionId xwait;
+ uint32 speculativeToken;
+
+ xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
+ &is_unique, &speculativeToken);
+
+ if (unlikely(TransactionIdIsValid(xwait)))
+ {
+ /* Have to wait for the other guy ... */
+ _bt_relbuf(rel, insertstate.buf);
+ insertstate.buf = InvalidBuffer;
+
+ /*
+ * If it's a speculative insertion, wait for it to finish (ie. to
+ * go ahead with the insertion, or kill the tuple). Otherwise
+ * wait for the transaction to finish as usual.
+ */
+ if (speculativeToken)
+ SpeculativeInsertionWait(xwait, speculativeToken);
+ else
+ XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
+
+ /* start over... */
+ if (stack)
+ _bt_freestack(stack);
+ goto search;
+ }
+
+ /* Uniqueness is established -- restore heap tid as scantid */
+ if (itup_key->heapkeyspace)
+ itup_key->scantid = &itup->t_tid;
+ }
+
+ if (checkUnique != UNIQUE_CHECK_EXISTING)
+ {
+ OffsetNumber newitemoff;
+
+ /*
+ * The only conflict predicate locking cares about for indexes is when
+ * an index tuple insert conflicts with an existing lock. We don't
+ * know the actual page we're going to insert on for sure just yet in
+ * checkingunique and !heapkeyspace cases, but it's okay to use the
+ * first page the value could be on (with scantid omitted) instead.
+ */
+ CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
+
+ /*
+ * Do the insertion. Note that insertstate contains cached binary
+ * search bounds established within _bt_check_unique when insertion is
+ * checkingunique.
+ */
+ newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
+ indexUnchanged, stack, heapRel);
+ _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
+ itup, insertstate.itemsz, newitemoff,
+ insertstate.postingoff, false);
+ }
+ else
+ {
+ /* just release the buffer */
+ _bt_relbuf(rel, insertstate.buf);
+ }
+
+ /* be tidy */
+ if (stack)
+ _bt_freestack(stack);
+ pfree(itup_key);
+
+ return is_unique;
+}
+/*
+ * _bt_search_insert() -- _bt_search() wrapper for inserts
+ *
+ * Search the tree for a particular scankey, or more precisely for the first
+ * leaf page it could be on. Try to make use of the fastpath optimization's
+ * rightmost leaf page cache before actually searching the tree from the root
+ * page, though.
+ *
+ * Return value is a stack of parent-page pointers (though see notes about
+ * fastpath optimization and page splits below). insertstate->buf is set to
+ * the address of the leaf-page buffer, which is write-locked and pinned in
+ * all cases (if necessary by creating a new empty root page for caller).
+ *
+ * The fastpath optimization avoids most of the work of searching the tree
+ * repeatedly when a single backend inserts successive new tuples on the
+ * rightmost leaf page of an index. A backend cache of the rightmost leaf
+ * page is maintained within _bt_insertonpg(), and used here. The cache is
+ * invalidated here when an insert of a non-pivot tuple must take place on a
+ * non-rightmost leaf page.
+ *
+ * The optimization helps with indexes on an auto-incremented field. It also
+ * helps with indexes on datetime columns, as well as indexes with lots of
+ * NULL values. (NULLs usually get inserted in the rightmost page for single
+ * column indexes, since they usually get treated as coming after everything
+ * else in the key space. Individual NULL tuples will generally be placed on
+ * the rightmost leaf page due to the influence of the heap TID column.)
+ *
+ * Note that we avoid applying the optimization when there is insufficient
+ * space on the rightmost page to fit caller's new item. This is necessary
+ * because we'll need to return a real descent stack when a page split is
+ * expected (actually, caller can cope with a leaf page split that uses a NULL
+ * stack, but that's very slow and so must be avoided). Note also that the
+ * fastpath optimization acquires the lock on the page conditionally as a way
+ * of reducing extra contention when there are concurrent insertions into the
+ * rightmost page (we give up if we'd have to wait for the lock). We assume
+ * that it isn't useful to apply the optimization when there is contention,
+ * since each per-backend cache won't stay valid for long.
+ */
+static BTStack
+_bt_search_insert(Relation rel, BTInsertState insertstate)
+{
+ Assert(insertstate->buf == InvalidBuffer);
+ Assert(!insertstate->bounds_valid);
+ Assert(insertstate->postingoff == 0);
+
+ if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
+ {
+ /* Simulate a _bt_getbuf() call with conditional locking */
+ insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
+ if (_bt_conditionallockbuf(rel, insertstate->buf))
+ {
+ Page page;
+ BTPageOpaque opaque;
+ AttrNumber cmpcol = 1;
+
+ _bt_checkpage(rel, insertstate->buf);
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ /*
+ * Check if the page is still the rightmost leaf page and has
+ * enough free space to accommodate the new tuple. Also check
+ * that the insertion scan key is strictly greater than the first
+ * non-pivot tuple on the page. (Note that we expect itup_key's
+ * scantid to be unset when our caller is a checkingunique
+ * inserter.)
+ */
+ if (P_RIGHTMOST(opaque) &&
+ P_ISLEAF(opaque) &&
+ !P_IGNORE(opaque) &&
+ PageGetFreeSpace(page) > insertstate->itemsz &&
+ PageGetMaxOffsetNumber(page) >= P_HIKEY &&
+ _bt_compare(rel, insertstate->itup_key, page, P_HIKEY,
+ &cmpcol) > 0)
+ {
+ /*
+ * Caller can use the fastpath optimization because cached
+ * block is still rightmost leaf page, which can fit caller's
+ * new tuple without splitting. Keep block in local cache for
+ * next insert, and have caller use NULL stack.
+ *
+ * Note that _bt_insert_parent() has an assertion that catches
+ * leaf page splits that somehow follow from a fastpath insert
+ * (it should only be passed a NULL stack when it must deal
+ * with a concurrent root page split, and never because a NULL
+ * stack was returned here).
+ */
+ return NULL;
+ }
+
+ /* Page unsuitable for caller, drop lock and pin */
+ _bt_relbuf(rel, insertstate->buf);
+ }
+ else
+ {
+ /* Lock unavailable, drop pin */
+ ReleaseBuffer(insertstate->buf);
+ }
+
+ /* Forget block, since cache doesn't appear to be useful */
+ RelationSetTargetBlock(rel, InvalidBlockNumber);
+ }
+
+ /* Cannot use optimization -- descend tree, return proper descent stack */
+ return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
+ NULL);
+}
+
+
+/*
+ * _bt_findinsertloc() -- Finds an insert location for a tuple
+ *
+ * On entry, insertstate buffer contains the page the new tuple belongs
+ * on. It is exclusive-locked and pinned by the caller.
+ *
+ * If 'checkingunique' is true, the buffer on entry is the first page
+ * that contains duplicates of the new key. If there are duplicates on
+ * multiple pages, the correct insertion position might be some page to
+ * the right, rather than the first page. In that case, this function
+ * moves right to the correct target page.
+ *
+ * (In a !heapkeyspace index, there can be multiple pages with the same
+ * high key, where the new tuple could legitimately be placed on. In
+ * that case, the caller passes the first page containing duplicates,
+ * just like when checkingunique=true. If that page doesn't have enough
+ * room for the new tuple, this function moves right, trying to find a
+ * legal page that does.)
+ *
+ * If 'indexUnchanged' is true, this is for an UPDATE that didn't
+ * logically change the indexed value, but must nevertheless have a new
+ * entry to point to a successor version. This hint from the executor
+ * will influence our behavior when the page might have to be split and
+ * we must consider our options. Bottom-up index deletion can avoid
+ * pathological version-driven page splits, but we only want to go to the
+ * trouble of trying it when we already have moderate confidence that
+ * it's appropriate. The hint should not significantly affect our
+ * behavior over time unless practically all inserts on to the leaf page
+ * get the hint.
+ *
+ * On exit, insertstate buffer contains the chosen insertion page, and
+ * the offset within that page is returned. If _bt_findinsertloc needed
+ * to move right, the lock and pin on the original page are released, and
+ * the new buffer is exclusively locked and pinned instead.
+ *
+ * If insertstate contains cached binary search bounds, we will take
+ * advantage of them. This avoids repeating comparisons that we made in
+ * _bt_check_unique() already.
+ */
+static OffsetNumber
+_bt_findinsertloc(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel)
+{
+ BTScanInsert itup_key = insertstate->itup_key;
+ Page page = BufferGetPage(insertstate->buf);
+ BTPageOpaque opaque;
+ OffsetNumber newitemoff;
+
+ opaque = BTPageGetOpaque(page);
+
+ /* Check 1/3 of a page restriction */
+ if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
+ _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
+ insertstate->itup);
+
+ Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
+ Assert(!insertstate->bounds_valid || checkingunique);
+ Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
+ Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
+ Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
+
+ if (itup_key->heapkeyspace)
+ {
+ /* Keep track of whether checkingunique duplicate seen */
+ bool uniquedup = indexUnchanged;
+
+ /*
+ * If we're inserting into a unique index, we may have to walk right
+ * through leaf pages to find the one leaf page that we must insert on
+ * to.
+ *
+ * This is needed for checkingunique callers because a scantid was not
+ * used when we called _bt_search(). scantid can only be set after
+ * _bt_check_unique() has checked for duplicates. The buffer
+ * initially stored in insertstate->buf has the page where the first
+ * duplicate key might be found, which isn't always the page that new
+ * tuple belongs on. The heap TID attribute for new tuple (scantid)
+ * could force us to insert on a sibling page, though that should be
+ * very rare in practice.
+ */
+ if (checkingunique)
+ {
+ if (insertstate->low < insertstate->stricthigh)
+ {
+ /* Encountered a duplicate in _bt_check_unique() */
+ Assert(insertstate->bounds_valid);
+ uniquedup = true;
+ }
+
+ for (;;)
+ {
+ AttrNumber cmpcol = 1;
+
+ /*
+ * Does the new tuple belong on this page?
+ *
+ * The earlier _bt_check_unique() call may well have
+ * established a strict upper bound on the offset for the new
+ * item. If it's not the last item of the page (i.e. if there
+ * is at least one tuple on the page that goes after the tuple
+ * we're inserting) then we know that the tuple belongs on
+ * this page. We can skip the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ /* Test '<=', not '!=', since scantid is set now */
+ if (P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
+ break;
+
+ _bt_stepright(rel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ /* Assume duplicates (if checkingunique) */
+ uniquedup = true;
+ }
+ }
+
+ /*
+ * If the target page cannot fit newitem, try to avoid splitting the
+ * page on insert by performing deletion or deduplication now
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
+ checkingunique, uniquedup,
+ indexUnchanged);
+ }
+ else
+ {
+ /*----------
+ * This is a !heapkeyspace (version 2 or 3) index. The current page
+ * is the first page that we could insert the new tuple to, but there
+ * may be other pages to the right that we could opt to use instead.
+ *
+ * If the new key is equal to one or more existing keys, we can
+ * legitimately place it anywhere in the series of equal keys. In
+ * fact, if the new key is equal to the page's "high key" we can place
+ * it on the next page. If it is equal to the high key, and there's
+ * not room to insert the new tuple on the current page without
+ * splitting, then we move right hoping to find more free space and
+ * avoid a split.
+ *
+ * Keep scanning right until we
+ * (a) find a page with enough free space,
+ * (b) reach the last page where the tuple can legally go, or
+ * (c) get tired of searching.
+ * (c) is not flippant; it is important because if there are many
+ * pages' worth of equal keys, it's better to split one of the early
+ * pages than to scan all the way to the end of the run of equal keys
+ * on every insert. We implement "get tired" as a random choice,
+ * since stopping after scanning a fixed number of pages wouldn't work
+ * well (we'd never reach the right-hand side of previously split
+ * pages). The probability of moving right is set at 0.99, which may
+ * seem too high to change the behavior much, but it does an excellent
+ * job of preventing O(N^2) behavior with many equal keys.
+ *----------
+ */
+ while (PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ AttrNumber cmpcol = 1;
+
+ /*
+ * Before considering moving right, see if we can obtain enough
+ * space by erasing LP_DEAD items
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ /* Perform simple deletion */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ if (PageGetFreeSpace(page) >= insertstate->itemsz)
+ break; /* OK, now we have enough space */
+ }
+
+ /*
+ * Nope, so check conditions (b) and (c) enumerated above
+ *
+ * The earlier _bt_check_unique() call may well have established a
+ * strict upper bound on the offset for the new item. If it's not
+ * the last item of the page (i.e. if there is at least one tuple
+ * on the page that's greater than the tuple we're inserting to)
+ * then we know that the tuple belongs on this page. We can skip
+ * the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ if (P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) != 0 ||
+ pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
+ break;
+
+ _bt_stepright(rel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ }
+ }
+
+ /*
+ * We should now be on the correct page. Find the offset within the page
+ * for the new tuple. (Possibly reusing earlier search bounds.)
+ */
+ {
+ AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
+ Assert(P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0);
+ }
+
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
+
+ if (insertstate->postingoff == -1)
+ {
+ /*
+ * There is an overlapping posting list tuple with its LP_DEAD bit
+ * set. We don't want to unnecessarily unset its LP_DEAD bit while
+ * performing a posting list split, so perform simple index tuple
+ * deletion early.
+ */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ /*
+ * Do new binary search. New insert location cannot overlap with any
+ * posting list now.
+ */
+ Assert(!insertstate->bounds_valid);
+ insertstate->postingoff = 0;
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
+ Assert(insertstate->postingoff == 0);
+ }
+
+ return newitemoff;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d19..7710226f41 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1819,6 +1819,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
bool rightsib_empty;
Page page;
BTPageOpaque opaque;
+ nbts_prep_ctx(rel);
/*
* Save original leafbuf block number from caller. Only deleted blocks
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1cc88da032..ceeafd637f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -87,6 +87,8 @@ static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
OffsetNumber updatedoffset,
int *nremaining);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtree_spec.c"
+#include "access/nbtree_spec.h"
/*
* Btree handler function: return IndexAmRoutine with access method parameters
@@ -120,7 +122,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambuild = btbuild;
amroutine->ambuildempty = btbuildempty;
- amroutine->aminsert = btinsert;
+ amroutine->aminsert = btinsert_default;
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
@@ -152,6 +154,8 @@ btbuildempty(Relation index)
{
Page metapage;
+ nbt_opt_specialize(index);
+
/* Construct metapage. */
metapage = (Page) palloc(BLCKSZ);
_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
@@ -177,33 +181,6 @@ btbuildempty(Relation index)
smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
}
-/*
- * btinsert() -- insert an index tuple into a btree.
- *
- * Descend the tree recursively, find the appropriate location for our
- * new tuple, and put it there.
- */
-bool
-btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- IndexInfo *indexInfo)
-{
- bool result;
- IndexTuple itup;
-
- /* generate an index tuple */
- itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
- itup->t_tid = *ht_ctid;
-
- result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);
-
- pfree(itup);
-
- return result;
-}
-
/*
* btgettuple() -- Get the next tuple in the scan.
*/
@@ -345,6 +322,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
IndexScanDesc scan;
BTScanOpaque so;
+ nbt_opt_specialize(rel);
+
/* no order by operators allowed */
Assert(norderbys == 0);
@@ -788,6 +767,8 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
Relation rel = info->index;
BTCycleId cycleid;
+ nbt_opt_specialize(rel);
+
/* allocate stats if first time through, else re-use existing struct */
if (stats == NULL)
stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
diff --git a/src/backend/access/nbtree/nbtree_spec.c b/src/backend/access/nbtree/nbtree_spec.c
new file mode 100644
index 0000000000..6b766581ab
--- /dev/null
+++ b/src/backend/access/nbtree/nbtree_spec.c
@@ -0,0 +1,69 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtree_spec.c
+ * Index shape-specialized functions for nbtree.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtree_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+/*
+ * _bt_specialize() -- Specialize this index relation for its index key.
+ */
+void
+_bt_specialize(Relation rel)
+{
+#ifdef NBTS_SPECIALIZING_DEFAULT
+ NBTS_MAKE_CTX(rel);
+ /*
+ * We can't directly address _bt_specialize here because it'd be macro-
+ * expanded, nor can we utilize NBTS_SPECIALIZE_NAME here because it'd
+ * try to call _bt_specialize, which would be an infinite recursive call.
+ */
+ switch (__nbts_ctx) {
+ case NBTS_CTX_CACHED:
+ _bt_specialize_cached(rel);
+ break;
+ case NBTS_CTX_DEFAULT:
+ break;
+ }
+#else
+ rel->rd_indam->aminsert = btinsert;
+#endif
+}
+
+/*
+ * btinsert() -- insert an index tuple into a btree.
+ *
+ * Descend the tree recursively, find the appropriate location for our
+ * new tuple, and put it there.
+ */
+bool
+btinsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ IndexInfo *indexInfo)
+{
+ bool result;
+ IndexTuple itup;
+
+ /* generate an index tuple */
+ itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
+ itup->t_tid = *ht_ctid;
+
+ result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);
+
+ pfree(itup);
+
+ return result;
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index e3b828137b..0089fe7eeb 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,12 +25,8 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
-static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf,
- AttrNumber *highkeycmpcol);
static int _bt_binsrch_posting(BTScanInsert key, Page page,
OffsetNumber offnum);
-static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
- OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
static int _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
@@ -47,6 +43,8 @@ static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsearch_spec.c"
+#include "access/nbtree_spec.h"
/*
* _bt_drop_lock_and_maybe_pin()
@@ -71,572 +69,6 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
}
}
-/*
- * _bt_search() -- Search the tree for a particular scankey,
- * or more precisely for the first leaf page it could be on.
- *
- * The passed scankey is an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * Return value is a stack of parent-page pointers (i.e. there is no entry for
- * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
- * which is locked and pinned. No locks are held on the parent pages,
- * however!
- *
- * If the snapshot parameter is not NULL, "old snapshot" checking will take
- * place during the descent through the tree. This is not needed when
- * positioning for an insert or delete, so NULL is used for those cases.
- *
- * The returned buffer is locked according to access parameter. Additionally,
- * access = BT_WRITE will allow an empty root page to be created and returned.
- * When access = BT_READ, an empty index will result in *bufP being set to
- * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
- * during the search will be finished.
- */
-BTStack
-_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
- Snapshot snapshot)
-{
- BTStack stack_in = NULL;
- int page_access = BT_READ;
- char tupdatabuf[BLCKSZ / 3];
- AttrNumber highkeycmpcol = 1;
-
- /* Get the root page to start with */
- *bufP = _bt_getroot(rel, access);
-
- /* If index is empty and access = BT_READ, no root page is created. */
- if (!BufferIsValid(*bufP))
- return (BTStack) NULL;
-
- /* Loop iterates once per level descended in the tree */
- for (;;)
- {
- Page page;
- BTPageOpaque opaque;
- OffsetNumber offnum;
- ItemId itemid;
- IndexTuple itup;
- BlockNumber child;
- BTStack new_stack;
-
- /*
- * Race -- the page we just grabbed may have split since we read its
- * downlink in its parent page (or the metapage). If it has, we may
- * need to move right to its new sibling. Do that.
- *
- * In write-mode, allow _bt_moveright to finish any incomplete splits
- * along the way. Strictly speaking, we'd only need to finish an
- * incomplete split on the leaf page we're about to insert to, not on
- * any of the upper levels (internal pages with incomplete splits are
- * also taken care of in _bt_getstackbuf). But this is a good
- * opportunity to finish splits of internal pages too.
- */
- *bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
- page_access, snapshot, &highkeycmpcol,
- (char *) tupdatabuf);
-
- /* if this is a leaf page, we're done */
- page = BufferGetPage(*bufP);
- opaque = BTPageGetOpaque(page);
- if (P_ISLEAF(opaque))
- break;
-
- /*
- * Find the appropriate pivot tuple on this page. Its downlink points
- * to the child page that we're about to descend to.
- */
- offnum = _bt_binsrch(rel, key, *bufP, &highkeycmpcol);
- itemid = PageGetItemId(page, offnum);
- itup = (IndexTuple) PageGetItem(page, itemid);
- Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
- child = BTreeTupleGetDownLink(itup);
-
- Assert(IndexTupleSize(itup) < sizeof(tupdatabuf));
- memcpy((char *) tupdatabuf, (char *) itup, IndexTupleSize(itup));
-
- /*
- * We need to save the location of the pivot tuple we chose in a new
- * stack entry for this page/level. If caller ends up splitting a
- * page one level down, it usually ends up inserting a new pivot
- * tuple/downlink immediately after the location recorded here.
- */
- new_stack = (BTStack) palloc(sizeof(BTStackData));
- new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
- new_stack->bts_offset = offnum;
- new_stack->bts_parent = stack_in;
-
- /*
- * Page level 1 is lowest non-leaf page level prior to leaves. So, if
- * we're on the level 1 and asked to lock leaf page in write mode,
- * then lock next page in write mode, because it must be a leaf.
- */
- if (opaque->btpo_level == 1 && access == BT_WRITE)
- page_access = BT_WRITE;
-
- /* drop the read lock on the page, then acquire one on its child */
- *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
-
- /* okay, all set to move down a level */
- stack_in = new_stack;
- }
-
- /*
- * If we're asked to lock leaf in write mode, but didn't manage to, then
- * relock. This should only happen when the root page is a leaf page (and
- * the only page in the index other than the metapage).
- */
- if (access == BT_WRITE && page_access == BT_READ)
- {
- highkeycmpcol = 1;
-
- /* trade in our read lock for a write lock */
- _bt_unlockbuf(rel, *bufP);
- _bt_lockbuf(rel, *bufP, BT_WRITE);
-
- /*
- * Race -- the leaf page may have split after we dropped the read lock
- * but before we acquired a write lock. If it has, we may need to
- * move right to its new sibling. Do that.
- */
- *bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
- snapshot, &highkeycmpcol, (char *) tupdatabuf);
- }
-
- return stack_in;
-}
-
-/*
- * _bt_moveright() -- move right in the btree if necessary.
- *
- * When we follow a pointer to reach a page, it is possible that
- * the page has changed in the meanwhile. If this happens, we're
- * guaranteed that the page has "split right" -- that is, that any
- * data that appeared on the page originally is either on the page
- * or strictly to the right of it.
- *
- * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page. If that entry is
- * strictly less than the scankey, or <= the scankey in the
- * key.nextkey=true case, then we followed the wrong link and we need
- * to move right.
- *
- * The passed insertion-type scankey can omit the rightmost column(s) of the
- * index. (see nbtree/README)
- *
- * When key.nextkey is false (the usual case), we are looking for the first
- * item >= key. When key.nextkey is true, we are looking for the first item
- * strictly greater than key.
- *
- * If forupdate is true, we will attempt to finish any incomplete splits
- * that we encounter. This is required when locking a target page for an
- * insertion, because we don't allow inserting on a page before the split
- * is completed. 'stack' is only used if forupdate is true.
- *
- * On entry, we have the buffer pinned and a lock of the type specified by
- * 'access'. If we move right, we release the buffer and lock and acquire
- * the same on the right sibling. Return value is the buffer we stop at.
- *
- * If the snapshot parameter is not NULL, "old snapshot" checking will take
- * place during the descent through the tree. This is not needed when
- * positioning for an insert or delete, so NULL is used for those cases.
- */
-Buffer
-_bt_moveright(Relation rel,
- BTScanInsert key,
- Buffer buf,
- bool forupdate,
- BTStack stack,
- int access,
- Snapshot snapshot,
- AttrNumber *comparecol,
- char *tupdatabuf)
-{
- Page page;
- BTPageOpaque opaque;
- int32 cmpval;
-
- Assert(PointerIsValid(comparecol) && PointerIsValid(tupdatabuf));
-
- /*
- * When nextkey = false (normal case): if the scan key that brought us to
- * this page is > the high key stored on the page, then the page has split
- * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
- * have some duplicates to the right as well as the left, but that's
- * something that's only ever dealt with on the leaf level, after
- * _bt_search has found an initial leaf page.)
- *
- * When nextkey = true: move right if the scan key is >= page's high key.
- * (Note that key.scantid cannot be set in this case.)
- *
- * The page could even have split more than once, so scan as far as
- * needed.
- *
- * We also have to move right if we followed a link that brought us to a
- * dead page.
- */
- cmpval = key->nextkey ? 0 : 1;
-
- for (;;)
- {
- AttrNumber cmpcol = 1;
-
- page = BufferGetPage(buf);
- TestForOldSnapshot(snapshot, rel, page);
- opaque = BTPageGetOpaque(page);
-
- if (P_RIGHTMOST(opaque))
- {
- *comparecol = 1;
- break;
- }
-
- /*
- * Finish any incomplete splits we encounter along the way.
- */
- if (forupdate && P_INCOMPLETE_SPLIT(opaque))
- {
- BlockNumber blkno = BufferGetBlockNumber(buf);
-
- /* upgrade our lock if necessary */
- if (access == BT_READ)
- {
- _bt_unlockbuf(rel, buf);
- _bt_lockbuf(rel, buf, BT_WRITE);
- }
-
- if (P_INCOMPLETE_SPLIT(opaque))
- _bt_finish_split(rel, buf, stack);
- else
- _bt_relbuf(rel, buf);
-
- /* re-acquire the lock in the right mode, and re-check */
- buf = _bt_getbuf(rel, blkno, access);
- continue;
- }
-
- /*
- * tupdatabuf is filled with the right seperator of the parent node.
- * This allows us to do a binary equality check between the parent
- * node's right seperator (which is < key) and this page's P_HIKEY.
- * If they equal, we can reuse the result of the parent node's
- * rightkey compare, which means we can potentially save a full key
- * compare (which includes indirect calls to attribute comparison
- * functions).
- *
- * Without this, we'd on average use 3 full key compares per page before
- * we achieve full dynamic prefix bounds, but with this optimization
- * that is only 2.
- *
- * 3 compares: 1 for the highkey (rightmost), and on average 2 before
- * we move right in the binary search on the page, this average equals
- * SUM (1/2 ^ x) for x from 0 to log(n items)), which tends to 2.
- */
- if (!P_IGNORE(opaque) && *comparecol > 1)
- {
- IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
- IndexTuple buftuple = (IndexTuple) tupdatabuf;
- if (IndexTupleSize(itup) == IndexTupleSize(buftuple))
- {
- char *dataptr = (char *) itup;
-
- if (memcmp(dataptr + sizeof(IndexTupleData),
- tupdatabuf + sizeof(IndexTupleData),
- IndexTupleSize(itup) - sizeof(IndexTupleData)) == 0)
- break;
- } else {
- *comparecol = 1;
- }
- } else {
- *comparecol = 1;
- }
-
- if (P_IGNORE(opaque) ||
- _bt_compare(rel, key, page, P_HIKEY, &cmpcol) >= cmpval)
- {
- *comparecol = 1;
- /* step right one page */
- buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
- continue;
- }
- else
- {
- *comparecol = cmpcol;
- break;
- }
- }
-
- if (P_IGNORE(opaque))
- elog(ERROR, "fell off the end of index \"%s\"",
- RelationGetRelationName(rel));
-
- return buf;
-}
-
-/*
- * _bt_binsrch() -- Do a binary search for a key on a particular page.
- *
- * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
- * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
- * particular, this means it is possible to return a value 1 greater than the
- * number of keys on the page, if the scankey is > all keys on the page.)
- *
- * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
- * of the last key < given scankey, or last key <= given scankey if nextkey
- * is true. (Since _bt_compare treats the first data key of such a page as
- * minus infinity, there will be at least one key < scankey, so the result
- * always points at one of the keys on the page.) This key indicates the
- * right place to descend to be sure we find all leaf keys >= given scankey
- * (or leaf keys > given scankey when nextkey is true).
- *
- * This procedure is not responsible for walking right, it just examines
- * the given page. _bt_binsrch() has no lock or refcount side effects
- * on the buffer.
- */
-static OffsetNumber
-_bt_binsrch(Relation rel,
- BTScanInsert key,
- Buffer buf,
- AttrNumber *highkeycmpcol)
-{
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high;
- int32 result,
- cmpval;
- AttrNumber highcmpcol = *highkeycmpcol,
- lowcmpcol = 1;
-
- page = BufferGetPage(buf);
- opaque = BTPageGetOpaque(page);
-
- /* Requesting nextkey semantics while using scantid seems nonsensical */
- Assert(!key->nextkey || key->scantid == NULL);
- /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
- Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
-
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
-
- /*
- * If there are no keys on the page, return the first available slot. Note
- * this covers two cases: the page is really empty (no keys), or it
- * contains only a high key. The latter case is possible after vacuuming.
- * This can never happen on an internal page, however, since they are
- * never empty (an internal page must have children).
- */
- if (unlikely(high < low))
- return low;
-
- /*
- * Binary search to find the first key on the page >= scan key, or first
- * key > scankey when nextkey is true.
- *
- * For nextkey=false (cmpval=1), the loop invariant is: all slots before
- * 'low' are < scan key, all slots at or after 'high' are >= scan key.
- *
- * For nextkey=true (cmpval=0), the loop invariant is: all slots before
- * 'low' are <= scan key, all slots at or after 'high' are > scan key.
- *
- * We can fall out when high == low.
- */
- high++; /* establish the loop invariant for high */
-
- cmpval = key->nextkey ? 0 : 1; /* select comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
- AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid, &cmpcol);
-
- if (result >= cmpval)
- {
- low = mid + 1;
- lowcmpcol = cmpcol;
- }
- else
- {
- high = mid;
- highcmpcol = cmpcol;
- }
- }
-
- *highkeycmpcol = highcmpcol;
-
- /*
- * At this point we have high == low, but be careful: they could point
- * past the last slot on the page.
- *
- * On a leaf page, we always return the first key >= scan key (resp. >
- * scan key), which could be the last slot + 1.
- */
- if (P_ISLEAF(opaque))
- return low;
-
- /*
- * On a non-leaf page, return the last key < scan key (resp. <= scan key).
- * There must be one if _bt_compare() is playing by the rules.
- */
- Assert(low > P_FIRSTDATAKEY(opaque));
-
- return OffsetNumberPrev(low);
-}
-
-/*
- *
- * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
- *
- * Like _bt_binsrch(), but with support for caching the binary search
- * bounds. Only used during insertion, and only on the leaf page that it
- * looks like caller will insert tuple on. Exclusive-locked and pinned
- * leaf page is contained within insertstate.
- *
- * Caches the bounds fields in insertstate so that a subsequent call can
- * reuse the low and strict high bounds of original binary search. Callers
- * that use these fields directly must be prepared for the case where low
- * and/or stricthigh are not on the same page (one or both exceed maxoff
- * for the page). The case where there are no items on the page (high <
- * low) makes bounds invalid.
- *
- * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time, and for dealing with posting list
- * tuple matches (callers can use insertstate's postingoff field to
- * determine which existing heap TID will need to be replaced by a posting
- * list split).
- */
-OffsetNumber
-_bt_binsrch_insert(Relation rel, BTInsertState insertstate,
- AttrNumber highcmpcol)
-{
- BTScanInsert key = insertstate->itup_key;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high,
- stricthigh;
- int32 result,
- cmpval;
- AttrNumber lowcmpcol = 1;
-
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- Assert(P_ISLEAF(opaque));
- Assert(!key->nextkey);
- Assert(insertstate->postingoff == 0);
-
- if (!insertstate->bounds_valid)
- {
- /* Start new binary search */
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
- }
- else
- {
- /* Restore result of previous binary search against same page */
- low = insertstate->low;
- high = insertstate->stricthigh;
- }
-
- /* If there are no keys on the page, return the first available slot */
- if (unlikely(high < low))
- {
- /* Caller can't reuse bounds */
- insertstate->low = InvalidOffsetNumber;
- insertstate->stricthigh = InvalidOffsetNumber;
- insertstate->bounds_valid = false;
- return low;
- }
-
- /*
- * Binary search to find the first key on the page >= scan key. (nextkey
- * is always false when inserting).
- *
- * The loop invariant is: all slots before 'low' are < scan key, all slots
- * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
- * maintained to save additional search effort for caller.
- *
- * We can fall out when high == low.
- */
- if (!insertstate->bounds_valid)
- high++; /* establish the loop invariant for high */
- stricthigh = high; /* high initially strictly higher */
-
- cmpval = 1; /* !nextkey comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
- AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid, &cmpcol);
-
- if (result >= cmpval)
- {
- low = mid + 1;
- lowcmpcol = cmpcol;
- }
- else
- {
- high = mid;
- highcmpcol = cmpcol;
-
- if (result != 0)
- stricthigh = high;
- }
-
- /*
- * If tuple at offset located by binary search is a posting list whose
- * TID range overlaps with caller's scantid, perform posting list
- * binary search to set postingoff for caller. Caller must split the
- * posting list when postingoff is set. This should happen
- * infrequently.
- */
- if (unlikely(result == 0 && key->scantid != NULL))
- {
- /*
- * postingoff should never be set more than once per leaf page
- * binary search. That would mean that there are duplicate table
- * TIDs in the index, which is never okay. Check for that here.
- */
- if (insertstate->postingoff != 0)
- ereport(ERROR,
- (errcode(ERRCODE_INDEX_CORRUPTED),
- errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
- ItemPointerGetBlockNumber(key->scantid),
- ItemPointerGetOffsetNumber(key->scantid),
- low, stricthigh,
- BufferGetBlockNumber(insertstate->buf),
- RelationGetRelationName(rel))));
-
- insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
- }
- }
-
- /*
- * On a leaf page, a binary search always returns the first key >= scan
- * key (at least in !nextkey case), which could be the last slot + 1. This
- * is also the lower bound of cached search.
- *
- * stricthigh may also be the last slot + 1, which prevents caller from
- * using bounds directly, but is still useful to us if we're called a
- * second time with cached bounds (cached low will be < stricthigh when
- * that happens).
- */
- insertstate->low = low;
- insertstate->stricthigh = stricthigh;
- insertstate->bounds_valid = true;
-
- return low;
-}
-
/*----------
* _bt_binsrch_posting() -- posting list binary search.
*
@@ -704,228 +136,6 @@ _bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
return low;
}
-/*----------
- * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
- *
- * page/offnum: location of btree item to be compared to.
- *
- * This routine returns:
- * <0 if scankey < tuple at offnum;
- * 0 if scankey == tuple at offnum;
- * >0 if scankey > tuple at offnum.
- *
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be returned
- * to the caller as a matching key. Similarly, an insertion scankey
- * with its scantid set is treated as equal to a posting tuple whose TID
- * range overlaps with their scantid. There generally won't be a
- * matching TID in the posting tuple, which caller must handle
- * themselves (e.g., by splitting the posting list tuple).
- *
- * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
- * "minus infinity": this routine will always claim it is less than the
- * scankey. The actual key value stored is explicitly truncated to 0
- * attributes (explicitly minus infinity) with version 3+ indexes, but
- * that isn't relied upon. This allows us to implement the Lehman and
- * Yao convention that the first down-link pointer is before the first
- * key. See backend/access/nbtree/README for details.
- *----------
- */
-int32
-_bt_compare(Relation rel,
- BTScanInsert key,
- Page page,
- OffsetNumber offnum,
- AttrNumber *comparecol)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- IndexTuple itup;
- ItemPointer heapTid;
- ScanKey scankey;
- int ncmpkey;
- int ntupatts;
- int32 result;
-
- Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
- Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
- Assert(key->heapkeyspace || key->scantid == NULL);
-
- /*
- * Force result ">" if target item is first data item on an internal page
- * --- see NOTE above.
- */
- if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
- return 1;
-
- itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
- ntupatts = BTreeTupleGetNAtts(itup, rel);
-
- /*
- * The scan key is set up with the attribute number associated with each
- * term in the key. It is important that, if the index is multi-key, the
- * scan contain the first k key attributes, and that they be in order. If
- * you think about how multi-key ordering works, you'll understand why
- * this is.
- *
- * We don't test for violation of this condition here, however. The
- * initial setup for the index scan had better have gotten it right (see
- * _bt_first).
- */
-
- ncmpkey = Min(ntupatts, key->keysz);
- Assert(key->heapkeyspace || ncmpkey == key->keysz);
- Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
-
- scankey = key->scankeys + ((*comparecol) - 1);
- for (int i = *comparecol; i <= ncmpkey; i++)
- {
- Datum datum;
- bool isNull;
-
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
-
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
- {
- if (isNull)
- result = 0; /* NULL "=" NULL */
- else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = -1; /* NULL "<" NOT_NULL */
- else
- result = 1; /* NULL ">" NOT_NULL */
- }
- else if (isNull) /* key is NOT_NULL and item is NULL */
- {
- if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = 1; /* NOT_NULL ">" NULL */
- else
- result = -1; /* NOT_NULL "<" NULL */
- }
- else
- {
- /*
- * The sk_func needs to be passed the index value as left arg and
- * the sk_argument as right arg (they might be of different
- * types). Since it is convenient for callers to think of
- * _bt_compare as comparing the scankey to the index item, we have
- * to flip the sign of the comparison result. (Unless it's a DESC
- * column, in which case we *don't* flip the sign.)
- */
- result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum,
- scankey->sk_argument));
-
- if (!(scankey->sk_flags & SK_BT_DESC))
- INVERT_COMPARE_RESULT(result);
- }
-
- /* if the keys are unequal, return the difference */
- if (result != 0)
- {
- *comparecol = i;
- return result;
- }
-
- scankey++;
- }
-
- /*
- * All tuple attributes are equal to the scan key, only later attributes
- * could potentially not equal the scan key.
- */
- *comparecol = ntupatts + 1;
-
- /*
- * All non-truncated attributes (other than heap TID) were found to be
- * equal. Treat truncated attributes as minus infinity when scankey has a
- * key attribute value that would otherwise be compared directly.
- *
- * Note: it doesn't matter if ntupatts includes non-key attributes;
- * scankey won't, so explicitly excluding non-key attributes isn't
- * necessary.
- */
- if (key->keysz > ntupatts)
- return 1;
-
- /*
- * Use the heap TID attribute and scantid to try to break the tie. The
- * rules are the same as any other key attribute -- only the
- * representation differs.
- */
- heapTid = BTreeTupleGetHeapTID(itup);
- if (key->scantid == NULL)
- {
- /*
- * Most searches have a scankey that is considered greater than a
- * truncated pivot tuple if and when the scankey has equal values for
- * attributes up to and including the least significant untruncated
- * attribute in tuple.
- *
- * For example, if an index has the minimum two attributes (single
- * user key attribute, plus heap TID attribute), and a page's high key
- * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
- * will not descend to the page to the left. The search will descend
- * right instead. The truncated attribute in pivot tuple means that
- * all non-pivot tuples on the page to the left are strictly < 'foo',
- * so it isn't necessary to descend left. In other words, search
- * doesn't have to descend left because it isn't interested in a match
- * that has a heap TID value of -inf.
- *
- * However, some searches (pivotsearch searches) actually require that
- * we descend left when this happens. -inf is treated as a possible
- * match for omitted scankey attribute(s). This is needed by page
- * deletion, which must re-find leaf pages that are targets for
- * deletion using their high keys.
- *
- * Note: the heap TID part of the test ensures that scankey is being
- * compared to a pivot tuple with one or more truncated key
- * attributes.
- *
- * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
- * left here, since they have no heap TID attribute (and cannot have
- * any -inf key values in any case, since truncation can only remove
- * non-key attributes). !heapkeyspace searches must always be
- * prepared to deal with matches on both sides of the pivot once the
- * leaf level is reached.
- */
- if (key->heapkeyspace && !key->pivotsearch &&
- key->keysz == ntupatts && heapTid == NULL)
- return 1;
-
- /* All provided scankey arguments found to be equal */
- return 0;
- }
-
- /*
- * Treat truncated heap TID as minus infinity, since scankey has a key
- * attribute value (scantid) that would otherwise be compared directly
- */
- Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
- if (heapTid == NULL)
- return 1;
-
- /*
- * Scankey must be treated as equal to a posting list tuple if its scantid
- * value falls within the range of the posting list. In all other cases
- * there can only be a single heap TID value, which is compared directly
- * with scantid.
- */
- Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- result = ItemPointerCompare(key->scantid, heapTid);
- if (result <= 0 || !BTreeTupleIsPosting(itup))
- return result;
- else
- {
- result = ItemPointerCompare(key->scantid,
- BTreeTupleGetMaxHeapTID(itup));
- if (result > 0)
- return 1;
- }
-
- return 0;
-}
-
/*
* _bt_first() -- Find the first item in a scan.
*
@@ -967,6 +177,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
BTScanPosItem *currItem;
BlockNumber blkno;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(rel);
Assert(!BTScanPosIsValid(so->currPos));
@@ -1589,280 +800,6 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
-/*
- * _bt_readpage() -- Load data from current index page into so->currPos
- *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate. All other fields of so->currPos are
- * initialized from scratch here.
- *
- * We scan the current page starting at offnum and moving in the indicated
- * direction. All items matching the scan keys are loaded into currPos.items.
- * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
- * that there can be no more matching tuples in the current scan direction.
- *
- * In the case of a parallel scan, caller must have called _bt_parallel_seize
- * prior to calling this function; this function will invoke
- * _bt_parallel_release before returning.
- *
- * Returns true if any matching items found on the page, false if none.
- */
-static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
-{
- BTScanOpaque so = (BTScanOpaque) scan->opaque;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber minoff;
- OffsetNumber maxoff;
- int itemIndex;
- bool continuescan;
- int indnatts;
-
- /*
- * We must have the buffer pinned and locked, but the usual macro can't be
- * used here; this function is what makes it good for currPos.
- */
- Assert(BufferIsValid(so->currPos.buf));
-
- page = BufferGetPage(so->currPos.buf);
- opaque = BTPageGetOpaque(page);
-
- /* allow next page be processed by parallel worker */
- if (scan->parallel_scan)
- {
- if (ScanDirectionIsForward(dir))
- _bt_parallel_release(scan, opaque->btpo_next);
- else
- _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
- }
-
- continuescan = true; /* default assumption */
- indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * We note the buffer's block number so that we can release the pin later.
- * This allows us to re-read the buffer if it is needed again for hinting.
- */
- so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-
- /*
- * We save the LSN of the page as we read it, so that we know whether it
- * safe to apply LP_DEAD hints to the page later. This allows us to drop
- * the pin for MVCC scans, which allows vacuum to avoid blocking.
- */
- so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-
- /*
- * we must save the page's right-link while scanning it; this tells us
- * where to step right to after we're done with these items. There is no
- * corresponding need for the left-link, since splits always go right.
- */
- so->currPos.nextPage = opaque->btpo_next;
-
- /* initialize tuple workspace to empty */
- so->currPos.nextTupleOffset = 0;
-
- /*
- * Now that the current page has been made consistent, the macro should be
- * good.
- */
- Assert(BTScanPosIsPinned(so->currPos));
-
- if (ScanDirectionIsForward(dir))
- {
- /* load items[] in ascending order */
- itemIndex = 0;
-
- offnum = Max(offnum, minoff);
-
- while (offnum <= maxoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- offnum = OffsetNumberNext(offnum);
- continue;
- }
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID
- */
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- itemIndex++;
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- itemIndex++;
- }
- }
- }
- /* When !continuescan, there can't be any more matches, so stop */
- if (!continuescan)
- break;
-
- offnum = OffsetNumberNext(offnum);
- }
-
- /*
- * We don't need to visit page to the right when the high key
- * indicates that no more matches will be found there.
- *
- * Checking the high key like this works out more often than you might
- * think. Leaf page splits pick a split point between the two most
- * dissimilar tuples (this is weighed against the need to evenly share
- * free space). Leaf pages with high key attribute values that can
- * only appear on non-pivot tuples on the right sibling page are
- * common.
- */
- if (continuescan && !P_RIGHTMOST(opaque))
- {
- ItemId iid = PageGetItemId(page, P_HIKEY);
- IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
- int truncatt;
-
- truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
- _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
- }
-
- if (!continuescan)
- so->currPos.moreRight = false;
-
- Assert(itemIndex <= MaxTIDsPerBTreePage);
- so->currPos.firstItem = 0;
- so->currPos.lastItem = itemIndex - 1;
- so->currPos.itemIndex = 0;
- }
- else
- {
- /* load items[] in descending order */
- itemIndex = MaxTIDsPerBTreePage;
-
- offnum = Min(offnum, maxoff);
-
- while (offnum >= minoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
- bool tuple_alive;
- bool passes_quals;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual. Most of the
- * time, it's a win to not bother examining the tuple's index
- * keys, but just skip to the next tuple (previous, actually,
- * since we're scanning backwards). However, if this is the first
- * tuple on the page, we do check the index keys, to prevent
- * uselessly advancing to the page to the left. This is similar
- * to the high key optimization used by forward scans.
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- Assert(offnum >= P_FIRSTDATAKEY(opaque));
- if (offnum > P_FIRSTDATAKEY(opaque))
- {
- offnum = OffsetNumberPrev(offnum);
- continue;
- }
-
- tuple_alive = false;
- }
- else
- tuple_alive = true;
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan);
- if (passes_quals && tuple_alive)
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID.
- *
- * Note that we deliberately save/return items from
- * posting lists in ascending heap TID order for backwards
- * scans. This allows _bt_killitems() to make a
- * consistent assumption about the order of items
- * associated with the same posting list tuple.
- */
- itemIndex--;
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- itemIndex--;
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- }
- }
- }
- if (!continuescan)
- {
- /* there can't be any more matches, so stop */
- so->currPos.moreLeft = false;
- break;
- }
-
- offnum = OffsetNumberPrev(offnum);
- }
-
- Assert(itemIndex >= 0);
- so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
- so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
- }
-
- return (so->currPos.firstItem <= so->currPos.lastItem);
-}
-
/* Save an index item into so->currPos.items[itemIndex] */
static void
_bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -2071,12 +1008,11 @@ static bool
_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
- Relation rel;
+ Relation rel = scan->indexRelation;
Page page;
BTPageOpaque opaque;
bool status;
-
- rel = scan->indexRelation;
+ nbts_prep_ctx(rel);
if (ScanDirectionIsForward(dir))
{
@@ -2488,6 +1424,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
BTPageOpaque opaque;
OffsetNumber start;
BTScanPosItem *currItem;
+ nbts_prep_ctx(rel);
/*
* Scan down to the leftmost or rightmost leaf page. This is a simplified
diff --git a/src/backend/access/nbtree/nbtsearch_spec.c b/src/backend/access/nbtree/nbtsearch_spec.c
new file mode 100644
index 0000000000..37cc3647d3
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsearch_spec.c
@@ -0,0 +1,1087 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtsearch_spec.c
+ * Index shape-specialized functions for nbtsearch.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtsearch_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_binsrch NBTS_FUNCTION(_bt_binsrch)
+#define _bt_readpage NBTS_FUNCTION(_bt_readpage)
+
+static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf,
+ AttrNumber *highkeycmpcol);
+static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
+ OffsetNumber offnum);
+
+/*
+ * _bt_search() -- Search the tree for a particular scankey,
+ * or more precisely for the first leaf page it could be on.
+ *
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
+ * but it can omit the rightmost column(s) of the index.
+ *
+ * Return value is a stack of parent-page pointers (i.e. there is no entry for
+ * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
+ * which is locked and pinned. No locks are held on the parent pages,
+ * however!
+ *
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
+ *
+ * The returned buffer is locked according to access parameter. Additionally,
+ * access = BT_WRITE will allow an empty root page to be created and returned.
+ * When access = BT_READ, an empty index will result in *bufP being set to
+ * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
+ * during the search will be finished.
+ */
+BTStack
+_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
+ Snapshot snapshot)
+{
+ BTStack stack_in = NULL;
+ int page_access = BT_READ;
+ char tupdatabuf[BLCKSZ / 3];
+ AttrNumber highkeycmpcol = 1;
+
+ /* Get the root page to start with */
+ *bufP = _bt_getroot(rel, access);
+
+ /* If index is empty and access = BT_READ, no root page is created. */
+ if (!BufferIsValid(*bufP))
+ return (BTStack) NULL;
+
+ /* Loop iterates once per level descended in the tree */
+ for (;;)
+ {
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber offnum;
+ ItemId itemid;
+ IndexTuple itup;
+ BlockNumber child;
+ BTStack new_stack;
+
+ /*
+ * Race -- the page we just grabbed may have split since we read its
+ * downlink in its parent page (or the metapage). If it has, we may
+ * need to move right to its new sibling. Do that.
+ *
+ * In write-mode, allow _bt_moveright to finish any incomplete splits
+ * along the way. Strictly speaking, we'd only need to finish an
+ * incomplete split on the leaf page we're about to insert to, not on
+ * any of the upper levels (internal pages with incomplete splits are
+ * also taken care of in _bt_getstackbuf). But this is a good
+ * opportunity to finish splits of internal pages too.
+ */
+ *bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
+ page_access, snapshot, &highkeycmpcol,
+ (char *) tupdatabuf);
+
+ /* if this is a leaf page, we're done */
+ page = BufferGetPage(*bufP);
+ opaque = BTPageGetOpaque(page);
+ if (P_ISLEAF(opaque))
+ break;
+
+ /*
+ * Find the appropriate pivot tuple on this page. Its downlink points
+ * to the child page that we're about to descend to.
+ */
+ offnum = _bt_binsrch(rel, key, *bufP, &highkeycmpcol);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
+ child = BTreeTupleGetDownLink(itup);
+
+ Assert(IndexTupleSize(itup) < sizeof(tupdatabuf));
+ memcpy((char *) tupdatabuf, (char *) itup, IndexTupleSize(itup));
+
+ /*
+ * We need to save the location of the pivot tuple we chose in a new
+ * stack entry for this page/level. If caller ends up splitting a
+ * page one level down, it usually ends up inserting a new pivot
+ * tuple/downlink immediately after the location recorded here.
+ */
+ new_stack = (BTStack) palloc(sizeof(BTStackData));
+ new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
+ new_stack->bts_offset = offnum;
+ new_stack->bts_parent = stack_in;
+
+ /*
+ * Page level 1 is lowest non-leaf page level prior to leaves. So, if
+ * we're on the level 1 and asked to lock leaf page in write mode,
+ * then lock next page in write mode, because it must be a leaf.
+ */
+ if (opaque->btpo_level == 1 && access == BT_WRITE)
+ page_access = BT_WRITE;
+
+ /* drop the read lock on the page, then acquire one on its child */
+ *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
+
+ /* okay, all set to move down a level */
+ stack_in = new_stack;
+ }
+
+ /*
+ * If we're asked to lock leaf in write mode, but didn't manage to, then
+ * relock. This should only happen when the root page is a leaf page (and
+ * the only page in the index other than the metapage).
+ */
+ if (access == BT_WRITE && page_access == BT_READ)
+ {
+ highkeycmpcol = 1;
+
+ /* trade in our read lock for a write lock */
+ _bt_unlockbuf(rel, *bufP);
+ _bt_lockbuf(rel, *bufP, BT_WRITE);
+
+ /*
+ * Race -- the leaf page may have split after we dropped the read lock
+ * but before we acquired a write lock. If it has, we may need to
+ * move right to its new sibling. Do that.
+ */
+ *bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+ snapshot, &highkeycmpcol, (char *) tupdatabuf);
+ }
+
+ return stack_in;
+}
+
+/*
+ * _bt_moveright() -- move right in the btree if necessary.
+ *
+ * When we follow a pointer to reach a page, it is possible that
+ * the page has changed in the meanwhile. If this happens, we're
+ * guaranteed that the page has "split right" -- that is, that any
+ * data that appeared on the page originally is either on the page
+ * or strictly to the right of it.
+ *
+ * This routine decides whether or not we need to move right in the
+ * tree by examining the high key entry on the page. If that entry is
+ * strictly less than the scankey, or <= the scankey in the
+ * key.nextkey=true case, then we followed the wrong link and we need
+ * to move right.
+ *
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
+ *
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key. When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
+ *
+ * If forupdate is true, we will attempt to finish any incomplete splits
+ * that we encounter. This is required when locking a target page for an
+ * insertion, because we don't allow inserting on a page before the split
+ * is completed. 'stack' is only used if forupdate is true.
+ *
+ * On entry, we have the buffer pinned and a lock of the type specified by
+ * 'access'. If we move right, we release the buffer and lock and acquire
+ * the same on the right sibling. Return value is the buffer we stop at.
+ *
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
+ */
+Buffer
+_bt_moveright(Relation rel,
+ BTScanInsert key,
+ Buffer buf,
+ bool forupdate,
+ BTStack stack,
+ int access,
+ Snapshot snapshot,
+ AttrNumber *comparecol,
+ char *tupdatabuf)
+{
+ Page page;
+ BTPageOpaque opaque;
+ int32 cmpval;
+
+ Assert(PointerIsValid(comparecol) && PointerIsValid(tupdatabuf));
+
+ /*
+ * When nextkey = false (normal case): if the scan key that brought us to
+ * this page is > the high key stored on the page, then the page has split
+ * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
+ * have some duplicates to the right as well as the left, but that's
+ * something that's only ever dealt with on the leaf level, after
+ * _bt_search has found an initial leaf page.)
+ *
+ * When nextkey = true: move right if the scan key is >= page's high key.
+ * (Note that key.scantid cannot be set in this case.)
+ *
+ * The page could even have split more than once, so scan as far as
+ * needed.
+ *
+ * We also have to move right if we followed a link that brought us to a
+ * dead page.
+ */
+ cmpval = key->nextkey ? 0 : 1;
+
+ for (;;)
+ {
+ AttrNumber cmpcol = 1;
+
+ page = BufferGetPage(buf);
+ TestForOldSnapshot(snapshot, rel, page);
+ opaque = BTPageGetOpaque(page);
+
+ if (P_RIGHTMOST(opaque))
+ {
+ *comparecol = 1;
+ break;
+ }
+
+ /*
+ * Finish any incomplete splits we encounter along the way.
+ */
+ if (forupdate && P_INCOMPLETE_SPLIT(opaque))
+ {
+ BlockNumber blkno = BufferGetBlockNumber(buf);
+
+ /* upgrade our lock if necessary */
+ if (access == BT_READ)
+ {
+ _bt_unlockbuf(rel, buf);
+ _bt_lockbuf(rel, buf, BT_WRITE);
+ }
+
+ if (P_INCOMPLETE_SPLIT(opaque))
+ _bt_finish_split(rel, buf, stack);
+ else
+ _bt_relbuf(rel, buf);
+
+ /* re-acquire the lock in the right mode, and re-check */
+ buf = _bt_getbuf(rel, blkno, access);
+ continue;
+ }
+
+ /*
+ * tupdatabuf is filled with the right seperator of the parent node.
+ * This allows us to do a binary equality check between the parent
+ * node's right seperator (which is < key) and this page's P_HIKEY.
+ * If they equal, we can reuse the result of the parent node's
+ * rightkey compare, which means we can potentially save a full key
+ * compare (which includes indirect calls to attribute comparison
+ * functions).
+ *
+ * Without this, we'd on average use 3 full key compares per page before
+ * we achieve full dynamic prefix bounds, but with this optimization
+ * that is only 2.
+ *
+ * 3 compares: 1 for the highkey (rightmost), and on average 2 before
+ * we move right in the binary search on the page, this average equals
+ * SUM (1/2 ^ x) for x from 0 to log(n items)), which tends to 2.
+ */
+ if (!P_IGNORE(opaque) && *comparecol > 1)
+ {
+ IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
+ IndexTuple buftuple = (IndexTuple) tupdatabuf;
+ if (IndexTupleSize(itup) == IndexTupleSize(buftuple))
+ {
+ char *dataptr = (char *) itup;
+
+ if (memcmp(dataptr + sizeof(IndexTupleData),
+ tupdatabuf + sizeof(IndexTupleData),
+ IndexTupleSize(itup) - sizeof(IndexTupleData)) == 0)
+ break;
+ } else {
+ *comparecol = 1;
+ }
+ } else {
+ *comparecol = 1;
+ }
+
+ if (P_IGNORE(opaque) ||
+ _bt_compare(rel, key, page, P_HIKEY, &cmpcol) >= cmpval)
+ {
+ *comparecol = 1;
+ /* step right one page */
+ buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
+ continue;
+ }
+ else
+ {
+ *comparecol = cmpcol;
+ break;
+ }
+ }
+
+ if (P_IGNORE(opaque))
+ elog(ERROR, "fell off the end of index \"%s\"",
+ RelationGetRelationName(rel));
+
+ return buf;
+}
+
+/*
+ * _bt_binsrch() -- Do a binary search for a key on a particular page.
+ *
+ * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
+ * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
+ * particular, this means it is possible to return a value 1 greater than the
+ * number of keys on the page, if the scankey is > all keys on the page.)
+ *
+ * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
+ * of the last key < given scankey, or last key <= given scankey if nextkey
+ * is true. (Since _bt_compare treats the first data key of such a page as
+ * minus infinity, there will be at least one key < scankey, so the result
+ * always points at one of the keys on the page.) This key indicates the
+ * right place to descend to be sure we find all leaf keys >= given scankey
+ * (or leaf keys > given scankey when nextkey is true).
+ *
+ * This procedure is not responsible for walking right, it just examines
+ * the given page. _bt_binsrch() has no lock or refcount side effects
+ * on the buffer.
+ */
+static OffsetNumber
+_bt_binsrch(Relation rel,
+ BTScanInsert key,
+ Buffer buf,
+ AttrNumber *highkeycmpcol)
+{
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high;
+ int32 result,
+ cmpval;
+ AttrNumber highcmpcol = *highkeycmpcol,
+ lowcmpcol = 1;
+
+ page = BufferGetPage(buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* Requesting nextkey semantics while using scantid seems nonsensical */
+ Assert(!key->nextkey || key->scantid == NULL);
+ /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
+ Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+
+ /*
+ * If there are no keys on the page, return the first available slot. Note
+ * this covers two cases: the page is really empty (no keys), or it
+ * contains only a high key. The latter case is possible after vacuuming.
+ * This can never happen on an internal page, however, since they are
+ * never empty (an internal page must have children).
+ */
+ if (unlikely(high < low))
+ return low;
+
+ /*
+ * Binary search to find the first key on the page >= scan key, or first
+ * key > scankey when nextkey is true.
+ *
+ * For nextkey=false (cmpval=1), the loop invariant is: all slots before
+ * 'low' are < scan key, all slots at or after 'high' are >= scan key.
+ *
+ * For nextkey=true (cmpval=0), the loop invariant is: all slots before
+ * 'low' are <= scan key, all slots at or after 'high' are > scan key.
+ *
+ * We can fall out when high == low.
+ */
+ high++; /* establish the loop invariant for high */
+
+ cmpval = key->nextkey ? 0 : 1; /* select comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
+
+ if (result >= cmpval)
+ {
+ low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
+ else
+ {
+ high = mid;
+ highcmpcol = cmpcol;
+ }
+ }
+
+ *highkeycmpcol = highcmpcol;
+
+ /*
+ * At this point we have high == low, but be careful: they could point
+ * past the last slot on the page.
+ *
+ * On a leaf page, we always return the first key >= scan key (resp. >
+ * scan key), which could be the last slot + 1.
+ */
+ if (P_ISLEAF(opaque))
+ return low;
+
+ /*
+ * On a non-leaf page, return the last key < scan key (resp. <= scan key).
+ * There must be one if _bt_compare() is playing by the rules.
+ */
+ Assert(low > P_FIRSTDATAKEY(opaque));
+
+ return OffsetNumberPrev(low);
+}
+
+/*
+ *
+ * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
+ *
+ * Like _bt_binsrch(), but with support for caching the binary search
+ * bounds. Only used during insertion, and only on the leaf page that it
+ * looks like caller will insert tuple on. Exclusive-locked and pinned
+ * leaf page is contained within insertstate.
+ *
+ * Caches the bounds fields in insertstate so that a subsequent call can
+ * reuse the low and strict high bounds of original binary search. Callers
+ * that use these fields directly must be prepared for the case where low
+ * and/or stricthigh are not on the same page (one or both exceed maxoff
+ * for the page). The case where there are no items on the page (high <
+ * low) makes bounds invalid.
+ *
+ * Caller is responsible for invalidating bounds when it modifies the page
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by a posting
+ * list split).
+ */
+OffsetNumber
+_bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol)
+{
+ BTScanInsert key = insertstate->itup_key;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high,
+ stricthigh;
+ int32 result,
+ cmpval;
+ AttrNumber lowcmpcol = 1;
+
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ Assert(P_ISLEAF(opaque));
+ Assert(!key->nextkey);
+ Assert(insertstate->postingoff == 0);
+
+ if (!insertstate->bounds_valid)
+ {
+ /* Start new binary search */
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ }
+ else
+ {
+ /* Restore result of previous binary search against same page */
+ low = insertstate->low;
+ high = insertstate->stricthigh;
+ }
+
+ /* If there are no keys on the page, return the first available slot */
+ if (unlikely(high < low))
+ {
+ /* Caller can't reuse bounds */
+ insertstate->low = InvalidOffsetNumber;
+ insertstate->stricthigh = InvalidOffsetNumber;
+ insertstate->bounds_valid = false;
+ return low;
+ }
+
+ /*
+ * Binary search to find the first key on the page >= scan key. (nextkey
+ * is always false when inserting).
+ *
+ * The loop invariant is: all slots before 'low' are < scan key, all slots
+ * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
+ * maintained to save additional search effort for caller.
+ *
+ * We can fall out when high == low.
+ */
+ if (!insertstate->bounds_valid)
+ high++; /* establish the loop invariant for high */
+ stricthigh = high; /* high initially strictly higher */
+
+ cmpval = 1; /* !nextkey comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
+
+ if (result >= cmpval)
+ {
+ low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
+ else
+ {
+ high = mid;
+ highcmpcol = cmpcol;
+
+ if (result != 0)
+ stricthigh = high;
+ }
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set postingoff for caller. Caller must split the
+ * posting list when postingoff is set. This should happen
+ * infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ {
+ /*
+ * postingoff should never be set more than once per leaf page
+ * binary search. That would mean that there are duplicate table
+ * TIDs in the index, which is never okay. Check for that here.
+ */
+ if (insertstate->postingoff != 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
+ ItemPointerGetBlockNumber(key->scantid),
+ ItemPointerGetOffsetNumber(key->scantid),
+ low, stricthigh,
+ BufferGetBlockNumber(insertstate->buf),
+ RelationGetRelationName(rel))));
+
+ insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
+ }
+ }
+
+ /*
+ * On a leaf page, a binary search always returns the first key >= scan
+ * key (at least in !nextkey case), which could be the last slot + 1. This
+ * is also the lower bound of cached search.
+ *
+ * stricthigh may also be the last slot + 1, which prevents caller from
+ * using bounds directly, but is still useful to us if we're called a
+ * second time with cached bounds (cached low will be < stricthigh when
+ * that happens).
+ */
+ insertstate->low = low;
+ insertstate->stricthigh = stricthigh;
+ insertstate->bounds_valid = true;
+
+ return low;
+}
+
+/*----------
+ * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
+ *
+ * page/offnum: location of btree item to be compared to.
+ *
+ * This routine returns:
+ * <0 if scankey < tuple at offnum;
+ * 0 if scankey == tuple at offnum;
+ * >0 if scankey > tuple at offnum.
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
+ * "minus infinity": this routine will always claim it is less than the
+ * scankey. The actual key value stored is explicitly truncated to 0
+ * attributes (explicitly minus infinity) with version 3+ indexes, but
+ * that isn't relied upon. This allows us to implement the Lehman and
+ * Yao convention that the first down-link pointer is before the first
+ * key. See backend/access/nbtree/README for details.
+ *----------
+ */
+int32
+_bt_compare(Relation rel,
+ BTScanInsert key,
+ Page page,
+ OffsetNumber offnum,
+ AttrNumber *comparecol)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ IndexTuple itup;
+ ItemPointer heapTid;
+ ScanKey scankey;
+ int ncmpkey;
+ int ntupatts;
+ int32 result;
+
+ Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
+ Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+ Assert(key->heapkeyspace || key->scantid == NULL);
+
+ /*
+ * Force result ">" if target item is first data item on an internal page
+ * --- see NOTE above.
+ */
+ if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
+ return 1;
+
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+ ntupatts = BTreeTupleGetNAtts(itup, rel);
+
+ /*
+ * The scan key is set up with the attribute number associated with each
+ * term in the key. It is important that, if the index is multi-key, the
+ * scan contain the first k key attributes, and that they be in order. If
+ * you think about how multi-key ordering works, you'll understand why
+ * this is.
+ *
+ * We don't test for violation of this condition here, however. The
+ * initial setup for the index scan had better have gotten it right (see
+ * _bt_first).
+ */
+
+ ncmpkey = Min(ntupatts, key->keysz);
+ Assert(key->heapkeyspace || ncmpkey == key->keysz);
+ Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
+
+ scankey = key->scankeys + ((*comparecol) - 1);
+ for (int i = *comparecol; i <= ncmpkey; i++)
+ {
+ Datum datum;
+ bool isNull;
+
+ datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+
+ if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ {
+ if (isNull)
+ result = 0; /* NULL "=" NULL */
+ else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = -1; /* NULL "<" NOT_NULL */
+ else
+ result = 1; /* NULL ">" NOT_NULL */
+ }
+ else if (isNull) /* key is NOT_NULL and item is NULL */
+ {
+ if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = 1; /* NOT_NULL ">" NULL */
+ else
+ result = -1; /* NOT_NULL "<" NULL */
+ }
+ else
+ {
+ /*
+ * The sk_func needs to be passed the index value as left arg and
+ * the sk_argument as right arg (they might be of different
+ * types). Since it is convenient for callers to think of
+ * _bt_compare as comparing the scankey to the index item, we have
+ * to flip the sign of the comparison result. (Unless it's a DESC
+ * column, in which case we *don't* flip the sign.)
+ */
+ result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum,
+ scankey->sk_argument));
+
+ if (!(scankey->sk_flags & SK_BT_DESC))
+ INVERT_COMPARE_RESULT(result);
+ }
+
+ /* if the keys are unequal, return the difference */
+ if (result != 0)
+ {
+ *comparecol = i;
+ return result;
+ }
+
+ scankey++;
+ }
+
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *comparecol = ntupatts + 1;
+
+ /*
+ * All non-truncated attributes (other than heap TID) were found to be
+ * equal. Treat truncated attributes as minus infinity when scankey has a
+ * key attribute value that would otherwise be compared directly.
+ *
+ * Note: it doesn't matter if ntupatts includes non-key attributes;
+ * scankey won't, so explicitly excluding non-key attributes isn't
+ * necessary.
+ */
+ if (key->keysz > ntupatts)
+ return 1;
+
+ /*
+ * Use the heap TID attribute and scantid to try to break the tie. The
+ * rules are the same as any other key attribute -- only the
+ * representation differs.
+ */
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid == NULL)
+ {
+ /*
+ * Most searches have a scankey that is considered greater than a
+ * truncated pivot tuple if and when the scankey has equal values for
+ * attributes up to and including the least significant untruncated
+ * attribute in tuple.
+ *
+ * For example, if an index has the minimum two attributes (single
+ * user key attribute, plus heap TID attribute), and a page's high key
+ * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
+ * will not descend to the page to the left. The search will descend
+ * right instead. The truncated attribute in pivot tuple means that
+ * all non-pivot tuples on the page to the left are strictly < 'foo',
+ * so it isn't necessary to descend left. In other words, search
+ * doesn't have to descend left because it isn't interested in a match
+ * that has a heap TID value of -inf.
+ *
+ * However, some searches (pivotsearch searches) actually require that
+ * we descend left when this happens. -inf is treated as a possible
+ * match for omitted scankey attribute(s). This is needed by page
+ * deletion, which must re-find leaf pages that are targets for
+ * deletion using their high keys.
+ *
+ * Note: the heap TID part of the test ensures that scankey is being
+ * compared to a pivot tuple with one or more truncated key
+ * attributes.
+ *
+ * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
+ * left here, since they have no heap TID attribute (and cannot have
+ * any -inf key values in any case, since truncation can only remove
+ * non-key attributes). !heapkeyspace searches must always be
+ * prepared to deal with matches on both sides of the pivot once the
+ * leaf level is reached.
+ */
+ if (key->heapkeyspace && !key->pivotsearch &&
+ key->keysz == ntupatts && heapTid == NULL)
+ return 1;
+
+ /* All provided scankey arguments found to be equal */
+ return 0;
+ }
+
+ /*
+ * Treat truncated heap TID as minus infinity, since scankey has a key
+ * attribute value (scantid) that would otherwise be compared directly
+ */
+ Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+ if (heapTid == NULL)
+ return 1;
+
+ /*
+ * Scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * with scantid.
+ */
+ Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (result <= 0 || !BTreeTupleIsPosting(itup))
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid,
+ BTreeTupleGetMaxHeapTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
+}
+
+/*
+ * _bt_readpage() -- Load data from current index page into so->currPos
+ *
+ * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
+ * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
+ * they are updated as appropriate. All other fields of so->currPos are
+ * initialized from scratch here.
+ *
+ * We scan the current page starting at offnum and moving in the indicated
+ * direction. All items matching the scan keys are loaded into currPos.items.
+ * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
+ * that there can be no more matching tuples in the current scan direction.
+ *
+ * In the case of a parallel scan, caller must have called _bt_parallel_seize
+ * prior to calling this function; this function will invoke
+ * _bt_parallel_release before returning.
+ *
+ * Returns true if any matching items found on the page, false if none.
+ */
+static bool
+_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber minoff;
+ OffsetNumber maxoff;
+ int itemIndex;
+ bool continuescan;
+ int indnatts;
+
+ /*
+ * We must have the buffer pinned and locked, but the usual macro can't be
+ * used here; this function is what makes it good for currPos.
+ */
+ Assert(BufferIsValid(so->currPos.buf));
+
+ page = BufferGetPage(so->currPos.buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* allow next page be processed by parallel worker */
+ if (scan->parallel_scan)
+ {
+ if (ScanDirectionIsForward(dir))
+ _bt_parallel_release(scan, opaque->btpo_next);
+ else
+ _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+ }
+
+ continuescan = true; /* default assumption */
+ indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * We note the buffer's block number so that we can release the pin later.
+ * This allows us to re-read the buffer if it is needed again for hinting.
+ */
+ so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later. This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+
+ /*
+ * we must save the page's right-link while scanning it; this tells us
+ * where to step right to after we're done with these items. There is no
+ * corresponding need for the left-link, since splits always go right.
+ */
+ so->currPos.nextPage = opaque->btpo_next;
+
+ /* initialize tuple workspace to empty */
+ so->currPos.nextTupleOffset = 0;
+
+ /*
+ * Now that the current page has been made consistent, the macro should be
+ * good.
+ */
+ Assert(BTScanPosIsPinned(so->currPos));
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* load items[] in ascending order */
+ itemIndex = 0;
+
+ offnum = Max(offnum, minoff);
+
+ while (offnum <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ offnum = OffsetNumberNext(offnum);
+ continue;
+ }
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID
+ */
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ itemIndex++;
+ }
+ }
+ }
+ /* When !continuescan, there can't be any more matches, so stop */
+ if (!continuescan)
+ break;
+
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /*
+ * We don't need to visit page to the right when the high key
+ * indicates that no more matches will be found there.
+ *
+ * Checking the high key like this works out more often than you might
+ * think. Leaf page splits pick a split point between the two most
+ * dissimilar tuples (this is weighed against the need to evenly share
+ * free space). Leaf pages with high key attribute values that can
+ * only appear on non-pivot tuples on the right sibling page are
+ * common.
+ */
+ if (continuescan && !P_RIGHTMOST(opaque))
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
+ int truncatt;
+
+ truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
+ _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+ }
+
+ if (!continuescan)
+ so->currPos.moreRight = false;
+
+ Assert(itemIndex <= MaxTIDsPerBTreePage);
+ so->currPos.firstItem = 0;
+ so->currPos.lastItem = itemIndex - 1;
+ so->currPos.itemIndex = 0;
+ }
+ else
+ {
+ /* load items[] in descending order */
+ itemIndex = MaxTIDsPerBTreePage;
+
+ offnum = Min(offnum, maxoff);
+
+ while (offnum >= minoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+ bool tuple_alive;
+ bool passes_quals;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual. Most of the
+ * time, it's a win to not bother examining the tuple's index
+ * keys, but just skip to the next tuple (previous, actually,
+ * since we're scanning backwards). However, if this is the first
+ * tuple on the page, we do check the index keys, to prevent
+ * uselessly advancing to the page to the left. This is similar
+ * to the high key optimization used by forward scans.
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ Assert(offnum >= P_FIRSTDATAKEY(opaque));
+ if (offnum > P_FIRSTDATAKEY(opaque))
+ {
+ offnum = OffsetNumberPrev(offnum);
+ continue;
+ }
+
+ tuple_alive = false;
+ }
+ else
+ tuple_alive = true;
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
+ &continuescan);
+ if (passes_quals && tuple_alive)
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID.
+ *
+ * Note that we deliberately save/return items from
+ * posting lists in ascending heap TID order for backwards
+ * scans. This allows _bt_killitems() to make a
+ * consistent assumption about the order of items
+ * associated with the same posting list tuple.
+ */
+ itemIndex--;
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ }
+ }
+ }
+ if (!continuescan)
+ {
+ /* there can't be any more matches, so stop */
+ so->currPos.moreLeft = false;
+ break;
+ }
+
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ Assert(itemIndex >= 0);
+ so->currPos.firstItem = itemIndex;
+ so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
+ so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+ }
+
+ return (so->currPos.firstItem <= so->currPos.lastItem);
+}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 67b7b1710c..af408f704f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -279,8 +279,6 @@ static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
BTPageState *state,
BTDedupState dstate);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
-static void _bt_load(BTWriteState *wstate,
- BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
int request);
static void _bt_end_parallel(BTLeader *btleader);
@@ -293,6 +291,8 @@ static void _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
Sharedsort *sharedsort2, int sortmem,
bool progress);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsort_spec.c"
+#include "access/nbtree_spec.h"
/*
* btbuild() -- build a new btree index.
@@ -544,6 +544,7 @@ static void
_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
{
BTWriteState wstate;
+ nbts_prep_ctx(btspool->index);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
@@ -844,6 +845,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
Size pgspc;
Size itupsz;
bool isleaf;
+ nbts_prep_ctx(wstate->index);
/*
* This is a handy place to check for cancel interrupts during the btree
@@ -1176,264 +1178,6 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
-/*
- * Read tuples in correct sort order from tuplesort, and load them into
- * btree leaves.
- */
-static void
-_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
-{
- BTPageState *state = NULL;
- bool merge = (btspool2 != NULL);
- IndexTuple itup,
- itup2 = NULL;
- bool load1;
- TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
- SortSupport sortKeys;
- int64 tuples_done = 0;
- bool deduplicate;
-
- deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
- BTGetDeduplicateItems(wstate->index);
-
- if (merge)
- {
- /*
- * Another BTSpool for dead tuples exists. Now we have to merge
- * btspool and btspool2.
- */
-
- /* the preparation of merge */
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
-
- /* Prepare SortSupport data for each column */
- sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
- for (i = 0; i < keysz; i++)
- {
- SortSupport sortKey = sortKeys + i;
- ScanKey scanKey = wstate->inskey->scankeys + i;
- int16 strategy;
-
- sortKey->ssup_cxt = CurrentMemoryContext;
- sortKey->ssup_collation = scanKey->sk_collation;
- sortKey->ssup_nulls_first =
- (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
- sortKey->ssup_attno = scanKey->sk_attno;
- /* Abbreviation is not supported here */
- sortKey->abbreviate = false;
-
- Assert(sortKey->ssup_attno != 0);
-
- strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
- BTGreaterStrategyNumber : BTLessStrategyNumber;
-
- PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
- }
-
- for (;;)
- {
- load1 = true; /* load BTSpool next ? */
- if (itup2 == NULL)
- {
- if (itup == NULL)
- break;
- }
- else if (itup != NULL)
- {
- int32 compare = 0;
-
- for (i = 1; i <= keysz; i++)
- {
- SortSupport entry;
- Datum attrDatum1,
- attrDatum2;
- bool isNull1,
- isNull2;
-
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
- entry);
- if (compare > 0)
- {
- load1 = false;
- break;
- }
- else if (compare < 0)
- break;
- }
-
- /*
- * If key values are equal, we sort on ItemPointer. This is
- * required for btree indexes, since heap TID is treated as an
- * implicit last key attribute in order to ensure that all
- * keys in the index are physically unique.
- */
- if (compare == 0)
- {
- compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
- Assert(compare != 0);
- if (compare > 0)
- load1 = false;
- }
- }
- else
- load1 = false;
-
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- if (load1)
- {
- _bt_buildadd(wstate, state, itup, 0);
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- }
- else
- {
- _bt_buildadd(wstate, state, itup2, 0);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- pfree(sortKeys);
- }
- else if (deduplicate)
- {
- /* merge is unnecessary, deduplicate into posting lists */
- BTDedupState dstate;
-
- dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
- dstate->deduplicate = true; /* unused */
- dstate->nmaxitems = 0; /* unused */
- dstate->maxpostingsize = 0; /* set later */
- /* Metadata about base tuple of current pending posting list */
- dstate->base = NULL;
- dstate->baseoff = InvalidOffsetNumber; /* unused */
- dstate->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- dstate->htids = NULL;
- dstate->nhtids = 0;
- dstate->nitems = 0;
- dstate->phystupsize = 0; /* unused */
- dstate->nintervals = 0; /* unused */
-
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- {
- state = _bt_pagestate(wstate, 0);
-
- /*
- * Limit size of posting list tuples to 1/10 space we want to
- * leave behind on the page, plus space for final item's line
- * pointer. This is equal to the space that we'd like to
- * leave behind on each leaf page when fillfactor is 90,
- * allowing us to get close to fillfactor% space utilization
- * when there happen to be a great many duplicates. (This
- * makes higher leaf fillfactor settings ineffective when
- * building indexes that have many duplicates, but packing
- * leaf pages full with few very large tuples doesn't seem
- * like a useful goal.)
- */
- dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
- sizeof(ItemIdData);
- Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
- dstate->maxpostingsize <= INDEX_SIZE_MASK);
- dstate->htids = palloc(dstate->maxpostingsize);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
- else if (_bt_keep_natts_fast(wstate->index, dstate->base,
- itup) > keysz &&
- _bt_dedup_save_htid(dstate, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID from itup has been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list.
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
-
- if (state)
- {
- /*
- * Handle the last item (there must be a last item when the
- * tuplesort returned one or more tuples)
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
- pfree(dstate->htids);
- }
-
- pfree(dstate);
- }
- else
- {
- /* merging and deduplication are both unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- _bt_buildadd(wstate, state, itup, 0);
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- }
-
- /* Close down final pages and write the metapage */
- _bt_uppershutdown(wstate, state);
-
- /*
- * When we WAL-logged index pages, we must nonetheless fsync index files.
- * Since we're building outside shared buffers, a CHECKPOINT occurring
- * during the build has no way to flush the previously written data to
- * disk (indeed it won't know the index even exists). A crash later on
- * would replay WAL from the checkpoint, therefore it wouldn't replay our
- * earlier WAL entries. If we do not fsync those pages here, they might
- * still not be on disk when the crash occurs.
- */
- if (wstate->btws_use_wal)
- smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
-}
-
/*
* Create parallel context, and launch workers for leader.
*
diff --git a/src/backend/access/nbtree/nbtsort_spec.c b/src/backend/access/nbtree/nbtsort_spec.c
new file mode 100644
index 0000000000..368d6f244c
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsort_spec.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtsort_spec.c
+ * Index shape-specialized functions for nbtsort.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtsort_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_load NBTS_FUNCTION(_bt_load)
+
+static void _bt_load(BTWriteState *wstate,
+ BTSpool *btspool, BTSpool *btspool2);
+
+/*
+ * Read tuples in correct sort order from tuplesort, and load them into
+ * btree leaves.
+ */
+static void
+_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
+{
+ BTPageState *state = NULL;
+ bool merge = (btspool2 != NULL);
+ IndexTuple itup,
+ itup2 = NULL;
+ bool load1;
+ TupleDesc tupdes = RelationGetDescr(wstate->index);
+ int i,
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ SortSupport sortKeys;
+ int64 tuples_done = 0;
+ bool deduplicate;
+
+ deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
+ BTGetDeduplicateItems(wstate->index);
+
+ if (merge)
+ {
+ /*
+ * Another BTSpool for dead tuples exists. Now we have to merge
+ * btspool and btspool2.
+ */
+
+ /* the preparation of merge */
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+
+ /* Prepare SortSupport data for each column */
+ sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+
+ for (i = 0; i < keysz; i++)
+ {
+ SortSupport sortKey = sortKeys + i;
+ ScanKey scanKey = wstate->inskey->scankeys + i;
+ int16 strategy;
+
+ sortKey->ssup_cxt = CurrentMemoryContext;
+ sortKey->ssup_collation = scanKey->sk_collation;
+ sortKey->ssup_nulls_first =
+ (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+ sortKey->ssup_attno = scanKey->sk_attno;
+ /* Abbreviation is not supported here */
+ sortKey->abbreviate = false;
+
+ Assert(sortKey->ssup_attno != 0);
+
+ strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+ BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+ PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+ }
+
+ for (;;)
+ {
+ load1 = true; /* load BTSpool next ? */
+ if (itup2 == NULL)
+ {
+ if (itup == NULL)
+ break;
+ }
+ else if (itup != NULL)
+ {
+ int32 compare = 0;
+
+ for (i = 1; i <= keysz; i++)
+ {
+ SortSupport entry;
+ Datum attrDatum1,
+ attrDatum2;
+ bool isNull1,
+ isNull2;
+
+ entry = sortKeys + i - 1;
+ attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+ attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+ compare = ApplySortComparator(attrDatum1, isNull1,
+ attrDatum2, isNull2,
+ entry);
+ if (compare > 0)
+ {
+ load1 = false;
+ break;
+ }
+ else if (compare < 0)
+ break;
+ }
+
+ /*
+ * If key values are equal, we sort on ItemPointer. This is
+ * required for btree indexes, since heap TID is treated as an
+ * implicit last key attribute in order to ensure that all
+ * keys in the index are physically unique.
+ */
+ if (compare == 0)
+ {
+ compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+ Assert(compare != 0);
+ if (compare > 0)
+ load1 = false;
+ }
+ }
+ else
+ load1 = false;
+
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ if (load1)
+ {
+ _bt_buildadd(wstate, state, itup, 0);
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ }
+ else
+ {
+ _bt_buildadd(wstate, state, itup2, 0);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ pfree(sortKeys);
+ }
+ else if (deduplicate)
+ {
+ /* merge is unnecessary, deduplicate into posting lists */
+ BTDedupState dstate;
+
+ dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ dstate->deduplicate = true; /* unused */
+ dstate->nmaxitems = 0; /* unused */
+ dstate->maxpostingsize = 0; /* set later */
+ /* Metadata about base tuple of current pending posting list */
+ dstate->base = NULL;
+ dstate->baseoff = InvalidOffsetNumber; /* unused */
+ dstate->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ dstate->htids = NULL;
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+ dstate->phystupsize = 0; /* unused */
+ dstate->nintervals = 0; /* unused */
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+
+ /*
+ * Limit size of posting list tuples to 1/10 space we want to
+ * leave behind on the page, plus space for final item's line
+ * pointer. This is equal to the space that we'd like to
+ * leave behind on each leaf page when fillfactor is 90,
+ * allowing us to get close to fillfactor% space utilization
+ * when there happen to be a great many duplicates. (This
+ * makes higher leaf fillfactor settings ineffective when
+ * building indexes that have many duplicates, but packing
+ * leaf pages full with few very large tuples doesn't seem
+ * like a useful goal.)
+ */
+ dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
+ sizeof(ItemIdData);
+ Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
+ dstate->maxpostingsize <= INDEX_SIZE_MASK);
+ dstate->htids = palloc(dstate->maxpostingsize);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+ else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+ itup) > keysz &&
+ _bt_dedup_save_htid(dstate, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID from itup has been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list.
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+
+ if (state)
+ {
+ /*
+ * Handle the last item (there must be a last item when the
+ * tuplesort returned one or more tuples)
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+ pfree(dstate->htids);
+ }
+
+ pfree(dstate);
+ }
+ else
+ {
+ /* merging and deduplication are both unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ _bt_buildadd(wstate, state, itup, 0);
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+
+ /* Close down final pages and write the metapage */
+ _bt_uppershutdown(wstate, state);
+
+ /*
+ * When we WAL-logged index pages, we must nonetheless fsync index files.
+ * Since we're building outside shared buffers, a CHECKPOINT occurring
+ * during the build has no way to flush the previously written data to
+ * disk (indeed it won't know the index even exists). A crash later on
+ * would replay WAL from the checkpoint, therefore it wouldn't replay our
+ * earlier WAL entries. If we do not fsync those pages here, they might
+ * still not be on disk when the crash occurs.
+ */
+ if (wstate->btws_use_wal)
+ smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index ecb49bb471..991118fd50 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -639,6 +639,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
ItemId itemid;
IndexTuple tup;
int keepnatts;
+ nbts_prep_ctx(state->rel);
Assert(state->is_leaf && !state->is_rightmost);
@@ -945,6 +946,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
*rightinterval;
int perfectpenalty;
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
+ nbts_prep_ctx(state->rel);
/* Assume that alternative strategy won't be used for now */
*strategy = SPLIT_DEFAULT;
@@ -1137,6 +1139,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
{
IndexTuple lastleft;
IndexTuple firstright;
+ nbts_prep_ctx(state->rel);
if (!state->is_leaf)
{
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 8003583c0a..85f92adda8 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -50,130 +50,10 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
bool *result);
static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
static void _bt_mark_scankey_required(ScanKey skey);
-static bool _bt_check_rowcompare(ScanKey skey,
- IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
- ScanDirection dir, bool *continuescan);
-static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtutils_spec.c"
+#include "access/nbtree_spec.h"
-/*
- * _bt_mkscankey
- * Build an insertion scan key that contains comparison data from itup
- * as well as comparator routines appropriate to the key datatypes.
- *
- * When itup is a non-pivot tuple, the returned insertion scan key is
- * suitable for finding a place for it to go on the leaf level. Pivot
- * tuples can be used to re-find leaf page with matching high key, but
- * then caller needs to set scan key's pivotsearch field to true. This
- * allows caller to search for a leaf page with a matching high key,
- * which is usually to the left of the first leaf page a non-pivot match
- * might appear on.
- *
- * The result is intended for use with _bt_compare() and _bt_truncate().
- * Callers that don't need to fill out the insertion scankey arguments
- * (e.g. they use an ad-hoc comparison routine, or only need a scankey
- * for _bt_truncate()) can pass a NULL index tuple. The scankey will
- * be initialized as if an "all truncated" pivot tuple was passed
- * instead.
- *
- * Note that we may occasionally have to share lock the metapage to
- * determine whether or not the keys in the index are expected to be
- * unique (i.e. if this is a "heapkeyspace" index). We assume a
- * heapkeyspace index when caller passes a NULL tuple, allowing index
- * build callers to avoid accessing the non-existent metapage. We
- * also assume that the index is _not_ allequalimage when a NULL tuple
- * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
- * field themselves.
- */
-BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
-{
- BTScanInsert key;
- ScanKey skey;
- TupleDesc itupdesc;
- int indnkeyatts;
- int16 *indoption;
- int tupnatts;
- int i;
-
- itupdesc = RelationGetDescr(rel);
- indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- indoption = rel->rd_indoption;
- tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
-
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
- /*
- * We'll execute search using scan key constructed on key columns.
- * Truncated attributes and non-key attributes are omitted from the final
- * scan key.
- */
- key = palloc(offsetof(BTScanInsertData, scankeys) +
- sizeof(ScanKeyData) * indnkeyatts);
- if (itup)
- _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
- else
- {
- /* Utility statement callers can set these fields themselves */
- key->heapkeyspace = true;
- key->allequalimage = false;
- }
- key->anynullkeys = false; /* initial assumption */
- key->nextkey = false;
- key->pivotsearch = false;
- key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
- skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
- {
- FmgrInfo *procinfo;
- Datum arg;
- bool null;
- int flags;
-
- /*
- * We can use the cached (default) support procs since no cross-type
- * comparison can be needed.
- */
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-
- /*
- * Key arguments built from truncated attributes (or when caller
- * provides no tuple) are defensively represented as NULL values. They
- * should never be used.
- */
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
- else
- {
- arg = (Datum) 0;
- null = true;
- }
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
- flags,
- (AttrNumber) (i + 1),
- InvalidStrategy,
- InvalidOid,
- rel->rd_indcollation[i],
- procinfo,
- arg);
- /* Record if any key attribute is NULL (or truncated) */
- if (null)
- key->anynullkeys = true;
- }
-
- /*
- * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
- * that full uniqueness check is done.
- */
- if (rel->rd_index->indnullsnotdistinct)
- key->anynullkeys = false;
-
- return key;
-}
/*
* free a retracement stack made by _bt_search.
@@ -1340,356 +1220,6 @@ _bt_mark_scankey_required(ScanKey skey)
}
}
-/*
- * Test whether an indextuple satisfies all the scankey conditions.
- *
- * Return true if so, false if not. If the tuple fails to pass the qual,
- * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly. See comments for
- * _bt_preprocess_keys(), above, about how this is done.
- *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
- *
- * scan: index scan descriptor (containing a search-type scankey)
- * tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
- */
-bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
- ScanDirection dir, bool *continuescan)
-{
- TupleDesc tupdesc;
- BTScanOpaque so;
- int keysz;
- int ikey;
- ScanKey key;
-
- Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
-
- *continuescan = true; /* default assumption */
-
- tupdesc = RelationGetDescr(scan->indexRelation);
- so = (BTScanOpaque) scan->opaque;
- keysz = so->numberOfKeys;
-
- for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
- {
- Datum datum;
- bool isNull;
- Datum test;
-
- if (key->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- continue;
- }
-
- /* row-comparison keys need special processing */
- if (key->sk_flags & SK_ROW_HEADER)
- {
- if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
- continuescan))
- continue;
- return false;
- }
-
- datum = index_getattr(tuple,
- key->sk_attno,
- tupdesc,
- &isNull);
-
- if (key->sk_flags & SK_ISNULL)
- {
- /* Handle IS NULL/NOT NULL tests */
- if (key->sk_flags & SK_SEARCHNULL)
- {
- if (isNull)
- continue; /* tuple satisfies this qual */
- }
- else
- {
- Assert(key->sk_flags & SK_SEARCHNOTNULL);
- if (!isNull)
- continue; /* tuple satisfies this qual */
- }
-
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (isNull)
- {
- if (key->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
- datum, key->sk_argument);
-
- if (!DatumGetBool(test))
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- *
- * Note: because we stop the scan as soon as any required equality
- * qual fails, it is critical that equality quals be used for the
- * initial positioning in _bt_first() when they are available. See
- * comments in _bt_first().
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
- }
-
- /* If we get here, the tuple passes all index quals. */
- return true;
-}
-
-/*
- * Test whether an indextuple satisfies a row-comparison scan condition.
- *
- * Return true if so, false if not. If not, also clear *continuescan if
- * it's not possible for any future tuples in the current scan direction
- * to pass the qual.
- *
- * This is a subroutine for _bt_checkkeys, which see for more info.
- */
-static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
- TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
-{
- ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
- int32 cmpresult = 0;
- bool result;
-
- /* First subkey should be same as the header says */
- Assert(subkey->sk_attno == skey->sk_attno);
-
- /* Loop over columns of the row condition */
- for (;;)
- {
- Datum datum;
- bool isNull;
-
- Assert(subkey->sk_flags & SK_ROW_MEMBER);
-
- if (subkey->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- cmpresult = 0;
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- continue;
- }
-
- datum = index_getattr(tuple,
- subkey->sk_attno,
- tupdesc,
- &isNull);
-
- if (isNull)
- {
- if (subkey->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (subkey->sk_flags & SK_ISNULL)
- {
- /*
- * Unlike the simple-scankey case, this isn't a disallowed case.
- * But it can never match. If all the earlier row comparison
- * columns are required for the scan direction, we can stop the
- * scan, because there can't be another tuple that will succeed.
- */
- if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
- subkey--;
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- return false;
- }
-
- /* Perform the test --- three-way comparison not bool operator */
- cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
- subkey->sk_collation,
- datum,
- subkey->sk_argument));
-
- if (subkey->sk_flags & SK_BT_DESC)
- INVERT_COMPARE_RESULT(cmpresult);
-
- /* Done comparing if unequal, else advance to next column */
- if (cmpresult != 0)
- break;
-
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- }
-
- /*
- * At this point cmpresult indicates the overall result of the row
- * comparison, and subkey points to the deciding column (or the last
- * column if the result is "=").
- */
- switch (subkey->sk_strategy)
- {
- /* EQ and NE cases aren't allowed here */
- case BTLessStrategyNumber:
- result = (cmpresult < 0);
- break;
- case BTLessEqualStrategyNumber:
- result = (cmpresult <= 0);
- break;
- case BTGreaterEqualStrategyNumber:
- result = (cmpresult >= 0);
- break;
- case BTGreaterStrategyNumber:
- result = (cmpresult > 0);
- break;
- default:
- elog(ERROR, "unrecognized RowCompareType: %d",
- (int) subkey->sk_strategy);
- result = 0; /* keep compiler quiet */
- break;
- }
-
- if (!result)
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will pass,
- * either. Note we have to look at the deciding column, not
- * necessarily the first or last column of the row condition.
- */
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
-
- return result;
-}
-
/*
* _bt_killitems - set LP_DEAD state for items an indexscan caller has
* told us were killed
@@ -2173,286 +1703,6 @@ btbuildphasename(int64 phasenum)
}
}
-/*
- * _bt_truncate() -- create tuple without unneeded suffix attributes.
- *
- * Returns truncated pivot index tuple allocated in caller's memory context,
- * with key attributes copied from caller's firstright argument. If rel is
- * an INCLUDE index, non-key attributes will definitely be truncated away,
- * since they're not part of the key space. More aggressive suffix
- * truncation can take place when it's clear that the returned tuple does not
- * need one or more suffix key attributes. We only need to keep firstright
- * attributes up to and including the first non-lastleft-equal attribute.
- * Caller's insertion scankey is used to compare the tuples; the scankey's
- * argument values are not considered here.
- *
- * Note that returned tuple's t_tid offset will hold the number of attributes
- * present, so the original item pointer offset is not represented. Caller
- * should only change truncated tuple's downlink. Note also that truncated
- * key attributes are treated as containing "minus infinity" values by
- * _bt_compare().
- *
- * In the worst case (when a heap TID must be appended to distinguish lastleft
- * from firstright), the size of the returned tuple is the size of firstright
- * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
- * is important, since callers need to stay under the 1/3 of a page
- * restriction on tuple size. If this routine is ever taught to truncate
- * within an attribute/datum, it will need to avoid returning an enlarged
- * tuple to caller when truncation + TOAST compression ends up enlarging the
- * final datum.
- */
-IndexTuple
-_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
- IndexTuple pivot;
- IndexTuple tidpivot;
- ItemPointer pivotheaptid;
- Size newsize;
-
- /*
- * We should only ever truncate non-pivot tuples from leaf pages. It's
- * never okay to truncate when splitting an internal page.
- */
- Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
-
- /* Determine how many attributes must be kept in truncated tuple */
- keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
-
-#ifdef DEBUG_NO_TRUNCATE
- /* Force truncation to be ineffective for testing purposes */
- keepnatts = nkeyatts + 1;
-#endif
-
- pivot = index_truncate_tuple(itupdesc, firstright,
- Min(keepnatts, nkeyatts));
-
- if (BTreeTupleIsPosting(pivot))
- {
- /*
- * index_truncate_tuple() just returns a straight copy of firstright
- * when it has no attributes to truncate. When that happens, we may
- * need to truncate away a posting list here instead.
- */
- Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
- Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
- pivot->t_info &= ~INDEX_SIZE_MASK;
- pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
- }
-
- /*
- * If there is a distinguishing key attribute within pivot tuple, we're
- * done
- */
- if (keepnatts <= nkeyatts)
- {
- BTreeTupleSetNAtts(pivot, keepnatts, false);
- return pivot;
- }
-
- /*
- * We have to store a heap TID in the new pivot tuple, since no non-TID
- * key attribute value in firstright distinguishes the right side of the
- * split from the left side. nbtree conceptualizes this case as an
- * inability to truncate away any key attributes, since heap TID is
- * treated as just another key attribute (despite lacking a pg_attribute
- * entry).
- *
- * Use enlarged space that holds a copy of pivot. We need the extra space
- * to store a heap TID at the end (using the special pivot tuple
- * representation). Note that the original pivot already has firstright's
- * possible posting list/non-key attribute values removed at this point.
- */
- newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
- tidpivot = palloc0(newsize);
- memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
- /* Cannot leak memory here */
- pfree(pivot);
-
- /*
- * Store all of firstright's key attribute values plus a tiebreaker heap
- * TID value in enlarged pivot tuple
- */
- tidpivot->t_info &= ~INDEX_SIZE_MASK;
- tidpivot->t_info |= newsize;
- BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
- pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
-
- /*
- * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
- * consider suffix truncation. It seems like a good idea to follow that
- * example in cases where no truncation takes place -- use lastleft's heap
- * TID. (This is also the closest value to negative infinity that's
- * legally usable.)
- */
- ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
-
- /*
- * We're done. Assert() that heap TID invariants hold before returning.
- *
- * Lehman and Yao require that the downlink to the right page, which is to
- * be inserted into the parent page in the second phase of a page split be
- * a strict lower bound on items on the right page, and a non-strict upper
- * bound for items on the left page. Assert that heap TIDs follow these
- * invariants, since a heap TID value is apparently needed as a
- * tiebreaker.
- */
-#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
- BTreeTupleGetHeapTID(firstright)) < 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(lastleft)) >= 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#else
-
- /*
- * Those invariants aren't guaranteed to hold for lastleft + firstright
- * heap TID attribute values when they're considered here only because
- * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
- * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
- * TID value that always works as a strict lower bound for items to the
- * right. In particular, it must avoid using firstright's leading key
- * attribute values along with lastleft's heap TID value when lastleft's
- * TID happens to be greater than firstright's TID.
- */
- ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
-
- /*
- * Pivot heap TID should never be fully equal to firstright. Note that
- * the pivot heap TID will still end up equal to lastleft's heap TID when
- * that's the only usable value.
- */
- ItemPointerSetOffsetNumber(pivotheaptid,
- OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#endif
-
- return tidpivot;
-}
-
-/*
- * _bt_keep_natts - how many key attributes to keep when truncating.
- *
- * Caller provides two tuples that enclose a split point. Caller's insertion
- * scankey is used to compare the tuples; the scankey's argument values are
- * not considered here.
- *
- * This can return a number of attributes that is one greater than the
- * number of key attributes for the index relation. This indicates that the
- * caller must use a heap TID as a unique-ifier in new pivot tuple.
- */
-static int
-_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keepnatts;
- ScanKey scankey;
-
- /*
- * _bt_compare() treats truncated key attributes as having the value minus
- * infinity, which would break searches within !heapkeyspace indexes. We
- * must still truncate away non-key attribute values, though.
- */
- if (!itup_key->heapkeyspace)
- return nkeyatts;
-
- scankey = itup_key->scankeys;
- keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum1,
- datum2)) != 0)
- break;
-
- keepnatts++;
- }
-
- /*
- * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
- * expected in an allequalimage index.
- */
- Assert(!itup_key->allequalimage ||
- keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
-
- return keepnatts;
-}
-
-/*
- * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
- *
- * This is exported so that a candidate split point can have its effect on
- * suffix truncation inexpensively evaluated ahead of time when finding a
- * split location. A naive bitwise approach to datum comparisons is used to
- * save cycles.
- *
- * The approach taken here usually provides the same answer as _bt_keep_natts
- * will (for the same pair of tuples from a heapkeyspace index), since the
- * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting. When an index only has
- * "equal image" columns, routine is guaranteed to give the same result as
- * _bt_keep_natts would.
- *
- * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts, even when the index uses
- * an opclass or collation that is not "allequalimage"/deduplication-safe.
- * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
- * negatives generally only have the effect of making leaf page splits use a
- * more balanced split point.
- */
-int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
-
- keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
- Form_pg_attribute att;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
- break;
-
- keepnatts++;
- }
-
- return keepnatts;
-}
-
/*
* _bt_check_natts() -- Verify tuple has expected number of attributes.
*
diff --git a/src/backend/access/nbtree/nbtutils_spec.c b/src/backend/access/nbtree/nbtutils_spec.c
new file mode 100644
index 0000000000..0288da22d6
--- /dev/null
+++ b/src/backend/access/nbtree/nbtutils_spec.c
@@ -0,0 +1,775 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtutils_spec.c
+ * Index shape-specialized functions for nbtutils.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtutils_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_check_rowcompare NBTS_FUNCTION(_bt_check_rowcompare)
+#define _bt_keep_natts NBTS_FUNCTION(_bt_keep_natts)
+
+static bool _bt_check_rowcompare(ScanKey skey,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ ScanDirection dir, bool *continuescan);
+static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+
+
+/*
+ * _bt_mkscankey
+ * Build an insertion scan key that contains comparison data from itup
+ * as well as comparator routines appropriate to the key datatypes.
+ *
+ * When itup is a non-pivot tuple, the returned insertion scan key is
+ * suitable for finding a place for it to go on the leaf level. Pivot
+ * tuples can be used to re-find leaf page with matching high key, but
+ * then caller needs to set scan key's pivotsearch field to true. This
+ * allows caller to search for a leaf page with a matching high key,
+ * which is usually to the left of the first leaf page a non-pivot match
+ * might appear on.
+ *
+ * The result is intended for use with _bt_compare() and _bt_truncate().
+ * Callers that don't need to fill out the insertion scankey arguments
+ * (e.g. they use an ad-hoc comparison routine, or only need a scankey
+ * for _bt_truncate()) can pass a NULL index tuple. The scankey will
+ * be initialized as if an "all truncated" pivot tuple was passed
+ * instead.
+ *
+ * Note that we may occasionally have to share lock the metapage to
+ * determine whether or not the keys in the index are expected to be
+ * unique (i.e. if this is a "heapkeyspace" index). We assume a
+ * heapkeyspace index when caller passes a NULL tuple, allowing index
+ * build callers to avoid accessing the non-existent metapage. We
+ * also assume that the index is _not_ allequalimage when a NULL tuple
+ * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
+ * field themselves.
+ */
+BTScanInsert
+_bt_mkscankey(Relation rel, IndexTuple itup)
+{
+ BTScanInsert key;
+ ScanKey skey;
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int16 *indoption;
+ int tupnatts;
+ int i;
+
+ itupdesc = RelationGetDescr(rel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ indoption = rel->rd_indoption;
+ tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+
+ /*
+ * We'll execute search using scan key constructed on key columns.
+ * Truncated attributes and non-key attributes are omitted from the final
+ * scan key.
+ */
+ key = palloc(offsetof(BTScanInsertData, scankeys) +
+ sizeof(ScanKeyData) * indnkeyatts);
+ if (itup)
+ _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+ else
+ {
+ /* Utility statement callers can set these fields themselves */
+ key->heapkeyspace = true;
+ key->allequalimage = false;
+ }
+ key->anynullkeys = false; /* initial assumption */
+ key->nextkey = false;
+ key->pivotsearch = false;
+ key->keysz = Min(indnkeyatts, tupnatts);
+ key->scantid = key->heapkeyspace && itup ?
+ BTreeTupleGetHeapTID(itup) : NULL;
+ skey = key->scankeys;
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ FmgrInfo *procinfo;
+ Datum arg;
+ bool null;
+ int flags;
+
+ /*
+ * We can use the cached (default) support procs since no cross-type
+ * comparison can be needed.
+ */
+ procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+
+ /*
+ * Key arguments built from truncated attributes (or when caller
+ * provides no tuple) are defensively represented as NULL values. They
+ * should never be used.
+ */
+ if (i < tupnatts)
+ arg = index_getattr(itup, i + 1, itupdesc, &null);
+ else
+ {
+ arg = (Datum) 0;
+ null = true;
+ }
+ flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags,
+ (AttrNumber) (i + 1),
+ InvalidStrategy,
+ InvalidOid,
+ rel->rd_indcollation[i],
+ procinfo,
+ arg);
+ /* Record if any key attribute is NULL (or truncated) */
+ if (null)
+ key->anynullkeys = true;
+ }
+
+ /*
+ * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
+ * that full uniqueness check is done.
+ */
+ if (rel->rd_index->indnullsnotdistinct)
+ key->anynullkeys = false;
+
+ return key;
+}
+
+/*
+ * Test whether an indextuple satisfies all the scankey conditions.
+ *
+ * Return true if so, false if not. If the tuple fails to pass the qual,
+ * we also determine whether there's any need to continue the scan beyond
+ * this tuple, and set *continuescan accordingly. See comments for
+ * _bt_preprocess_keys(), above, about how this is done.
+ *
+ * Forward scan callers can pass a high key tuple in the hopes of having
+ * us set *continuescan to false, and avoiding an unnecessary visit to
+ * the page to the right.
+ *
+ * scan: index scan descriptor (containing a search-type scankey)
+ * tuple: index tuple to test
+ * tupnatts: number of attributes in tupnatts (high key may be truncated)
+ * dir: direction we are scanning in
+ * continuescan: output parameter (will be set correctly in all cases)
+ */
+bool
+_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan)
+{
+ TupleDesc tupdesc;
+ BTScanOpaque so;
+ int keysz;
+ int ikey;
+ ScanKey key;
+
+ Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+
+ *continuescan = true; /* default assumption */
+
+ tupdesc = RelationGetDescr(scan->indexRelation);
+ so = (BTScanOpaque) scan->opaque;
+ keysz = so->numberOfKeys;
+
+ for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+ {
+ Datum datum;
+ bool isNull;
+ Datum test;
+
+ if (key->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ continue;
+ }
+
+ /* row-comparison keys need special processing */
+ if (key->sk_flags & SK_ROW_HEADER)
+ {
+ if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
+ continuescan))
+ continue;
+ return false;
+ }
+
+ datum = index_getattr(tuple,
+ key->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (key->sk_flags & SK_ISNULL)
+ {
+ /* Handle IS NULL/NOT NULL tests */
+ if (key->sk_flags & SK_SEARCHNULL)
+ {
+ if (isNull)
+ continue; /* tuple satisfies this qual */
+ }
+ else
+ {
+ Assert(key->sk_flags & SK_SEARCHNOTNULL);
+ if (!isNull)
+ continue; /* tuple satisfies this qual */
+ }
+
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (isNull)
+ {
+ if (key->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
+ datum, key->sk_argument);
+
+ if (!DatumGetBool(test))
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ *
+ * Note: because we stop the scan as soon as any required equality
+ * qual fails, it is critical that equality quals be used for the
+ * initial positioning in _bt_first() when they are available. See
+ * comments in _bt_first().
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+ }
+
+ /* If we get here, the tuple passes all index quals. */
+ return true;
+}
+
+/*
+ * Test whether an indextuple satisfies a row-comparison scan condition.
+ *
+ * Return true if so, false if not. If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction
+ * to pass the qual.
+ *
+ * This is a subroutine for _bt_checkkeys, which see for more info.
+ */
+static bool
+_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
+ TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
+{
+ ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
+ int32 cmpresult = 0;
+ bool result;
+
+ /* First subkey should be same as the header says */
+ Assert(subkey->sk_attno == skey->sk_attno);
+
+ /* Loop over columns of the row condition */
+ for (;;)
+ {
+ Datum datum;
+ bool isNull;
+
+ Assert(subkey->sk_flags & SK_ROW_MEMBER);
+
+ if (subkey->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ cmpresult = 0;
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ continue;
+ }
+
+ datum = index_getattr(tuple,
+ subkey->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (isNull)
+ {
+ if (subkey->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (subkey->sk_flags & SK_ISNULL)
+ {
+ /*
+ * Unlike the simple-scankey case, this isn't a disallowed case.
+ * But it can never match. If all the earlier row comparison
+ * columns are required for the scan direction, we can stop the
+ * scan, because there can't be another tuple that will succeed.
+ */
+ if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
+ subkey--;
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ return false;
+ }
+
+ /* Perform the test --- three-way comparison not bool operator */
+ cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
+ subkey->sk_collation,
+ datum,
+ subkey->sk_argument));
+
+ if (subkey->sk_flags & SK_BT_DESC)
+ INVERT_COMPARE_RESULT(cmpresult);
+
+ /* Done comparing if unequal, else advance to next column */
+ if (cmpresult != 0)
+ break;
+
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ }
+
+ /*
+ * At this point cmpresult indicates the overall result of the row
+ * comparison, and subkey points to the deciding column (or the last
+ * column if the result is "=").
+ */
+ switch (subkey->sk_strategy)
+ {
+ /* EQ and NE cases aren't allowed here */
+ case BTLessStrategyNumber:
+ result = (cmpresult < 0);
+ break;
+ case BTLessEqualStrategyNumber:
+ result = (cmpresult <= 0);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ result = (cmpresult >= 0);
+ break;
+ case BTGreaterStrategyNumber:
+ result = (cmpresult > 0);
+ break;
+ default:
+ elog(ERROR, "unrecognized RowCompareType: %d",
+ (int) subkey->sk_strategy);
+ result = 0; /* keep compiler quiet */
+ break;
+ }
+
+ if (!result)
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will pass,
+ * either. Note we have to look at the deciding column, not
+ * necessarily the first or last column of the row condition.
+ */
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+
+ return result;
+}
+
+/*
+ * _bt_truncate() -- create tuple without unneeded suffix attributes.
+ *
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument. If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space. More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes. We only need to keep firstright
+ * attributes up to and including the first non-lastleft-equal attribute.
+ * Caller's insertion scankey is used to compare the tuples; the scankey's
+ * argument values are not considered here.
+ *
+ * Note that returned tuple's t_tid offset will hold the number of attributes
+ * present, so the original item pointer offset is not represented. Caller
+ * should only change truncated tuple's downlink. Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
+ *
+ * In the worst case (when a heap TID must be appended to distinguish lastleft
+ * from firstright), the size of the returned tuple is the size of firstright
+ * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
+ * is important, since callers need to stay under the 1/3 of a page
+ * restriction on tuple size. If this routine is ever taught to truncate
+ * within an attribute/datum, it will need to avoid returning an enlarged
+ * tuple to caller when truncation + TOAST compression ends up enlarging the
+ * final datum.
+ */
+IndexTuple
+_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ BTScanInsert itup_key)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+ IndexTuple pivot;
+ IndexTuple tidpivot;
+ ItemPointer pivotheaptid;
+ Size newsize;
+
+ /*
+ * We should only ever truncate non-pivot tuples from leaf pages. It's
+ * never okay to truncate when splitting an internal page.
+ */
+ Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
+
+ /* Determine how many attributes must be kept in truncated tuple */
+ keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+
+#ifdef DEBUG_NO_TRUNCATE
+ /* Force truncation to be ineffective for testing purposes */
+ keepnatts = nkeyatts + 1;
+#endif
+
+ pivot = index_truncate_tuple(itupdesc, firstright,
+ Min(keepnatts, nkeyatts));
+
+ if (BTreeTupleIsPosting(pivot))
+ {
+ /*
+ * index_truncate_tuple() just returns a straight copy of firstright
+ * when it has no attributes to truncate. When that happens, we may
+ * need to truncate away a posting list here instead.
+ */
+ Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
+ Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+
+ /*
+ * If there is a distinguishing key attribute within pivot tuple, we're
+ * done
+ */
+ if (keepnatts <= nkeyatts)
+ {
+ BTreeTupleSetNAtts(pivot, keepnatts, false);
+ return pivot;
+ }
+
+ /*
+ * We have to store a heap TID in the new pivot tuple, since no non-TID
+ * key attribute value in firstright distinguishes the right side of the
+ * split from the left side. nbtree conceptualizes this case as an
+ * inability to truncate away any key attributes, since heap TID is
+ * treated as just another key attribute (despite lacking a pg_attribute
+ * entry).
+ *
+ * Use enlarged space that holds a copy of pivot. We need the extra space
+ * to store a heap TID at the end (using the special pivot tuple
+ * representation). Note that the original pivot already has firstright's
+ * possible posting list/non-key attribute values removed at this point.
+ */
+ newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
+ tidpivot = palloc0(newsize);
+ memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
+ /* Cannot leak memory here */
+ pfree(pivot);
+
+ /*
+ * Store all of firstright's key attribute values plus a tiebreaker heap
+ * TID value in enlarged pivot tuple
+ */
+ tidpivot->t_info &= ~INDEX_SIZE_MASK;
+ tidpivot->t_info |= newsize;
+ BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
+ pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
+
+ /*
+ * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
+ * consider suffix truncation. It seems like a good idea to follow that
+ * example in cases where no truncation takes place -- use lastleft's heap
+ * TID. (This is also the closest value to negative infinity that's
+ * legally usable.)
+ */
+ ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
+
+ /*
+ * We're done. Assert() that heap TID invariants hold before returning.
+ *
+ * Lehman and Yao require that the downlink to the right page, which is to
+ * be inserted into the parent page in the second phase of a page split be
+ * a strict lower bound on items on the right page, and a non-strict upper
+ * bound for items on the left page. Assert that heap TIDs follow these
+ * invariants, since a heap TID value is apparently needed as a
+ * tiebreaker.
+ */
+#ifndef DEBUG_NO_TRUNCATE
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#else
+
+ /*
+ * Those invariants aren't guaranteed to hold for lastleft + firstright
+ * heap TID attribute values when they're considered here only because
+ * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
+ * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
+ * TID value that always works as a strict lower bound for items to the
+ * right. In particular, it must avoid using firstright's leading key
+ * attribute values along with lastleft's heap TID value when lastleft's
+ * TID happens to be greater than firstright's TID.
+ */
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
+
+ /*
+ * Pivot heap TID should never be fully equal to firstright. Note that
+ * the pivot heap TID will still end up equal to lastleft's heap TID when
+ * that's the only usable value.
+ */
+ ItemPointerSetOffsetNumber(pivotheaptid,
+ OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#endif
+
+ return tidpivot;
+}
+
+/*
+ * _bt_keep_natts - how many key attributes to keep when truncating.
+ *
+ * Caller provides two tuples that enclose a split point. Caller's insertion
+ * scankey is used to compare the tuples; the scankey's argument values are
+ * not considered here.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation. This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ BTScanInsert itup_key)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keepnatts;
+ ScanKey scankey;
+
+ /*
+ * _bt_compare() treats truncated key attributes as having the value minus
+ * infinity, which would break searches within !heapkeyspace indexes. We
+ * must still truncate away non-key attribute values, though.
+ */
+ if (!itup_key->heapkeyspace)
+ return nkeyatts;
+
+ scankey = itup_key->scankeys;
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum1,
+ datum2)) != 0)
+ break;
+
+ keepnatts++;
+ }
+
+ /*
+ * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
+ * expected in an allequalimage index.
+ */
+ Assert(!itup_key->allequalimage ||
+ keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+
+ return keepnatts;
+}
+
+/*
+ * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location. A naive bitwise approach to datum comparisons is used to
+ * save cycles.
+ *
+ * The approach taken here usually provides the same answer as _bt_keep_natts
+ * will (for the same pair of tuples from a heapkeyspace index), since the
+ * majority of btree opclasses can never indicate that two datums are equal
+ * unless they're bitwise equal after detoasting. When an index only has
+ * "equal image" columns, routine is guaranteed to give the same result as
+ * _bt_keep_natts would.
+ *
+ * Callers can rely on the fact that attributes considered equal here are
+ * definitely also equal according to _bt_keep_natts, even when the index uses
+ * an opclass or collation that is not "allequalimage"/deduplication-safe.
+ * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
+ * negatives generally only have the effect of making leaf page splits use a
+ * more balanced split point.
+ */
+int
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= keysz; attnum++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+ Form_pg_attribute att;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ att = TupleDescAttr(itupdesc, attnum - 1);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
+ break;
+
+ keepnatts++;
+ }
+
+ return keepnatts;
+}
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb6cfcfd00..12f909e1cf 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -57,8 +57,6 @@ static void writetup_cluster(Tuplesortstate *state, LogicalTape *tape,
SortTuple *stup);
static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
LogicalTape *tape, unsigned int tuplen);
-static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state);
static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
Tuplesortstate *state);
static void writetup_index(Tuplesortstate *state, LogicalTape *tape,
@@ -130,6 +128,9 @@ typedef struct
int datumTypeLen;
} TuplesortDatumArg;
+#define NBT_SPECIALIZE_FILE "../../backend/utils/sort/tuplesortvariants_spec.c"
+#include "access/nbtree_spec.h"
+
Tuplesortstate *
tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
@@ -217,6 +218,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
MemoryContext oldcontext;
TuplesortClusterArg *arg;
int i;
+ nbts_prep_ctx(indexRel);
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
@@ -328,6 +330,7 @@ tuplesort_begin_index_btree(Relation heapRel,
TuplesortIndexBTreeArg *arg;
MemoryContext oldcontext;
int i;
+ nbts_prep_ctx(indexRel);
oldcontext = MemoryContextSwitchTo(base->maincontext);
arg = (TuplesortIndexBTreeArg *) palloc(sizeof(TuplesortIndexBTreeArg));
@@ -461,6 +464,7 @@ tuplesort_begin_index_gist(Relation heapRel,
MemoryContext oldcontext;
TuplesortIndexBTreeArg *arg;
int i;
+ nbts_prep_ctx(indexRel);
oldcontext = MemoryContextSwitchTo(base->maincontext);
arg = (TuplesortIndexBTreeArg *) palloc(sizeof(TuplesortIndexBTreeArg));
@@ -1259,142 +1263,6 @@ removeabbrev_index(Tuplesortstate *state, SortTuple *stups, int count)
}
}
-static int
-comparetup_index_btree(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state)
-{
- /*
- * This is similar to comparetup_heap(), but expects index tuples. There
- * is also special handling for enforcing uniqueness, and special
- * treatment for equal keys at the end.
- */
- TuplesortPublic *base = TuplesortstateGetPublic(state);
- TuplesortIndexBTreeArg *arg = (TuplesortIndexBTreeArg *) base->arg;
- SortSupport sortKey = base->sortKeys;
- IndexTuple tuple1;
- IndexTuple tuple2;
- int keysz;
- TupleDesc tupDes;
- bool equal_hasnull = false;
- int nkey;
- int32 compare;
- Datum datum1,
- datum2;
- bool isnull1,
- isnull2;
-
-
- /* Compare the leading sort key */
- compare = ApplySortComparator(a->datum1, a->isnull1,
- b->datum1, b->isnull1,
- sortKey);
- if (compare != 0)
- return compare;
-
- /* Compare additional sort keys */
- tuple1 = (IndexTuple) a->tuple;
- tuple2 = (IndexTuple) b->tuple;
- keysz = base->nKeys;
- tupDes = RelationGetDescr(arg->index.indexRel);
-
- if (sortKey->abbrev_converter)
- {
- datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
-
- compare = ApplySortAbbrevFullComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare;
- }
-
- /* they are equal, so we only need to examine one null flag */
- if (a->isnull1)
- equal_hasnull = true;
-
- sortKey++;
- for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
- {
- datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
-
- compare = ApplySortComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare; /* done when we find unequal attributes */
-
- /* they are equal, so we only need to examine one null flag */
- if (isnull1)
- equal_hasnull = true;
- }
-
- /*
- * If btree has asked us to enforce uniqueness, complain if two equal
- * tuples are detected (unless there was at least one NULL field and NULLS
- * NOT DISTINCT was not set).
- *
- * It is sufficient to make the test here, because if two tuples are equal
- * they *must* get compared at some stage of the sort --- otherwise the
- * sort algorithm wouldn't have checked whether one must appear before the
- * other.
- */
- if (arg->enforceUnique && !(!arg->uniqueNullsNotDistinct && equal_hasnull))
- {
- Datum values[INDEX_MAX_KEYS];
- bool isnull[INDEX_MAX_KEYS];
- char *key_desc;
-
- /*
- * Some rather brain-dead implementations of qsort (such as the one in
- * QNX 4) will sometimes call the comparison routine to compare a
- * value to itself, but we always use our own implementation, which
- * does not.
- */
- Assert(tuple1 != tuple2);
-
- index_deform_tuple(tuple1, tupDes, values, isnull);
-
- key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("could not create unique index \"%s\"",
- RelationGetRelationName(arg->index.indexRel)),
- key_desc ? errdetail("Key %s is duplicated.", key_desc) :
- errdetail("Duplicate keys exist."),
- errtableconstraint(arg->index.heapRel,
- RelationGetRelationName(arg->index.indexRel))));
- }
-
- /*
- * If key values are equal, we sort on ItemPointer. This is required for
- * btree indexes, since heap TID is treated as an implicit last key
- * attribute in order to ensure that all keys in the index are physically
- * unique.
- */
- {
- BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
- BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
-
- if (blk1 != blk2)
- return (blk1 < blk2) ? -1 : 1;
- }
- {
- OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
- OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
-
- if (pos1 != pos2)
- return (pos1 < pos2) ? -1 : 1;
- }
-
- /* ItemPointer values should never be equal */
- Assert(false);
-
- return 0;
-}
-
static int
comparetup_index_hash(const SortTuple *a, const SortTuple *b,
Tuplesortstate *state)
diff --git a/src/backend/utils/sort/tuplesortvariants_spec.c b/src/backend/utils/sort/tuplesortvariants_spec.c
new file mode 100644
index 0000000000..0791f41136
--- /dev/null
+++ b/src/backend/utils/sort/tuplesortvariants_spec.c
@@ -0,0 +1,158 @@
+/*-------------------------------------------------------------------------
+ *
+ * tuplesortvariants_spec.c
+ * Index shape-specialized functions for tuplesortvariants.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/tuplesortvariants_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define comparetup_index_btree NBTS_FUNCTION(comparetup_index_btree)
+
+static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
+ Tuplesortstate *state);
+
+static int
+comparetup_index_btree(const SortTuple *a, const SortTuple *b,
+ Tuplesortstate *state)
+{
+ /*
+ * This is similar to comparetup_heap(), but expects index tuples. There
+ * is also special handling for enforcing uniqueness, and special
+ * treatment for equal keys at the end.
+ */
+ TuplesortPublic *base = TuplesortstateGetPublic(state);
+ TuplesortIndexBTreeArg *arg = (TuplesortIndexBTreeArg *) base->arg;
+ SortSupport sortKey = base->sortKeys;
+ IndexTuple tuple1;
+ IndexTuple tuple2;
+ int keysz;
+ TupleDesc tupDes;
+ bool equal_hasnull = false;
+ int nkey;
+ int32 compare;
+ Datum datum1,
+ datum2;
+ bool isnull1,
+ isnull2;
+
+
+ /* Compare the leading sort key */
+ compare = ApplySortComparator(a->datum1, a->isnull1,
+ b->datum1, b->isnull1,
+ sortKey);
+ if (compare != 0)
+ return compare;
+
+ /* Compare additional sort keys */
+ tuple1 = (IndexTuple) a->tuple;
+ tuple2 = (IndexTuple) b->tuple;
+ keysz = base->nKeys;
+ tupDes = RelationGetDescr(arg->index.indexRel);
+
+ if (sortKey->abbrev_converter)
+ {
+ datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
+ datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
+
+ compare = ApplySortAbbrevFullComparator(datum1, isnull1,
+ datum2, isnull2,
+ sortKey);
+ if (compare != 0)
+ return compare;
+ }
+
+ /* they are equal, so we only need to examine one null flag */
+ if (a->isnull1)
+ equal_hasnull = true;
+
+ sortKey++;
+ for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
+ {
+ datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
+ datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
+
+ compare = ApplySortComparator(datum1, isnull1,
+ datum2, isnull2,
+ sortKey);
+ if (compare != 0)
+ return compare; /* done when we find unequal attributes */
+
+ /* they are equal, so we only need to examine one null flag */
+ if (isnull1)
+ equal_hasnull = true;
+ }
+
+ /*
+ * If btree has asked us to enforce uniqueness, complain if two equal
+ * tuples are detected (unless there was at least one NULL field and NULLS
+ * NOT DISTINCT was not set).
+ *
+ * It is sufficient to make the test here, because if two tuples are equal
+ * they *must* get compared at some stage of the sort --- otherwise the
+ * sort algorithm wouldn't have checked whether one must appear before the
+ * other.
+ */
+ if (arg->enforceUnique && !(!arg->uniqueNullsNotDistinct && equal_hasnull))
+ {
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ char *key_desc;
+
+ /*
+ * Some rather brain-dead implementations of qsort (such as the one in
+ * QNX 4) will sometimes call the comparison routine to compare a
+ * value to itself, but we always use our own implementation, which
+ * does not.
+ */
+ Assert(tuple1 != tuple2);
+
+ index_deform_tuple(tuple1, tupDes, values, isnull);
+
+ key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("could not create unique index \"%s\"",
+ RelationGetRelationName(arg->index.indexRel)),
+ key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+ errdetail("Duplicate keys exist."),
+ errtableconstraint(arg->index.heapRel,
+ RelationGetRelationName(arg->index.indexRel))));
+ }
+
+ /*
+ * If key values are equal, we sort on ItemPointer. This is required for
+ * btree indexes, since heap TID is treated as an implicit last key
+ * attribute in order to ensure that all keys in the index are physically
+ * unique.
+ */
+ {
+ BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
+ BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
+
+ if (blk1 != blk2)
+ return (blk1 < blk2) ? -1 : 1;
+ }
+ {
+ OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
+ OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
+
+ if (pos1 != pos2)
+ return (pos1 < pos2) ? -1 : 1;
+ }
+
+ /* ItemPointer values should never be equal */
+ Assert(false);
+
+ return 0;
+}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4cb24fa005..f3f0961052 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1122,15 +1122,27 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_PERFORMSORT_2 4
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
+typedef enum NBTS_CTX {
+ NBTS_CTX_CACHED,
+ NBTS_CTX_DEFAULT, /* fallback */
+} NBTS_CTX;
+
+static inline NBTS_CTX _nbt_spec_context(Relation irel)
+{
+ if (!PointerIsValid(irel))
+ return NBTS_CTX_DEFAULT;
+
+ return NBTS_CTX_CACHED;
+}
+
+
+#define NBT_SPECIALIZE_FILE "access/nbtree_specfuncs.h"
+#include "nbtree_spec.h"
+
/*
* external entry points for btree, in nbtree.c
*/
extern void btbuildempty(Relation index);
-extern bool btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- struct IndexInfo *indexInfo);
extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
extern Size btestimateparallelscan(void);
extern void btinitparallelscan(void *target);
@@ -1161,9 +1173,6 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
/*
* prototypes for functions in nbtdedup.c
*/
-extern void _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel,
- IndexTuple newitem, Size newitemsz,
- bool bottomupdedup);
extern bool _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
Size newitemsz);
extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
@@ -1179,9 +1188,6 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
/*
* prototypes for functions in nbtinsert.c
*/
-extern bool _bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel);
extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
@@ -1229,16 +1235,6 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
/*
* prototypes for functions in nbtsearch.c
*/
-extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
- int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
- bool forupdate, BTStack stack, int access,
- Snapshot snapshot, AttrNumber *comparecol,
- char *tupdatabuf);
-extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate,
- AttrNumber highcmpcol);
-extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page,
- OffsetNumber offnum, AttrNumber *comparecol);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -1247,7 +1243,6 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
/*
* prototypes for functions in nbtutils.c
*/
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
extern void _bt_freestack(BTStack stack);
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
@@ -1255,8 +1250,6 @@ extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
extern void _bt_mark_array_keys(IndexScanDesc scan);
extern void _bt_restore_array_keys(IndexScanDesc scan);
extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
- int tupnatts, ScanDirection dir, bool *continuescan);
extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
extern BTCycleId _bt_start_vacuum(Relation rel);
@@ -1269,10 +1262,6 @@ extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
extern char *btbuildphasename(int64 phasenum);
-extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
-extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
- IndexTuple firstright);
extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/nbtree_spec.h b/src/include/access/nbtree_spec.h
new file mode 100644
index 0000000000..0bfb623f37
--- /dev/null
+++ b/src/include/access/nbtree_spec.h
@@ -0,0 +1,180 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtree_specialize.h
+ * header file for postgres btree access method implementation.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/nbtree_specialize.h
+ *
+ *-------------------------------------------------------------------------
+ *
+ * Specialize key-accessing functions and the hot code around those.
+ *
+ * Key attribute iteration is specialized through the use of the following
+ * macros:
+ *
+ * - nbts_attiterdeclare(itup)
+ * Declare the variables required to iterate over the provided IndexTuple's
+ * key attributes. Many tuples may have their attributes iterated over at the
+ * same time.
+ * - nbts_attiterinit(itup, initAttNum, tupDesc)
+ * Initialize the attribute iterator for the provided IndexTuple at
+ * the provided AttributeNumber.
+ * - nbts_foreachattr(initAttNum, endAttNum)
+ * Start a loop over the attributes, starting at initAttNum and ending at
+ * endAttNum, inclusive. It also takes care of truncated attributes.
+ * - nbts_attiter_attnum
+ * The current attribute number
+ * - nbts_attiter_nextattdatum(itup, tupDesc)
+ * Updates the attribute iterator state to the next attribute. Returns the
+ * datum of the next attribute, which might be null (see below)
+ * - nbts_attiter_curattisnull(itup)
+ * Returns whether the result from the last nbts_attiter_nextattdatum is
+ * null.
+ * - nbts_context(irel)
+ * Constructs a context that is used to call specialized functions.
+ * Note that this is optional in paths that are inaccessible to unspecialized
+ * code paths, but should be included in NBTS_BUILD_GENERIC.
+ */
+
+/*
+ * Macros used in the nbtree specialization code.
+ */
+#define NBTS_TYPE_CACHED cached
+#define NBTS_TYPE_DEFAULT default
+#define NBTS_CTX_NAME __nbts_ctx
+
+/* contextual specializations */
+#define NBTS_MAKE_CTX(rel) const NBTS_CTX NBTS_CTX_NAME = _nbt_spec_context(rel)
+#define NBTS_SPECIALIZE_NAME(name) ( \
+ (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ ) \
+)
+
+/* how do we make names? */
+#define NBTS_MAKE_PREFIX(a) CppConcat(a,_)
+#define NBTS_MAKE_NAME_(a,b) CppConcat(a,b)
+#define NBTS_MAKE_NAME(a,b) NBTS_MAKE_NAME_(NBTS_MAKE_PREFIX(a),b)
+
+#define nbt_opt_specialize(rel) \
+do { \
+ Assert(PointerIsValid(rel)); \
+ if (unlikely((rel)->rd_indam->aminsert == btinsert_default)) \
+ { \
+ nbts_prep_ctx(rel); \
+ _bt_specialize(rel); \
+ } \
+} while (false)
+
+/*
+ * Access a specialized nbtree function, based on the shape of the index key.
+ */
+#define NBTS_DEFINITIONS
+
+/*
+ * Call a potentially specialized function for a given btree operation.
+ *
+ * NB: the rel argument is evaluated multiple times.
+ */
+#ifdef NBTS_FUNCTION
+#undef NBTS_FUNCTION
+#endif
+#define NBTS_FUNCTION(name) NBTS_MAKE_NAME(name, NBTS_TYPE)
+
+/* While specializing, the context is the local context */
+#ifdef nbts_prep_ctx
+#undef nbts_prep_ctx
+#endif
+#define nbts_prep_ctx(rel)
+
+/*
+ * Specialization 1: CACHED
+ *
+ * Multiple key columns, optimized access for attcacheoff -cacheable offsets.
+ */
+#define NBTS_SPECIALIZING_CACHED
+#define NBTS_TYPE NBTS_TYPE_CACHED
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_SPECIALIZING_CACHED
+#undef NBTS_TYPE
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/*
+ * Specialization 2: DEFAULT
+ *
+ * "Default", externally accessible, not so much optimized functions
+ */
+
+/* the default context (and later contexts) do need to specialize, so here's that */
+#undef nbts_prep_ctx
+#define nbts_prep_ctx(rel) NBTS_MAKE_CTX(rel)
+
+#define NBTS_SPECIALIZING_DEFAULT
+#define NBTS_TYPE NBTS_TYPE_DEFAULT
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_DEFAULT
+
+/* un-define the optimization macros */
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/*
+ * from here on all NBTS_FUNCTIONs are from specialized function names that
+ * are being called. Change the result of those macros from a direct call
+ * call to a conditional call to the right place, depending on the correct
+ * context.
+ */
+#undef NBTS_FUNCTION
+#define NBTS_FUNCTION(name) NBTS_SPECIALIZE_NAME(name)
+
+#undef NBT_SPECIALIZE_FILE
diff --git a/src/include/access/nbtree_specfuncs.h b/src/include/access/nbtree_specfuncs.h
new file mode 100644
index 0000000000..ac60319eff
--- /dev/null
+++ b/src/include/access/nbtree_specfuncs.h
@@ -0,0 +1,66 @@
+/*
+ * prototypes for functions that are included in nbtree.h
+ */
+
+#define _bt_specialize NBTS_FUNCTION(_bt_specialize)
+#define btinsert NBTS_FUNCTION(btinsert)
+#define _bt_dedup_pass NBTS_FUNCTION(_bt_dedup_pass)
+#define _bt_doinsert NBTS_FUNCTION(_bt_doinsert)
+#define _bt_search NBTS_FUNCTION(_bt_search)
+#define _bt_moveright NBTS_FUNCTION(_bt_moveright)
+#define _bt_binsrch_insert NBTS_FUNCTION(_bt_binsrch_insert)
+#define _bt_compare NBTS_FUNCTION(_bt_compare)
+#define _bt_mkscankey NBTS_FUNCTION(_bt_mkscankey)
+#define _bt_checkkeys NBTS_FUNCTION(_bt_checkkeys)
+#define _bt_truncate NBTS_FUNCTION(_bt_truncate)
+#define _bt_keep_natts_fast NBTS_FUNCTION(_bt_keep_natts_fast)
+
+/*
+ * prototypes for functions in nbtree_spec.h
+ */
+extern void _bt_specialize(Relation rel);
+
+extern bool btinsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ struct IndexInfo *indexInfo);
+
+/*
+ * prototypes for functions in nbtdedup_spec.h
+ */
+extern void _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel,
+ IndexTuple newitem, Size newitemsz,
+ bool bottomupdedup);
+
+
+/*
+ * prototypes for functions in nbtinsert_spec.h
+ */
+
+extern bool _bt_doinsert(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ Relation heapRel);
+
+/*
+ * prototypes for functions in nbtsearch_spec.h
+ */
+extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
+ int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
+ bool forupdate, BTStack stack, int access,
+ Snapshot snapshot, AttrNumber *comparecol,
+ char *tupdatabuf);
+extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page,
+ OffsetNumber offnum, AttrNumber *comparecol);
+/*
+ * prototypes for functions in nbtutils_spec.h
+ */
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
+extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan);
+extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright);
--
2.39.0
v9-0006-btree-specialization-for-variable-length-multi-at.patchapplication/octet-stream; name=v9-0006-btree-specialization-for-variable-length-multi-at.patchDownload
From ea88f2340887d7ae984b0a96664f7667bc8db50a Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 13 Jan 2023 15:42:41 +0100
Subject: [PATCH v9 6/6] btree specialization for variable-length
multi-attribute keys
The default code path is relatively slow at O(n^2), so with multiple
attributes we accept the increased startup cost in favour of lower
costs for later attributes.
Note that this will only be used for indexes that use at least one
variable-length key attribute (except as last key attribute in specific
cases).
---
src/backend/access/nbtree/README | 10 +-
src/backend/access/nbtree/nbtree_spec.c | 3 +
src/include/access/itup_attiter.h | 199 ++++++++++++++++++++++++
src/include/access/nbtree.h | 11 +-
src/include/access/nbtree_spec.h | 48 +++++-
5 files changed, 260 insertions(+), 11 deletions(-)
create mode 100644 src/include/access/itup_attiter.h
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6864902637..2219c58242 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -1104,15 +1104,13 @@ in the index AM to call the specialized functions, increasing the
performance of those hot paths.
Optimized code paths exist for the following cases, in order of preference:
- - indexes with only a single key attribute
- - multi-column indexes that could benefit from the attcacheoff optimization
+ - indexes with only a single key attribute,
+ - multi-column indexes that cannot pre-calculate the offsets of all key
+ attributes in the tuple data section,
+ - multi-column indexes that do benefit from the attcacheoff optimization
NB: This is also the default path, and is comparatively slow for uncachable
attribute offsets.
-Future work will optimize for multi-column indexes that don't benefit
-from the attcacheoff optimization by improving on the O(n^2) nature of
-index_getattr through storing attribute offsets.
-
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtree_spec.c b/src/backend/access/nbtree/nbtree_spec.c
index 21635397ed..699197dfa7 100644
--- a/src/backend/access/nbtree/nbtree_spec.c
+++ b/src/backend/access/nbtree/nbtree_spec.c
@@ -33,6 +33,9 @@ _bt_specialize(Relation rel)
case NBTS_CTX_CACHED:
_bt_specialize_cached(rel);
break;
+ case NBTS_CTX_UNCACHED:
+ _bt_specialize_uncached(rel);
+ break;
case NBTS_CTX_SINGLE_KEYATT:
_bt_specialize_single_keyatt(rel);
break;
diff --git a/src/include/access/itup_attiter.h b/src/include/access/itup_attiter.h
new file mode 100644
index 0000000000..c8fb6954bc
--- /dev/null
+++ b/src/include/access/itup_attiter.h
@@ -0,0 +1,199 @@
+/*-------------------------------------------------------------------------
+ *
+ * itup_attiter.h
+ * POSTGRES index tuple attribute iterator definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/itup_attiter.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef ITUP_ATTITER_H
+#define ITUP_ATTITER_H
+
+#include "access/itup.h"
+#include "varatt.h"
+
+typedef struct IAttrIterStateData
+{
+ int offset;
+ bool slow;
+ bool isNull;
+} IAttrIterStateData;
+
+typedef IAttrIterStateData * IAttrIterState;
+
+/* ----------------
+ * index_attiterinit
+ *
+ * This gets called many times, so we macro the cacheable and NULL
+ * lookups, and call nocache_index_attiterinit() for the rest.
+ *
+ * tup - the tuple being iterated on
+ * attnum - the attribute number that we start the iteration with
+ * in the first index_attiternext call
+ * tupdesc - the tuple description
+ *
+ * ----------------
+ */
+#define index_attiterinit(tup, attnum, tupleDesc, iter) \
+do { \
+ if ((attnum) == 1) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ 0 /* Offset of attribute 1 is always 0 */, \
+ false /* slow */, \
+ false /* isNull */ \
+ }); \
+ } \
+ else if (!IndexTupleHasNulls(tup) && \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff >= 0) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff, /* offset */ \
+ false, /* slow */ \
+ false /* isNull */ \
+ }); \
+ } \
+ else \
+ nocache_index_attiterinit((tup), (attnum) - 1, (tupleDesc), (iter)); \
+} while (false);
+
+/*
+ * Initiate an index attribute iterator to attribute attnum,
+ * and return the corresponding datum.
+ *
+ * This is nearly the same as index_deform_tuple, except that this
+ * returns the internal state up to attnum, instead of populating the
+ * datum- and isnull-arrays
+ */
+static inline void
+nocache_index_attiterinit(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ int curatt;
+ char *tp; /* ptr to tuple data */
+ int off; /* offset in tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ bool slow = false; /* can we use/set attcacheoff? */
+ bool null = false;
+
+ /* Assert to protect callers */
+ Assert(PointerIsValid(iter));
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ /* XXX "knows" t_bits are just after fixed tuple header! */
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+ off = 0;
+
+ for (curatt = 0; curatt < attnum; curatt++)
+ {
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, curatt);
+
+ if (hasnulls && att_isnull(curatt, bp))
+ {
+ null = true;
+ slow = true; /* can't use attcacheoff anymore */
+ continue;
+ }
+
+ null = false;
+
+ if (!slow && thisatt->attcacheoff >= 0)
+ off = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ slow = true;
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+ }
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+
+ if (thisatt->attlen <= 0)
+ slow = true; /* can't use attcacheoff anymore */
+ }
+
+ iter->isNull = null;
+ iter->offset = off;
+ iter->slow = slow;
+}
+
+/* ----------------
+ * index_attiternext() - get the next attribute of an index tuple
+ *
+ * This gets called many times, so we do the least amount of work
+ * possible.
+ *
+ * The code does not attempt to update attcacheoff; as it is unlikely
+ * to reach a situation where the cached offset matters a lot.
+ * If the cached offset do matter, the caller should make sure that
+ * PopulateTupleDescCacheOffsets() was called on the tuple descriptor
+ * to populate the attribute offset cache.
+ *
+ * ----------------
+ */
+static inline Datum
+index_attiternext(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ char *tp; /* ptr to tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ Datum datum;
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, attnum - 1);
+
+ Assert(PointerIsValid(iter));
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+
+ if (hasnulls && att_isnull(attnum - 1, bp))
+ {
+ iter->isNull = true;
+ iter->slow = true;
+ return (Datum) 0;
+ }
+
+ iter->isNull = false;
+
+ if (!iter->slow && thisatt->attcacheoff >= 0)
+ iter->offset = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ iter->offset = att_align_pointer(iter->offset, thisatt->attalign, -1,
+ tp + iter->offset);
+ iter->slow = true;
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ iter->offset = att_align_nominal(iter->offset, thisatt->attalign);
+ }
+
+ datum = fetchatt(thisatt, tp + iter->offset);
+
+ iter->offset = att_addlength_pointer(iter->offset, thisatt->attlen, tp + iter->offset);
+
+ if (thisatt->attlen <= 0)
+ iter->slow = true; /* can't use attcacheoff anymore */
+
+ return datum;
+}
+
+#endif /* ITUP_ATTITER_H */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4628c41e9a..d5ed38bb71 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -16,6 +16,7 @@
#include "access/amapi.h"
#include "access/itup.h"
+#include "access/itup_attiter.h"
#include "access/sdir.h"
#include "access/tableam.h"
#include "access/xlogreader.h"
@@ -1124,18 +1125,26 @@ typedef struct BTOptions
typedef enum NBTS_CTX {
NBTS_CTX_SINGLE_KEYATT,
+ NBTS_CTX_UNCACHED,
NBTS_CTX_CACHED,
NBTS_CTX_DEFAULT, /* fallback */
} NBTS_CTX;
static inline NBTS_CTX _nbt_spec_context(Relation irel)
{
+ AttrNumber nKeyAtts;
+
if (!PointerIsValid(irel))
return NBTS_CTX_DEFAULT;
- if (IndexRelationGetNumberOfKeyAttributes(irel) == 1)
+ nKeyAtts = IndexRelationGetNumberOfKeyAttributes(irel);
+
+ if (nKeyAtts == 1)
return NBTS_CTX_SINGLE_KEYATT;
+ if (TupleDescAttr(irel->rd_att, nKeyAtts - 1)->attcacheoff < -1)
+ return NBTS_CTX_UNCACHED;
+
return NBTS_CTX_CACHED;
}
diff --git a/src/include/access/nbtree_spec.h b/src/include/access/nbtree_spec.h
index 3ad64aad39..a57d69f588 100644
--- a/src/include/access/nbtree_spec.h
+++ b/src/include/access/nbtree_spec.h
@@ -44,6 +44,7 @@
* Macros used in the nbtree specialization code.
*/
#define NBTS_TYPE_SINGLE_KEYATT single_keyatt
+#define NBTS_TYPE_UNCACHED uncached
#define NBTS_TYPE_CACHED cached
#define NBTS_TYPE_DEFAULT default
#define NBTS_CTX_NAME __nbts_ctx
@@ -52,8 +53,10 @@
#define NBTS_MAKE_CTX(rel) const NBTS_CTX NBTS_CTX_NAME = _nbt_spec_context(rel)
#define NBTS_SPECIALIZE_NAME(name) ( \
(NBTS_CTX_NAME) == NBTS_CTX_SINGLE_KEYATT ? (NBTS_MAKE_NAME(name, NBTS_TYPE_SINGLE_KEYATT)) : ( \
- (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
- NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ (NBTS_CTX_NAME) == NBTS_CTX_UNCACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_UNCACHED)) : ( \
+ (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ ) \
) \
) \
)
@@ -68,9 +71,12 @@ do { \
Assert(PointerIsValid(rel)); \
if (unlikely((rel)->rd_indam->aminsert == btinsert_default)) \
{ \
- nbts_prep_ctx(rel); \
Assert(PointerIsValid(rel)); \
- _bt_specialize(rel); \
+ PopulateTupleDescCacheOffsets(rel->rd_att); \
+ { \
+ nbts_prep_ctx(rel); \
+ _bt_specialize(rel); \
+ } \
} \
} while (false)
@@ -216,6 +222,40 @@ do { \
#undef nbts_attiter_nextattdatum
#undef nbts_attiter_curattisnull
+/*
+ * Multiple key columns, but attcacheoff -optimization doesn't apply.
+ */
+#define NBTS_SPECIALIZING_UNCACHED
+#define NBTS_TYPE NBTS_TYPE_UNCACHED
+
+#define nbts_attiterdeclare(itup) \
+ IAttrIterStateData NBTS_MAKE_NAME(itup, iter)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc) \
+ index_attiterinit((itup), (initAttNum), (tupDesc), &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_attiternext((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, iter).isNull
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_UNCACHED
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
/*
* All subsequent contexts are from non-templated code, so
* they need to actually include the context.
--
2.39.0
On Mon, 23 Jan 2023 at 14:54, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Fri, 20 Jan 2023 at 20:37, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Thu, 12 Jan 2023 at 16:11, David Christensen <david@pgguru.net> wrote:
Last I saw on the thread you were going to see if the specialization was required or not.
Thank you for your interest, and sorry for the delayed response. I've
been working on rebasing and polishing the patches, and hit some
issues benchmarking the set. Attached in Perf_results.xlsx are the
results of my benchmarks, and a new rebased patchset.
Attached v10 which fixes one compile warning, and fixes
headerscheck/cpluspluscheck by adding nbtree_spec.h and
nbtree_specfuncs.h to ignored headers files. It also fixes some cases
of later modifications of earlier patches' code where the change
should be incorporated in the earlier patch instead.
I think this is ready for review, I don't .
The top-level design of the patchset:
0001 modifies btree descent code to use dynamic prefix compression,
i.e. skip comparing columns in binary search when the same column on
tuples on both the left and the right of this tuple have been compared
as 'equal'.
It also includes an optimized path when the downlink's tuple's right
neighbor's data is bytewise equal to the highkey of the page we
descended onto - in those cases we don't need to run _bt_compare on
the index tuple as we know that the result will be the same as that of
the downlink tuple, i.e. it compare as "less than".
NOTE that this patch when applied as stand-alone adds overhead for all
indexes, with the benefits of the patch limited to non-unique indexes
or indexes where the uniqueness is guaranteed only at later
attributes. Later patches in the patchset return performance to a
similar level as before 0001 for the impacted indexes.
0002 creates a scaffold for specializing nbtree functions, and moves
the functions I selected for specialization into separate files. Those
separate files (postfixed with _spec) are included in the original
files through inclusion of the nbtree specialization header file with
a macro variable. The code itself has not materially changed yet at
this point.
0003 updates the functions selected in 0002 to utilize the
specializable attribute iterator macros instead of manual attribute
iteration.
Then, 0004 adds specialization for single-attribute indexes,
0005 adds a helper function for populating attcacheoff (which is
separately useful too, but essential in this patchset), and
0006 adds specialization for multi-column indexes of which the offsets
of the last key column cannot be known.
Kind regards,
Matthias van de Meent.
Attachments:
v10-0004-Optimize-nbts_attiter-for-nkeyatts-1-btrees.patchapplication/octet-stream; name=v10-0004-Optimize-nbts_attiter-for-nkeyatts-1-btrees.patchDownload
From e8d891adb0f4bdcf091edc5e02155303503ba603 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 11 Jan 2023 20:04:56 +0100
Subject: [PATCH v10 4/6] Optimize nbts_attiter for nkeyatts==1 btrees
This removes the index_getattr_nocache call path, which has significant overhead, and uses constant 0 offset.
---
src/backend/access/nbtree/README | 1 +
src/backend/access/nbtree/nbtree_spec.c | 3 ++
src/include/access/nbtree.h | 35 ++++++++++++++++
src/include/access/nbtree_spec.h | 56 ++++++++++++++++++++++++-
4 files changed, 93 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 4b11ea9ad7..6864902637 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -1104,6 +1104,7 @@ in the index AM to call the specialized functions, increasing the
performance of those hot paths.
Optimized code paths exist for the following cases, in order of preference:
+ - indexes with only a single key attribute
- multi-column indexes that could benefit from the attcacheoff optimization
NB: This is also the default path, and is comparatively slow for uncachable
attribute offsets.
diff --git a/src/backend/access/nbtree/nbtree_spec.c b/src/backend/access/nbtree/nbtree_spec.c
index 6b766581ab..21635397ed 100644
--- a/src/backend/access/nbtree/nbtree_spec.c
+++ b/src/backend/access/nbtree/nbtree_spec.c
@@ -33,6 +33,9 @@ _bt_specialize(Relation rel)
case NBTS_CTX_CACHED:
_bt_specialize_cached(rel);
break;
+ case NBTS_CTX_SINGLE_KEYATT:
+ _bt_specialize_single_keyatt(rel);
+ break;
case NBTS_CTX_DEFAULT:
break;
}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index f3f0961052..4628c41e9a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1123,6 +1123,7 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
typedef enum NBTS_CTX {
+ NBTS_CTX_SINGLE_KEYATT,
NBTS_CTX_CACHED,
NBTS_CTX_DEFAULT, /* fallback */
} NBTS_CTX;
@@ -1132,9 +1133,43 @@ static inline NBTS_CTX _nbt_spec_context(Relation irel)
if (!PointerIsValid(irel))
return NBTS_CTX_DEFAULT;
+ if (IndexRelationGetNumberOfKeyAttributes(irel) == 1)
+ return NBTS_CTX_SINGLE_KEYATT;
+
return NBTS_CTX_CACHED;
}
+static inline Datum _bt_getfirstatt(IndexTuple tuple, TupleDesc tupleDesc,
+ bool *isNull)
+{
+ Datum result;
+ if (IndexTupleHasNulls(tuple))
+ {
+ if (att_isnull(0, (bits8 *)(tuple) + sizeof(IndexTupleData)))
+ {
+ *isNull = true;
+ result = (Datum) 0;
+ }
+ else
+ {
+ *isNull = false;
+ result = fetchatt(TupleDescAttr(tupleDesc, 0),
+ ((char *) tuple)
+ + MAXALIGN(sizeof(IndexTupleData)
+ + sizeof(IndexAttributeBitMapData)));
+ }
+ }
+ else
+ {
+ *isNull = false;
+ result = fetchatt(TupleDescAttr(tupleDesc, 0),
+ ((char *) tuple)
+ + MAXALIGN(sizeof(IndexTupleData)));
+ }
+
+ return result;
+}
+
#define NBT_SPECIALIZE_FILE "access/nbtree_specfuncs.h"
#include "nbtree_spec.h"
diff --git a/src/include/access/nbtree_spec.h b/src/include/access/nbtree_spec.h
index fa38b09c6e..8e476c300d 100644
--- a/src/include/access/nbtree_spec.h
+++ b/src/include/access/nbtree_spec.h
@@ -44,6 +44,7 @@
/*
* Macros used in the nbtree specialization code.
*/
+#define NBTS_TYPE_SINGLE_KEYATT single_keyatt
#define NBTS_TYPE_CACHED cached
#define NBTS_TYPE_DEFAULT default
#define NBTS_CTX_NAME __nbts_ctx
@@ -51,8 +52,10 @@
/* contextual specializations */
#define NBTS_MAKE_CTX(rel) const NBTS_CTX NBTS_CTX_NAME = _nbt_spec_context(rel)
#define NBTS_SPECIALIZE_NAME(name) ( \
- (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
- NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ (NBTS_CTX_NAME) == NBTS_CTX_SINGLE_KEYATT ? (NBTS_MAKE_NAME(name, NBTS_TYPE_SINGLE_KEYATT)) : ( \
+ (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ ) \
) \
)
@@ -164,6 +167,55 @@ do { \
#undef nbts_attiter_nextattdatum
#undef nbts_attiter_curattisnull
+/*
+ * Specialization 3: SINGLE_KEYATT
+ *
+ * Optimized access for indexes with a single key column.
+ *
+ * Note that this path cannot be used for indexes with multiple key
++ * columns, because it never considers the next column.
+ */
+
+/* the default context (and later contexts) do need to specialize, so here's that */
+#undef nbts_prep_ctx
+#define nbts_prep_ctx(rel)
+
+#define NBTS_SPECIALIZING_SINGLE_KEYATT
+#define NBTS_TYPE NBTS_TYPE_SINGLE_KEYATT
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ Assert((endAttNum) == 1); ((void) (endAttNum)); \
+ if ((initAttNum) == 1) for (int spec_i = 0; spec_i < 1; spec_i++)
+
+#define nbts_attiter_attnum 1
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+( \
+ AssertMacro(spec_i == 0), \
+ _bt_getfirstatt(itup, tupDesc, &NBTS_MAKE_NAME(itup, isNull)) \
+)
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_SINGLE_KEYATT
+
+/* un-define the optimization macros */
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
/*
* All next uses of nbts_prep_ctx are in non-templated code, so here we make
* sure we actually create the context.
--
2.39.0
v10-0005-Add-an-attcacheoff-populating-function.patchapplication/octet-stream; name=v10-0005-Add-an-attcacheoff-populating-function.patchDownload
From 5a0733796abf607c44f366f066fe6bf99c3118b5 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 12 Jan 2023 21:34:36 +0100
Subject: [PATCH v10 5/6] Add an attcacheoff-populating function
It populates attcacheoff-capable attributes with the correct offset,
and fills attributes whose offset is uncacheable with an 'uncacheable'
indicator value; as opposed to -1 which signals "unknown".
This allows users of the API to remove redundant cycles that try to
cache the offset of attributes - instead of O(N-attrs) operations, this
one only requires a O(1) check.
---
src/backend/access/common/tupdesc.c | 111 ++++++++++++++++++++++++++++
src/include/access/tupdesc.h | 2 +
2 files changed, 113 insertions(+)
diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c
index 72a2c3d3db..f2d80ed0db 100644
--- a/src/backend/access/common/tupdesc.c
+++ b/src/backend/access/common/tupdesc.c
@@ -919,3 +919,114 @@ BuildDescFromLists(List *names, List *types, List *typmods, List *collations)
return desc;
}
+
+/*
+ * PopulateTupleDescCacheOffsets
+ *
+ * Populate the attcacheoff fields of a TupleDesc, returning the last
+ * attcacheoff with a valid offset value.
+ *
+ * Populates attcacheoff with a negative cache value when no offset
+ * can be calculated (due to e.g. variable length attributes).
+ * The negative value is a value relative to the last cacheable attribute
+ * attcacheoff = -1 - (thisattno - cachedattno)
+ * so that the last attribute with cached offset can be found with
+ * cachedattno = attcacheoff + 1 + thisattno
+ *
+ * The value returned is the AttrNumber of the last (1-based) attribute that
+ * had its offset cached.
+ *
+ * When the TupleDesc has 0 attributes, it returns 0.
+ */
+AttrNumber
+PopulateTupleDescCacheOffsets(TupleDesc desc)
+{
+ int numberOfAttributes = desc->natts;
+ AttrNumber currAttNo, lastCachedAttNo;
+
+ if (numberOfAttributes == 0)
+ return 0;
+
+ /* Non-negative value: this attribute is cached */
+ if (TupleDescAttr(desc, desc->natts - 1)->attcacheoff >= 0)
+ return (AttrNumber) desc->natts;
+ /*
+ * Attribute has been filled with relative offset to last cached value, but
+ * it itself is unreachable.
+ */
+ if (TupleDescAttr(desc, desc->natts - 1)->attcacheoff != -1)
+ return (AttrNumber) (TupleDescAttr(desc, desc->natts - 1)->attcacheoff + 1 + desc->natts);
+
+ /* last attribute of the tupledesc may or may not support attcacheoff */
+
+ /*
+ * First attribute always starts at offset zero.
+ */
+ TupleDescAttr(desc, 0)->attcacheoff = 0;
+
+ currAttNo = 1;
+ /*
+ * Other code may have populated the value previously.
+ * Skip all positive offsets to get to the first attribute without
+ * attcacheoff.
+ */
+ while (currAttNo < numberOfAttributes &&
+ TupleDescAttr(desc, currAttNo)->attcacheoff >= 0)
+ currAttNo++;
+
+ /*
+ * Cache offset is undetermined. Start calculating offsets if possible.
+ *
+ * When we exit this block, currAttNo will point at the first uncacheable
+ * attribute, or past the end of the attribute array.
+ */
+ if (currAttNo < numberOfAttributes &&
+ TupleDescAttr(desc, currAttNo)->attcacheoff == -1)
+ {
+ Form_pg_attribute att = TupleDescAttr(desc, currAttNo - 1);
+ int32 off = att->attcacheoff;
+
+ if (att->attlen >= 0) {
+ off += att->attlen;
+
+ while (currAttNo < numberOfAttributes)
+ {
+ att = TupleDescAttr(desc, currAttNo);
+
+ if (att->attlen < 0)
+ {
+ if (off == att_align_nominal(off, att->attalign))
+ {
+ att->attcacheoff = off;
+ currAttNo++;
+ }
+ break;
+ }
+
+ off = att_align_nominal(off, att->attalign);
+ att->attcacheoff = off;
+ off += att->attlen;
+ currAttNo++;
+ }
+ }
+ }
+
+ Assert(currAttNo == numberOfAttributes || (
+ currAttNo < numberOfAttributes
+ && TupleDescAttr(desc, (currAttNo - 1))->attcacheoff >= 0
+ && TupleDescAttr(desc, currAttNo)->attcacheoff == -1
+ ));
+ /*
+ * No cacheable offsets left. Fill the rest with negative cache values,
+ * but return the latest cached offset.
+ */
+ lastCachedAttNo = currAttNo;
+
+ while (currAttNo < numberOfAttributes)
+ {
+ TupleDescAttr(desc, currAttNo)->attcacheoff = -1 - (currAttNo - lastCachedAttNo);
+ currAttNo++;
+ }
+
+ return lastCachedAttNo;
+}
\ No newline at end of file
diff --git a/src/include/access/tupdesc.h b/src/include/access/tupdesc.h
index b4286cf922..2673f2d0f3 100644
--- a/src/include/access/tupdesc.h
+++ b/src/include/access/tupdesc.h
@@ -151,4 +151,6 @@ extern TupleDesc BuildDescForRelation(List *schema);
extern TupleDesc BuildDescFromLists(List *names, List *types, List *typmods, List *collations);
+extern AttrNumber PopulateTupleDescCacheOffsets(TupleDesc desc);
+
#endif /* TUPDESC_H */
--
2.39.0
v10-0003-Use-specialized-attribute-iterators-in-the-speci.patchapplication/octet-stream; name=v10-0003-Use-specialized-attribute-iterators-in-the-speci.patchDownload
From 8b848f7bc63bb3999308940593abba10208c3913 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 11 Jan 2023 02:57:21 +0100
Subject: [PATCH v10 3/6] Use specialized attribute iterators in the
specialized source files
This is committed separately to make clear what substantial changes were
made to the pre-existing code.
Even though not all nbt*_spec functions have been updated; these functions
can now directly call (and inline, and optimize for) the specialized functions
they call, instead of having to determine the right specialization based on
the (potentially locally unavailable) index relation, making the specialization
of those functions still worth specializing/duplicating.
---
src/backend/access/nbtree/nbtsearch_spec.c | 18 +++---
src/backend/access/nbtree/nbtsort_spec.c | 24 +++----
src/backend/access/nbtree/nbtutils_spec.c | 62 ++++++++++++-------
.../utils/sort/tuplesortvariants_spec.c | 54 +++++++++-------
4 files changed, 92 insertions(+), 66 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsearch_spec.c b/src/backend/access/nbtree/nbtsearch_spec.c
index 37cc3647d3..4ce39e7724 100644
--- a/src/backend/access/nbtree/nbtsearch_spec.c
+++ b/src/backend/access/nbtree/nbtsearch_spec.c
@@ -632,6 +632,7 @@ _bt_compare(Relation rel,
int ncmpkey;
int ntupatts;
int32 result;
+ nbts_attiterdeclare(itup);
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -664,23 +665,26 @@ _bt_compare(Relation rel,
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
scankey = key->scankeys + ((*comparecol) - 1);
- for (int i = *comparecol; i <= ncmpkey; i++)
+ nbts_attiterinit(itup, *comparecol, itupdesc);
+
+ nbts_foreachattr(*comparecol, ncmpkey)
{
Datum datum;
- bool isNull;
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+ datum = nbts_attiter_nextattdatum(itup, itupdesc);
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ /* key is NULL */
+ if (scankey->sk_flags & SK_ISNULL)
{
- if (isNull)
+ if (nbts_attiter_curattisnull(itup))
result = 0; /* NULL "=" NULL */
else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = -1; /* NULL "<" NOT_NULL */
else
result = 1; /* NULL ">" NOT_NULL */
}
- else if (isNull) /* key is NOT_NULL and item is NULL */
+ /* key is NOT_NULL and item is NULL */
+ else if (nbts_attiter_curattisnull(itup))
{
if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = 1; /* NOT_NULL ">" NULL */
@@ -709,7 +713,7 @@ _bt_compare(Relation rel,
/* if the keys are unequal, return the difference */
if (result != 0)
{
- *comparecol = i;
+ *comparecol = nbts_attiter_attnum;
return result;
}
diff --git a/src/backend/access/nbtree/nbtsort_spec.c b/src/backend/access/nbtree/nbtsort_spec.c
index 368d6f244c..6f33cc4cc2 100644
--- a/src/backend/access/nbtree/nbtsort_spec.c
+++ b/src/backend/access/nbtree/nbtsort_spec.c
@@ -34,8 +34,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
itup2 = NULL;
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
bool deduplicate;
@@ -57,7 +56,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
/* Prepare SortSupport data for each column */
sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
- for (i = 0; i < keysz; i++)
+ for (int i = 0; i < keysz; i++)
{
SortSupport sortKey = sortKeys + i;
ScanKey scanKey = wstate->inskey->scankeys + i;
@@ -90,21 +89,24 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
else if (itup != NULL)
{
int32 compare = 0;
+ nbts_attiterdeclare(itup);
+ nbts_attiterdeclare(itup2);
- for (i = 1; i <= keysz; i++)
+ nbts_attiterinit(itup, 1, tupdes);
+ nbts_attiterinit(itup2, 1, tupdes);
+
+ nbts_foreachattr(1, keysz)
{
SortSupport entry;
Datum attrDatum1,
attrDatum2;
- bool isNull1,
- isNull2;
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+ entry = sortKeys + nbts_attiter_attnum - 1;
+ attrDatum1 = nbts_attiter_nextattdatum(itup, tupdes);
+ attrDatum2 = nbts_attiter_nextattdatum(itup2, tupdes);
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
+ compare = ApplySortComparator(attrDatum1, nbts_attiter_curattisnull(itup),
+ attrDatum2, nbts_attiter_curattisnull(itup2),
entry);
if (compare > 0)
{
diff --git a/src/backend/access/nbtree/nbtutils_spec.c b/src/backend/access/nbtree/nbtutils_spec.c
index 0288da22d6..07ca18f404 100644
--- a/src/backend/access/nbtree/nbtutils_spec.c
+++ b/src/backend/access/nbtree/nbtutils_spec.c
@@ -64,7 +64,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
int indnkeyatts;
int16 *indoption;
int tupnatts;
- int i;
+ nbts_attiterdeclare(itup);
itupdesc = RelationGetDescr(rel);
indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -95,7 +95,10 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->scantid = key->heapkeyspace && itup ?
BTreeTupleGetHeapTID(itup) : NULL;
skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
+
+ nbts_attiterinit(itup, 1, itupdesc);
+
+ nbts_foreachattr(1, indnkeyatts)
{
FmgrInfo *procinfo;
Datum arg;
@@ -106,27 +109,30 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
* We can use the cached (default) support procs since no cross-type
* comparison can be needed.
*/
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+ procinfo = index_getprocinfo(rel, nbts_attiter_attnum, BTORDER_PROC);
/*
* Key arguments built from truncated attributes (or when caller
* provides no tuple) are defensively represented as NULL values. They
* should never be used.
*/
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
+ if (nbts_attiter_attnum <= tupnatts)
+ {
+ arg = nbts_attiter_nextattdatum(itup, itupdesc);
+ null = nbts_attiter_curattisnull(itup);
+ }
else
{
arg = (Datum) 0;
null = true;
}
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags = (null ? SK_ISNULL : 0) | (indoption[nbts_attiter_attnum - 1] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[nbts_attiter_attnum - 1],
flags,
- (AttrNumber) (i + 1),
+ (AttrNumber) nbts_attiter_attnum,
InvalidStrategy,
InvalidOid,
- rel->rd_indcollation[i],
+ rel->rd_indcollation[nbts_attiter_attnum - 1],
procinfo,
arg);
/* Record if any key attribute is NULL (or truncated) */
@@ -675,6 +681,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
TupleDesc itupdesc = RelationGetDescr(rel);
int keepnatts;
ScanKey scankey;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
/*
* _bt_compare() treats truncated key attributes as having the value minus
@@ -686,20 +694,22 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
scankey = itup_key->scankeys;
keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, nkeyatts)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ datum1 = nbts_attiter_nextattdatum(lastleft, itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) != nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
scankey->sk_collation,
datum1,
@@ -707,6 +717,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
break;
keepnatts++;
+ scankey++;
}
/*
@@ -747,24 +758,27 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
TupleDesc itupdesc = RelationGetDescr(rel);
int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
int keepnatts;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, keysz)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
Form_pg_attribute att;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
+ datum1 = nbts_attiter_nextattdatum(lastleft, itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
+ att = TupleDescAttr(itupdesc, nbts_attiter_attnum - 1);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) != nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
diff --git a/src/backend/utils/sort/tuplesortvariants_spec.c b/src/backend/utils/sort/tuplesortvariants_spec.c
index 0791f41136..61c4826853 100644
--- a/src/backend/utils/sort/tuplesortvariants_spec.c
+++ b/src/backend/utils/sort/tuplesortvariants_spec.c
@@ -40,11 +40,8 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
bool equal_hasnull = false;
int nkey;
int32 compare;
- Datum datum1,
- datum2;
- bool isnull1,
- isnull2;
-
+ nbts_attiterdeclare(tuple1);
+ nbts_attiterdeclare(tuple2);
/* Compare the leading sort key */
compare = ApplySortComparator(a->datum1, a->isnull1,
@@ -59,37 +56,46 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
keysz = base->nKeys;
tupDes = RelationGetDescr(arg->index.indexRel);
- if (sortKey->abbrev_converter)
+ if (!sortKey->abbrev_converter)
{
- datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
-
- compare = ApplySortAbbrevFullComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare;
+ nkey = 2;
+ sortKey++;
+ }
+ else
+ {
+ nkey = 1;
}
/* they are equal, so we only need to examine one null flag */
if (a->isnull1)
equal_hasnull = true;
- sortKey++;
- for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
+ nbts_attiterinit(tuple1, nkey, tupDes);
+ nbts_attiterinit(tuple2, nkey, tupDes);
+
+ nbts_foreachattr(nkey, keysz)
{
- datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
+ Datum datum1,
+ datum2;
+ datum1 = nbts_attiter_nextattdatum(tuple1, tupDes);
+ datum2 = nbts_attiter_nextattdatum(tuple2, tupDes);
+
+ if (nbts_attiter_attnum == 1)
+ compare = ApplySortAbbrevFullComparator(datum1, nbts_attiter_curattisnull(tuple1),
+ datum2, nbts_attiter_curattisnull(tuple2),
+ sortKey);
+ else
+ compare = ApplySortComparator(datum1, nbts_attiter_curattisnull(tuple1),
+ datum2, nbts_attiter_curattisnull(tuple2),
+ sortKey);
- compare = ApplySortComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
if (compare != 0)
- return compare; /* done when we find unequal attributes */
+ return compare;
- /* they are equal, so we only need to examine one null flag */
- if (isnull1)
+ if (nbts_attiter_curattisnull(tuple1))
equal_hasnull = true;
+
+ sortKey++;
}
/*
--
2.39.0
v10-0001-Implement-dynamic-prefix-compression-in-nbtree.patchapplication/octet-stream; name=v10-0001-Implement-dynamic-prefix-compression-in-nbtree.patchDownload
From 9cca2dce4f968994e989e7214366600f5d0f6c5e Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Tue, 10 Jan 2023 21:45:44 +0100
Subject: [PATCH v10 1/6] Implement dynamic prefix compression in nbtree
Because tuples are ordered on the page, if some prefix of the
scan attributes on both sides of the compared tuple are equal
to the scankey, then the current tuple that is being compared
must also have those prefixing attributes that equal the
scankey.
We cannot generally propagate this information to _binsrch on
lower pages, as this downstream page may have concurrently split
and/or have merged with its deleted left neighbour (see [0]),
which moves the keyspace of the linked page. We thus can only
trust the current state of this current page for this optimization,
which means we must validate this state each time we open the page.
Although this limits the overall applicability of the
performance improvement, it still allows for a nice performance
improvement in most cases where initial columns have many
duplicate values and a compare function that is not cheap.
As an exception to the above rule, most of the time a pages'
highkey is equal to the right seperator on the parent page due to
how btree splits are done. By storing this right seperator from
the parent page and then validating that the highkey of the child
page contains the exact same data, we can restore the right prefix
bound without having to call the relatively expensive _bt_compare.
In the worst-case scenario of a concurrent page split, we'd still
have to validate the full key, but that doesn't happen very often
when compared to the number of times we descend the btree.
---
contrib/amcheck/verify_nbtree.c | 17 ++--
src/backend/access/nbtree/README | 43 ++++++++++
src/backend/access/nbtree/nbtinsert.c | 34 +++++---
src/backend/access/nbtree/nbtsearch.c | 119 +++++++++++++++++++++++---
src/include/access/nbtree.h | 10 ++-
5 files changed, 189 insertions(+), 34 deletions(-)
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 257cff671b..22bb229820 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2701,6 +2701,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
BTInsertStateData insertstate;
OffsetNumber offnum;
Page page;
+ AttrNumber cmpcol = 1;
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
@@ -2710,13 +2711,13 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.buf = lbuf;
/* Get matching tuple on leaf page */
- offnum = _bt_binsrch_insert(state->rel, &insertstate);
+ offnum = _bt_binsrch_insert(state->rel, &insertstate, 1);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
/* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
insertstate.postingoff <= 0 &&
- _bt_compare(state->rel, key, page, offnum) == 0)
+ _bt_compare(state->rel, key, page, offnum, &cmpcol) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
}
@@ -2778,6 +2779,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
@@ -2788,7 +2790,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
if (!key->heapkeyspace)
return invariant_leq_offset(state, key, upperbound);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
/*
* _bt_compare() is capable of determining that a scankey with a
@@ -2840,10 +2842,11 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber upperbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
return cmp <= 0;
}
@@ -2863,10 +2866,11 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber lowerbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, lowerbound);
+ cmp = _bt_compare(state->rel, key, state->target, lowerbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
@@ -2901,13 +2905,14 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
/* Verify line pointer before checking tuple */
itemid = PageGetItemIdCareful(state, nontargetblock, nontarget,
upperbound);
- cmp = _bt_compare(state->rel, key, nontarget, upperbound);
+ cmp = _bt_compare(state->rel, key, nontarget, upperbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index dd0f7ad2bd..16a31d2bfe 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -901,6 +901,49 @@ large groups of duplicates, maximizing space utilization. Note also that
deduplication more efficient. Deduplication can be performed infrequently,
without merging together existing posting list tuples too often.
+Notes about dynamic prefix truncation
+-------------------------------------
+
+Because NBTrees have a sorted keyspace, when we have determined that some
+prefixing columns of tuples on both sides of the tuple that is being
+compared are equal to the scankey, then the current tuple must also share
+this prefix with the scankey. This allows us to skip comparing those columns,
+saving the indirect function calls in the compare operation.
+
+We can only use this constraint if we have proven this information while we
+hold a pin on the page, so this is only useful on the page level: Concurrent
+page deletions and splits may have moved the keyspace of the page referenced
+by a parent page to the right. If we re-used high- and low-column-prefixes,
+we would not be able to detect a change of keyspace from e.g. [2,3) to [1,2),
+and subsequently return invalid results. This race condition can only be
+prevented by re-establishing the prefix-equal-columns for each page.
+
+There is positive news, though: A page split will put a binary copy of the
+page's highkey in the parent page. This means that we usually can reuse
+the compare result of the parent page's downlink's right sibling when we
+discover that their representation is binary equal. In general this will
+be the case, as only in concurrent page splits and deletes the downlink
+may not point to the page with the correct highkey bound (_bt_moveright
+only rarely actually moves right).
+
+To implement this, we copy the downlink's right differentiator key into a
+temporary buffer, which is then compared against the child pages' highkey.
+If they match, we reuse the compare result (plus prefix) we had for it from
+the parent page, if not, we need to do a full _bt_compare. Because memcpy +
+memcmp is cheap compared to _bt_compare, and because it's quite unlikely
+that we guess wrong this speeds up our _bt_moveright code (at cost of some
+stack memory in _bt_search and some overhead in case of a wrong prediction)
+
+Now that we have prefix bounds on the highest value of a page, the
+_bt_binsrch procedure will use this result as a rightmost prefix compare,
+and for each step in the binary search (that does not compare less than the
+insert key) improve the equal-prefix bounds.
+
+Using the above optimization, we now (on average) only need 2 full key
+compares per page (plus ceil(log2(ntupsperpage)) single-attribute compares),
+as opposed to the ceil(log2(ntupsperpage)) + 1 of a naive implementation;
+a significant improvement.
+
Notes about deduplication
-------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f4c1a974ef..4c3bdefae2 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -326,6 +326,7 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
{
Page page;
BTPageOpaque opaque;
+ AttrNumber cmpcol = 1;
_bt_checkpage(rel, insertstate->buf);
page = BufferGetPage(insertstate->buf);
@@ -344,7 +345,8 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
!P_IGNORE(opaque) &&
PageGetFreeSpace(page) > insertstate->itemsz &&
PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- _bt_compare(rel, insertstate->itup_key, page, P_HIKEY) > 0)
+ _bt_compare(rel, insertstate->itup_key, page, P_HIKEY,
+ &cmpcol) > 0)
{
/*
* Caller can use the fastpath optimization because cached
@@ -438,7 +440,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* in the fastpath below, but also in the _bt_findinsertloc() call later.
*/
Assert(!insertstate->bounds_valid);
- offset = _bt_binsrch_insert(rel, insertstate);
+ offset = _bt_binsrch_insert(rel, insertstate, 1);
/*
* Scan over all equal tuples, looking for live conflicts.
@@ -448,6 +450,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(itup_key->scantid == NULL);
for (;;)
{
+ AttrNumber cmpcol = 1;
+
/*
* Each iteration of the loop processes one heap TID, not one index
* tuple. Current offset number for page isn't usually advanced on
@@ -483,7 +487,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(insertstate->bounds_valid);
Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
Assert(insertstate->low <= insertstate->stricthigh);
- Assert(_bt_compare(rel, itup_key, page, offset) < 0);
+ Assert(_bt_compare(rel, itup_key, page, offset, &cmpcol) < 0);
break;
}
@@ -508,7 +512,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
if (!inposting)
{
/* Plain tuple, or first TID in posting list tuple */
- if (_bt_compare(rel, itup_key, page, offset) != 0)
+ if (_bt_compare(rel, itup_key, page, offset, &cmpcol) != 0)
break; /* we're past all the equal tuples */
/* Advanced curitup */
@@ -718,11 +722,12 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
else
{
int highkeycmp;
+ cmpcol = 1;
/* If scankey == hikey we gotta check the next page too */
if (P_RIGHTMOST(opaque))
break;
- highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+ highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol);
Assert(highkeycmp <= 0);
if (highkeycmp != 0)
break;
@@ -865,6 +870,8 @@ _bt_findinsertloc(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
+
/*
* Does the new tuple belong on this page?
*
@@ -882,7 +889,7 @@ _bt_findinsertloc(Relation rel,
/* Test '<=', not '!=', since scantid is set now */
if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
break;
_bt_stepright(rel, insertstate, stack);
@@ -935,6 +942,8 @@ _bt_findinsertloc(Relation rel,
*/
while (PageGetFreeSpace(page) < insertstate->itemsz)
{
+ AttrNumber cmpcol = 1;
+
/*
* Before considering moving right, see if we can obtain enough
* space by erasing LP_DEAD items
@@ -965,7 +974,7 @@ _bt_findinsertloc(Relation rel,
break;
if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) != 0 ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) != 0 ||
pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
break;
@@ -980,10 +989,13 @@ _bt_findinsertloc(Relation rel,
* We should now be on the correct page. Find the offset within the page
* for the new tuple. (Possibly reusing earlier search bounds.)
*/
- Assert(P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
+ {
+ AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
+ Assert(P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0);
+ }
- newitemoff = _bt_binsrch_insert(rel, insertstate);
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
if (insertstate->postingoff == -1)
{
@@ -1002,7 +1014,7 @@ _bt_findinsertloc(Relation rel,
*/
Assert(!insertstate->bounds_valid);
insertstate->postingoff = 0;
- newitemoff = _bt_binsrch_insert(rel, insertstate);
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
Assert(insertstate->postingoff == 0);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c43c1a2830..e3b828137b 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,7 +25,8 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
-static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf,
+ AttrNumber *highkeycmpcol);
static int _bt_binsrch_posting(BTScanInsert key, Page page,
OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
@@ -98,6 +99,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
{
BTStack stack_in = NULL;
int page_access = BT_READ;
+ char tupdatabuf[BLCKSZ / 3];
+ AttrNumber highkeycmpcol = 1;
/* Get the root page to start with */
*bufP = _bt_getroot(rel, access);
@@ -130,7 +133,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
* opportunity to finish splits of internal pages too.
*/
*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
- page_access, snapshot);
+ page_access, snapshot, &highkeycmpcol,
+ (char *) tupdatabuf);
/* if this is a leaf page, we're done */
page = BufferGetPage(*bufP);
@@ -142,12 +146,15 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
* Find the appropriate pivot tuple on this page. Its downlink points
* to the child page that we're about to descend to.
*/
- offnum = _bt_binsrch(rel, key, *bufP);
+ offnum = _bt_binsrch(rel, key, *bufP, &highkeycmpcol);
itemid = PageGetItemId(page, offnum);
itup = (IndexTuple) PageGetItem(page, itemid);
Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
child = BTreeTupleGetDownLink(itup);
+ Assert(IndexTupleSize(itup) < sizeof(tupdatabuf));
+ memcpy((char *) tupdatabuf, (char *) itup, IndexTupleSize(itup));
+
/*
* We need to save the location of the pivot tuple we chose in a new
* stack entry for this page/level. If caller ends up splitting a
@@ -181,6 +188,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
*/
if (access == BT_WRITE && page_access == BT_READ)
{
+ highkeycmpcol = 1;
+
/* trade in our read lock for a write lock */
_bt_unlockbuf(rel, *bufP);
_bt_lockbuf(rel, *bufP, BT_WRITE);
@@ -191,7 +200,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
* move right to its new sibling. Do that.
*/
*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
- snapshot);
+ snapshot, &highkeycmpcol, (char *) tupdatabuf);
}
return stack_in;
@@ -239,12 +248,16 @@ _bt_moveright(Relation rel,
bool forupdate,
BTStack stack,
int access,
- Snapshot snapshot)
+ Snapshot snapshot,
+ AttrNumber *comparecol,
+ char *tupdatabuf)
{
Page page;
BTPageOpaque opaque;
int32 cmpval;
+ Assert(PointerIsValid(comparecol) && PointerIsValid(tupdatabuf));
+
/*
* When nextkey = false (normal case): if the scan key that brought us to
* this page is > the high key stored on the page, then the page has split
@@ -266,12 +279,17 @@ _bt_moveright(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
+
page = BufferGetPage(buf);
TestForOldSnapshot(snapshot, rel, page);
opaque = BTPageGetOpaque(page);
if (P_RIGHTMOST(opaque))
+ {
+ *comparecol = 1;
break;
+ }
/*
* Finish any incomplete splits we encounter along the way.
@@ -297,14 +315,55 @@ _bt_moveright(Relation rel,
continue;
}
- if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
+ /*
+ * tupdatabuf is filled with the right seperator of the parent node.
+ * This allows us to do a binary equality check between the parent
+ * node's right seperator (which is < key) and this page's P_HIKEY.
+ * If they equal, we can reuse the result of the parent node's
+ * rightkey compare, which means we can potentially save a full key
+ * compare (which includes indirect calls to attribute comparison
+ * functions).
+ *
+ * Without this, we'd on average use 3 full key compares per page before
+ * we achieve full dynamic prefix bounds, but with this optimization
+ * that is only 2.
+ *
+ * 3 compares: 1 for the highkey (rightmost), and on average 2 before
+ * we move right in the binary search on the page, this average equals
+ * SUM (1/2 ^ x) for x from 0 to log(n items)), which tends to 2.
+ */
+ if (!P_IGNORE(opaque) && *comparecol > 1)
+ {
+ IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
+ IndexTuple buftuple = (IndexTuple) tupdatabuf;
+ if (IndexTupleSize(itup) == IndexTupleSize(buftuple))
+ {
+ char *dataptr = (char *) itup;
+
+ if (memcmp(dataptr + sizeof(IndexTupleData),
+ tupdatabuf + sizeof(IndexTupleData),
+ IndexTupleSize(itup) - sizeof(IndexTupleData)) == 0)
+ break;
+ } else {
+ *comparecol = 1;
+ }
+ } else {
+ *comparecol = 1;
+ }
+
+ if (P_IGNORE(opaque) ||
+ _bt_compare(rel, key, page, P_HIKEY, &cmpcol) >= cmpval)
{
+ *comparecol = 1;
/* step right one page */
buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
continue;
}
else
+ {
+ *comparecol = cmpcol;
break;
+ }
}
if (P_IGNORE(opaque))
@@ -337,7 +396,8 @@ _bt_moveright(Relation rel,
static OffsetNumber
_bt_binsrch(Relation rel,
BTScanInsert key,
- Buffer buf)
+ Buffer buf,
+ AttrNumber *highkeycmpcol)
{
Page page;
BTPageOpaque opaque;
@@ -345,6 +405,8 @@ _bt_binsrch(Relation rel,
high;
int32 result,
cmpval;
+ AttrNumber highcmpcol = *highkeycmpcol,
+ lowcmpcol = 1;
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
@@ -386,16 +448,25 @@ _bt_binsrch(Relation rel,
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, key, page, mid);
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
+ {
high = mid;
+ highcmpcol = cmpcol;
+ }
}
+
+ *highkeycmpcol = highcmpcol;
/*
* At this point we have high == low, but be careful: they could point
@@ -439,7 +510,8 @@ _bt_binsrch(Relation rel,
* list split).
*/
OffsetNumber
-_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
+_bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol)
{
BTScanInsert key = insertstate->itup_key;
Page page;
@@ -449,6 +521,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
stricthigh;
int32 result,
cmpval;
+ AttrNumber lowcmpcol = 1;
page = BufferGetPage(insertstate->buf);
opaque = BTPageGetOpaque(page);
@@ -499,16 +572,22 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, key, page, mid);
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
{
high = mid;
+ highcmpcol = cmpcol;
+
if (result != 0)
stricthigh = high;
}
@@ -656,7 +735,8 @@ int32
_bt_compare(Relation rel,
BTScanInsert key,
Page page,
- OffsetNumber offnum)
+ OffsetNumber offnum,
+ AttrNumber *comparecol)
{
TupleDesc itupdesc = RelationGetDescr(rel);
BTPageOpaque opaque = BTPageGetOpaque(page);
@@ -696,8 +776,9 @@ _bt_compare(Relation rel,
ncmpkey = Min(ntupatts, key->keysz);
Assert(key->heapkeyspace || ncmpkey == key->keysz);
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
- scankey = key->scankeys;
- for (int i = 1; i <= ncmpkey; i++)
+
+ scankey = key->scankeys + ((*comparecol) - 1);
+ for (int i = *comparecol; i <= ncmpkey; i++)
{
Datum datum;
bool isNull;
@@ -741,11 +822,20 @@ _bt_compare(Relation rel,
/* if the keys are unequal, return the difference */
if (result != 0)
+ {
+ *comparecol = i;
return result;
+ }
scankey++;
}
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *comparecol = ntupatts + 1;
+
/*
* All non-truncated attributes (other than heap TID) were found to be
* equal. Treat truncated attributes as minus infinity when scankey has a
@@ -876,6 +966,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
StrategyNumber strat_total;
BTScanPosItem *currItem;
BlockNumber blkno;
+ AttrNumber cmpcol = 1;
Assert(!BTScanPosIsValid(so->currPos));
@@ -1392,7 +1483,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
_bt_initialize_more_data(so, dir);
/* position to the precise item on the page */
- offnum = _bt_binsrch(rel, &inskey, buf);
+ offnum = _bt_binsrch(rel, &inskey, buf, &cmpcol);
/*
* If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8f48960f9d..4cb24fa005 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1232,9 +1232,13 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
int access, Snapshot snapshot);
extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
- bool forupdate, BTStack stack, int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
-extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+ bool forupdate, BTStack stack, int access,
+ Snapshot snapshot, AttrNumber *comparecol,
+ char *tupdatabuf);
+extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page,
+ OffsetNumber offnum, AttrNumber *comparecol);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
--
2.39.0
v10-0002-Specialize-nbtree-functions-on-btree-key-shape.patchapplication/octet-stream; name=v10-0002-Specialize-nbtree-functions-on-btree-key-shape.patchDownload
From f3bca667ac60de549337353553810610dfca8ace Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 11 Jan 2023 02:13:04 +0100
Subject: [PATCH v10 2/6] Specialize nbtree functions on btree key shape.
nbtree keys are not all made the same, so a significant amount of time is
spent on code that exists only to deal with other key's shape. By specializing
function calls based on the key shape, we can remove or reduce these causes
of overhead.
This commit adds the basic infrastructure for specializing specific hot code
in the nbtree AM to certain shapes of keys, and splits the code that can
benefit from attribute offset optimizations into separate files. This does
NOT yet update the code itself - it just makes the code compile cleanly.
The performance should be comparable if not the same.
---
contrib/amcheck/verify_nbtree.c | 7 +
src/backend/access/nbtree/README | 28 +
src/backend/access/nbtree/nbtdedup.c | 300 +----
src/backend/access/nbtree/nbtdedup_spec.c | 317 +++++
src/backend/access/nbtree/nbtinsert.c | 566 +--------
src/backend/access/nbtree/nbtinsert_spec.c | 583 +++++++++
src/backend/access/nbtree/nbtpage.c | 1 +
src/backend/access/nbtree/nbtree.c | 37 +-
src/backend/access/nbtree/nbtree_spec.c | 69 ++
src/backend/access/nbtree/nbtsearch.c | 1075 +---------------
src/backend/access/nbtree/nbtsearch_spec.c | 1087 +++++++++++++++++
src/backend/access/nbtree/nbtsort.c | 264 +---
src/backend/access/nbtree/nbtsort_spec.c | 280 +++++
src/backend/access/nbtree/nbtsplitloc.c | 3 +
src/backend/access/nbtree/nbtutils.c | 754 +-----------
src/backend/access/nbtree/nbtutils_spec.c | 775 ++++++++++++
src/backend/utils/sort/tuplesortvariants.c | 144 +--
.../utils/sort/tuplesortvariants_spec.c | 158 +++
src/include/access/nbtree.h | 45 +-
src/include/access/nbtree_spec.h | 183 +++
src/include/access/nbtree_specfuncs.h | 66 +
src/tools/pginclude/cpluspluscheck | 2 +
src/tools/pginclude/headerscheck | 2 +
23 files changed, 3613 insertions(+), 3133 deletions(-)
create mode 100644 src/backend/access/nbtree/nbtdedup_spec.c
create mode 100644 src/backend/access/nbtree/nbtinsert_spec.c
create mode 100644 src/backend/access/nbtree/nbtree_spec.c
create mode 100644 src/backend/access/nbtree/nbtsearch_spec.c
create mode 100644 src/backend/access/nbtree/nbtsort_spec.c
create mode 100644 src/backend/access/nbtree/nbtutils_spec.c
create mode 100644 src/backend/utils/sort/tuplesortvariants_spec.c
create mode 100644 src/include/access/nbtree_spec.h
create mode 100644 src/include/access/nbtree_specfuncs.h
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 22bb229820..fb89d6ada2 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2680,6 +2680,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
BTStack stack;
Buffer lbuf;
bool exists;
+ nbts_prep_ctx(NULL);
key = _bt_mkscankey(state->rel, itup);
Assert(key->heapkeyspace && key->scantid != NULL);
@@ -2780,6 +2781,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
ItemId itemid;
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -2843,6 +2845,7 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
{
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -2867,6 +2870,7 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
{
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -2906,6 +2910,7 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
ItemId itemid;
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -2966,6 +2971,7 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
Page page;
BTPageOpaque opaque;
OffsetNumber maxoffset;
+ nbts_prep_ctx(NULL);
page = palloc(BLCKSZ);
@@ -3141,6 +3147,7 @@ static inline BTScanInsert
bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
{
BTScanInsert skey;
+ nbts_prep_ctx(NULL);
skey = _bt_mkscankey(rel, itup);
skey->pivotsearch = true;
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 16a31d2bfe..4b11ea9ad7 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -1084,6 +1084,34 @@ that need a page split anyway. Besides, supporting variable "split points"
while splitting posting lists won't actually improve overall space
utilization.
+Notes about nbtree specialization
+---------------------------------
+
+Attribute iteration is a significant overhead for multi-column indexes
+with variable length attributes, due to our inability to cache the offset
+of each attribute into an on-disk tuple. To combat this, we'd have to either
+fully deserialize the tuple, or maintain our offset into the tuple as we
+iterate over the tuple's fields.
+
+Keeping track of this offset also has a non-negligible overhead too, so we'd
+prefer to not have to keep track of these offsets when we can use the cache.
+By specializing performance-sensitive search functions for these specific
+index tuple shapes and calling those selectively, we can keep the performance
+of cacheable attribute offsets where that is applicable, while improving
+performance where we currently would see O(n_atts^2) time iterating on
+variable-length attributes. Additionally, we update the entry points
+in the index AM to call the specialized functions, increasing the
+performance of those hot paths.
+
+Optimized code paths exist for the following cases, in order of preference:
+ - multi-column indexes that could benefit from the attcacheoff optimization
+ NB: This is also the default path, and is comparatively slow for uncachable
+ attribute offsets.
+
+Future work will optimize for multi-column indexes that don't benefit
+from the attcacheoff optimization by improving on the O(n^2) nature of
+index_getattr through storing attribute offsets.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 0349988cf5..4589ade267 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -22,260 +22,14 @@
static void _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
TM_IndexDeleteOp *delstate);
-static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem);
static void _bt_singleval_fillfactor(Page page, BTDedupState state,
Size newitemsz);
#ifdef USE_ASSERT_CHECKING
static bool _bt_posting_valid(IndexTuple posting);
#endif
-/*
- * Perform a deduplication pass.
- *
- * The general approach taken here is to perform as much deduplication as
- * possible to free as much space as possible. Note, however, that "single
- * value" strategy is used for !bottomupdedup callers when the page is full of
- * tuples of a single value. Deduplication passes that apply the strategy
- * will leave behind a few untouched tuples at the end of the page, preparing
- * the page for an anticipated page split that uses nbtsplitloc.c's own single
- * value strategy. Our high level goal is to delay merging the untouched
- * tuples until after the page splits.
- *
- * When a call to _bt_bottomupdel_pass() just took place (and failed), our
- * high level goal is to prevent a page split entirely by buying more time.
- * We still hope that a page split can be avoided altogether. That's why
- * single value strategy is not even considered for bottomupdedup callers.
- *
- * The page will have to be split if we cannot successfully free at least
- * newitemsz (we also need space for newitem's line pointer, which isn't
- * included in caller's newitemsz).
- *
- * Note: Caller should have already deleted all existing items with their
- * LP_DEAD bits set.
- */
-void
-_bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
- Size newitemsz, bool bottomupdedup)
-{
- OffsetNumber offnum,
- minoff,
- maxoff;
- Page page = BufferGetPage(buf);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- Page newpage;
- BTDedupState state;
- Size pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
- bool singlevalstrat = false;
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-
- /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
- newitemsz += sizeof(ItemIdData);
-
- /*
- * Initialize deduplication state.
- *
- * It would be possible for maxpostingsize (limit on posting list tuple
- * size) to be set to one third of the page. However, it seems like a
- * good idea to limit the size of posting lists to one sixth of a page.
- * That ought to leave us with a good split point when pages full of
- * duplicates can be split several times.
- */
- state = (BTDedupState) palloc(sizeof(BTDedupStateData));
- state->deduplicate = true;
- state->nmaxitems = 0;
- state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
- /* Metadata about base tuple of current pending posting list */
- state->base = NULL;
- state->baseoff = InvalidOffsetNumber;
- state->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- state->htids = palloc(state->maxpostingsize);
- state->nhtids = 0;
- state->nitems = 0;
- /* Size of all physical tuples to be replaced by pending posting list */
- state->phystupsize = 0;
- /* nintervals should be initialized to zero */
- state->nintervals = 0;
-
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * Consider applying "single value" strategy, though only if the page
- * seems likely to be split in the near future
- */
- if (!bottomupdedup)
- singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
-
- /*
- * Deduplicate items from page, and write them to newpage.
- *
- * Copy the original page's LSN into newpage copy. This will become the
- * updated version of the page. We need this because XLogInsert will
- * examine the LSN and possibly dump it in a page image.
- */
- newpage = PageGetTempPageCopySpecial(page);
- PageSetLSN(newpage, PageGetLSN(page));
-
- /* Copy high key, if any */
- if (!P_RIGHTMOST(opaque))
- {
- ItemId hitemid = PageGetItemId(page, P_HIKEY);
- Size hitemsz = ItemIdGetLength(hitemid);
- IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
-
- if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
- false, false) == InvalidOffsetNumber)
- elog(ERROR, "deduplication failed to add highkey");
- }
-
- for (offnum = minoff;
- offnum <= maxoff;
- offnum = OffsetNumberNext(offnum))
- {
- ItemId itemid = PageGetItemId(page, offnum);
- IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
-
- Assert(!ItemIdIsDead(itemid));
-
- if (offnum == minoff)
- {
- /*
- * No previous/base tuple for the data item -- use the data item
- * as base tuple of pending posting list
- */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- else if (state->deduplicate &&
- _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
- _bt_dedup_save_htid(state, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID(s) for itup have been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list for some other reason (e.g., adding more
- * TIDs would have caused posting list to exceed current
- * maxpostingsize).
- *
- * If state contains pending posting list with more than one item,
- * form new posting tuple and add it to our temp page (newpage).
- * Else add pending interval's base tuple to the temp page as-is.
- */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- if (singlevalstrat)
- {
- /*
- * Single value strategy's extra steps.
- *
- * Lower maxpostingsize for sixth and final large posting list
- * tuple at the point where 5 maxpostingsize-capped tuples
- * have either been formed or observed.
- *
- * When a sixth maxpostingsize-capped item is formed/observed,
- * stop merging together tuples altogether. The few tuples
- * that remain at the end of the page won't be merged together
- * at all (at least not until after a future page split takes
- * place, when this page's newly allocated right sibling page
- * gets its first deduplication pass).
- */
- if (state->nmaxitems == 5)
- _bt_singleval_fillfactor(page, state, newitemsz);
- else if (state->nmaxitems == 6)
- {
- state->deduplicate = false;
- singlevalstrat = false; /* won't be back here */
- }
- }
-
- /* itup starts new pending posting list */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- }
-
- /* Handle the last item */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- /*
- * If no items suitable for deduplication were found, newpage must be
- * exactly the same as the original page, so just return from function.
- *
- * We could determine whether or not to proceed on the basis the space
- * savings being sufficient to avoid an immediate page split instead. We
- * don't do that because there is some small value in nbtsplitloc.c always
- * operating against a page that is fully deduplicated (apart from
- * newitem). Besides, most of the cost has already been paid.
- */
- if (state->nintervals == 0)
- {
- /* cannot leak memory here */
- pfree(newpage);
- pfree(state->htids);
- pfree(state);
- return;
- }
-
- /*
- * By here, it's clear that deduplication will definitely go ahead.
- *
- * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
- * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
- * But keep things tidy.
- */
- if (P_HAS_GARBAGE(opaque))
- {
- BTPageOpaque nopaque = BTPageGetOpaque(newpage);
-
- nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
- }
-
- START_CRIT_SECTION();
-
- PageRestoreTempPage(newpage, page);
- MarkBufferDirty(buf);
-
- /* XLOG stuff */
- if (RelationNeedsWAL(rel))
- {
- XLogRecPtr recptr;
- xl_btree_dedup xlrec_dedup;
-
- xlrec_dedup.nintervals = state->nintervals;
-
- XLogBeginInsert();
- XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
-
- /*
- * The intervals array is not in the buffer, but pretend that it is.
- * When XLogInsert stores the whole buffer, the array need not be
- * stored too.
- */
- XLogRegisterBufData(0, (char *) state->intervals,
- state->nintervals * sizeof(BTDedupInterval));
-
- recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
-
- PageSetLSN(page, recptr);
- }
-
- END_CRIT_SECTION();
-
- /* Local space accounting should agree with page accounting */
- Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
-
- /* cannot leak memory here */
- pfree(state->htids);
- pfree(state);
-}
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtdedup_spec.c"
+#include "access/nbtree_spec.h"
/*
* Perform bottom-up index deletion pass.
@@ -316,6 +70,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
TM_IndexDeleteOp delstate;
bool neverdedup;
int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ nbts_prep_ctx(rel);
/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
newitemsz += sizeof(ItemIdData);
@@ -752,55 +507,6 @@ _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
state->phystupsize = 0;
}
-/*
- * Determine if page non-pivot tuples (data items) are all duplicates of the
- * same value -- if they are, deduplication's "single value" strategy should
- * be applied. The general goal of this strategy is to ensure that
- * nbtsplitloc.c (which uses its own single value strategy) will find a useful
- * split point as further duplicates are inserted, and successive rightmost
- * page splits occur among pages that store the same duplicate value. When
- * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
- * just like it would if deduplication were disabled.
- *
- * We expect that affected workloads will require _several_ single value
- * strategy deduplication passes (over a page that only stores duplicates)
- * before the page is finally split. The first deduplication pass should only
- * find regular non-pivot tuples. Later deduplication passes will find
- * existing maxpostingsize-capped posting list tuples, which must be skipped
- * over. The penultimate pass is generally the first pass that actually
- * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
- * few untouched non-pivot tuples. The final deduplication pass won't free
- * any space -- it will skip over everything without merging anything (it
- * retraces the steps of the penultimate pass).
- *
- * Fortunately, having several passes isn't too expensive. Each pass (after
- * the first pass) won't spend many cycles on the large posting list tuples
- * left by previous passes. Each pass will find a large contiguous group of
- * smaller duplicate tuples to merge together at the end of the page.
- */
-static bool
-_bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- ItemId itemid;
- IndexTuple itup;
-
- itemid = PageGetItemId(page, minoff);
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- {
- itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- return true;
- }
-
- return false;
-}
-
/*
* Lower maxpostingsize when using "single value" strategy, to avoid a sixth
* and final maxpostingsize-capped tuple. The sixth and final posting list
diff --git a/src/backend/access/nbtree/nbtdedup_spec.c b/src/backend/access/nbtree/nbtdedup_spec.c
new file mode 100644
index 0000000000..584211fe66
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup_spec.c
@@ -0,0 +1,317 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup_spec.c
+ * Index shape-specialized functions for nbtdedup.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtdedup_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_do_singleval NBTS_FUNCTION(_bt_do_singleval)
+
+static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem);
+
+/*
+ * Perform a deduplication pass.
+ *
+ * The general approach taken here is to perform as much deduplication as
+ * possible to free as much space as possible. Note, however, that "single
+ * value" strategy is used for !bottomupdedup callers when the page is full of
+ * tuples of a single value. Deduplication passes that apply the strategy
+ * will leave behind a few untouched tuples at the end of the page, preparing
+ * the page for an anticipated page split that uses nbtsplitloc.c's own single
+ * value strategy. Our high level goal is to delay merging the untouched
+ * tuples until after the page splits.
+ *
+ * When a call to _bt_bottomupdel_pass() just took place (and failed), our
+ * high level goal is to prevent a page split entirely by buying more time.
+ * We still hope that a page split can be avoided altogether. That's why
+ * single value strategy is not even considered for bottomupdedup callers.
+ *
+ * The page will have to be split if we cannot successfully free at least
+ * newitemsz (we also need space for newitem's line pointer, which isn't
+ * included in caller's newitemsz).
+ *
+ * Note: Caller should have already deleted all existing items with their
+ * LP_DEAD bits set.
+ */
+void
+_bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
+ Size newitemsz, bool bottomupdedup)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ Page newpage;
+ BTDedupState state;
+ Size pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
+ bool singlevalstrat = false;
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+
+ /*
+ * Initialize deduplication state.
+ *
+ * It would be possible for maxpostingsize (limit on posting list tuple
+ * size) to be set to one third of the page. However, it seems like a
+ * good idea to limit the size of posting lists to one sixth of a page.
+ * That ought to leave us with a good split point when pages full of
+ * duplicates can be split several times.
+ */
+ state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ state->deduplicate = true;
+ state->nmaxitems = 0;
+ state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
+ /* Metadata about base tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ state->htids = palloc(state->maxpostingsize);
+ state->nhtids = 0;
+ state->nitems = 0;
+ /* Size of all physical tuples to be replaced by pending posting list */
+ state->phystupsize = 0;
+ /* nintervals should be initialized to zero */
+ state->nintervals = 0;
+
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Consider applying "single value" strategy, though only if the page
+ * seems likely to be split in the near future
+ */
+ if (!bottomupdedup)
+ singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
+
+ /*
+ * Deduplicate items from page, and write them to newpage.
+ *
+ * Copy the original page's LSN into newpage copy. This will become the
+ * updated version of the page. We need this because XLogInsert will
+ * examine the LSN and possibly dump it in a page image.
+ */
+ newpage = PageGetTempPageCopySpecial(page);
+ PageSetLSN(newpage, PageGetLSN(page));
+
+ /* Copy high key, if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId hitemid = PageGetItemId(page, P_HIKEY);
+ Size hitemsz = ItemIdGetLength(hitemid);
+ IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+ if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add highkey");
+ }
+
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == minoff)
+ {
+ /*
+ * No previous/base tuple for the data item -- use the data item
+ * as base tuple of pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (state->deduplicate &&
+ _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+ _bt_dedup_save_htid(state, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID(s) for itup have been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list for some other reason (e.g., adding more
+ * TIDs would have caused posting list to exceed current
+ * maxpostingsize).
+ *
+ * If state contains pending posting list with more than one item,
+ * form new posting tuple and add it to our temp page (newpage).
+ * Else add pending interval's base tuple to the temp page as-is.
+ */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ if (singlevalstrat)
+ {
+ /*
+ * Single value strategy's extra steps.
+ *
+ * Lower maxpostingsize for sixth and final large posting list
+ * tuple at the point where 5 maxpostingsize-capped tuples
+ * have either been formed or observed.
+ *
+ * When a sixth maxpostingsize-capped item is formed/observed,
+ * stop merging together tuples altogether. The few tuples
+ * that remain at the end of the page won't be merged together
+ * at all (at least not until after a future page split takes
+ * place, when this page's newly allocated right sibling page
+ * gets its first deduplication pass).
+ */
+ if (state->nmaxitems == 5)
+ _bt_singleval_fillfactor(page, state, newitemsz);
+ else if (state->nmaxitems == 6)
+ {
+ state->deduplicate = false;
+ singlevalstrat = false; /* won't be back here */
+ }
+ }
+
+ /* itup starts new pending posting list */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ }
+
+ /* Handle the last item */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ /*
+ * If no items suitable for deduplication were found, newpage must be
+ * exactly the same as the original page, so just return from function.
+ *
+ * We could determine whether or not to proceed on the basis the space
+ * savings being sufficient to avoid an immediate page split instead. We
+ * don't do that because there is some small value in nbtsplitloc.c always
+ * operating against a page that is fully deduplicated (apart from
+ * newitem). Besides, most of the cost has already been paid.
+ */
+ if (state->nintervals == 0)
+ {
+ /* cannot leak memory here */
+ pfree(newpage);
+ pfree(state->htids);
+ pfree(state);
+ return;
+ }
+
+ /*
+ * By here, it's clear that deduplication will definitely go ahead.
+ *
+ * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
+ * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
+ * But keep things tidy.
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ BTPageOpaque nopaque = BTPageGetOpaque(newpage);
+
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+ }
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.nintervals = state->nintervals;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ /*
+ * The intervals array is not in the buffer, but pretend that it is.
+ * When XLogInsert stores the whole buffer, the array need not be
+ * stored too.
+ */
+ XLogRegisterBufData(0, (char *) state->intervals,
+ state->nintervals * sizeof(BTDedupInterval));
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* Local space accounting should agree with page accounting */
+ Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+ /* cannot leak memory here */
+ pfree(state->htids);
+ pfree(state);
+}
+
+/*
+ * Determine if page non-pivot tuples (data items) are all duplicates of the
+ * same value -- if they are, deduplication's "single value" strategy should
+ * be applied. The general goal of this strategy is to ensure that
+ * nbtsplitloc.c (which uses its own single value strategy) will find a useful
+ * split point as further duplicates are inserted, and successive rightmost
+ * page splits occur among pages that store the same duplicate value. When
+ * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
+ * just like it would if deduplication were disabled.
+ *
+ * We expect that affected workloads will require _several_ single value
+ * strategy deduplication passes (over a page that only stores duplicates)
+ * before the page is finally split. The first deduplication pass should only
+ * find regular non-pivot tuples. Later deduplication passes will find
+ * existing maxpostingsize-capped posting list tuples, which must be skipped
+ * over. The penultimate pass is generally the first pass that actually
+ * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
+ * few untouched non-pivot tuples. The final deduplication pass won't free
+ * any space -- it will skip over everything without merging anything (it
+ * retraces the steps of the penultimate pass).
+ *
+ * Fortunately, having several passes isn't too expensive. Each pass (after
+ * the first pass) won't spend many cycles on the large posting list tuples
+ * left by previous passes. Each pass will find a large contiguous group of
+ * smaller duplicate tuples to merge together at the end of the page.
+ */
+static bool
+_bt_do_singleval(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, minoff);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+ {
+ itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+ return true;
+ }
+
+ return false;
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 4c3bdefae2..ca8ea60ffb 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,17 +30,10 @@
#define BTREE_FASTPATH_MIN_LEVEL 2
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
Relation heapRel,
IndexUniqueCheck checkUnique, bool *is_unique,
uint32 *speculativeToken);
-static OffsetNumber _bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel);
static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
Buffer buf,
@@ -73,313 +66,8 @@ static BlockNumber *_bt_deadblocks(Page page, OffsetNumber *deletable,
int *nblocks);
static inline int _bt_blk_cmp(const void *arg1, const void *arg2);
-/*
- * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
- *
- * This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
- *
- * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
- * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
- * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
- * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
- * don't actually insert.
- *
- * indexUnchanged executor hint indicates if itup is from an
- * UPDATE that didn't logically change the indexed value, but
- * must nevertheless have a new entry to point to a successor
- * version.
- *
- * The result value is only significant for UNIQUE_CHECK_PARTIAL:
- * it must be true if the entry is known unique, else false.
- * (In the current implementation we'll also return true after a
- * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
- * that's just a coding artifact.)
- */
-bool
-_bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel)
-{
- bool is_unique = false;
- BTInsertStateData insertstate;
- BTScanInsert itup_key;
- BTStack stack;
- bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
-
- /* we need an insertion scan key to do our search, so build one */
- itup_key = _bt_mkscankey(rel, itup);
-
- if (checkingunique)
- {
- if (!itup_key->anynullkeys)
- {
- /* No (heapkeyspace) scantid until uniqueness established */
- itup_key->scantid = NULL;
- }
- else
- {
- /*
- * Scan key for new tuple contains NULL key values. Bypass
- * checkingunique steps. They are unnecessary because core code
- * considers NULL unequal to every value, including NULL.
- *
- * This optimization avoids O(N^2) behavior within the
- * _bt_findinsertloc() heapkeyspace path when a unique index has a
- * large number of "duplicates" with NULL key values.
- */
- checkingunique = false;
- /* Tuple is unique in the sense that core code cares about */
- Assert(checkUnique != UNIQUE_CHECK_EXISTING);
- is_unique = true;
- }
- }
-
- /*
- * Fill in the BTInsertState working area, to track the current page and
- * position within the page to insert on.
- *
- * Note that itemsz is passed down to lower level code that deals with
- * inserting the item. It must be MAXALIGN()'d. This ensures that space
- * accounting code consistently considers the alignment overhead that we
- * expect PageAddItem() will add later. (Actually, index_form_tuple() is
- * already conservative about alignment, but we don't rely on that from
- * this distance. Besides, preserving the "true" tuple size in index
- * tuple headers for the benefit of nbtsplitloc.c might happen someday.
- * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
- */
- insertstate.itup = itup;
- insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
- insertstate.itup_key = itup_key;
- insertstate.bounds_valid = false;
- insertstate.buf = InvalidBuffer;
- insertstate.postingoff = 0;
-
-search:
-
- /*
- * Find and lock the leaf page that the tuple should be added to by
- * searching from the root page. insertstate.buf will hold a buffer that
- * is locked in exclusive mode afterwards.
- */
- stack = _bt_search_insert(rel, &insertstate);
-
- /*
- * checkingunique inserts are not allowed to go ahead when two tuples with
- * equal key attribute values would be visible to new MVCC snapshots once
- * the xact commits. Check for conflicts in the locked page/buffer (if
- * needed) here.
- *
- * It might be necessary to check a page to the right in _bt_check_unique,
- * though that should be very rare. In practice the first page the value
- * could be on (with scantid omitted) is almost always also the only page
- * that a matching tuple might be found on. This is due to the behavior
- * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
- * only be allowed to cross a page boundary when there is no candidate
- * leaf page split point that avoids it. Also, _bt_check_unique can use
- * the leaf page high key to determine that there will be no duplicates on
- * the right sibling without actually visiting it (it uses the high key in
- * cases where the new item happens to belong at the far right of the leaf
- * page).
- *
- * NOTE: obviously, _bt_check_unique can only detect keys that are already
- * in the index; so it cannot defend against concurrent insertions of the
- * same key. We protect against that by means of holding a write lock on
- * the first page the value could be on, with omitted/-inf value for the
- * implicit heap TID tiebreaker attribute. Any other would-be inserter of
- * the same key must acquire a write lock on the same page, so only one
- * would-be inserter can be making the check at one time. Furthermore,
- * once we are past the check we hold write locks continuously until we
- * have performed our insertion, so no later inserter can fail to see our
- * insertion. (This requires some care in _bt_findinsertloc.)
- *
- * If we must wait for another xact, we release the lock while waiting,
- * and then must perform a new search.
- *
- * For a partial uniqueness check, we don't wait for the other xact. Just
- * let the tuple in and return false for possibly non-unique, or true for
- * definitely unique.
- */
- if (checkingunique)
- {
- TransactionId xwait;
- uint32 speculativeToken;
-
- xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
- &is_unique, &speculativeToken);
-
- if (unlikely(TransactionIdIsValid(xwait)))
- {
- /* Have to wait for the other guy ... */
- _bt_relbuf(rel, insertstate.buf);
- insertstate.buf = InvalidBuffer;
-
- /*
- * If it's a speculative insertion, wait for it to finish (ie. to
- * go ahead with the insertion, or kill the tuple). Otherwise
- * wait for the transaction to finish as usual.
- */
- if (speculativeToken)
- SpeculativeInsertionWait(xwait, speculativeToken);
- else
- XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
-
- /* start over... */
- if (stack)
- _bt_freestack(stack);
- goto search;
- }
-
- /* Uniqueness is established -- restore heap tid as scantid */
- if (itup_key->heapkeyspace)
- itup_key->scantid = &itup->t_tid;
- }
-
- if (checkUnique != UNIQUE_CHECK_EXISTING)
- {
- OffsetNumber newitemoff;
-
- /*
- * The only conflict predicate locking cares about for indexes is when
- * an index tuple insert conflicts with an existing lock. We don't
- * know the actual page we're going to insert on for sure just yet in
- * checkingunique and !heapkeyspace cases, but it's okay to use the
- * first page the value could be on (with scantid omitted) instead.
- */
- CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
-
- /*
- * Do the insertion. Note that insertstate contains cached binary
- * search bounds established within _bt_check_unique when insertion is
- * checkingunique.
- */
- newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
- indexUnchanged, stack, heapRel);
- _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, insertstate.itemsz, newitemoff,
- insertstate.postingoff, false);
- }
- else
- {
- /* just release the buffer */
- _bt_relbuf(rel, insertstate.buf);
- }
-
- /* be tidy */
- if (stack)
- _bt_freestack(stack);
- pfree(itup_key);
-
- return is_unique;
-}
-
-/*
- * _bt_search_insert() -- _bt_search() wrapper for inserts
- *
- * Search the tree for a particular scankey, or more precisely for the first
- * leaf page it could be on. Try to make use of the fastpath optimization's
- * rightmost leaf page cache before actually searching the tree from the root
- * page, though.
- *
- * Return value is a stack of parent-page pointers (though see notes about
- * fastpath optimization and page splits below). insertstate->buf is set to
- * the address of the leaf-page buffer, which is write-locked and pinned in
- * all cases (if necessary by creating a new empty root page for caller).
- *
- * The fastpath optimization avoids most of the work of searching the tree
- * repeatedly when a single backend inserts successive new tuples on the
- * rightmost leaf page of an index. A backend cache of the rightmost leaf
- * page is maintained within _bt_insertonpg(), and used here. The cache is
- * invalidated here when an insert of a non-pivot tuple must take place on a
- * non-rightmost leaf page.
- *
- * The optimization helps with indexes on an auto-incremented field. It also
- * helps with indexes on datetime columns, as well as indexes with lots of
- * NULL values. (NULLs usually get inserted in the rightmost page for single
- * column indexes, since they usually get treated as coming after everything
- * else in the key space. Individual NULL tuples will generally be placed on
- * the rightmost leaf page due to the influence of the heap TID column.)
- *
- * Note that we avoid applying the optimization when there is insufficient
- * space on the rightmost page to fit caller's new item. This is necessary
- * because we'll need to return a real descent stack when a page split is
- * expected (actually, caller can cope with a leaf page split that uses a NULL
- * stack, but that's very slow and so must be avoided). Note also that the
- * fastpath optimization acquires the lock on the page conditionally as a way
- * of reducing extra contention when there are concurrent insertions into the
- * rightmost page (we give up if we'd have to wait for the lock). We assume
- * that it isn't useful to apply the optimization when there is contention,
- * since each per-backend cache won't stay valid for long.
- */
-static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
-{
- Assert(insertstate->buf == InvalidBuffer);
- Assert(!insertstate->bounds_valid);
- Assert(insertstate->postingoff == 0);
-
- if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
- {
- /* Simulate a _bt_getbuf() call with conditional locking */
- insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
- if (_bt_conditionallockbuf(rel, insertstate->buf))
- {
- Page page;
- BTPageOpaque opaque;
- AttrNumber cmpcol = 1;
-
- _bt_checkpage(rel, insertstate->buf);
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- /*
- * Check if the page is still the rightmost leaf page and has
- * enough free space to accommodate the new tuple. Also check
- * that the insertion scan key is strictly greater than the first
- * non-pivot tuple on the page. (Note that we expect itup_key's
- * scantid to be unset when our caller is a checkingunique
- * inserter.)
- */
- if (P_RIGHTMOST(opaque) &&
- P_ISLEAF(opaque) &&
- !P_IGNORE(opaque) &&
- PageGetFreeSpace(page) > insertstate->itemsz &&
- PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- _bt_compare(rel, insertstate->itup_key, page, P_HIKEY,
- &cmpcol) > 0)
- {
- /*
- * Caller can use the fastpath optimization because cached
- * block is still rightmost leaf page, which can fit caller's
- * new tuple without splitting. Keep block in local cache for
- * next insert, and have caller use NULL stack.
- *
- * Note that _bt_insert_parent() has an assertion that catches
- * leaf page splits that somehow follow from a fastpath insert
- * (it should only be passed a NULL stack when it must deal
- * with a concurrent root page split, and never because a NULL
- * stack was returned here).
- */
- return NULL;
- }
-
- /* Page unsuitable for caller, drop lock and pin */
- _bt_relbuf(rel, insertstate->buf);
- }
- else
- {
- /* Lock unavailable, drop pin */
- ReleaseBuffer(insertstate->buf);
- }
-
- /* Forget block, since cache doesn't appear to be useful */
- RelationSetTargetBlock(rel, InvalidBlockNumber);
- }
-
- /* Cannot use optimization -- descend tree, return proper descent stack */
- return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
- NULL);
-}
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtinsert_spec.c"
+#include "access/nbtree_spec.h"
/*
* _bt_check_unique() -- Check for violation of unique index constraint
@@ -423,6 +111,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
bool inposting = false;
bool prevalldead = true;
int curposti = 0;
+ nbts_prep_ctx(rel);
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -774,253 +463,6 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
return InvalidTransactionId;
}
-
-/*
- * _bt_findinsertloc() -- Finds an insert location for a tuple
- *
- * On entry, insertstate buffer contains the page the new tuple belongs
- * on. It is exclusive-locked and pinned by the caller.
- *
- * If 'checkingunique' is true, the buffer on entry is the first page
- * that contains duplicates of the new key. If there are duplicates on
- * multiple pages, the correct insertion position might be some page to
- * the right, rather than the first page. In that case, this function
- * moves right to the correct target page.
- *
- * (In a !heapkeyspace index, there can be multiple pages with the same
- * high key, where the new tuple could legitimately be placed on. In
- * that case, the caller passes the first page containing duplicates,
- * just like when checkingunique=true. If that page doesn't have enough
- * room for the new tuple, this function moves right, trying to find a
- * legal page that does.)
- *
- * If 'indexUnchanged' is true, this is for an UPDATE that didn't
- * logically change the indexed value, but must nevertheless have a new
- * entry to point to a successor version. This hint from the executor
- * will influence our behavior when the page might have to be split and
- * we must consider our options. Bottom-up index deletion can avoid
- * pathological version-driven page splits, but we only want to go to the
- * trouble of trying it when we already have moderate confidence that
- * it's appropriate. The hint should not significantly affect our
- * behavior over time unless practically all inserts on to the leaf page
- * get the hint.
- *
- * On exit, insertstate buffer contains the chosen insertion page, and
- * the offset within that page is returned. If _bt_findinsertloc needed
- * to move right, the lock and pin on the original page are released, and
- * the new buffer is exclusively locked and pinned instead.
- *
- * If insertstate contains cached binary search bounds, we will take
- * advantage of them. This avoids repeating comparisons that we made in
- * _bt_check_unique() already.
- */
-static OffsetNumber
-_bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel)
-{
- BTScanInsert itup_key = insertstate->itup_key;
- Page page = BufferGetPage(insertstate->buf);
- BTPageOpaque opaque;
- OffsetNumber newitemoff;
-
- opaque = BTPageGetOpaque(page);
-
- /* Check 1/3 of a page restriction */
- if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
- _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
- insertstate->itup);
-
- Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
- Assert(!insertstate->bounds_valid || checkingunique);
- Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
- Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
- Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
-
- if (itup_key->heapkeyspace)
- {
- /* Keep track of whether checkingunique duplicate seen */
- bool uniquedup = indexUnchanged;
-
- /*
- * If we're inserting into a unique index, we may have to walk right
- * through leaf pages to find the one leaf page that we must insert on
- * to.
- *
- * This is needed for checkingunique callers because a scantid was not
- * used when we called _bt_search(). scantid can only be set after
- * _bt_check_unique() has checked for duplicates. The buffer
- * initially stored in insertstate->buf has the page where the first
- * duplicate key might be found, which isn't always the page that new
- * tuple belongs on. The heap TID attribute for new tuple (scantid)
- * could force us to insert on a sibling page, though that should be
- * very rare in practice.
- */
- if (checkingunique)
- {
- if (insertstate->low < insertstate->stricthigh)
- {
- /* Encountered a duplicate in _bt_check_unique() */
- Assert(insertstate->bounds_valid);
- uniquedup = true;
- }
-
- for (;;)
- {
- AttrNumber cmpcol = 1;
-
- /*
- * Does the new tuple belong on this page?
- *
- * The earlier _bt_check_unique() call may well have
- * established a strict upper bound on the offset for the new
- * item. If it's not the last item of the page (i.e. if there
- * is at least one tuple on the page that goes after the tuple
- * we're inserting) then we know that the tuple belongs on
- * this page. We can skip the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- /* Test '<=', not '!=', since scantid is set now */
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
- break;
-
- _bt_stepright(rel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- /* Assume duplicates (if checkingunique) */
- uniquedup = true;
- }
- }
-
- /*
- * If the target page cannot fit newitem, try to avoid splitting the
- * page on insert by performing deletion or deduplication now
- */
- if (PageGetFreeSpace(page) < insertstate->itemsz)
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
- checkingunique, uniquedup,
- indexUnchanged);
- }
- else
- {
- /*----------
- * This is a !heapkeyspace (version 2 or 3) index. The current page
- * is the first page that we could insert the new tuple to, but there
- * may be other pages to the right that we could opt to use instead.
- *
- * If the new key is equal to one or more existing keys, we can
- * legitimately place it anywhere in the series of equal keys. In
- * fact, if the new key is equal to the page's "high key" we can place
- * it on the next page. If it is equal to the high key, and there's
- * not room to insert the new tuple on the current page without
- * splitting, then we move right hoping to find more free space and
- * avoid a split.
- *
- * Keep scanning right until we
- * (a) find a page with enough free space,
- * (b) reach the last page where the tuple can legally go, or
- * (c) get tired of searching.
- * (c) is not flippant; it is important because if there are many
- * pages' worth of equal keys, it's better to split one of the early
- * pages than to scan all the way to the end of the run of equal keys
- * on every insert. We implement "get tired" as a random choice,
- * since stopping after scanning a fixed number of pages wouldn't work
- * well (we'd never reach the right-hand side of previously split
- * pages). The probability of moving right is set at 0.99, which may
- * seem too high to change the behavior much, but it does an excellent
- * job of preventing O(N^2) behavior with many equal keys.
- *----------
- */
- while (PageGetFreeSpace(page) < insertstate->itemsz)
- {
- AttrNumber cmpcol = 1;
-
- /*
- * Before considering moving right, see if we can obtain enough
- * space by erasing LP_DEAD items
- */
- if (P_HAS_GARBAGE(opaque))
- {
- /* Perform simple deletion */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- if (PageGetFreeSpace(page) >= insertstate->itemsz)
- break; /* OK, now we have enough space */
- }
-
- /*
- * Nope, so check conditions (b) and (c) enumerated above
- *
- * The earlier _bt_check_unique() call may well have established a
- * strict upper bound on the offset for the new item. If it's not
- * the last item of the page (i.e. if there is at least one tuple
- * on the page that's greater than the tuple we're inserting to)
- * then we know that the tuple belongs on this page. We can skip
- * the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) != 0 ||
- pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
- break;
-
- _bt_stepright(rel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- }
- }
-
- /*
- * We should now be on the correct page. Find the offset within the page
- * for the new tuple. (Possibly reusing earlier search bounds.)
- */
- {
- AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
- Assert(P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0);
- }
-
- newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
-
- if (insertstate->postingoff == -1)
- {
- /*
- * There is an overlapping posting list tuple with its LP_DEAD bit
- * set. We don't want to unnecessarily unset its LP_DEAD bit while
- * performing a posting list split, so perform simple index tuple
- * deletion early.
- */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- /*
- * Do new binary search. New insert location cannot overlap with any
- * posting list now.
- */
- Assert(!insertstate->bounds_valid);
- insertstate->postingoff = 0;
- newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
- Assert(insertstate->postingoff == 0);
- }
-
- return newitemoff;
-}
-
/*
* Step right to next non-dead page, during insertion.
*
@@ -1501,6 +943,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
bool newitemonleft,
isleaf,
isrightmost;
+ nbts_prep_ctx(rel);
/*
* origpage is the original page to be split. leftpage is a temporary
@@ -2693,6 +2136,7 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
BTScanInsert itup_key = insertstate->itup_key;
Page page = BufferGetPage(buffer);
BTPageOpaque opaque = BTPageGetOpaque(page);
+ nbts_prep_ctx(rel);
Assert(P_ISLEAF(opaque));
Assert(simpleonly || itup_key->heapkeyspace);
diff --git a/src/backend/access/nbtree/nbtinsert_spec.c b/src/backend/access/nbtree/nbtinsert_spec.c
new file mode 100644
index 0000000000..d37afae5ae
--- /dev/null
+++ b/src/backend/access/nbtree/nbtinsert_spec.c
@@ -0,0 +1,583 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtinsert_spec.c
+ * Index shape-specialized functions for nbtinsert.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtinsert_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_search_insert NBTS_FUNCTION(_bt_search_insert)
+#define _bt_findinsertloc NBTS_FUNCTION(_bt_findinsertloc)
+
+static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
+static OffsetNumber _bt_findinsertloc(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel);
+
+
+/*
+ * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
+ *
+ * This routine is called by the public interface routine, btinsert.
+ * By here, itup is filled in, including the TID.
+ *
+ * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
+ * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
+ * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
+ * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
+ * don't actually insert.
+ *
+ * indexUnchanged executor hint indicates if itup is from an
+ * UPDATE that didn't logically change the indexed value, but
+ * must nevertheless have a new entry to point to a successor
+ * version.
+ *
+ * The result value is only significant for UNIQUE_CHECK_PARTIAL:
+ * it must be true if the entry is known unique, else false.
+ * (In the current implementation we'll also return true after a
+ * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
+ * that's just a coding artifact.)
+ */
+bool
+_bt_doinsert(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ Relation heapRel)
+{
+ bool is_unique = false;
+ BTInsertStateData insertstate;
+ BTScanInsert itup_key;
+ BTStack stack;
+ bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
+
+ /* we need an insertion scan key to do our search, so build one */
+ itup_key = _bt_mkscankey(rel, itup);
+
+ if (checkingunique)
+ {
+ if (!itup_key->anynullkeys)
+ {
+ /* No (heapkeyspace) scantid until uniqueness established */
+ itup_key->scantid = NULL;
+ }
+ else
+ {
+ /*
+ * Scan key for new tuple contains NULL key values. Bypass
+ * checkingunique steps. They are unnecessary because core code
+ * considers NULL unequal to every value, including NULL.
+ *
+ * This optimization avoids O(N^2) behavior within the
+ * _bt_findinsertloc() heapkeyspace path when a unique index has a
+ * large number of "duplicates" with NULL key values.
+ */
+ checkingunique = false;
+ /* Tuple is unique in the sense that core code cares about */
+ Assert(checkUnique != UNIQUE_CHECK_EXISTING);
+ is_unique = true;
+ }
+ }
+
+ /*
+ * Fill in the BTInsertState working area, to track the current page and
+ * position within the page to insert on.
+ *
+ * Note that itemsz is passed down to lower level code that deals with
+ * inserting the item. It must be MAXALIGN()'d. This ensures that space
+ * accounting code consistently considers the alignment overhead that we
+ * expect PageAddItem() will add later. (Actually, index_form_tuple() is
+ * already conservative about alignment, but we don't rely on that from
+ * this distance. Besides, preserving the "true" tuple size in index
+ * tuple headers for the benefit of nbtsplitloc.c might happen someday.
+ * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
+ */
+ insertstate.itup = itup;
+ insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
+ insertstate.itup_key = itup_key;
+ insertstate.bounds_valid = false;
+ insertstate.buf = InvalidBuffer;
+ insertstate.postingoff = 0;
+
+ search:
+
+ /*
+ * Find and lock the leaf page that the tuple should be added to by
+ * searching from the root page. insertstate.buf will hold a buffer that
+ * is locked in exclusive mode afterwards.
+ */
+ stack = _bt_search_insert(rel, &insertstate);
+
+ /*
+ * checkingunique inserts are not allowed to go ahead when two tuples with
+ * equal key attribute values would be visible to new MVCC snapshots once
+ * the xact commits. Check for conflicts in the locked page/buffer (if
+ * needed) here.
+ *
+ * It might be necessary to check a page to the right in _bt_check_unique,
+ * though that should be very rare. In practice the first page the value
+ * could be on (with scantid omitted) is almost always also the only page
+ * that a matching tuple might be found on. This is due to the behavior
+ * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
+ * only be allowed to cross a page boundary when there is no candidate
+ * leaf page split point that avoids it. Also, _bt_check_unique can use
+ * the leaf page high key to determine that there will be no duplicates on
+ * the right sibling without actually visiting it (it uses the high key in
+ * cases where the new item happens to belong at the far right of the leaf
+ * page).
+ *
+ * NOTE: obviously, _bt_check_unique can only detect keys that are already
+ * in the index; so it cannot defend against concurrent insertions of the
+ * same key. We protect against that by means of holding a write lock on
+ * the first page the value could be on, with omitted/-inf value for the
+ * implicit heap TID tiebreaker attribute. Any other would-be inserter of
+ * the same key must acquire a write lock on the same page, so only one
+ * would-be inserter can be making the check at one time. Furthermore,
+ * once we are past the check we hold write locks continuously until we
+ * have performed our insertion, so no later inserter can fail to see our
+ * insertion. (This requires some care in _bt_findinsertloc.)
+ *
+ * If we must wait for another xact, we release the lock while waiting,
+ * and then must perform a new search.
+ *
+ * For a partial uniqueness check, we don't wait for the other xact. Just
+ * let the tuple in and return false for possibly non-unique, or true for
+ * definitely unique.
+ */
+ if (checkingunique)
+ {
+ TransactionId xwait;
+ uint32 speculativeToken;
+
+ xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
+ &is_unique, &speculativeToken);
+
+ if (unlikely(TransactionIdIsValid(xwait)))
+ {
+ /* Have to wait for the other guy ... */
+ _bt_relbuf(rel, insertstate.buf);
+ insertstate.buf = InvalidBuffer;
+
+ /*
+ * If it's a speculative insertion, wait for it to finish (ie. to
+ * go ahead with the insertion, or kill the tuple). Otherwise
+ * wait for the transaction to finish as usual.
+ */
+ if (speculativeToken)
+ SpeculativeInsertionWait(xwait, speculativeToken);
+ else
+ XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
+
+ /* start over... */
+ if (stack)
+ _bt_freestack(stack);
+ goto search;
+ }
+
+ /* Uniqueness is established -- restore heap tid as scantid */
+ if (itup_key->heapkeyspace)
+ itup_key->scantid = &itup->t_tid;
+ }
+
+ if (checkUnique != UNIQUE_CHECK_EXISTING)
+ {
+ OffsetNumber newitemoff;
+
+ /*
+ * The only conflict predicate locking cares about for indexes is when
+ * an index tuple insert conflicts with an existing lock. We don't
+ * know the actual page we're going to insert on for sure just yet in
+ * checkingunique and !heapkeyspace cases, but it's okay to use the
+ * first page the value could be on (with scantid omitted) instead.
+ */
+ CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
+
+ /*
+ * Do the insertion. Note that insertstate contains cached binary
+ * search bounds established within _bt_check_unique when insertion is
+ * checkingunique.
+ */
+ newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
+ indexUnchanged, stack, heapRel);
+ _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
+ itup, insertstate.itemsz, newitemoff,
+ insertstate.postingoff, false);
+ }
+ else
+ {
+ /* just release the buffer */
+ _bt_relbuf(rel, insertstate.buf);
+ }
+
+ /* be tidy */
+ if (stack)
+ _bt_freestack(stack);
+ pfree(itup_key);
+
+ return is_unique;
+}
+/*
+ * _bt_search_insert() -- _bt_search() wrapper for inserts
+ *
+ * Search the tree for a particular scankey, or more precisely for the first
+ * leaf page it could be on. Try to make use of the fastpath optimization's
+ * rightmost leaf page cache before actually searching the tree from the root
+ * page, though.
+ *
+ * Return value is a stack of parent-page pointers (though see notes about
+ * fastpath optimization and page splits below). insertstate->buf is set to
+ * the address of the leaf-page buffer, which is write-locked and pinned in
+ * all cases (if necessary by creating a new empty root page for caller).
+ *
+ * The fastpath optimization avoids most of the work of searching the tree
+ * repeatedly when a single backend inserts successive new tuples on the
+ * rightmost leaf page of an index. A backend cache of the rightmost leaf
+ * page is maintained within _bt_insertonpg(), and used here. The cache is
+ * invalidated here when an insert of a non-pivot tuple must take place on a
+ * non-rightmost leaf page.
+ *
+ * The optimization helps with indexes on an auto-incremented field. It also
+ * helps with indexes on datetime columns, as well as indexes with lots of
+ * NULL values. (NULLs usually get inserted in the rightmost page for single
+ * column indexes, since they usually get treated as coming after everything
+ * else in the key space. Individual NULL tuples will generally be placed on
+ * the rightmost leaf page due to the influence of the heap TID column.)
+ *
+ * Note that we avoid applying the optimization when there is insufficient
+ * space on the rightmost page to fit caller's new item. This is necessary
+ * because we'll need to return a real descent stack when a page split is
+ * expected (actually, caller can cope with a leaf page split that uses a NULL
+ * stack, but that's very slow and so must be avoided). Note also that the
+ * fastpath optimization acquires the lock on the page conditionally as a way
+ * of reducing extra contention when there are concurrent insertions into the
+ * rightmost page (we give up if we'd have to wait for the lock). We assume
+ * that it isn't useful to apply the optimization when there is contention,
+ * since each per-backend cache won't stay valid for long.
+ */
+static BTStack
+_bt_search_insert(Relation rel, BTInsertState insertstate)
+{
+ Assert(insertstate->buf == InvalidBuffer);
+ Assert(!insertstate->bounds_valid);
+ Assert(insertstate->postingoff == 0);
+
+ if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
+ {
+ /* Simulate a _bt_getbuf() call with conditional locking */
+ insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
+ if (_bt_conditionallockbuf(rel, insertstate->buf))
+ {
+ Page page;
+ BTPageOpaque opaque;
+ AttrNumber cmpcol = 1;
+
+ _bt_checkpage(rel, insertstate->buf);
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ /*
+ * Check if the page is still the rightmost leaf page and has
+ * enough free space to accommodate the new tuple. Also check
+ * that the insertion scan key is strictly greater than the first
+ * non-pivot tuple on the page. (Note that we expect itup_key's
+ * scantid to be unset when our caller is a checkingunique
+ * inserter.)
+ */
+ if (P_RIGHTMOST(opaque) &&
+ P_ISLEAF(opaque) &&
+ !P_IGNORE(opaque) &&
+ PageGetFreeSpace(page) > insertstate->itemsz &&
+ PageGetMaxOffsetNumber(page) >= P_HIKEY &&
+ _bt_compare(rel, insertstate->itup_key, page, P_HIKEY,
+ &cmpcol) > 0)
+ {
+ /*
+ * Caller can use the fastpath optimization because cached
+ * block is still rightmost leaf page, which can fit caller's
+ * new tuple without splitting. Keep block in local cache for
+ * next insert, and have caller use NULL stack.
+ *
+ * Note that _bt_insert_parent() has an assertion that catches
+ * leaf page splits that somehow follow from a fastpath insert
+ * (it should only be passed a NULL stack when it must deal
+ * with a concurrent root page split, and never because a NULL
+ * stack was returned here).
+ */
+ return NULL;
+ }
+
+ /* Page unsuitable for caller, drop lock and pin */
+ _bt_relbuf(rel, insertstate->buf);
+ }
+ else
+ {
+ /* Lock unavailable, drop pin */
+ ReleaseBuffer(insertstate->buf);
+ }
+
+ /* Forget block, since cache doesn't appear to be useful */
+ RelationSetTargetBlock(rel, InvalidBlockNumber);
+ }
+
+ /* Cannot use optimization -- descend tree, return proper descent stack */
+ return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
+ NULL);
+}
+
+
+/*
+ * _bt_findinsertloc() -- Finds an insert location for a tuple
+ *
+ * On entry, insertstate buffer contains the page the new tuple belongs
+ * on. It is exclusive-locked and pinned by the caller.
+ *
+ * If 'checkingunique' is true, the buffer on entry is the first page
+ * that contains duplicates of the new key. If there are duplicates on
+ * multiple pages, the correct insertion position might be some page to
+ * the right, rather than the first page. In that case, this function
+ * moves right to the correct target page.
+ *
+ * (In a !heapkeyspace index, there can be multiple pages with the same
+ * high key, where the new tuple could legitimately be placed on. In
+ * that case, the caller passes the first page containing duplicates,
+ * just like when checkingunique=true. If that page doesn't have enough
+ * room for the new tuple, this function moves right, trying to find a
+ * legal page that does.)
+ *
+ * If 'indexUnchanged' is true, this is for an UPDATE that didn't
+ * logically change the indexed value, but must nevertheless have a new
+ * entry to point to a successor version. This hint from the executor
+ * will influence our behavior when the page might have to be split and
+ * we must consider our options. Bottom-up index deletion can avoid
+ * pathological version-driven page splits, but we only want to go to the
+ * trouble of trying it when we already have moderate confidence that
+ * it's appropriate. The hint should not significantly affect our
+ * behavior over time unless practically all inserts on to the leaf page
+ * get the hint.
+ *
+ * On exit, insertstate buffer contains the chosen insertion page, and
+ * the offset within that page is returned. If _bt_findinsertloc needed
+ * to move right, the lock and pin on the original page are released, and
+ * the new buffer is exclusively locked and pinned instead.
+ *
+ * If insertstate contains cached binary search bounds, we will take
+ * advantage of them. This avoids repeating comparisons that we made in
+ * _bt_check_unique() already.
+ */
+static OffsetNumber
+_bt_findinsertloc(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel)
+{
+ BTScanInsert itup_key = insertstate->itup_key;
+ Page page = BufferGetPage(insertstate->buf);
+ BTPageOpaque opaque;
+ OffsetNumber newitemoff;
+
+ opaque = BTPageGetOpaque(page);
+
+ /* Check 1/3 of a page restriction */
+ if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
+ _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
+ insertstate->itup);
+
+ Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
+ Assert(!insertstate->bounds_valid || checkingunique);
+ Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
+ Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
+ Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
+
+ if (itup_key->heapkeyspace)
+ {
+ /* Keep track of whether checkingunique duplicate seen */
+ bool uniquedup = indexUnchanged;
+
+ /*
+ * If we're inserting into a unique index, we may have to walk right
+ * through leaf pages to find the one leaf page that we must insert on
+ * to.
+ *
+ * This is needed for checkingunique callers because a scantid was not
+ * used when we called _bt_search(). scantid can only be set after
+ * _bt_check_unique() has checked for duplicates. The buffer
+ * initially stored in insertstate->buf has the page where the first
+ * duplicate key might be found, which isn't always the page that new
+ * tuple belongs on. The heap TID attribute for new tuple (scantid)
+ * could force us to insert on a sibling page, though that should be
+ * very rare in practice.
+ */
+ if (checkingunique)
+ {
+ if (insertstate->low < insertstate->stricthigh)
+ {
+ /* Encountered a duplicate in _bt_check_unique() */
+ Assert(insertstate->bounds_valid);
+ uniquedup = true;
+ }
+
+ for (;;)
+ {
+ AttrNumber cmpcol = 1;
+
+ /*
+ * Does the new tuple belong on this page?
+ *
+ * The earlier _bt_check_unique() call may well have
+ * established a strict upper bound on the offset for the new
+ * item. If it's not the last item of the page (i.e. if there
+ * is at least one tuple on the page that goes after the tuple
+ * we're inserting) then we know that the tuple belongs on
+ * this page. We can skip the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ /* Test '<=', not '!=', since scantid is set now */
+ if (P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
+ break;
+
+ _bt_stepright(rel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ /* Assume duplicates (if checkingunique) */
+ uniquedup = true;
+ }
+ }
+
+ /*
+ * If the target page cannot fit newitem, try to avoid splitting the
+ * page on insert by performing deletion or deduplication now
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
+ checkingunique, uniquedup,
+ indexUnchanged);
+ }
+ else
+ {
+ /*----------
+ * This is a !heapkeyspace (version 2 or 3) index. The current page
+ * is the first page that we could insert the new tuple to, but there
+ * may be other pages to the right that we could opt to use instead.
+ *
+ * If the new key is equal to one or more existing keys, we can
+ * legitimately place it anywhere in the series of equal keys. In
+ * fact, if the new key is equal to the page's "high key" we can place
+ * it on the next page. If it is equal to the high key, and there's
+ * not room to insert the new tuple on the current page without
+ * splitting, then we move right hoping to find more free space and
+ * avoid a split.
+ *
+ * Keep scanning right until we
+ * (a) find a page with enough free space,
+ * (b) reach the last page where the tuple can legally go, or
+ * (c) get tired of searching.
+ * (c) is not flippant; it is important because if there are many
+ * pages' worth of equal keys, it's better to split one of the early
+ * pages than to scan all the way to the end of the run of equal keys
+ * on every insert. We implement "get tired" as a random choice,
+ * since stopping after scanning a fixed number of pages wouldn't work
+ * well (we'd never reach the right-hand side of previously split
+ * pages). The probability of moving right is set at 0.99, which may
+ * seem too high to change the behavior much, but it does an excellent
+ * job of preventing O(N^2) behavior with many equal keys.
+ *----------
+ */
+ while (PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ AttrNumber cmpcol = 1;
+
+ /*
+ * Before considering moving right, see if we can obtain enough
+ * space by erasing LP_DEAD items
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ /* Perform simple deletion */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ if (PageGetFreeSpace(page) >= insertstate->itemsz)
+ break; /* OK, now we have enough space */
+ }
+
+ /*
+ * Nope, so check conditions (b) and (c) enumerated above
+ *
+ * The earlier _bt_check_unique() call may well have established a
+ * strict upper bound on the offset for the new item. If it's not
+ * the last item of the page (i.e. if there is at least one tuple
+ * on the page that's greater than the tuple we're inserting to)
+ * then we know that the tuple belongs on this page. We can skip
+ * the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ if (P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) != 0 ||
+ pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
+ break;
+
+ _bt_stepright(rel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ }
+ }
+
+ /*
+ * We should now be on the correct page. Find the offset within the page
+ * for the new tuple. (Possibly reusing earlier search bounds.)
+ */
+ {
+ AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
+ Assert(P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0);
+ }
+
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
+
+ if (insertstate->postingoff == -1)
+ {
+ /*
+ * There is an overlapping posting list tuple with its LP_DEAD bit
+ * set. We don't want to unnecessarily unset its LP_DEAD bit while
+ * performing a posting list split, so perform simple index tuple
+ * deletion early.
+ */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ /*
+ * Do new binary search. New insert location cannot overlap with any
+ * posting list now.
+ */
+ Assert(!insertstate->bounds_valid);
+ insertstate->postingoff = 0;
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
+ Assert(insertstate->postingoff == 0);
+ }
+
+ return newitemoff;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d19..7710226f41 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1819,6 +1819,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
bool rightsib_empty;
Page page;
BTPageOpaque opaque;
+ nbts_prep_ctx(rel);
/*
* Save original leafbuf block number from caller. Only deleted blocks
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1cc88da032..ceeafd637f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -87,6 +87,8 @@ static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
OffsetNumber updatedoffset,
int *nremaining);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtree_spec.c"
+#include "access/nbtree_spec.h"
/*
* Btree handler function: return IndexAmRoutine with access method parameters
@@ -120,7 +122,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambuild = btbuild;
amroutine->ambuildempty = btbuildempty;
- amroutine->aminsert = btinsert;
+ amroutine->aminsert = btinsert_default;
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
@@ -152,6 +154,8 @@ btbuildempty(Relation index)
{
Page metapage;
+ nbt_opt_specialize(index);
+
/* Construct metapage. */
metapage = (Page) palloc(BLCKSZ);
_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
@@ -177,33 +181,6 @@ btbuildempty(Relation index)
smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
}
-/*
- * btinsert() -- insert an index tuple into a btree.
- *
- * Descend the tree recursively, find the appropriate location for our
- * new tuple, and put it there.
- */
-bool
-btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- IndexInfo *indexInfo)
-{
- bool result;
- IndexTuple itup;
-
- /* generate an index tuple */
- itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
- itup->t_tid = *ht_ctid;
-
- result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);
-
- pfree(itup);
-
- return result;
-}
-
/*
* btgettuple() -- Get the next tuple in the scan.
*/
@@ -345,6 +322,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
IndexScanDesc scan;
BTScanOpaque so;
+ nbt_opt_specialize(rel);
+
/* no order by operators allowed */
Assert(norderbys == 0);
@@ -788,6 +767,8 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
Relation rel = info->index;
BTCycleId cycleid;
+ nbt_opt_specialize(rel);
+
/* allocate stats if first time through, else re-use existing struct */
if (stats == NULL)
stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
diff --git a/src/backend/access/nbtree/nbtree_spec.c b/src/backend/access/nbtree/nbtree_spec.c
new file mode 100644
index 0000000000..6b766581ab
--- /dev/null
+++ b/src/backend/access/nbtree/nbtree_spec.c
@@ -0,0 +1,69 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtree_spec.c
+ * Index shape-specialized functions for nbtree.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtree_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+/*
+ * _bt_specialize() -- Specialize this index relation for its index key.
+ */
+void
+_bt_specialize(Relation rel)
+{
+#ifdef NBTS_SPECIALIZING_DEFAULT
+ NBTS_MAKE_CTX(rel);
+ /*
+ * We can't directly address _bt_specialize here because it'd be macro-
+ * expanded, nor can we utilize NBTS_SPECIALIZE_NAME here because it'd
+ * try to call _bt_specialize, which would be an infinite recursive call.
+ */
+ switch (__nbts_ctx) {
+ case NBTS_CTX_CACHED:
+ _bt_specialize_cached(rel);
+ break;
+ case NBTS_CTX_DEFAULT:
+ break;
+ }
+#else
+ rel->rd_indam->aminsert = btinsert;
+#endif
+}
+
+/*
+ * btinsert() -- insert an index tuple into a btree.
+ *
+ * Descend the tree recursively, find the appropriate location for our
+ * new tuple, and put it there.
+ */
+bool
+btinsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ IndexInfo *indexInfo)
+{
+ bool result;
+ IndexTuple itup;
+
+ /* generate an index tuple */
+ itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
+ itup->t_tid = *ht_ctid;
+
+ result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);
+
+ pfree(itup);
+
+ return result;
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index e3b828137b..0089fe7eeb 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,12 +25,8 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
-static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf,
- AttrNumber *highkeycmpcol);
static int _bt_binsrch_posting(BTScanInsert key, Page page,
OffsetNumber offnum);
-static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
- OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
static int _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
@@ -47,6 +43,8 @@ static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsearch_spec.c"
+#include "access/nbtree_spec.h"
/*
* _bt_drop_lock_and_maybe_pin()
@@ -71,572 +69,6 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
}
}
-/*
- * _bt_search() -- Search the tree for a particular scankey,
- * or more precisely for the first leaf page it could be on.
- *
- * The passed scankey is an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * Return value is a stack of parent-page pointers (i.e. there is no entry for
- * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
- * which is locked and pinned. No locks are held on the parent pages,
- * however!
- *
- * If the snapshot parameter is not NULL, "old snapshot" checking will take
- * place during the descent through the tree. This is not needed when
- * positioning for an insert or delete, so NULL is used for those cases.
- *
- * The returned buffer is locked according to access parameter. Additionally,
- * access = BT_WRITE will allow an empty root page to be created and returned.
- * When access = BT_READ, an empty index will result in *bufP being set to
- * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
- * during the search will be finished.
- */
-BTStack
-_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
- Snapshot snapshot)
-{
- BTStack stack_in = NULL;
- int page_access = BT_READ;
- char tupdatabuf[BLCKSZ / 3];
- AttrNumber highkeycmpcol = 1;
-
- /* Get the root page to start with */
- *bufP = _bt_getroot(rel, access);
-
- /* If index is empty and access = BT_READ, no root page is created. */
- if (!BufferIsValid(*bufP))
- return (BTStack) NULL;
-
- /* Loop iterates once per level descended in the tree */
- for (;;)
- {
- Page page;
- BTPageOpaque opaque;
- OffsetNumber offnum;
- ItemId itemid;
- IndexTuple itup;
- BlockNumber child;
- BTStack new_stack;
-
- /*
- * Race -- the page we just grabbed may have split since we read its
- * downlink in its parent page (or the metapage). If it has, we may
- * need to move right to its new sibling. Do that.
- *
- * In write-mode, allow _bt_moveright to finish any incomplete splits
- * along the way. Strictly speaking, we'd only need to finish an
- * incomplete split on the leaf page we're about to insert to, not on
- * any of the upper levels (internal pages with incomplete splits are
- * also taken care of in _bt_getstackbuf). But this is a good
- * opportunity to finish splits of internal pages too.
- */
- *bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
- page_access, snapshot, &highkeycmpcol,
- (char *) tupdatabuf);
-
- /* if this is a leaf page, we're done */
- page = BufferGetPage(*bufP);
- opaque = BTPageGetOpaque(page);
- if (P_ISLEAF(opaque))
- break;
-
- /*
- * Find the appropriate pivot tuple on this page. Its downlink points
- * to the child page that we're about to descend to.
- */
- offnum = _bt_binsrch(rel, key, *bufP, &highkeycmpcol);
- itemid = PageGetItemId(page, offnum);
- itup = (IndexTuple) PageGetItem(page, itemid);
- Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
- child = BTreeTupleGetDownLink(itup);
-
- Assert(IndexTupleSize(itup) < sizeof(tupdatabuf));
- memcpy((char *) tupdatabuf, (char *) itup, IndexTupleSize(itup));
-
- /*
- * We need to save the location of the pivot tuple we chose in a new
- * stack entry for this page/level. If caller ends up splitting a
- * page one level down, it usually ends up inserting a new pivot
- * tuple/downlink immediately after the location recorded here.
- */
- new_stack = (BTStack) palloc(sizeof(BTStackData));
- new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
- new_stack->bts_offset = offnum;
- new_stack->bts_parent = stack_in;
-
- /*
- * Page level 1 is lowest non-leaf page level prior to leaves. So, if
- * we're on the level 1 and asked to lock leaf page in write mode,
- * then lock next page in write mode, because it must be a leaf.
- */
- if (opaque->btpo_level == 1 && access == BT_WRITE)
- page_access = BT_WRITE;
-
- /* drop the read lock on the page, then acquire one on its child */
- *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
-
- /* okay, all set to move down a level */
- stack_in = new_stack;
- }
-
- /*
- * If we're asked to lock leaf in write mode, but didn't manage to, then
- * relock. This should only happen when the root page is a leaf page (and
- * the only page in the index other than the metapage).
- */
- if (access == BT_WRITE && page_access == BT_READ)
- {
- highkeycmpcol = 1;
-
- /* trade in our read lock for a write lock */
- _bt_unlockbuf(rel, *bufP);
- _bt_lockbuf(rel, *bufP, BT_WRITE);
-
- /*
- * Race -- the leaf page may have split after we dropped the read lock
- * but before we acquired a write lock. If it has, we may need to
- * move right to its new sibling. Do that.
- */
- *bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
- snapshot, &highkeycmpcol, (char *) tupdatabuf);
- }
-
- return stack_in;
-}
-
-/*
- * _bt_moveright() -- move right in the btree if necessary.
- *
- * When we follow a pointer to reach a page, it is possible that
- * the page has changed in the meanwhile. If this happens, we're
- * guaranteed that the page has "split right" -- that is, that any
- * data that appeared on the page originally is either on the page
- * or strictly to the right of it.
- *
- * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page. If that entry is
- * strictly less than the scankey, or <= the scankey in the
- * key.nextkey=true case, then we followed the wrong link and we need
- * to move right.
- *
- * The passed insertion-type scankey can omit the rightmost column(s) of the
- * index. (see nbtree/README)
- *
- * When key.nextkey is false (the usual case), we are looking for the first
- * item >= key. When key.nextkey is true, we are looking for the first item
- * strictly greater than key.
- *
- * If forupdate is true, we will attempt to finish any incomplete splits
- * that we encounter. This is required when locking a target page for an
- * insertion, because we don't allow inserting on a page before the split
- * is completed. 'stack' is only used if forupdate is true.
- *
- * On entry, we have the buffer pinned and a lock of the type specified by
- * 'access'. If we move right, we release the buffer and lock and acquire
- * the same on the right sibling. Return value is the buffer we stop at.
- *
- * If the snapshot parameter is not NULL, "old snapshot" checking will take
- * place during the descent through the tree. This is not needed when
- * positioning for an insert or delete, so NULL is used for those cases.
- */
-Buffer
-_bt_moveright(Relation rel,
- BTScanInsert key,
- Buffer buf,
- bool forupdate,
- BTStack stack,
- int access,
- Snapshot snapshot,
- AttrNumber *comparecol,
- char *tupdatabuf)
-{
- Page page;
- BTPageOpaque opaque;
- int32 cmpval;
-
- Assert(PointerIsValid(comparecol) && PointerIsValid(tupdatabuf));
-
- /*
- * When nextkey = false (normal case): if the scan key that brought us to
- * this page is > the high key stored on the page, then the page has split
- * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
- * have some duplicates to the right as well as the left, but that's
- * something that's only ever dealt with on the leaf level, after
- * _bt_search has found an initial leaf page.)
- *
- * When nextkey = true: move right if the scan key is >= page's high key.
- * (Note that key.scantid cannot be set in this case.)
- *
- * The page could even have split more than once, so scan as far as
- * needed.
- *
- * We also have to move right if we followed a link that brought us to a
- * dead page.
- */
- cmpval = key->nextkey ? 0 : 1;
-
- for (;;)
- {
- AttrNumber cmpcol = 1;
-
- page = BufferGetPage(buf);
- TestForOldSnapshot(snapshot, rel, page);
- opaque = BTPageGetOpaque(page);
-
- if (P_RIGHTMOST(opaque))
- {
- *comparecol = 1;
- break;
- }
-
- /*
- * Finish any incomplete splits we encounter along the way.
- */
- if (forupdate && P_INCOMPLETE_SPLIT(opaque))
- {
- BlockNumber blkno = BufferGetBlockNumber(buf);
-
- /* upgrade our lock if necessary */
- if (access == BT_READ)
- {
- _bt_unlockbuf(rel, buf);
- _bt_lockbuf(rel, buf, BT_WRITE);
- }
-
- if (P_INCOMPLETE_SPLIT(opaque))
- _bt_finish_split(rel, buf, stack);
- else
- _bt_relbuf(rel, buf);
-
- /* re-acquire the lock in the right mode, and re-check */
- buf = _bt_getbuf(rel, blkno, access);
- continue;
- }
-
- /*
- * tupdatabuf is filled with the right seperator of the parent node.
- * This allows us to do a binary equality check between the parent
- * node's right seperator (which is < key) and this page's P_HIKEY.
- * If they equal, we can reuse the result of the parent node's
- * rightkey compare, which means we can potentially save a full key
- * compare (which includes indirect calls to attribute comparison
- * functions).
- *
- * Without this, we'd on average use 3 full key compares per page before
- * we achieve full dynamic prefix bounds, but with this optimization
- * that is only 2.
- *
- * 3 compares: 1 for the highkey (rightmost), and on average 2 before
- * we move right in the binary search on the page, this average equals
- * SUM (1/2 ^ x) for x from 0 to log(n items)), which tends to 2.
- */
- if (!P_IGNORE(opaque) && *comparecol > 1)
- {
- IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
- IndexTuple buftuple = (IndexTuple) tupdatabuf;
- if (IndexTupleSize(itup) == IndexTupleSize(buftuple))
- {
- char *dataptr = (char *) itup;
-
- if (memcmp(dataptr + sizeof(IndexTupleData),
- tupdatabuf + sizeof(IndexTupleData),
- IndexTupleSize(itup) - sizeof(IndexTupleData)) == 0)
- break;
- } else {
- *comparecol = 1;
- }
- } else {
- *comparecol = 1;
- }
-
- if (P_IGNORE(opaque) ||
- _bt_compare(rel, key, page, P_HIKEY, &cmpcol) >= cmpval)
- {
- *comparecol = 1;
- /* step right one page */
- buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
- continue;
- }
- else
- {
- *comparecol = cmpcol;
- break;
- }
- }
-
- if (P_IGNORE(opaque))
- elog(ERROR, "fell off the end of index \"%s\"",
- RelationGetRelationName(rel));
-
- return buf;
-}
-
-/*
- * _bt_binsrch() -- Do a binary search for a key on a particular page.
- *
- * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
- * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
- * particular, this means it is possible to return a value 1 greater than the
- * number of keys on the page, if the scankey is > all keys on the page.)
- *
- * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
- * of the last key < given scankey, or last key <= given scankey if nextkey
- * is true. (Since _bt_compare treats the first data key of such a page as
- * minus infinity, there will be at least one key < scankey, so the result
- * always points at one of the keys on the page.) This key indicates the
- * right place to descend to be sure we find all leaf keys >= given scankey
- * (or leaf keys > given scankey when nextkey is true).
- *
- * This procedure is not responsible for walking right, it just examines
- * the given page. _bt_binsrch() has no lock or refcount side effects
- * on the buffer.
- */
-static OffsetNumber
-_bt_binsrch(Relation rel,
- BTScanInsert key,
- Buffer buf,
- AttrNumber *highkeycmpcol)
-{
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high;
- int32 result,
- cmpval;
- AttrNumber highcmpcol = *highkeycmpcol,
- lowcmpcol = 1;
-
- page = BufferGetPage(buf);
- opaque = BTPageGetOpaque(page);
-
- /* Requesting nextkey semantics while using scantid seems nonsensical */
- Assert(!key->nextkey || key->scantid == NULL);
- /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
- Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
-
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
-
- /*
- * If there are no keys on the page, return the first available slot. Note
- * this covers two cases: the page is really empty (no keys), or it
- * contains only a high key. The latter case is possible after vacuuming.
- * This can never happen on an internal page, however, since they are
- * never empty (an internal page must have children).
- */
- if (unlikely(high < low))
- return low;
-
- /*
- * Binary search to find the first key on the page >= scan key, or first
- * key > scankey when nextkey is true.
- *
- * For nextkey=false (cmpval=1), the loop invariant is: all slots before
- * 'low' are < scan key, all slots at or after 'high' are >= scan key.
- *
- * For nextkey=true (cmpval=0), the loop invariant is: all slots before
- * 'low' are <= scan key, all slots at or after 'high' are > scan key.
- *
- * We can fall out when high == low.
- */
- high++; /* establish the loop invariant for high */
-
- cmpval = key->nextkey ? 0 : 1; /* select comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
- AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid, &cmpcol);
-
- if (result >= cmpval)
- {
- low = mid + 1;
- lowcmpcol = cmpcol;
- }
- else
- {
- high = mid;
- highcmpcol = cmpcol;
- }
- }
-
- *highkeycmpcol = highcmpcol;
-
- /*
- * At this point we have high == low, but be careful: they could point
- * past the last slot on the page.
- *
- * On a leaf page, we always return the first key >= scan key (resp. >
- * scan key), which could be the last slot + 1.
- */
- if (P_ISLEAF(opaque))
- return low;
-
- /*
- * On a non-leaf page, return the last key < scan key (resp. <= scan key).
- * There must be one if _bt_compare() is playing by the rules.
- */
- Assert(low > P_FIRSTDATAKEY(opaque));
-
- return OffsetNumberPrev(low);
-}
-
-/*
- *
- * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
- *
- * Like _bt_binsrch(), but with support for caching the binary search
- * bounds. Only used during insertion, and only on the leaf page that it
- * looks like caller will insert tuple on. Exclusive-locked and pinned
- * leaf page is contained within insertstate.
- *
- * Caches the bounds fields in insertstate so that a subsequent call can
- * reuse the low and strict high bounds of original binary search. Callers
- * that use these fields directly must be prepared for the case where low
- * and/or stricthigh are not on the same page (one or both exceed maxoff
- * for the page). The case where there are no items on the page (high <
- * low) makes bounds invalid.
- *
- * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time, and for dealing with posting list
- * tuple matches (callers can use insertstate's postingoff field to
- * determine which existing heap TID will need to be replaced by a posting
- * list split).
- */
-OffsetNumber
-_bt_binsrch_insert(Relation rel, BTInsertState insertstate,
- AttrNumber highcmpcol)
-{
- BTScanInsert key = insertstate->itup_key;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high,
- stricthigh;
- int32 result,
- cmpval;
- AttrNumber lowcmpcol = 1;
-
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- Assert(P_ISLEAF(opaque));
- Assert(!key->nextkey);
- Assert(insertstate->postingoff == 0);
-
- if (!insertstate->bounds_valid)
- {
- /* Start new binary search */
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
- }
- else
- {
- /* Restore result of previous binary search against same page */
- low = insertstate->low;
- high = insertstate->stricthigh;
- }
-
- /* If there are no keys on the page, return the first available slot */
- if (unlikely(high < low))
- {
- /* Caller can't reuse bounds */
- insertstate->low = InvalidOffsetNumber;
- insertstate->stricthigh = InvalidOffsetNumber;
- insertstate->bounds_valid = false;
- return low;
- }
-
- /*
- * Binary search to find the first key on the page >= scan key. (nextkey
- * is always false when inserting).
- *
- * The loop invariant is: all slots before 'low' are < scan key, all slots
- * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
- * maintained to save additional search effort for caller.
- *
- * We can fall out when high == low.
- */
- if (!insertstate->bounds_valid)
- high++; /* establish the loop invariant for high */
- stricthigh = high; /* high initially strictly higher */
-
- cmpval = 1; /* !nextkey comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
- AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid, &cmpcol);
-
- if (result >= cmpval)
- {
- low = mid + 1;
- lowcmpcol = cmpcol;
- }
- else
- {
- high = mid;
- highcmpcol = cmpcol;
-
- if (result != 0)
- stricthigh = high;
- }
-
- /*
- * If tuple at offset located by binary search is a posting list whose
- * TID range overlaps with caller's scantid, perform posting list
- * binary search to set postingoff for caller. Caller must split the
- * posting list when postingoff is set. This should happen
- * infrequently.
- */
- if (unlikely(result == 0 && key->scantid != NULL))
- {
- /*
- * postingoff should never be set more than once per leaf page
- * binary search. That would mean that there are duplicate table
- * TIDs in the index, which is never okay. Check for that here.
- */
- if (insertstate->postingoff != 0)
- ereport(ERROR,
- (errcode(ERRCODE_INDEX_CORRUPTED),
- errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
- ItemPointerGetBlockNumber(key->scantid),
- ItemPointerGetOffsetNumber(key->scantid),
- low, stricthigh,
- BufferGetBlockNumber(insertstate->buf),
- RelationGetRelationName(rel))));
-
- insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
- }
- }
-
- /*
- * On a leaf page, a binary search always returns the first key >= scan
- * key (at least in !nextkey case), which could be the last slot + 1. This
- * is also the lower bound of cached search.
- *
- * stricthigh may also be the last slot + 1, which prevents caller from
- * using bounds directly, but is still useful to us if we're called a
- * second time with cached bounds (cached low will be < stricthigh when
- * that happens).
- */
- insertstate->low = low;
- insertstate->stricthigh = stricthigh;
- insertstate->bounds_valid = true;
-
- return low;
-}
-
/*----------
* _bt_binsrch_posting() -- posting list binary search.
*
@@ -704,228 +136,6 @@ _bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
return low;
}
-/*----------
- * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
- *
- * page/offnum: location of btree item to be compared to.
- *
- * This routine returns:
- * <0 if scankey < tuple at offnum;
- * 0 if scankey == tuple at offnum;
- * >0 if scankey > tuple at offnum.
- *
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be returned
- * to the caller as a matching key. Similarly, an insertion scankey
- * with its scantid set is treated as equal to a posting tuple whose TID
- * range overlaps with their scantid. There generally won't be a
- * matching TID in the posting tuple, which caller must handle
- * themselves (e.g., by splitting the posting list tuple).
- *
- * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
- * "minus infinity": this routine will always claim it is less than the
- * scankey. The actual key value stored is explicitly truncated to 0
- * attributes (explicitly minus infinity) with version 3+ indexes, but
- * that isn't relied upon. This allows us to implement the Lehman and
- * Yao convention that the first down-link pointer is before the first
- * key. See backend/access/nbtree/README for details.
- *----------
- */
-int32
-_bt_compare(Relation rel,
- BTScanInsert key,
- Page page,
- OffsetNumber offnum,
- AttrNumber *comparecol)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- IndexTuple itup;
- ItemPointer heapTid;
- ScanKey scankey;
- int ncmpkey;
- int ntupatts;
- int32 result;
-
- Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
- Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
- Assert(key->heapkeyspace || key->scantid == NULL);
-
- /*
- * Force result ">" if target item is first data item on an internal page
- * --- see NOTE above.
- */
- if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
- return 1;
-
- itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
- ntupatts = BTreeTupleGetNAtts(itup, rel);
-
- /*
- * The scan key is set up with the attribute number associated with each
- * term in the key. It is important that, if the index is multi-key, the
- * scan contain the first k key attributes, and that they be in order. If
- * you think about how multi-key ordering works, you'll understand why
- * this is.
- *
- * We don't test for violation of this condition here, however. The
- * initial setup for the index scan had better have gotten it right (see
- * _bt_first).
- */
-
- ncmpkey = Min(ntupatts, key->keysz);
- Assert(key->heapkeyspace || ncmpkey == key->keysz);
- Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
-
- scankey = key->scankeys + ((*comparecol) - 1);
- for (int i = *comparecol; i <= ncmpkey; i++)
- {
- Datum datum;
- bool isNull;
-
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
-
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
- {
- if (isNull)
- result = 0; /* NULL "=" NULL */
- else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = -1; /* NULL "<" NOT_NULL */
- else
- result = 1; /* NULL ">" NOT_NULL */
- }
- else if (isNull) /* key is NOT_NULL and item is NULL */
- {
- if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = 1; /* NOT_NULL ">" NULL */
- else
- result = -1; /* NOT_NULL "<" NULL */
- }
- else
- {
- /*
- * The sk_func needs to be passed the index value as left arg and
- * the sk_argument as right arg (they might be of different
- * types). Since it is convenient for callers to think of
- * _bt_compare as comparing the scankey to the index item, we have
- * to flip the sign of the comparison result. (Unless it's a DESC
- * column, in which case we *don't* flip the sign.)
- */
- result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum,
- scankey->sk_argument));
-
- if (!(scankey->sk_flags & SK_BT_DESC))
- INVERT_COMPARE_RESULT(result);
- }
-
- /* if the keys are unequal, return the difference */
- if (result != 0)
- {
- *comparecol = i;
- return result;
- }
-
- scankey++;
- }
-
- /*
- * All tuple attributes are equal to the scan key, only later attributes
- * could potentially not equal the scan key.
- */
- *comparecol = ntupatts + 1;
-
- /*
- * All non-truncated attributes (other than heap TID) were found to be
- * equal. Treat truncated attributes as minus infinity when scankey has a
- * key attribute value that would otherwise be compared directly.
- *
- * Note: it doesn't matter if ntupatts includes non-key attributes;
- * scankey won't, so explicitly excluding non-key attributes isn't
- * necessary.
- */
- if (key->keysz > ntupatts)
- return 1;
-
- /*
- * Use the heap TID attribute and scantid to try to break the tie. The
- * rules are the same as any other key attribute -- only the
- * representation differs.
- */
- heapTid = BTreeTupleGetHeapTID(itup);
- if (key->scantid == NULL)
- {
- /*
- * Most searches have a scankey that is considered greater than a
- * truncated pivot tuple if and when the scankey has equal values for
- * attributes up to and including the least significant untruncated
- * attribute in tuple.
- *
- * For example, if an index has the minimum two attributes (single
- * user key attribute, plus heap TID attribute), and a page's high key
- * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
- * will not descend to the page to the left. The search will descend
- * right instead. The truncated attribute in pivot tuple means that
- * all non-pivot tuples on the page to the left are strictly < 'foo',
- * so it isn't necessary to descend left. In other words, search
- * doesn't have to descend left because it isn't interested in a match
- * that has a heap TID value of -inf.
- *
- * However, some searches (pivotsearch searches) actually require that
- * we descend left when this happens. -inf is treated as a possible
- * match for omitted scankey attribute(s). This is needed by page
- * deletion, which must re-find leaf pages that are targets for
- * deletion using their high keys.
- *
- * Note: the heap TID part of the test ensures that scankey is being
- * compared to a pivot tuple with one or more truncated key
- * attributes.
- *
- * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
- * left here, since they have no heap TID attribute (and cannot have
- * any -inf key values in any case, since truncation can only remove
- * non-key attributes). !heapkeyspace searches must always be
- * prepared to deal with matches on both sides of the pivot once the
- * leaf level is reached.
- */
- if (key->heapkeyspace && !key->pivotsearch &&
- key->keysz == ntupatts && heapTid == NULL)
- return 1;
-
- /* All provided scankey arguments found to be equal */
- return 0;
- }
-
- /*
- * Treat truncated heap TID as minus infinity, since scankey has a key
- * attribute value (scantid) that would otherwise be compared directly
- */
- Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
- if (heapTid == NULL)
- return 1;
-
- /*
- * Scankey must be treated as equal to a posting list tuple if its scantid
- * value falls within the range of the posting list. In all other cases
- * there can only be a single heap TID value, which is compared directly
- * with scantid.
- */
- Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- result = ItemPointerCompare(key->scantid, heapTid);
- if (result <= 0 || !BTreeTupleIsPosting(itup))
- return result;
- else
- {
- result = ItemPointerCompare(key->scantid,
- BTreeTupleGetMaxHeapTID(itup));
- if (result > 0)
- return 1;
- }
-
- return 0;
-}
-
/*
* _bt_first() -- Find the first item in a scan.
*
@@ -967,6 +177,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
BTScanPosItem *currItem;
BlockNumber blkno;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(rel);
Assert(!BTScanPosIsValid(so->currPos));
@@ -1589,280 +800,6 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
-/*
- * _bt_readpage() -- Load data from current index page into so->currPos
- *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate. All other fields of so->currPos are
- * initialized from scratch here.
- *
- * We scan the current page starting at offnum and moving in the indicated
- * direction. All items matching the scan keys are loaded into currPos.items.
- * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
- * that there can be no more matching tuples in the current scan direction.
- *
- * In the case of a parallel scan, caller must have called _bt_parallel_seize
- * prior to calling this function; this function will invoke
- * _bt_parallel_release before returning.
- *
- * Returns true if any matching items found on the page, false if none.
- */
-static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
-{
- BTScanOpaque so = (BTScanOpaque) scan->opaque;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber minoff;
- OffsetNumber maxoff;
- int itemIndex;
- bool continuescan;
- int indnatts;
-
- /*
- * We must have the buffer pinned and locked, but the usual macro can't be
- * used here; this function is what makes it good for currPos.
- */
- Assert(BufferIsValid(so->currPos.buf));
-
- page = BufferGetPage(so->currPos.buf);
- opaque = BTPageGetOpaque(page);
-
- /* allow next page be processed by parallel worker */
- if (scan->parallel_scan)
- {
- if (ScanDirectionIsForward(dir))
- _bt_parallel_release(scan, opaque->btpo_next);
- else
- _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
- }
-
- continuescan = true; /* default assumption */
- indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * We note the buffer's block number so that we can release the pin later.
- * This allows us to re-read the buffer if it is needed again for hinting.
- */
- so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-
- /*
- * We save the LSN of the page as we read it, so that we know whether it
- * safe to apply LP_DEAD hints to the page later. This allows us to drop
- * the pin for MVCC scans, which allows vacuum to avoid blocking.
- */
- so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-
- /*
- * we must save the page's right-link while scanning it; this tells us
- * where to step right to after we're done with these items. There is no
- * corresponding need for the left-link, since splits always go right.
- */
- so->currPos.nextPage = opaque->btpo_next;
-
- /* initialize tuple workspace to empty */
- so->currPos.nextTupleOffset = 0;
-
- /*
- * Now that the current page has been made consistent, the macro should be
- * good.
- */
- Assert(BTScanPosIsPinned(so->currPos));
-
- if (ScanDirectionIsForward(dir))
- {
- /* load items[] in ascending order */
- itemIndex = 0;
-
- offnum = Max(offnum, minoff);
-
- while (offnum <= maxoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- offnum = OffsetNumberNext(offnum);
- continue;
- }
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID
- */
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- itemIndex++;
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- itemIndex++;
- }
- }
- }
- /* When !continuescan, there can't be any more matches, so stop */
- if (!continuescan)
- break;
-
- offnum = OffsetNumberNext(offnum);
- }
-
- /*
- * We don't need to visit page to the right when the high key
- * indicates that no more matches will be found there.
- *
- * Checking the high key like this works out more often than you might
- * think. Leaf page splits pick a split point between the two most
- * dissimilar tuples (this is weighed against the need to evenly share
- * free space). Leaf pages with high key attribute values that can
- * only appear on non-pivot tuples on the right sibling page are
- * common.
- */
- if (continuescan && !P_RIGHTMOST(opaque))
- {
- ItemId iid = PageGetItemId(page, P_HIKEY);
- IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
- int truncatt;
-
- truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
- _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
- }
-
- if (!continuescan)
- so->currPos.moreRight = false;
-
- Assert(itemIndex <= MaxTIDsPerBTreePage);
- so->currPos.firstItem = 0;
- so->currPos.lastItem = itemIndex - 1;
- so->currPos.itemIndex = 0;
- }
- else
- {
- /* load items[] in descending order */
- itemIndex = MaxTIDsPerBTreePage;
-
- offnum = Min(offnum, maxoff);
-
- while (offnum >= minoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
- bool tuple_alive;
- bool passes_quals;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual. Most of the
- * time, it's a win to not bother examining the tuple's index
- * keys, but just skip to the next tuple (previous, actually,
- * since we're scanning backwards). However, if this is the first
- * tuple on the page, we do check the index keys, to prevent
- * uselessly advancing to the page to the left. This is similar
- * to the high key optimization used by forward scans.
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- Assert(offnum >= P_FIRSTDATAKEY(opaque));
- if (offnum > P_FIRSTDATAKEY(opaque))
- {
- offnum = OffsetNumberPrev(offnum);
- continue;
- }
-
- tuple_alive = false;
- }
- else
- tuple_alive = true;
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan);
- if (passes_quals && tuple_alive)
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID.
- *
- * Note that we deliberately save/return items from
- * posting lists in ascending heap TID order for backwards
- * scans. This allows _bt_killitems() to make a
- * consistent assumption about the order of items
- * associated with the same posting list tuple.
- */
- itemIndex--;
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- itemIndex--;
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- }
- }
- }
- if (!continuescan)
- {
- /* there can't be any more matches, so stop */
- so->currPos.moreLeft = false;
- break;
- }
-
- offnum = OffsetNumberPrev(offnum);
- }
-
- Assert(itemIndex >= 0);
- so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
- so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
- }
-
- return (so->currPos.firstItem <= so->currPos.lastItem);
-}
-
/* Save an index item into so->currPos.items[itemIndex] */
static void
_bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -2071,12 +1008,11 @@ static bool
_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
- Relation rel;
+ Relation rel = scan->indexRelation;
Page page;
BTPageOpaque opaque;
bool status;
-
- rel = scan->indexRelation;
+ nbts_prep_ctx(rel);
if (ScanDirectionIsForward(dir))
{
@@ -2488,6 +1424,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
BTPageOpaque opaque;
OffsetNumber start;
BTScanPosItem *currItem;
+ nbts_prep_ctx(rel);
/*
* Scan down to the leftmost or rightmost leaf page. This is a simplified
diff --git a/src/backend/access/nbtree/nbtsearch_spec.c b/src/backend/access/nbtree/nbtsearch_spec.c
new file mode 100644
index 0000000000..37cc3647d3
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsearch_spec.c
@@ -0,0 +1,1087 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtsearch_spec.c
+ * Index shape-specialized functions for nbtsearch.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtsearch_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_binsrch NBTS_FUNCTION(_bt_binsrch)
+#define _bt_readpage NBTS_FUNCTION(_bt_readpage)
+
+static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf,
+ AttrNumber *highkeycmpcol);
+static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
+ OffsetNumber offnum);
+
+/*
+ * _bt_search() -- Search the tree for a particular scankey,
+ * or more precisely for the first leaf page it could be on.
+ *
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
+ * but it can omit the rightmost column(s) of the index.
+ *
+ * Return value is a stack of parent-page pointers (i.e. there is no entry for
+ * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
+ * which is locked and pinned. No locks are held on the parent pages,
+ * however!
+ *
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
+ *
+ * The returned buffer is locked according to access parameter. Additionally,
+ * access = BT_WRITE will allow an empty root page to be created and returned.
+ * When access = BT_READ, an empty index will result in *bufP being set to
+ * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
+ * during the search will be finished.
+ */
+BTStack
+_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
+ Snapshot snapshot)
+{
+ BTStack stack_in = NULL;
+ int page_access = BT_READ;
+ char tupdatabuf[BLCKSZ / 3];
+ AttrNumber highkeycmpcol = 1;
+
+ /* Get the root page to start with */
+ *bufP = _bt_getroot(rel, access);
+
+ /* If index is empty and access = BT_READ, no root page is created. */
+ if (!BufferIsValid(*bufP))
+ return (BTStack) NULL;
+
+ /* Loop iterates once per level descended in the tree */
+ for (;;)
+ {
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber offnum;
+ ItemId itemid;
+ IndexTuple itup;
+ BlockNumber child;
+ BTStack new_stack;
+
+ /*
+ * Race -- the page we just grabbed may have split since we read its
+ * downlink in its parent page (or the metapage). If it has, we may
+ * need to move right to its new sibling. Do that.
+ *
+ * In write-mode, allow _bt_moveright to finish any incomplete splits
+ * along the way. Strictly speaking, we'd only need to finish an
+ * incomplete split on the leaf page we're about to insert to, not on
+ * any of the upper levels (internal pages with incomplete splits are
+ * also taken care of in _bt_getstackbuf). But this is a good
+ * opportunity to finish splits of internal pages too.
+ */
+ *bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
+ page_access, snapshot, &highkeycmpcol,
+ (char *) tupdatabuf);
+
+ /* if this is a leaf page, we're done */
+ page = BufferGetPage(*bufP);
+ opaque = BTPageGetOpaque(page);
+ if (P_ISLEAF(opaque))
+ break;
+
+ /*
+ * Find the appropriate pivot tuple on this page. Its downlink points
+ * to the child page that we're about to descend to.
+ */
+ offnum = _bt_binsrch(rel, key, *bufP, &highkeycmpcol);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
+ child = BTreeTupleGetDownLink(itup);
+
+ Assert(IndexTupleSize(itup) < sizeof(tupdatabuf));
+ memcpy((char *) tupdatabuf, (char *) itup, IndexTupleSize(itup));
+
+ /*
+ * We need to save the location of the pivot tuple we chose in a new
+ * stack entry for this page/level. If caller ends up splitting a
+ * page one level down, it usually ends up inserting a new pivot
+ * tuple/downlink immediately after the location recorded here.
+ */
+ new_stack = (BTStack) palloc(sizeof(BTStackData));
+ new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
+ new_stack->bts_offset = offnum;
+ new_stack->bts_parent = stack_in;
+
+ /*
+ * Page level 1 is lowest non-leaf page level prior to leaves. So, if
+ * we're on the level 1 and asked to lock leaf page in write mode,
+ * then lock next page in write mode, because it must be a leaf.
+ */
+ if (opaque->btpo_level == 1 && access == BT_WRITE)
+ page_access = BT_WRITE;
+
+ /* drop the read lock on the page, then acquire one on its child */
+ *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
+
+ /* okay, all set to move down a level */
+ stack_in = new_stack;
+ }
+
+ /*
+ * If we're asked to lock leaf in write mode, but didn't manage to, then
+ * relock. This should only happen when the root page is a leaf page (and
+ * the only page in the index other than the metapage).
+ */
+ if (access == BT_WRITE && page_access == BT_READ)
+ {
+ highkeycmpcol = 1;
+
+ /* trade in our read lock for a write lock */
+ _bt_unlockbuf(rel, *bufP);
+ _bt_lockbuf(rel, *bufP, BT_WRITE);
+
+ /*
+ * Race -- the leaf page may have split after we dropped the read lock
+ * but before we acquired a write lock. If it has, we may need to
+ * move right to its new sibling. Do that.
+ */
+ *bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+ snapshot, &highkeycmpcol, (char *) tupdatabuf);
+ }
+
+ return stack_in;
+}
+
+/*
+ * _bt_moveright() -- move right in the btree if necessary.
+ *
+ * When we follow a pointer to reach a page, it is possible that
+ * the page has changed in the meanwhile. If this happens, we're
+ * guaranteed that the page has "split right" -- that is, that any
+ * data that appeared on the page originally is either on the page
+ * or strictly to the right of it.
+ *
+ * This routine decides whether or not we need to move right in the
+ * tree by examining the high key entry on the page. If that entry is
+ * strictly less than the scankey, or <= the scankey in the
+ * key.nextkey=true case, then we followed the wrong link and we need
+ * to move right.
+ *
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
+ *
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key. When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
+ *
+ * If forupdate is true, we will attempt to finish any incomplete splits
+ * that we encounter. This is required when locking a target page for an
+ * insertion, because we don't allow inserting on a page before the split
+ * is completed. 'stack' is only used if forupdate is true.
+ *
+ * On entry, we have the buffer pinned and a lock of the type specified by
+ * 'access'. If we move right, we release the buffer and lock and acquire
+ * the same on the right sibling. Return value is the buffer we stop at.
+ *
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
+ */
+Buffer
+_bt_moveright(Relation rel,
+ BTScanInsert key,
+ Buffer buf,
+ bool forupdate,
+ BTStack stack,
+ int access,
+ Snapshot snapshot,
+ AttrNumber *comparecol,
+ char *tupdatabuf)
+{
+ Page page;
+ BTPageOpaque opaque;
+ int32 cmpval;
+
+ Assert(PointerIsValid(comparecol) && PointerIsValid(tupdatabuf));
+
+ /*
+ * When nextkey = false (normal case): if the scan key that brought us to
+ * this page is > the high key stored on the page, then the page has split
+ * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
+ * have some duplicates to the right as well as the left, but that's
+ * something that's only ever dealt with on the leaf level, after
+ * _bt_search has found an initial leaf page.)
+ *
+ * When nextkey = true: move right if the scan key is >= page's high key.
+ * (Note that key.scantid cannot be set in this case.)
+ *
+ * The page could even have split more than once, so scan as far as
+ * needed.
+ *
+ * We also have to move right if we followed a link that brought us to a
+ * dead page.
+ */
+ cmpval = key->nextkey ? 0 : 1;
+
+ for (;;)
+ {
+ AttrNumber cmpcol = 1;
+
+ page = BufferGetPage(buf);
+ TestForOldSnapshot(snapshot, rel, page);
+ opaque = BTPageGetOpaque(page);
+
+ if (P_RIGHTMOST(opaque))
+ {
+ *comparecol = 1;
+ break;
+ }
+
+ /*
+ * Finish any incomplete splits we encounter along the way.
+ */
+ if (forupdate && P_INCOMPLETE_SPLIT(opaque))
+ {
+ BlockNumber blkno = BufferGetBlockNumber(buf);
+
+ /* upgrade our lock if necessary */
+ if (access == BT_READ)
+ {
+ _bt_unlockbuf(rel, buf);
+ _bt_lockbuf(rel, buf, BT_WRITE);
+ }
+
+ if (P_INCOMPLETE_SPLIT(opaque))
+ _bt_finish_split(rel, buf, stack);
+ else
+ _bt_relbuf(rel, buf);
+
+ /* re-acquire the lock in the right mode, and re-check */
+ buf = _bt_getbuf(rel, blkno, access);
+ continue;
+ }
+
+ /*
+ * tupdatabuf is filled with the right seperator of the parent node.
+ * This allows us to do a binary equality check between the parent
+ * node's right seperator (which is < key) and this page's P_HIKEY.
+ * If they equal, we can reuse the result of the parent node's
+ * rightkey compare, which means we can potentially save a full key
+ * compare (which includes indirect calls to attribute comparison
+ * functions).
+ *
+ * Without this, we'd on average use 3 full key compares per page before
+ * we achieve full dynamic prefix bounds, but with this optimization
+ * that is only 2.
+ *
+ * 3 compares: 1 for the highkey (rightmost), and on average 2 before
+ * we move right in the binary search on the page, this average equals
+ * SUM (1/2 ^ x) for x from 0 to log(n items)), which tends to 2.
+ */
+ if (!P_IGNORE(opaque) && *comparecol > 1)
+ {
+ IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
+ IndexTuple buftuple = (IndexTuple) tupdatabuf;
+ if (IndexTupleSize(itup) == IndexTupleSize(buftuple))
+ {
+ char *dataptr = (char *) itup;
+
+ if (memcmp(dataptr + sizeof(IndexTupleData),
+ tupdatabuf + sizeof(IndexTupleData),
+ IndexTupleSize(itup) - sizeof(IndexTupleData)) == 0)
+ break;
+ } else {
+ *comparecol = 1;
+ }
+ } else {
+ *comparecol = 1;
+ }
+
+ if (P_IGNORE(opaque) ||
+ _bt_compare(rel, key, page, P_HIKEY, &cmpcol) >= cmpval)
+ {
+ *comparecol = 1;
+ /* step right one page */
+ buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
+ continue;
+ }
+ else
+ {
+ *comparecol = cmpcol;
+ break;
+ }
+ }
+
+ if (P_IGNORE(opaque))
+ elog(ERROR, "fell off the end of index \"%s\"",
+ RelationGetRelationName(rel));
+
+ return buf;
+}
+
+/*
+ * _bt_binsrch() -- Do a binary search for a key on a particular page.
+ *
+ * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
+ * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
+ * particular, this means it is possible to return a value 1 greater than the
+ * number of keys on the page, if the scankey is > all keys on the page.)
+ *
+ * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
+ * of the last key < given scankey, or last key <= given scankey if nextkey
+ * is true. (Since _bt_compare treats the first data key of such a page as
+ * minus infinity, there will be at least one key < scankey, so the result
+ * always points at one of the keys on the page.) This key indicates the
+ * right place to descend to be sure we find all leaf keys >= given scankey
+ * (or leaf keys > given scankey when nextkey is true).
+ *
+ * This procedure is not responsible for walking right, it just examines
+ * the given page. _bt_binsrch() has no lock or refcount side effects
+ * on the buffer.
+ */
+static OffsetNumber
+_bt_binsrch(Relation rel,
+ BTScanInsert key,
+ Buffer buf,
+ AttrNumber *highkeycmpcol)
+{
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high;
+ int32 result,
+ cmpval;
+ AttrNumber highcmpcol = *highkeycmpcol,
+ lowcmpcol = 1;
+
+ page = BufferGetPage(buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* Requesting nextkey semantics while using scantid seems nonsensical */
+ Assert(!key->nextkey || key->scantid == NULL);
+ /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
+ Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+
+ /*
+ * If there are no keys on the page, return the first available slot. Note
+ * this covers two cases: the page is really empty (no keys), or it
+ * contains only a high key. The latter case is possible after vacuuming.
+ * This can never happen on an internal page, however, since they are
+ * never empty (an internal page must have children).
+ */
+ if (unlikely(high < low))
+ return low;
+
+ /*
+ * Binary search to find the first key on the page >= scan key, or first
+ * key > scankey when nextkey is true.
+ *
+ * For nextkey=false (cmpval=1), the loop invariant is: all slots before
+ * 'low' are < scan key, all slots at or after 'high' are >= scan key.
+ *
+ * For nextkey=true (cmpval=0), the loop invariant is: all slots before
+ * 'low' are <= scan key, all slots at or after 'high' are > scan key.
+ *
+ * We can fall out when high == low.
+ */
+ high++; /* establish the loop invariant for high */
+
+ cmpval = key->nextkey ? 0 : 1; /* select comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
+
+ if (result >= cmpval)
+ {
+ low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
+ else
+ {
+ high = mid;
+ highcmpcol = cmpcol;
+ }
+ }
+
+ *highkeycmpcol = highcmpcol;
+
+ /*
+ * At this point we have high == low, but be careful: they could point
+ * past the last slot on the page.
+ *
+ * On a leaf page, we always return the first key >= scan key (resp. >
+ * scan key), which could be the last slot + 1.
+ */
+ if (P_ISLEAF(opaque))
+ return low;
+
+ /*
+ * On a non-leaf page, return the last key < scan key (resp. <= scan key).
+ * There must be one if _bt_compare() is playing by the rules.
+ */
+ Assert(low > P_FIRSTDATAKEY(opaque));
+
+ return OffsetNumberPrev(low);
+}
+
+/*
+ *
+ * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
+ *
+ * Like _bt_binsrch(), but with support for caching the binary search
+ * bounds. Only used during insertion, and only on the leaf page that it
+ * looks like caller will insert tuple on. Exclusive-locked and pinned
+ * leaf page is contained within insertstate.
+ *
+ * Caches the bounds fields in insertstate so that a subsequent call can
+ * reuse the low and strict high bounds of original binary search. Callers
+ * that use these fields directly must be prepared for the case where low
+ * and/or stricthigh are not on the same page (one or both exceed maxoff
+ * for the page). The case where there are no items on the page (high <
+ * low) makes bounds invalid.
+ *
+ * Caller is responsible for invalidating bounds when it modifies the page
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by a posting
+ * list split).
+ */
+OffsetNumber
+_bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol)
+{
+ BTScanInsert key = insertstate->itup_key;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high,
+ stricthigh;
+ int32 result,
+ cmpval;
+ AttrNumber lowcmpcol = 1;
+
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ Assert(P_ISLEAF(opaque));
+ Assert(!key->nextkey);
+ Assert(insertstate->postingoff == 0);
+
+ if (!insertstate->bounds_valid)
+ {
+ /* Start new binary search */
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ }
+ else
+ {
+ /* Restore result of previous binary search against same page */
+ low = insertstate->low;
+ high = insertstate->stricthigh;
+ }
+
+ /* If there are no keys on the page, return the first available slot */
+ if (unlikely(high < low))
+ {
+ /* Caller can't reuse bounds */
+ insertstate->low = InvalidOffsetNumber;
+ insertstate->stricthigh = InvalidOffsetNumber;
+ insertstate->bounds_valid = false;
+ return low;
+ }
+
+ /*
+ * Binary search to find the first key on the page >= scan key. (nextkey
+ * is always false when inserting).
+ *
+ * The loop invariant is: all slots before 'low' are < scan key, all slots
+ * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
+ * maintained to save additional search effort for caller.
+ *
+ * We can fall out when high == low.
+ */
+ if (!insertstate->bounds_valid)
+ high++; /* establish the loop invariant for high */
+ stricthigh = high; /* high initially strictly higher */
+
+ cmpval = 1; /* !nextkey comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
+
+ if (result >= cmpval)
+ {
+ low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
+ else
+ {
+ high = mid;
+ highcmpcol = cmpcol;
+
+ if (result != 0)
+ stricthigh = high;
+ }
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set postingoff for caller. Caller must split the
+ * posting list when postingoff is set. This should happen
+ * infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ {
+ /*
+ * postingoff should never be set more than once per leaf page
+ * binary search. That would mean that there are duplicate table
+ * TIDs in the index, which is never okay. Check for that here.
+ */
+ if (insertstate->postingoff != 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
+ ItemPointerGetBlockNumber(key->scantid),
+ ItemPointerGetOffsetNumber(key->scantid),
+ low, stricthigh,
+ BufferGetBlockNumber(insertstate->buf),
+ RelationGetRelationName(rel))));
+
+ insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
+ }
+ }
+
+ /*
+ * On a leaf page, a binary search always returns the first key >= scan
+ * key (at least in !nextkey case), which could be the last slot + 1. This
+ * is also the lower bound of cached search.
+ *
+ * stricthigh may also be the last slot + 1, which prevents caller from
+ * using bounds directly, but is still useful to us if we're called a
+ * second time with cached bounds (cached low will be < stricthigh when
+ * that happens).
+ */
+ insertstate->low = low;
+ insertstate->stricthigh = stricthigh;
+ insertstate->bounds_valid = true;
+
+ return low;
+}
+
+/*----------
+ * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
+ *
+ * page/offnum: location of btree item to be compared to.
+ *
+ * This routine returns:
+ * <0 if scankey < tuple at offnum;
+ * 0 if scankey == tuple at offnum;
+ * >0 if scankey > tuple at offnum.
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
+ * "minus infinity": this routine will always claim it is less than the
+ * scankey. The actual key value stored is explicitly truncated to 0
+ * attributes (explicitly minus infinity) with version 3+ indexes, but
+ * that isn't relied upon. This allows us to implement the Lehman and
+ * Yao convention that the first down-link pointer is before the first
+ * key. See backend/access/nbtree/README for details.
+ *----------
+ */
+int32
+_bt_compare(Relation rel,
+ BTScanInsert key,
+ Page page,
+ OffsetNumber offnum,
+ AttrNumber *comparecol)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ IndexTuple itup;
+ ItemPointer heapTid;
+ ScanKey scankey;
+ int ncmpkey;
+ int ntupatts;
+ int32 result;
+
+ Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
+ Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+ Assert(key->heapkeyspace || key->scantid == NULL);
+
+ /*
+ * Force result ">" if target item is first data item on an internal page
+ * --- see NOTE above.
+ */
+ if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
+ return 1;
+
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+ ntupatts = BTreeTupleGetNAtts(itup, rel);
+
+ /*
+ * The scan key is set up with the attribute number associated with each
+ * term in the key. It is important that, if the index is multi-key, the
+ * scan contain the first k key attributes, and that they be in order. If
+ * you think about how multi-key ordering works, you'll understand why
+ * this is.
+ *
+ * We don't test for violation of this condition here, however. The
+ * initial setup for the index scan had better have gotten it right (see
+ * _bt_first).
+ */
+
+ ncmpkey = Min(ntupatts, key->keysz);
+ Assert(key->heapkeyspace || ncmpkey == key->keysz);
+ Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
+
+ scankey = key->scankeys + ((*comparecol) - 1);
+ for (int i = *comparecol; i <= ncmpkey; i++)
+ {
+ Datum datum;
+ bool isNull;
+
+ datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+
+ if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ {
+ if (isNull)
+ result = 0; /* NULL "=" NULL */
+ else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = -1; /* NULL "<" NOT_NULL */
+ else
+ result = 1; /* NULL ">" NOT_NULL */
+ }
+ else if (isNull) /* key is NOT_NULL and item is NULL */
+ {
+ if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = 1; /* NOT_NULL ">" NULL */
+ else
+ result = -1; /* NOT_NULL "<" NULL */
+ }
+ else
+ {
+ /*
+ * The sk_func needs to be passed the index value as left arg and
+ * the sk_argument as right arg (they might be of different
+ * types). Since it is convenient for callers to think of
+ * _bt_compare as comparing the scankey to the index item, we have
+ * to flip the sign of the comparison result. (Unless it's a DESC
+ * column, in which case we *don't* flip the sign.)
+ */
+ result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum,
+ scankey->sk_argument));
+
+ if (!(scankey->sk_flags & SK_BT_DESC))
+ INVERT_COMPARE_RESULT(result);
+ }
+
+ /* if the keys are unequal, return the difference */
+ if (result != 0)
+ {
+ *comparecol = i;
+ return result;
+ }
+
+ scankey++;
+ }
+
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *comparecol = ntupatts + 1;
+
+ /*
+ * All non-truncated attributes (other than heap TID) were found to be
+ * equal. Treat truncated attributes as minus infinity when scankey has a
+ * key attribute value that would otherwise be compared directly.
+ *
+ * Note: it doesn't matter if ntupatts includes non-key attributes;
+ * scankey won't, so explicitly excluding non-key attributes isn't
+ * necessary.
+ */
+ if (key->keysz > ntupatts)
+ return 1;
+
+ /*
+ * Use the heap TID attribute and scantid to try to break the tie. The
+ * rules are the same as any other key attribute -- only the
+ * representation differs.
+ */
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid == NULL)
+ {
+ /*
+ * Most searches have a scankey that is considered greater than a
+ * truncated pivot tuple if and when the scankey has equal values for
+ * attributes up to and including the least significant untruncated
+ * attribute in tuple.
+ *
+ * For example, if an index has the minimum two attributes (single
+ * user key attribute, plus heap TID attribute), and a page's high key
+ * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
+ * will not descend to the page to the left. The search will descend
+ * right instead. The truncated attribute in pivot tuple means that
+ * all non-pivot tuples on the page to the left are strictly < 'foo',
+ * so it isn't necessary to descend left. In other words, search
+ * doesn't have to descend left because it isn't interested in a match
+ * that has a heap TID value of -inf.
+ *
+ * However, some searches (pivotsearch searches) actually require that
+ * we descend left when this happens. -inf is treated as a possible
+ * match for omitted scankey attribute(s). This is needed by page
+ * deletion, which must re-find leaf pages that are targets for
+ * deletion using their high keys.
+ *
+ * Note: the heap TID part of the test ensures that scankey is being
+ * compared to a pivot tuple with one or more truncated key
+ * attributes.
+ *
+ * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
+ * left here, since they have no heap TID attribute (and cannot have
+ * any -inf key values in any case, since truncation can only remove
+ * non-key attributes). !heapkeyspace searches must always be
+ * prepared to deal with matches on both sides of the pivot once the
+ * leaf level is reached.
+ */
+ if (key->heapkeyspace && !key->pivotsearch &&
+ key->keysz == ntupatts && heapTid == NULL)
+ return 1;
+
+ /* All provided scankey arguments found to be equal */
+ return 0;
+ }
+
+ /*
+ * Treat truncated heap TID as minus infinity, since scankey has a key
+ * attribute value (scantid) that would otherwise be compared directly
+ */
+ Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+ if (heapTid == NULL)
+ return 1;
+
+ /*
+ * Scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * with scantid.
+ */
+ Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (result <= 0 || !BTreeTupleIsPosting(itup))
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid,
+ BTreeTupleGetMaxHeapTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
+}
+
+/*
+ * _bt_readpage() -- Load data from current index page into so->currPos
+ *
+ * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
+ * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
+ * they are updated as appropriate. All other fields of so->currPos are
+ * initialized from scratch here.
+ *
+ * We scan the current page starting at offnum and moving in the indicated
+ * direction. All items matching the scan keys are loaded into currPos.items.
+ * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
+ * that there can be no more matching tuples in the current scan direction.
+ *
+ * In the case of a parallel scan, caller must have called _bt_parallel_seize
+ * prior to calling this function; this function will invoke
+ * _bt_parallel_release before returning.
+ *
+ * Returns true if any matching items found on the page, false if none.
+ */
+static bool
+_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber minoff;
+ OffsetNumber maxoff;
+ int itemIndex;
+ bool continuescan;
+ int indnatts;
+
+ /*
+ * We must have the buffer pinned and locked, but the usual macro can't be
+ * used here; this function is what makes it good for currPos.
+ */
+ Assert(BufferIsValid(so->currPos.buf));
+
+ page = BufferGetPage(so->currPos.buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* allow next page be processed by parallel worker */
+ if (scan->parallel_scan)
+ {
+ if (ScanDirectionIsForward(dir))
+ _bt_parallel_release(scan, opaque->btpo_next);
+ else
+ _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+ }
+
+ continuescan = true; /* default assumption */
+ indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * We note the buffer's block number so that we can release the pin later.
+ * This allows us to re-read the buffer if it is needed again for hinting.
+ */
+ so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later. This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+
+ /*
+ * we must save the page's right-link while scanning it; this tells us
+ * where to step right to after we're done with these items. There is no
+ * corresponding need for the left-link, since splits always go right.
+ */
+ so->currPos.nextPage = opaque->btpo_next;
+
+ /* initialize tuple workspace to empty */
+ so->currPos.nextTupleOffset = 0;
+
+ /*
+ * Now that the current page has been made consistent, the macro should be
+ * good.
+ */
+ Assert(BTScanPosIsPinned(so->currPos));
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* load items[] in ascending order */
+ itemIndex = 0;
+
+ offnum = Max(offnum, minoff);
+
+ while (offnum <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ offnum = OffsetNumberNext(offnum);
+ continue;
+ }
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID
+ */
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ itemIndex++;
+ }
+ }
+ }
+ /* When !continuescan, there can't be any more matches, so stop */
+ if (!continuescan)
+ break;
+
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /*
+ * We don't need to visit page to the right when the high key
+ * indicates that no more matches will be found there.
+ *
+ * Checking the high key like this works out more often than you might
+ * think. Leaf page splits pick a split point between the two most
+ * dissimilar tuples (this is weighed against the need to evenly share
+ * free space). Leaf pages with high key attribute values that can
+ * only appear on non-pivot tuples on the right sibling page are
+ * common.
+ */
+ if (continuescan && !P_RIGHTMOST(opaque))
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
+ int truncatt;
+
+ truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
+ _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+ }
+
+ if (!continuescan)
+ so->currPos.moreRight = false;
+
+ Assert(itemIndex <= MaxTIDsPerBTreePage);
+ so->currPos.firstItem = 0;
+ so->currPos.lastItem = itemIndex - 1;
+ so->currPos.itemIndex = 0;
+ }
+ else
+ {
+ /* load items[] in descending order */
+ itemIndex = MaxTIDsPerBTreePage;
+
+ offnum = Min(offnum, maxoff);
+
+ while (offnum >= minoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+ bool tuple_alive;
+ bool passes_quals;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual. Most of the
+ * time, it's a win to not bother examining the tuple's index
+ * keys, but just skip to the next tuple (previous, actually,
+ * since we're scanning backwards). However, if this is the first
+ * tuple on the page, we do check the index keys, to prevent
+ * uselessly advancing to the page to the left. This is similar
+ * to the high key optimization used by forward scans.
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ Assert(offnum >= P_FIRSTDATAKEY(opaque));
+ if (offnum > P_FIRSTDATAKEY(opaque))
+ {
+ offnum = OffsetNumberPrev(offnum);
+ continue;
+ }
+
+ tuple_alive = false;
+ }
+ else
+ tuple_alive = true;
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
+ &continuescan);
+ if (passes_quals && tuple_alive)
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID.
+ *
+ * Note that we deliberately save/return items from
+ * posting lists in ascending heap TID order for backwards
+ * scans. This allows _bt_killitems() to make a
+ * consistent assumption about the order of items
+ * associated with the same posting list tuple.
+ */
+ itemIndex--;
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ }
+ }
+ }
+ if (!continuescan)
+ {
+ /* there can't be any more matches, so stop */
+ so->currPos.moreLeft = false;
+ break;
+ }
+
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ Assert(itemIndex >= 0);
+ so->currPos.firstItem = itemIndex;
+ so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
+ so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+ }
+
+ return (so->currPos.firstItem <= so->currPos.lastItem);
+}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 67b7b1710c..af408f704f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -279,8 +279,6 @@ static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
BTPageState *state,
BTDedupState dstate);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
-static void _bt_load(BTWriteState *wstate,
- BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
int request);
static void _bt_end_parallel(BTLeader *btleader);
@@ -293,6 +291,8 @@ static void _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
Sharedsort *sharedsort2, int sortmem,
bool progress);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsort_spec.c"
+#include "access/nbtree_spec.h"
/*
* btbuild() -- build a new btree index.
@@ -544,6 +544,7 @@ static void
_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
{
BTWriteState wstate;
+ nbts_prep_ctx(btspool->index);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
@@ -844,6 +845,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
Size pgspc;
Size itupsz;
bool isleaf;
+ nbts_prep_ctx(wstate->index);
/*
* This is a handy place to check for cancel interrupts during the btree
@@ -1176,264 +1178,6 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
-/*
- * Read tuples in correct sort order from tuplesort, and load them into
- * btree leaves.
- */
-static void
-_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
-{
- BTPageState *state = NULL;
- bool merge = (btspool2 != NULL);
- IndexTuple itup,
- itup2 = NULL;
- bool load1;
- TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
- SortSupport sortKeys;
- int64 tuples_done = 0;
- bool deduplicate;
-
- deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
- BTGetDeduplicateItems(wstate->index);
-
- if (merge)
- {
- /*
- * Another BTSpool for dead tuples exists. Now we have to merge
- * btspool and btspool2.
- */
-
- /* the preparation of merge */
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
-
- /* Prepare SortSupport data for each column */
- sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
- for (i = 0; i < keysz; i++)
- {
- SortSupport sortKey = sortKeys + i;
- ScanKey scanKey = wstate->inskey->scankeys + i;
- int16 strategy;
-
- sortKey->ssup_cxt = CurrentMemoryContext;
- sortKey->ssup_collation = scanKey->sk_collation;
- sortKey->ssup_nulls_first =
- (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
- sortKey->ssup_attno = scanKey->sk_attno;
- /* Abbreviation is not supported here */
- sortKey->abbreviate = false;
-
- Assert(sortKey->ssup_attno != 0);
-
- strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
- BTGreaterStrategyNumber : BTLessStrategyNumber;
-
- PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
- }
-
- for (;;)
- {
- load1 = true; /* load BTSpool next ? */
- if (itup2 == NULL)
- {
- if (itup == NULL)
- break;
- }
- else if (itup != NULL)
- {
- int32 compare = 0;
-
- for (i = 1; i <= keysz; i++)
- {
- SortSupport entry;
- Datum attrDatum1,
- attrDatum2;
- bool isNull1,
- isNull2;
-
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
- entry);
- if (compare > 0)
- {
- load1 = false;
- break;
- }
- else if (compare < 0)
- break;
- }
-
- /*
- * If key values are equal, we sort on ItemPointer. This is
- * required for btree indexes, since heap TID is treated as an
- * implicit last key attribute in order to ensure that all
- * keys in the index are physically unique.
- */
- if (compare == 0)
- {
- compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
- Assert(compare != 0);
- if (compare > 0)
- load1 = false;
- }
- }
- else
- load1 = false;
-
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- if (load1)
- {
- _bt_buildadd(wstate, state, itup, 0);
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- }
- else
- {
- _bt_buildadd(wstate, state, itup2, 0);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- pfree(sortKeys);
- }
- else if (deduplicate)
- {
- /* merge is unnecessary, deduplicate into posting lists */
- BTDedupState dstate;
-
- dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
- dstate->deduplicate = true; /* unused */
- dstate->nmaxitems = 0; /* unused */
- dstate->maxpostingsize = 0; /* set later */
- /* Metadata about base tuple of current pending posting list */
- dstate->base = NULL;
- dstate->baseoff = InvalidOffsetNumber; /* unused */
- dstate->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- dstate->htids = NULL;
- dstate->nhtids = 0;
- dstate->nitems = 0;
- dstate->phystupsize = 0; /* unused */
- dstate->nintervals = 0; /* unused */
-
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- {
- state = _bt_pagestate(wstate, 0);
-
- /*
- * Limit size of posting list tuples to 1/10 space we want to
- * leave behind on the page, plus space for final item's line
- * pointer. This is equal to the space that we'd like to
- * leave behind on each leaf page when fillfactor is 90,
- * allowing us to get close to fillfactor% space utilization
- * when there happen to be a great many duplicates. (This
- * makes higher leaf fillfactor settings ineffective when
- * building indexes that have many duplicates, but packing
- * leaf pages full with few very large tuples doesn't seem
- * like a useful goal.)
- */
- dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
- sizeof(ItemIdData);
- Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
- dstate->maxpostingsize <= INDEX_SIZE_MASK);
- dstate->htids = palloc(dstate->maxpostingsize);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
- else if (_bt_keep_natts_fast(wstate->index, dstate->base,
- itup) > keysz &&
- _bt_dedup_save_htid(dstate, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID from itup has been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list.
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
-
- if (state)
- {
- /*
- * Handle the last item (there must be a last item when the
- * tuplesort returned one or more tuples)
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
- pfree(dstate->htids);
- }
-
- pfree(dstate);
- }
- else
- {
- /* merging and deduplication are both unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- _bt_buildadd(wstate, state, itup, 0);
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- }
-
- /* Close down final pages and write the metapage */
- _bt_uppershutdown(wstate, state);
-
- /*
- * When we WAL-logged index pages, we must nonetheless fsync index files.
- * Since we're building outside shared buffers, a CHECKPOINT occurring
- * during the build has no way to flush the previously written data to
- * disk (indeed it won't know the index even exists). A crash later on
- * would replay WAL from the checkpoint, therefore it wouldn't replay our
- * earlier WAL entries. If we do not fsync those pages here, they might
- * still not be on disk when the crash occurs.
- */
- if (wstate->btws_use_wal)
- smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
-}
-
/*
* Create parallel context, and launch workers for leader.
*
diff --git a/src/backend/access/nbtree/nbtsort_spec.c b/src/backend/access/nbtree/nbtsort_spec.c
new file mode 100644
index 0000000000..368d6f244c
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsort_spec.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtsort_spec.c
+ * Index shape-specialized functions for nbtsort.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtsort_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_load NBTS_FUNCTION(_bt_load)
+
+static void _bt_load(BTWriteState *wstate,
+ BTSpool *btspool, BTSpool *btspool2);
+
+/*
+ * Read tuples in correct sort order from tuplesort, and load them into
+ * btree leaves.
+ */
+static void
+_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
+{
+ BTPageState *state = NULL;
+ bool merge = (btspool2 != NULL);
+ IndexTuple itup,
+ itup2 = NULL;
+ bool load1;
+ TupleDesc tupdes = RelationGetDescr(wstate->index);
+ int i,
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ SortSupport sortKeys;
+ int64 tuples_done = 0;
+ bool deduplicate;
+
+ deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
+ BTGetDeduplicateItems(wstate->index);
+
+ if (merge)
+ {
+ /*
+ * Another BTSpool for dead tuples exists. Now we have to merge
+ * btspool and btspool2.
+ */
+
+ /* the preparation of merge */
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+
+ /* Prepare SortSupport data for each column */
+ sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+
+ for (i = 0; i < keysz; i++)
+ {
+ SortSupport sortKey = sortKeys + i;
+ ScanKey scanKey = wstate->inskey->scankeys + i;
+ int16 strategy;
+
+ sortKey->ssup_cxt = CurrentMemoryContext;
+ sortKey->ssup_collation = scanKey->sk_collation;
+ sortKey->ssup_nulls_first =
+ (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+ sortKey->ssup_attno = scanKey->sk_attno;
+ /* Abbreviation is not supported here */
+ sortKey->abbreviate = false;
+
+ Assert(sortKey->ssup_attno != 0);
+
+ strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+ BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+ PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+ }
+
+ for (;;)
+ {
+ load1 = true; /* load BTSpool next ? */
+ if (itup2 == NULL)
+ {
+ if (itup == NULL)
+ break;
+ }
+ else if (itup != NULL)
+ {
+ int32 compare = 0;
+
+ for (i = 1; i <= keysz; i++)
+ {
+ SortSupport entry;
+ Datum attrDatum1,
+ attrDatum2;
+ bool isNull1,
+ isNull2;
+
+ entry = sortKeys + i - 1;
+ attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+ attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+ compare = ApplySortComparator(attrDatum1, isNull1,
+ attrDatum2, isNull2,
+ entry);
+ if (compare > 0)
+ {
+ load1 = false;
+ break;
+ }
+ else if (compare < 0)
+ break;
+ }
+
+ /*
+ * If key values are equal, we sort on ItemPointer. This is
+ * required for btree indexes, since heap TID is treated as an
+ * implicit last key attribute in order to ensure that all
+ * keys in the index are physically unique.
+ */
+ if (compare == 0)
+ {
+ compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+ Assert(compare != 0);
+ if (compare > 0)
+ load1 = false;
+ }
+ }
+ else
+ load1 = false;
+
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ if (load1)
+ {
+ _bt_buildadd(wstate, state, itup, 0);
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ }
+ else
+ {
+ _bt_buildadd(wstate, state, itup2, 0);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ pfree(sortKeys);
+ }
+ else if (deduplicate)
+ {
+ /* merge is unnecessary, deduplicate into posting lists */
+ BTDedupState dstate;
+
+ dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ dstate->deduplicate = true; /* unused */
+ dstate->nmaxitems = 0; /* unused */
+ dstate->maxpostingsize = 0; /* set later */
+ /* Metadata about base tuple of current pending posting list */
+ dstate->base = NULL;
+ dstate->baseoff = InvalidOffsetNumber; /* unused */
+ dstate->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ dstate->htids = NULL;
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+ dstate->phystupsize = 0; /* unused */
+ dstate->nintervals = 0; /* unused */
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+
+ /*
+ * Limit size of posting list tuples to 1/10 space we want to
+ * leave behind on the page, plus space for final item's line
+ * pointer. This is equal to the space that we'd like to
+ * leave behind on each leaf page when fillfactor is 90,
+ * allowing us to get close to fillfactor% space utilization
+ * when there happen to be a great many duplicates. (This
+ * makes higher leaf fillfactor settings ineffective when
+ * building indexes that have many duplicates, but packing
+ * leaf pages full with few very large tuples doesn't seem
+ * like a useful goal.)
+ */
+ dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
+ sizeof(ItemIdData);
+ Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
+ dstate->maxpostingsize <= INDEX_SIZE_MASK);
+ dstate->htids = palloc(dstate->maxpostingsize);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+ else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+ itup) > keysz &&
+ _bt_dedup_save_htid(dstate, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID from itup has been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list.
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+
+ if (state)
+ {
+ /*
+ * Handle the last item (there must be a last item when the
+ * tuplesort returned one or more tuples)
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+ pfree(dstate->htids);
+ }
+
+ pfree(dstate);
+ }
+ else
+ {
+ /* merging and deduplication are both unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ _bt_buildadd(wstate, state, itup, 0);
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+
+ /* Close down final pages and write the metapage */
+ _bt_uppershutdown(wstate, state);
+
+ /*
+ * When we WAL-logged index pages, we must nonetheless fsync index files.
+ * Since we're building outside shared buffers, a CHECKPOINT occurring
+ * during the build has no way to flush the previously written data to
+ * disk (indeed it won't know the index even exists). A crash later on
+ * would replay WAL from the checkpoint, therefore it wouldn't replay our
+ * earlier WAL entries. If we do not fsync those pages here, they might
+ * still not be on disk when the crash occurs.
+ */
+ if (wstate->btws_use_wal)
+ smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index ecb49bb471..991118fd50 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -639,6 +639,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
ItemId itemid;
IndexTuple tup;
int keepnatts;
+ nbts_prep_ctx(state->rel);
Assert(state->is_leaf && !state->is_rightmost);
@@ -945,6 +946,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
*rightinterval;
int perfectpenalty;
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
+ nbts_prep_ctx(state->rel);
/* Assume that alternative strategy won't be used for now */
*strategy = SPLIT_DEFAULT;
@@ -1137,6 +1139,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
{
IndexTuple lastleft;
IndexTuple firstright;
+ nbts_prep_ctx(state->rel);
if (!state->is_leaf)
{
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 8003583c0a..85f92adda8 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -50,130 +50,10 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
bool *result);
static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
static void _bt_mark_scankey_required(ScanKey skey);
-static bool _bt_check_rowcompare(ScanKey skey,
- IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
- ScanDirection dir, bool *continuescan);
-static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtutils_spec.c"
+#include "access/nbtree_spec.h"
-/*
- * _bt_mkscankey
- * Build an insertion scan key that contains comparison data from itup
- * as well as comparator routines appropriate to the key datatypes.
- *
- * When itup is a non-pivot tuple, the returned insertion scan key is
- * suitable for finding a place for it to go on the leaf level. Pivot
- * tuples can be used to re-find leaf page with matching high key, but
- * then caller needs to set scan key's pivotsearch field to true. This
- * allows caller to search for a leaf page with a matching high key,
- * which is usually to the left of the first leaf page a non-pivot match
- * might appear on.
- *
- * The result is intended for use with _bt_compare() and _bt_truncate().
- * Callers that don't need to fill out the insertion scankey arguments
- * (e.g. they use an ad-hoc comparison routine, or only need a scankey
- * for _bt_truncate()) can pass a NULL index tuple. The scankey will
- * be initialized as if an "all truncated" pivot tuple was passed
- * instead.
- *
- * Note that we may occasionally have to share lock the metapage to
- * determine whether or not the keys in the index are expected to be
- * unique (i.e. if this is a "heapkeyspace" index). We assume a
- * heapkeyspace index when caller passes a NULL tuple, allowing index
- * build callers to avoid accessing the non-existent metapage. We
- * also assume that the index is _not_ allequalimage when a NULL tuple
- * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
- * field themselves.
- */
-BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
-{
- BTScanInsert key;
- ScanKey skey;
- TupleDesc itupdesc;
- int indnkeyatts;
- int16 *indoption;
- int tupnatts;
- int i;
-
- itupdesc = RelationGetDescr(rel);
- indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- indoption = rel->rd_indoption;
- tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
-
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
- /*
- * We'll execute search using scan key constructed on key columns.
- * Truncated attributes and non-key attributes are omitted from the final
- * scan key.
- */
- key = palloc(offsetof(BTScanInsertData, scankeys) +
- sizeof(ScanKeyData) * indnkeyatts);
- if (itup)
- _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
- else
- {
- /* Utility statement callers can set these fields themselves */
- key->heapkeyspace = true;
- key->allequalimage = false;
- }
- key->anynullkeys = false; /* initial assumption */
- key->nextkey = false;
- key->pivotsearch = false;
- key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
- skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
- {
- FmgrInfo *procinfo;
- Datum arg;
- bool null;
- int flags;
-
- /*
- * We can use the cached (default) support procs since no cross-type
- * comparison can be needed.
- */
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-
- /*
- * Key arguments built from truncated attributes (or when caller
- * provides no tuple) are defensively represented as NULL values. They
- * should never be used.
- */
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
- else
- {
- arg = (Datum) 0;
- null = true;
- }
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
- flags,
- (AttrNumber) (i + 1),
- InvalidStrategy,
- InvalidOid,
- rel->rd_indcollation[i],
- procinfo,
- arg);
- /* Record if any key attribute is NULL (or truncated) */
- if (null)
- key->anynullkeys = true;
- }
-
- /*
- * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
- * that full uniqueness check is done.
- */
- if (rel->rd_index->indnullsnotdistinct)
- key->anynullkeys = false;
-
- return key;
-}
/*
* free a retracement stack made by _bt_search.
@@ -1340,356 +1220,6 @@ _bt_mark_scankey_required(ScanKey skey)
}
}
-/*
- * Test whether an indextuple satisfies all the scankey conditions.
- *
- * Return true if so, false if not. If the tuple fails to pass the qual,
- * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly. See comments for
- * _bt_preprocess_keys(), above, about how this is done.
- *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
- *
- * scan: index scan descriptor (containing a search-type scankey)
- * tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
- */
-bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
- ScanDirection dir, bool *continuescan)
-{
- TupleDesc tupdesc;
- BTScanOpaque so;
- int keysz;
- int ikey;
- ScanKey key;
-
- Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
-
- *continuescan = true; /* default assumption */
-
- tupdesc = RelationGetDescr(scan->indexRelation);
- so = (BTScanOpaque) scan->opaque;
- keysz = so->numberOfKeys;
-
- for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
- {
- Datum datum;
- bool isNull;
- Datum test;
-
- if (key->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- continue;
- }
-
- /* row-comparison keys need special processing */
- if (key->sk_flags & SK_ROW_HEADER)
- {
- if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
- continuescan))
- continue;
- return false;
- }
-
- datum = index_getattr(tuple,
- key->sk_attno,
- tupdesc,
- &isNull);
-
- if (key->sk_flags & SK_ISNULL)
- {
- /* Handle IS NULL/NOT NULL tests */
- if (key->sk_flags & SK_SEARCHNULL)
- {
- if (isNull)
- continue; /* tuple satisfies this qual */
- }
- else
- {
- Assert(key->sk_flags & SK_SEARCHNOTNULL);
- if (!isNull)
- continue; /* tuple satisfies this qual */
- }
-
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (isNull)
- {
- if (key->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
- datum, key->sk_argument);
-
- if (!DatumGetBool(test))
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- *
- * Note: because we stop the scan as soon as any required equality
- * qual fails, it is critical that equality quals be used for the
- * initial positioning in _bt_first() when they are available. See
- * comments in _bt_first().
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
- }
-
- /* If we get here, the tuple passes all index quals. */
- return true;
-}
-
-/*
- * Test whether an indextuple satisfies a row-comparison scan condition.
- *
- * Return true if so, false if not. If not, also clear *continuescan if
- * it's not possible for any future tuples in the current scan direction
- * to pass the qual.
- *
- * This is a subroutine for _bt_checkkeys, which see for more info.
- */
-static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
- TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
-{
- ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
- int32 cmpresult = 0;
- bool result;
-
- /* First subkey should be same as the header says */
- Assert(subkey->sk_attno == skey->sk_attno);
-
- /* Loop over columns of the row condition */
- for (;;)
- {
- Datum datum;
- bool isNull;
-
- Assert(subkey->sk_flags & SK_ROW_MEMBER);
-
- if (subkey->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- cmpresult = 0;
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- continue;
- }
-
- datum = index_getattr(tuple,
- subkey->sk_attno,
- tupdesc,
- &isNull);
-
- if (isNull)
- {
- if (subkey->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (subkey->sk_flags & SK_ISNULL)
- {
- /*
- * Unlike the simple-scankey case, this isn't a disallowed case.
- * But it can never match. If all the earlier row comparison
- * columns are required for the scan direction, we can stop the
- * scan, because there can't be another tuple that will succeed.
- */
- if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
- subkey--;
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- return false;
- }
-
- /* Perform the test --- three-way comparison not bool operator */
- cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
- subkey->sk_collation,
- datum,
- subkey->sk_argument));
-
- if (subkey->sk_flags & SK_BT_DESC)
- INVERT_COMPARE_RESULT(cmpresult);
-
- /* Done comparing if unequal, else advance to next column */
- if (cmpresult != 0)
- break;
-
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- }
-
- /*
- * At this point cmpresult indicates the overall result of the row
- * comparison, and subkey points to the deciding column (or the last
- * column if the result is "=").
- */
- switch (subkey->sk_strategy)
- {
- /* EQ and NE cases aren't allowed here */
- case BTLessStrategyNumber:
- result = (cmpresult < 0);
- break;
- case BTLessEqualStrategyNumber:
- result = (cmpresult <= 0);
- break;
- case BTGreaterEqualStrategyNumber:
- result = (cmpresult >= 0);
- break;
- case BTGreaterStrategyNumber:
- result = (cmpresult > 0);
- break;
- default:
- elog(ERROR, "unrecognized RowCompareType: %d",
- (int) subkey->sk_strategy);
- result = 0; /* keep compiler quiet */
- break;
- }
-
- if (!result)
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will pass,
- * either. Note we have to look at the deciding column, not
- * necessarily the first or last column of the row condition.
- */
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
-
- return result;
-}
-
/*
* _bt_killitems - set LP_DEAD state for items an indexscan caller has
* told us were killed
@@ -2173,286 +1703,6 @@ btbuildphasename(int64 phasenum)
}
}
-/*
- * _bt_truncate() -- create tuple without unneeded suffix attributes.
- *
- * Returns truncated pivot index tuple allocated in caller's memory context,
- * with key attributes copied from caller's firstright argument. If rel is
- * an INCLUDE index, non-key attributes will definitely be truncated away,
- * since they're not part of the key space. More aggressive suffix
- * truncation can take place when it's clear that the returned tuple does not
- * need one or more suffix key attributes. We only need to keep firstright
- * attributes up to and including the first non-lastleft-equal attribute.
- * Caller's insertion scankey is used to compare the tuples; the scankey's
- * argument values are not considered here.
- *
- * Note that returned tuple's t_tid offset will hold the number of attributes
- * present, so the original item pointer offset is not represented. Caller
- * should only change truncated tuple's downlink. Note also that truncated
- * key attributes are treated as containing "minus infinity" values by
- * _bt_compare().
- *
- * In the worst case (when a heap TID must be appended to distinguish lastleft
- * from firstright), the size of the returned tuple is the size of firstright
- * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
- * is important, since callers need to stay under the 1/3 of a page
- * restriction on tuple size. If this routine is ever taught to truncate
- * within an attribute/datum, it will need to avoid returning an enlarged
- * tuple to caller when truncation + TOAST compression ends up enlarging the
- * final datum.
- */
-IndexTuple
-_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
- IndexTuple pivot;
- IndexTuple tidpivot;
- ItemPointer pivotheaptid;
- Size newsize;
-
- /*
- * We should only ever truncate non-pivot tuples from leaf pages. It's
- * never okay to truncate when splitting an internal page.
- */
- Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
-
- /* Determine how many attributes must be kept in truncated tuple */
- keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
-
-#ifdef DEBUG_NO_TRUNCATE
- /* Force truncation to be ineffective for testing purposes */
- keepnatts = nkeyatts + 1;
-#endif
-
- pivot = index_truncate_tuple(itupdesc, firstright,
- Min(keepnatts, nkeyatts));
-
- if (BTreeTupleIsPosting(pivot))
- {
- /*
- * index_truncate_tuple() just returns a straight copy of firstright
- * when it has no attributes to truncate. When that happens, we may
- * need to truncate away a posting list here instead.
- */
- Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
- Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
- pivot->t_info &= ~INDEX_SIZE_MASK;
- pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
- }
-
- /*
- * If there is a distinguishing key attribute within pivot tuple, we're
- * done
- */
- if (keepnatts <= nkeyatts)
- {
- BTreeTupleSetNAtts(pivot, keepnatts, false);
- return pivot;
- }
-
- /*
- * We have to store a heap TID in the new pivot tuple, since no non-TID
- * key attribute value in firstright distinguishes the right side of the
- * split from the left side. nbtree conceptualizes this case as an
- * inability to truncate away any key attributes, since heap TID is
- * treated as just another key attribute (despite lacking a pg_attribute
- * entry).
- *
- * Use enlarged space that holds a copy of pivot. We need the extra space
- * to store a heap TID at the end (using the special pivot tuple
- * representation). Note that the original pivot already has firstright's
- * possible posting list/non-key attribute values removed at this point.
- */
- newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
- tidpivot = palloc0(newsize);
- memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
- /* Cannot leak memory here */
- pfree(pivot);
-
- /*
- * Store all of firstright's key attribute values plus a tiebreaker heap
- * TID value in enlarged pivot tuple
- */
- tidpivot->t_info &= ~INDEX_SIZE_MASK;
- tidpivot->t_info |= newsize;
- BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
- pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
-
- /*
- * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
- * consider suffix truncation. It seems like a good idea to follow that
- * example in cases where no truncation takes place -- use lastleft's heap
- * TID. (This is also the closest value to negative infinity that's
- * legally usable.)
- */
- ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
-
- /*
- * We're done. Assert() that heap TID invariants hold before returning.
- *
- * Lehman and Yao require that the downlink to the right page, which is to
- * be inserted into the parent page in the second phase of a page split be
- * a strict lower bound on items on the right page, and a non-strict upper
- * bound for items on the left page. Assert that heap TIDs follow these
- * invariants, since a heap TID value is apparently needed as a
- * tiebreaker.
- */
-#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
- BTreeTupleGetHeapTID(firstright)) < 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(lastleft)) >= 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#else
-
- /*
- * Those invariants aren't guaranteed to hold for lastleft + firstright
- * heap TID attribute values when they're considered here only because
- * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
- * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
- * TID value that always works as a strict lower bound for items to the
- * right. In particular, it must avoid using firstright's leading key
- * attribute values along with lastleft's heap TID value when lastleft's
- * TID happens to be greater than firstright's TID.
- */
- ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
-
- /*
- * Pivot heap TID should never be fully equal to firstright. Note that
- * the pivot heap TID will still end up equal to lastleft's heap TID when
- * that's the only usable value.
- */
- ItemPointerSetOffsetNumber(pivotheaptid,
- OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#endif
-
- return tidpivot;
-}
-
-/*
- * _bt_keep_natts - how many key attributes to keep when truncating.
- *
- * Caller provides two tuples that enclose a split point. Caller's insertion
- * scankey is used to compare the tuples; the scankey's argument values are
- * not considered here.
- *
- * This can return a number of attributes that is one greater than the
- * number of key attributes for the index relation. This indicates that the
- * caller must use a heap TID as a unique-ifier in new pivot tuple.
- */
-static int
-_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keepnatts;
- ScanKey scankey;
-
- /*
- * _bt_compare() treats truncated key attributes as having the value minus
- * infinity, which would break searches within !heapkeyspace indexes. We
- * must still truncate away non-key attribute values, though.
- */
- if (!itup_key->heapkeyspace)
- return nkeyatts;
-
- scankey = itup_key->scankeys;
- keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum1,
- datum2)) != 0)
- break;
-
- keepnatts++;
- }
-
- /*
- * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
- * expected in an allequalimage index.
- */
- Assert(!itup_key->allequalimage ||
- keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
-
- return keepnatts;
-}
-
-/*
- * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
- *
- * This is exported so that a candidate split point can have its effect on
- * suffix truncation inexpensively evaluated ahead of time when finding a
- * split location. A naive bitwise approach to datum comparisons is used to
- * save cycles.
- *
- * The approach taken here usually provides the same answer as _bt_keep_natts
- * will (for the same pair of tuples from a heapkeyspace index), since the
- * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting. When an index only has
- * "equal image" columns, routine is guaranteed to give the same result as
- * _bt_keep_natts would.
- *
- * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts, even when the index uses
- * an opclass or collation that is not "allequalimage"/deduplication-safe.
- * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
- * negatives generally only have the effect of making leaf page splits use a
- * more balanced split point.
- */
-int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
-
- keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
- Form_pg_attribute att;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
- break;
-
- keepnatts++;
- }
-
- return keepnatts;
-}
-
/*
* _bt_check_natts() -- Verify tuple has expected number of attributes.
*
diff --git a/src/backend/access/nbtree/nbtutils_spec.c b/src/backend/access/nbtree/nbtutils_spec.c
new file mode 100644
index 0000000000..0288da22d6
--- /dev/null
+++ b/src/backend/access/nbtree/nbtutils_spec.c
@@ -0,0 +1,775 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtutils_spec.c
+ * Index shape-specialized functions for nbtutils.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtutils_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_check_rowcompare NBTS_FUNCTION(_bt_check_rowcompare)
+#define _bt_keep_natts NBTS_FUNCTION(_bt_keep_natts)
+
+static bool _bt_check_rowcompare(ScanKey skey,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ ScanDirection dir, bool *continuescan);
+static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+
+
+/*
+ * _bt_mkscankey
+ * Build an insertion scan key that contains comparison data from itup
+ * as well as comparator routines appropriate to the key datatypes.
+ *
+ * When itup is a non-pivot tuple, the returned insertion scan key is
+ * suitable for finding a place for it to go on the leaf level. Pivot
+ * tuples can be used to re-find leaf page with matching high key, but
+ * then caller needs to set scan key's pivotsearch field to true. This
+ * allows caller to search for a leaf page with a matching high key,
+ * which is usually to the left of the first leaf page a non-pivot match
+ * might appear on.
+ *
+ * The result is intended for use with _bt_compare() and _bt_truncate().
+ * Callers that don't need to fill out the insertion scankey arguments
+ * (e.g. they use an ad-hoc comparison routine, or only need a scankey
+ * for _bt_truncate()) can pass a NULL index tuple. The scankey will
+ * be initialized as if an "all truncated" pivot tuple was passed
+ * instead.
+ *
+ * Note that we may occasionally have to share lock the metapage to
+ * determine whether or not the keys in the index are expected to be
+ * unique (i.e. if this is a "heapkeyspace" index). We assume a
+ * heapkeyspace index when caller passes a NULL tuple, allowing index
+ * build callers to avoid accessing the non-existent metapage. We
+ * also assume that the index is _not_ allequalimage when a NULL tuple
+ * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
+ * field themselves.
+ */
+BTScanInsert
+_bt_mkscankey(Relation rel, IndexTuple itup)
+{
+ BTScanInsert key;
+ ScanKey skey;
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int16 *indoption;
+ int tupnatts;
+ int i;
+
+ itupdesc = RelationGetDescr(rel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ indoption = rel->rd_indoption;
+ tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+
+ /*
+ * We'll execute search using scan key constructed on key columns.
+ * Truncated attributes and non-key attributes are omitted from the final
+ * scan key.
+ */
+ key = palloc(offsetof(BTScanInsertData, scankeys) +
+ sizeof(ScanKeyData) * indnkeyatts);
+ if (itup)
+ _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+ else
+ {
+ /* Utility statement callers can set these fields themselves */
+ key->heapkeyspace = true;
+ key->allequalimage = false;
+ }
+ key->anynullkeys = false; /* initial assumption */
+ key->nextkey = false;
+ key->pivotsearch = false;
+ key->keysz = Min(indnkeyatts, tupnatts);
+ key->scantid = key->heapkeyspace && itup ?
+ BTreeTupleGetHeapTID(itup) : NULL;
+ skey = key->scankeys;
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ FmgrInfo *procinfo;
+ Datum arg;
+ bool null;
+ int flags;
+
+ /*
+ * We can use the cached (default) support procs since no cross-type
+ * comparison can be needed.
+ */
+ procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+
+ /*
+ * Key arguments built from truncated attributes (or when caller
+ * provides no tuple) are defensively represented as NULL values. They
+ * should never be used.
+ */
+ if (i < tupnatts)
+ arg = index_getattr(itup, i + 1, itupdesc, &null);
+ else
+ {
+ arg = (Datum) 0;
+ null = true;
+ }
+ flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags,
+ (AttrNumber) (i + 1),
+ InvalidStrategy,
+ InvalidOid,
+ rel->rd_indcollation[i],
+ procinfo,
+ arg);
+ /* Record if any key attribute is NULL (or truncated) */
+ if (null)
+ key->anynullkeys = true;
+ }
+
+ /*
+ * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
+ * that full uniqueness check is done.
+ */
+ if (rel->rd_index->indnullsnotdistinct)
+ key->anynullkeys = false;
+
+ return key;
+}
+
+/*
+ * Test whether an indextuple satisfies all the scankey conditions.
+ *
+ * Return true if so, false if not. If the tuple fails to pass the qual,
+ * we also determine whether there's any need to continue the scan beyond
+ * this tuple, and set *continuescan accordingly. See comments for
+ * _bt_preprocess_keys(), above, about how this is done.
+ *
+ * Forward scan callers can pass a high key tuple in the hopes of having
+ * us set *continuescan to false, and avoiding an unnecessary visit to
+ * the page to the right.
+ *
+ * scan: index scan descriptor (containing a search-type scankey)
+ * tuple: index tuple to test
+ * tupnatts: number of attributes in tupnatts (high key may be truncated)
+ * dir: direction we are scanning in
+ * continuescan: output parameter (will be set correctly in all cases)
+ */
+bool
+_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan)
+{
+ TupleDesc tupdesc;
+ BTScanOpaque so;
+ int keysz;
+ int ikey;
+ ScanKey key;
+
+ Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+
+ *continuescan = true; /* default assumption */
+
+ tupdesc = RelationGetDescr(scan->indexRelation);
+ so = (BTScanOpaque) scan->opaque;
+ keysz = so->numberOfKeys;
+
+ for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+ {
+ Datum datum;
+ bool isNull;
+ Datum test;
+
+ if (key->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ continue;
+ }
+
+ /* row-comparison keys need special processing */
+ if (key->sk_flags & SK_ROW_HEADER)
+ {
+ if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
+ continuescan))
+ continue;
+ return false;
+ }
+
+ datum = index_getattr(tuple,
+ key->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (key->sk_flags & SK_ISNULL)
+ {
+ /* Handle IS NULL/NOT NULL tests */
+ if (key->sk_flags & SK_SEARCHNULL)
+ {
+ if (isNull)
+ continue; /* tuple satisfies this qual */
+ }
+ else
+ {
+ Assert(key->sk_flags & SK_SEARCHNOTNULL);
+ if (!isNull)
+ continue; /* tuple satisfies this qual */
+ }
+
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (isNull)
+ {
+ if (key->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
+ datum, key->sk_argument);
+
+ if (!DatumGetBool(test))
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ *
+ * Note: because we stop the scan as soon as any required equality
+ * qual fails, it is critical that equality quals be used for the
+ * initial positioning in _bt_first() when they are available. See
+ * comments in _bt_first().
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+ }
+
+ /* If we get here, the tuple passes all index quals. */
+ return true;
+}
+
+/*
+ * Test whether an indextuple satisfies a row-comparison scan condition.
+ *
+ * Return true if so, false if not. If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction
+ * to pass the qual.
+ *
+ * This is a subroutine for _bt_checkkeys, which see for more info.
+ */
+static bool
+_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
+ TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
+{
+ ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
+ int32 cmpresult = 0;
+ bool result;
+
+ /* First subkey should be same as the header says */
+ Assert(subkey->sk_attno == skey->sk_attno);
+
+ /* Loop over columns of the row condition */
+ for (;;)
+ {
+ Datum datum;
+ bool isNull;
+
+ Assert(subkey->sk_flags & SK_ROW_MEMBER);
+
+ if (subkey->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ cmpresult = 0;
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ continue;
+ }
+
+ datum = index_getattr(tuple,
+ subkey->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (isNull)
+ {
+ if (subkey->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (subkey->sk_flags & SK_ISNULL)
+ {
+ /*
+ * Unlike the simple-scankey case, this isn't a disallowed case.
+ * But it can never match. If all the earlier row comparison
+ * columns are required for the scan direction, we can stop the
+ * scan, because there can't be another tuple that will succeed.
+ */
+ if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
+ subkey--;
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ return false;
+ }
+
+ /* Perform the test --- three-way comparison not bool operator */
+ cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
+ subkey->sk_collation,
+ datum,
+ subkey->sk_argument));
+
+ if (subkey->sk_flags & SK_BT_DESC)
+ INVERT_COMPARE_RESULT(cmpresult);
+
+ /* Done comparing if unequal, else advance to next column */
+ if (cmpresult != 0)
+ break;
+
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ }
+
+ /*
+ * At this point cmpresult indicates the overall result of the row
+ * comparison, and subkey points to the deciding column (or the last
+ * column if the result is "=").
+ */
+ switch (subkey->sk_strategy)
+ {
+ /* EQ and NE cases aren't allowed here */
+ case BTLessStrategyNumber:
+ result = (cmpresult < 0);
+ break;
+ case BTLessEqualStrategyNumber:
+ result = (cmpresult <= 0);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ result = (cmpresult >= 0);
+ break;
+ case BTGreaterStrategyNumber:
+ result = (cmpresult > 0);
+ break;
+ default:
+ elog(ERROR, "unrecognized RowCompareType: %d",
+ (int) subkey->sk_strategy);
+ result = 0; /* keep compiler quiet */
+ break;
+ }
+
+ if (!result)
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will pass,
+ * either. Note we have to look at the deciding column, not
+ * necessarily the first or last column of the row condition.
+ */
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+
+ return result;
+}
+
+/*
+ * _bt_truncate() -- create tuple without unneeded suffix attributes.
+ *
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument. If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space. More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes. We only need to keep firstright
+ * attributes up to and including the first non-lastleft-equal attribute.
+ * Caller's insertion scankey is used to compare the tuples; the scankey's
+ * argument values are not considered here.
+ *
+ * Note that returned tuple's t_tid offset will hold the number of attributes
+ * present, so the original item pointer offset is not represented. Caller
+ * should only change truncated tuple's downlink. Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
+ *
+ * In the worst case (when a heap TID must be appended to distinguish lastleft
+ * from firstright), the size of the returned tuple is the size of firstright
+ * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
+ * is important, since callers need to stay under the 1/3 of a page
+ * restriction on tuple size. If this routine is ever taught to truncate
+ * within an attribute/datum, it will need to avoid returning an enlarged
+ * tuple to caller when truncation + TOAST compression ends up enlarging the
+ * final datum.
+ */
+IndexTuple
+_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ BTScanInsert itup_key)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+ IndexTuple pivot;
+ IndexTuple tidpivot;
+ ItemPointer pivotheaptid;
+ Size newsize;
+
+ /*
+ * We should only ever truncate non-pivot tuples from leaf pages. It's
+ * never okay to truncate when splitting an internal page.
+ */
+ Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
+
+ /* Determine how many attributes must be kept in truncated tuple */
+ keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+
+#ifdef DEBUG_NO_TRUNCATE
+ /* Force truncation to be ineffective for testing purposes */
+ keepnatts = nkeyatts + 1;
+#endif
+
+ pivot = index_truncate_tuple(itupdesc, firstright,
+ Min(keepnatts, nkeyatts));
+
+ if (BTreeTupleIsPosting(pivot))
+ {
+ /*
+ * index_truncate_tuple() just returns a straight copy of firstright
+ * when it has no attributes to truncate. When that happens, we may
+ * need to truncate away a posting list here instead.
+ */
+ Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
+ Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+
+ /*
+ * If there is a distinguishing key attribute within pivot tuple, we're
+ * done
+ */
+ if (keepnatts <= nkeyatts)
+ {
+ BTreeTupleSetNAtts(pivot, keepnatts, false);
+ return pivot;
+ }
+
+ /*
+ * We have to store a heap TID in the new pivot tuple, since no non-TID
+ * key attribute value in firstright distinguishes the right side of the
+ * split from the left side. nbtree conceptualizes this case as an
+ * inability to truncate away any key attributes, since heap TID is
+ * treated as just another key attribute (despite lacking a pg_attribute
+ * entry).
+ *
+ * Use enlarged space that holds a copy of pivot. We need the extra space
+ * to store a heap TID at the end (using the special pivot tuple
+ * representation). Note that the original pivot already has firstright's
+ * possible posting list/non-key attribute values removed at this point.
+ */
+ newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
+ tidpivot = palloc0(newsize);
+ memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
+ /* Cannot leak memory here */
+ pfree(pivot);
+
+ /*
+ * Store all of firstright's key attribute values plus a tiebreaker heap
+ * TID value in enlarged pivot tuple
+ */
+ tidpivot->t_info &= ~INDEX_SIZE_MASK;
+ tidpivot->t_info |= newsize;
+ BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
+ pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
+
+ /*
+ * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
+ * consider suffix truncation. It seems like a good idea to follow that
+ * example in cases where no truncation takes place -- use lastleft's heap
+ * TID. (This is also the closest value to negative infinity that's
+ * legally usable.)
+ */
+ ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
+
+ /*
+ * We're done. Assert() that heap TID invariants hold before returning.
+ *
+ * Lehman and Yao require that the downlink to the right page, which is to
+ * be inserted into the parent page in the second phase of a page split be
+ * a strict lower bound on items on the right page, and a non-strict upper
+ * bound for items on the left page. Assert that heap TIDs follow these
+ * invariants, since a heap TID value is apparently needed as a
+ * tiebreaker.
+ */
+#ifndef DEBUG_NO_TRUNCATE
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#else
+
+ /*
+ * Those invariants aren't guaranteed to hold for lastleft + firstright
+ * heap TID attribute values when they're considered here only because
+ * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
+ * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
+ * TID value that always works as a strict lower bound for items to the
+ * right. In particular, it must avoid using firstright's leading key
+ * attribute values along with lastleft's heap TID value when lastleft's
+ * TID happens to be greater than firstright's TID.
+ */
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
+
+ /*
+ * Pivot heap TID should never be fully equal to firstright. Note that
+ * the pivot heap TID will still end up equal to lastleft's heap TID when
+ * that's the only usable value.
+ */
+ ItemPointerSetOffsetNumber(pivotheaptid,
+ OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#endif
+
+ return tidpivot;
+}
+
+/*
+ * _bt_keep_natts - how many key attributes to keep when truncating.
+ *
+ * Caller provides two tuples that enclose a split point. Caller's insertion
+ * scankey is used to compare the tuples; the scankey's argument values are
+ * not considered here.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation. This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ BTScanInsert itup_key)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keepnatts;
+ ScanKey scankey;
+
+ /*
+ * _bt_compare() treats truncated key attributes as having the value minus
+ * infinity, which would break searches within !heapkeyspace indexes. We
+ * must still truncate away non-key attribute values, though.
+ */
+ if (!itup_key->heapkeyspace)
+ return nkeyatts;
+
+ scankey = itup_key->scankeys;
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum1,
+ datum2)) != 0)
+ break;
+
+ keepnatts++;
+ }
+
+ /*
+ * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
+ * expected in an allequalimage index.
+ */
+ Assert(!itup_key->allequalimage ||
+ keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+
+ return keepnatts;
+}
+
+/*
+ * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location. A naive bitwise approach to datum comparisons is used to
+ * save cycles.
+ *
+ * The approach taken here usually provides the same answer as _bt_keep_natts
+ * will (for the same pair of tuples from a heapkeyspace index), since the
+ * majority of btree opclasses can never indicate that two datums are equal
+ * unless they're bitwise equal after detoasting. When an index only has
+ * "equal image" columns, routine is guaranteed to give the same result as
+ * _bt_keep_natts would.
+ *
+ * Callers can rely on the fact that attributes considered equal here are
+ * definitely also equal according to _bt_keep_natts, even when the index uses
+ * an opclass or collation that is not "allequalimage"/deduplication-safe.
+ * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
+ * negatives generally only have the effect of making leaf page splits use a
+ * more balanced split point.
+ */
+int
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= keysz; attnum++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+ Form_pg_attribute att;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ att = TupleDescAttr(itupdesc, attnum - 1);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
+ break;
+
+ keepnatts++;
+ }
+
+ return keepnatts;
+}
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb6cfcfd00..12f909e1cf 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -57,8 +57,6 @@ static void writetup_cluster(Tuplesortstate *state, LogicalTape *tape,
SortTuple *stup);
static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
LogicalTape *tape, unsigned int tuplen);
-static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state);
static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
Tuplesortstate *state);
static void writetup_index(Tuplesortstate *state, LogicalTape *tape,
@@ -130,6 +128,9 @@ typedef struct
int datumTypeLen;
} TuplesortDatumArg;
+#define NBT_SPECIALIZE_FILE "../../backend/utils/sort/tuplesortvariants_spec.c"
+#include "access/nbtree_spec.h"
+
Tuplesortstate *
tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
@@ -217,6 +218,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
MemoryContext oldcontext;
TuplesortClusterArg *arg;
int i;
+ nbts_prep_ctx(indexRel);
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
@@ -328,6 +330,7 @@ tuplesort_begin_index_btree(Relation heapRel,
TuplesortIndexBTreeArg *arg;
MemoryContext oldcontext;
int i;
+ nbts_prep_ctx(indexRel);
oldcontext = MemoryContextSwitchTo(base->maincontext);
arg = (TuplesortIndexBTreeArg *) palloc(sizeof(TuplesortIndexBTreeArg));
@@ -461,6 +464,7 @@ tuplesort_begin_index_gist(Relation heapRel,
MemoryContext oldcontext;
TuplesortIndexBTreeArg *arg;
int i;
+ nbts_prep_ctx(indexRel);
oldcontext = MemoryContextSwitchTo(base->maincontext);
arg = (TuplesortIndexBTreeArg *) palloc(sizeof(TuplesortIndexBTreeArg));
@@ -1259,142 +1263,6 @@ removeabbrev_index(Tuplesortstate *state, SortTuple *stups, int count)
}
}
-static int
-comparetup_index_btree(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state)
-{
- /*
- * This is similar to comparetup_heap(), but expects index tuples. There
- * is also special handling for enforcing uniqueness, and special
- * treatment for equal keys at the end.
- */
- TuplesortPublic *base = TuplesortstateGetPublic(state);
- TuplesortIndexBTreeArg *arg = (TuplesortIndexBTreeArg *) base->arg;
- SortSupport sortKey = base->sortKeys;
- IndexTuple tuple1;
- IndexTuple tuple2;
- int keysz;
- TupleDesc tupDes;
- bool equal_hasnull = false;
- int nkey;
- int32 compare;
- Datum datum1,
- datum2;
- bool isnull1,
- isnull2;
-
-
- /* Compare the leading sort key */
- compare = ApplySortComparator(a->datum1, a->isnull1,
- b->datum1, b->isnull1,
- sortKey);
- if (compare != 0)
- return compare;
-
- /* Compare additional sort keys */
- tuple1 = (IndexTuple) a->tuple;
- tuple2 = (IndexTuple) b->tuple;
- keysz = base->nKeys;
- tupDes = RelationGetDescr(arg->index.indexRel);
-
- if (sortKey->abbrev_converter)
- {
- datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
-
- compare = ApplySortAbbrevFullComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare;
- }
-
- /* they are equal, so we only need to examine one null flag */
- if (a->isnull1)
- equal_hasnull = true;
-
- sortKey++;
- for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
- {
- datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
-
- compare = ApplySortComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare; /* done when we find unequal attributes */
-
- /* they are equal, so we only need to examine one null flag */
- if (isnull1)
- equal_hasnull = true;
- }
-
- /*
- * If btree has asked us to enforce uniqueness, complain if two equal
- * tuples are detected (unless there was at least one NULL field and NULLS
- * NOT DISTINCT was not set).
- *
- * It is sufficient to make the test here, because if two tuples are equal
- * they *must* get compared at some stage of the sort --- otherwise the
- * sort algorithm wouldn't have checked whether one must appear before the
- * other.
- */
- if (arg->enforceUnique && !(!arg->uniqueNullsNotDistinct && equal_hasnull))
- {
- Datum values[INDEX_MAX_KEYS];
- bool isnull[INDEX_MAX_KEYS];
- char *key_desc;
-
- /*
- * Some rather brain-dead implementations of qsort (such as the one in
- * QNX 4) will sometimes call the comparison routine to compare a
- * value to itself, but we always use our own implementation, which
- * does not.
- */
- Assert(tuple1 != tuple2);
-
- index_deform_tuple(tuple1, tupDes, values, isnull);
-
- key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("could not create unique index \"%s\"",
- RelationGetRelationName(arg->index.indexRel)),
- key_desc ? errdetail("Key %s is duplicated.", key_desc) :
- errdetail("Duplicate keys exist."),
- errtableconstraint(arg->index.heapRel,
- RelationGetRelationName(arg->index.indexRel))));
- }
-
- /*
- * If key values are equal, we sort on ItemPointer. This is required for
- * btree indexes, since heap TID is treated as an implicit last key
- * attribute in order to ensure that all keys in the index are physically
- * unique.
- */
- {
- BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
- BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
-
- if (blk1 != blk2)
- return (blk1 < blk2) ? -1 : 1;
- }
- {
- OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
- OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
-
- if (pos1 != pos2)
- return (pos1 < pos2) ? -1 : 1;
- }
-
- /* ItemPointer values should never be equal */
- Assert(false);
-
- return 0;
-}
-
static int
comparetup_index_hash(const SortTuple *a, const SortTuple *b,
Tuplesortstate *state)
diff --git a/src/backend/utils/sort/tuplesortvariants_spec.c b/src/backend/utils/sort/tuplesortvariants_spec.c
new file mode 100644
index 0000000000..0791f41136
--- /dev/null
+++ b/src/backend/utils/sort/tuplesortvariants_spec.c
@@ -0,0 +1,158 @@
+/*-------------------------------------------------------------------------
+ *
+ * tuplesortvariants_spec.c
+ * Index shape-specialized functions for tuplesortvariants.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/tuplesortvariants_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define comparetup_index_btree NBTS_FUNCTION(comparetup_index_btree)
+
+static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
+ Tuplesortstate *state);
+
+static int
+comparetup_index_btree(const SortTuple *a, const SortTuple *b,
+ Tuplesortstate *state)
+{
+ /*
+ * This is similar to comparetup_heap(), but expects index tuples. There
+ * is also special handling for enforcing uniqueness, and special
+ * treatment for equal keys at the end.
+ */
+ TuplesortPublic *base = TuplesortstateGetPublic(state);
+ TuplesortIndexBTreeArg *arg = (TuplesortIndexBTreeArg *) base->arg;
+ SortSupport sortKey = base->sortKeys;
+ IndexTuple tuple1;
+ IndexTuple tuple2;
+ int keysz;
+ TupleDesc tupDes;
+ bool equal_hasnull = false;
+ int nkey;
+ int32 compare;
+ Datum datum1,
+ datum2;
+ bool isnull1,
+ isnull2;
+
+
+ /* Compare the leading sort key */
+ compare = ApplySortComparator(a->datum1, a->isnull1,
+ b->datum1, b->isnull1,
+ sortKey);
+ if (compare != 0)
+ return compare;
+
+ /* Compare additional sort keys */
+ tuple1 = (IndexTuple) a->tuple;
+ tuple2 = (IndexTuple) b->tuple;
+ keysz = base->nKeys;
+ tupDes = RelationGetDescr(arg->index.indexRel);
+
+ if (sortKey->abbrev_converter)
+ {
+ datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
+ datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
+
+ compare = ApplySortAbbrevFullComparator(datum1, isnull1,
+ datum2, isnull2,
+ sortKey);
+ if (compare != 0)
+ return compare;
+ }
+
+ /* they are equal, so we only need to examine one null flag */
+ if (a->isnull1)
+ equal_hasnull = true;
+
+ sortKey++;
+ for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
+ {
+ datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
+ datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
+
+ compare = ApplySortComparator(datum1, isnull1,
+ datum2, isnull2,
+ sortKey);
+ if (compare != 0)
+ return compare; /* done when we find unequal attributes */
+
+ /* they are equal, so we only need to examine one null flag */
+ if (isnull1)
+ equal_hasnull = true;
+ }
+
+ /*
+ * If btree has asked us to enforce uniqueness, complain if two equal
+ * tuples are detected (unless there was at least one NULL field and NULLS
+ * NOT DISTINCT was not set).
+ *
+ * It is sufficient to make the test here, because if two tuples are equal
+ * they *must* get compared at some stage of the sort --- otherwise the
+ * sort algorithm wouldn't have checked whether one must appear before the
+ * other.
+ */
+ if (arg->enforceUnique && !(!arg->uniqueNullsNotDistinct && equal_hasnull))
+ {
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ char *key_desc;
+
+ /*
+ * Some rather brain-dead implementations of qsort (such as the one in
+ * QNX 4) will sometimes call the comparison routine to compare a
+ * value to itself, but we always use our own implementation, which
+ * does not.
+ */
+ Assert(tuple1 != tuple2);
+
+ index_deform_tuple(tuple1, tupDes, values, isnull);
+
+ key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("could not create unique index \"%s\"",
+ RelationGetRelationName(arg->index.indexRel)),
+ key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+ errdetail("Duplicate keys exist."),
+ errtableconstraint(arg->index.heapRel,
+ RelationGetRelationName(arg->index.indexRel))));
+ }
+
+ /*
+ * If key values are equal, we sort on ItemPointer. This is required for
+ * btree indexes, since heap TID is treated as an implicit last key
+ * attribute in order to ensure that all keys in the index are physically
+ * unique.
+ */
+ {
+ BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
+ BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
+
+ if (blk1 != blk2)
+ return (blk1 < blk2) ? -1 : 1;
+ }
+ {
+ OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
+ OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
+
+ if (pos1 != pos2)
+ return (pos1 < pos2) ? -1 : 1;
+ }
+
+ /* ItemPointer values should never be equal */
+ Assert(false);
+
+ return 0;
+}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4cb24fa005..f3f0961052 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1122,15 +1122,27 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_PERFORMSORT_2 4
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
+typedef enum NBTS_CTX {
+ NBTS_CTX_CACHED,
+ NBTS_CTX_DEFAULT, /* fallback */
+} NBTS_CTX;
+
+static inline NBTS_CTX _nbt_spec_context(Relation irel)
+{
+ if (!PointerIsValid(irel))
+ return NBTS_CTX_DEFAULT;
+
+ return NBTS_CTX_CACHED;
+}
+
+
+#define NBT_SPECIALIZE_FILE "access/nbtree_specfuncs.h"
+#include "nbtree_spec.h"
+
/*
* external entry points for btree, in nbtree.c
*/
extern void btbuildempty(Relation index);
-extern bool btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- struct IndexInfo *indexInfo);
extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
extern Size btestimateparallelscan(void);
extern void btinitparallelscan(void *target);
@@ -1161,9 +1173,6 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
/*
* prototypes for functions in nbtdedup.c
*/
-extern void _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel,
- IndexTuple newitem, Size newitemsz,
- bool bottomupdedup);
extern bool _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
Size newitemsz);
extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
@@ -1179,9 +1188,6 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
/*
* prototypes for functions in nbtinsert.c
*/
-extern bool _bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel);
extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
@@ -1229,16 +1235,6 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
/*
* prototypes for functions in nbtsearch.c
*/
-extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
- int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
- bool forupdate, BTStack stack, int access,
- Snapshot snapshot, AttrNumber *comparecol,
- char *tupdatabuf);
-extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate,
- AttrNumber highcmpcol);
-extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page,
- OffsetNumber offnum, AttrNumber *comparecol);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -1247,7 +1243,6 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
/*
* prototypes for functions in nbtutils.c
*/
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
extern void _bt_freestack(BTStack stack);
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
@@ -1255,8 +1250,6 @@ extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
extern void _bt_mark_array_keys(IndexScanDesc scan);
extern void _bt_restore_array_keys(IndexScanDesc scan);
extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
- int tupnatts, ScanDirection dir, bool *continuescan);
extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
extern BTCycleId _bt_start_vacuum(Relation rel);
@@ -1269,10 +1262,6 @@ extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
extern char *btbuildphasename(int64 phasenum);
-extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
-extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
- IndexTuple firstright);
extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/nbtree_spec.h b/src/include/access/nbtree_spec.h
new file mode 100644
index 0000000000..fa38b09c6e
--- /dev/null
+++ b/src/include/access/nbtree_spec.h
@@ -0,0 +1,183 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtree_specialize.h
+ * header file for postgres btree access method implementation.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/nbtree_specialize.h
+ *
+ *-------------------------------------------------------------------------
+ *
+ * Specialize key-accessing functions and the hot code around those.
+ *
+ * Key attribute iteration is specialized through the use of the following
+ * macros:
+ *
+ * - nbts_attiterdeclare(itup)
+ * Declare the variables required to iterate over the provided IndexTuple's
+ * key attributes. Many tuples may have their attributes iterated over at the
+ * same time.
+ * - nbts_attiterinit(itup, initAttNum, tupDesc)
+ * Initialize the attribute iterator for the provided IndexTuple at
+ * the provided AttributeNumber.
+ * - nbts_foreachattr(initAttNum, endAttNum)
+ * Start a loop over the attributes, starting at initAttNum and ending at
+ * endAttNum, inclusive. It also takes care of truncated attributes.
+ * - nbts_attiter_attnum
+ * The current attribute number
+ * - nbts_attiter_nextattdatum(itup, tupDesc)
+ * Updates the attribute iterator state to the next attribute. Returns the
+ * datum of the next attribute, which might be null (see below)
+ * - nbts_attiter_curattisnull(itup)
+ * Returns whether the result from the last nbts_attiter_nextattdatum is
+ * null.
+ * - nbts_context(irel)
+ * Constructs a context that is used to call specialized functions.
+ * Note that this is unneeded in paths that are inaccessible to unspecialized
+ * code paths (i.e. code included through nbtree_spec.h), because that
+ * always calls the optimized functions directly.
+ */
+
+/*
+ * Macros used in the nbtree specialization code.
+ */
+#define NBTS_TYPE_CACHED cached
+#define NBTS_TYPE_DEFAULT default
+#define NBTS_CTX_NAME __nbts_ctx
+
+/* contextual specializations */
+#define NBTS_MAKE_CTX(rel) const NBTS_CTX NBTS_CTX_NAME = _nbt_spec_context(rel)
+#define NBTS_SPECIALIZE_NAME(name) ( \
+ (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ ) \
+)
+
+/* how do we make names? */
+#define NBTS_MAKE_PREFIX(a) CppConcat(a,_)
+#define NBTS_MAKE_NAME_(a,b) CppConcat(a,b)
+#define NBTS_MAKE_NAME(a,b) NBTS_MAKE_NAME_(NBTS_MAKE_PREFIX(a),b)
+
+#define nbt_opt_specialize(rel) \
+do { \
+ Assert(PointerIsValid(rel)); \
+ if (unlikely((rel)->rd_indam->aminsert == btinsert_default)) \
+ { \
+ nbts_prep_ctx(rel); \
+ _bt_specialize(rel); \
+ } \
+} while (false)
+
+/*
+ * Protections against multiple inclusions - the definition of this macro is
+ * different for files included with the templating mechanism vs the users
+ * of this template, so redefine these macros at top and bottom.
+ */
+#ifdef NBTS_FUNCTION
+#undef NBTS_FUNCTION
+#endif
+#define NBTS_FUNCTION(name) NBTS_MAKE_NAME(name, NBTS_TYPE)
+
+/* While specializing, the context is the local context */
+#ifdef nbts_prep_ctx
+#undef nbts_prep_ctx
+#endif
+#define nbts_prep_ctx(rel)
+
+/*
+ * Specialization 1: CACHED
+ *
+ * Multiple key columns, optimized access for attcacheoff -cacheable offsets.
+ */
+#define NBTS_SPECIALIZING_CACHED
+#define NBTS_TYPE NBTS_TYPE_CACHED
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc) do {} while (false)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_SPECIALIZING_CACHED
+#undef NBTS_TYPE
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/*
+ * Specialization 2: DEFAULT
+ *
+ * "Default", externally accessible, not so optimized functions
+ */
+
+/* Only the default context may need to specialize in some cases, so here's that */
+#undef nbts_prep_ctx
+#define nbts_prep_ctx(rel) NBTS_MAKE_CTX(rel)
+
+#define NBTS_SPECIALIZING_DEFAULT
+#define NBTS_TYPE NBTS_TYPE_DEFAULT
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_DEFAULT
+
+/* un-define the optimization macros */
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/*
+ * All next uses of nbts_prep_ctx are in non-templated code, so here we make
+ * sure we actually create the context.
+ */
+#undef nbts_prep_ctx
+#define nbts_prep_ctx(rel) NBTS_MAKE_CTX(rel)
+
+/*
+ * from here on all NBTS_FUNCTIONs are from specialized function names that
+ * are being called. Change the result of those macros from a direct call
+ * call to a conditional call to the right place, depending on the correct
+ * context.
+ */
+#undef NBTS_FUNCTION
+#define NBTS_FUNCTION(name) NBTS_SPECIALIZE_NAME(name)
+
+#undef NBT_SPECIALIZE_FILE
diff --git a/src/include/access/nbtree_specfuncs.h b/src/include/access/nbtree_specfuncs.h
new file mode 100644
index 0000000000..ac60319eff
--- /dev/null
+++ b/src/include/access/nbtree_specfuncs.h
@@ -0,0 +1,66 @@
+/*
+ * prototypes for functions that are included in nbtree.h
+ */
+
+#define _bt_specialize NBTS_FUNCTION(_bt_specialize)
+#define btinsert NBTS_FUNCTION(btinsert)
+#define _bt_dedup_pass NBTS_FUNCTION(_bt_dedup_pass)
+#define _bt_doinsert NBTS_FUNCTION(_bt_doinsert)
+#define _bt_search NBTS_FUNCTION(_bt_search)
+#define _bt_moveright NBTS_FUNCTION(_bt_moveright)
+#define _bt_binsrch_insert NBTS_FUNCTION(_bt_binsrch_insert)
+#define _bt_compare NBTS_FUNCTION(_bt_compare)
+#define _bt_mkscankey NBTS_FUNCTION(_bt_mkscankey)
+#define _bt_checkkeys NBTS_FUNCTION(_bt_checkkeys)
+#define _bt_truncate NBTS_FUNCTION(_bt_truncate)
+#define _bt_keep_natts_fast NBTS_FUNCTION(_bt_keep_natts_fast)
+
+/*
+ * prototypes for functions in nbtree_spec.h
+ */
+extern void _bt_specialize(Relation rel);
+
+extern bool btinsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ struct IndexInfo *indexInfo);
+
+/*
+ * prototypes for functions in nbtdedup_spec.h
+ */
+extern void _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel,
+ IndexTuple newitem, Size newitemsz,
+ bool bottomupdedup);
+
+
+/*
+ * prototypes for functions in nbtinsert_spec.h
+ */
+
+extern bool _bt_doinsert(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ Relation heapRel);
+
+/*
+ * prototypes for functions in nbtsearch_spec.h
+ */
+extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
+ int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
+ bool forupdate, BTStack stack, int access,
+ Snapshot snapshot, AttrNumber *comparecol,
+ char *tupdatabuf);
+extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page,
+ OffsetNumber offnum, AttrNumber *comparecol);
+/*
+ * prototypes for functions in nbtutils_spec.h
+ */
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
+extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan);
+extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright);
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..0b9997ef5d 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -116,6 +116,8 @@ do
test "$f" = src/pl/tcl/pltclerrcodes.h && continue
# Also not meant to be included standalone.
+ test "$f" = src/include/access/nbtree_spec.h && continue
+ test "$f" = src/include/access/nbtree_specfuncs.h && continue
test "$f" = src/include/common/unicode_nonspacing_table.h && continue
test "$f" = src/include/common/unicode_east_asian_fw_table.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..0350df58b8 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -111,6 +111,8 @@ do
test "$f" = src/pl/tcl/pltclerrcodes.h && continue
# Also not meant to be included standalone.
+ test "$f" = src/include/access/nbtree_spec.h && continue
+ test "$f" = src/include/access/nbtree_specfuncs.h && continue
test "$f" = src/include/common/unicode_nonspacing_table.h && continue
test "$f" = src/include/common/unicode_east_asian_fw_table.h && continue
--
2.39.0
v10-0006-btree-specialization-for-variable-length-multi-a.patchapplication/octet-stream; name=v10-0006-btree-specialization-for-variable-length-multi-a.patchDownload
From 26f95a437cc6c6aa48e76bd4593306ae89823ea4 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 13 Jan 2023 15:42:41 +0100
Subject: [PATCH v10 6/6] btree specialization for variable-length
multi-attribute keys
The default code path is relatively slow at O(n^2), so with multiple
attributes we accept the increased startup cost in favour of lower
costs for later attributes.
Note that this will only be used for indexes that use at least one
variable-length key attribute (except as last key attribute in specific
cases).
---
src/backend/access/nbtree/README | 6 +-
src/backend/access/nbtree/nbtree_spec.c | 3 +
src/include/access/itup_attiter.h | 199 ++++++++++++++++++++++++
src/include/access/nbtree.h | 11 +-
src/include/access/nbtree_spec.h | 48 +++++-
5 files changed, 258 insertions(+), 9 deletions(-)
create mode 100644 src/include/access/itup_attiter.h
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6864902637..1b9bfe6d89 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -1105,14 +1105,12 @@ performance of those hot paths.
Optimized code paths exist for the following cases, in order of preference:
- indexes with only a single key attribute
+ - multi-column indexes that cannot pre-calculate the offsets of all key
+ attributes in the tuple data section
- multi-column indexes that could benefit from the attcacheoff optimization
NB: This is also the default path, and is comparatively slow for uncachable
attribute offsets.
-Future work will optimize for multi-column indexes that don't benefit
-from the attcacheoff optimization by improving on the O(n^2) nature of
-index_getattr through storing attribute offsets.
-
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtree_spec.c b/src/backend/access/nbtree/nbtree_spec.c
index 21635397ed..699197dfa7 100644
--- a/src/backend/access/nbtree/nbtree_spec.c
+++ b/src/backend/access/nbtree/nbtree_spec.c
@@ -33,6 +33,9 @@ _bt_specialize(Relation rel)
case NBTS_CTX_CACHED:
_bt_specialize_cached(rel);
break;
+ case NBTS_CTX_UNCACHED:
+ _bt_specialize_uncached(rel);
+ break;
case NBTS_CTX_SINGLE_KEYATT:
_bt_specialize_single_keyatt(rel);
break;
diff --git a/src/include/access/itup_attiter.h b/src/include/access/itup_attiter.h
new file mode 100644
index 0000000000..c8fb6954bc
--- /dev/null
+++ b/src/include/access/itup_attiter.h
@@ -0,0 +1,199 @@
+/*-------------------------------------------------------------------------
+ *
+ * itup_attiter.h
+ * POSTGRES index tuple attribute iterator definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/itup_attiter.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef ITUP_ATTITER_H
+#define ITUP_ATTITER_H
+
+#include "access/itup.h"
+#include "varatt.h"
+
+typedef struct IAttrIterStateData
+{
+ int offset;
+ bool slow;
+ bool isNull;
+} IAttrIterStateData;
+
+typedef IAttrIterStateData * IAttrIterState;
+
+/* ----------------
+ * index_attiterinit
+ *
+ * This gets called many times, so we macro the cacheable and NULL
+ * lookups, and call nocache_index_attiterinit() for the rest.
+ *
+ * tup - the tuple being iterated on
+ * attnum - the attribute number that we start the iteration with
+ * in the first index_attiternext call
+ * tupdesc - the tuple description
+ *
+ * ----------------
+ */
+#define index_attiterinit(tup, attnum, tupleDesc, iter) \
+do { \
+ if ((attnum) == 1) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ 0 /* Offset of attribute 1 is always 0 */, \
+ false /* slow */, \
+ false /* isNull */ \
+ }); \
+ } \
+ else if (!IndexTupleHasNulls(tup) && \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff >= 0) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff, /* offset */ \
+ false, /* slow */ \
+ false /* isNull */ \
+ }); \
+ } \
+ else \
+ nocache_index_attiterinit((tup), (attnum) - 1, (tupleDesc), (iter)); \
+} while (false);
+
+/*
+ * Initiate an index attribute iterator to attribute attnum,
+ * and return the corresponding datum.
+ *
+ * This is nearly the same as index_deform_tuple, except that this
+ * returns the internal state up to attnum, instead of populating the
+ * datum- and isnull-arrays
+ */
+static inline void
+nocache_index_attiterinit(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ int curatt;
+ char *tp; /* ptr to tuple data */
+ int off; /* offset in tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ bool slow = false; /* can we use/set attcacheoff? */
+ bool null = false;
+
+ /* Assert to protect callers */
+ Assert(PointerIsValid(iter));
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ /* XXX "knows" t_bits are just after fixed tuple header! */
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+ off = 0;
+
+ for (curatt = 0; curatt < attnum; curatt++)
+ {
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, curatt);
+
+ if (hasnulls && att_isnull(curatt, bp))
+ {
+ null = true;
+ slow = true; /* can't use attcacheoff anymore */
+ continue;
+ }
+
+ null = false;
+
+ if (!slow && thisatt->attcacheoff >= 0)
+ off = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ slow = true;
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+ }
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+
+ if (thisatt->attlen <= 0)
+ slow = true; /* can't use attcacheoff anymore */
+ }
+
+ iter->isNull = null;
+ iter->offset = off;
+ iter->slow = slow;
+}
+
+/* ----------------
+ * index_attiternext() - get the next attribute of an index tuple
+ *
+ * This gets called many times, so we do the least amount of work
+ * possible.
+ *
+ * The code does not attempt to update attcacheoff; as it is unlikely
+ * to reach a situation where the cached offset matters a lot.
+ * If the cached offset do matter, the caller should make sure that
+ * PopulateTupleDescCacheOffsets() was called on the tuple descriptor
+ * to populate the attribute offset cache.
+ *
+ * ----------------
+ */
+static inline Datum
+index_attiternext(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ char *tp; /* ptr to tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ Datum datum;
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, attnum - 1);
+
+ Assert(PointerIsValid(iter));
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+
+ if (hasnulls && att_isnull(attnum - 1, bp))
+ {
+ iter->isNull = true;
+ iter->slow = true;
+ return (Datum) 0;
+ }
+
+ iter->isNull = false;
+
+ if (!iter->slow && thisatt->attcacheoff >= 0)
+ iter->offset = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ iter->offset = att_align_pointer(iter->offset, thisatt->attalign, -1,
+ tp + iter->offset);
+ iter->slow = true;
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ iter->offset = att_align_nominal(iter->offset, thisatt->attalign);
+ }
+
+ datum = fetchatt(thisatt, tp + iter->offset);
+
+ iter->offset = att_addlength_pointer(iter->offset, thisatt->attlen, tp + iter->offset);
+
+ if (thisatt->attlen <= 0)
+ iter->slow = true; /* can't use attcacheoff anymore */
+
+ return datum;
+}
+
+#endif /* ITUP_ATTITER_H */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4628c41e9a..d5ed38bb71 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -16,6 +16,7 @@
#include "access/amapi.h"
#include "access/itup.h"
+#include "access/itup_attiter.h"
#include "access/sdir.h"
#include "access/tableam.h"
#include "access/xlogreader.h"
@@ -1124,18 +1125,26 @@ typedef struct BTOptions
typedef enum NBTS_CTX {
NBTS_CTX_SINGLE_KEYATT,
+ NBTS_CTX_UNCACHED,
NBTS_CTX_CACHED,
NBTS_CTX_DEFAULT, /* fallback */
} NBTS_CTX;
static inline NBTS_CTX _nbt_spec_context(Relation irel)
{
+ AttrNumber nKeyAtts;
+
if (!PointerIsValid(irel))
return NBTS_CTX_DEFAULT;
- if (IndexRelationGetNumberOfKeyAttributes(irel) == 1)
+ nKeyAtts = IndexRelationGetNumberOfKeyAttributes(irel);
+
+ if (nKeyAtts == 1)
return NBTS_CTX_SINGLE_KEYATT;
+ if (TupleDescAttr(irel->rd_att, nKeyAtts - 1)->attcacheoff < -1)
+ return NBTS_CTX_UNCACHED;
+
return NBTS_CTX_CACHED;
}
diff --git a/src/include/access/nbtree_spec.h b/src/include/access/nbtree_spec.h
index 8e476c300d..efed9824e7 100644
--- a/src/include/access/nbtree_spec.h
+++ b/src/include/access/nbtree_spec.h
@@ -45,6 +45,7 @@
* Macros used in the nbtree specialization code.
*/
#define NBTS_TYPE_SINGLE_KEYATT single_keyatt
+#define NBTS_TYPE_UNCACHED uncached
#define NBTS_TYPE_CACHED cached
#define NBTS_TYPE_DEFAULT default
#define NBTS_CTX_NAME __nbts_ctx
@@ -53,8 +54,10 @@
#define NBTS_MAKE_CTX(rel) const NBTS_CTX NBTS_CTX_NAME = _nbt_spec_context(rel)
#define NBTS_SPECIALIZE_NAME(name) ( \
(NBTS_CTX_NAME) == NBTS_CTX_SINGLE_KEYATT ? (NBTS_MAKE_NAME(name, NBTS_TYPE_SINGLE_KEYATT)) : ( \
- (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
- NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ (NBTS_CTX_NAME) == NBTS_CTX_UNCACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_UNCACHED)) : ( \
+ (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ ) \
) \
) \
)
@@ -69,8 +72,11 @@ do { \
Assert(PointerIsValid(rel)); \
if (unlikely((rel)->rd_indam->aminsert == btinsert_default)) \
{ \
- nbts_prep_ctx(rel); \
- _bt_specialize(rel); \
+ PopulateTupleDescCacheOffsets((rel)->rd_att); \
+ { \
+ nbts_prep_ctx(rel); \
+ _bt_specialize(rel); \
+ } \
} \
} while (false)
@@ -216,6 +222,40 @@ do { \
#undef nbts_attiter_nextattdatum
#undef nbts_attiter_curattisnull
+/*
+ * Multiple key columns, but attcacheoff -optimization doesn't apply.
+ */
+#define NBTS_SPECIALIZING_UNCACHED
+#define NBTS_TYPE NBTS_TYPE_UNCACHED
+
+#define nbts_attiterdeclare(itup) \
+ IAttrIterStateData NBTS_MAKE_NAME(itup, iter)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc) \
+ index_attiterinit((itup), (initAttNum), (tupDesc), &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_attiternext((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, iter).isNull
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_UNCACHED
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
/*
* All next uses of nbts_prep_ctx are in non-templated code, so here we make
* sure we actually create the context.
--
2.39.0
Hm. The cfbot has a fairly trivial issue with this with a unused variable:
18:36:17.405] In file included from ../../src/include/access/nbtree.h:1184,
[18:36:17.405] from verify_nbtree.c:27:
[18:36:17.405] verify_nbtree.c: In function ‘palloc_btree_page’:
[18:36:17.405] ../../src/include/access/nbtree_spec.h:51:23: error:
unused variable ‘__nbts_ctx’ [-Werror=unused-variable]
[18:36:17.405] 51 | #define NBTS_CTX_NAME __nbts_ctx
[18:36:17.405] | ^~~~~~~~~~
[18:36:17.405] ../../src/include/access/nbtree_spec.h:54:43: note: in
expansion of macro ‘NBTS_CTX_NAME’
[18:36:17.405] 54 | #define NBTS_MAKE_CTX(rel) const NBTS_CTX
NBTS_CTX_NAME = _nbt_spec_context(rel)
[18:36:17.405] | ^~~~~~~~~~~~~
[18:36:17.405] ../../src/include/access/nbtree_spec.h:264:28: note: in
expansion of macro ‘NBTS_MAKE_CTX’
[18:36:17.405] 264 | #define nbts_prep_ctx(rel) NBTS_MAKE_CTX(rel)
[18:36:17.405] | ^~~~~~~~~~~~~
[18:36:17.405] verify_nbtree.c:2974:2: note: in expansion of macro
‘nbts_prep_ctx’
[18:36:17.405] 2974 | nbts_prep_ctx(NULL);
[18:36:17.405] | ^~~~~~~~~~~~~
On Tue, 4 Apr 2023 at 17:43, Gregory Stark (as CFM) <stark.cfm@gmail.com> wrote:
Hm. The cfbot has a fairly trivial issue with this with a unused variable:
Attached a rebase on top of HEAD @ 8cca660b to make the patch apply
again. I think this is ready for review once again. As the patchset
has seen no significant changes since v8 of the patchset this january
[0]: /messages/by-id/CAEze2WixWviBYTWXiFLbD3AuLT4oqGk_MykS_ssB=TudeZ=ajQ@mail.gmail.com
Kind regards,
Matthias van de Meent
Neon, Inc.
= Description of the patchset so far:
This patchset implements two features that *taken togetther* improve
the performance of our btree implementation:
== Dynamic prefix truncation (0001)
The code now tracks how many prefix attributes of the scan key are
already considered equal based on earlier binsrch results, and ignores
those prefix colums in further binsrch operations (sorted list; if
both the high and low value of your range have the same prefix, the
middle value will have that prefix, too). This reduces the number of
calls into opclass-supplied (dynamic) compare functions, and thus
increase performance for multi-key-attribute indexes where shared
prefixes are common (e.g. index on (customer, order_id)).
== Index key shape code specialization (0002-0006)
Index tuple attribute accesses for multi-column indexes are often done
through index_getattr, which gets very expensive for indexes which
cannot use attcacheoff. However, single-key and attcacheoff-able
indexes do benefit greatly from the attcacheoff optimization, so we
can't just stop applying the optimization. This is why the second part
of the patchset (0002 and up) adds infrastructure to generate
specialized code paths that access key attributes in the most
efficient way it knows of: Single key attributes do not go through
loops/condtionals for which attribute it is (except certain
exceptions, 0004), attcacheoff-able indexes get the same treatment as
they do now, and indexes where attcacheoff cannot be used for all key
attributes will get a special attribute iterator that incrementally
calculates the offset of each attribute in the current index tuple
(0005+0006).
Although patch 0002 is large, most of the modified lines are functions
being moved into different files. Once 0002 is understood, the
following patches should be fairly easy to understand as well.
= Why both features in one patchset?
The overhead of keeping track of the prefix in 0001 can add up to
several % of performance lost for the common index shape where dynamic
prefix truncation cannot be applied (measured 5-9%); single-column
unique indexes are the most sensitive to this. By adding the
single-key-column code specialization in 0004, we reduce other types
of overhead for the indexes most affected, which thus compensates for
the additional overhead in 0001, resulting in a net-neutral result.
= Benchmarks
I haven't re-run the benchmarks for this since v8 at [0]/messages/by-id/CAEze2WixWviBYTWXiFLbD3AuLT4oqGk_MykS_ssB=TudeZ=ajQ@mail.gmail.com as I haven't
modified the patch significantly after that patch - only compiler
complaint fixes and changes required for rebasing. The results from
that benchmark: improvements vary between 'not significantly different
from HEAD' to '250+% improved throughput for select INSERT workloads,
and 360+% improved for select REINDEX workloads'. Graphs from that
benchmark are also attached now; as LibreOffice Calc wasn't good at
exporting the sheet with working graphs.
[0]: /messages/by-id/CAEze2WixWviBYTWXiFLbD3AuLT4oqGk_MykS_ssB=TudeZ=ajQ@mail.gmail.com
Attachments:
v11-0006-btree-specialization-for-variable-length-multi-a.patchapplication/octet-stream; name=v11-0006-btree-specialization-for-variable-length-multi-a.patchDownload
From 8dfbc98b0ea4ff624fde3739156d8a4c3f12ca33 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 13 Jan 2023 15:42:41 +0100
Subject: [PATCH v11 6/6] btree specialization for variable-length
multi-attribute keys
The default code path is relatively slow at O(n^2), so with multiple
attributes we accept the increased startup cost in favour of lower
costs for later attributes.
Note that this will only be used for indexes that use at least one
variable-length key attribute (except as last key attribute in specific
cases).
---
src/backend/access/nbtree/README | 6 +-
src/backend/access/nbtree/nbtree_spec.c | 3 +
src/include/access/itup_attiter.h | 199 ++++++++++++++++++++++++
src/include/access/nbtree.h | 11 +-
src/include/access/nbtree_spec.h | 48 +++++-
5 files changed, 258 insertions(+), 9 deletions(-)
create mode 100644 src/include/access/itup_attiter.h
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index e90e24cb70..0c45288e61 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -1105,14 +1105,12 @@ performance of those hot paths.
Optimized code paths exist for the following cases, in order of preference:
- indexes with only a single key attribute
+ - multi-column indexes that cannot pre-calculate the offsets of all key
+ attributes in the tuple data section
- multi-column indexes that could benefit from the attcacheoff optimization
NB: This is also the default path, and is comparatively slow for uncachable
attribute offsets.
-Future work will optimize for multi-column indexes that don't benefit
-from the attcacheoff optimization by improving on the O(n^2) nature of
-index_getattr through storing attribute offsets.
-
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtree_spec.c b/src/backend/access/nbtree/nbtree_spec.c
index 21635397ed..699197dfa7 100644
--- a/src/backend/access/nbtree/nbtree_spec.c
+++ b/src/backend/access/nbtree/nbtree_spec.c
@@ -33,6 +33,9 @@ _bt_specialize(Relation rel)
case NBTS_CTX_CACHED:
_bt_specialize_cached(rel);
break;
+ case NBTS_CTX_UNCACHED:
+ _bt_specialize_uncached(rel);
+ break;
case NBTS_CTX_SINGLE_KEYATT:
_bt_specialize_single_keyatt(rel);
break;
diff --git a/src/include/access/itup_attiter.h b/src/include/access/itup_attiter.h
new file mode 100644
index 0000000000..c8fb6954bc
--- /dev/null
+++ b/src/include/access/itup_attiter.h
@@ -0,0 +1,199 @@
+/*-------------------------------------------------------------------------
+ *
+ * itup_attiter.h
+ * POSTGRES index tuple attribute iterator definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/itup_attiter.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef ITUP_ATTITER_H
+#define ITUP_ATTITER_H
+
+#include "access/itup.h"
+#include "varatt.h"
+
+typedef struct IAttrIterStateData
+{
+ int offset;
+ bool slow;
+ bool isNull;
+} IAttrIterStateData;
+
+typedef IAttrIterStateData * IAttrIterState;
+
+/* ----------------
+ * index_attiterinit
+ *
+ * This gets called many times, so we macro the cacheable and NULL
+ * lookups, and call nocache_index_attiterinit() for the rest.
+ *
+ * tup - the tuple being iterated on
+ * attnum - the attribute number that we start the iteration with
+ * in the first index_attiternext call
+ * tupdesc - the tuple description
+ *
+ * ----------------
+ */
+#define index_attiterinit(tup, attnum, tupleDesc, iter) \
+do { \
+ if ((attnum) == 1) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ 0 /* Offset of attribute 1 is always 0 */, \
+ false /* slow */, \
+ false /* isNull */ \
+ }); \
+ } \
+ else if (!IndexTupleHasNulls(tup) && \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff >= 0) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff, /* offset */ \
+ false, /* slow */ \
+ false /* isNull */ \
+ }); \
+ } \
+ else \
+ nocache_index_attiterinit((tup), (attnum) - 1, (tupleDesc), (iter)); \
+} while (false);
+
+/*
+ * Initiate an index attribute iterator to attribute attnum,
+ * and return the corresponding datum.
+ *
+ * This is nearly the same as index_deform_tuple, except that this
+ * returns the internal state up to attnum, instead of populating the
+ * datum- and isnull-arrays
+ */
+static inline void
+nocache_index_attiterinit(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ int curatt;
+ char *tp; /* ptr to tuple data */
+ int off; /* offset in tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ bool slow = false; /* can we use/set attcacheoff? */
+ bool null = false;
+
+ /* Assert to protect callers */
+ Assert(PointerIsValid(iter));
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ /* XXX "knows" t_bits are just after fixed tuple header! */
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+ off = 0;
+
+ for (curatt = 0; curatt < attnum; curatt++)
+ {
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, curatt);
+
+ if (hasnulls && att_isnull(curatt, bp))
+ {
+ null = true;
+ slow = true; /* can't use attcacheoff anymore */
+ continue;
+ }
+
+ null = false;
+
+ if (!slow && thisatt->attcacheoff >= 0)
+ off = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ slow = true;
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+ }
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+
+ if (thisatt->attlen <= 0)
+ slow = true; /* can't use attcacheoff anymore */
+ }
+
+ iter->isNull = null;
+ iter->offset = off;
+ iter->slow = slow;
+}
+
+/* ----------------
+ * index_attiternext() - get the next attribute of an index tuple
+ *
+ * This gets called many times, so we do the least amount of work
+ * possible.
+ *
+ * The code does not attempt to update attcacheoff; as it is unlikely
+ * to reach a situation where the cached offset matters a lot.
+ * If the cached offset do matter, the caller should make sure that
+ * PopulateTupleDescCacheOffsets() was called on the tuple descriptor
+ * to populate the attribute offset cache.
+ *
+ * ----------------
+ */
+static inline Datum
+index_attiternext(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ char *tp; /* ptr to tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ Datum datum;
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, attnum - 1);
+
+ Assert(PointerIsValid(iter));
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+
+ if (hasnulls && att_isnull(attnum - 1, bp))
+ {
+ iter->isNull = true;
+ iter->slow = true;
+ return (Datum) 0;
+ }
+
+ iter->isNull = false;
+
+ if (!iter->slow && thisatt->attcacheoff >= 0)
+ iter->offset = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ iter->offset = att_align_pointer(iter->offset, thisatt->attalign, -1,
+ tp + iter->offset);
+ iter->slow = true;
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ iter->offset = att_align_nominal(iter->offset, thisatt->attalign);
+ }
+
+ datum = fetchatt(thisatt, tp + iter->offset);
+
+ iter->offset = att_addlength_pointer(iter->offset, thisatt->attlen, tp + iter->offset);
+
+ if (thisatt->attlen <= 0)
+ iter->slow = true; /* can't use attcacheoff anymore */
+
+ return datum;
+}
+
+#endif /* ITUP_ATTITER_H */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 72fbf3a4c6..204e349872 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -16,6 +16,7 @@
#include "access/amapi.h"
#include "access/itup.h"
+#include "access/itup_attiter.h"
#include "access/sdir.h"
#include "access/tableam.h"
#include "access/xlogreader.h"
@@ -1123,18 +1124,26 @@ typedef struct BTOptions
typedef enum NBTS_CTX {
NBTS_CTX_SINGLE_KEYATT,
+ NBTS_CTX_UNCACHED,
NBTS_CTX_CACHED,
NBTS_CTX_DEFAULT, /* fallback */
} NBTS_CTX;
static inline NBTS_CTX _nbt_spec_context(Relation irel)
{
+ AttrNumber nKeyAtts;
+
if (!PointerIsValid(irel))
return NBTS_CTX_DEFAULT;
- if (IndexRelationGetNumberOfKeyAttributes(irel) == 1)
+ nKeyAtts = IndexRelationGetNumberOfKeyAttributes(irel);
+
+ if (nKeyAtts == 1)
return NBTS_CTX_SINGLE_KEYATT;
+ if (TupleDescAttr(irel->rd_att, nKeyAtts - 1)->attcacheoff < -1)
+ return NBTS_CTX_UNCACHED;
+
return NBTS_CTX_CACHED;
}
diff --git a/src/include/access/nbtree_spec.h b/src/include/access/nbtree_spec.h
index 8e476c300d..efed9824e7 100644
--- a/src/include/access/nbtree_spec.h
+++ b/src/include/access/nbtree_spec.h
@@ -45,6 +45,7 @@
* Macros used in the nbtree specialization code.
*/
#define NBTS_TYPE_SINGLE_KEYATT single_keyatt
+#define NBTS_TYPE_UNCACHED uncached
#define NBTS_TYPE_CACHED cached
#define NBTS_TYPE_DEFAULT default
#define NBTS_CTX_NAME __nbts_ctx
@@ -53,8 +54,10 @@
#define NBTS_MAKE_CTX(rel) const NBTS_CTX NBTS_CTX_NAME = _nbt_spec_context(rel)
#define NBTS_SPECIALIZE_NAME(name) ( \
(NBTS_CTX_NAME) == NBTS_CTX_SINGLE_KEYATT ? (NBTS_MAKE_NAME(name, NBTS_TYPE_SINGLE_KEYATT)) : ( \
- (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
- NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ (NBTS_CTX_NAME) == NBTS_CTX_UNCACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_UNCACHED)) : ( \
+ (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ ) \
) \
) \
)
@@ -69,8 +72,11 @@ do { \
Assert(PointerIsValid(rel)); \
if (unlikely((rel)->rd_indam->aminsert == btinsert_default)) \
{ \
- nbts_prep_ctx(rel); \
- _bt_specialize(rel); \
+ PopulateTupleDescCacheOffsets((rel)->rd_att); \
+ { \
+ nbts_prep_ctx(rel); \
+ _bt_specialize(rel); \
+ } \
} \
} while (false)
@@ -216,6 +222,40 @@ do { \
#undef nbts_attiter_nextattdatum
#undef nbts_attiter_curattisnull
+/*
+ * Multiple key columns, but attcacheoff -optimization doesn't apply.
+ */
+#define NBTS_SPECIALIZING_UNCACHED
+#define NBTS_TYPE NBTS_TYPE_UNCACHED
+
+#define nbts_attiterdeclare(itup) \
+ IAttrIterStateData NBTS_MAKE_NAME(itup, iter)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc) \
+ index_attiterinit((itup), (initAttNum), (tupDesc), &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_attiternext((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, iter).isNull
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_UNCACHED
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
/*
* All next uses of nbts_prep_ctx are in non-templated code, so here we make
* sure we actually create the context.
--
2.39.0
v11-0003-Use-specialized-attribute-iterators-in-the-speci.patchapplication/octet-stream; name=v11-0003-Use-specialized-attribute-iterators-in-the-speci.patchDownload
From 7ee3514e21d0ec16be2e87c93847be661b913c32 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 11 Jan 2023 02:57:21 +0100
Subject: [PATCH v11 3/6] Use specialized attribute iterators in the
specialized source files
This is committed separately to make clear what substantial changes were
made to the pre-existing code.
Even though not all nbt*_spec functions have been updated; these functions
can now directly call (and inline, and optimize for) the specialized functions
they call, instead of having to determine the right specialization based on
the (potentially locally unavailable) index relation, making the specialization
of those functions still worth specializing/duplicating.
---
src/backend/access/nbtree/nbtsearch_spec.c | 18 +++---
src/backend/access/nbtree/nbtsort_spec.c | 24 +++----
src/backend/access/nbtree/nbtutils_spec.c | 62 ++++++++++++-------
.../utils/sort/tuplesortvariants_spec.c | 54 +++++++++-------
4 files changed, 92 insertions(+), 66 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsearch_spec.c b/src/backend/access/nbtree/nbtsearch_spec.c
index 1e04caf090..f0bdcb9828 100644
--- a/src/backend/access/nbtree/nbtsearch_spec.c
+++ b/src/backend/access/nbtree/nbtsearch_spec.c
@@ -641,6 +641,7 @@ _bt_compare(Relation rel,
int ncmpkey;
int ntupatts;
int32 result;
+ nbts_attiterdeclare(itup);
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -673,23 +674,26 @@ _bt_compare(Relation rel,
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
scankey = key->scankeys + ((*comparecol) - 1);
- for (int i = *comparecol; i <= ncmpkey; i++)
+ nbts_attiterinit(itup, *comparecol, itupdesc);
+
+ nbts_foreachattr(*comparecol, ncmpkey)
{
Datum datum;
- bool isNull;
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+ datum = nbts_attiter_nextattdatum(itup, itupdesc);
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ /* key is NULL */
+ if (scankey->sk_flags & SK_ISNULL)
{
- if (isNull)
+ if (nbts_attiter_curattisnull(itup))
result = 0; /* NULL "=" NULL */
else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = -1; /* NULL "<" NOT_NULL */
else
result = 1; /* NULL ">" NOT_NULL */
}
- else if (isNull) /* key is NOT_NULL and item is NULL */
+ /* key is NOT_NULL and item is NULL */
+ else if (nbts_attiter_curattisnull(itup))
{
if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = 1; /* NOT_NULL ">" NULL */
@@ -718,7 +722,7 @@ _bt_compare(Relation rel,
/* if the keys are unequal, return the difference */
if (result != 0)
{
- *comparecol = i;
+ *comparecol = nbts_attiter_attnum;
return result;
}
diff --git a/src/backend/access/nbtree/nbtsort_spec.c b/src/backend/access/nbtree/nbtsort_spec.c
index 368d6f244c..6f33cc4cc2 100644
--- a/src/backend/access/nbtree/nbtsort_spec.c
+++ b/src/backend/access/nbtree/nbtsort_spec.c
@@ -34,8 +34,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
itup2 = NULL;
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
bool deduplicate;
@@ -57,7 +56,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
/* Prepare SortSupport data for each column */
sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
- for (i = 0; i < keysz; i++)
+ for (int i = 0; i < keysz; i++)
{
SortSupport sortKey = sortKeys + i;
ScanKey scanKey = wstate->inskey->scankeys + i;
@@ -90,21 +89,24 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
else if (itup != NULL)
{
int32 compare = 0;
+ nbts_attiterdeclare(itup);
+ nbts_attiterdeclare(itup2);
- for (i = 1; i <= keysz; i++)
+ nbts_attiterinit(itup, 1, tupdes);
+ nbts_attiterinit(itup2, 1, tupdes);
+
+ nbts_foreachattr(1, keysz)
{
SortSupport entry;
Datum attrDatum1,
attrDatum2;
- bool isNull1,
- isNull2;
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+ entry = sortKeys + nbts_attiter_attnum - 1;
+ attrDatum1 = nbts_attiter_nextattdatum(itup, tupdes);
+ attrDatum2 = nbts_attiter_nextattdatum(itup2, tupdes);
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
+ compare = ApplySortComparator(attrDatum1, nbts_attiter_curattisnull(itup),
+ attrDatum2, nbts_attiter_curattisnull(itup2),
entry);
if (compare > 0)
{
diff --git a/src/backend/access/nbtree/nbtutils_spec.c b/src/backend/access/nbtree/nbtutils_spec.c
index 0288da22d6..07ca18f404 100644
--- a/src/backend/access/nbtree/nbtutils_spec.c
+++ b/src/backend/access/nbtree/nbtutils_spec.c
@@ -64,7 +64,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
int indnkeyatts;
int16 *indoption;
int tupnatts;
- int i;
+ nbts_attiterdeclare(itup);
itupdesc = RelationGetDescr(rel);
indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -95,7 +95,10 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->scantid = key->heapkeyspace && itup ?
BTreeTupleGetHeapTID(itup) : NULL;
skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
+
+ nbts_attiterinit(itup, 1, itupdesc);
+
+ nbts_foreachattr(1, indnkeyatts)
{
FmgrInfo *procinfo;
Datum arg;
@@ -106,27 +109,30 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
* We can use the cached (default) support procs since no cross-type
* comparison can be needed.
*/
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+ procinfo = index_getprocinfo(rel, nbts_attiter_attnum, BTORDER_PROC);
/*
* Key arguments built from truncated attributes (or when caller
* provides no tuple) are defensively represented as NULL values. They
* should never be used.
*/
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
+ if (nbts_attiter_attnum <= tupnatts)
+ {
+ arg = nbts_attiter_nextattdatum(itup, itupdesc);
+ null = nbts_attiter_curattisnull(itup);
+ }
else
{
arg = (Datum) 0;
null = true;
}
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags = (null ? SK_ISNULL : 0) | (indoption[nbts_attiter_attnum - 1] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[nbts_attiter_attnum - 1],
flags,
- (AttrNumber) (i + 1),
+ (AttrNumber) nbts_attiter_attnum,
InvalidStrategy,
InvalidOid,
- rel->rd_indcollation[i],
+ rel->rd_indcollation[nbts_attiter_attnum - 1],
procinfo,
arg);
/* Record if any key attribute is NULL (or truncated) */
@@ -675,6 +681,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
TupleDesc itupdesc = RelationGetDescr(rel);
int keepnatts;
ScanKey scankey;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
/*
* _bt_compare() treats truncated key attributes as having the value minus
@@ -686,20 +694,22 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
scankey = itup_key->scankeys;
keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, nkeyatts)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ datum1 = nbts_attiter_nextattdatum(lastleft, itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) != nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
scankey->sk_collation,
datum1,
@@ -707,6 +717,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
break;
keepnatts++;
+ scankey++;
}
/*
@@ -747,24 +758,27 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
TupleDesc itupdesc = RelationGetDescr(rel);
int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
int keepnatts;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, keysz)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
Form_pg_attribute att;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
+ datum1 = nbts_attiter_nextattdatum(lastleft, itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
+ att = TupleDescAttr(itupdesc, nbts_attiter_attnum - 1);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) != nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
diff --git a/src/backend/utils/sort/tuplesortvariants_spec.c b/src/backend/utils/sort/tuplesortvariants_spec.c
index 0791f41136..61c4826853 100644
--- a/src/backend/utils/sort/tuplesortvariants_spec.c
+++ b/src/backend/utils/sort/tuplesortvariants_spec.c
@@ -40,11 +40,8 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
bool equal_hasnull = false;
int nkey;
int32 compare;
- Datum datum1,
- datum2;
- bool isnull1,
- isnull2;
-
+ nbts_attiterdeclare(tuple1);
+ nbts_attiterdeclare(tuple2);
/* Compare the leading sort key */
compare = ApplySortComparator(a->datum1, a->isnull1,
@@ -59,37 +56,46 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
keysz = base->nKeys;
tupDes = RelationGetDescr(arg->index.indexRel);
- if (sortKey->abbrev_converter)
+ if (!sortKey->abbrev_converter)
{
- datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
-
- compare = ApplySortAbbrevFullComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare;
+ nkey = 2;
+ sortKey++;
+ }
+ else
+ {
+ nkey = 1;
}
/* they are equal, so we only need to examine one null flag */
if (a->isnull1)
equal_hasnull = true;
- sortKey++;
- for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
+ nbts_attiterinit(tuple1, nkey, tupDes);
+ nbts_attiterinit(tuple2, nkey, tupDes);
+
+ nbts_foreachattr(nkey, keysz)
{
- datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
+ Datum datum1,
+ datum2;
+ datum1 = nbts_attiter_nextattdatum(tuple1, tupDes);
+ datum2 = nbts_attiter_nextattdatum(tuple2, tupDes);
+
+ if (nbts_attiter_attnum == 1)
+ compare = ApplySortAbbrevFullComparator(datum1, nbts_attiter_curattisnull(tuple1),
+ datum2, nbts_attiter_curattisnull(tuple2),
+ sortKey);
+ else
+ compare = ApplySortComparator(datum1, nbts_attiter_curattisnull(tuple1),
+ datum2, nbts_attiter_curattisnull(tuple2),
+ sortKey);
- compare = ApplySortComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
if (compare != 0)
- return compare; /* done when we find unequal attributes */
+ return compare;
- /* they are equal, so we only need to examine one null flag */
- if (isnull1)
+ if (nbts_attiter_curattisnull(tuple1))
equal_hasnull = true;
+
+ sortKey++;
}
/*
--
2.39.0
v11-0004-Optimize-nbts_attiter-for-nkeyatts-1-btrees.patchapplication/octet-stream; name=v11-0004-Optimize-nbts_attiter-for-nkeyatts-1-btrees.patchDownload
From 3119760693d53acd871c8ead2ad6722b604ff388 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 11 Jan 2023 20:04:56 +0100
Subject: [PATCH v11 4/6] Optimize nbts_attiter for nkeyatts==1 btrees
This removes the index_getattr_nocache call path, which has significant overhead, and uses constant 0 offset.
---
src/backend/access/nbtree/README | 1 +
src/backend/access/nbtree/nbtree_spec.c | 3 ++
src/include/access/nbtree.h | 35 ++++++++++++++++
src/include/access/nbtree_spec.h | 56 ++++++++++++++++++++++++-
4 files changed, 93 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index e9d0cf6ac1..e90e24cb70 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -1104,6 +1104,7 @@ in the index AM to call the specialized functions, increasing the
performance of those hot paths.
Optimized code paths exist for the following cases, in order of preference:
+ - indexes with only a single key attribute
- multi-column indexes that could benefit from the attcacheoff optimization
NB: This is also the default path, and is comparatively slow for uncachable
attribute offsets.
diff --git a/src/backend/access/nbtree/nbtree_spec.c b/src/backend/access/nbtree/nbtree_spec.c
index 6b766581ab..21635397ed 100644
--- a/src/backend/access/nbtree/nbtree_spec.c
+++ b/src/backend/access/nbtree/nbtree_spec.c
@@ -33,6 +33,9 @@ _bt_specialize(Relation rel)
case NBTS_CTX_CACHED:
_bt_specialize_cached(rel);
break;
+ case NBTS_CTX_SINGLE_KEYATT:
+ _bt_specialize_single_keyatt(rel);
+ break;
case NBTS_CTX_DEFAULT:
break;
}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d1bbc4d2a8..72fbf3a4c6 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1122,6 +1122,7 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
typedef enum NBTS_CTX {
+ NBTS_CTX_SINGLE_KEYATT,
NBTS_CTX_CACHED,
NBTS_CTX_DEFAULT, /* fallback */
} NBTS_CTX;
@@ -1131,9 +1132,43 @@ static inline NBTS_CTX _nbt_spec_context(Relation irel)
if (!PointerIsValid(irel))
return NBTS_CTX_DEFAULT;
+ if (IndexRelationGetNumberOfKeyAttributes(irel) == 1)
+ return NBTS_CTX_SINGLE_KEYATT;
+
return NBTS_CTX_CACHED;
}
+static inline Datum _bt_getfirstatt(IndexTuple tuple, TupleDesc tupleDesc,
+ bool *isNull)
+{
+ Datum result;
+ if (IndexTupleHasNulls(tuple))
+ {
+ if (att_isnull(0, (bits8 *)(tuple) + sizeof(IndexTupleData)))
+ {
+ *isNull = true;
+ result = (Datum) 0;
+ }
+ else
+ {
+ *isNull = false;
+ result = fetchatt(TupleDescAttr(tupleDesc, 0),
+ ((char *) tuple)
+ + MAXALIGN(sizeof(IndexTupleData)
+ + sizeof(IndexAttributeBitMapData)));
+ }
+ }
+ else
+ {
+ *isNull = false;
+ result = fetchatt(TupleDescAttr(tupleDesc, 0),
+ ((char *) tuple)
+ + MAXALIGN(sizeof(IndexTupleData)));
+ }
+
+ return result;
+}
+
#define NBT_SPECIALIZE_FILE "access/nbtree_specfuncs.h"
#include "nbtree_spec.h"
diff --git a/src/include/access/nbtree_spec.h b/src/include/access/nbtree_spec.h
index fa38b09c6e..8e476c300d 100644
--- a/src/include/access/nbtree_spec.h
+++ b/src/include/access/nbtree_spec.h
@@ -44,6 +44,7 @@
/*
* Macros used in the nbtree specialization code.
*/
+#define NBTS_TYPE_SINGLE_KEYATT single_keyatt
#define NBTS_TYPE_CACHED cached
#define NBTS_TYPE_DEFAULT default
#define NBTS_CTX_NAME __nbts_ctx
@@ -51,8 +52,10 @@
/* contextual specializations */
#define NBTS_MAKE_CTX(rel) const NBTS_CTX NBTS_CTX_NAME = _nbt_spec_context(rel)
#define NBTS_SPECIALIZE_NAME(name) ( \
- (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
- NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ (NBTS_CTX_NAME) == NBTS_CTX_SINGLE_KEYATT ? (NBTS_MAKE_NAME(name, NBTS_TYPE_SINGLE_KEYATT)) : ( \
+ (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ ) \
) \
)
@@ -164,6 +167,55 @@ do { \
#undef nbts_attiter_nextattdatum
#undef nbts_attiter_curattisnull
+/*
+ * Specialization 3: SINGLE_KEYATT
+ *
+ * Optimized access for indexes with a single key column.
+ *
+ * Note that this path cannot be used for indexes with multiple key
++ * columns, because it never considers the next column.
+ */
+
+/* the default context (and later contexts) do need to specialize, so here's that */
+#undef nbts_prep_ctx
+#define nbts_prep_ctx(rel)
+
+#define NBTS_SPECIALIZING_SINGLE_KEYATT
+#define NBTS_TYPE NBTS_TYPE_SINGLE_KEYATT
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ Assert((endAttNum) == 1); ((void) (endAttNum)); \
+ if ((initAttNum) == 1) for (int spec_i = 0; spec_i < 1; spec_i++)
+
+#define nbts_attiter_attnum 1
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+( \
+ AssertMacro(spec_i == 0), \
+ _bt_getfirstatt(itup, tupDesc, &NBTS_MAKE_NAME(itup, isNull)) \
+)
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_SINGLE_KEYATT
+
+/* un-define the optimization macros */
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
/*
* All next uses of nbts_prep_ctx are in non-templated code, so here we make
* sure we actually create the context.
--
2.39.0
v11-0005-Add-an-attcacheoff-populating-function.patchapplication/octet-stream; name=v11-0005-Add-an-attcacheoff-populating-function.patchDownload
From b9891e737f142391e45e85f18993617eccae5a54 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 12 Jan 2023 21:34:36 +0100
Subject: [PATCH v11 5/6] Add an attcacheoff-populating function
It populates attcacheoff-capable attributes with the correct offset,
and fills attributes whose offset is uncacheable with an 'uncacheable'
indicator value; as opposed to -1 which signals "unknown".
This allows users of the API to remove redundant cycles that try to
cache the offset of attributes - instead of O(N-attrs) operations, this
one only requires a O(1) check.
---
src/backend/access/common/tupdesc.c | 111 ++++++++++++++++++++++++++++
src/include/access/tupdesc.h | 2 +
2 files changed, 113 insertions(+)
diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c
index 7c5c390503..b3f543cd83 100644
--- a/src/backend/access/common/tupdesc.c
+++ b/src/backend/access/common/tupdesc.c
@@ -927,3 +927,114 @@ BuildDescFromLists(List *names, List *types, List *typmods, List *collations)
return desc;
}
+
+/*
+ * PopulateTupleDescCacheOffsets
+ *
+ * Populate the attcacheoff fields of a TupleDesc, returning the last
+ * attcacheoff with a valid offset value.
+ *
+ * Populates attcacheoff with a negative cache value when no offset
+ * can be calculated (due to e.g. variable length attributes).
+ * The negative value is a value relative to the last cacheable attribute
+ * attcacheoff = -1 - (thisattno - cachedattno)
+ * so that the last attribute with cached offset can be found with
+ * cachedattno = attcacheoff + 1 + thisattno
+ *
+ * The value returned is the AttrNumber of the last (1-based) attribute that
+ * had its offset cached.
+ *
+ * When the TupleDesc has 0 attributes, it returns 0.
+ */
+AttrNumber
+PopulateTupleDescCacheOffsets(TupleDesc desc)
+{
+ int numberOfAttributes = desc->natts;
+ AttrNumber currAttNo, lastCachedAttNo;
+
+ if (numberOfAttributes == 0)
+ return 0;
+
+ /* Non-negative value: this attribute is cached */
+ if (TupleDescAttr(desc, desc->natts - 1)->attcacheoff >= 0)
+ return (AttrNumber) desc->natts;
+ /*
+ * Attribute has been filled with relative offset to last cached value, but
+ * it itself is unreachable.
+ */
+ if (TupleDescAttr(desc, desc->natts - 1)->attcacheoff != -1)
+ return (AttrNumber) (TupleDescAttr(desc, desc->natts - 1)->attcacheoff + 1 + desc->natts);
+
+ /* last attribute of the tupledesc may or may not support attcacheoff */
+
+ /*
+ * First attribute always starts at offset zero.
+ */
+ TupleDescAttr(desc, 0)->attcacheoff = 0;
+
+ currAttNo = 1;
+ /*
+ * Other code may have populated the value previously.
+ * Skip all positive offsets to get to the first attribute without
+ * attcacheoff.
+ */
+ while (currAttNo < numberOfAttributes &&
+ TupleDescAttr(desc, currAttNo)->attcacheoff >= 0)
+ currAttNo++;
+
+ /*
+ * Cache offset is undetermined. Start calculating offsets if possible.
+ *
+ * When we exit this block, currAttNo will point at the first uncacheable
+ * attribute, or past the end of the attribute array.
+ */
+ if (currAttNo < numberOfAttributes &&
+ TupleDescAttr(desc, currAttNo)->attcacheoff == -1)
+ {
+ Form_pg_attribute att = TupleDescAttr(desc, currAttNo - 1);
+ int32 off = att->attcacheoff;
+
+ if (att->attlen >= 0) {
+ off += att->attlen;
+
+ while (currAttNo < numberOfAttributes)
+ {
+ att = TupleDescAttr(desc, currAttNo);
+
+ if (att->attlen < 0)
+ {
+ if (off == att_align_nominal(off, att->attalign))
+ {
+ att->attcacheoff = off;
+ currAttNo++;
+ }
+ break;
+ }
+
+ off = att_align_nominal(off, att->attalign);
+ att->attcacheoff = off;
+ off += att->attlen;
+ currAttNo++;
+ }
+ }
+ }
+
+ Assert(currAttNo == numberOfAttributes || (
+ currAttNo < numberOfAttributes
+ && TupleDescAttr(desc, (currAttNo - 1))->attcacheoff >= 0
+ && TupleDescAttr(desc, currAttNo)->attcacheoff == -1
+ ));
+ /*
+ * No cacheable offsets left. Fill the rest with negative cache values,
+ * but return the latest cached offset.
+ */
+ lastCachedAttNo = currAttNo;
+
+ while (currAttNo < numberOfAttributes)
+ {
+ TupleDescAttr(desc, currAttNo)->attcacheoff = -1 - (currAttNo - lastCachedAttNo);
+ currAttNo++;
+ }
+
+ return lastCachedAttNo;
+}
\ No newline at end of file
diff --git a/src/include/access/tupdesc.h b/src/include/access/tupdesc.h
index b4286cf922..2673f2d0f3 100644
--- a/src/include/access/tupdesc.h
+++ b/src/include/access/tupdesc.h
@@ -151,4 +151,6 @@ extern TupleDesc BuildDescForRelation(List *schema);
extern TupleDesc BuildDescFromLists(List *names, List *types, List *typmods, List *collations);
+extern AttrNumber PopulateTupleDescCacheOffsets(TupleDesc desc);
+
#endif /* TUPDESC_H */
--
2.39.0
v11-0001-Implement-dynamic-prefix-compression-in-nbtree.patchapplication/octet-stream; name=v11-0001-Implement-dynamic-prefix-compression-in-nbtree.patchDownload
From 6f6248dfb37b6bd6275d1b520e712d2bf83228d9 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Tue, 10 Jan 2023 21:45:44 +0100
Subject: [PATCH v11 1/6] Implement dynamic prefix compression in nbtree
Because tuples are ordered on the page, if some prefix of the
scan attributes on both sides of the compared tuple are equal
to the scankey, then the current tuple that is being compared
must also have those prefixing attributes that equal the
scankey.
We cannot generally propagate this information to _binsrch on
lower pages, as this downstream page may have concurrently split
and/or have merged with its deleted left neighbour (see [0]),
which moves the keyspace of the linked page. We thus can only
trust the current state of this current page for this optimization,
which means we must validate this state each time we open the page.
Although this limits the overall applicability of the
performance improvement, it still allows for a nice performance
improvement in most cases where initial columns have many
duplicate values and a compare function that is not cheap.
As an exception to the above rule, most of the time a pages'
highkey is equal to the right seperator on the parent page due to
how btree splits are done. By storing this right seperator from
the parent page and then validating that the highkey of the child
page contains the exact same data, we can restore the right prefix
bound without having to call the relatively expensive _bt_compare.
In the worst-case scenario of a concurrent page split, we'd still
have to validate the full key, but that doesn't happen very often
when compared to the number of times we descend the btree.
---
contrib/amcheck/verify_nbtree.c | 17 ++--
src/backend/access/nbtree/README | 43 ++++++++++
src/backend/access/nbtree/nbtinsert.c | 34 +++++---
src/backend/access/nbtree/nbtsearch.c | 118 +++++++++++++++++++++++---
src/include/access/nbtree.h | 9 +-
5 files changed, 187 insertions(+), 34 deletions(-)
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 94a9759322..e57625b75c 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2701,6 +2701,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
BTInsertStateData insertstate;
OffsetNumber offnum;
Page page;
+ AttrNumber cmpcol = 1;
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
@@ -2710,13 +2711,13 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.buf = lbuf;
/* Get matching tuple on leaf page */
- offnum = _bt_binsrch_insert(state->rel, &insertstate);
+ offnum = _bt_binsrch_insert(state->rel, &insertstate, 1);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
/* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
insertstate.postingoff <= 0 &&
- _bt_compare(state->rel, key, page, offnum) == 0)
+ _bt_compare(state->rel, key, page, offnum, &cmpcol) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
}
@@ -2778,6 +2779,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
@@ -2788,7 +2790,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
if (!key->heapkeyspace)
return invariant_leq_offset(state, key, upperbound);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
/*
* _bt_compare() is capable of determining that a scankey with a
@@ -2840,10 +2842,11 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber upperbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
return cmp <= 0;
}
@@ -2863,10 +2866,11 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber lowerbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, lowerbound);
+ cmp = _bt_compare(state->rel, key, state->target, lowerbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
@@ -2901,13 +2905,14 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
/* Verify line pointer before checking tuple */
itemid = PageGetItemIdCareful(state, nontargetblock, nontarget,
upperbound);
- cmp = _bt_compare(state->rel, key, nontarget, upperbound);
+ cmp = _bt_compare(state->rel, key, nontarget, upperbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 52e646c7f7..0f10141a2f 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -901,6 +901,49 @@ large groups of duplicates, maximizing space utilization. Note also that
deduplication more efficient. Deduplication can be performed infrequently,
without merging together existing posting list tuples too often.
+Notes about dynamic prefix truncation
+-------------------------------------
+
+Because NBTrees have a sorted keyspace, when we have determined that some
+prefixing columns of tuples on both sides of the tuple that is being
+compared are equal to the scankey, then the current tuple must also share
+this prefix with the scankey. This allows us to skip comparing those columns,
+saving the indirect function calls in the compare operation.
+
+We can only use this constraint if we have proven this information while we
+hold a pin on the page, so this is only useful on the page level: Concurrent
+page deletions and splits may have moved the keyspace of the page referenced
+by a parent page to the right. If we re-used high- and low-column-prefixes,
+we would not be able to detect a change of keyspace from e.g. [2,3) to [1,2),
+and subsequently return invalid results. This race condition can only be
+prevented by re-establishing the prefix-equal-columns for each page.
+
+There is positive news, though: A page split will put a binary copy of the
+page's highkey in the parent page. This means that we usually can reuse
+the compare result of the parent page's downlink's right sibling when we
+discover that their representation is binary equal. In general this will
+be the case, as only in concurrent page splits and deletes the downlink
+may not point to the page with the correct highkey bound (_bt_moveright
+only rarely actually moves right).
+
+To implement this, we copy the downlink's right differentiator key into a
+temporary buffer, which is then compared against the child pages' highkey.
+If they match, we reuse the compare result (plus prefix) we had for it from
+the parent page, if not, we need to do a full _bt_compare. Because memcpy +
+memcmp is cheap compared to _bt_compare, and because it's quite unlikely
+that we guess wrong this speeds up our _bt_moveright code (at cost of some
+stack memory in _bt_search and some overhead in case of a wrong prediction)
+
+Now that we have prefix bounds on the highest value of a page, the
+_bt_binsrch procedure will use this result as a rightmost prefix compare,
+and for each step in the binary search (that does not compare less than the
+insert key) improve the equal-prefix bounds.
+
+Using the above optimization, we now (on average) only need 2 full key
+compares per page (plus ceil(log2(ntupsperpage)) single-attribute compares),
+as opposed to the ceil(log2(ntupsperpage)) + 1 of a naive implementation;
+a significant improvement.
+
Notes about deduplication
-------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index d33f814a93..39e7e9b731 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -328,6 +328,7 @@ _bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
{
Page page;
BTPageOpaque opaque;
+ AttrNumber cmpcol = 1;
_bt_checkpage(rel, insertstate->buf);
page = BufferGetPage(insertstate->buf);
@@ -346,7 +347,8 @@ _bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
!P_IGNORE(opaque) &&
PageGetFreeSpace(page) > insertstate->itemsz &&
PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- _bt_compare(rel, insertstate->itup_key, page, P_HIKEY) > 0)
+ _bt_compare(rel, insertstate->itup_key, page, P_HIKEY,
+ &cmpcol) > 0)
{
/*
* Caller can use the fastpath optimization because cached
@@ -440,7 +442,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* in the fastpath below, but also in the _bt_findinsertloc() call later.
*/
Assert(!insertstate->bounds_valid);
- offset = _bt_binsrch_insert(rel, insertstate);
+ offset = _bt_binsrch_insert(rel, insertstate, 1);
/*
* Scan over all equal tuples, looking for live conflicts.
@@ -450,6 +452,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(itup_key->scantid == NULL);
for (;;)
{
+ AttrNumber cmpcol = 1;
+
/*
* Each iteration of the loop processes one heap TID, not one index
* tuple. Current offset number for page isn't usually advanced on
@@ -485,7 +489,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(insertstate->bounds_valid);
Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
Assert(insertstate->low <= insertstate->stricthigh);
- Assert(_bt_compare(rel, itup_key, page, offset) < 0);
+ Assert(_bt_compare(rel, itup_key, page, offset, &cmpcol) < 0);
break;
}
@@ -510,7 +514,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
if (!inposting)
{
/* Plain tuple, or first TID in posting list tuple */
- if (_bt_compare(rel, itup_key, page, offset) != 0)
+ if (_bt_compare(rel, itup_key, page, offset, &cmpcol) != 0)
break; /* we're past all the equal tuples */
/* Advanced curitup */
@@ -720,11 +724,12 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
else
{
int highkeycmp;
+ cmpcol = 1;
/* If scankey == hikey we gotta check the next page too */
if (P_RIGHTMOST(opaque))
break;
- highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+ highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol);
Assert(highkeycmp <= 0);
if (highkeycmp != 0)
break;
@@ -867,6 +872,8 @@ _bt_findinsertloc(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
+
/*
* Does the new tuple belong on this page?
*
@@ -884,7 +891,7 @@ _bt_findinsertloc(Relation rel,
/* Test '<=', not '!=', since scantid is set now */
if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
break;
_bt_stepright(rel, heapRel, insertstate, stack);
@@ -937,6 +944,8 @@ _bt_findinsertloc(Relation rel,
*/
while (PageGetFreeSpace(page) < insertstate->itemsz)
{
+ AttrNumber cmpcol = 1;
+
/*
* Before considering moving right, see if we can obtain enough
* space by erasing LP_DEAD items
@@ -967,7 +976,7 @@ _bt_findinsertloc(Relation rel,
break;
if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) != 0 ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) != 0 ||
pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
break;
@@ -982,10 +991,13 @@ _bt_findinsertloc(Relation rel,
* We should now be on the correct page. Find the offset within the page
* for the new tuple. (Possibly reusing earlier search bounds.)
*/
- Assert(P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
+ {
+ AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
+ Assert(P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0);
+ }
- newitemoff = _bt_binsrch_insert(rel, insertstate);
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
if (insertstate->postingoff == -1)
{
@@ -1004,7 +1016,7 @@ _bt_findinsertloc(Relation rel,
*/
Assert(!insertstate->bounds_valid);
insertstate->postingoff = 0;
- newitemoff = _bt_binsrch_insert(rel, insertstate);
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
Assert(insertstate->postingoff == 0);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7e05e58676..7423b76e1c 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,7 +25,8 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
-static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf,
+ AttrNumber *highkeycmpcol);
static int _bt_binsrch_posting(BTScanInsert key, Page page,
OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
@@ -101,6 +102,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
{
BTStack stack_in = NULL;
int page_access = BT_READ;
+ char tupdatabuf[BLCKSZ / 3];
+ AttrNumber highkeycmpcol = 1;
/* heaprel must be set whenever _bt_allocbuf is reachable */
Assert(access == BT_READ || access == BT_WRITE);
@@ -137,7 +140,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
* opportunity to finish splits of internal pages too.
*/
*bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
- stack_in, page_access, snapshot);
+ stack_in, page_access, snapshot, &highkeycmpcol,
+ (char *) tupdatabuf);
/* if this is a leaf page, we're done */
page = BufferGetPage(*bufP);
@@ -149,12 +153,15 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
* Find the appropriate pivot tuple on this page. Its downlink points
* to the child page that we're about to descend to.
*/
- offnum = _bt_binsrch(rel, key, *bufP);
+ offnum = _bt_binsrch(rel, key, *bufP, &highkeycmpcol);
itemid = PageGetItemId(page, offnum);
itup = (IndexTuple) PageGetItem(page, itemid);
Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
child = BTreeTupleGetDownLink(itup);
+ Assert(IndexTupleSize(itup) < sizeof(tupdatabuf));
+ memcpy((char *) tupdatabuf, (char *) itup, IndexTupleSize(itup));
+
/*
* We need to save the location of the pivot tuple we chose in a new
* stack entry for this page/level. If caller ends up splitting a
@@ -188,6 +195,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
*/
if (access == BT_WRITE && page_access == BT_READ)
{
+ highkeycmpcol = 1;
+
/* trade in our read lock for a write lock */
_bt_unlockbuf(rel, *bufP);
_bt_lockbuf(rel, *bufP, BT_WRITE);
@@ -198,7 +207,7 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
* move right to its new sibling. Do that.
*/
*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
- snapshot);
+ snapshot, &highkeycmpcol, (char *) tupdatabuf);
}
return stack_in;
@@ -247,13 +256,16 @@ _bt_moveright(Relation rel,
bool forupdate,
BTStack stack,
int access,
- Snapshot snapshot)
+ Snapshot snapshot,
+ AttrNumber *comparecol,
+ char *tupdatabuf)
{
Page page;
BTPageOpaque opaque;
int32 cmpval;
Assert(!forupdate || heaprel != NULL);
+ Assert(PointerIsValid(comparecol) && PointerIsValid(tupdatabuf));
/*
* When nextkey = false (normal case): if the scan key that brought us to
@@ -276,12 +288,17 @@ _bt_moveright(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
+
page = BufferGetPage(buf);
TestForOldSnapshot(snapshot, rel, page);
opaque = BTPageGetOpaque(page);
if (P_RIGHTMOST(opaque))
+ {
+ *comparecol = 1;
break;
+ }
/*
* Finish any incomplete splits we encounter along the way.
@@ -307,14 +324,55 @@ _bt_moveright(Relation rel,
continue;
}
- if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
+ /*
+ * tupdatabuf is filled with the right seperator of the parent node.
+ * This allows us to do a binary equality check between the parent
+ * node's right seperator (which is < key) and this page's P_HIKEY.
+ * If they equal, we can reuse the result of the parent node's
+ * rightkey compare, which means we can potentially save a full key
+ * compare (which includes indirect calls to attribute comparison
+ * functions).
+ *
+ * Without this, we'd on average use 3 full key compares per page before
+ * we achieve full dynamic prefix bounds, but with this optimization
+ * that is only 2.
+ *
+ * 3 compares: 1 for the highkey (rightmost), and on average 2 before
+ * we move right in the binary search on the page, this average equals
+ * SUM (1/2 ^ x) for x from 0 to log(n items)), which tends to 2.
+ */
+ if (!P_IGNORE(opaque) && *comparecol > 1)
{
+ IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
+ IndexTuple buftuple = (IndexTuple) tupdatabuf;
+ if (IndexTupleSize(itup) == IndexTupleSize(buftuple))
+ {
+ char *dataptr = (char *) itup;
+
+ if (memcmp(dataptr + sizeof(IndexTupleData),
+ tupdatabuf + sizeof(IndexTupleData),
+ IndexTupleSize(itup) - sizeof(IndexTupleData)) == 0)
+ break;
+ } else {
+ *comparecol = 1;
+ }
+ } else {
+ *comparecol = 1;
+ }
+
+ if (P_IGNORE(opaque) ||
+ _bt_compare(rel, key, page, P_HIKEY, &cmpcol) >= cmpval)
+ {
+ *comparecol = 1;
/* step right one page */
buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
continue;
}
else
+ {
+ *comparecol = cmpcol;
break;
+ }
}
if (P_IGNORE(opaque))
@@ -347,7 +405,8 @@ _bt_moveright(Relation rel,
static OffsetNumber
_bt_binsrch(Relation rel,
BTScanInsert key,
- Buffer buf)
+ Buffer buf,
+ AttrNumber *highkeycmpcol)
{
Page page;
BTPageOpaque opaque;
@@ -355,6 +414,8 @@ _bt_binsrch(Relation rel,
high;
int32 result,
cmpval;
+ AttrNumber highcmpcol = *highkeycmpcol,
+ lowcmpcol = 1;
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
@@ -396,16 +457,25 @@ _bt_binsrch(Relation rel,
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, key, page, mid);
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
+ {
high = mid;
+ highcmpcol = cmpcol;
+ }
}
+
+ *highkeycmpcol = highcmpcol;
/*
* At this point we have high == low, but be careful: they could point
@@ -449,7 +519,8 @@ _bt_binsrch(Relation rel,
* list split).
*/
OffsetNumber
-_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
+_bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol)
{
BTScanInsert key = insertstate->itup_key;
Page page;
@@ -459,6 +530,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
stricthigh;
int32 result,
cmpval;
+ AttrNumber lowcmpcol = 1;
page = BufferGetPage(insertstate->buf);
opaque = BTPageGetOpaque(page);
@@ -509,16 +581,22 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, key, page, mid);
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
{
high = mid;
+ highcmpcol = cmpcol;
+
if (result != 0)
stricthigh = high;
}
@@ -666,7 +744,8 @@ int32
_bt_compare(Relation rel,
BTScanInsert key,
Page page,
- OffsetNumber offnum)
+ OffsetNumber offnum,
+ AttrNumber *comparecol)
{
TupleDesc itupdesc = RelationGetDescr(rel);
BTPageOpaque opaque = BTPageGetOpaque(page);
@@ -706,8 +785,9 @@ _bt_compare(Relation rel,
ncmpkey = Min(ntupatts, key->keysz);
Assert(key->heapkeyspace || ncmpkey == key->keysz);
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
- scankey = key->scankeys;
- for (int i = 1; i <= ncmpkey; i++)
+
+ scankey = key->scankeys + ((*comparecol) - 1);
+ for (int i = *comparecol; i <= ncmpkey; i++)
{
Datum datum;
bool isNull;
@@ -751,11 +831,20 @@ _bt_compare(Relation rel,
/* if the keys are unequal, return the difference */
if (result != 0)
+ {
+ *comparecol = i;
return result;
+ }
scankey++;
}
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *comparecol = ntupatts + 1;
+
/*
* All non-truncated attributes (other than heap TID) were found to be
* equal. Treat truncated attributes as minus infinity when scankey has a
@@ -886,6 +975,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
StrategyNumber strat_total;
BTScanPosItem *currItem;
BlockNumber blkno;
+ AttrNumber cmpcol = 1;
Assert(!BTScanPosIsValid(so->currPos));
@@ -1402,7 +1492,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
_bt_initialize_more_data(so, dir);
/* position to the precise item on the page */
- offnum = _bt_binsrch(rel, &inskey, buf);
+ offnum = _bt_binsrch(rel, &inskey, buf, &cmpcol);
/*
* If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8891fa7973..11f4184107 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1234,9 +1234,12 @@ extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
Buffer *bufP, int access, Snapshot snapshot);
extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
Buffer buf, bool forupdate, BTStack stack,
- int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
-extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+ int access, Snapshot snapshot,
+ AttrNumber *comparecol, char *tupdatabuf);
+extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page,
+ OffsetNumber offnum, AttrNumber *comparecol);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
--
2.39.0
v11-0002-Specialize-nbtree-functions-on-btree-key-shape.patchapplication/octet-stream; name=v11-0002-Specialize-nbtree-functions-on-btree-key-shape.patchDownload
From cce61ad3617f6966316cc743948ac0cb4ef8693a Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 11 Jan 2023 02:13:04 +0100
Subject: [PATCH v11 2/6] Specialize nbtree functions on btree key shape.
nbtree keys are not all made the same, so a significant amount of time is
spent on code that exists only to deal with other key's shape. By specializing
function calls based on the key shape, we can remove or reduce these causes
of overhead.
This commit adds the basic infrastructure for specializing specific hot code
in the nbtree AM to certain shapes of keys, and splits the code that can
benefit from attribute offset optimizations into separate files. This does
NOT yet update the code itself - it just makes the code compile cleanly.
The performance should be comparable if not the same.
---
contrib/amcheck/verify_nbtree.c | 6 +
src/backend/access/nbtree/README | 28 +
src/backend/access/nbtree/nbtdedup.c | 300 +----
src/backend/access/nbtree/nbtdedup_spec.c | 317 +++++
src/backend/access/nbtree/nbtinsert.c | 579 +--------
src/backend/access/nbtree/nbtinsert_spec.c | 584 +++++++++
src/backend/access/nbtree/nbtpage.c | 1 +
src/backend/access/nbtree/nbtree.c | 37 +-
src/backend/access/nbtree/nbtree_spec.c | 69 ++
src/backend/access/nbtree/nbtsearch.c | 1084 +---------------
src/backend/access/nbtree/nbtsearch_spec.c | 1096 +++++++++++++++++
src/backend/access/nbtree/nbtsort.c | 264 +---
src/backend/access/nbtree/nbtsort_spec.c | 280 +++++
src/backend/access/nbtree/nbtsplitloc.c | 3 +
src/backend/access/nbtree/nbtutils.c | 754 +-----------
src/backend/access/nbtree/nbtutils_spec.c | 775 ++++++++++++
src/backend/utils/sort/tuplesortvariants.c | 144 +--
.../utils/sort/tuplesortvariants_spec.c | 158 +++
src/include/access/nbtree.h | 44 +-
src/include/access/nbtree_spec.h | 183 +++
src/include/access/nbtree_specfuncs.h | 65 +
src/tools/pginclude/cpluspluscheck | 2 +
src/tools/pginclude/headerscheck | 2 +
23 files changed, 3625 insertions(+), 3150 deletions(-)
create mode 100644 src/backend/access/nbtree/nbtdedup_spec.c
create mode 100644 src/backend/access/nbtree/nbtinsert_spec.c
create mode 100644 src/backend/access/nbtree/nbtree_spec.c
create mode 100644 src/backend/access/nbtree/nbtsearch_spec.c
create mode 100644 src/backend/access/nbtree/nbtsort_spec.c
create mode 100644 src/backend/access/nbtree/nbtutils_spec.c
create mode 100644 src/backend/utils/sort/tuplesortvariants_spec.c
create mode 100644 src/include/access/nbtree_spec.h
create mode 100644 src/include/access/nbtree_specfuncs.h
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index e57625b75c..10ed67bffe 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2680,6 +2680,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
BTStack stack;
Buffer lbuf;
bool exists;
+ nbts_prep_ctx(NULL);
key = _bt_mkscankey(state->rel, itup);
Assert(key->heapkeyspace && key->scantid != NULL);
@@ -2780,6 +2781,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
ItemId itemid;
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -2843,6 +2845,7 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
{
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -2867,6 +2870,7 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
{
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -2906,6 +2910,7 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
ItemId itemid;
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -3141,6 +3146,7 @@ static inline BTScanInsert
bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
{
BTScanInsert skey;
+ nbts_prep_ctx(NULL);
skey = _bt_mkscankey(rel, itup);
skey->pivotsearch = true;
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 0f10141a2f..e9d0cf6ac1 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -1084,6 +1084,34 @@ that need a page split anyway. Besides, supporting variable "split points"
while splitting posting lists won't actually improve overall space
utilization.
+Notes about nbtree specialization
+---------------------------------
+
+Attribute iteration is a significant overhead for multi-column indexes
+with variable length attributes, due to our inability to cache the offset
+of each attribute into an on-disk tuple. To combat this, we'd have to either
+fully deserialize the tuple, or maintain our offset into the tuple as we
+iterate over the tuple's fields.
+
+Keeping track of this offset also has a non-negligible overhead too, so we'd
+prefer to not have to keep track of these offsets when we can use the cache.
+By specializing performance-sensitive search functions for these specific
+index tuple shapes and calling those selectively, we can keep the performance
+of cacheable attribute offsets where that is applicable, while improving
+performance where we currently would see O(n_atts^2) time iterating on
+variable-length attributes. Additionally, we update the entry points
+in the index AM to call the specialized functions, increasing the
+performance of those hot paths.
+
+Optimized code paths exist for the following cases, in order of preference:
+ - multi-column indexes that could benefit from the attcacheoff optimization
+ NB: This is also the default path, and is comparatively slow for uncachable
+ attribute offsets.
+
+Future work will optimize for multi-column indexes that don't benefit
+from the attcacheoff optimization by improving on the O(n^2) nature of
+index_getattr through storing attribute offsets.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index d4db0b28f2..4589ade267 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -22,260 +22,14 @@
static void _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
TM_IndexDeleteOp *delstate);
-static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem);
static void _bt_singleval_fillfactor(Page page, BTDedupState state,
Size newitemsz);
#ifdef USE_ASSERT_CHECKING
static bool _bt_posting_valid(IndexTuple posting);
#endif
-/*
- * Perform a deduplication pass.
- *
- * The general approach taken here is to perform as much deduplication as
- * possible to free as much space as possible. Note, however, that "single
- * value" strategy is used for !bottomupdedup callers when the page is full of
- * tuples of a single value. Deduplication passes that apply the strategy
- * will leave behind a few untouched tuples at the end of the page, preparing
- * the page for an anticipated page split that uses nbtsplitloc.c's own single
- * value strategy. Our high level goal is to delay merging the untouched
- * tuples until after the page splits.
- *
- * When a call to _bt_bottomupdel_pass() just took place (and failed), our
- * high level goal is to prevent a page split entirely by buying more time.
- * We still hope that a page split can be avoided altogether. That's why
- * single value strategy is not even considered for bottomupdedup callers.
- *
- * The page will have to be split if we cannot successfully free at least
- * newitemsz (we also need space for newitem's line pointer, which isn't
- * included in caller's newitemsz).
- *
- * Note: Caller should have already deleted all existing items with their
- * LP_DEAD bits set.
- */
-void
-_bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
- bool bottomupdedup)
-{
- OffsetNumber offnum,
- minoff,
- maxoff;
- Page page = BufferGetPage(buf);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- Page newpage;
- BTDedupState state;
- Size pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
- bool singlevalstrat = false;
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-
- /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
- newitemsz += sizeof(ItemIdData);
-
- /*
- * Initialize deduplication state.
- *
- * It would be possible for maxpostingsize (limit on posting list tuple
- * size) to be set to one third of the page. However, it seems like a
- * good idea to limit the size of posting lists to one sixth of a page.
- * That ought to leave us with a good split point when pages full of
- * duplicates can be split several times.
- */
- state = (BTDedupState) palloc(sizeof(BTDedupStateData));
- state->deduplicate = true;
- state->nmaxitems = 0;
- state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
- /* Metadata about base tuple of current pending posting list */
- state->base = NULL;
- state->baseoff = InvalidOffsetNumber;
- state->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- state->htids = palloc(state->maxpostingsize);
- state->nhtids = 0;
- state->nitems = 0;
- /* Size of all physical tuples to be replaced by pending posting list */
- state->phystupsize = 0;
- /* nintervals should be initialized to zero */
- state->nintervals = 0;
-
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * Consider applying "single value" strategy, though only if the page
- * seems likely to be split in the near future
- */
- if (!bottomupdedup)
- singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
-
- /*
- * Deduplicate items from page, and write them to newpage.
- *
- * Copy the original page's LSN into newpage copy. This will become the
- * updated version of the page. We need this because XLogInsert will
- * examine the LSN and possibly dump it in a page image.
- */
- newpage = PageGetTempPageCopySpecial(page);
- PageSetLSN(newpage, PageGetLSN(page));
-
- /* Copy high key, if any */
- if (!P_RIGHTMOST(opaque))
- {
- ItemId hitemid = PageGetItemId(page, P_HIKEY);
- Size hitemsz = ItemIdGetLength(hitemid);
- IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
-
- if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
- false, false) == InvalidOffsetNumber)
- elog(ERROR, "deduplication failed to add highkey");
- }
-
- for (offnum = minoff;
- offnum <= maxoff;
- offnum = OffsetNumberNext(offnum))
- {
- ItemId itemid = PageGetItemId(page, offnum);
- IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
-
- Assert(!ItemIdIsDead(itemid));
-
- if (offnum == minoff)
- {
- /*
- * No previous/base tuple for the data item -- use the data item
- * as base tuple of pending posting list
- */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- else if (state->deduplicate &&
- _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
- _bt_dedup_save_htid(state, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID(s) for itup have been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list for some other reason (e.g., adding more
- * TIDs would have caused posting list to exceed current
- * maxpostingsize).
- *
- * If state contains pending posting list with more than one item,
- * form new posting tuple and add it to our temp page (newpage).
- * Else add pending interval's base tuple to the temp page as-is.
- */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- if (singlevalstrat)
- {
- /*
- * Single value strategy's extra steps.
- *
- * Lower maxpostingsize for sixth and final large posting list
- * tuple at the point where 5 maxpostingsize-capped tuples
- * have either been formed or observed.
- *
- * When a sixth maxpostingsize-capped item is formed/observed,
- * stop merging together tuples altogether. The few tuples
- * that remain at the end of the page won't be merged together
- * at all (at least not until after a future page split takes
- * place, when this page's newly allocated right sibling page
- * gets its first deduplication pass).
- */
- if (state->nmaxitems == 5)
- _bt_singleval_fillfactor(page, state, newitemsz);
- else if (state->nmaxitems == 6)
- {
- state->deduplicate = false;
- singlevalstrat = false; /* won't be back here */
- }
- }
-
- /* itup starts new pending posting list */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- }
-
- /* Handle the last item */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- /*
- * If no items suitable for deduplication were found, newpage must be
- * exactly the same as the original page, so just return from function.
- *
- * We could determine whether or not to proceed on the basis the space
- * savings being sufficient to avoid an immediate page split instead. We
- * don't do that because there is some small value in nbtsplitloc.c always
- * operating against a page that is fully deduplicated (apart from
- * newitem). Besides, most of the cost has already been paid.
- */
- if (state->nintervals == 0)
- {
- /* cannot leak memory here */
- pfree(newpage);
- pfree(state->htids);
- pfree(state);
- return;
- }
-
- /*
- * By here, it's clear that deduplication will definitely go ahead.
- *
- * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
- * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
- * But keep things tidy.
- */
- if (P_HAS_GARBAGE(opaque))
- {
- BTPageOpaque nopaque = BTPageGetOpaque(newpage);
-
- nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
- }
-
- START_CRIT_SECTION();
-
- PageRestoreTempPage(newpage, page);
- MarkBufferDirty(buf);
-
- /* XLOG stuff */
- if (RelationNeedsWAL(rel))
- {
- XLogRecPtr recptr;
- xl_btree_dedup xlrec_dedup;
-
- xlrec_dedup.nintervals = state->nintervals;
-
- XLogBeginInsert();
- XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
-
- /*
- * The intervals array is not in the buffer, but pretend that it is.
- * When XLogInsert stores the whole buffer, the array need not be
- * stored too.
- */
- XLogRegisterBufData(0, (char *) state->intervals,
- state->nintervals * sizeof(BTDedupInterval));
-
- recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
-
- PageSetLSN(page, recptr);
- }
-
- END_CRIT_SECTION();
-
- /* Local space accounting should agree with page accounting */
- Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
-
- /* cannot leak memory here */
- pfree(state->htids);
- pfree(state);
-}
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtdedup_spec.c"
+#include "access/nbtree_spec.h"
/*
* Perform bottom-up index deletion pass.
@@ -316,6 +70,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
TM_IndexDeleteOp delstate;
bool neverdedup;
int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ nbts_prep_ctx(rel);
/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
newitemsz += sizeof(ItemIdData);
@@ -752,55 +507,6 @@ _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
state->phystupsize = 0;
}
-/*
- * Determine if page non-pivot tuples (data items) are all duplicates of the
- * same value -- if they are, deduplication's "single value" strategy should
- * be applied. The general goal of this strategy is to ensure that
- * nbtsplitloc.c (which uses its own single value strategy) will find a useful
- * split point as further duplicates are inserted, and successive rightmost
- * page splits occur among pages that store the same duplicate value. When
- * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
- * just like it would if deduplication were disabled.
- *
- * We expect that affected workloads will require _several_ single value
- * strategy deduplication passes (over a page that only stores duplicates)
- * before the page is finally split. The first deduplication pass should only
- * find regular non-pivot tuples. Later deduplication passes will find
- * existing maxpostingsize-capped posting list tuples, which must be skipped
- * over. The penultimate pass is generally the first pass that actually
- * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
- * few untouched non-pivot tuples. The final deduplication pass won't free
- * any space -- it will skip over everything without merging anything (it
- * retraces the steps of the penultimate pass).
- *
- * Fortunately, having several passes isn't too expensive. Each pass (after
- * the first pass) won't spend many cycles on the large posting list tuples
- * left by previous passes. Each pass will find a large contiguous group of
- * smaller duplicate tuples to merge together at the end of the page.
- */
-static bool
-_bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- ItemId itemid;
- IndexTuple itup;
-
- itemid = PageGetItemId(page, minoff);
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- {
- itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- return true;
- }
-
- return false;
-}
-
/*
* Lower maxpostingsize when using "single value" strategy, to avoid a sixth
* and final maxpostingsize-capped tuple. The sixth and final posting list
diff --git a/src/backend/access/nbtree/nbtdedup_spec.c b/src/backend/access/nbtree/nbtdedup_spec.c
new file mode 100644
index 0000000000..4b280de980
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup_spec.c
@@ -0,0 +1,317 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup_spec.c
+ * Index shape-specialized functions for nbtdedup.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtdedup_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_do_singleval NBTS_FUNCTION(_bt_do_singleval)
+
+static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem);
+
+/*
+ * Perform a deduplication pass.
+ *
+ * The general approach taken here is to perform as much deduplication as
+ * possible to free as much space as possible. Note, however, that "single
+ * value" strategy is used for !bottomupdedup callers when the page is full of
+ * tuples of a single value. Deduplication passes that apply the strategy
+ * will leave behind a few untouched tuples at the end of the page, preparing
+ * the page for an anticipated page split that uses nbtsplitloc.c's own single
+ * value strategy. Our high level goal is to delay merging the untouched
+ * tuples until after the page splits.
+ *
+ * When a call to _bt_bottomupdel_pass() just took place (and failed), our
+ * high level goal is to prevent a page split entirely by buying more time.
+ * We still hope that a page split can be avoided altogether. That's why
+ * single value strategy is not even considered for bottomupdedup callers.
+ *
+ * The page will have to be split if we cannot successfully free at least
+ * newitemsz (we also need space for newitem's line pointer, which isn't
+ * included in caller's newitemsz).
+ *
+ * Note: Caller should have already deleted all existing items with their
+ * LP_DEAD bits set.
+ */
+void
+_bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
+ bool bottomupdedup)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ Page newpage;
+ BTDedupState state;
+ Size pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
+ bool singlevalstrat = false;
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+
+ /*
+ * Initialize deduplication state.
+ *
+ * It would be possible for maxpostingsize (limit on posting list tuple
+ * size) to be set to one third of the page. However, it seems like a
+ * good idea to limit the size of posting lists to one sixth of a page.
+ * That ought to leave us with a good split point when pages full of
+ * duplicates can be split several times.
+ */
+ state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ state->deduplicate = true;
+ state->nmaxitems = 0;
+ state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
+ /* Metadata about base tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ state->htids = palloc(state->maxpostingsize);
+ state->nhtids = 0;
+ state->nitems = 0;
+ /* Size of all physical tuples to be replaced by pending posting list */
+ state->phystupsize = 0;
+ /* nintervals should be initialized to zero */
+ state->nintervals = 0;
+
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Consider applying "single value" strategy, though only if the page
+ * seems likely to be split in the near future
+ */
+ if (!bottomupdedup)
+ singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
+
+ /*
+ * Deduplicate items from page, and write them to newpage.
+ *
+ * Copy the original page's LSN into newpage copy. This will become the
+ * updated version of the page. We need this because XLogInsert will
+ * examine the LSN and possibly dump it in a page image.
+ */
+ newpage = PageGetTempPageCopySpecial(page);
+ PageSetLSN(newpage, PageGetLSN(page));
+
+ /* Copy high key, if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId hitemid = PageGetItemId(page, P_HIKEY);
+ Size hitemsz = ItemIdGetLength(hitemid);
+ IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+ if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add highkey");
+ }
+
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == minoff)
+ {
+ /*
+ * No previous/base tuple for the data item -- use the data item
+ * as base tuple of pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (state->deduplicate &&
+ _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+ _bt_dedup_save_htid(state, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID(s) for itup have been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list for some other reason (e.g., adding more
+ * TIDs would have caused posting list to exceed current
+ * maxpostingsize).
+ *
+ * If state contains pending posting list with more than one item,
+ * form new posting tuple and add it to our temp page (newpage).
+ * Else add pending interval's base tuple to the temp page as-is.
+ */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ if (singlevalstrat)
+ {
+ /*
+ * Single value strategy's extra steps.
+ *
+ * Lower maxpostingsize for sixth and final large posting list
+ * tuple at the point where 5 maxpostingsize-capped tuples
+ * have either been formed or observed.
+ *
+ * When a sixth maxpostingsize-capped item is formed/observed,
+ * stop merging together tuples altogether. The few tuples
+ * that remain at the end of the page won't be merged together
+ * at all (at least not until after a future page split takes
+ * place, when this page's newly allocated right sibling page
+ * gets its first deduplication pass).
+ */
+ if (state->nmaxitems == 5)
+ _bt_singleval_fillfactor(page, state, newitemsz);
+ else if (state->nmaxitems == 6)
+ {
+ state->deduplicate = false;
+ singlevalstrat = false; /* won't be back here */
+ }
+ }
+
+ /* itup starts new pending posting list */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ }
+
+ /* Handle the last item */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ /*
+ * If no items suitable for deduplication were found, newpage must be
+ * exactly the same as the original page, so just return from function.
+ *
+ * We could determine whether or not to proceed on the basis the space
+ * savings being sufficient to avoid an immediate page split instead. We
+ * don't do that because there is some small value in nbtsplitloc.c always
+ * operating against a page that is fully deduplicated (apart from
+ * newitem). Besides, most of the cost has already been paid.
+ */
+ if (state->nintervals == 0)
+ {
+ /* cannot leak memory here */
+ pfree(newpage);
+ pfree(state->htids);
+ pfree(state);
+ return;
+ }
+
+ /*
+ * By here, it's clear that deduplication will definitely go ahead.
+ *
+ * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
+ * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
+ * But keep things tidy.
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ BTPageOpaque nopaque = BTPageGetOpaque(newpage);
+
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+ }
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.nintervals = state->nintervals;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ /*
+ * The intervals array is not in the buffer, but pretend that it is.
+ * When XLogInsert stores the whole buffer, the array need not be
+ * stored too.
+ */
+ XLogRegisterBufData(0, (char *) state->intervals,
+ state->nintervals * sizeof(BTDedupInterval));
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* Local space accounting should agree with page accounting */
+ Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+ /* cannot leak memory here */
+ pfree(state->htids);
+ pfree(state);
+}
+
+/*
+ * Determine if page non-pivot tuples (data items) are all duplicates of the
+ * same value -- if they are, deduplication's "single value" strategy should
+ * be applied. The general goal of this strategy is to ensure that
+ * nbtsplitloc.c (which uses its own single value strategy) will find a useful
+ * split point as further duplicates are inserted, and successive rightmost
+ * page splits occur among pages that store the same duplicate value. When
+ * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
+ * just like it would if deduplication were disabled.
+ *
+ * We expect that affected workloads will require _several_ single value
+ * strategy deduplication passes (over a page that only stores duplicates)
+ * before the page is finally split. The first deduplication pass should only
+ * find regular non-pivot tuples. Later deduplication passes will find
+ * existing maxpostingsize-capped posting list tuples, which must be skipped
+ * over. The penultimate pass is generally the first pass that actually
+ * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
+ * few untouched non-pivot tuples. The final deduplication pass won't free
+ * any space -- it will skip over everything without merging anything (it
+ * retraces the steps of the penultimate pass).
+ *
+ * Fortunately, having several passes isn't too expensive. Each pass (after
+ * the first pass) won't spend many cycles on the large posting list tuples
+ * left by previous passes. Each pass will find a large contiguous group of
+ * smaller duplicate tuples to merge together at the end of the page.
+ */
+static bool
+_bt_do_singleval(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, minoff);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+ {
+ itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+ return true;
+ }
+
+ return false;
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 39e7e9b731..3607bd418e 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,28 +30,16 @@
#define BTREE_FASTPATH_MIN_LEVEL 2
-static BTStack _bt_search_insert(Relation rel, Relation heaprel,
- BTInsertState insertstate);
static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
Relation heapRel,
IndexUniqueCheck checkUnique, bool *is_unique,
uint32 *speculativeToken);
-static OffsetNumber _bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel);
static void _bt_stepright(Relation rel, Relation heaprel,
BTInsertState insertstate, BTStack stack);
-static void _bt_insertonpg(Relation rel, Relation heaprel, BTScanInsert itup_key,
- Buffer buf,
- Buffer cbuf,
- BTStack stack,
- IndexTuple itup,
- Size itemsz,
- OffsetNumber newitemoff,
- int postingoff,
+static void _bt_insertonpg(Relation rel, Relation heaprel,
+ BTScanInsert itup_key, Buffer buf, Buffer cbuf,
+ BTStack stack, IndexTuple itup, Size itemsz,
+ OffsetNumber newitemoff, int postingoff,
bool split_only_page);
static Buffer _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key,
Buffer buf, Buffer cbuf, OffsetNumber newitemoff,
@@ -75,313 +63,8 @@ static BlockNumber *_bt_deadblocks(Page page, OffsetNumber *deletable,
int *nblocks);
static inline int _bt_blk_cmp(const void *arg1, const void *arg2);
-/*
- * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
- *
- * This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
- *
- * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
- * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
- * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
- * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
- * don't actually insert.
- *
- * indexUnchanged executor hint indicates if itup is from an
- * UPDATE that didn't logically change the indexed value, but
- * must nevertheless have a new entry to point to a successor
- * version.
- *
- * The result value is only significant for UNIQUE_CHECK_PARTIAL:
- * it must be true if the entry is known unique, else false.
- * (In the current implementation we'll also return true after a
- * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
- * that's just a coding artifact.)
- */
-bool
-_bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel)
-{
- bool is_unique = false;
- BTInsertStateData insertstate;
- BTScanInsert itup_key;
- BTStack stack;
- bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
-
- /* we need an insertion scan key to do our search, so build one */
- itup_key = _bt_mkscankey(rel, itup);
-
- if (checkingunique)
- {
- if (!itup_key->anynullkeys)
- {
- /* No (heapkeyspace) scantid until uniqueness established */
- itup_key->scantid = NULL;
- }
- else
- {
- /*
- * Scan key for new tuple contains NULL key values. Bypass
- * checkingunique steps. They are unnecessary because core code
- * considers NULL unequal to every value, including NULL.
- *
- * This optimization avoids O(N^2) behavior within the
- * _bt_findinsertloc() heapkeyspace path when a unique index has a
- * large number of "duplicates" with NULL key values.
- */
- checkingunique = false;
- /* Tuple is unique in the sense that core code cares about */
- Assert(checkUnique != UNIQUE_CHECK_EXISTING);
- is_unique = true;
- }
- }
-
- /*
- * Fill in the BTInsertState working area, to track the current page and
- * position within the page to insert on.
- *
- * Note that itemsz is passed down to lower level code that deals with
- * inserting the item. It must be MAXALIGN()'d. This ensures that space
- * accounting code consistently considers the alignment overhead that we
- * expect PageAddItem() will add later. (Actually, index_form_tuple() is
- * already conservative about alignment, but we don't rely on that from
- * this distance. Besides, preserving the "true" tuple size in index
- * tuple headers for the benefit of nbtsplitloc.c might happen someday.
- * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
- */
- insertstate.itup = itup;
- insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
- insertstate.itup_key = itup_key;
- insertstate.bounds_valid = false;
- insertstate.buf = InvalidBuffer;
- insertstate.postingoff = 0;
-
-search:
-
- /*
- * Find and lock the leaf page that the tuple should be added to by
- * searching from the root page. insertstate.buf will hold a buffer that
- * is locked in exclusive mode afterwards.
- */
- stack = _bt_search_insert(rel, heapRel, &insertstate);
-
- /*
- * checkingunique inserts are not allowed to go ahead when two tuples with
- * equal key attribute values would be visible to new MVCC snapshots once
- * the xact commits. Check for conflicts in the locked page/buffer (if
- * needed) here.
- *
- * It might be necessary to check a page to the right in _bt_check_unique,
- * though that should be very rare. In practice the first page the value
- * could be on (with scantid omitted) is almost always also the only page
- * that a matching tuple might be found on. This is due to the behavior
- * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
- * only be allowed to cross a page boundary when there is no candidate
- * leaf page split point that avoids it. Also, _bt_check_unique can use
- * the leaf page high key to determine that there will be no duplicates on
- * the right sibling without actually visiting it (it uses the high key in
- * cases where the new item happens to belong at the far right of the leaf
- * page).
- *
- * NOTE: obviously, _bt_check_unique can only detect keys that are already
- * in the index; so it cannot defend against concurrent insertions of the
- * same key. We protect against that by means of holding a write lock on
- * the first page the value could be on, with omitted/-inf value for the
- * implicit heap TID tiebreaker attribute. Any other would-be inserter of
- * the same key must acquire a write lock on the same page, so only one
- * would-be inserter can be making the check at one time. Furthermore,
- * once we are past the check we hold write locks continuously until we
- * have performed our insertion, so no later inserter can fail to see our
- * insertion. (This requires some care in _bt_findinsertloc.)
- *
- * If we must wait for another xact, we release the lock while waiting,
- * and then must perform a new search.
- *
- * For a partial uniqueness check, we don't wait for the other xact. Just
- * let the tuple in and return false for possibly non-unique, or true for
- * definitely unique.
- */
- if (checkingunique)
- {
- TransactionId xwait;
- uint32 speculativeToken;
-
- xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
- &is_unique, &speculativeToken);
-
- if (unlikely(TransactionIdIsValid(xwait)))
- {
- /* Have to wait for the other guy ... */
- _bt_relbuf(rel, insertstate.buf);
- insertstate.buf = InvalidBuffer;
-
- /*
- * If it's a speculative insertion, wait for it to finish (ie. to
- * go ahead with the insertion, or kill the tuple). Otherwise
- * wait for the transaction to finish as usual.
- */
- if (speculativeToken)
- SpeculativeInsertionWait(xwait, speculativeToken);
- else
- XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
-
- /* start over... */
- if (stack)
- _bt_freestack(stack);
- goto search;
- }
-
- /* Uniqueness is established -- restore heap tid as scantid */
- if (itup_key->heapkeyspace)
- itup_key->scantid = &itup->t_tid;
- }
-
- if (checkUnique != UNIQUE_CHECK_EXISTING)
- {
- OffsetNumber newitemoff;
-
- /*
- * The only conflict predicate locking cares about for indexes is when
- * an index tuple insert conflicts with an existing lock. We don't
- * know the actual page we're going to insert on for sure just yet in
- * checkingunique and !heapkeyspace cases, but it's okay to use the
- * first page the value could be on (with scantid omitted) instead.
- */
- CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
-
- /*
- * Do the insertion. Note that insertstate contains cached binary
- * search bounds established within _bt_check_unique when insertion is
- * checkingunique.
- */
- newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
- indexUnchanged, stack, heapRel);
- _bt_insertonpg(rel, heapRel, itup_key, insertstate.buf, InvalidBuffer,
- stack, itup, insertstate.itemsz, newitemoff,
- insertstate.postingoff, false);
- }
- else
- {
- /* just release the buffer */
- _bt_relbuf(rel, insertstate.buf);
- }
-
- /* be tidy */
- if (stack)
- _bt_freestack(stack);
- pfree(itup_key);
-
- return is_unique;
-}
-
-/*
- * _bt_search_insert() -- _bt_search() wrapper for inserts
- *
- * Search the tree for a particular scankey, or more precisely for the first
- * leaf page it could be on. Try to make use of the fastpath optimization's
- * rightmost leaf page cache before actually searching the tree from the root
- * page, though.
- *
- * Return value is a stack of parent-page pointers (though see notes about
- * fastpath optimization and page splits below). insertstate->buf is set to
- * the address of the leaf-page buffer, which is write-locked and pinned in
- * all cases (if necessary by creating a new empty root page for caller).
- *
- * The fastpath optimization avoids most of the work of searching the tree
- * repeatedly when a single backend inserts successive new tuples on the
- * rightmost leaf page of an index. A backend cache of the rightmost leaf
- * page is maintained within _bt_insertonpg(), and used here. The cache is
- * invalidated here when an insert of a non-pivot tuple must take place on a
- * non-rightmost leaf page.
- *
- * The optimization helps with indexes on an auto-incremented field. It also
- * helps with indexes on datetime columns, as well as indexes with lots of
- * NULL values. (NULLs usually get inserted in the rightmost page for single
- * column indexes, since they usually get treated as coming after everything
- * else in the key space. Individual NULL tuples will generally be placed on
- * the rightmost leaf page due to the influence of the heap TID column.)
- *
- * Note that we avoid applying the optimization when there is insufficient
- * space on the rightmost page to fit caller's new item. This is necessary
- * because we'll need to return a real descent stack when a page split is
- * expected (actually, caller can cope with a leaf page split that uses a NULL
- * stack, but that's very slow and so must be avoided). Note also that the
- * fastpath optimization acquires the lock on the page conditionally as a way
- * of reducing extra contention when there are concurrent insertions into the
- * rightmost page (we give up if we'd have to wait for the lock). We assume
- * that it isn't useful to apply the optimization when there is contention,
- * since each per-backend cache won't stay valid for long.
- */
-static BTStack
-_bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
-{
- Assert(insertstate->buf == InvalidBuffer);
- Assert(!insertstate->bounds_valid);
- Assert(insertstate->postingoff == 0);
-
- if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
- {
- /* Simulate a _bt_getbuf() call with conditional locking */
- insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
- if (_bt_conditionallockbuf(rel, insertstate->buf))
- {
- Page page;
- BTPageOpaque opaque;
- AttrNumber cmpcol = 1;
-
- _bt_checkpage(rel, insertstate->buf);
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- /*
- * Check if the page is still the rightmost leaf page and has
- * enough free space to accommodate the new tuple. Also check
- * that the insertion scan key is strictly greater than the first
- * non-pivot tuple on the page. (Note that we expect itup_key's
- * scantid to be unset when our caller is a checkingunique
- * inserter.)
- */
- if (P_RIGHTMOST(opaque) &&
- P_ISLEAF(opaque) &&
- !P_IGNORE(opaque) &&
- PageGetFreeSpace(page) > insertstate->itemsz &&
- PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- _bt_compare(rel, insertstate->itup_key, page, P_HIKEY,
- &cmpcol) > 0)
- {
- /*
- * Caller can use the fastpath optimization because cached
- * block is still rightmost leaf page, which can fit caller's
- * new tuple without splitting. Keep block in local cache for
- * next insert, and have caller use NULL stack.
- *
- * Note that _bt_insert_parent() has an assertion that catches
- * leaf page splits that somehow follow from a fastpath insert
- * (it should only be passed a NULL stack when it must deal
- * with a concurrent root page split, and never because a NULL
- * stack was returned here).
- */
- return NULL;
- }
-
- /* Page unsuitable for caller, drop lock and pin */
- _bt_relbuf(rel, insertstate->buf);
- }
- else
- {
- /* Lock unavailable, drop pin */
- ReleaseBuffer(insertstate->buf);
- }
-
- /* Forget block, since cache doesn't appear to be useful */
- RelationSetTargetBlock(rel, InvalidBlockNumber);
- }
-
- /* Cannot use optimization -- descend tree, return proper descent stack */
- return _bt_search(rel, heaprel, insertstate->itup_key, &insertstate->buf,
- BT_WRITE, NULL);
-}
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtinsert_spec.c"
+#include "access/nbtree_spec.h"
/*
* _bt_check_unique() -- Check for violation of unique index constraint
@@ -425,6 +108,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
bool inposting = false;
bool prevalldead = true;
int curposti = 0;
+ nbts_prep_ctx(rel);
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -776,253 +460,6 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
return InvalidTransactionId;
}
-
-/*
- * _bt_findinsertloc() -- Finds an insert location for a tuple
- *
- * On entry, insertstate buffer contains the page the new tuple belongs
- * on. It is exclusive-locked and pinned by the caller.
- *
- * If 'checkingunique' is true, the buffer on entry is the first page
- * that contains duplicates of the new key. If there are duplicates on
- * multiple pages, the correct insertion position might be some page to
- * the right, rather than the first page. In that case, this function
- * moves right to the correct target page.
- *
- * (In a !heapkeyspace index, there can be multiple pages with the same
- * high key, where the new tuple could legitimately be placed on. In
- * that case, the caller passes the first page containing duplicates,
- * just like when checkingunique=true. If that page doesn't have enough
- * room for the new tuple, this function moves right, trying to find a
- * legal page that does.)
- *
- * If 'indexUnchanged' is true, this is for an UPDATE that didn't
- * logically change the indexed value, but must nevertheless have a new
- * entry to point to a successor version. This hint from the executor
- * will influence our behavior when the page might have to be split and
- * we must consider our options. Bottom-up index deletion can avoid
- * pathological version-driven page splits, but we only want to go to the
- * trouble of trying it when we already have moderate confidence that
- * it's appropriate. The hint should not significantly affect our
- * behavior over time unless practically all inserts on to the leaf page
- * get the hint.
- *
- * On exit, insertstate buffer contains the chosen insertion page, and
- * the offset within that page is returned. If _bt_findinsertloc needed
- * to move right, the lock and pin on the original page are released, and
- * the new buffer is exclusively locked and pinned instead.
- *
- * If insertstate contains cached binary search bounds, we will take
- * advantage of them. This avoids repeating comparisons that we made in
- * _bt_check_unique() already.
- */
-static OffsetNumber
-_bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel)
-{
- BTScanInsert itup_key = insertstate->itup_key;
- Page page = BufferGetPage(insertstate->buf);
- BTPageOpaque opaque;
- OffsetNumber newitemoff;
-
- opaque = BTPageGetOpaque(page);
-
- /* Check 1/3 of a page restriction */
- if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
- _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
- insertstate->itup);
-
- Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
- Assert(!insertstate->bounds_valid || checkingunique);
- Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
- Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
- Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
-
- if (itup_key->heapkeyspace)
- {
- /* Keep track of whether checkingunique duplicate seen */
- bool uniquedup = indexUnchanged;
-
- /*
- * If we're inserting into a unique index, we may have to walk right
- * through leaf pages to find the one leaf page that we must insert on
- * to.
- *
- * This is needed for checkingunique callers because a scantid was not
- * used when we called _bt_search(). scantid can only be set after
- * _bt_check_unique() has checked for duplicates. The buffer
- * initially stored in insertstate->buf has the page where the first
- * duplicate key might be found, which isn't always the page that new
- * tuple belongs on. The heap TID attribute for new tuple (scantid)
- * could force us to insert on a sibling page, though that should be
- * very rare in practice.
- */
- if (checkingunique)
- {
- if (insertstate->low < insertstate->stricthigh)
- {
- /* Encountered a duplicate in _bt_check_unique() */
- Assert(insertstate->bounds_valid);
- uniquedup = true;
- }
-
- for (;;)
- {
- AttrNumber cmpcol = 1;
-
- /*
- * Does the new tuple belong on this page?
- *
- * The earlier _bt_check_unique() call may well have
- * established a strict upper bound on the offset for the new
- * item. If it's not the last item of the page (i.e. if there
- * is at least one tuple on the page that goes after the tuple
- * we're inserting) then we know that the tuple belongs on
- * this page. We can skip the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- /* Test '<=', not '!=', since scantid is set now */
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
- break;
-
- _bt_stepright(rel, heapRel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- /* Assume duplicates (if checkingunique) */
- uniquedup = true;
- }
- }
-
- /*
- * If the target page cannot fit newitem, try to avoid splitting the
- * page on insert by performing deletion or deduplication now
- */
- if (PageGetFreeSpace(page) < insertstate->itemsz)
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
- checkingunique, uniquedup,
- indexUnchanged);
- }
- else
- {
- /*----------
- * This is a !heapkeyspace (version 2 or 3) index. The current page
- * is the first page that we could insert the new tuple to, but there
- * may be other pages to the right that we could opt to use instead.
- *
- * If the new key is equal to one or more existing keys, we can
- * legitimately place it anywhere in the series of equal keys. In
- * fact, if the new key is equal to the page's "high key" we can place
- * it on the next page. If it is equal to the high key, and there's
- * not room to insert the new tuple on the current page without
- * splitting, then we move right hoping to find more free space and
- * avoid a split.
- *
- * Keep scanning right until we
- * (a) find a page with enough free space,
- * (b) reach the last page where the tuple can legally go, or
- * (c) get tired of searching.
- * (c) is not flippant; it is important because if there are many
- * pages' worth of equal keys, it's better to split one of the early
- * pages than to scan all the way to the end of the run of equal keys
- * on every insert. We implement "get tired" as a random choice,
- * since stopping after scanning a fixed number of pages wouldn't work
- * well (we'd never reach the right-hand side of previously split
- * pages). The probability of moving right is set at 0.99, which may
- * seem too high to change the behavior much, but it does an excellent
- * job of preventing O(N^2) behavior with many equal keys.
- *----------
- */
- while (PageGetFreeSpace(page) < insertstate->itemsz)
- {
- AttrNumber cmpcol = 1;
-
- /*
- * Before considering moving right, see if we can obtain enough
- * space by erasing LP_DEAD items
- */
- if (P_HAS_GARBAGE(opaque))
- {
- /* Perform simple deletion */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- if (PageGetFreeSpace(page) >= insertstate->itemsz)
- break; /* OK, now we have enough space */
- }
-
- /*
- * Nope, so check conditions (b) and (c) enumerated above
- *
- * The earlier _bt_check_unique() call may well have established a
- * strict upper bound on the offset for the new item. If it's not
- * the last item of the page (i.e. if there is at least one tuple
- * on the page that's greater than the tuple we're inserting to)
- * then we know that the tuple belongs on this page. We can skip
- * the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) != 0 ||
- pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
- break;
-
- _bt_stepright(rel, heapRel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- }
- }
-
- /*
- * We should now be on the correct page. Find the offset within the page
- * for the new tuple. (Possibly reusing earlier search bounds.)
- */
- {
- AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
- Assert(P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0);
- }
-
- newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
-
- if (insertstate->postingoff == -1)
- {
- /*
- * There is an overlapping posting list tuple with its LP_DEAD bit
- * set. We don't want to unnecessarily unset its LP_DEAD bit while
- * performing a posting list split, so perform simple index tuple
- * deletion early.
- */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- /*
- * Do new binary search. New insert location cannot overlap with any
- * posting list now.
- */
- Assert(!insertstate->bounds_valid);
- insertstate->postingoff = 0;
- newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
- Assert(insertstate->postingoff == 0);
- }
-
- return newitemoff;
-}
-
/*
* Step right to next non-dead page, during insertion.
*
@@ -1506,6 +943,7 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
bool newitemonleft,
isleaf,
isrightmost;
+ nbts_prep_ctx(rel);
/*
* origpage is the original page to be split. leftpage is a temporary
@@ -2706,6 +2144,7 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
BTScanInsert itup_key = insertstate->itup_key;
Page page = BufferGetPage(buffer);
BTPageOpaque opaque = BTPageGetOpaque(page);
+ nbts_prep_ctx(rel);
Assert(P_ISLEAF(opaque));
Assert(simpleonly || itup_key->heapkeyspace);
diff --git a/src/backend/access/nbtree/nbtinsert_spec.c b/src/backend/access/nbtree/nbtinsert_spec.c
new file mode 100644
index 0000000000..6915f22839
--- /dev/null
+++ b/src/backend/access/nbtree/nbtinsert_spec.c
@@ -0,0 +1,584 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtinsert_spec.c
+ * Index shape-specialized functions for nbtinsert.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtinsert_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_search_insert NBTS_FUNCTION(_bt_search_insert)
+#define _bt_findinsertloc NBTS_FUNCTION(_bt_findinsertloc)
+
+static BTStack _bt_search_insert(Relation rel, Relation heaprel,
+ BTInsertState insertstate);
+static OffsetNumber _bt_findinsertloc(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel);
+
+
+/*
+ * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
+ *
+ * This routine is called by the public interface routine, btinsert.
+ * By here, itup is filled in, including the TID.
+ *
+ * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
+ * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
+ * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
+ * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
+ * don't actually insert.
+ *
+ * indexUnchanged executor hint indicates if itup is from an
+ * UPDATE that didn't logically change the indexed value, but
+ * must nevertheless have a new entry to point to a successor
+ * version.
+ *
+ * The result value is only significant for UNIQUE_CHECK_PARTIAL:
+ * it must be true if the entry is known unique, else false.
+ * (In the current implementation we'll also return true after a
+ * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
+ * that's just a coding artifact.)
+ */
+bool
+_bt_doinsert(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ Relation heapRel)
+{
+ bool is_unique = false;
+ BTInsertStateData insertstate;
+ BTScanInsert itup_key;
+ BTStack stack;
+ bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
+
+ /* we need an insertion scan key to do our search, so build one */
+ itup_key = _bt_mkscankey(rel, itup);
+
+ if (checkingunique)
+ {
+ if (!itup_key->anynullkeys)
+ {
+ /* No (heapkeyspace) scantid until uniqueness established */
+ itup_key->scantid = NULL;
+ }
+ else
+ {
+ /*
+ * Scan key for new tuple contains NULL key values. Bypass
+ * checkingunique steps. They are unnecessary because core code
+ * considers NULL unequal to every value, including NULL.
+ *
+ * This optimization avoids O(N^2) behavior within the
+ * _bt_findinsertloc() heapkeyspace path when a unique index has a
+ * large number of "duplicates" with NULL key values.
+ */
+ checkingunique = false;
+ /* Tuple is unique in the sense that core code cares about */
+ Assert(checkUnique != UNIQUE_CHECK_EXISTING);
+ is_unique = true;
+ }
+ }
+
+ /*
+ * Fill in the BTInsertState working area, to track the current page and
+ * position within the page to insert on.
+ *
+ * Note that itemsz is passed down to lower level code that deals with
+ * inserting the item. It must be MAXALIGN()'d. This ensures that space
+ * accounting code consistently considers the alignment overhead that we
+ * expect PageAddItem() will add later. (Actually, index_form_tuple() is
+ * already conservative about alignment, but we don't rely on that from
+ * this distance. Besides, preserving the "true" tuple size in index
+ * tuple headers for the benefit of nbtsplitloc.c might happen someday.
+ * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
+ */
+ insertstate.itup = itup;
+ insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
+ insertstate.itup_key = itup_key;
+ insertstate.bounds_valid = false;
+ insertstate.buf = InvalidBuffer;
+ insertstate.postingoff = 0;
+
+ search:
+
+ /*
+ * Find and lock the leaf page that the tuple should be added to by
+ * searching from the root page. insertstate.buf will hold a buffer that
+ * is locked in exclusive mode afterwards.
+ */
+ stack = _bt_search_insert(rel, heapRel, &insertstate);
+
+ /*
+ * checkingunique inserts are not allowed to go ahead when two tuples with
+ * equal key attribute values would be visible to new MVCC snapshots once
+ * the xact commits. Check for conflicts in the locked page/buffer (if
+ * needed) here.
+ *
+ * It might be necessary to check a page to the right in _bt_check_unique,
+ * though that should be very rare. In practice the first page the value
+ * could be on (with scantid omitted) is almost always also the only page
+ * that a matching tuple might be found on. This is due to the behavior
+ * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
+ * only be allowed to cross a page boundary when there is no candidate
+ * leaf page split point that avoids it. Also, _bt_check_unique can use
+ * the leaf page high key to determine that there will be no duplicates on
+ * the right sibling without actually visiting it (it uses the high key in
+ * cases where the new item happens to belong at the far right of the leaf
+ * page).
+ *
+ * NOTE: obviously, _bt_check_unique can only detect keys that are already
+ * in the index; so it cannot defend against concurrent insertions of the
+ * same key. We protect against that by means of holding a write lock on
+ * the first page the value could be on, with omitted/-inf value for the
+ * implicit heap TID tiebreaker attribute. Any other would-be inserter of
+ * the same key must acquire a write lock on the same page, so only one
+ * would-be inserter can be making the check at one time. Furthermore,
+ * once we are past the check we hold write locks continuously until we
+ * have performed our insertion, so no later inserter can fail to see our
+ * insertion. (This requires some care in _bt_findinsertloc.)
+ *
+ * If we must wait for another xact, we release the lock while waiting,
+ * and then must perform a new search.
+ *
+ * For a partial uniqueness check, we don't wait for the other xact. Just
+ * let the tuple in and return false for possibly non-unique, or true for
+ * definitely unique.
+ */
+ if (checkingunique)
+ {
+ TransactionId xwait;
+ uint32 speculativeToken;
+
+ xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
+ &is_unique, &speculativeToken);
+
+ if (unlikely(TransactionIdIsValid(xwait)))
+ {
+ /* Have to wait for the other guy ... */
+ _bt_relbuf(rel, insertstate.buf);
+ insertstate.buf = InvalidBuffer;
+
+ /*
+ * If it's a speculative insertion, wait for it to finish (ie. to
+ * go ahead with the insertion, or kill the tuple). Otherwise
+ * wait for the transaction to finish as usual.
+ */
+ if (speculativeToken)
+ SpeculativeInsertionWait(xwait, speculativeToken);
+ else
+ XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
+
+ /* start over... */
+ if (stack)
+ _bt_freestack(stack);
+ goto search;
+ }
+
+ /* Uniqueness is established -- restore heap tid as scantid */
+ if (itup_key->heapkeyspace)
+ itup_key->scantid = &itup->t_tid;
+ }
+
+ if (checkUnique != UNIQUE_CHECK_EXISTING)
+ {
+ OffsetNumber newitemoff;
+
+ /*
+ * The only conflict predicate locking cares about for indexes is when
+ * an index tuple insert conflicts with an existing lock. We don't
+ * know the actual page we're going to insert on for sure just yet in
+ * checkingunique and !heapkeyspace cases, but it's okay to use the
+ * first page the value could be on (with scantid omitted) instead.
+ */
+ CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
+
+ /*
+ * Do the insertion. Note that insertstate contains cached binary
+ * search bounds established within _bt_check_unique when insertion is
+ * checkingunique.
+ */
+ newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
+ indexUnchanged, stack, heapRel);
+ _bt_insertonpg(rel, heapRel, itup_key, insertstate.buf, InvalidBuffer,
+ stack, itup, insertstate.itemsz, newitemoff,
+ insertstate.postingoff, false);
+ }
+ else
+ {
+ /* just release the buffer */
+ _bt_relbuf(rel, insertstate.buf);
+ }
+
+ /* be tidy */
+ if (stack)
+ _bt_freestack(stack);
+ pfree(itup_key);
+
+ return is_unique;
+}
+
+/*
+ * _bt_search_insert() -- _bt_search() wrapper for inserts
+ *
+ * Search the tree for a particular scankey, or more precisely for the first
+ * leaf page it could be on. Try to make use of the fastpath optimization's
+ * rightmost leaf page cache before actually searching the tree from the root
+ * page, though.
+ *
+ * Return value is a stack of parent-page pointers (though see notes about
+ * fastpath optimization and page splits below). insertstate->buf is set to
+ * the address of the leaf-page buffer, which is write-locked and pinned in
+ * all cases (if necessary by creating a new empty root page for caller).
+ *
+ * The fastpath optimization avoids most of the work of searching the tree
+ * repeatedly when a single backend inserts successive new tuples on the
+ * rightmost leaf page of an index. A backend cache of the rightmost leaf
+ * page is maintained within _bt_insertonpg(), and used here. The cache is
+ * invalidated here when an insert of a non-pivot tuple must take place on a
+ * non-rightmost leaf page.
+ *
+ * The optimization helps with indexes on an auto-incremented field. It also
+ * helps with indexes on datetime columns, as well as indexes with lots of
+ * NULL values. (NULLs usually get inserted in the rightmost page for single
+ * column indexes, since they usually get treated as coming after everything
+ * else in the key space. Individual NULL tuples will generally be placed on
+ * the rightmost leaf page due to the influence of the heap TID column.)
+ *
+ * Note that we avoid applying the optimization when there is insufficient
+ * space on the rightmost page to fit caller's new item. This is necessary
+ * because we'll need to return a real descent stack when a page split is
+ * expected (actually, caller can cope with a leaf page split that uses a NULL
+ * stack, but that's very slow and so must be avoided). Note also that the
+ * fastpath optimization acquires the lock on the page conditionally as a way
+ * of reducing extra contention when there are concurrent insertions into the
+ * rightmost page (we give up if we'd have to wait for the lock). We assume
+ * that it isn't useful to apply the optimization when there is contention,
+ * since each per-backend cache won't stay valid for long.
+ */
+static BTStack
+_bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
+{
+ Assert(insertstate->buf == InvalidBuffer);
+ Assert(!insertstate->bounds_valid);
+ Assert(insertstate->postingoff == 0);
+
+ if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
+ {
+ /* Simulate a _bt_getbuf() call with conditional locking */
+ insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
+ if (_bt_conditionallockbuf(rel, insertstate->buf))
+ {
+ Page page;
+ BTPageOpaque opaque;
+ AttrNumber cmpcol = 1;
+
+ _bt_checkpage(rel, insertstate->buf);
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ /*
+ * Check if the page is still the rightmost leaf page and has
+ * enough free space to accommodate the new tuple. Also check
+ * that the insertion scan key is strictly greater than the first
+ * non-pivot tuple on the page. (Note that we expect itup_key's
+ * scantid to be unset when our caller is a checkingunique
+ * inserter.)
+ */
+ if (P_RIGHTMOST(opaque) &&
+ P_ISLEAF(opaque) &&
+ !P_IGNORE(opaque) &&
+ PageGetFreeSpace(page) > insertstate->itemsz &&
+ PageGetMaxOffsetNumber(page) >= P_HIKEY &&
+ _bt_compare(rel, insertstate->itup_key, page, P_HIKEY,
+ &cmpcol) > 0)
+ {
+ /*
+ * Caller can use the fastpath optimization because cached
+ * block is still rightmost leaf page, which can fit caller's
+ * new tuple without splitting. Keep block in local cache for
+ * next insert, and have caller use NULL stack.
+ *
+ * Note that _bt_insert_parent() has an assertion that catches
+ * leaf page splits that somehow follow from a fastpath insert
+ * (it should only be passed a NULL stack when it must deal
+ * with a concurrent root page split, and never because a NULL
+ * stack was returned here).
+ */
+ return NULL;
+ }
+
+ /* Page unsuitable for caller, drop lock and pin */
+ _bt_relbuf(rel, insertstate->buf);
+ }
+ else
+ {
+ /* Lock unavailable, drop pin */
+ ReleaseBuffer(insertstate->buf);
+ }
+
+ /* Forget block, since cache doesn't appear to be useful */
+ RelationSetTargetBlock(rel, InvalidBlockNumber);
+ }
+
+ /* Cannot use optimization -- descend tree, return proper descent stack */
+ return _bt_search(rel, heaprel, insertstate->itup_key, &insertstate->buf,
+ BT_WRITE, NULL);
+}
+
+/*
+ * _bt_findinsertloc() -- Finds an insert location for a tuple
+ *
+ * On entry, insertstate buffer contains the page the new tuple belongs
+ * on. It is exclusive-locked and pinned by the caller.
+ *
+ * If 'checkingunique' is true, the buffer on entry is the first page
+ * that contains duplicates of the new key. If there are duplicates on
+ * multiple pages, the correct insertion position might be some page to
+ * the right, rather than the first page. In that case, this function
+ * moves right to the correct target page.
+ *
+ * (In a !heapkeyspace index, there can be multiple pages with the same
+ * high key, where the new tuple could legitimately be placed on. In
+ * that case, the caller passes the first page containing duplicates,
+ * just like when checkingunique=true. If that page doesn't have enough
+ * room for the new tuple, this function moves right, trying to find a
+ * legal page that does.)
+ *
+ * If 'indexUnchanged' is true, this is for an UPDATE that didn't
+ * logically change the indexed value, but must nevertheless have a new
+ * entry to point to a successor version. This hint from the executor
+ * will influence our behavior when the page might have to be split and
+ * we must consider our options. Bottom-up index deletion can avoid
+ * pathological version-driven page splits, but we only want to go to the
+ * trouble of trying it when we already have moderate confidence that
+ * it's appropriate. The hint should not significantly affect our
+ * behavior over time unless practically all inserts on to the leaf page
+ * get the hint.
+ *
+ * On exit, insertstate buffer contains the chosen insertion page, and
+ * the offset within that page is returned. If _bt_findinsertloc needed
+ * to move right, the lock and pin on the original page are released, and
+ * the new buffer is exclusively locked and pinned instead.
+ *
+ * If insertstate contains cached binary search bounds, we will take
+ * advantage of them. This avoids repeating comparisons that we made in
+ * _bt_check_unique() already.
+ */
+static OffsetNumber
+_bt_findinsertloc(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel)
+{
+ BTScanInsert itup_key = insertstate->itup_key;
+ Page page = BufferGetPage(insertstate->buf);
+ BTPageOpaque opaque;
+ OffsetNumber newitemoff;
+
+ opaque = BTPageGetOpaque(page);
+
+ /* Check 1/3 of a page restriction */
+ if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
+ _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
+ insertstate->itup);
+
+ Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
+ Assert(!insertstate->bounds_valid || checkingunique);
+ Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
+ Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
+ Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
+
+ if (itup_key->heapkeyspace)
+ {
+ /* Keep track of whether checkingunique duplicate seen */
+ bool uniquedup = indexUnchanged;
+
+ /*
+ * If we're inserting into a unique index, we may have to walk right
+ * through leaf pages to find the one leaf page that we must insert on
+ * to.
+ *
+ * This is needed for checkingunique callers because a scantid was not
+ * used when we called _bt_search(). scantid can only be set after
+ * _bt_check_unique() has checked for duplicates. The buffer
+ * initially stored in insertstate->buf has the page where the first
+ * duplicate key might be found, which isn't always the page that new
+ * tuple belongs on. The heap TID attribute for new tuple (scantid)
+ * could force us to insert on a sibling page, though that should be
+ * very rare in practice.
+ */
+ if (checkingunique)
+ {
+ if (insertstate->low < insertstate->stricthigh)
+ {
+ /* Encountered a duplicate in _bt_check_unique() */
+ Assert(insertstate->bounds_valid);
+ uniquedup = true;
+ }
+
+ for (;;)
+ {
+ AttrNumber cmpcol = 1;
+
+ /*
+ * Does the new tuple belong on this page?
+ *
+ * The earlier _bt_check_unique() call may well have
+ * established a strict upper bound on the offset for the new
+ * item. If it's not the last item of the page (i.e. if there
+ * is at least one tuple on the page that goes after the tuple
+ * we're inserting) then we know that the tuple belongs on
+ * this page. We can skip the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ /* Test '<=', not '!=', since scantid is set now */
+ if (P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
+ break;
+
+ _bt_stepright(rel, heapRel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ /* Assume duplicates (if checkingunique) */
+ uniquedup = true;
+ }
+ }
+
+ /*
+ * If the target page cannot fit newitem, try to avoid splitting the
+ * page on insert by performing deletion or deduplication now
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
+ checkingunique, uniquedup,
+ indexUnchanged);
+ }
+ else
+ {
+ /*----------
+ * This is a !heapkeyspace (version 2 or 3) index. The current page
+ * is the first page that we could insert the new tuple to, but there
+ * may be other pages to the right that we could opt to use instead.
+ *
+ * If the new key is equal to one or more existing keys, we can
+ * legitimately place it anywhere in the series of equal keys. In
+ * fact, if the new key is equal to the page's "high key" we can place
+ * it on the next page. If it is equal to the high key, and there's
+ * not room to insert the new tuple on the current page without
+ * splitting, then we move right hoping to find more free space and
+ * avoid a split.
+ *
+ * Keep scanning right until we
+ * (a) find a page with enough free space,
+ * (b) reach the last page where the tuple can legally go, or
+ * (c) get tired of searching.
+ * (c) is not flippant; it is important because if there are many
+ * pages' worth of equal keys, it's better to split one of the early
+ * pages than to scan all the way to the end of the run of equal keys
+ * on every insert. We implement "get tired" as a random choice,
+ * since stopping after scanning a fixed number of pages wouldn't work
+ * well (we'd never reach the right-hand side of previously split
+ * pages). The probability of moving right is set at 0.99, which may
+ * seem too high to change the behavior much, but it does an excellent
+ * job of preventing O(N^2) behavior with many equal keys.
+ *----------
+ */
+ while (PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ AttrNumber cmpcol = 1;
+
+ /*
+ * Before considering moving right, see if we can obtain enough
+ * space by erasing LP_DEAD items
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ /* Perform simple deletion */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ if (PageGetFreeSpace(page) >= insertstate->itemsz)
+ break; /* OK, now we have enough space */
+ }
+
+ /*
+ * Nope, so check conditions (b) and (c) enumerated above
+ *
+ * The earlier _bt_check_unique() call may well have established a
+ * strict upper bound on the offset for the new item. If it's not
+ * the last item of the page (i.e. if there is at least one tuple
+ * on the page that's greater than the tuple we're inserting to)
+ * then we know that the tuple belongs on this page. We can skip
+ * the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ if (P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) != 0 ||
+ pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
+ break;
+
+ _bt_stepright(rel, heapRel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ }
+ }
+
+ /*
+ * We should now be on the correct page. Find the offset within the page
+ * for the new tuple. (Possibly reusing earlier search bounds.)
+ */
+ {
+ AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
+ Assert(P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0);
+ }
+
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
+
+ if (insertstate->postingoff == -1)
+ {
+ /*
+ * There is an overlapping posting list tuple with its LP_DEAD bit
+ * set. We don't want to unnecessarily unset its LP_DEAD bit while
+ * performing a posting list split, so perform simple index tuple
+ * deletion early.
+ */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ /*
+ * Do new binary search. New insert location cannot overlap with any
+ * posting list now.
+ */
+ Assert(!insertstate->bounds_valid);
+ insertstate->postingoff = 0;
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
+ Assert(insertstate->postingoff == 0);
+ }
+
+ return newitemoff;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index c2050656e4..504fa37d99 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1810,6 +1810,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
bool rightsib_empty;
Page page;
BTPageOpaque opaque;
+ nbts_prep_ctx(rel);
/*
* Save original leafbuf block number from caller. Only deleted blocks
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4553aaee53..ff1611bfb8 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -87,6 +87,8 @@ static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
OffsetNumber updatedoffset,
int *nremaining);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtree_spec.c"
+#include "access/nbtree_spec.h"
/*
* Btree handler function: return IndexAmRoutine with access method parameters
@@ -121,7 +123,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambuild = btbuild;
amroutine->ambuildempty = btbuildempty;
- amroutine->aminsert = btinsert;
+ amroutine->aminsert = btinsert_default;
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
@@ -153,6 +155,8 @@ btbuildempty(Relation index)
{
Page metapage;
+ nbt_opt_specialize(index);
+
/* Construct metapage. */
metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
@@ -178,33 +182,6 @@ btbuildempty(Relation index)
smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
}
-/*
- * btinsert() -- insert an index tuple into a btree.
- *
- * Descend the tree recursively, find the appropriate location for our
- * new tuple, and put it there.
- */
-bool
-btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- IndexInfo *indexInfo)
-{
- bool result;
- IndexTuple itup;
-
- /* generate an index tuple */
- itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
- itup->t_tid = *ht_ctid;
-
- result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);
-
- pfree(itup);
-
- return result;
-}
-
/*
* btgettuple() -- Get the next tuple in the scan.
*/
@@ -346,6 +323,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
IndexScanDesc scan;
BTScanOpaque so;
+ nbt_opt_specialize(rel);
+
/* no order by operators allowed */
Assert(norderbys == 0);
@@ -789,6 +768,8 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
Relation rel = info->index;
BTCycleId cycleid;
+ nbt_opt_specialize(rel);
+
/* allocate stats if first time through, else re-use existing struct */
if (stats == NULL)
stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
diff --git a/src/backend/access/nbtree/nbtree_spec.c b/src/backend/access/nbtree/nbtree_spec.c
new file mode 100644
index 0000000000..6b766581ab
--- /dev/null
+++ b/src/backend/access/nbtree/nbtree_spec.c
@@ -0,0 +1,69 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtree_spec.c
+ * Index shape-specialized functions for nbtree.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtree_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+/*
+ * _bt_specialize() -- Specialize this index relation for its index key.
+ */
+void
+_bt_specialize(Relation rel)
+{
+#ifdef NBTS_SPECIALIZING_DEFAULT
+ NBTS_MAKE_CTX(rel);
+ /*
+ * We can't directly address _bt_specialize here because it'd be macro-
+ * expanded, nor can we utilize NBTS_SPECIALIZE_NAME here because it'd
+ * try to call _bt_specialize, which would be an infinite recursive call.
+ */
+ switch (__nbts_ctx) {
+ case NBTS_CTX_CACHED:
+ _bt_specialize_cached(rel);
+ break;
+ case NBTS_CTX_DEFAULT:
+ break;
+ }
+#else
+ rel->rd_indam->aminsert = btinsert;
+#endif
+}
+
+/*
+ * btinsert() -- insert an index tuple into a btree.
+ *
+ * Descend the tree recursively, find the appropriate location for our
+ * new tuple, and put it there.
+ */
+bool
+btinsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ IndexInfo *indexInfo)
+{
+ bool result;
+ IndexTuple itup;
+
+ /* generate an index tuple */
+ itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
+ itup->t_tid = *ht_ctid;
+
+ result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);
+
+ pfree(itup);
+
+ return result;
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7423b76e1c..4c853f1e4b 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,12 +25,8 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
-static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf,
- AttrNumber *highkeycmpcol);
static int _bt_binsrch_posting(BTScanInsert key, Page page,
OffsetNumber offnum);
-static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
- OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
static int _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
@@ -47,6 +43,8 @@ static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsearch_spec.c"
+#include "access/nbtree_spec.h"
/*
* _bt_drop_lock_and_maybe_pin()
@@ -71,581 +69,6 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
}
}
-/*
- * _bt_search() -- Search the tree for a particular scankey,
- * or more precisely for the first leaf page it could be on.
- *
- * The passed scankey is an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * Return value is a stack of parent-page pointers (i.e. there is no entry for
- * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
- * which is locked and pinned. No locks are held on the parent pages,
- * however!
- *
- * If the snapshot parameter is not NULL, "old snapshot" checking will take
- * place during the descent through the tree. This is not needed when
- * positioning for an insert or delete, so NULL is used for those cases.
- *
- * The returned buffer is locked according to access parameter. Additionally,
- * access = BT_WRITE will allow an empty root page to be created and returned.
- * When access = BT_READ, an empty index will result in *bufP being set to
- * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
- * during the search will be finished.
- *
- * heaprel must be provided by callers that pass access = BT_WRITE, since we
- * might need to allocate a new root page for caller -- see _bt_allocbuf.
- */
-BTStack
-_bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
- int access, Snapshot snapshot)
-{
- BTStack stack_in = NULL;
- int page_access = BT_READ;
- char tupdatabuf[BLCKSZ / 3];
- AttrNumber highkeycmpcol = 1;
-
- /* heaprel must be set whenever _bt_allocbuf is reachable */
- Assert(access == BT_READ || access == BT_WRITE);
- Assert(access == BT_READ || heaprel != NULL);
-
- /* Get the root page to start with */
- *bufP = _bt_getroot(rel, heaprel, access);
-
- /* If index is empty and access = BT_READ, no root page is created. */
- if (!BufferIsValid(*bufP))
- return (BTStack) NULL;
-
- /* Loop iterates once per level descended in the tree */
- for (;;)
- {
- Page page;
- BTPageOpaque opaque;
- OffsetNumber offnum;
- ItemId itemid;
- IndexTuple itup;
- BlockNumber child;
- BTStack new_stack;
-
- /*
- * Race -- the page we just grabbed may have split since we read its
- * downlink in its parent page (or the metapage). If it has, we may
- * need to move right to its new sibling. Do that.
- *
- * In write-mode, allow _bt_moveright to finish any incomplete splits
- * along the way. Strictly speaking, we'd only need to finish an
- * incomplete split on the leaf page we're about to insert to, not on
- * any of the upper levels (internal pages with incomplete splits are
- * also taken care of in _bt_getstackbuf). But this is a good
- * opportunity to finish splits of internal pages too.
- */
- *bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
- stack_in, page_access, snapshot, &highkeycmpcol,
- (char *) tupdatabuf);
-
- /* if this is a leaf page, we're done */
- page = BufferGetPage(*bufP);
- opaque = BTPageGetOpaque(page);
- if (P_ISLEAF(opaque))
- break;
-
- /*
- * Find the appropriate pivot tuple on this page. Its downlink points
- * to the child page that we're about to descend to.
- */
- offnum = _bt_binsrch(rel, key, *bufP, &highkeycmpcol);
- itemid = PageGetItemId(page, offnum);
- itup = (IndexTuple) PageGetItem(page, itemid);
- Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
- child = BTreeTupleGetDownLink(itup);
-
- Assert(IndexTupleSize(itup) < sizeof(tupdatabuf));
- memcpy((char *) tupdatabuf, (char *) itup, IndexTupleSize(itup));
-
- /*
- * We need to save the location of the pivot tuple we chose in a new
- * stack entry for this page/level. If caller ends up splitting a
- * page one level down, it usually ends up inserting a new pivot
- * tuple/downlink immediately after the location recorded here.
- */
- new_stack = (BTStack) palloc(sizeof(BTStackData));
- new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
- new_stack->bts_offset = offnum;
- new_stack->bts_parent = stack_in;
-
- /*
- * Page level 1 is lowest non-leaf page level prior to leaves. So, if
- * we're on the level 1 and asked to lock leaf page in write mode,
- * then lock next page in write mode, because it must be a leaf.
- */
- if (opaque->btpo_level == 1 && access == BT_WRITE)
- page_access = BT_WRITE;
-
- /* drop the read lock on the page, then acquire one on its child */
- *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
-
- /* okay, all set to move down a level */
- stack_in = new_stack;
- }
-
- /*
- * If we're asked to lock leaf in write mode, but didn't manage to, then
- * relock. This should only happen when the root page is a leaf page (and
- * the only page in the index other than the metapage).
- */
- if (access == BT_WRITE && page_access == BT_READ)
- {
- highkeycmpcol = 1;
-
- /* trade in our read lock for a write lock */
- _bt_unlockbuf(rel, *bufP);
- _bt_lockbuf(rel, *bufP, BT_WRITE);
-
- /*
- * Race -- the leaf page may have split after we dropped the read lock
- * but before we acquired a write lock. If it has, we may need to
- * move right to its new sibling. Do that.
- */
- *bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
- snapshot, &highkeycmpcol, (char *) tupdatabuf);
- }
-
- return stack_in;
-}
-
-/*
- * _bt_moveright() -- move right in the btree if necessary.
- *
- * When we follow a pointer to reach a page, it is possible that
- * the page has changed in the meanwhile. If this happens, we're
- * guaranteed that the page has "split right" -- that is, that any
- * data that appeared on the page originally is either on the page
- * or strictly to the right of it.
- *
- * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page. If that entry is
- * strictly less than the scankey, or <= the scankey in the
- * key.nextkey=true case, then we followed the wrong link and we need
- * to move right.
- *
- * The passed insertion-type scankey can omit the rightmost column(s) of the
- * index. (see nbtree/README)
- *
- * When key.nextkey is false (the usual case), we are looking for the first
- * item >= key. When key.nextkey is true, we are looking for the first item
- * strictly greater than key.
- *
- * If forupdate is true, we will attempt to finish any incomplete splits
- * that we encounter. This is required when locking a target page for an
- * insertion, because we don't allow inserting on a page before the split is
- * completed. 'heaprel' and 'stack' are only used if forupdate is true.
- *
- * On entry, we have the buffer pinned and a lock of the type specified by
- * 'access'. If we move right, we release the buffer and lock and acquire
- * the same on the right sibling. Return value is the buffer we stop at.
- *
- * If the snapshot parameter is not NULL, "old snapshot" checking will take
- * place during the descent through the tree. This is not needed when
- * positioning for an insert or delete, so NULL is used for those cases.
- */
-Buffer
-_bt_moveright(Relation rel,
- Relation heaprel,
- BTScanInsert key,
- Buffer buf,
- bool forupdate,
- BTStack stack,
- int access,
- Snapshot snapshot,
- AttrNumber *comparecol,
- char *tupdatabuf)
-{
- Page page;
- BTPageOpaque opaque;
- int32 cmpval;
-
- Assert(!forupdate || heaprel != NULL);
- Assert(PointerIsValid(comparecol) && PointerIsValid(tupdatabuf));
-
- /*
- * When nextkey = false (normal case): if the scan key that brought us to
- * this page is > the high key stored on the page, then the page has split
- * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
- * have some duplicates to the right as well as the left, but that's
- * something that's only ever dealt with on the leaf level, after
- * _bt_search has found an initial leaf page.)
- *
- * When nextkey = true: move right if the scan key is >= page's high key.
- * (Note that key.scantid cannot be set in this case.)
- *
- * The page could even have split more than once, so scan as far as
- * needed.
- *
- * We also have to move right if we followed a link that brought us to a
- * dead page.
- */
- cmpval = key->nextkey ? 0 : 1;
-
- for (;;)
- {
- AttrNumber cmpcol = 1;
-
- page = BufferGetPage(buf);
- TestForOldSnapshot(snapshot, rel, page);
- opaque = BTPageGetOpaque(page);
-
- if (P_RIGHTMOST(opaque))
- {
- *comparecol = 1;
- break;
- }
-
- /*
- * Finish any incomplete splits we encounter along the way.
- */
- if (forupdate && P_INCOMPLETE_SPLIT(opaque))
- {
- BlockNumber blkno = BufferGetBlockNumber(buf);
-
- /* upgrade our lock if necessary */
- if (access == BT_READ)
- {
- _bt_unlockbuf(rel, buf);
- _bt_lockbuf(rel, buf, BT_WRITE);
- }
-
- if (P_INCOMPLETE_SPLIT(opaque))
- _bt_finish_split(rel, heaprel, buf, stack);
- else
- _bt_relbuf(rel, buf);
-
- /* re-acquire the lock in the right mode, and re-check */
- buf = _bt_getbuf(rel, blkno, access);
- continue;
- }
-
- /*
- * tupdatabuf is filled with the right seperator of the parent node.
- * This allows us to do a binary equality check between the parent
- * node's right seperator (which is < key) and this page's P_HIKEY.
- * If they equal, we can reuse the result of the parent node's
- * rightkey compare, which means we can potentially save a full key
- * compare (which includes indirect calls to attribute comparison
- * functions).
- *
- * Without this, we'd on average use 3 full key compares per page before
- * we achieve full dynamic prefix bounds, but with this optimization
- * that is only 2.
- *
- * 3 compares: 1 for the highkey (rightmost), and on average 2 before
- * we move right in the binary search on the page, this average equals
- * SUM (1/2 ^ x) for x from 0 to log(n items)), which tends to 2.
- */
- if (!P_IGNORE(opaque) && *comparecol > 1)
- {
- IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
- IndexTuple buftuple = (IndexTuple) tupdatabuf;
- if (IndexTupleSize(itup) == IndexTupleSize(buftuple))
- {
- char *dataptr = (char *) itup;
-
- if (memcmp(dataptr + sizeof(IndexTupleData),
- tupdatabuf + sizeof(IndexTupleData),
- IndexTupleSize(itup) - sizeof(IndexTupleData)) == 0)
- break;
- } else {
- *comparecol = 1;
- }
- } else {
- *comparecol = 1;
- }
-
- if (P_IGNORE(opaque) ||
- _bt_compare(rel, key, page, P_HIKEY, &cmpcol) >= cmpval)
- {
- *comparecol = 1;
- /* step right one page */
- buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
- continue;
- }
- else
- {
- *comparecol = cmpcol;
- break;
- }
- }
-
- if (P_IGNORE(opaque))
- elog(ERROR, "fell off the end of index \"%s\"",
- RelationGetRelationName(rel));
-
- return buf;
-}
-
-/*
- * _bt_binsrch() -- Do a binary search for a key on a particular page.
- *
- * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
- * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
- * particular, this means it is possible to return a value 1 greater than the
- * number of keys on the page, if the scankey is > all keys on the page.)
- *
- * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
- * of the last key < given scankey, or last key <= given scankey if nextkey
- * is true. (Since _bt_compare treats the first data key of such a page as
- * minus infinity, there will be at least one key < scankey, so the result
- * always points at one of the keys on the page.) This key indicates the
- * right place to descend to be sure we find all leaf keys >= given scankey
- * (or leaf keys > given scankey when nextkey is true).
- *
- * This procedure is not responsible for walking right, it just examines
- * the given page. _bt_binsrch() has no lock or refcount side effects
- * on the buffer.
- */
-static OffsetNumber
-_bt_binsrch(Relation rel,
- BTScanInsert key,
- Buffer buf,
- AttrNumber *highkeycmpcol)
-{
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high;
- int32 result,
- cmpval;
- AttrNumber highcmpcol = *highkeycmpcol,
- lowcmpcol = 1;
-
- page = BufferGetPage(buf);
- opaque = BTPageGetOpaque(page);
-
- /* Requesting nextkey semantics while using scantid seems nonsensical */
- Assert(!key->nextkey || key->scantid == NULL);
- /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
- Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
-
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
-
- /*
- * If there are no keys on the page, return the first available slot. Note
- * this covers two cases: the page is really empty (no keys), or it
- * contains only a high key. The latter case is possible after vacuuming.
- * This can never happen on an internal page, however, since they are
- * never empty (an internal page must have children).
- */
- if (unlikely(high < low))
- return low;
-
- /*
- * Binary search to find the first key on the page >= scan key, or first
- * key > scankey when nextkey is true.
- *
- * For nextkey=false (cmpval=1), the loop invariant is: all slots before
- * 'low' are < scan key, all slots at or after 'high' are >= scan key.
- *
- * For nextkey=true (cmpval=0), the loop invariant is: all slots before
- * 'low' are <= scan key, all slots at or after 'high' are > scan key.
- *
- * We can fall out when high == low.
- */
- high++; /* establish the loop invariant for high */
-
- cmpval = key->nextkey ? 0 : 1; /* select comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
- AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid, &cmpcol);
-
- if (result >= cmpval)
- {
- low = mid + 1;
- lowcmpcol = cmpcol;
- }
- else
- {
- high = mid;
- highcmpcol = cmpcol;
- }
- }
-
- *highkeycmpcol = highcmpcol;
-
- /*
- * At this point we have high == low, but be careful: they could point
- * past the last slot on the page.
- *
- * On a leaf page, we always return the first key >= scan key (resp. >
- * scan key), which could be the last slot + 1.
- */
- if (P_ISLEAF(opaque))
- return low;
-
- /*
- * On a non-leaf page, return the last key < scan key (resp. <= scan key).
- * There must be one if _bt_compare() is playing by the rules.
- */
- Assert(low > P_FIRSTDATAKEY(opaque));
-
- return OffsetNumberPrev(low);
-}
-
-/*
- *
- * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
- *
- * Like _bt_binsrch(), but with support for caching the binary search
- * bounds. Only used during insertion, and only on the leaf page that it
- * looks like caller will insert tuple on. Exclusive-locked and pinned
- * leaf page is contained within insertstate.
- *
- * Caches the bounds fields in insertstate so that a subsequent call can
- * reuse the low and strict high bounds of original binary search. Callers
- * that use these fields directly must be prepared for the case where low
- * and/or stricthigh are not on the same page (one or both exceed maxoff
- * for the page). The case where there are no items on the page (high <
- * low) makes bounds invalid.
- *
- * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time, and for dealing with posting list
- * tuple matches (callers can use insertstate's postingoff field to
- * determine which existing heap TID will need to be replaced by a posting
- * list split).
- */
-OffsetNumber
-_bt_binsrch_insert(Relation rel, BTInsertState insertstate,
- AttrNumber highcmpcol)
-{
- BTScanInsert key = insertstate->itup_key;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high,
- stricthigh;
- int32 result,
- cmpval;
- AttrNumber lowcmpcol = 1;
-
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- Assert(P_ISLEAF(opaque));
- Assert(!key->nextkey);
- Assert(insertstate->postingoff == 0);
-
- if (!insertstate->bounds_valid)
- {
- /* Start new binary search */
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
- }
- else
- {
- /* Restore result of previous binary search against same page */
- low = insertstate->low;
- high = insertstate->stricthigh;
- }
-
- /* If there are no keys on the page, return the first available slot */
- if (unlikely(high < low))
- {
- /* Caller can't reuse bounds */
- insertstate->low = InvalidOffsetNumber;
- insertstate->stricthigh = InvalidOffsetNumber;
- insertstate->bounds_valid = false;
- return low;
- }
-
- /*
- * Binary search to find the first key on the page >= scan key. (nextkey
- * is always false when inserting).
- *
- * The loop invariant is: all slots before 'low' are < scan key, all slots
- * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
- * maintained to save additional search effort for caller.
- *
- * We can fall out when high == low.
- */
- if (!insertstate->bounds_valid)
- high++; /* establish the loop invariant for high */
- stricthigh = high; /* high initially strictly higher */
-
- cmpval = 1; /* !nextkey comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
- AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid, &cmpcol);
-
- if (result >= cmpval)
- {
- low = mid + 1;
- lowcmpcol = cmpcol;
- }
- else
- {
- high = mid;
- highcmpcol = cmpcol;
-
- if (result != 0)
- stricthigh = high;
- }
-
- /*
- * If tuple at offset located by binary search is a posting list whose
- * TID range overlaps with caller's scantid, perform posting list
- * binary search to set postingoff for caller. Caller must split the
- * posting list when postingoff is set. This should happen
- * infrequently.
- */
- if (unlikely(result == 0 && key->scantid != NULL))
- {
- /*
- * postingoff should never be set more than once per leaf page
- * binary search. That would mean that there are duplicate table
- * TIDs in the index, which is never okay. Check for that here.
- */
- if (insertstate->postingoff != 0)
- ereport(ERROR,
- (errcode(ERRCODE_INDEX_CORRUPTED),
- errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
- ItemPointerGetBlockNumber(key->scantid),
- ItemPointerGetOffsetNumber(key->scantid),
- low, stricthigh,
- BufferGetBlockNumber(insertstate->buf),
- RelationGetRelationName(rel))));
-
- insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
- }
- }
-
- /*
- * On a leaf page, a binary search always returns the first key >= scan
- * key (at least in !nextkey case), which could be the last slot + 1. This
- * is also the lower bound of cached search.
- *
- * stricthigh may also be the last slot + 1, which prevents caller from
- * using bounds directly, but is still useful to us if we're called a
- * second time with cached bounds (cached low will be < stricthigh when
- * that happens).
- */
- insertstate->low = low;
- insertstate->stricthigh = stricthigh;
- insertstate->bounds_valid = true;
-
- return low;
-}
-
/*----------
* _bt_binsrch_posting() -- posting list binary search.
*
@@ -713,228 +136,6 @@ _bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
return low;
}
-/*----------
- * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
- *
- * page/offnum: location of btree item to be compared to.
- *
- * This routine returns:
- * <0 if scankey < tuple at offnum;
- * 0 if scankey == tuple at offnum;
- * >0 if scankey > tuple at offnum.
- *
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be returned
- * to the caller as a matching key. Similarly, an insertion scankey
- * with its scantid set is treated as equal to a posting tuple whose TID
- * range overlaps with their scantid. There generally won't be a
- * matching TID in the posting tuple, which caller must handle
- * themselves (e.g., by splitting the posting list tuple).
- *
- * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
- * "minus infinity": this routine will always claim it is less than the
- * scankey. The actual key value stored is explicitly truncated to 0
- * attributes (explicitly minus infinity) with version 3+ indexes, but
- * that isn't relied upon. This allows us to implement the Lehman and
- * Yao convention that the first down-link pointer is before the first
- * key. See backend/access/nbtree/README for details.
- *----------
- */
-int32
-_bt_compare(Relation rel,
- BTScanInsert key,
- Page page,
- OffsetNumber offnum,
- AttrNumber *comparecol)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- IndexTuple itup;
- ItemPointer heapTid;
- ScanKey scankey;
- int ncmpkey;
- int ntupatts;
- int32 result;
-
- Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
- Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
- Assert(key->heapkeyspace || key->scantid == NULL);
-
- /*
- * Force result ">" if target item is first data item on an internal page
- * --- see NOTE above.
- */
- if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
- return 1;
-
- itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
- ntupatts = BTreeTupleGetNAtts(itup, rel);
-
- /*
- * The scan key is set up with the attribute number associated with each
- * term in the key. It is important that, if the index is multi-key, the
- * scan contain the first k key attributes, and that they be in order. If
- * you think about how multi-key ordering works, you'll understand why
- * this is.
- *
- * We don't test for violation of this condition here, however. The
- * initial setup for the index scan had better have gotten it right (see
- * _bt_first).
- */
-
- ncmpkey = Min(ntupatts, key->keysz);
- Assert(key->heapkeyspace || ncmpkey == key->keysz);
- Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
-
- scankey = key->scankeys + ((*comparecol) - 1);
- for (int i = *comparecol; i <= ncmpkey; i++)
- {
- Datum datum;
- bool isNull;
-
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
-
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
- {
- if (isNull)
- result = 0; /* NULL "=" NULL */
- else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = -1; /* NULL "<" NOT_NULL */
- else
- result = 1; /* NULL ">" NOT_NULL */
- }
- else if (isNull) /* key is NOT_NULL and item is NULL */
- {
- if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = 1; /* NOT_NULL ">" NULL */
- else
- result = -1; /* NOT_NULL "<" NULL */
- }
- else
- {
- /*
- * The sk_func needs to be passed the index value as left arg and
- * the sk_argument as right arg (they might be of different
- * types). Since it is convenient for callers to think of
- * _bt_compare as comparing the scankey to the index item, we have
- * to flip the sign of the comparison result. (Unless it's a DESC
- * column, in which case we *don't* flip the sign.)
- */
- result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum,
- scankey->sk_argument));
-
- if (!(scankey->sk_flags & SK_BT_DESC))
- INVERT_COMPARE_RESULT(result);
- }
-
- /* if the keys are unequal, return the difference */
- if (result != 0)
- {
- *comparecol = i;
- return result;
- }
-
- scankey++;
- }
-
- /*
- * All tuple attributes are equal to the scan key, only later attributes
- * could potentially not equal the scan key.
- */
- *comparecol = ntupatts + 1;
-
- /*
- * All non-truncated attributes (other than heap TID) were found to be
- * equal. Treat truncated attributes as minus infinity when scankey has a
- * key attribute value that would otherwise be compared directly.
- *
- * Note: it doesn't matter if ntupatts includes non-key attributes;
- * scankey won't, so explicitly excluding non-key attributes isn't
- * necessary.
- */
- if (key->keysz > ntupatts)
- return 1;
-
- /*
- * Use the heap TID attribute and scantid to try to break the tie. The
- * rules are the same as any other key attribute -- only the
- * representation differs.
- */
- heapTid = BTreeTupleGetHeapTID(itup);
- if (key->scantid == NULL)
- {
- /*
- * Most searches have a scankey that is considered greater than a
- * truncated pivot tuple if and when the scankey has equal values for
- * attributes up to and including the least significant untruncated
- * attribute in tuple.
- *
- * For example, if an index has the minimum two attributes (single
- * user key attribute, plus heap TID attribute), and a page's high key
- * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
- * will not descend to the page to the left. The search will descend
- * right instead. The truncated attribute in pivot tuple means that
- * all non-pivot tuples on the page to the left are strictly < 'foo',
- * so it isn't necessary to descend left. In other words, search
- * doesn't have to descend left because it isn't interested in a match
- * that has a heap TID value of -inf.
- *
- * However, some searches (pivotsearch searches) actually require that
- * we descend left when this happens. -inf is treated as a possible
- * match for omitted scankey attribute(s). This is needed by page
- * deletion, which must re-find leaf pages that are targets for
- * deletion using their high keys.
- *
- * Note: the heap TID part of the test ensures that scankey is being
- * compared to a pivot tuple with one or more truncated key
- * attributes.
- *
- * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
- * left here, since they have no heap TID attribute (and cannot have
- * any -inf key values in any case, since truncation can only remove
- * non-key attributes). !heapkeyspace searches must always be
- * prepared to deal with matches on both sides of the pivot once the
- * leaf level is reached.
- */
- if (key->heapkeyspace && !key->pivotsearch &&
- key->keysz == ntupatts && heapTid == NULL)
- return 1;
-
- /* All provided scankey arguments found to be equal */
- return 0;
- }
-
- /*
- * Treat truncated heap TID as minus infinity, since scankey has a key
- * attribute value (scantid) that would otherwise be compared directly
- */
- Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
- if (heapTid == NULL)
- return 1;
-
- /*
- * Scankey must be treated as equal to a posting list tuple if its scantid
- * value falls within the range of the posting list. In all other cases
- * there can only be a single heap TID value, which is compared directly
- * with scantid.
- */
- Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- result = ItemPointerCompare(key->scantid, heapTid);
- if (result <= 0 || !BTreeTupleIsPosting(itup))
- return result;
- else
- {
- result = ItemPointerCompare(key->scantid,
- BTreeTupleGetMaxHeapTID(itup));
- if (result > 0)
- return 1;
- }
-
- return 0;
-}
-
/*
* _bt_first() -- Find the first item in a scan.
*
@@ -976,6 +177,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
BTScanPosItem *currItem;
BlockNumber blkno;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(rel);
Assert(!BTScanPosIsValid(so->currPos));
@@ -1598,280 +800,6 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
-/*
- * _bt_readpage() -- Load data from current index page into so->currPos
- *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate. All other fields of so->currPos are
- * initialized from scratch here.
- *
- * We scan the current page starting at offnum and moving in the indicated
- * direction. All items matching the scan keys are loaded into currPos.items.
- * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
- * that there can be no more matching tuples in the current scan direction.
- *
- * In the case of a parallel scan, caller must have called _bt_parallel_seize
- * prior to calling this function; this function will invoke
- * _bt_parallel_release before returning.
- *
- * Returns true if any matching items found on the page, false if none.
- */
-static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
-{
- BTScanOpaque so = (BTScanOpaque) scan->opaque;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber minoff;
- OffsetNumber maxoff;
- int itemIndex;
- bool continuescan;
- int indnatts;
-
- /*
- * We must have the buffer pinned and locked, but the usual macro can't be
- * used here; this function is what makes it good for currPos.
- */
- Assert(BufferIsValid(so->currPos.buf));
-
- page = BufferGetPage(so->currPos.buf);
- opaque = BTPageGetOpaque(page);
-
- /* allow next page be processed by parallel worker */
- if (scan->parallel_scan)
- {
- if (ScanDirectionIsForward(dir))
- _bt_parallel_release(scan, opaque->btpo_next);
- else
- _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
- }
-
- continuescan = true; /* default assumption */
- indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * We note the buffer's block number so that we can release the pin later.
- * This allows us to re-read the buffer if it is needed again for hinting.
- */
- so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-
- /*
- * We save the LSN of the page as we read it, so that we know whether it
- * safe to apply LP_DEAD hints to the page later. This allows us to drop
- * the pin for MVCC scans, which allows vacuum to avoid blocking.
- */
- so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-
- /*
- * we must save the page's right-link while scanning it; this tells us
- * where to step right to after we're done with these items. There is no
- * corresponding need for the left-link, since splits always go right.
- */
- so->currPos.nextPage = opaque->btpo_next;
-
- /* initialize tuple workspace to empty */
- so->currPos.nextTupleOffset = 0;
-
- /*
- * Now that the current page has been made consistent, the macro should be
- * good.
- */
- Assert(BTScanPosIsPinned(so->currPos));
-
- if (ScanDirectionIsForward(dir))
- {
- /* load items[] in ascending order */
- itemIndex = 0;
-
- offnum = Max(offnum, minoff);
-
- while (offnum <= maxoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- offnum = OffsetNumberNext(offnum);
- continue;
- }
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID
- */
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- itemIndex++;
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- itemIndex++;
- }
- }
- }
- /* When !continuescan, there can't be any more matches, so stop */
- if (!continuescan)
- break;
-
- offnum = OffsetNumberNext(offnum);
- }
-
- /*
- * We don't need to visit page to the right when the high key
- * indicates that no more matches will be found there.
- *
- * Checking the high key like this works out more often than you might
- * think. Leaf page splits pick a split point between the two most
- * dissimilar tuples (this is weighed against the need to evenly share
- * free space). Leaf pages with high key attribute values that can
- * only appear on non-pivot tuples on the right sibling page are
- * common.
- */
- if (continuescan && !P_RIGHTMOST(opaque))
- {
- ItemId iid = PageGetItemId(page, P_HIKEY);
- IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
- int truncatt;
-
- truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
- _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
- }
-
- if (!continuescan)
- so->currPos.moreRight = false;
-
- Assert(itemIndex <= MaxTIDsPerBTreePage);
- so->currPos.firstItem = 0;
- so->currPos.lastItem = itemIndex - 1;
- so->currPos.itemIndex = 0;
- }
- else
- {
- /* load items[] in descending order */
- itemIndex = MaxTIDsPerBTreePage;
-
- offnum = Min(offnum, maxoff);
-
- while (offnum >= minoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
- bool tuple_alive;
- bool passes_quals;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual. Most of the
- * time, it's a win to not bother examining the tuple's index
- * keys, but just skip to the next tuple (previous, actually,
- * since we're scanning backwards). However, if this is the first
- * tuple on the page, we do check the index keys, to prevent
- * uselessly advancing to the page to the left. This is similar
- * to the high key optimization used by forward scans.
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- Assert(offnum >= P_FIRSTDATAKEY(opaque));
- if (offnum > P_FIRSTDATAKEY(opaque))
- {
- offnum = OffsetNumberPrev(offnum);
- continue;
- }
-
- tuple_alive = false;
- }
- else
- tuple_alive = true;
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan);
- if (passes_quals && tuple_alive)
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID.
- *
- * Note that we deliberately save/return items from
- * posting lists in ascending heap TID order for backwards
- * scans. This allows _bt_killitems() to make a
- * consistent assumption about the order of items
- * associated with the same posting list tuple.
- */
- itemIndex--;
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- itemIndex--;
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- }
- }
- }
- if (!continuescan)
- {
- /* there can't be any more matches, so stop */
- so->currPos.moreLeft = false;
- break;
- }
-
- offnum = OffsetNumberPrev(offnum);
- }
-
- Assert(itemIndex >= 0);
- so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
- so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
- }
-
- return (so->currPos.firstItem <= so->currPos.lastItem);
-}
-
/* Save an index item into so->currPos.items[itemIndex] */
static void
_bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -2080,12 +1008,11 @@ static bool
_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
- Relation rel;
+ Relation rel = scan->indexRelation;
Page page;
BTPageOpaque opaque;
bool status;
-
- rel = scan->indexRelation;
+ nbts_prep_ctx(rel);
if (ScanDirectionIsForward(dir))
{
@@ -2497,6 +1424,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
BTPageOpaque opaque;
OffsetNumber start;
BTScanPosItem *currItem;
+ nbts_prep_ctx(rel);
/*
* Scan down to the leftmost or rightmost leaf page. This is a simplified
diff --git a/src/backend/access/nbtree/nbtsearch_spec.c b/src/backend/access/nbtree/nbtsearch_spec.c
new file mode 100644
index 0000000000..1e04caf090
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsearch_spec.c
@@ -0,0 +1,1096 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtsearch_spec.c
+ * Index shape-specialized functions for nbtsearch.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtsearch_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_binsrch NBTS_FUNCTION(_bt_binsrch)
+#define _bt_readpage NBTS_FUNCTION(_bt_readpage)
+
+static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf,
+ AttrNumber *highkeycmpcol);
+static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
+ OffsetNumber offnum);
+
+/*
+ * _bt_search() -- Search the tree for a particular scankey,
+ * or more precisely for the first leaf page it could be on.
+ *
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
+ * but it can omit the rightmost column(s) of the index.
+ *
+ * Return value is a stack of parent-page pointers (i.e. there is no entry for
+ * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
+ * which is locked and pinned. No locks are held on the parent pages,
+ * however!
+ *
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
+ *
+ * The returned buffer is locked according to access parameter. Additionally,
+ * access = BT_WRITE will allow an empty root page to be created and returned.
+ * When access = BT_READ, an empty index will result in *bufP being set to
+ * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
+ * during the search will be finished.
+ *
+ * heaprel must be provided by callers that pass access = BT_WRITE, since we
+ * might need to allocate a new root page for caller -- see _bt_allocbuf.
+ */
+BTStack
+_bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
+ int access, Snapshot snapshot)
+{
+ BTStack stack_in = NULL;
+ int page_access = BT_READ;
+ char tupdatabuf[BLCKSZ / 3];
+ AttrNumber highkeycmpcol = 1;
+
+ /* heaprel must be set whenever _bt_allocbuf is reachable */
+ Assert(access == BT_READ || access == BT_WRITE);
+ Assert(access == BT_READ || heaprel != NULL);
+
+ /* Get the root page to start with */
+ *bufP = _bt_getroot(rel, heaprel, access);
+
+ /* If index is empty and access = BT_READ, no root page is created. */
+ if (!BufferIsValid(*bufP))
+ return (BTStack) NULL;
+
+ /* Loop iterates once per level descended in the tree */
+ for (;;)
+ {
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber offnum;
+ ItemId itemid;
+ IndexTuple itup;
+ BlockNumber child;
+ BTStack new_stack;
+
+ /*
+ * Race -- the page we just grabbed may have split since we read its
+ * downlink in its parent page (or the metapage). If it has, we may
+ * need to move right to its new sibling. Do that.
+ *
+ * In write-mode, allow _bt_moveright to finish any incomplete splits
+ * along the way. Strictly speaking, we'd only need to finish an
+ * incomplete split on the leaf page we're about to insert to, not on
+ * any of the upper levels (internal pages with incomplete splits are
+ * also taken care of in _bt_getstackbuf). But this is a good
+ * opportunity to finish splits of internal pages too.
+ */
+ *bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
+ stack_in, page_access, snapshot, &highkeycmpcol,
+ (char *) tupdatabuf);
+
+ /* if this is a leaf page, we're done */
+ page = BufferGetPage(*bufP);
+ opaque = BTPageGetOpaque(page);
+ if (P_ISLEAF(opaque))
+ break;
+
+ /*
+ * Find the appropriate pivot tuple on this page. Its downlink points
+ * to the child page that we're about to descend to.
+ */
+ offnum = _bt_binsrch(rel, key, *bufP, &highkeycmpcol);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
+ child = BTreeTupleGetDownLink(itup);
+
+ Assert(IndexTupleSize(itup) < sizeof(tupdatabuf));
+ memcpy((char *) tupdatabuf, (char *) itup, IndexTupleSize(itup));
+
+ /*
+ * We need to save the location of the pivot tuple we chose in a new
+ * stack entry for this page/level. If caller ends up splitting a
+ * page one level down, it usually ends up inserting a new pivot
+ * tuple/downlink immediately after the location recorded here.
+ */
+ new_stack = (BTStack) palloc(sizeof(BTStackData));
+ new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
+ new_stack->bts_offset = offnum;
+ new_stack->bts_parent = stack_in;
+
+ /*
+ * Page level 1 is lowest non-leaf page level prior to leaves. So, if
+ * we're on the level 1 and asked to lock leaf page in write mode,
+ * then lock next page in write mode, because it must be a leaf.
+ */
+ if (opaque->btpo_level == 1 && access == BT_WRITE)
+ page_access = BT_WRITE;
+
+ /* drop the read lock on the page, then acquire one on its child */
+ *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
+
+ /* okay, all set to move down a level */
+ stack_in = new_stack;
+ }
+
+ /*
+ * If we're asked to lock leaf in write mode, but didn't manage to, then
+ * relock. This should only happen when the root page is a leaf page (and
+ * the only page in the index other than the metapage).
+ */
+ if (access == BT_WRITE && page_access == BT_READ)
+ {
+ highkeycmpcol = 1;
+
+ /* trade in our read lock for a write lock */
+ _bt_unlockbuf(rel, *bufP);
+ _bt_lockbuf(rel, *bufP, BT_WRITE);
+
+ /*
+ * Race -- the leaf page may have split after we dropped the read lock
+ * but before we acquired a write lock. If it has, we may need to
+ * move right to its new sibling. Do that.
+ */
+ *bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
+ snapshot, &highkeycmpcol, (char *) tupdatabuf);
+ }
+
+ return stack_in;
+}
+
+/*
+ * _bt_moveright() -- move right in the btree if necessary.
+ *
+ * When we follow a pointer to reach a page, it is possible that
+ * the page has changed in the meanwhile. If this happens, we're
+ * guaranteed that the page has "split right" -- that is, that any
+ * data that appeared on the page originally is either on the page
+ * or strictly to the right of it.
+ *
+ * This routine decides whether or not we need to move right in the
+ * tree by examining the high key entry on the page. If that entry is
+ * strictly less than the scankey, or <= the scankey in the
+ * key.nextkey=true case, then we followed the wrong link and we need
+ * to move right.
+ *
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
+ *
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key. When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
+ *
+ * If forupdate is true, we will attempt to finish any incomplete splits
+ * that we encounter. This is required when locking a target page for an
+ * insertion, because we don't allow inserting on a page before the split is
+ * completed. 'heaprel' and 'stack' are only used if forupdate is true.
+ *
+ * On entry, we have the buffer pinned and a lock of the type specified by
+ * 'access'. If we move right, we release the buffer and lock and acquire
+ * the same on the right sibling. Return value is the buffer we stop at.
+ *
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
+ */
+Buffer
+_bt_moveright(Relation rel,
+ Relation heaprel,
+ BTScanInsert key,
+ Buffer buf,
+ bool forupdate,
+ BTStack stack,
+ int access,
+ Snapshot snapshot,
+ AttrNumber *comparecol,
+ char *tupdatabuf)
+{
+ Page page;
+ BTPageOpaque opaque;
+ int32 cmpval;
+
+ Assert(!forupdate || heaprel != NULL);
+ Assert(PointerIsValid(comparecol) && PointerIsValid(tupdatabuf));
+
+ /*
+ * When nextkey = false (normal case): if the scan key that brought us to
+ * this page is > the high key stored on the page, then the page has split
+ * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
+ * have some duplicates to the right as well as the left, but that's
+ * something that's only ever dealt with on the leaf level, after
+ * _bt_search has found an initial leaf page.)
+ *
+ * When nextkey = true: move right if the scan key is >= page's high key.
+ * (Note that key.scantid cannot be set in this case.)
+ *
+ * The page could even have split more than once, so scan as far as
+ * needed.
+ *
+ * We also have to move right if we followed a link that brought us to a
+ * dead page.
+ */
+ cmpval = key->nextkey ? 0 : 1;
+
+ for (;;)
+ {
+ AttrNumber cmpcol = 1;
+
+ page = BufferGetPage(buf);
+ TestForOldSnapshot(snapshot, rel, page);
+ opaque = BTPageGetOpaque(page);
+
+ if (P_RIGHTMOST(opaque))
+ {
+ *comparecol = 1;
+ break;
+ }
+
+ /*
+ * Finish any incomplete splits we encounter along the way.
+ */
+ if (forupdate && P_INCOMPLETE_SPLIT(opaque))
+ {
+ BlockNumber blkno = BufferGetBlockNumber(buf);
+
+ /* upgrade our lock if necessary */
+ if (access == BT_READ)
+ {
+ _bt_unlockbuf(rel, buf);
+ _bt_lockbuf(rel, buf, BT_WRITE);
+ }
+
+ if (P_INCOMPLETE_SPLIT(opaque))
+ _bt_finish_split(rel, heaprel, buf, stack);
+ else
+ _bt_relbuf(rel, buf);
+
+ /* re-acquire the lock in the right mode, and re-check */
+ buf = _bt_getbuf(rel, blkno, access);
+ continue;
+ }
+
+ /*
+ * tupdatabuf is filled with the right seperator of the parent node.
+ * This allows us to do a binary equality check between the parent
+ * node's right seperator (which is < key) and this page's P_HIKEY.
+ * If they equal, we can reuse the result of the parent node's
+ * rightkey compare, which means we can potentially save a full key
+ * compare (which includes indirect calls to attribute comparison
+ * functions).
+ *
+ * Without this, we'd on average use 3 full key compares per page before
+ * we achieve full dynamic prefix bounds, but with this optimization
+ * that is only 2.
+ *
+ * 3 compares: 1 for the highkey (rightmost), and on average 2 before
+ * we move right in the binary search on the page, this average equals
+ * SUM (1/2 ^ x) for x from 0 to log(n items)), which tends to 2.
+ */
+ if (!P_IGNORE(opaque) && *comparecol > 1)
+ {
+ IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
+ IndexTuple buftuple = (IndexTuple) tupdatabuf;
+ if (IndexTupleSize(itup) == IndexTupleSize(buftuple))
+ {
+ char *dataptr = (char *) itup;
+
+ if (memcmp(dataptr + sizeof(IndexTupleData),
+ tupdatabuf + sizeof(IndexTupleData),
+ IndexTupleSize(itup) - sizeof(IndexTupleData)) == 0)
+ break;
+ } else {
+ *comparecol = 1;
+ }
+ } else {
+ *comparecol = 1;
+ }
+
+ if (P_IGNORE(opaque) ||
+ _bt_compare(rel, key, page, P_HIKEY, &cmpcol) >= cmpval)
+ {
+ *comparecol = 1;
+ /* step right one page */
+ buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
+ continue;
+ }
+ else
+ {
+ *comparecol = cmpcol;
+ break;
+ }
+ }
+
+ if (P_IGNORE(opaque))
+ elog(ERROR, "fell off the end of index \"%s\"",
+ RelationGetRelationName(rel));
+
+ return buf;
+}
+
+/*
+ * _bt_binsrch() -- Do a binary search for a key on a particular page.
+ *
+ * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
+ * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
+ * particular, this means it is possible to return a value 1 greater than the
+ * number of keys on the page, if the scankey is > all keys on the page.)
+ *
+ * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
+ * of the last key < given scankey, or last key <= given scankey if nextkey
+ * is true. (Since _bt_compare treats the first data key of such a page as
+ * minus infinity, there will be at least one key < scankey, so the result
+ * always points at one of the keys on the page.) This key indicates the
+ * right place to descend to be sure we find all leaf keys >= given scankey
+ * (or leaf keys > given scankey when nextkey is true).
+ *
+ * This procedure is not responsible for walking right, it just examines
+ * the given page. _bt_binsrch() has no lock or refcount side effects
+ * on the buffer.
+ */
+static OffsetNumber
+_bt_binsrch(Relation rel,
+ BTScanInsert key,
+ Buffer buf,
+ AttrNumber *highkeycmpcol)
+{
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high;
+ int32 result,
+ cmpval;
+ AttrNumber highcmpcol = *highkeycmpcol,
+ lowcmpcol = 1;
+
+ page = BufferGetPage(buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* Requesting nextkey semantics while using scantid seems nonsensical */
+ Assert(!key->nextkey || key->scantid == NULL);
+ /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
+ Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+
+ /*
+ * If there are no keys on the page, return the first available slot. Note
+ * this covers two cases: the page is really empty (no keys), or it
+ * contains only a high key. The latter case is possible after vacuuming.
+ * This can never happen on an internal page, however, since they are
+ * never empty (an internal page must have children).
+ */
+ if (unlikely(high < low))
+ return low;
+
+ /*
+ * Binary search to find the first key on the page >= scan key, or first
+ * key > scankey when nextkey is true.
+ *
+ * For nextkey=false (cmpval=1), the loop invariant is: all slots before
+ * 'low' are < scan key, all slots at or after 'high' are >= scan key.
+ *
+ * For nextkey=true (cmpval=0), the loop invariant is: all slots before
+ * 'low' are <= scan key, all slots at or after 'high' are > scan key.
+ *
+ * We can fall out when high == low.
+ */
+ high++; /* establish the loop invariant for high */
+
+ cmpval = key->nextkey ? 0 : 1; /* select comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
+
+ if (result >= cmpval)
+ {
+ low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
+ else
+ {
+ high = mid;
+ highcmpcol = cmpcol;
+ }
+ }
+
+ *highkeycmpcol = highcmpcol;
+
+ /*
+ * At this point we have high == low, but be careful: they could point
+ * past the last slot on the page.
+ *
+ * On a leaf page, we always return the first key >= scan key (resp. >
+ * scan key), which could be the last slot + 1.
+ */
+ if (P_ISLEAF(opaque))
+ return low;
+
+ /*
+ * On a non-leaf page, return the last key < scan key (resp. <= scan key).
+ * There must be one if _bt_compare() is playing by the rules.
+ */
+ Assert(low > P_FIRSTDATAKEY(opaque));
+
+ return OffsetNumberPrev(low);
+}
+
+/*
+ *
+ * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
+ *
+ * Like _bt_binsrch(), but with support for caching the binary search
+ * bounds. Only used during insertion, and only on the leaf page that it
+ * looks like caller will insert tuple on. Exclusive-locked and pinned
+ * leaf page is contained within insertstate.
+ *
+ * Caches the bounds fields in insertstate so that a subsequent call can
+ * reuse the low and strict high bounds of original binary search. Callers
+ * that use these fields directly must be prepared for the case where low
+ * and/or stricthigh are not on the same page (one or both exceed maxoff
+ * for the page). The case where there are no items on the page (high <
+ * low) makes bounds invalid.
+ *
+ * Caller is responsible for invalidating bounds when it modifies the page
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by a posting
+ * list split).
+ */
+OffsetNumber
+_bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol)
+{
+ BTScanInsert key = insertstate->itup_key;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high,
+ stricthigh;
+ int32 result,
+ cmpval;
+ AttrNumber lowcmpcol = 1;
+
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ Assert(P_ISLEAF(opaque));
+ Assert(!key->nextkey);
+ Assert(insertstate->postingoff == 0);
+
+ if (!insertstate->bounds_valid)
+ {
+ /* Start new binary search */
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ }
+ else
+ {
+ /* Restore result of previous binary search against same page */
+ low = insertstate->low;
+ high = insertstate->stricthigh;
+ }
+
+ /* If there are no keys on the page, return the first available slot */
+ if (unlikely(high < low))
+ {
+ /* Caller can't reuse bounds */
+ insertstate->low = InvalidOffsetNumber;
+ insertstate->stricthigh = InvalidOffsetNumber;
+ insertstate->bounds_valid = false;
+ return low;
+ }
+
+ /*
+ * Binary search to find the first key on the page >= scan key. (nextkey
+ * is always false when inserting).
+ *
+ * The loop invariant is: all slots before 'low' are < scan key, all slots
+ * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
+ * maintained to save additional search effort for caller.
+ *
+ * We can fall out when high == low.
+ */
+ if (!insertstate->bounds_valid)
+ high++; /* establish the loop invariant for high */
+ stricthigh = high; /* high initially strictly higher */
+
+ cmpval = 1; /* !nextkey comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
+
+ if (result >= cmpval)
+ {
+ low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
+ else
+ {
+ high = mid;
+ highcmpcol = cmpcol;
+
+ if (result != 0)
+ stricthigh = high;
+ }
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set postingoff for caller. Caller must split the
+ * posting list when postingoff is set. This should happen
+ * infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ {
+ /*
+ * postingoff should never be set more than once per leaf page
+ * binary search. That would mean that there are duplicate table
+ * TIDs in the index, which is never okay. Check for that here.
+ */
+ if (insertstate->postingoff != 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
+ ItemPointerGetBlockNumber(key->scantid),
+ ItemPointerGetOffsetNumber(key->scantid),
+ low, stricthigh,
+ BufferGetBlockNumber(insertstate->buf),
+ RelationGetRelationName(rel))));
+
+ insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
+ }
+ }
+
+ /*
+ * On a leaf page, a binary search always returns the first key >= scan
+ * key (at least in !nextkey case), which could be the last slot + 1. This
+ * is also the lower bound of cached search.
+ *
+ * stricthigh may also be the last slot + 1, which prevents caller from
+ * using bounds directly, but is still useful to us if we're called a
+ * second time with cached bounds (cached low will be < stricthigh when
+ * that happens).
+ */
+ insertstate->low = low;
+ insertstate->stricthigh = stricthigh;
+ insertstate->bounds_valid = true;
+
+ return low;
+}
+
+/*----------
+ * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
+ *
+ * page/offnum: location of btree item to be compared to.
+ *
+ * This routine returns:
+ * <0 if scankey < tuple at offnum;
+ * 0 if scankey == tuple at offnum;
+ * >0 if scankey > tuple at offnum.
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
+ * "minus infinity": this routine will always claim it is less than the
+ * scankey. The actual key value stored is explicitly truncated to 0
+ * attributes (explicitly minus infinity) with version 3+ indexes, but
+ * that isn't relied upon. This allows us to implement the Lehman and
+ * Yao convention that the first down-link pointer is before the first
+ * key. See backend/access/nbtree/README for details.
+ *----------
+ */
+int32
+_bt_compare(Relation rel,
+ BTScanInsert key,
+ Page page,
+ OffsetNumber offnum,
+ AttrNumber *comparecol)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ IndexTuple itup;
+ ItemPointer heapTid;
+ ScanKey scankey;
+ int ncmpkey;
+ int ntupatts;
+ int32 result;
+
+ Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
+ Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+ Assert(key->heapkeyspace || key->scantid == NULL);
+
+ /*
+ * Force result ">" if target item is first data item on an internal page
+ * --- see NOTE above.
+ */
+ if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
+ return 1;
+
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+ ntupatts = BTreeTupleGetNAtts(itup, rel);
+
+ /*
+ * The scan key is set up with the attribute number associated with each
+ * term in the key. It is important that, if the index is multi-key, the
+ * scan contain the first k key attributes, and that they be in order. If
+ * you think about how multi-key ordering works, you'll understand why
+ * this is.
+ *
+ * We don't test for violation of this condition here, however. The
+ * initial setup for the index scan had better have gotten it right (see
+ * _bt_first).
+ */
+
+ ncmpkey = Min(ntupatts, key->keysz);
+ Assert(key->heapkeyspace || ncmpkey == key->keysz);
+ Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
+
+ scankey = key->scankeys + ((*comparecol) - 1);
+ for (int i = *comparecol; i <= ncmpkey; i++)
+ {
+ Datum datum;
+ bool isNull;
+
+ datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+
+ if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ {
+ if (isNull)
+ result = 0; /* NULL "=" NULL */
+ else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = -1; /* NULL "<" NOT_NULL */
+ else
+ result = 1; /* NULL ">" NOT_NULL */
+ }
+ else if (isNull) /* key is NOT_NULL and item is NULL */
+ {
+ if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = 1; /* NOT_NULL ">" NULL */
+ else
+ result = -1; /* NOT_NULL "<" NULL */
+ }
+ else
+ {
+ /*
+ * The sk_func needs to be passed the index value as left arg and
+ * the sk_argument as right arg (they might be of different
+ * types). Since it is convenient for callers to think of
+ * _bt_compare as comparing the scankey to the index item, we have
+ * to flip the sign of the comparison result. (Unless it's a DESC
+ * column, in which case we *don't* flip the sign.)
+ */
+ result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum,
+ scankey->sk_argument));
+
+ if (!(scankey->sk_flags & SK_BT_DESC))
+ INVERT_COMPARE_RESULT(result);
+ }
+
+ /* if the keys are unequal, return the difference */
+ if (result != 0)
+ {
+ *comparecol = i;
+ return result;
+ }
+
+ scankey++;
+ }
+
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *comparecol = ntupatts + 1;
+
+ /*
+ * All non-truncated attributes (other than heap TID) were found to be
+ * equal. Treat truncated attributes as minus infinity when scankey has a
+ * key attribute value that would otherwise be compared directly.
+ *
+ * Note: it doesn't matter if ntupatts includes non-key attributes;
+ * scankey won't, so explicitly excluding non-key attributes isn't
+ * necessary.
+ */
+ if (key->keysz > ntupatts)
+ return 1;
+
+ /*
+ * Use the heap TID attribute and scantid to try to break the tie. The
+ * rules are the same as any other key attribute -- only the
+ * representation differs.
+ */
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid == NULL)
+ {
+ /*
+ * Most searches have a scankey that is considered greater than a
+ * truncated pivot tuple if and when the scankey has equal values for
+ * attributes up to and including the least significant untruncated
+ * attribute in tuple.
+ *
+ * For example, if an index has the minimum two attributes (single
+ * user key attribute, plus heap TID attribute), and a page's high key
+ * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
+ * will not descend to the page to the left. The search will descend
+ * right instead. The truncated attribute in pivot tuple means that
+ * all non-pivot tuples on the page to the left are strictly < 'foo',
+ * so it isn't necessary to descend left. In other words, search
+ * doesn't have to descend left because it isn't interested in a match
+ * that has a heap TID value of -inf.
+ *
+ * However, some searches (pivotsearch searches) actually require that
+ * we descend left when this happens. -inf is treated as a possible
+ * match for omitted scankey attribute(s). This is needed by page
+ * deletion, which must re-find leaf pages that are targets for
+ * deletion using their high keys.
+ *
+ * Note: the heap TID part of the test ensures that scankey is being
+ * compared to a pivot tuple with one or more truncated key
+ * attributes.
+ *
+ * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
+ * left here, since they have no heap TID attribute (and cannot have
+ * any -inf key values in any case, since truncation can only remove
+ * non-key attributes). !heapkeyspace searches must always be
+ * prepared to deal with matches on both sides of the pivot once the
+ * leaf level is reached.
+ */
+ if (key->heapkeyspace && !key->pivotsearch &&
+ key->keysz == ntupatts && heapTid == NULL)
+ return 1;
+
+ /* All provided scankey arguments found to be equal */
+ return 0;
+ }
+
+ /*
+ * Treat truncated heap TID as minus infinity, since scankey has a key
+ * attribute value (scantid) that would otherwise be compared directly
+ */
+ Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+ if (heapTid == NULL)
+ return 1;
+
+ /*
+ * Scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * with scantid.
+ */
+ Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (result <= 0 || !BTreeTupleIsPosting(itup))
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid,
+ BTreeTupleGetMaxHeapTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
+}
+
+/*
+ * _bt_readpage() -- Load data from current index page into so->currPos
+ *
+ * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
+ * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
+ * they are updated as appropriate. All other fields of so->currPos are
+ * initialized from scratch here.
+ *
+ * We scan the current page starting at offnum and moving in the indicated
+ * direction. All items matching the scan keys are loaded into currPos.items.
+ * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
+ * that there can be no more matching tuples in the current scan direction.
+ *
+ * In the case of a parallel scan, caller must have called _bt_parallel_seize
+ * prior to calling this function; this function will invoke
+ * _bt_parallel_release before returning.
+ *
+ * Returns true if any matching items found on the page, false if none.
+ */
+static bool
+_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber minoff;
+ OffsetNumber maxoff;
+ int itemIndex;
+ bool continuescan;
+ int indnatts;
+
+ /*
+ * We must have the buffer pinned and locked, but the usual macro can't be
+ * used here; this function is what makes it good for currPos.
+ */
+ Assert(BufferIsValid(so->currPos.buf));
+
+ page = BufferGetPage(so->currPos.buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* allow next page be processed by parallel worker */
+ if (scan->parallel_scan)
+ {
+ if (ScanDirectionIsForward(dir))
+ _bt_parallel_release(scan, opaque->btpo_next);
+ else
+ _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+ }
+
+ continuescan = true; /* default assumption */
+ indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * We note the buffer's block number so that we can release the pin later.
+ * This allows us to re-read the buffer if it is needed again for hinting.
+ */
+ so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later. This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+
+ /*
+ * we must save the page's right-link while scanning it; this tells us
+ * where to step right to after we're done with these items. There is no
+ * corresponding need for the left-link, since splits always go right.
+ */
+ so->currPos.nextPage = opaque->btpo_next;
+
+ /* initialize tuple workspace to empty */
+ so->currPos.nextTupleOffset = 0;
+
+ /*
+ * Now that the current page has been made consistent, the macro should be
+ * good.
+ */
+ Assert(BTScanPosIsPinned(so->currPos));
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* load items[] in ascending order */
+ itemIndex = 0;
+
+ offnum = Max(offnum, minoff);
+
+ while (offnum <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ offnum = OffsetNumberNext(offnum);
+ continue;
+ }
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID
+ */
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ itemIndex++;
+ }
+ }
+ }
+ /* When !continuescan, there can't be any more matches, so stop */
+ if (!continuescan)
+ break;
+
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /*
+ * We don't need to visit page to the right when the high key
+ * indicates that no more matches will be found there.
+ *
+ * Checking the high key like this works out more often than you might
+ * think. Leaf page splits pick a split point between the two most
+ * dissimilar tuples (this is weighed against the need to evenly share
+ * free space). Leaf pages with high key attribute values that can
+ * only appear on non-pivot tuples on the right sibling page are
+ * common.
+ */
+ if (continuescan && !P_RIGHTMOST(opaque))
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
+ int truncatt;
+
+ truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
+ _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+ }
+
+ if (!continuescan)
+ so->currPos.moreRight = false;
+
+ Assert(itemIndex <= MaxTIDsPerBTreePage);
+ so->currPos.firstItem = 0;
+ so->currPos.lastItem = itemIndex - 1;
+ so->currPos.itemIndex = 0;
+ }
+ else
+ {
+ /* load items[] in descending order */
+ itemIndex = MaxTIDsPerBTreePage;
+
+ offnum = Min(offnum, maxoff);
+
+ while (offnum >= minoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+ bool tuple_alive;
+ bool passes_quals;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual. Most of the
+ * time, it's a win to not bother examining the tuple's index
+ * keys, but just skip to the next tuple (previous, actually,
+ * since we're scanning backwards). However, if this is the first
+ * tuple on the page, we do check the index keys, to prevent
+ * uselessly advancing to the page to the left. This is similar
+ * to the high key optimization used by forward scans.
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ Assert(offnum >= P_FIRSTDATAKEY(opaque));
+ if (offnum > P_FIRSTDATAKEY(opaque))
+ {
+ offnum = OffsetNumberPrev(offnum);
+ continue;
+ }
+
+ tuple_alive = false;
+ }
+ else
+ tuple_alive = true;
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
+ &continuescan);
+ if (passes_quals && tuple_alive)
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID.
+ *
+ * Note that we deliberately save/return items from
+ * posting lists in ascending heap TID order for backwards
+ * scans. This allows _bt_killitems() to make a
+ * consistent assumption about the order of items
+ * associated with the same posting list tuple.
+ */
+ itemIndex--;
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ }
+ }
+ }
+ if (!continuescan)
+ {
+ /* there can't be any more matches, so stop */
+ so->currPos.moreLeft = false;
+ break;
+ }
+
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ Assert(itemIndex >= 0);
+ so->currPos.firstItem = itemIndex;
+ so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
+ so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+ }
+
+ return (so->currPos.firstItem <= so->currPos.lastItem);
+}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index c2665fce41..8742716383 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -279,8 +279,6 @@ static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
BTPageState *state,
BTDedupState dstate);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
-static void _bt_load(BTWriteState *wstate,
- BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
int request);
static void _bt_end_parallel(BTLeader *btleader);
@@ -293,6 +291,8 @@ static void _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
Sharedsort *sharedsort2, int sortmem,
bool progress);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsort_spec.c"
+#include "access/nbtree_spec.h"
/*
* btbuild() -- build a new btree index.
@@ -544,6 +544,7 @@ static void
_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
{
BTWriteState wstate;
+ nbts_prep_ctx(btspool->index);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
@@ -846,6 +847,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
Size pgspc;
Size itupsz;
bool isleaf;
+ nbts_prep_ctx(wstate->index);
/*
* This is a handy place to check for cancel interrupts during the btree
@@ -1178,264 +1180,6 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
-/*
- * Read tuples in correct sort order from tuplesort, and load them into
- * btree leaves.
- */
-static void
-_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
-{
- BTPageState *state = NULL;
- bool merge = (btspool2 != NULL);
- IndexTuple itup,
- itup2 = NULL;
- bool load1;
- TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
- SortSupport sortKeys;
- int64 tuples_done = 0;
- bool deduplicate;
-
- deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
- BTGetDeduplicateItems(wstate->index);
-
- if (merge)
- {
- /*
- * Another BTSpool for dead tuples exists. Now we have to merge
- * btspool and btspool2.
- */
-
- /* the preparation of merge */
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
-
- /* Prepare SortSupport data for each column */
- sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
- for (i = 0; i < keysz; i++)
- {
- SortSupport sortKey = sortKeys + i;
- ScanKey scanKey = wstate->inskey->scankeys + i;
- int16 strategy;
-
- sortKey->ssup_cxt = CurrentMemoryContext;
- sortKey->ssup_collation = scanKey->sk_collation;
- sortKey->ssup_nulls_first =
- (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
- sortKey->ssup_attno = scanKey->sk_attno;
- /* Abbreviation is not supported here */
- sortKey->abbreviate = false;
-
- Assert(sortKey->ssup_attno != 0);
-
- strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
- BTGreaterStrategyNumber : BTLessStrategyNumber;
-
- PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
- }
-
- for (;;)
- {
- load1 = true; /* load BTSpool next ? */
- if (itup2 == NULL)
- {
- if (itup == NULL)
- break;
- }
- else if (itup != NULL)
- {
- int32 compare = 0;
-
- for (i = 1; i <= keysz; i++)
- {
- SortSupport entry;
- Datum attrDatum1,
- attrDatum2;
- bool isNull1,
- isNull2;
-
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
- entry);
- if (compare > 0)
- {
- load1 = false;
- break;
- }
- else if (compare < 0)
- break;
- }
-
- /*
- * If key values are equal, we sort on ItemPointer. This is
- * required for btree indexes, since heap TID is treated as an
- * implicit last key attribute in order to ensure that all
- * keys in the index are physically unique.
- */
- if (compare == 0)
- {
- compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
- Assert(compare != 0);
- if (compare > 0)
- load1 = false;
- }
- }
- else
- load1 = false;
-
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- if (load1)
- {
- _bt_buildadd(wstate, state, itup, 0);
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- }
- else
- {
- _bt_buildadd(wstate, state, itup2, 0);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- pfree(sortKeys);
- }
- else if (deduplicate)
- {
- /* merge is unnecessary, deduplicate into posting lists */
- BTDedupState dstate;
-
- dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
- dstate->deduplicate = true; /* unused */
- dstate->nmaxitems = 0; /* unused */
- dstate->maxpostingsize = 0; /* set later */
- /* Metadata about base tuple of current pending posting list */
- dstate->base = NULL;
- dstate->baseoff = InvalidOffsetNumber; /* unused */
- dstate->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- dstate->htids = NULL;
- dstate->nhtids = 0;
- dstate->nitems = 0;
- dstate->phystupsize = 0; /* unused */
- dstate->nintervals = 0; /* unused */
-
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- {
- state = _bt_pagestate(wstate, 0);
-
- /*
- * Limit size of posting list tuples to 1/10 space we want to
- * leave behind on the page, plus space for final item's line
- * pointer. This is equal to the space that we'd like to
- * leave behind on each leaf page when fillfactor is 90,
- * allowing us to get close to fillfactor% space utilization
- * when there happen to be a great many duplicates. (This
- * makes higher leaf fillfactor settings ineffective when
- * building indexes that have many duplicates, but packing
- * leaf pages full with few very large tuples doesn't seem
- * like a useful goal.)
- */
- dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
- sizeof(ItemIdData);
- Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
- dstate->maxpostingsize <= INDEX_SIZE_MASK);
- dstate->htids = palloc(dstate->maxpostingsize);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
- else if (_bt_keep_natts_fast(wstate->index, dstate->base,
- itup) > keysz &&
- _bt_dedup_save_htid(dstate, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID from itup has been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list.
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
-
- if (state)
- {
- /*
- * Handle the last item (there must be a last item when the
- * tuplesort returned one or more tuples)
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
- pfree(dstate->htids);
- }
-
- pfree(dstate);
- }
- else
- {
- /* merging and deduplication are both unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- _bt_buildadd(wstate, state, itup, 0);
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- }
-
- /* Close down final pages and write the metapage */
- _bt_uppershutdown(wstate, state);
-
- /*
- * When we WAL-logged index pages, we must nonetheless fsync index files.
- * Since we're building outside shared buffers, a CHECKPOINT occurring
- * during the build has no way to flush the previously written data to
- * disk (indeed it won't know the index even exists). A crash later on
- * would replay WAL from the checkpoint, therefore it wouldn't replay our
- * earlier WAL entries. If we do not fsync those pages here, they might
- * still not be on disk when the crash occurs.
- */
- if (wstate->btws_use_wal)
- smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
-}
-
/*
* Create parallel context, and launch workers for leader.
*
diff --git a/src/backend/access/nbtree/nbtsort_spec.c b/src/backend/access/nbtree/nbtsort_spec.c
new file mode 100644
index 0000000000..368d6f244c
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsort_spec.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtsort_spec.c
+ * Index shape-specialized functions for nbtsort.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtsort_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_load NBTS_FUNCTION(_bt_load)
+
+static void _bt_load(BTWriteState *wstate,
+ BTSpool *btspool, BTSpool *btspool2);
+
+/*
+ * Read tuples in correct sort order from tuplesort, and load them into
+ * btree leaves.
+ */
+static void
+_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
+{
+ BTPageState *state = NULL;
+ bool merge = (btspool2 != NULL);
+ IndexTuple itup,
+ itup2 = NULL;
+ bool load1;
+ TupleDesc tupdes = RelationGetDescr(wstate->index);
+ int i,
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ SortSupport sortKeys;
+ int64 tuples_done = 0;
+ bool deduplicate;
+
+ deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
+ BTGetDeduplicateItems(wstate->index);
+
+ if (merge)
+ {
+ /*
+ * Another BTSpool for dead tuples exists. Now we have to merge
+ * btspool and btspool2.
+ */
+
+ /* the preparation of merge */
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+
+ /* Prepare SortSupport data for each column */
+ sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+
+ for (i = 0; i < keysz; i++)
+ {
+ SortSupport sortKey = sortKeys + i;
+ ScanKey scanKey = wstate->inskey->scankeys + i;
+ int16 strategy;
+
+ sortKey->ssup_cxt = CurrentMemoryContext;
+ sortKey->ssup_collation = scanKey->sk_collation;
+ sortKey->ssup_nulls_first =
+ (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+ sortKey->ssup_attno = scanKey->sk_attno;
+ /* Abbreviation is not supported here */
+ sortKey->abbreviate = false;
+
+ Assert(sortKey->ssup_attno != 0);
+
+ strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+ BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+ PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+ }
+
+ for (;;)
+ {
+ load1 = true; /* load BTSpool next ? */
+ if (itup2 == NULL)
+ {
+ if (itup == NULL)
+ break;
+ }
+ else if (itup != NULL)
+ {
+ int32 compare = 0;
+
+ for (i = 1; i <= keysz; i++)
+ {
+ SortSupport entry;
+ Datum attrDatum1,
+ attrDatum2;
+ bool isNull1,
+ isNull2;
+
+ entry = sortKeys + i - 1;
+ attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+ attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+ compare = ApplySortComparator(attrDatum1, isNull1,
+ attrDatum2, isNull2,
+ entry);
+ if (compare > 0)
+ {
+ load1 = false;
+ break;
+ }
+ else if (compare < 0)
+ break;
+ }
+
+ /*
+ * If key values are equal, we sort on ItemPointer. This is
+ * required for btree indexes, since heap TID is treated as an
+ * implicit last key attribute in order to ensure that all
+ * keys in the index are physically unique.
+ */
+ if (compare == 0)
+ {
+ compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+ Assert(compare != 0);
+ if (compare > 0)
+ load1 = false;
+ }
+ }
+ else
+ load1 = false;
+
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ if (load1)
+ {
+ _bt_buildadd(wstate, state, itup, 0);
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ }
+ else
+ {
+ _bt_buildadd(wstate, state, itup2, 0);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ pfree(sortKeys);
+ }
+ else if (deduplicate)
+ {
+ /* merge is unnecessary, deduplicate into posting lists */
+ BTDedupState dstate;
+
+ dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ dstate->deduplicate = true; /* unused */
+ dstate->nmaxitems = 0; /* unused */
+ dstate->maxpostingsize = 0; /* set later */
+ /* Metadata about base tuple of current pending posting list */
+ dstate->base = NULL;
+ dstate->baseoff = InvalidOffsetNumber; /* unused */
+ dstate->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ dstate->htids = NULL;
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+ dstate->phystupsize = 0; /* unused */
+ dstate->nintervals = 0; /* unused */
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+
+ /*
+ * Limit size of posting list tuples to 1/10 space we want to
+ * leave behind on the page, plus space for final item's line
+ * pointer. This is equal to the space that we'd like to
+ * leave behind on each leaf page when fillfactor is 90,
+ * allowing us to get close to fillfactor% space utilization
+ * when there happen to be a great many duplicates. (This
+ * makes higher leaf fillfactor settings ineffective when
+ * building indexes that have many duplicates, but packing
+ * leaf pages full with few very large tuples doesn't seem
+ * like a useful goal.)
+ */
+ dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
+ sizeof(ItemIdData);
+ Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
+ dstate->maxpostingsize <= INDEX_SIZE_MASK);
+ dstate->htids = palloc(dstate->maxpostingsize);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+ else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+ itup) > keysz &&
+ _bt_dedup_save_htid(dstate, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID from itup has been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list.
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+
+ if (state)
+ {
+ /*
+ * Handle the last item (there must be a last item when the
+ * tuplesort returned one or more tuples)
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+ pfree(dstate->htids);
+ }
+
+ pfree(dstate);
+ }
+ else
+ {
+ /* merging and deduplication are both unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ _bt_buildadd(wstate, state, itup, 0);
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+
+ /* Close down final pages and write the metapage */
+ _bt_uppershutdown(wstate, state);
+
+ /*
+ * When we WAL-logged index pages, we must nonetheless fsync index files.
+ * Since we're building outside shared buffers, a CHECKPOINT occurring
+ * during the build has no way to flush the previously written data to
+ * disk (indeed it won't know the index even exists). A crash later on
+ * would replay WAL from the checkpoint, therefore it wouldn't replay our
+ * earlier WAL entries. If we do not fsync those pages here, they might
+ * still not be on disk when the crash occurs.
+ */
+ if (wstate->btws_use_wal)
+ smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 43b67893d9..db2da1e303 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -639,6 +639,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
ItemId itemid;
IndexTuple tup;
int keepnatts;
+ nbts_prep_ctx(state->rel);
Assert(state->is_leaf && !state->is_rightmost);
@@ -945,6 +946,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
*rightinterval;
int perfectpenalty;
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
+ nbts_prep_ctx(state->rel);
/* Assume that alternative strategy won't be used for now */
*strategy = SPLIT_DEFAULT;
@@ -1137,6 +1139,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
{
IndexTuple lastleft;
IndexTuple firstright;
+ nbts_prep_ctx(state->rel);
if (!state->is_leaf)
{
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7da499c4dd..37d644e9f3 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -50,130 +50,10 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
bool *result);
static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
static void _bt_mark_scankey_required(ScanKey skey);
-static bool _bt_check_rowcompare(ScanKey skey,
- IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
- ScanDirection dir, bool *continuescan);
-static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtutils_spec.c"
+#include "access/nbtree_spec.h"
-/*
- * _bt_mkscankey
- * Build an insertion scan key that contains comparison data from itup
- * as well as comparator routines appropriate to the key datatypes.
- *
- * When itup is a non-pivot tuple, the returned insertion scan key is
- * suitable for finding a place for it to go on the leaf level. Pivot
- * tuples can be used to re-find leaf page with matching high key, but
- * then caller needs to set scan key's pivotsearch field to true. This
- * allows caller to search for a leaf page with a matching high key,
- * which is usually to the left of the first leaf page a non-pivot match
- * might appear on.
- *
- * The result is intended for use with _bt_compare() and _bt_truncate().
- * Callers that don't need to fill out the insertion scankey arguments
- * (e.g. they use an ad-hoc comparison routine, or only need a scankey
- * for _bt_truncate()) can pass a NULL index tuple. The scankey will
- * be initialized as if an "all truncated" pivot tuple was passed
- * instead.
- *
- * Note that we may occasionally have to share lock the metapage to
- * determine whether or not the keys in the index are expected to be
- * unique (i.e. if this is a "heapkeyspace" index). We assume a
- * heapkeyspace index when caller passes a NULL tuple, allowing index
- * build callers to avoid accessing the non-existent metapage. We
- * also assume that the index is _not_ allequalimage when a NULL tuple
- * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
- * field themselves.
- */
-BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
-{
- BTScanInsert key;
- ScanKey skey;
- TupleDesc itupdesc;
- int indnkeyatts;
- int16 *indoption;
- int tupnatts;
- int i;
-
- itupdesc = RelationGetDescr(rel);
- indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- indoption = rel->rd_indoption;
- tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
-
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
- /*
- * We'll execute search using scan key constructed on key columns.
- * Truncated attributes and non-key attributes are omitted from the final
- * scan key.
- */
- key = palloc(offsetof(BTScanInsertData, scankeys) +
- sizeof(ScanKeyData) * indnkeyatts);
- if (itup)
- _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
- else
- {
- /* Utility statement callers can set these fields themselves */
- key->heapkeyspace = true;
- key->allequalimage = false;
- }
- key->anynullkeys = false; /* initial assumption */
- key->nextkey = false;
- key->pivotsearch = false;
- key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
- skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
- {
- FmgrInfo *procinfo;
- Datum arg;
- bool null;
- int flags;
-
- /*
- * We can use the cached (default) support procs since no cross-type
- * comparison can be needed.
- */
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-
- /*
- * Key arguments built from truncated attributes (or when caller
- * provides no tuple) are defensively represented as NULL values. They
- * should never be used.
- */
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
- else
- {
- arg = (Datum) 0;
- null = true;
- }
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
- flags,
- (AttrNumber) (i + 1),
- InvalidStrategy,
- InvalidOid,
- rel->rd_indcollation[i],
- procinfo,
- arg);
- /* Record if any key attribute is NULL (or truncated) */
- if (null)
- key->anynullkeys = true;
- }
-
- /*
- * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
- * that full uniqueness check is done.
- */
- if (rel->rd_index->indnullsnotdistinct)
- key->anynullkeys = false;
-
- return key;
-}
/*
* free a retracement stack made by _bt_search.
@@ -1340,356 +1220,6 @@ _bt_mark_scankey_required(ScanKey skey)
}
}
-/*
- * Test whether an indextuple satisfies all the scankey conditions.
- *
- * Return true if so, false if not. If the tuple fails to pass the qual,
- * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly. See comments for
- * _bt_preprocess_keys(), above, about how this is done.
- *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
- *
- * scan: index scan descriptor (containing a search-type scankey)
- * tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
- */
-bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
- ScanDirection dir, bool *continuescan)
-{
- TupleDesc tupdesc;
- BTScanOpaque so;
- int keysz;
- int ikey;
- ScanKey key;
-
- Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
-
- *continuescan = true; /* default assumption */
-
- tupdesc = RelationGetDescr(scan->indexRelation);
- so = (BTScanOpaque) scan->opaque;
- keysz = so->numberOfKeys;
-
- for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
- {
- Datum datum;
- bool isNull;
- Datum test;
-
- if (key->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- continue;
- }
-
- /* row-comparison keys need special processing */
- if (key->sk_flags & SK_ROW_HEADER)
- {
- if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
- continuescan))
- continue;
- return false;
- }
-
- datum = index_getattr(tuple,
- key->sk_attno,
- tupdesc,
- &isNull);
-
- if (key->sk_flags & SK_ISNULL)
- {
- /* Handle IS NULL/NOT NULL tests */
- if (key->sk_flags & SK_SEARCHNULL)
- {
- if (isNull)
- continue; /* tuple satisfies this qual */
- }
- else
- {
- Assert(key->sk_flags & SK_SEARCHNOTNULL);
- if (!isNull)
- continue; /* tuple satisfies this qual */
- }
-
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (isNull)
- {
- if (key->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
- datum, key->sk_argument);
-
- if (!DatumGetBool(test))
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- *
- * Note: because we stop the scan as soon as any required equality
- * qual fails, it is critical that equality quals be used for the
- * initial positioning in _bt_first() when they are available. See
- * comments in _bt_first().
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
- }
-
- /* If we get here, the tuple passes all index quals. */
- return true;
-}
-
-/*
- * Test whether an indextuple satisfies a row-comparison scan condition.
- *
- * Return true if so, false if not. If not, also clear *continuescan if
- * it's not possible for any future tuples in the current scan direction
- * to pass the qual.
- *
- * This is a subroutine for _bt_checkkeys, which see for more info.
- */
-static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
- TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
-{
- ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
- int32 cmpresult = 0;
- bool result;
-
- /* First subkey should be same as the header says */
- Assert(subkey->sk_attno == skey->sk_attno);
-
- /* Loop over columns of the row condition */
- for (;;)
- {
- Datum datum;
- bool isNull;
-
- Assert(subkey->sk_flags & SK_ROW_MEMBER);
-
- if (subkey->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- cmpresult = 0;
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- continue;
- }
-
- datum = index_getattr(tuple,
- subkey->sk_attno,
- tupdesc,
- &isNull);
-
- if (isNull)
- {
- if (subkey->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (subkey->sk_flags & SK_ISNULL)
- {
- /*
- * Unlike the simple-scankey case, this isn't a disallowed case.
- * But it can never match. If all the earlier row comparison
- * columns are required for the scan direction, we can stop the
- * scan, because there can't be another tuple that will succeed.
- */
- if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
- subkey--;
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- return false;
- }
-
- /* Perform the test --- three-way comparison not bool operator */
- cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
- subkey->sk_collation,
- datum,
- subkey->sk_argument));
-
- if (subkey->sk_flags & SK_BT_DESC)
- INVERT_COMPARE_RESULT(cmpresult);
-
- /* Done comparing if unequal, else advance to next column */
- if (cmpresult != 0)
- break;
-
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- }
-
- /*
- * At this point cmpresult indicates the overall result of the row
- * comparison, and subkey points to the deciding column (or the last
- * column if the result is "=").
- */
- switch (subkey->sk_strategy)
- {
- /* EQ and NE cases aren't allowed here */
- case BTLessStrategyNumber:
- result = (cmpresult < 0);
- break;
- case BTLessEqualStrategyNumber:
- result = (cmpresult <= 0);
- break;
- case BTGreaterEqualStrategyNumber:
- result = (cmpresult >= 0);
- break;
- case BTGreaterStrategyNumber:
- result = (cmpresult > 0);
- break;
- default:
- elog(ERROR, "unrecognized RowCompareType: %d",
- (int) subkey->sk_strategy);
- result = 0; /* keep compiler quiet */
- break;
- }
-
- if (!result)
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will pass,
- * either. Note we have to look at the deciding column, not
- * necessarily the first or last column of the row condition.
- */
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
-
- return result;
-}
-
/*
* _bt_killitems - set LP_DEAD state for items an indexscan caller has
* told us were killed
@@ -2173,286 +1703,6 @@ btbuildphasename(int64 phasenum)
}
}
-/*
- * _bt_truncate() -- create tuple without unneeded suffix attributes.
- *
- * Returns truncated pivot index tuple allocated in caller's memory context,
- * with key attributes copied from caller's firstright argument. If rel is
- * an INCLUDE index, non-key attributes will definitely be truncated away,
- * since they're not part of the key space. More aggressive suffix
- * truncation can take place when it's clear that the returned tuple does not
- * need one or more suffix key attributes. We only need to keep firstright
- * attributes up to and including the first non-lastleft-equal attribute.
- * Caller's insertion scankey is used to compare the tuples; the scankey's
- * argument values are not considered here.
- *
- * Note that returned tuple's t_tid offset will hold the number of attributes
- * present, so the original item pointer offset is not represented. Caller
- * should only change truncated tuple's downlink. Note also that truncated
- * key attributes are treated as containing "minus infinity" values by
- * _bt_compare().
- *
- * In the worst case (when a heap TID must be appended to distinguish lastleft
- * from firstright), the size of the returned tuple is the size of firstright
- * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
- * is important, since callers need to stay under the 1/3 of a page
- * restriction on tuple size. If this routine is ever taught to truncate
- * within an attribute/datum, it will need to avoid returning an enlarged
- * tuple to caller when truncation + TOAST compression ends up enlarging the
- * final datum.
- */
-IndexTuple
-_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
- IndexTuple pivot;
- IndexTuple tidpivot;
- ItemPointer pivotheaptid;
- Size newsize;
-
- /*
- * We should only ever truncate non-pivot tuples from leaf pages. It's
- * never okay to truncate when splitting an internal page.
- */
- Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
-
- /* Determine how many attributes must be kept in truncated tuple */
- keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
-
-#ifdef DEBUG_NO_TRUNCATE
- /* Force truncation to be ineffective for testing purposes */
- keepnatts = nkeyatts + 1;
-#endif
-
- pivot = index_truncate_tuple(itupdesc, firstright,
- Min(keepnatts, nkeyatts));
-
- if (BTreeTupleIsPosting(pivot))
- {
- /*
- * index_truncate_tuple() just returns a straight copy of firstright
- * when it has no attributes to truncate. When that happens, we may
- * need to truncate away a posting list here instead.
- */
- Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
- Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
- pivot->t_info &= ~INDEX_SIZE_MASK;
- pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
- }
-
- /*
- * If there is a distinguishing key attribute within pivot tuple, we're
- * done
- */
- if (keepnatts <= nkeyatts)
- {
- BTreeTupleSetNAtts(pivot, keepnatts, false);
- return pivot;
- }
-
- /*
- * We have to store a heap TID in the new pivot tuple, since no non-TID
- * key attribute value in firstright distinguishes the right side of the
- * split from the left side. nbtree conceptualizes this case as an
- * inability to truncate away any key attributes, since heap TID is
- * treated as just another key attribute (despite lacking a pg_attribute
- * entry).
- *
- * Use enlarged space that holds a copy of pivot. We need the extra space
- * to store a heap TID at the end (using the special pivot tuple
- * representation). Note that the original pivot already has firstright's
- * possible posting list/non-key attribute values removed at this point.
- */
- newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
- tidpivot = palloc0(newsize);
- memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
- /* Cannot leak memory here */
- pfree(pivot);
-
- /*
- * Store all of firstright's key attribute values plus a tiebreaker heap
- * TID value in enlarged pivot tuple
- */
- tidpivot->t_info &= ~INDEX_SIZE_MASK;
- tidpivot->t_info |= newsize;
- BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
- pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
-
- /*
- * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
- * consider suffix truncation. It seems like a good idea to follow that
- * example in cases where no truncation takes place -- use lastleft's heap
- * TID. (This is also the closest value to negative infinity that's
- * legally usable.)
- */
- ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
-
- /*
- * We're done. Assert() that heap TID invariants hold before returning.
- *
- * Lehman and Yao require that the downlink to the right page, which is to
- * be inserted into the parent page in the second phase of a page split be
- * a strict lower bound on items on the right page, and a non-strict upper
- * bound for items on the left page. Assert that heap TIDs follow these
- * invariants, since a heap TID value is apparently needed as a
- * tiebreaker.
- */
-#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
- BTreeTupleGetHeapTID(firstright)) < 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(lastleft)) >= 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#else
-
- /*
- * Those invariants aren't guaranteed to hold for lastleft + firstright
- * heap TID attribute values when they're considered here only because
- * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
- * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
- * TID value that always works as a strict lower bound for items to the
- * right. In particular, it must avoid using firstright's leading key
- * attribute values along with lastleft's heap TID value when lastleft's
- * TID happens to be greater than firstright's TID.
- */
- ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
-
- /*
- * Pivot heap TID should never be fully equal to firstright. Note that
- * the pivot heap TID will still end up equal to lastleft's heap TID when
- * that's the only usable value.
- */
- ItemPointerSetOffsetNumber(pivotheaptid,
- OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#endif
-
- return tidpivot;
-}
-
-/*
- * _bt_keep_natts - how many key attributes to keep when truncating.
- *
- * Caller provides two tuples that enclose a split point. Caller's insertion
- * scankey is used to compare the tuples; the scankey's argument values are
- * not considered here.
- *
- * This can return a number of attributes that is one greater than the
- * number of key attributes for the index relation. This indicates that the
- * caller must use a heap TID as a unique-ifier in new pivot tuple.
- */
-static int
-_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keepnatts;
- ScanKey scankey;
-
- /*
- * _bt_compare() treats truncated key attributes as having the value minus
- * infinity, which would break searches within !heapkeyspace indexes. We
- * must still truncate away non-key attribute values, though.
- */
- if (!itup_key->heapkeyspace)
- return nkeyatts;
-
- scankey = itup_key->scankeys;
- keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum1,
- datum2)) != 0)
- break;
-
- keepnatts++;
- }
-
- /*
- * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
- * expected in an allequalimage index.
- */
- Assert(!itup_key->allequalimage ||
- keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
-
- return keepnatts;
-}
-
-/*
- * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
- *
- * This is exported so that a candidate split point can have its effect on
- * suffix truncation inexpensively evaluated ahead of time when finding a
- * split location. A naive bitwise approach to datum comparisons is used to
- * save cycles.
- *
- * The approach taken here usually provides the same answer as _bt_keep_natts
- * will (for the same pair of tuples from a heapkeyspace index), since the
- * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting. When an index only has
- * "equal image" columns, routine is guaranteed to give the same result as
- * _bt_keep_natts would.
- *
- * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts, even when the index uses
- * an opclass or collation that is not "allequalimage"/deduplication-safe.
- * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
- * negatives generally only have the effect of making leaf page splits use a
- * more balanced split point.
- */
-int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
-
- keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
- Form_pg_attribute att;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
- break;
-
- keepnatts++;
- }
-
- return keepnatts;
-}
-
/*
* _bt_check_natts() -- Verify tuple has expected number of attributes.
*
diff --git a/src/backend/access/nbtree/nbtutils_spec.c b/src/backend/access/nbtree/nbtutils_spec.c
new file mode 100644
index 0000000000..0288da22d6
--- /dev/null
+++ b/src/backend/access/nbtree/nbtutils_spec.c
@@ -0,0 +1,775 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtutils_spec.c
+ * Index shape-specialized functions for nbtutils.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtutils_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_check_rowcompare NBTS_FUNCTION(_bt_check_rowcompare)
+#define _bt_keep_natts NBTS_FUNCTION(_bt_keep_natts)
+
+static bool _bt_check_rowcompare(ScanKey skey,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ ScanDirection dir, bool *continuescan);
+static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+
+
+/*
+ * _bt_mkscankey
+ * Build an insertion scan key that contains comparison data from itup
+ * as well as comparator routines appropriate to the key datatypes.
+ *
+ * When itup is a non-pivot tuple, the returned insertion scan key is
+ * suitable for finding a place for it to go on the leaf level. Pivot
+ * tuples can be used to re-find leaf page with matching high key, but
+ * then caller needs to set scan key's pivotsearch field to true. This
+ * allows caller to search for a leaf page with a matching high key,
+ * which is usually to the left of the first leaf page a non-pivot match
+ * might appear on.
+ *
+ * The result is intended for use with _bt_compare() and _bt_truncate().
+ * Callers that don't need to fill out the insertion scankey arguments
+ * (e.g. they use an ad-hoc comparison routine, or only need a scankey
+ * for _bt_truncate()) can pass a NULL index tuple. The scankey will
+ * be initialized as if an "all truncated" pivot tuple was passed
+ * instead.
+ *
+ * Note that we may occasionally have to share lock the metapage to
+ * determine whether or not the keys in the index are expected to be
+ * unique (i.e. if this is a "heapkeyspace" index). We assume a
+ * heapkeyspace index when caller passes a NULL tuple, allowing index
+ * build callers to avoid accessing the non-existent metapage. We
+ * also assume that the index is _not_ allequalimage when a NULL tuple
+ * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
+ * field themselves.
+ */
+BTScanInsert
+_bt_mkscankey(Relation rel, IndexTuple itup)
+{
+ BTScanInsert key;
+ ScanKey skey;
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int16 *indoption;
+ int tupnatts;
+ int i;
+
+ itupdesc = RelationGetDescr(rel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ indoption = rel->rd_indoption;
+ tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+
+ /*
+ * We'll execute search using scan key constructed on key columns.
+ * Truncated attributes and non-key attributes are omitted from the final
+ * scan key.
+ */
+ key = palloc(offsetof(BTScanInsertData, scankeys) +
+ sizeof(ScanKeyData) * indnkeyatts);
+ if (itup)
+ _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+ else
+ {
+ /* Utility statement callers can set these fields themselves */
+ key->heapkeyspace = true;
+ key->allequalimage = false;
+ }
+ key->anynullkeys = false; /* initial assumption */
+ key->nextkey = false;
+ key->pivotsearch = false;
+ key->keysz = Min(indnkeyatts, tupnatts);
+ key->scantid = key->heapkeyspace && itup ?
+ BTreeTupleGetHeapTID(itup) : NULL;
+ skey = key->scankeys;
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ FmgrInfo *procinfo;
+ Datum arg;
+ bool null;
+ int flags;
+
+ /*
+ * We can use the cached (default) support procs since no cross-type
+ * comparison can be needed.
+ */
+ procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+
+ /*
+ * Key arguments built from truncated attributes (or when caller
+ * provides no tuple) are defensively represented as NULL values. They
+ * should never be used.
+ */
+ if (i < tupnatts)
+ arg = index_getattr(itup, i + 1, itupdesc, &null);
+ else
+ {
+ arg = (Datum) 0;
+ null = true;
+ }
+ flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags,
+ (AttrNumber) (i + 1),
+ InvalidStrategy,
+ InvalidOid,
+ rel->rd_indcollation[i],
+ procinfo,
+ arg);
+ /* Record if any key attribute is NULL (or truncated) */
+ if (null)
+ key->anynullkeys = true;
+ }
+
+ /*
+ * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
+ * that full uniqueness check is done.
+ */
+ if (rel->rd_index->indnullsnotdistinct)
+ key->anynullkeys = false;
+
+ return key;
+}
+
+/*
+ * Test whether an indextuple satisfies all the scankey conditions.
+ *
+ * Return true if so, false if not. If the tuple fails to pass the qual,
+ * we also determine whether there's any need to continue the scan beyond
+ * this tuple, and set *continuescan accordingly. See comments for
+ * _bt_preprocess_keys(), above, about how this is done.
+ *
+ * Forward scan callers can pass a high key tuple in the hopes of having
+ * us set *continuescan to false, and avoiding an unnecessary visit to
+ * the page to the right.
+ *
+ * scan: index scan descriptor (containing a search-type scankey)
+ * tuple: index tuple to test
+ * tupnatts: number of attributes in tupnatts (high key may be truncated)
+ * dir: direction we are scanning in
+ * continuescan: output parameter (will be set correctly in all cases)
+ */
+bool
+_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan)
+{
+ TupleDesc tupdesc;
+ BTScanOpaque so;
+ int keysz;
+ int ikey;
+ ScanKey key;
+
+ Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+
+ *continuescan = true; /* default assumption */
+
+ tupdesc = RelationGetDescr(scan->indexRelation);
+ so = (BTScanOpaque) scan->opaque;
+ keysz = so->numberOfKeys;
+
+ for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+ {
+ Datum datum;
+ bool isNull;
+ Datum test;
+
+ if (key->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ continue;
+ }
+
+ /* row-comparison keys need special processing */
+ if (key->sk_flags & SK_ROW_HEADER)
+ {
+ if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
+ continuescan))
+ continue;
+ return false;
+ }
+
+ datum = index_getattr(tuple,
+ key->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (key->sk_flags & SK_ISNULL)
+ {
+ /* Handle IS NULL/NOT NULL tests */
+ if (key->sk_flags & SK_SEARCHNULL)
+ {
+ if (isNull)
+ continue; /* tuple satisfies this qual */
+ }
+ else
+ {
+ Assert(key->sk_flags & SK_SEARCHNOTNULL);
+ if (!isNull)
+ continue; /* tuple satisfies this qual */
+ }
+
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (isNull)
+ {
+ if (key->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
+ datum, key->sk_argument);
+
+ if (!DatumGetBool(test))
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ *
+ * Note: because we stop the scan as soon as any required equality
+ * qual fails, it is critical that equality quals be used for the
+ * initial positioning in _bt_first() when they are available. See
+ * comments in _bt_first().
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+ }
+
+ /* If we get here, the tuple passes all index quals. */
+ return true;
+}
+
+/*
+ * Test whether an indextuple satisfies a row-comparison scan condition.
+ *
+ * Return true if so, false if not. If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction
+ * to pass the qual.
+ *
+ * This is a subroutine for _bt_checkkeys, which see for more info.
+ */
+static bool
+_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
+ TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
+{
+ ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
+ int32 cmpresult = 0;
+ bool result;
+
+ /* First subkey should be same as the header says */
+ Assert(subkey->sk_attno == skey->sk_attno);
+
+ /* Loop over columns of the row condition */
+ for (;;)
+ {
+ Datum datum;
+ bool isNull;
+
+ Assert(subkey->sk_flags & SK_ROW_MEMBER);
+
+ if (subkey->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ cmpresult = 0;
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ continue;
+ }
+
+ datum = index_getattr(tuple,
+ subkey->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (isNull)
+ {
+ if (subkey->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (subkey->sk_flags & SK_ISNULL)
+ {
+ /*
+ * Unlike the simple-scankey case, this isn't a disallowed case.
+ * But it can never match. If all the earlier row comparison
+ * columns are required for the scan direction, we can stop the
+ * scan, because there can't be another tuple that will succeed.
+ */
+ if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
+ subkey--;
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ return false;
+ }
+
+ /* Perform the test --- three-way comparison not bool operator */
+ cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
+ subkey->sk_collation,
+ datum,
+ subkey->sk_argument));
+
+ if (subkey->sk_flags & SK_BT_DESC)
+ INVERT_COMPARE_RESULT(cmpresult);
+
+ /* Done comparing if unequal, else advance to next column */
+ if (cmpresult != 0)
+ break;
+
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ }
+
+ /*
+ * At this point cmpresult indicates the overall result of the row
+ * comparison, and subkey points to the deciding column (or the last
+ * column if the result is "=").
+ */
+ switch (subkey->sk_strategy)
+ {
+ /* EQ and NE cases aren't allowed here */
+ case BTLessStrategyNumber:
+ result = (cmpresult < 0);
+ break;
+ case BTLessEqualStrategyNumber:
+ result = (cmpresult <= 0);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ result = (cmpresult >= 0);
+ break;
+ case BTGreaterStrategyNumber:
+ result = (cmpresult > 0);
+ break;
+ default:
+ elog(ERROR, "unrecognized RowCompareType: %d",
+ (int) subkey->sk_strategy);
+ result = 0; /* keep compiler quiet */
+ break;
+ }
+
+ if (!result)
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will pass,
+ * either. Note we have to look at the deciding column, not
+ * necessarily the first or last column of the row condition.
+ */
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+
+ return result;
+}
+
+/*
+ * _bt_truncate() -- create tuple without unneeded suffix attributes.
+ *
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument. If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space. More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes. We only need to keep firstright
+ * attributes up to and including the first non-lastleft-equal attribute.
+ * Caller's insertion scankey is used to compare the tuples; the scankey's
+ * argument values are not considered here.
+ *
+ * Note that returned tuple's t_tid offset will hold the number of attributes
+ * present, so the original item pointer offset is not represented. Caller
+ * should only change truncated tuple's downlink. Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
+ *
+ * In the worst case (when a heap TID must be appended to distinguish lastleft
+ * from firstright), the size of the returned tuple is the size of firstright
+ * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
+ * is important, since callers need to stay under the 1/3 of a page
+ * restriction on tuple size. If this routine is ever taught to truncate
+ * within an attribute/datum, it will need to avoid returning an enlarged
+ * tuple to caller when truncation + TOAST compression ends up enlarging the
+ * final datum.
+ */
+IndexTuple
+_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ BTScanInsert itup_key)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+ IndexTuple pivot;
+ IndexTuple tidpivot;
+ ItemPointer pivotheaptid;
+ Size newsize;
+
+ /*
+ * We should only ever truncate non-pivot tuples from leaf pages. It's
+ * never okay to truncate when splitting an internal page.
+ */
+ Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
+
+ /* Determine how many attributes must be kept in truncated tuple */
+ keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+
+#ifdef DEBUG_NO_TRUNCATE
+ /* Force truncation to be ineffective for testing purposes */
+ keepnatts = nkeyatts + 1;
+#endif
+
+ pivot = index_truncate_tuple(itupdesc, firstright,
+ Min(keepnatts, nkeyatts));
+
+ if (BTreeTupleIsPosting(pivot))
+ {
+ /*
+ * index_truncate_tuple() just returns a straight copy of firstright
+ * when it has no attributes to truncate. When that happens, we may
+ * need to truncate away a posting list here instead.
+ */
+ Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
+ Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+
+ /*
+ * If there is a distinguishing key attribute within pivot tuple, we're
+ * done
+ */
+ if (keepnatts <= nkeyatts)
+ {
+ BTreeTupleSetNAtts(pivot, keepnatts, false);
+ return pivot;
+ }
+
+ /*
+ * We have to store a heap TID in the new pivot tuple, since no non-TID
+ * key attribute value in firstright distinguishes the right side of the
+ * split from the left side. nbtree conceptualizes this case as an
+ * inability to truncate away any key attributes, since heap TID is
+ * treated as just another key attribute (despite lacking a pg_attribute
+ * entry).
+ *
+ * Use enlarged space that holds a copy of pivot. We need the extra space
+ * to store a heap TID at the end (using the special pivot tuple
+ * representation). Note that the original pivot already has firstright's
+ * possible posting list/non-key attribute values removed at this point.
+ */
+ newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
+ tidpivot = palloc0(newsize);
+ memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
+ /* Cannot leak memory here */
+ pfree(pivot);
+
+ /*
+ * Store all of firstright's key attribute values plus a tiebreaker heap
+ * TID value in enlarged pivot tuple
+ */
+ tidpivot->t_info &= ~INDEX_SIZE_MASK;
+ tidpivot->t_info |= newsize;
+ BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
+ pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
+
+ /*
+ * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
+ * consider suffix truncation. It seems like a good idea to follow that
+ * example in cases where no truncation takes place -- use lastleft's heap
+ * TID. (This is also the closest value to negative infinity that's
+ * legally usable.)
+ */
+ ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
+
+ /*
+ * We're done. Assert() that heap TID invariants hold before returning.
+ *
+ * Lehman and Yao require that the downlink to the right page, which is to
+ * be inserted into the parent page in the second phase of a page split be
+ * a strict lower bound on items on the right page, and a non-strict upper
+ * bound for items on the left page. Assert that heap TIDs follow these
+ * invariants, since a heap TID value is apparently needed as a
+ * tiebreaker.
+ */
+#ifndef DEBUG_NO_TRUNCATE
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#else
+
+ /*
+ * Those invariants aren't guaranteed to hold for lastleft + firstright
+ * heap TID attribute values when they're considered here only because
+ * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
+ * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
+ * TID value that always works as a strict lower bound for items to the
+ * right. In particular, it must avoid using firstright's leading key
+ * attribute values along with lastleft's heap TID value when lastleft's
+ * TID happens to be greater than firstright's TID.
+ */
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
+
+ /*
+ * Pivot heap TID should never be fully equal to firstright. Note that
+ * the pivot heap TID will still end up equal to lastleft's heap TID when
+ * that's the only usable value.
+ */
+ ItemPointerSetOffsetNumber(pivotheaptid,
+ OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#endif
+
+ return tidpivot;
+}
+
+/*
+ * _bt_keep_natts - how many key attributes to keep when truncating.
+ *
+ * Caller provides two tuples that enclose a split point. Caller's insertion
+ * scankey is used to compare the tuples; the scankey's argument values are
+ * not considered here.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation. This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ BTScanInsert itup_key)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keepnatts;
+ ScanKey scankey;
+
+ /*
+ * _bt_compare() treats truncated key attributes as having the value minus
+ * infinity, which would break searches within !heapkeyspace indexes. We
+ * must still truncate away non-key attribute values, though.
+ */
+ if (!itup_key->heapkeyspace)
+ return nkeyatts;
+
+ scankey = itup_key->scankeys;
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum1,
+ datum2)) != 0)
+ break;
+
+ keepnatts++;
+ }
+
+ /*
+ * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
+ * expected in an allequalimage index.
+ */
+ Assert(!itup_key->allequalimage ||
+ keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+
+ return keepnatts;
+}
+
+/*
+ * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location. A naive bitwise approach to datum comparisons is used to
+ * save cycles.
+ *
+ * The approach taken here usually provides the same answer as _bt_keep_natts
+ * will (for the same pair of tuples from a heapkeyspace index), since the
+ * majority of btree opclasses can never indicate that two datums are equal
+ * unless they're bitwise equal after detoasting. When an index only has
+ * "equal image" columns, routine is guaranteed to give the same result as
+ * _bt_keep_natts would.
+ *
+ * Callers can rely on the fact that attributes considered equal here are
+ * definitely also equal according to _bt_keep_natts, even when the index uses
+ * an opclass or collation that is not "allequalimage"/deduplication-safe.
+ * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
+ * negatives generally only have the effect of making leaf page splits use a
+ * more balanced split point.
+ */
+int
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= keysz; attnum++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+ Form_pg_attribute att;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ att = TupleDescAttr(itupdesc, attnum - 1);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
+ break;
+
+ keepnatts++;
+ }
+
+ return keepnatts;
+}
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb6cfcfd00..12f909e1cf 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -57,8 +57,6 @@ static void writetup_cluster(Tuplesortstate *state, LogicalTape *tape,
SortTuple *stup);
static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
LogicalTape *tape, unsigned int tuplen);
-static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state);
static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
Tuplesortstate *state);
static void writetup_index(Tuplesortstate *state, LogicalTape *tape,
@@ -130,6 +128,9 @@ typedef struct
int datumTypeLen;
} TuplesortDatumArg;
+#define NBT_SPECIALIZE_FILE "../../backend/utils/sort/tuplesortvariants_spec.c"
+#include "access/nbtree_spec.h"
+
Tuplesortstate *
tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
@@ -217,6 +218,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
MemoryContext oldcontext;
TuplesortClusterArg *arg;
int i;
+ nbts_prep_ctx(indexRel);
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
@@ -328,6 +330,7 @@ tuplesort_begin_index_btree(Relation heapRel,
TuplesortIndexBTreeArg *arg;
MemoryContext oldcontext;
int i;
+ nbts_prep_ctx(indexRel);
oldcontext = MemoryContextSwitchTo(base->maincontext);
arg = (TuplesortIndexBTreeArg *) palloc(sizeof(TuplesortIndexBTreeArg));
@@ -461,6 +464,7 @@ tuplesort_begin_index_gist(Relation heapRel,
MemoryContext oldcontext;
TuplesortIndexBTreeArg *arg;
int i;
+ nbts_prep_ctx(indexRel);
oldcontext = MemoryContextSwitchTo(base->maincontext);
arg = (TuplesortIndexBTreeArg *) palloc(sizeof(TuplesortIndexBTreeArg));
@@ -1259,142 +1263,6 @@ removeabbrev_index(Tuplesortstate *state, SortTuple *stups, int count)
}
}
-static int
-comparetup_index_btree(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state)
-{
- /*
- * This is similar to comparetup_heap(), but expects index tuples. There
- * is also special handling for enforcing uniqueness, and special
- * treatment for equal keys at the end.
- */
- TuplesortPublic *base = TuplesortstateGetPublic(state);
- TuplesortIndexBTreeArg *arg = (TuplesortIndexBTreeArg *) base->arg;
- SortSupport sortKey = base->sortKeys;
- IndexTuple tuple1;
- IndexTuple tuple2;
- int keysz;
- TupleDesc tupDes;
- bool equal_hasnull = false;
- int nkey;
- int32 compare;
- Datum datum1,
- datum2;
- bool isnull1,
- isnull2;
-
-
- /* Compare the leading sort key */
- compare = ApplySortComparator(a->datum1, a->isnull1,
- b->datum1, b->isnull1,
- sortKey);
- if (compare != 0)
- return compare;
-
- /* Compare additional sort keys */
- tuple1 = (IndexTuple) a->tuple;
- tuple2 = (IndexTuple) b->tuple;
- keysz = base->nKeys;
- tupDes = RelationGetDescr(arg->index.indexRel);
-
- if (sortKey->abbrev_converter)
- {
- datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
-
- compare = ApplySortAbbrevFullComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare;
- }
-
- /* they are equal, so we only need to examine one null flag */
- if (a->isnull1)
- equal_hasnull = true;
-
- sortKey++;
- for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
- {
- datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
-
- compare = ApplySortComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare; /* done when we find unequal attributes */
-
- /* they are equal, so we only need to examine one null flag */
- if (isnull1)
- equal_hasnull = true;
- }
-
- /*
- * If btree has asked us to enforce uniqueness, complain if two equal
- * tuples are detected (unless there was at least one NULL field and NULLS
- * NOT DISTINCT was not set).
- *
- * It is sufficient to make the test here, because if two tuples are equal
- * they *must* get compared at some stage of the sort --- otherwise the
- * sort algorithm wouldn't have checked whether one must appear before the
- * other.
- */
- if (arg->enforceUnique && !(!arg->uniqueNullsNotDistinct && equal_hasnull))
- {
- Datum values[INDEX_MAX_KEYS];
- bool isnull[INDEX_MAX_KEYS];
- char *key_desc;
-
- /*
- * Some rather brain-dead implementations of qsort (such as the one in
- * QNX 4) will sometimes call the comparison routine to compare a
- * value to itself, but we always use our own implementation, which
- * does not.
- */
- Assert(tuple1 != tuple2);
-
- index_deform_tuple(tuple1, tupDes, values, isnull);
-
- key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("could not create unique index \"%s\"",
- RelationGetRelationName(arg->index.indexRel)),
- key_desc ? errdetail("Key %s is duplicated.", key_desc) :
- errdetail("Duplicate keys exist."),
- errtableconstraint(arg->index.heapRel,
- RelationGetRelationName(arg->index.indexRel))));
- }
-
- /*
- * If key values are equal, we sort on ItemPointer. This is required for
- * btree indexes, since heap TID is treated as an implicit last key
- * attribute in order to ensure that all keys in the index are physically
- * unique.
- */
- {
- BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
- BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
-
- if (blk1 != blk2)
- return (blk1 < blk2) ? -1 : 1;
- }
- {
- OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
- OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
-
- if (pos1 != pos2)
- return (pos1 < pos2) ? -1 : 1;
- }
-
- /* ItemPointer values should never be equal */
- Assert(false);
-
- return 0;
-}
-
static int
comparetup_index_hash(const SortTuple *a, const SortTuple *b,
Tuplesortstate *state)
diff --git a/src/backend/utils/sort/tuplesortvariants_spec.c b/src/backend/utils/sort/tuplesortvariants_spec.c
new file mode 100644
index 0000000000..0791f41136
--- /dev/null
+++ b/src/backend/utils/sort/tuplesortvariants_spec.c
@@ -0,0 +1,158 @@
+/*-------------------------------------------------------------------------
+ *
+ * tuplesortvariants_spec.c
+ * Index shape-specialized functions for tuplesortvariants.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/tuplesortvariants_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define comparetup_index_btree NBTS_FUNCTION(comparetup_index_btree)
+
+static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
+ Tuplesortstate *state);
+
+static int
+comparetup_index_btree(const SortTuple *a, const SortTuple *b,
+ Tuplesortstate *state)
+{
+ /*
+ * This is similar to comparetup_heap(), but expects index tuples. There
+ * is also special handling for enforcing uniqueness, and special
+ * treatment for equal keys at the end.
+ */
+ TuplesortPublic *base = TuplesortstateGetPublic(state);
+ TuplesortIndexBTreeArg *arg = (TuplesortIndexBTreeArg *) base->arg;
+ SortSupport sortKey = base->sortKeys;
+ IndexTuple tuple1;
+ IndexTuple tuple2;
+ int keysz;
+ TupleDesc tupDes;
+ bool equal_hasnull = false;
+ int nkey;
+ int32 compare;
+ Datum datum1,
+ datum2;
+ bool isnull1,
+ isnull2;
+
+
+ /* Compare the leading sort key */
+ compare = ApplySortComparator(a->datum1, a->isnull1,
+ b->datum1, b->isnull1,
+ sortKey);
+ if (compare != 0)
+ return compare;
+
+ /* Compare additional sort keys */
+ tuple1 = (IndexTuple) a->tuple;
+ tuple2 = (IndexTuple) b->tuple;
+ keysz = base->nKeys;
+ tupDes = RelationGetDescr(arg->index.indexRel);
+
+ if (sortKey->abbrev_converter)
+ {
+ datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
+ datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
+
+ compare = ApplySortAbbrevFullComparator(datum1, isnull1,
+ datum2, isnull2,
+ sortKey);
+ if (compare != 0)
+ return compare;
+ }
+
+ /* they are equal, so we only need to examine one null flag */
+ if (a->isnull1)
+ equal_hasnull = true;
+
+ sortKey++;
+ for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
+ {
+ datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
+ datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
+
+ compare = ApplySortComparator(datum1, isnull1,
+ datum2, isnull2,
+ sortKey);
+ if (compare != 0)
+ return compare; /* done when we find unequal attributes */
+
+ /* they are equal, so we only need to examine one null flag */
+ if (isnull1)
+ equal_hasnull = true;
+ }
+
+ /*
+ * If btree has asked us to enforce uniqueness, complain if two equal
+ * tuples are detected (unless there was at least one NULL field and NULLS
+ * NOT DISTINCT was not set).
+ *
+ * It is sufficient to make the test here, because if two tuples are equal
+ * they *must* get compared at some stage of the sort --- otherwise the
+ * sort algorithm wouldn't have checked whether one must appear before the
+ * other.
+ */
+ if (arg->enforceUnique && !(!arg->uniqueNullsNotDistinct && equal_hasnull))
+ {
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ char *key_desc;
+
+ /*
+ * Some rather brain-dead implementations of qsort (such as the one in
+ * QNX 4) will sometimes call the comparison routine to compare a
+ * value to itself, but we always use our own implementation, which
+ * does not.
+ */
+ Assert(tuple1 != tuple2);
+
+ index_deform_tuple(tuple1, tupDes, values, isnull);
+
+ key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("could not create unique index \"%s\"",
+ RelationGetRelationName(arg->index.indexRel)),
+ key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+ errdetail("Duplicate keys exist."),
+ errtableconstraint(arg->index.heapRel,
+ RelationGetRelationName(arg->index.indexRel))));
+ }
+
+ /*
+ * If key values are equal, we sort on ItemPointer. This is required for
+ * btree indexes, since heap TID is treated as an implicit last key
+ * attribute in order to ensure that all keys in the index are physically
+ * unique.
+ */
+ {
+ BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
+ BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
+
+ if (blk1 != blk2)
+ return (blk1 < blk2) ? -1 : 1;
+ }
+ {
+ OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
+ OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
+
+ if (pos1 != pos2)
+ return (pos1 < pos2) ? -1 : 1;
+ }
+
+ /* ItemPointer values should never be equal */
+ Assert(false);
+
+ return 0;
+}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 11f4184107..d1bbc4d2a8 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1121,15 +1121,27 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_PERFORMSORT_2 4
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
+typedef enum NBTS_CTX {
+ NBTS_CTX_CACHED,
+ NBTS_CTX_DEFAULT, /* fallback */
+} NBTS_CTX;
+
+static inline NBTS_CTX _nbt_spec_context(Relation irel)
+{
+ if (!PointerIsValid(irel))
+ return NBTS_CTX_DEFAULT;
+
+ return NBTS_CTX_CACHED;
+}
+
+
+#define NBT_SPECIALIZE_FILE "access/nbtree_specfuncs.h"
+#include "nbtree_spec.h"
+
/*
* external entry points for btree, in nbtree.c
*/
extern void btbuildempty(Relation index);
-extern bool btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- struct IndexInfo *indexInfo);
extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
extern Size btestimateparallelscan(void);
extern void btinitparallelscan(void *target);
@@ -1160,8 +1172,6 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
/*
* prototypes for functions in nbtdedup.c
*/
-extern void _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem,
- Size newitemsz, bool bottomupdedup);
extern bool _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
Size newitemsz);
extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
@@ -1177,9 +1187,6 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
/*
* prototypes for functions in nbtinsert.c
*/
-extern bool _bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel);
extern void _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf,
BTStack stack);
extern Buffer _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack,
@@ -1230,16 +1237,6 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
/*
* prototypes for functions in nbtsearch.c
*/
-extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
- Buffer *bufP, int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
- Buffer buf, bool forupdate, BTStack stack,
- int access, Snapshot snapshot,
- AttrNumber *comparecol, char *tupdatabuf);
-extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate,
- AttrNumber highcmpcol);
-extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page,
- OffsetNumber offnum, AttrNumber *comparecol);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -1248,7 +1245,6 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
/*
* prototypes for functions in nbtutils.c
*/
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
extern void _bt_freestack(BTStack stack);
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
@@ -1256,8 +1252,6 @@ extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
extern void _bt_mark_array_keys(IndexScanDesc scan);
extern void _bt_restore_array_keys(IndexScanDesc scan);
extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
- int tupnatts, ScanDirection dir, bool *continuescan);
extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
extern BTCycleId _bt_start_vacuum(Relation rel);
@@ -1270,10 +1264,6 @@ extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
extern char *btbuildphasename(int64 phasenum);
-extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
-extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
- IndexTuple firstright);
extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/nbtree_spec.h b/src/include/access/nbtree_spec.h
new file mode 100644
index 0000000000..fa38b09c6e
--- /dev/null
+++ b/src/include/access/nbtree_spec.h
@@ -0,0 +1,183 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtree_specialize.h
+ * header file for postgres btree access method implementation.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/nbtree_specialize.h
+ *
+ *-------------------------------------------------------------------------
+ *
+ * Specialize key-accessing functions and the hot code around those.
+ *
+ * Key attribute iteration is specialized through the use of the following
+ * macros:
+ *
+ * - nbts_attiterdeclare(itup)
+ * Declare the variables required to iterate over the provided IndexTuple's
+ * key attributes. Many tuples may have their attributes iterated over at the
+ * same time.
+ * - nbts_attiterinit(itup, initAttNum, tupDesc)
+ * Initialize the attribute iterator for the provided IndexTuple at
+ * the provided AttributeNumber.
+ * - nbts_foreachattr(initAttNum, endAttNum)
+ * Start a loop over the attributes, starting at initAttNum and ending at
+ * endAttNum, inclusive. It also takes care of truncated attributes.
+ * - nbts_attiter_attnum
+ * The current attribute number
+ * - nbts_attiter_nextattdatum(itup, tupDesc)
+ * Updates the attribute iterator state to the next attribute. Returns the
+ * datum of the next attribute, which might be null (see below)
+ * - nbts_attiter_curattisnull(itup)
+ * Returns whether the result from the last nbts_attiter_nextattdatum is
+ * null.
+ * - nbts_context(irel)
+ * Constructs a context that is used to call specialized functions.
+ * Note that this is unneeded in paths that are inaccessible to unspecialized
+ * code paths (i.e. code included through nbtree_spec.h), because that
+ * always calls the optimized functions directly.
+ */
+
+/*
+ * Macros used in the nbtree specialization code.
+ */
+#define NBTS_TYPE_CACHED cached
+#define NBTS_TYPE_DEFAULT default
+#define NBTS_CTX_NAME __nbts_ctx
+
+/* contextual specializations */
+#define NBTS_MAKE_CTX(rel) const NBTS_CTX NBTS_CTX_NAME = _nbt_spec_context(rel)
+#define NBTS_SPECIALIZE_NAME(name) ( \
+ (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ ) \
+)
+
+/* how do we make names? */
+#define NBTS_MAKE_PREFIX(a) CppConcat(a,_)
+#define NBTS_MAKE_NAME_(a,b) CppConcat(a,b)
+#define NBTS_MAKE_NAME(a,b) NBTS_MAKE_NAME_(NBTS_MAKE_PREFIX(a),b)
+
+#define nbt_opt_specialize(rel) \
+do { \
+ Assert(PointerIsValid(rel)); \
+ if (unlikely((rel)->rd_indam->aminsert == btinsert_default)) \
+ { \
+ nbts_prep_ctx(rel); \
+ _bt_specialize(rel); \
+ } \
+} while (false)
+
+/*
+ * Protections against multiple inclusions - the definition of this macro is
+ * different for files included with the templating mechanism vs the users
+ * of this template, so redefine these macros at top and bottom.
+ */
+#ifdef NBTS_FUNCTION
+#undef NBTS_FUNCTION
+#endif
+#define NBTS_FUNCTION(name) NBTS_MAKE_NAME(name, NBTS_TYPE)
+
+/* While specializing, the context is the local context */
+#ifdef nbts_prep_ctx
+#undef nbts_prep_ctx
+#endif
+#define nbts_prep_ctx(rel)
+
+/*
+ * Specialization 1: CACHED
+ *
+ * Multiple key columns, optimized access for attcacheoff -cacheable offsets.
+ */
+#define NBTS_SPECIALIZING_CACHED
+#define NBTS_TYPE NBTS_TYPE_CACHED
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc) do {} while (false)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_SPECIALIZING_CACHED
+#undef NBTS_TYPE
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/*
+ * Specialization 2: DEFAULT
+ *
+ * "Default", externally accessible, not so optimized functions
+ */
+
+/* Only the default context may need to specialize in some cases, so here's that */
+#undef nbts_prep_ctx
+#define nbts_prep_ctx(rel) NBTS_MAKE_CTX(rel)
+
+#define NBTS_SPECIALIZING_DEFAULT
+#define NBTS_TYPE NBTS_TYPE_DEFAULT
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_DEFAULT
+
+/* un-define the optimization macros */
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/*
+ * All next uses of nbts_prep_ctx are in non-templated code, so here we make
+ * sure we actually create the context.
+ */
+#undef nbts_prep_ctx
+#define nbts_prep_ctx(rel) NBTS_MAKE_CTX(rel)
+
+/*
+ * from here on all NBTS_FUNCTIONs are from specialized function names that
+ * are being called. Change the result of those macros from a direct call
+ * call to a conditional call to the right place, depending on the correct
+ * context.
+ */
+#undef NBTS_FUNCTION
+#define NBTS_FUNCTION(name) NBTS_SPECIALIZE_NAME(name)
+
+#undef NBT_SPECIALIZE_FILE
diff --git a/src/include/access/nbtree_specfuncs.h b/src/include/access/nbtree_specfuncs.h
new file mode 100644
index 0000000000..b87f5bf802
--- /dev/null
+++ b/src/include/access/nbtree_specfuncs.h
@@ -0,0 +1,65 @@
+/*
+ * prototypes for functions that are included in nbtree.h
+ */
+
+#define _bt_specialize NBTS_FUNCTION(_bt_specialize)
+#define btinsert NBTS_FUNCTION(btinsert)
+#define _bt_dedup_pass NBTS_FUNCTION(_bt_dedup_pass)
+#define _bt_doinsert NBTS_FUNCTION(_bt_doinsert)
+#define _bt_search NBTS_FUNCTION(_bt_search)
+#define _bt_moveright NBTS_FUNCTION(_bt_moveright)
+#define _bt_binsrch_insert NBTS_FUNCTION(_bt_binsrch_insert)
+#define _bt_compare NBTS_FUNCTION(_bt_compare)
+#define _bt_mkscankey NBTS_FUNCTION(_bt_mkscankey)
+#define _bt_checkkeys NBTS_FUNCTION(_bt_checkkeys)
+#define _bt_truncate NBTS_FUNCTION(_bt_truncate)
+#define _bt_keep_natts_fast NBTS_FUNCTION(_bt_keep_natts_fast)
+
+/*
+ * prototypes for functions in nbtree_spec.h
+ */
+extern void _bt_specialize(Relation rel);
+
+extern bool btinsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ struct IndexInfo *indexInfo);
+
+/*
+ * prototypes for functions in nbtdedup_spec.h
+ */
+extern void _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem,
+ Size newitemsz, bool bottomupdedup);
+
+
+/*
+ * prototypes for functions in nbtinsert_spec.h
+ */
+
+extern bool _bt_doinsert(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ Relation heapRel);
+
+/*
+ * prototypes for functions in nbtsearch_spec.h
+ */
+extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
+ Buffer *bufP, int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
+ Buffer buf, bool forupdate, BTStack stack,
+ int access, Snapshot snapshot,
+ AttrNumber *comparecol, char *tupdatabuf);
+extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page,
+ OffsetNumber offnum, AttrNumber *comparecol);
+/*
+ * prototypes for functions in nbtutils_spec.h
+ */
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
+extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan);
+extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright);
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index 4e09c4686b..e504a2f114 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -116,6 +116,8 @@ do
test "$f" = src/pl/tcl/pltclerrcodes.h && continue
# Also not meant to be included standalone.
+ test "$f" = src/include/access/nbtree_spec.h && continue
+ test "$f" = src/include/access/nbtree_specfuncs.h && continue
test "$f" = src/include/common/unicode_nonspacing_table.h && continue
test "$f" = src/include/common/unicode_east_asian_fw_table.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index 8dee1b5670..101888c806 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -111,6 +111,8 @@ do
test "$f" = src/pl/tcl/pltclerrcodes.h && continue
# Also not meant to be included standalone.
+ test "$f" = src/include/access/nbtree_spec.h && continue
+ test "$f" = src/include/access/nbtree_specfuncs.h && continue
test "$f" = src/include/common/unicode_nonspacing_table.h && continue
test "$f" = src/include/common/unicode_east_asian_fw_table.h && continue
--
2.39.0
On Fri, Jun 23, 2023 at 2:21 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
== Dynamic prefix truncation (0001)
The code now tracks how many prefix attributes of the scan key are
already considered equal based on earlier binsrch results, and ignores
those prefix colums in further binsrch operations (sorted list; if
both the high and low value of your range have the same prefix, the
middle value will have that prefix, too). This reduces the number of
calls into opclass-supplied (dynamic) compare functions, and thus
increase performance for multi-key-attribute indexes where shared
prefixes are common (e.g. index on (customer, order_id)).
I think the idea looks good to me.
I was looking into the 0001 patches, and I have one confusion in the
below hunk in the _bt_moveright function, basically, if the parent
page's right key is exactly matching the HIGH key of the child key
then I suppose while doing the "_bt_compare" with the HIGH_KEY we can
use the optimization right, i.e. column number from where we need to
start the comparison should be used what is passed by the caller. But
in the below hunk, you are always passing that as 'cmpcol' which is 1.
I think this should be '*comparecol' because '*comparecol' will either
hold the value passed by the parent if high key data exactly match
with the parent's right tuple or it will hold 1 in case it doesn't
match. Am I missing something?
@@ -247,13 +256,16 @@ _bt_moveright(Relation rel,
{
....
+ if (P_IGNORE(opaque) ||
+ _bt_compare(rel, key, page, P_HIKEY, &cmpcol) >= cmpval)
+ {
+ *comparecol = 1;
}
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Fri, 23 Jun 2023 at 11:26, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Jun 23, 2023 at 2:21 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:== Dynamic prefix truncation (0001)
The code now tracks how many prefix attributes of the scan key are
already considered equal based on earlier binsrch results, and ignores
those prefix colums in further binsrch operations (sorted list; if
both the high and low value of your range have the same prefix, the
middle value will have that prefix, too). This reduces the number of
calls into opclass-supplied (dynamic) compare functions, and thus
increase performance for multi-key-attribute indexes where shared
prefixes are common (e.g. index on (customer, order_id)).I think the idea looks good to me.
I was looking into the 0001 patches,
Thanks for reviewing.
and I have one confusion in the
below hunk in the _bt_moveright function, basically, if the parent
page's right key is exactly matching the HIGH key of the child key
then I suppose while doing the "_bt_compare" with the HIGH_KEY we can
use the optimization right, i.e. column number from where we need to
start the comparison should be used what is passed by the caller. But
in the below hunk, you are always passing that as 'cmpcol' which is 1.
I think this should be '*comparecol' because '*comparecol' will either
hold the value passed by the parent if high key data exactly match
with the parent's right tuple or it will hold 1 in case it doesn't
match. Am I missing something?
We can't carry _bt_compare prefix results across pages, because the
key range of a page may shift while we're not holding a lock on that
page. That's also why the code resets the prefix to 1 every time it
accesses a new page ^1: it cannot guarantee correct results otherwise.
See also [0]/messages/by-id/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com and [1] for why that is important.
^1: When following downlinks, the code above your quoted code tries to
reuse the _bt_compare result of the parent page in the common case of
a child page's high key that is bytewise equal to the right separator
tuple of the parent page's downlink to this page. However, if it
detects that this child high key has changed (i.e. not 100% bytewise
equal), we can't reuse that result, and we'll have to re-establish all
prefix info on that page from scratch.
In any case, this only establishes the prefix for the right half of
the page's keyspace, the prefix of the left half of the data still
needs to be established separetely.
I hope this explains the reasons for why we can't reuse comparecol as
_bt_compare argument.
Kind regards,
Matthias van de Meent
Neon, Inc.
[0]: /messages/by-id/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com
On Fri, Jun 23, 2023 at 8:16 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Fri, 23 Jun 2023 at 11:26, Dilip Kumar <dilipbalaut@gmail.com> wrote:
and I have one confusion in the
below hunk in the _bt_moveright function, basically, if the parent
page's right key is exactly matching the HIGH key of the child key
then I suppose while doing the "_bt_compare" with the HIGH_KEY we can
use the optimization right, i.e. column number from where we need to
start the comparison should be used what is passed by the caller. But
in the below hunk, you are always passing that as 'cmpcol' which is 1.
I think this should be '*comparecol' because '*comparecol' will either
hold the value passed by the parent if high key data exactly match
with the parent's right tuple or it will hold 1 in case it doesn't
match. Am I missing something?We can't carry _bt_compare prefix results across pages, because the
key range of a page may shift while we're not holding a lock on that
page. That's also why the code resets the prefix to 1 every time it
accesses a new page ^1: it cannot guarantee correct results otherwise.
See also [0] and [1] for why that is important.
Yeah that makes sense
^1: When following downlinks, the code above your quoted code tries to
reuse the _bt_compare result of the parent page in the common case of
a child page's high key that is bytewise equal to the right separator
tuple of the parent page's downlink to this page. However, if it
detects that this child high key has changed (i.e. not 100% bytewise
equal), we can't reuse that result, and we'll have to re-establish all
prefix info on that page from scratch.
In any case, this only establishes the prefix for the right half of
the page's keyspace, the prefix of the left half of the data still
needs to be established separetely.I hope this explains the reasons for why we can't reuse comparecol as
_bt_compare argument.
Yeah got it, thanks for explaining this. Now I see you have explained
this in comments as well above the memcmp() statement.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Tue, Jun 27, 2023 at 9:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Jun 23, 2023 at 8:16 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Fri, 23 Jun 2023 at 11:26, Dilip Kumar <dilipbalaut@gmail.com> wrote:
and I have one confusion in the
below hunk in the _bt_moveright function, basically, if the parent
page's right key is exactly matching the HIGH key of the child key
then I suppose while doing the "_bt_compare" with the HIGH_KEY we can
use the optimization right, i.e. column number from where we need to
start the comparison should be used what is passed by the caller. But
in the below hunk, you are always passing that as 'cmpcol' which is 1.
I think this should be '*comparecol' because '*comparecol' will either
hold the value passed by the parent if high key data exactly match
with the parent's right tuple or it will hold 1 in case it doesn't
match. Am I missing something?We can't carry _bt_compare prefix results across pages, because the
key range of a page may shift while we're not holding a lock on that
page. That's also why the code resets the prefix to 1 every time it
accesses a new page ^1: it cannot guarantee correct results otherwise.
See also [0] and [1] for why that is important.Yeah that makes sense
^1: When following downlinks, the code above your quoted code tries to
reuse the _bt_compare result of the parent page in the common case of
a child page's high key that is bytewise equal to the right separator
tuple of the parent page's downlink to this page. However, if it
detects that this child high key has changed (i.e. not 100% bytewise
equal), we can't reuse that result, and we'll have to re-establish all
prefix info on that page from scratch.
In any case, this only establishes the prefix for the right half of
the page's keyspace, the prefix of the left half of the data still
needs to be established separetely.I hope this explains the reasons for why we can't reuse comparecol as
_bt_compare argument.Yeah got it, thanks for explaining this. Now I see you have explained
this in comments as well above the memcmp() statement.
At high level 0001 looks fine to me, just some suggestions
1.
+Notes about dynamic prefix truncation
+-------------------------------------
I feel instead of calling it "dynamic prefix truncation" should we can
call it "dynamic prefix skipping", I mean we are not
really truncating anything right, we are just skipping those
attributes in comparison?
2.
I think we should add some comments in the _bt_binsrch() function
where we are having main logic around maintaining highcmpcol and
lowcmpcol.
I think the notes section explains that very clearly but adding some
comments here would be good and then reference to that section in the
README.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Tue, 27 Jun 2023 at 06:57, Dilip Kumar <dilipbalaut@gmail.com> wrote:
At high level 0001 looks fine to me, just some suggestions
Thanks for the review.
1. +Notes about dynamic prefix truncation +-------------------------------------I feel instead of calling it "dynamic prefix truncation" should we can
call it "dynamic prefix skipping", I mean we are not
really truncating anything right, we are just skipping those
attributes in comparison?
The reason I am using "prefix truncation" is that that is a fairly
well-known term in literature (together with "prefix compression"),
and it was introduced on this list with that name by Peter in 2018
[0]: /messages/by-id/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com
considering that normal "static" prefix truncation/compression is also
somewhere on my to-do list.
2.
I think we should add some comments in the _bt_binsrch() function
where we are having main logic around maintaining highcmpcol and
lowcmpcol.
I think the notes section explains that very clearly but adding some
comments here would be good and then reference to that section in the
README.
Updated in the attached version 12 of the patchset (which is also
rebased on HEAD @ 9c13b681). No changes apart from rebase fixes and
these added comments.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
[0]: /messages/by-id/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com
Attachments:
v12-0003-Use-specialized-attribute-iterators-in-the-speci.patchapplication/octet-stream; name=v12-0003-Use-specialized-attribute-iterators-in-the-speci.patchDownload
From 73a38766c8bc32fc49d3acea32b66b47ff9ab030 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 11 Jan 2023 02:57:21 +0100
Subject: [PATCH v12 3/6] Use specialized attribute iterators in the
specialized source files
This is committed separately to make clear what substantial changes were
made to the pre-existing code.
Even though not all nbt*_spec functions have been updated; these functions
can now directly call (and inline, and optimize for) the specialized functions
they call, instead of having to determine the right specialization based on
the (potentially locally unavailable) index relation, making the specialization
of those functions still worth specializing/duplicating.
---
src/backend/access/nbtree/nbtsearch_spec.c | 18 +++---
src/backend/access/nbtree/nbtsort_spec.c | 24 +++----
src/backend/access/nbtree/nbtutils_spec.c | 62 ++++++++++++-------
.../utils/sort/tuplesortvariants_spec.c | 53 +++++++++-------
4 files changed, 92 insertions(+), 65 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsearch_spec.c b/src/backend/access/nbtree/nbtsearch_spec.c
index 5f1ead2400..c03aed3fd7 100644
--- a/src/backend/access/nbtree/nbtsearch_spec.c
+++ b/src/backend/access/nbtree/nbtsearch_spec.c
@@ -668,6 +668,7 @@ _bt_compare(Relation rel,
int ncmpkey;
int ntupatts;
int32 result;
+ nbts_attiterdeclare(itup);
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -700,23 +701,26 @@ _bt_compare(Relation rel,
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
scankey = key->scankeys + ((*comparecol) - 1);
- for (int i = *comparecol; i <= ncmpkey; i++)
+ nbts_attiterinit(itup, *comparecol, itupdesc);
+
+ nbts_foreachattr(*comparecol, ncmpkey)
{
Datum datum;
- bool isNull;
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+ datum = nbts_attiter_nextattdatum(itup, itupdesc);
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ /* key is NULL */
+ if (scankey->sk_flags & SK_ISNULL)
{
- if (isNull)
+ if (nbts_attiter_curattisnull(itup))
result = 0; /* NULL "=" NULL */
else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = -1; /* NULL "<" NOT_NULL */
else
result = 1; /* NULL ">" NOT_NULL */
}
- else if (isNull) /* key is NOT_NULL and item is NULL */
+ /* key is NOT_NULL and item is NULL */
+ else if (nbts_attiter_curattisnull(itup))
{
if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = 1; /* NOT_NULL ">" NULL */
@@ -745,7 +749,7 @@ _bt_compare(Relation rel,
/* if the keys are unequal, return the difference */
if (result != 0)
{
- *comparecol = i;
+ *comparecol = nbts_attiter_attnum;
return result;
}
diff --git a/src/backend/access/nbtree/nbtsort_spec.c b/src/backend/access/nbtree/nbtsort_spec.c
index 368d6f244c..6f33cc4cc2 100644
--- a/src/backend/access/nbtree/nbtsort_spec.c
+++ b/src/backend/access/nbtree/nbtsort_spec.c
@@ -34,8 +34,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
itup2 = NULL;
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
bool deduplicate;
@@ -57,7 +56,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
/* Prepare SortSupport data for each column */
sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
- for (i = 0; i < keysz; i++)
+ for (int i = 0; i < keysz; i++)
{
SortSupport sortKey = sortKeys + i;
ScanKey scanKey = wstate->inskey->scankeys + i;
@@ -90,21 +89,24 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
else if (itup != NULL)
{
int32 compare = 0;
+ nbts_attiterdeclare(itup);
+ nbts_attiterdeclare(itup2);
- for (i = 1; i <= keysz; i++)
+ nbts_attiterinit(itup, 1, tupdes);
+ nbts_attiterinit(itup2, 1, tupdes);
+
+ nbts_foreachattr(1, keysz)
{
SortSupport entry;
Datum attrDatum1,
attrDatum2;
- bool isNull1,
- isNull2;
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+ entry = sortKeys + nbts_attiter_attnum - 1;
+ attrDatum1 = nbts_attiter_nextattdatum(itup, tupdes);
+ attrDatum2 = nbts_attiter_nextattdatum(itup2, tupdes);
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
+ compare = ApplySortComparator(attrDatum1, nbts_attiter_curattisnull(itup),
+ attrDatum2, nbts_attiter_curattisnull(itup2),
entry);
if (compare > 0)
{
diff --git a/src/backend/access/nbtree/nbtutils_spec.c b/src/backend/access/nbtree/nbtutils_spec.c
index 0288da22d6..07ca18f404 100644
--- a/src/backend/access/nbtree/nbtutils_spec.c
+++ b/src/backend/access/nbtree/nbtutils_spec.c
@@ -64,7 +64,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
int indnkeyatts;
int16 *indoption;
int tupnatts;
- int i;
+ nbts_attiterdeclare(itup);
itupdesc = RelationGetDescr(rel);
indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -95,7 +95,10 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->scantid = key->heapkeyspace && itup ?
BTreeTupleGetHeapTID(itup) : NULL;
skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
+
+ nbts_attiterinit(itup, 1, itupdesc);
+
+ nbts_foreachattr(1, indnkeyatts)
{
FmgrInfo *procinfo;
Datum arg;
@@ -106,27 +109,30 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
* We can use the cached (default) support procs since no cross-type
* comparison can be needed.
*/
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+ procinfo = index_getprocinfo(rel, nbts_attiter_attnum, BTORDER_PROC);
/*
* Key arguments built from truncated attributes (or when caller
* provides no tuple) are defensively represented as NULL values. They
* should never be used.
*/
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
+ if (nbts_attiter_attnum <= tupnatts)
+ {
+ arg = nbts_attiter_nextattdatum(itup, itupdesc);
+ null = nbts_attiter_curattisnull(itup);
+ }
else
{
arg = (Datum) 0;
null = true;
}
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags = (null ? SK_ISNULL : 0) | (indoption[nbts_attiter_attnum - 1] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[nbts_attiter_attnum - 1],
flags,
- (AttrNumber) (i + 1),
+ (AttrNumber) nbts_attiter_attnum,
InvalidStrategy,
InvalidOid,
- rel->rd_indcollation[i],
+ rel->rd_indcollation[nbts_attiter_attnum - 1],
procinfo,
arg);
/* Record if any key attribute is NULL (or truncated) */
@@ -675,6 +681,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
TupleDesc itupdesc = RelationGetDescr(rel);
int keepnatts;
ScanKey scankey;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
/*
* _bt_compare() treats truncated key attributes as having the value minus
@@ -686,20 +694,22 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
scankey = itup_key->scankeys;
keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, nkeyatts)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ datum1 = nbts_attiter_nextattdatum(lastleft, itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) != nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
scankey->sk_collation,
datum1,
@@ -707,6 +717,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
break;
keepnatts++;
+ scankey++;
}
/*
@@ -747,24 +758,27 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
TupleDesc itupdesc = RelationGetDescr(rel);
int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
int keepnatts;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, keysz)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
Form_pg_attribute att;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
+ datum1 = nbts_attiter_nextattdatum(lastleft, itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
+ att = TupleDescAttr(itupdesc, nbts_attiter_attnum - 1);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) != nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
diff --git a/src/backend/utils/sort/tuplesortvariants_spec.c b/src/backend/utils/sort/tuplesortvariants_spec.c
index 705da09329..cf262eee2d 100644
--- a/src/backend/utils/sort/tuplesortvariants_spec.c
+++ b/src/backend/utils/sort/tuplesortvariants_spec.c
@@ -66,47 +66,54 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
bool equal_hasnull = false;
int nkey;
int32 compare;
- Datum datum1,
- datum2;
- bool isnull1,
- isnull2;
+ nbts_attiterdeclare(tuple1);
+ nbts_attiterdeclare(tuple2);
tuple1 = (IndexTuple) a->tuple;
tuple2 = (IndexTuple) b->tuple;
keysz = base->nKeys;
tupDes = RelationGetDescr(arg->index.indexRel);
- if (sortKey->abbrev_converter)
+ if (!sortKey->abbrev_converter)
{
- datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
-
- compare = ApplySortAbbrevFullComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare;
+ nkey = 2;
+ sortKey++;
+ }
+ else
+ {
+ nkey = 1;
}
/* they are equal, so we only need to examine one null flag */
if (a->isnull1)
equal_hasnull = true;
- sortKey++;
- for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
+ nbts_attiterinit(tuple1, nkey, tupDes);
+ nbts_attiterinit(tuple2, nkey, tupDes);
+
+ nbts_foreachattr(nkey, keysz)
{
- datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
+ Datum datum1,
+ datum2;
+ datum1 = nbts_attiter_nextattdatum(tuple1, tupDes);
+ datum2 = nbts_attiter_nextattdatum(tuple2, tupDes);
+
+ if (nbts_attiter_attnum == 1)
+ compare = ApplySortAbbrevFullComparator(datum1, nbts_attiter_curattisnull(tuple1),
+ datum2, nbts_attiter_curattisnull(tuple2),
+ sortKey);
+ else
+ compare = ApplySortComparator(datum1, nbts_attiter_curattisnull(tuple1),
+ datum2, nbts_attiter_curattisnull(tuple2),
+ sortKey);
- compare = ApplySortComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
if (compare != 0)
- return compare; /* done when we find unequal attributes */
+ return compare;
- /* they are equal, so we only need to examine one null flag */
- if (isnull1)
+ if (nbts_attiter_curattisnull(tuple1))
equal_hasnull = true;
+
+ sortKey++;
}
/*
--
2.40.1
v12-0005-Add-an-attcacheoff-populating-function.patchapplication/octet-stream; name=v12-0005-Add-an-attcacheoff-populating-function.patchDownload
From aff7566e784ca0f810b65effcd7a121781b11e37 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 12 Jan 2023 21:34:36 +0100
Subject: [PATCH v12 5/6] Add an attcacheoff-populating function
It populates attcacheoff-capable attributes with the correct offset,
and fills attributes whose offset is uncacheable with an 'uncacheable'
indicator value; as opposed to -1 which signals "unknown".
This allows users of the API to remove redundant cycles that try to
cache the offset of attributes - instead of O(N-attrs) operations, this
one only requires a O(1) check.
---
src/backend/access/common/tupdesc.c | 111 ++++++++++++++++++++++++++++
src/include/access/tupdesc.h | 2 +
2 files changed, 113 insertions(+)
diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c
index 7c5c390503..b3f543cd83 100644
--- a/src/backend/access/common/tupdesc.c
+++ b/src/backend/access/common/tupdesc.c
@@ -927,3 +927,114 @@ BuildDescFromLists(List *names, List *types, List *typmods, List *collations)
return desc;
}
+
+/*
+ * PopulateTupleDescCacheOffsets
+ *
+ * Populate the attcacheoff fields of a TupleDesc, returning the last
+ * attcacheoff with a valid offset value.
+ *
+ * Populates attcacheoff with a negative cache value when no offset
+ * can be calculated (due to e.g. variable length attributes).
+ * The negative value is a value relative to the last cacheable attribute
+ * attcacheoff = -1 - (thisattno - cachedattno)
+ * so that the last attribute with cached offset can be found with
+ * cachedattno = attcacheoff + 1 + thisattno
+ *
+ * The value returned is the AttrNumber of the last (1-based) attribute that
+ * had its offset cached.
+ *
+ * When the TupleDesc has 0 attributes, it returns 0.
+ */
+AttrNumber
+PopulateTupleDescCacheOffsets(TupleDesc desc)
+{
+ int numberOfAttributes = desc->natts;
+ AttrNumber currAttNo, lastCachedAttNo;
+
+ if (numberOfAttributes == 0)
+ return 0;
+
+ /* Non-negative value: this attribute is cached */
+ if (TupleDescAttr(desc, desc->natts - 1)->attcacheoff >= 0)
+ return (AttrNumber) desc->natts;
+ /*
+ * Attribute has been filled with relative offset to last cached value, but
+ * it itself is unreachable.
+ */
+ if (TupleDescAttr(desc, desc->natts - 1)->attcacheoff != -1)
+ return (AttrNumber) (TupleDescAttr(desc, desc->natts - 1)->attcacheoff + 1 + desc->natts);
+
+ /* last attribute of the tupledesc may or may not support attcacheoff */
+
+ /*
+ * First attribute always starts at offset zero.
+ */
+ TupleDescAttr(desc, 0)->attcacheoff = 0;
+
+ currAttNo = 1;
+ /*
+ * Other code may have populated the value previously.
+ * Skip all positive offsets to get to the first attribute without
+ * attcacheoff.
+ */
+ while (currAttNo < numberOfAttributes &&
+ TupleDescAttr(desc, currAttNo)->attcacheoff >= 0)
+ currAttNo++;
+
+ /*
+ * Cache offset is undetermined. Start calculating offsets if possible.
+ *
+ * When we exit this block, currAttNo will point at the first uncacheable
+ * attribute, or past the end of the attribute array.
+ */
+ if (currAttNo < numberOfAttributes &&
+ TupleDescAttr(desc, currAttNo)->attcacheoff == -1)
+ {
+ Form_pg_attribute att = TupleDescAttr(desc, currAttNo - 1);
+ int32 off = att->attcacheoff;
+
+ if (att->attlen >= 0) {
+ off += att->attlen;
+
+ while (currAttNo < numberOfAttributes)
+ {
+ att = TupleDescAttr(desc, currAttNo);
+
+ if (att->attlen < 0)
+ {
+ if (off == att_align_nominal(off, att->attalign))
+ {
+ att->attcacheoff = off;
+ currAttNo++;
+ }
+ break;
+ }
+
+ off = att_align_nominal(off, att->attalign);
+ att->attcacheoff = off;
+ off += att->attlen;
+ currAttNo++;
+ }
+ }
+ }
+
+ Assert(currAttNo == numberOfAttributes || (
+ currAttNo < numberOfAttributes
+ && TupleDescAttr(desc, (currAttNo - 1))->attcacheoff >= 0
+ && TupleDescAttr(desc, currAttNo)->attcacheoff == -1
+ ));
+ /*
+ * No cacheable offsets left. Fill the rest with negative cache values,
+ * but return the latest cached offset.
+ */
+ lastCachedAttNo = currAttNo;
+
+ while (currAttNo < numberOfAttributes)
+ {
+ TupleDescAttr(desc, currAttNo)->attcacheoff = -1 - (currAttNo - lastCachedAttNo);
+ currAttNo++;
+ }
+
+ return lastCachedAttNo;
+}
\ No newline at end of file
diff --git a/src/include/access/tupdesc.h b/src/include/access/tupdesc.h
index b4286cf922..2673f2d0f3 100644
--- a/src/include/access/tupdesc.h
+++ b/src/include/access/tupdesc.h
@@ -151,4 +151,6 @@ extern TupleDesc BuildDescForRelation(List *schema);
extern TupleDesc BuildDescFromLists(List *names, List *types, List *typmods, List *collations);
+extern AttrNumber PopulateTupleDescCacheOffsets(TupleDesc desc);
+
#endif /* TUPDESC_H */
--
2.40.1
v12-0004-Optimize-nbts_attiter-for-nkeyatts-1-btrees.patchapplication/octet-stream; name=v12-0004-Optimize-nbts_attiter-for-nkeyatts-1-btrees.patchDownload
From f483508da0eac4edcf63ff56f05de4f954e104bf Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 11 Jan 2023 20:04:56 +0100
Subject: [PATCH v12 4/6] Optimize nbts_attiter for nkeyatts==1 btrees
This removes the index_getattr_nocache call path, which has significant overhead, and uses constant 0 offset.
---
src/backend/access/nbtree/README | 1 +
src/backend/access/nbtree/nbtree_spec.c | 3 ++
src/include/access/nbtree.h | 35 ++++++++++++++++
src/include/access/nbtree_spec.h | 56 ++++++++++++++++++++++++-
4 files changed, 93 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index e9d0cf6ac1..e90e24cb70 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -1104,6 +1104,7 @@ in the index AM to call the specialized functions, increasing the
performance of those hot paths.
Optimized code paths exist for the following cases, in order of preference:
+ - indexes with only a single key attribute
- multi-column indexes that could benefit from the attcacheoff optimization
NB: This is also the default path, and is comparatively slow for uncachable
attribute offsets.
diff --git a/src/backend/access/nbtree/nbtree_spec.c b/src/backend/access/nbtree/nbtree_spec.c
index 6b766581ab..21635397ed 100644
--- a/src/backend/access/nbtree/nbtree_spec.c
+++ b/src/backend/access/nbtree/nbtree_spec.c
@@ -33,6 +33,9 @@ _bt_specialize(Relation rel)
case NBTS_CTX_CACHED:
_bt_specialize_cached(rel);
break;
+ case NBTS_CTX_SINGLE_KEYATT:
+ _bt_specialize_single_keyatt(rel);
+ break;
case NBTS_CTX_DEFAULT:
break;
}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d1bbc4d2a8..72fbf3a4c6 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1122,6 +1122,7 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
typedef enum NBTS_CTX {
+ NBTS_CTX_SINGLE_KEYATT,
NBTS_CTX_CACHED,
NBTS_CTX_DEFAULT, /* fallback */
} NBTS_CTX;
@@ -1131,9 +1132,43 @@ static inline NBTS_CTX _nbt_spec_context(Relation irel)
if (!PointerIsValid(irel))
return NBTS_CTX_DEFAULT;
+ if (IndexRelationGetNumberOfKeyAttributes(irel) == 1)
+ return NBTS_CTX_SINGLE_KEYATT;
+
return NBTS_CTX_CACHED;
}
+static inline Datum _bt_getfirstatt(IndexTuple tuple, TupleDesc tupleDesc,
+ bool *isNull)
+{
+ Datum result;
+ if (IndexTupleHasNulls(tuple))
+ {
+ if (att_isnull(0, (bits8 *)(tuple) + sizeof(IndexTupleData)))
+ {
+ *isNull = true;
+ result = (Datum) 0;
+ }
+ else
+ {
+ *isNull = false;
+ result = fetchatt(TupleDescAttr(tupleDesc, 0),
+ ((char *) tuple)
+ + MAXALIGN(sizeof(IndexTupleData)
+ + sizeof(IndexAttributeBitMapData)));
+ }
+ }
+ else
+ {
+ *isNull = false;
+ result = fetchatt(TupleDescAttr(tupleDesc, 0),
+ ((char *) tuple)
+ + MAXALIGN(sizeof(IndexTupleData)));
+ }
+
+ return result;
+}
+
#define NBT_SPECIALIZE_FILE "access/nbtree_specfuncs.h"
#include "nbtree_spec.h"
diff --git a/src/include/access/nbtree_spec.h b/src/include/access/nbtree_spec.h
index fa38b09c6e..8e476c300d 100644
--- a/src/include/access/nbtree_spec.h
+++ b/src/include/access/nbtree_spec.h
@@ -44,6 +44,7 @@
/*
* Macros used in the nbtree specialization code.
*/
+#define NBTS_TYPE_SINGLE_KEYATT single_keyatt
#define NBTS_TYPE_CACHED cached
#define NBTS_TYPE_DEFAULT default
#define NBTS_CTX_NAME __nbts_ctx
@@ -51,8 +52,10 @@
/* contextual specializations */
#define NBTS_MAKE_CTX(rel) const NBTS_CTX NBTS_CTX_NAME = _nbt_spec_context(rel)
#define NBTS_SPECIALIZE_NAME(name) ( \
- (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
- NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ (NBTS_CTX_NAME) == NBTS_CTX_SINGLE_KEYATT ? (NBTS_MAKE_NAME(name, NBTS_TYPE_SINGLE_KEYATT)) : ( \
+ (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ ) \
) \
)
@@ -164,6 +167,55 @@ do { \
#undef nbts_attiter_nextattdatum
#undef nbts_attiter_curattisnull
+/*
+ * Specialization 3: SINGLE_KEYATT
+ *
+ * Optimized access for indexes with a single key column.
+ *
+ * Note that this path cannot be used for indexes with multiple key
++ * columns, because it never considers the next column.
+ */
+
+/* the default context (and later contexts) do need to specialize, so here's that */
+#undef nbts_prep_ctx
+#define nbts_prep_ctx(rel)
+
+#define NBTS_SPECIALIZING_SINGLE_KEYATT
+#define NBTS_TYPE NBTS_TYPE_SINGLE_KEYATT
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ Assert((endAttNum) == 1); ((void) (endAttNum)); \
+ if ((initAttNum) == 1) for (int spec_i = 0; spec_i < 1; spec_i++)
+
+#define nbts_attiter_attnum 1
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+( \
+ AssertMacro(spec_i == 0), \
+ _bt_getfirstatt(itup, tupDesc, &NBTS_MAKE_NAME(itup, isNull)) \
+)
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_SINGLE_KEYATT
+
+/* un-define the optimization macros */
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
/*
* All next uses of nbts_prep_ctx are in non-templated code, so here we make
* sure we actually create the context.
--
2.40.1
v12-0002-Specialize-nbtree-functions-on-btree-key-shape.patchapplication/octet-stream; name=v12-0002-Specialize-nbtree-functions-on-btree-key-shape.patchDownload
From bbc222f8ce4d06f16606ab5ea52f5dc420ba3cb1 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 11 Jan 2023 02:13:04 +0100
Subject: [PATCH v12 2/6] Specialize nbtree functions on btree key shape.
nbtree keys are not all made the same, so a significant amount of time is
spent on code that exists only to deal with other key's shape. By specializing
function calls based on the key shape, we can remove or reduce these causes
of overhead.
This commit adds the basic infrastructure for specializing specific hot code
in the nbtree AM to certain shapes of keys, and splits the code that can
benefit from attribute offset optimizations into separate files. This does
NOT yet update the code itself - it just makes the code compile cleanly.
The performance should be comparable if not the same.
---
contrib/amcheck/verify_nbtree.c | 6 +
src/backend/access/nbtree/README | 28 +
src/backend/access/nbtree/nbtdedup.c | 300 +----
src/backend/access/nbtree/nbtdedup_spec.c | 317 +++++
src/backend/access/nbtree/nbtinsert.c | 579 +--------
src/backend/access/nbtree/nbtinsert_spec.c | 584 +++++++++
src/backend/access/nbtree/nbtpage.c | 1 +
src/backend/access/nbtree/nbtree.c | 37 +-
src/backend/access/nbtree/nbtree_spec.c | 69 +
src/backend/access/nbtree/nbtsearch.c | 1111 +---------------
src/backend/access/nbtree/nbtsearch_spec.c | 1123 +++++++++++++++++
src/backend/access/nbtree/nbtsort.c | 264 +---
src/backend/access/nbtree/nbtsort_spec.c | 280 ++++
src/backend/access/nbtree/nbtsplitloc.c | 3 +
src/backend/access/nbtree/nbtutils.c | 754 +----------
src/backend/access/nbtree/nbtutils_spec.c | 775 ++++++++++++
src/backend/utils/sort/tuplesortvariants.c | 156 +--
.../utils/sort/tuplesortvariants_spec.c | 175 +++
src/include/access/nbtree.h | 44 +-
src/include/access/nbtree_spec.h | 183 +++
src/include/access/nbtree_specfuncs.h | 65 +
src/tools/pginclude/cpluspluscheck | 2 +
src/tools/pginclude/headerscheck | 2 +
23 files changed, 3669 insertions(+), 3189 deletions(-)
create mode 100644 src/backend/access/nbtree/nbtdedup_spec.c
create mode 100644 src/backend/access/nbtree/nbtinsert_spec.c
create mode 100644 src/backend/access/nbtree/nbtree_spec.c
create mode 100644 src/backend/access/nbtree/nbtsearch_spec.c
create mode 100644 src/backend/access/nbtree/nbtsort_spec.c
create mode 100644 src/backend/access/nbtree/nbtutils_spec.c
create mode 100644 src/backend/utils/sort/tuplesortvariants_spec.c
create mode 100644 src/include/access/nbtree_spec.h
create mode 100644 src/include/access/nbtree_specfuncs.h
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index e57625b75c..10ed67bffe 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2680,6 +2680,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
BTStack stack;
Buffer lbuf;
bool exists;
+ nbts_prep_ctx(NULL);
key = _bt_mkscankey(state->rel, itup);
Assert(key->heapkeyspace && key->scantid != NULL);
@@ -2780,6 +2781,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
ItemId itemid;
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -2843,6 +2845,7 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
{
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -2867,6 +2870,7 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
{
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -2906,6 +2910,7 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
ItemId itemid;
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -3141,6 +3146,7 @@ static inline BTScanInsert
bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
{
BTScanInsert skey;
+ nbts_prep_ctx(NULL);
skey = _bt_mkscankey(rel, itup);
skey->pivotsearch = true;
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 0f10141a2f..e9d0cf6ac1 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -1084,6 +1084,34 @@ that need a page split anyway. Besides, supporting variable "split points"
while splitting posting lists won't actually improve overall space
utilization.
+Notes about nbtree specialization
+---------------------------------
+
+Attribute iteration is a significant overhead for multi-column indexes
+with variable length attributes, due to our inability to cache the offset
+of each attribute into an on-disk tuple. To combat this, we'd have to either
+fully deserialize the tuple, or maintain our offset into the tuple as we
+iterate over the tuple's fields.
+
+Keeping track of this offset also has a non-negligible overhead too, so we'd
+prefer to not have to keep track of these offsets when we can use the cache.
+By specializing performance-sensitive search functions for these specific
+index tuple shapes and calling those selectively, we can keep the performance
+of cacheable attribute offsets where that is applicable, while improving
+performance where we currently would see O(n_atts^2) time iterating on
+variable-length attributes. Additionally, we update the entry points
+in the index AM to call the specialized functions, increasing the
+performance of those hot paths.
+
+Optimized code paths exist for the following cases, in order of preference:
+ - multi-column indexes that could benefit from the attcacheoff optimization
+ NB: This is also the default path, and is comparatively slow for uncachable
+ attribute offsets.
+
+Future work will optimize for multi-column indexes that don't benefit
+from the attcacheoff optimization by improving on the O(n^2) nature of
+index_getattr through storing attribute offsets.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index d4db0b28f2..4589ade267 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -22,260 +22,14 @@
static void _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
TM_IndexDeleteOp *delstate);
-static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem);
static void _bt_singleval_fillfactor(Page page, BTDedupState state,
Size newitemsz);
#ifdef USE_ASSERT_CHECKING
static bool _bt_posting_valid(IndexTuple posting);
#endif
-/*
- * Perform a deduplication pass.
- *
- * The general approach taken here is to perform as much deduplication as
- * possible to free as much space as possible. Note, however, that "single
- * value" strategy is used for !bottomupdedup callers when the page is full of
- * tuples of a single value. Deduplication passes that apply the strategy
- * will leave behind a few untouched tuples at the end of the page, preparing
- * the page for an anticipated page split that uses nbtsplitloc.c's own single
- * value strategy. Our high level goal is to delay merging the untouched
- * tuples until after the page splits.
- *
- * When a call to _bt_bottomupdel_pass() just took place (and failed), our
- * high level goal is to prevent a page split entirely by buying more time.
- * We still hope that a page split can be avoided altogether. That's why
- * single value strategy is not even considered for bottomupdedup callers.
- *
- * The page will have to be split if we cannot successfully free at least
- * newitemsz (we also need space for newitem's line pointer, which isn't
- * included in caller's newitemsz).
- *
- * Note: Caller should have already deleted all existing items with their
- * LP_DEAD bits set.
- */
-void
-_bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
- bool bottomupdedup)
-{
- OffsetNumber offnum,
- minoff,
- maxoff;
- Page page = BufferGetPage(buf);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- Page newpage;
- BTDedupState state;
- Size pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
- bool singlevalstrat = false;
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-
- /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
- newitemsz += sizeof(ItemIdData);
-
- /*
- * Initialize deduplication state.
- *
- * It would be possible for maxpostingsize (limit on posting list tuple
- * size) to be set to one third of the page. However, it seems like a
- * good idea to limit the size of posting lists to one sixth of a page.
- * That ought to leave us with a good split point when pages full of
- * duplicates can be split several times.
- */
- state = (BTDedupState) palloc(sizeof(BTDedupStateData));
- state->deduplicate = true;
- state->nmaxitems = 0;
- state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
- /* Metadata about base tuple of current pending posting list */
- state->base = NULL;
- state->baseoff = InvalidOffsetNumber;
- state->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- state->htids = palloc(state->maxpostingsize);
- state->nhtids = 0;
- state->nitems = 0;
- /* Size of all physical tuples to be replaced by pending posting list */
- state->phystupsize = 0;
- /* nintervals should be initialized to zero */
- state->nintervals = 0;
-
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * Consider applying "single value" strategy, though only if the page
- * seems likely to be split in the near future
- */
- if (!bottomupdedup)
- singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
-
- /*
- * Deduplicate items from page, and write them to newpage.
- *
- * Copy the original page's LSN into newpage copy. This will become the
- * updated version of the page. We need this because XLogInsert will
- * examine the LSN and possibly dump it in a page image.
- */
- newpage = PageGetTempPageCopySpecial(page);
- PageSetLSN(newpage, PageGetLSN(page));
-
- /* Copy high key, if any */
- if (!P_RIGHTMOST(opaque))
- {
- ItemId hitemid = PageGetItemId(page, P_HIKEY);
- Size hitemsz = ItemIdGetLength(hitemid);
- IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
-
- if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
- false, false) == InvalidOffsetNumber)
- elog(ERROR, "deduplication failed to add highkey");
- }
-
- for (offnum = minoff;
- offnum <= maxoff;
- offnum = OffsetNumberNext(offnum))
- {
- ItemId itemid = PageGetItemId(page, offnum);
- IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
-
- Assert(!ItemIdIsDead(itemid));
-
- if (offnum == minoff)
- {
- /*
- * No previous/base tuple for the data item -- use the data item
- * as base tuple of pending posting list
- */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- else if (state->deduplicate &&
- _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
- _bt_dedup_save_htid(state, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID(s) for itup have been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list for some other reason (e.g., adding more
- * TIDs would have caused posting list to exceed current
- * maxpostingsize).
- *
- * If state contains pending posting list with more than one item,
- * form new posting tuple and add it to our temp page (newpage).
- * Else add pending interval's base tuple to the temp page as-is.
- */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- if (singlevalstrat)
- {
- /*
- * Single value strategy's extra steps.
- *
- * Lower maxpostingsize for sixth and final large posting list
- * tuple at the point where 5 maxpostingsize-capped tuples
- * have either been formed or observed.
- *
- * When a sixth maxpostingsize-capped item is formed/observed,
- * stop merging together tuples altogether. The few tuples
- * that remain at the end of the page won't be merged together
- * at all (at least not until after a future page split takes
- * place, when this page's newly allocated right sibling page
- * gets its first deduplication pass).
- */
- if (state->nmaxitems == 5)
- _bt_singleval_fillfactor(page, state, newitemsz);
- else if (state->nmaxitems == 6)
- {
- state->deduplicate = false;
- singlevalstrat = false; /* won't be back here */
- }
- }
-
- /* itup starts new pending posting list */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- }
-
- /* Handle the last item */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- /*
- * If no items suitable for deduplication were found, newpage must be
- * exactly the same as the original page, so just return from function.
- *
- * We could determine whether or not to proceed on the basis the space
- * savings being sufficient to avoid an immediate page split instead. We
- * don't do that because there is some small value in nbtsplitloc.c always
- * operating against a page that is fully deduplicated (apart from
- * newitem). Besides, most of the cost has already been paid.
- */
- if (state->nintervals == 0)
- {
- /* cannot leak memory here */
- pfree(newpage);
- pfree(state->htids);
- pfree(state);
- return;
- }
-
- /*
- * By here, it's clear that deduplication will definitely go ahead.
- *
- * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
- * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
- * But keep things tidy.
- */
- if (P_HAS_GARBAGE(opaque))
- {
- BTPageOpaque nopaque = BTPageGetOpaque(newpage);
-
- nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
- }
-
- START_CRIT_SECTION();
-
- PageRestoreTempPage(newpage, page);
- MarkBufferDirty(buf);
-
- /* XLOG stuff */
- if (RelationNeedsWAL(rel))
- {
- XLogRecPtr recptr;
- xl_btree_dedup xlrec_dedup;
-
- xlrec_dedup.nintervals = state->nintervals;
-
- XLogBeginInsert();
- XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
-
- /*
- * The intervals array is not in the buffer, but pretend that it is.
- * When XLogInsert stores the whole buffer, the array need not be
- * stored too.
- */
- XLogRegisterBufData(0, (char *) state->intervals,
- state->nintervals * sizeof(BTDedupInterval));
-
- recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
-
- PageSetLSN(page, recptr);
- }
-
- END_CRIT_SECTION();
-
- /* Local space accounting should agree with page accounting */
- Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
-
- /* cannot leak memory here */
- pfree(state->htids);
- pfree(state);
-}
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtdedup_spec.c"
+#include "access/nbtree_spec.h"
/*
* Perform bottom-up index deletion pass.
@@ -316,6 +70,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
TM_IndexDeleteOp delstate;
bool neverdedup;
int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ nbts_prep_ctx(rel);
/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
newitemsz += sizeof(ItemIdData);
@@ -752,55 +507,6 @@ _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
state->phystupsize = 0;
}
-/*
- * Determine if page non-pivot tuples (data items) are all duplicates of the
- * same value -- if they are, deduplication's "single value" strategy should
- * be applied. The general goal of this strategy is to ensure that
- * nbtsplitloc.c (which uses its own single value strategy) will find a useful
- * split point as further duplicates are inserted, and successive rightmost
- * page splits occur among pages that store the same duplicate value. When
- * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
- * just like it would if deduplication were disabled.
- *
- * We expect that affected workloads will require _several_ single value
- * strategy deduplication passes (over a page that only stores duplicates)
- * before the page is finally split. The first deduplication pass should only
- * find regular non-pivot tuples. Later deduplication passes will find
- * existing maxpostingsize-capped posting list tuples, which must be skipped
- * over. The penultimate pass is generally the first pass that actually
- * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
- * few untouched non-pivot tuples. The final deduplication pass won't free
- * any space -- it will skip over everything without merging anything (it
- * retraces the steps of the penultimate pass).
- *
- * Fortunately, having several passes isn't too expensive. Each pass (after
- * the first pass) won't spend many cycles on the large posting list tuples
- * left by previous passes. Each pass will find a large contiguous group of
- * smaller duplicate tuples to merge together at the end of the page.
- */
-static bool
-_bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- ItemId itemid;
- IndexTuple itup;
-
- itemid = PageGetItemId(page, minoff);
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- {
- itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- return true;
- }
-
- return false;
-}
-
/*
* Lower maxpostingsize when using "single value" strategy, to avoid a sixth
* and final maxpostingsize-capped tuple. The sixth and final posting list
diff --git a/src/backend/access/nbtree/nbtdedup_spec.c b/src/backend/access/nbtree/nbtdedup_spec.c
new file mode 100644
index 0000000000..4b280de980
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup_spec.c
@@ -0,0 +1,317 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup_spec.c
+ * Index shape-specialized functions for nbtdedup.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtdedup_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_do_singleval NBTS_FUNCTION(_bt_do_singleval)
+
+static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem);
+
+/*
+ * Perform a deduplication pass.
+ *
+ * The general approach taken here is to perform as much deduplication as
+ * possible to free as much space as possible. Note, however, that "single
+ * value" strategy is used for !bottomupdedup callers when the page is full of
+ * tuples of a single value. Deduplication passes that apply the strategy
+ * will leave behind a few untouched tuples at the end of the page, preparing
+ * the page for an anticipated page split that uses nbtsplitloc.c's own single
+ * value strategy. Our high level goal is to delay merging the untouched
+ * tuples until after the page splits.
+ *
+ * When a call to _bt_bottomupdel_pass() just took place (and failed), our
+ * high level goal is to prevent a page split entirely by buying more time.
+ * We still hope that a page split can be avoided altogether. That's why
+ * single value strategy is not even considered for bottomupdedup callers.
+ *
+ * The page will have to be split if we cannot successfully free at least
+ * newitemsz (we also need space for newitem's line pointer, which isn't
+ * included in caller's newitemsz).
+ *
+ * Note: Caller should have already deleted all existing items with their
+ * LP_DEAD bits set.
+ */
+void
+_bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
+ bool bottomupdedup)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ Page newpage;
+ BTDedupState state;
+ Size pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
+ bool singlevalstrat = false;
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+
+ /*
+ * Initialize deduplication state.
+ *
+ * It would be possible for maxpostingsize (limit on posting list tuple
+ * size) to be set to one third of the page. However, it seems like a
+ * good idea to limit the size of posting lists to one sixth of a page.
+ * That ought to leave us with a good split point when pages full of
+ * duplicates can be split several times.
+ */
+ state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ state->deduplicate = true;
+ state->nmaxitems = 0;
+ state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
+ /* Metadata about base tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ state->htids = palloc(state->maxpostingsize);
+ state->nhtids = 0;
+ state->nitems = 0;
+ /* Size of all physical tuples to be replaced by pending posting list */
+ state->phystupsize = 0;
+ /* nintervals should be initialized to zero */
+ state->nintervals = 0;
+
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Consider applying "single value" strategy, though only if the page
+ * seems likely to be split in the near future
+ */
+ if (!bottomupdedup)
+ singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
+
+ /*
+ * Deduplicate items from page, and write them to newpage.
+ *
+ * Copy the original page's LSN into newpage copy. This will become the
+ * updated version of the page. We need this because XLogInsert will
+ * examine the LSN and possibly dump it in a page image.
+ */
+ newpage = PageGetTempPageCopySpecial(page);
+ PageSetLSN(newpage, PageGetLSN(page));
+
+ /* Copy high key, if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId hitemid = PageGetItemId(page, P_HIKEY);
+ Size hitemsz = ItemIdGetLength(hitemid);
+ IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+ if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add highkey");
+ }
+
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == minoff)
+ {
+ /*
+ * No previous/base tuple for the data item -- use the data item
+ * as base tuple of pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (state->deduplicate &&
+ _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+ _bt_dedup_save_htid(state, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID(s) for itup have been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list for some other reason (e.g., adding more
+ * TIDs would have caused posting list to exceed current
+ * maxpostingsize).
+ *
+ * If state contains pending posting list with more than one item,
+ * form new posting tuple and add it to our temp page (newpage).
+ * Else add pending interval's base tuple to the temp page as-is.
+ */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ if (singlevalstrat)
+ {
+ /*
+ * Single value strategy's extra steps.
+ *
+ * Lower maxpostingsize for sixth and final large posting list
+ * tuple at the point where 5 maxpostingsize-capped tuples
+ * have either been formed or observed.
+ *
+ * When a sixth maxpostingsize-capped item is formed/observed,
+ * stop merging together tuples altogether. The few tuples
+ * that remain at the end of the page won't be merged together
+ * at all (at least not until after a future page split takes
+ * place, when this page's newly allocated right sibling page
+ * gets its first deduplication pass).
+ */
+ if (state->nmaxitems == 5)
+ _bt_singleval_fillfactor(page, state, newitemsz);
+ else if (state->nmaxitems == 6)
+ {
+ state->deduplicate = false;
+ singlevalstrat = false; /* won't be back here */
+ }
+ }
+
+ /* itup starts new pending posting list */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ }
+
+ /* Handle the last item */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ /*
+ * If no items suitable for deduplication were found, newpage must be
+ * exactly the same as the original page, so just return from function.
+ *
+ * We could determine whether or not to proceed on the basis the space
+ * savings being sufficient to avoid an immediate page split instead. We
+ * don't do that because there is some small value in nbtsplitloc.c always
+ * operating against a page that is fully deduplicated (apart from
+ * newitem). Besides, most of the cost has already been paid.
+ */
+ if (state->nintervals == 0)
+ {
+ /* cannot leak memory here */
+ pfree(newpage);
+ pfree(state->htids);
+ pfree(state);
+ return;
+ }
+
+ /*
+ * By here, it's clear that deduplication will definitely go ahead.
+ *
+ * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
+ * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
+ * But keep things tidy.
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ BTPageOpaque nopaque = BTPageGetOpaque(newpage);
+
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+ }
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.nintervals = state->nintervals;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ /*
+ * The intervals array is not in the buffer, but pretend that it is.
+ * When XLogInsert stores the whole buffer, the array need not be
+ * stored too.
+ */
+ XLogRegisterBufData(0, (char *) state->intervals,
+ state->nintervals * sizeof(BTDedupInterval));
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* Local space accounting should agree with page accounting */
+ Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+ /* cannot leak memory here */
+ pfree(state->htids);
+ pfree(state);
+}
+
+/*
+ * Determine if page non-pivot tuples (data items) are all duplicates of the
+ * same value -- if they are, deduplication's "single value" strategy should
+ * be applied. The general goal of this strategy is to ensure that
+ * nbtsplitloc.c (which uses its own single value strategy) will find a useful
+ * split point as further duplicates are inserted, and successive rightmost
+ * page splits occur among pages that store the same duplicate value. When
+ * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
+ * just like it would if deduplication were disabled.
+ *
+ * We expect that affected workloads will require _several_ single value
+ * strategy deduplication passes (over a page that only stores duplicates)
+ * before the page is finally split. The first deduplication pass should only
+ * find regular non-pivot tuples. Later deduplication passes will find
+ * existing maxpostingsize-capped posting list tuples, which must be skipped
+ * over. The penultimate pass is generally the first pass that actually
+ * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
+ * few untouched non-pivot tuples. The final deduplication pass won't free
+ * any space -- it will skip over everything without merging anything (it
+ * retraces the steps of the penultimate pass).
+ *
+ * Fortunately, having several passes isn't too expensive. Each pass (after
+ * the first pass) won't spend many cycles on the large posting list tuples
+ * left by previous passes. Each pass will find a large contiguous group of
+ * smaller duplicate tuples to merge together at the end of the page.
+ */
+static bool
+_bt_do_singleval(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, minoff);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+ {
+ itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+ return true;
+ }
+
+ return false;
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 39e7e9b731..3607bd418e 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,28 +30,16 @@
#define BTREE_FASTPATH_MIN_LEVEL 2
-static BTStack _bt_search_insert(Relation rel, Relation heaprel,
- BTInsertState insertstate);
static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
Relation heapRel,
IndexUniqueCheck checkUnique, bool *is_unique,
uint32 *speculativeToken);
-static OffsetNumber _bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel);
static void _bt_stepright(Relation rel, Relation heaprel,
BTInsertState insertstate, BTStack stack);
-static void _bt_insertonpg(Relation rel, Relation heaprel, BTScanInsert itup_key,
- Buffer buf,
- Buffer cbuf,
- BTStack stack,
- IndexTuple itup,
- Size itemsz,
- OffsetNumber newitemoff,
- int postingoff,
+static void _bt_insertonpg(Relation rel, Relation heaprel,
+ BTScanInsert itup_key, Buffer buf, Buffer cbuf,
+ BTStack stack, IndexTuple itup, Size itemsz,
+ OffsetNumber newitemoff, int postingoff,
bool split_only_page);
static Buffer _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key,
Buffer buf, Buffer cbuf, OffsetNumber newitemoff,
@@ -75,313 +63,8 @@ static BlockNumber *_bt_deadblocks(Page page, OffsetNumber *deletable,
int *nblocks);
static inline int _bt_blk_cmp(const void *arg1, const void *arg2);
-/*
- * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
- *
- * This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
- *
- * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
- * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
- * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
- * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
- * don't actually insert.
- *
- * indexUnchanged executor hint indicates if itup is from an
- * UPDATE that didn't logically change the indexed value, but
- * must nevertheless have a new entry to point to a successor
- * version.
- *
- * The result value is only significant for UNIQUE_CHECK_PARTIAL:
- * it must be true if the entry is known unique, else false.
- * (In the current implementation we'll also return true after a
- * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
- * that's just a coding artifact.)
- */
-bool
-_bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel)
-{
- bool is_unique = false;
- BTInsertStateData insertstate;
- BTScanInsert itup_key;
- BTStack stack;
- bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
-
- /* we need an insertion scan key to do our search, so build one */
- itup_key = _bt_mkscankey(rel, itup);
-
- if (checkingunique)
- {
- if (!itup_key->anynullkeys)
- {
- /* No (heapkeyspace) scantid until uniqueness established */
- itup_key->scantid = NULL;
- }
- else
- {
- /*
- * Scan key for new tuple contains NULL key values. Bypass
- * checkingunique steps. They are unnecessary because core code
- * considers NULL unequal to every value, including NULL.
- *
- * This optimization avoids O(N^2) behavior within the
- * _bt_findinsertloc() heapkeyspace path when a unique index has a
- * large number of "duplicates" with NULL key values.
- */
- checkingunique = false;
- /* Tuple is unique in the sense that core code cares about */
- Assert(checkUnique != UNIQUE_CHECK_EXISTING);
- is_unique = true;
- }
- }
-
- /*
- * Fill in the BTInsertState working area, to track the current page and
- * position within the page to insert on.
- *
- * Note that itemsz is passed down to lower level code that deals with
- * inserting the item. It must be MAXALIGN()'d. This ensures that space
- * accounting code consistently considers the alignment overhead that we
- * expect PageAddItem() will add later. (Actually, index_form_tuple() is
- * already conservative about alignment, but we don't rely on that from
- * this distance. Besides, preserving the "true" tuple size in index
- * tuple headers for the benefit of nbtsplitloc.c might happen someday.
- * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
- */
- insertstate.itup = itup;
- insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
- insertstate.itup_key = itup_key;
- insertstate.bounds_valid = false;
- insertstate.buf = InvalidBuffer;
- insertstate.postingoff = 0;
-
-search:
-
- /*
- * Find and lock the leaf page that the tuple should be added to by
- * searching from the root page. insertstate.buf will hold a buffer that
- * is locked in exclusive mode afterwards.
- */
- stack = _bt_search_insert(rel, heapRel, &insertstate);
-
- /*
- * checkingunique inserts are not allowed to go ahead when two tuples with
- * equal key attribute values would be visible to new MVCC snapshots once
- * the xact commits. Check for conflicts in the locked page/buffer (if
- * needed) here.
- *
- * It might be necessary to check a page to the right in _bt_check_unique,
- * though that should be very rare. In practice the first page the value
- * could be on (with scantid omitted) is almost always also the only page
- * that a matching tuple might be found on. This is due to the behavior
- * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
- * only be allowed to cross a page boundary when there is no candidate
- * leaf page split point that avoids it. Also, _bt_check_unique can use
- * the leaf page high key to determine that there will be no duplicates on
- * the right sibling without actually visiting it (it uses the high key in
- * cases where the new item happens to belong at the far right of the leaf
- * page).
- *
- * NOTE: obviously, _bt_check_unique can only detect keys that are already
- * in the index; so it cannot defend against concurrent insertions of the
- * same key. We protect against that by means of holding a write lock on
- * the first page the value could be on, with omitted/-inf value for the
- * implicit heap TID tiebreaker attribute. Any other would-be inserter of
- * the same key must acquire a write lock on the same page, so only one
- * would-be inserter can be making the check at one time. Furthermore,
- * once we are past the check we hold write locks continuously until we
- * have performed our insertion, so no later inserter can fail to see our
- * insertion. (This requires some care in _bt_findinsertloc.)
- *
- * If we must wait for another xact, we release the lock while waiting,
- * and then must perform a new search.
- *
- * For a partial uniqueness check, we don't wait for the other xact. Just
- * let the tuple in and return false for possibly non-unique, or true for
- * definitely unique.
- */
- if (checkingunique)
- {
- TransactionId xwait;
- uint32 speculativeToken;
-
- xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
- &is_unique, &speculativeToken);
-
- if (unlikely(TransactionIdIsValid(xwait)))
- {
- /* Have to wait for the other guy ... */
- _bt_relbuf(rel, insertstate.buf);
- insertstate.buf = InvalidBuffer;
-
- /*
- * If it's a speculative insertion, wait for it to finish (ie. to
- * go ahead with the insertion, or kill the tuple). Otherwise
- * wait for the transaction to finish as usual.
- */
- if (speculativeToken)
- SpeculativeInsertionWait(xwait, speculativeToken);
- else
- XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
-
- /* start over... */
- if (stack)
- _bt_freestack(stack);
- goto search;
- }
-
- /* Uniqueness is established -- restore heap tid as scantid */
- if (itup_key->heapkeyspace)
- itup_key->scantid = &itup->t_tid;
- }
-
- if (checkUnique != UNIQUE_CHECK_EXISTING)
- {
- OffsetNumber newitemoff;
-
- /*
- * The only conflict predicate locking cares about for indexes is when
- * an index tuple insert conflicts with an existing lock. We don't
- * know the actual page we're going to insert on for sure just yet in
- * checkingunique and !heapkeyspace cases, but it's okay to use the
- * first page the value could be on (with scantid omitted) instead.
- */
- CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
-
- /*
- * Do the insertion. Note that insertstate contains cached binary
- * search bounds established within _bt_check_unique when insertion is
- * checkingunique.
- */
- newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
- indexUnchanged, stack, heapRel);
- _bt_insertonpg(rel, heapRel, itup_key, insertstate.buf, InvalidBuffer,
- stack, itup, insertstate.itemsz, newitemoff,
- insertstate.postingoff, false);
- }
- else
- {
- /* just release the buffer */
- _bt_relbuf(rel, insertstate.buf);
- }
-
- /* be tidy */
- if (stack)
- _bt_freestack(stack);
- pfree(itup_key);
-
- return is_unique;
-}
-
-/*
- * _bt_search_insert() -- _bt_search() wrapper for inserts
- *
- * Search the tree for a particular scankey, or more precisely for the first
- * leaf page it could be on. Try to make use of the fastpath optimization's
- * rightmost leaf page cache before actually searching the tree from the root
- * page, though.
- *
- * Return value is a stack of parent-page pointers (though see notes about
- * fastpath optimization and page splits below). insertstate->buf is set to
- * the address of the leaf-page buffer, which is write-locked and pinned in
- * all cases (if necessary by creating a new empty root page for caller).
- *
- * The fastpath optimization avoids most of the work of searching the tree
- * repeatedly when a single backend inserts successive new tuples on the
- * rightmost leaf page of an index. A backend cache of the rightmost leaf
- * page is maintained within _bt_insertonpg(), and used here. The cache is
- * invalidated here when an insert of a non-pivot tuple must take place on a
- * non-rightmost leaf page.
- *
- * The optimization helps with indexes on an auto-incremented field. It also
- * helps with indexes on datetime columns, as well as indexes with lots of
- * NULL values. (NULLs usually get inserted in the rightmost page for single
- * column indexes, since they usually get treated as coming after everything
- * else in the key space. Individual NULL tuples will generally be placed on
- * the rightmost leaf page due to the influence of the heap TID column.)
- *
- * Note that we avoid applying the optimization when there is insufficient
- * space on the rightmost page to fit caller's new item. This is necessary
- * because we'll need to return a real descent stack when a page split is
- * expected (actually, caller can cope with a leaf page split that uses a NULL
- * stack, but that's very slow and so must be avoided). Note also that the
- * fastpath optimization acquires the lock on the page conditionally as a way
- * of reducing extra contention when there are concurrent insertions into the
- * rightmost page (we give up if we'd have to wait for the lock). We assume
- * that it isn't useful to apply the optimization when there is contention,
- * since each per-backend cache won't stay valid for long.
- */
-static BTStack
-_bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
-{
- Assert(insertstate->buf == InvalidBuffer);
- Assert(!insertstate->bounds_valid);
- Assert(insertstate->postingoff == 0);
-
- if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
- {
- /* Simulate a _bt_getbuf() call with conditional locking */
- insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
- if (_bt_conditionallockbuf(rel, insertstate->buf))
- {
- Page page;
- BTPageOpaque opaque;
- AttrNumber cmpcol = 1;
-
- _bt_checkpage(rel, insertstate->buf);
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- /*
- * Check if the page is still the rightmost leaf page and has
- * enough free space to accommodate the new tuple. Also check
- * that the insertion scan key is strictly greater than the first
- * non-pivot tuple on the page. (Note that we expect itup_key's
- * scantid to be unset when our caller is a checkingunique
- * inserter.)
- */
- if (P_RIGHTMOST(opaque) &&
- P_ISLEAF(opaque) &&
- !P_IGNORE(opaque) &&
- PageGetFreeSpace(page) > insertstate->itemsz &&
- PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- _bt_compare(rel, insertstate->itup_key, page, P_HIKEY,
- &cmpcol) > 0)
- {
- /*
- * Caller can use the fastpath optimization because cached
- * block is still rightmost leaf page, which can fit caller's
- * new tuple without splitting. Keep block in local cache for
- * next insert, and have caller use NULL stack.
- *
- * Note that _bt_insert_parent() has an assertion that catches
- * leaf page splits that somehow follow from a fastpath insert
- * (it should only be passed a NULL stack when it must deal
- * with a concurrent root page split, and never because a NULL
- * stack was returned here).
- */
- return NULL;
- }
-
- /* Page unsuitable for caller, drop lock and pin */
- _bt_relbuf(rel, insertstate->buf);
- }
- else
- {
- /* Lock unavailable, drop pin */
- ReleaseBuffer(insertstate->buf);
- }
-
- /* Forget block, since cache doesn't appear to be useful */
- RelationSetTargetBlock(rel, InvalidBlockNumber);
- }
-
- /* Cannot use optimization -- descend tree, return proper descent stack */
- return _bt_search(rel, heaprel, insertstate->itup_key, &insertstate->buf,
- BT_WRITE, NULL);
-}
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtinsert_spec.c"
+#include "access/nbtree_spec.h"
/*
* _bt_check_unique() -- Check for violation of unique index constraint
@@ -425,6 +108,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
bool inposting = false;
bool prevalldead = true;
int curposti = 0;
+ nbts_prep_ctx(rel);
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -776,253 +460,6 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
return InvalidTransactionId;
}
-
-/*
- * _bt_findinsertloc() -- Finds an insert location for a tuple
- *
- * On entry, insertstate buffer contains the page the new tuple belongs
- * on. It is exclusive-locked and pinned by the caller.
- *
- * If 'checkingunique' is true, the buffer on entry is the first page
- * that contains duplicates of the new key. If there are duplicates on
- * multiple pages, the correct insertion position might be some page to
- * the right, rather than the first page. In that case, this function
- * moves right to the correct target page.
- *
- * (In a !heapkeyspace index, there can be multiple pages with the same
- * high key, where the new tuple could legitimately be placed on. In
- * that case, the caller passes the first page containing duplicates,
- * just like when checkingunique=true. If that page doesn't have enough
- * room for the new tuple, this function moves right, trying to find a
- * legal page that does.)
- *
- * If 'indexUnchanged' is true, this is for an UPDATE that didn't
- * logically change the indexed value, but must nevertheless have a new
- * entry to point to a successor version. This hint from the executor
- * will influence our behavior when the page might have to be split and
- * we must consider our options. Bottom-up index deletion can avoid
- * pathological version-driven page splits, but we only want to go to the
- * trouble of trying it when we already have moderate confidence that
- * it's appropriate. The hint should not significantly affect our
- * behavior over time unless practically all inserts on to the leaf page
- * get the hint.
- *
- * On exit, insertstate buffer contains the chosen insertion page, and
- * the offset within that page is returned. If _bt_findinsertloc needed
- * to move right, the lock and pin on the original page are released, and
- * the new buffer is exclusively locked and pinned instead.
- *
- * If insertstate contains cached binary search bounds, we will take
- * advantage of them. This avoids repeating comparisons that we made in
- * _bt_check_unique() already.
- */
-static OffsetNumber
-_bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel)
-{
- BTScanInsert itup_key = insertstate->itup_key;
- Page page = BufferGetPage(insertstate->buf);
- BTPageOpaque opaque;
- OffsetNumber newitemoff;
-
- opaque = BTPageGetOpaque(page);
-
- /* Check 1/3 of a page restriction */
- if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
- _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
- insertstate->itup);
-
- Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
- Assert(!insertstate->bounds_valid || checkingunique);
- Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
- Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
- Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
-
- if (itup_key->heapkeyspace)
- {
- /* Keep track of whether checkingunique duplicate seen */
- bool uniquedup = indexUnchanged;
-
- /*
- * If we're inserting into a unique index, we may have to walk right
- * through leaf pages to find the one leaf page that we must insert on
- * to.
- *
- * This is needed for checkingunique callers because a scantid was not
- * used when we called _bt_search(). scantid can only be set after
- * _bt_check_unique() has checked for duplicates. The buffer
- * initially stored in insertstate->buf has the page where the first
- * duplicate key might be found, which isn't always the page that new
- * tuple belongs on. The heap TID attribute for new tuple (scantid)
- * could force us to insert on a sibling page, though that should be
- * very rare in practice.
- */
- if (checkingunique)
- {
- if (insertstate->low < insertstate->stricthigh)
- {
- /* Encountered a duplicate in _bt_check_unique() */
- Assert(insertstate->bounds_valid);
- uniquedup = true;
- }
-
- for (;;)
- {
- AttrNumber cmpcol = 1;
-
- /*
- * Does the new tuple belong on this page?
- *
- * The earlier _bt_check_unique() call may well have
- * established a strict upper bound on the offset for the new
- * item. If it's not the last item of the page (i.e. if there
- * is at least one tuple on the page that goes after the tuple
- * we're inserting) then we know that the tuple belongs on
- * this page. We can skip the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- /* Test '<=', not '!=', since scantid is set now */
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
- break;
-
- _bt_stepright(rel, heapRel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- /* Assume duplicates (if checkingunique) */
- uniquedup = true;
- }
- }
-
- /*
- * If the target page cannot fit newitem, try to avoid splitting the
- * page on insert by performing deletion or deduplication now
- */
- if (PageGetFreeSpace(page) < insertstate->itemsz)
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
- checkingunique, uniquedup,
- indexUnchanged);
- }
- else
- {
- /*----------
- * This is a !heapkeyspace (version 2 or 3) index. The current page
- * is the first page that we could insert the new tuple to, but there
- * may be other pages to the right that we could opt to use instead.
- *
- * If the new key is equal to one or more existing keys, we can
- * legitimately place it anywhere in the series of equal keys. In
- * fact, if the new key is equal to the page's "high key" we can place
- * it on the next page. If it is equal to the high key, and there's
- * not room to insert the new tuple on the current page without
- * splitting, then we move right hoping to find more free space and
- * avoid a split.
- *
- * Keep scanning right until we
- * (a) find a page with enough free space,
- * (b) reach the last page where the tuple can legally go, or
- * (c) get tired of searching.
- * (c) is not flippant; it is important because if there are many
- * pages' worth of equal keys, it's better to split one of the early
- * pages than to scan all the way to the end of the run of equal keys
- * on every insert. We implement "get tired" as a random choice,
- * since stopping after scanning a fixed number of pages wouldn't work
- * well (we'd never reach the right-hand side of previously split
- * pages). The probability of moving right is set at 0.99, which may
- * seem too high to change the behavior much, but it does an excellent
- * job of preventing O(N^2) behavior with many equal keys.
- *----------
- */
- while (PageGetFreeSpace(page) < insertstate->itemsz)
- {
- AttrNumber cmpcol = 1;
-
- /*
- * Before considering moving right, see if we can obtain enough
- * space by erasing LP_DEAD items
- */
- if (P_HAS_GARBAGE(opaque))
- {
- /* Perform simple deletion */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- if (PageGetFreeSpace(page) >= insertstate->itemsz)
- break; /* OK, now we have enough space */
- }
-
- /*
- * Nope, so check conditions (b) and (c) enumerated above
- *
- * The earlier _bt_check_unique() call may well have established a
- * strict upper bound on the offset for the new item. If it's not
- * the last item of the page (i.e. if there is at least one tuple
- * on the page that's greater than the tuple we're inserting to)
- * then we know that the tuple belongs on this page. We can skip
- * the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) != 0 ||
- pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
- break;
-
- _bt_stepright(rel, heapRel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- }
- }
-
- /*
- * We should now be on the correct page. Find the offset within the page
- * for the new tuple. (Possibly reusing earlier search bounds.)
- */
- {
- AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
- Assert(P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0);
- }
-
- newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
-
- if (insertstate->postingoff == -1)
- {
- /*
- * There is an overlapping posting list tuple with its LP_DEAD bit
- * set. We don't want to unnecessarily unset its LP_DEAD bit while
- * performing a posting list split, so perform simple index tuple
- * deletion early.
- */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- /*
- * Do new binary search. New insert location cannot overlap with any
- * posting list now.
- */
- Assert(!insertstate->bounds_valid);
- insertstate->postingoff = 0;
- newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
- Assert(insertstate->postingoff == 0);
- }
-
- return newitemoff;
-}
-
/*
* Step right to next non-dead page, during insertion.
*
@@ -1506,6 +943,7 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
bool newitemonleft,
isleaf,
isrightmost;
+ nbts_prep_ctx(rel);
/*
* origpage is the original page to be split. leftpage is a temporary
@@ -2706,6 +2144,7 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
BTScanInsert itup_key = insertstate->itup_key;
Page page = BufferGetPage(buffer);
BTPageOpaque opaque = BTPageGetOpaque(page);
+ nbts_prep_ctx(rel);
Assert(P_ISLEAF(opaque));
Assert(simpleonly || itup_key->heapkeyspace);
diff --git a/src/backend/access/nbtree/nbtinsert_spec.c b/src/backend/access/nbtree/nbtinsert_spec.c
new file mode 100644
index 0000000000..6915f22839
--- /dev/null
+++ b/src/backend/access/nbtree/nbtinsert_spec.c
@@ -0,0 +1,584 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtinsert_spec.c
+ * Index shape-specialized functions for nbtinsert.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtinsert_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_search_insert NBTS_FUNCTION(_bt_search_insert)
+#define _bt_findinsertloc NBTS_FUNCTION(_bt_findinsertloc)
+
+static BTStack _bt_search_insert(Relation rel, Relation heaprel,
+ BTInsertState insertstate);
+static OffsetNumber _bt_findinsertloc(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel);
+
+
+/*
+ * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
+ *
+ * This routine is called by the public interface routine, btinsert.
+ * By here, itup is filled in, including the TID.
+ *
+ * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
+ * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
+ * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
+ * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
+ * don't actually insert.
+ *
+ * indexUnchanged executor hint indicates if itup is from an
+ * UPDATE that didn't logically change the indexed value, but
+ * must nevertheless have a new entry to point to a successor
+ * version.
+ *
+ * The result value is only significant for UNIQUE_CHECK_PARTIAL:
+ * it must be true if the entry is known unique, else false.
+ * (In the current implementation we'll also return true after a
+ * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
+ * that's just a coding artifact.)
+ */
+bool
+_bt_doinsert(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ Relation heapRel)
+{
+ bool is_unique = false;
+ BTInsertStateData insertstate;
+ BTScanInsert itup_key;
+ BTStack stack;
+ bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
+
+ /* we need an insertion scan key to do our search, so build one */
+ itup_key = _bt_mkscankey(rel, itup);
+
+ if (checkingunique)
+ {
+ if (!itup_key->anynullkeys)
+ {
+ /* No (heapkeyspace) scantid until uniqueness established */
+ itup_key->scantid = NULL;
+ }
+ else
+ {
+ /*
+ * Scan key for new tuple contains NULL key values. Bypass
+ * checkingunique steps. They are unnecessary because core code
+ * considers NULL unequal to every value, including NULL.
+ *
+ * This optimization avoids O(N^2) behavior within the
+ * _bt_findinsertloc() heapkeyspace path when a unique index has a
+ * large number of "duplicates" with NULL key values.
+ */
+ checkingunique = false;
+ /* Tuple is unique in the sense that core code cares about */
+ Assert(checkUnique != UNIQUE_CHECK_EXISTING);
+ is_unique = true;
+ }
+ }
+
+ /*
+ * Fill in the BTInsertState working area, to track the current page and
+ * position within the page to insert on.
+ *
+ * Note that itemsz is passed down to lower level code that deals with
+ * inserting the item. It must be MAXALIGN()'d. This ensures that space
+ * accounting code consistently considers the alignment overhead that we
+ * expect PageAddItem() will add later. (Actually, index_form_tuple() is
+ * already conservative about alignment, but we don't rely on that from
+ * this distance. Besides, preserving the "true" tuple size in index
+ * tuple headers for the benefit of nbtsplitloc.c might happen someday.
+ * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
+ */
+ insertstate.itup = itup;
+ insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
+ insertstate.itup_key = itup_key;
+ insertstate.bounds_valid = false;
+ insertstate.buf = InvalidBuffer;
+ insertstate.postingoff = 0;
+
+ search:
+
+ /*
+ * Find and lock the leaf page that the tuple should be added to by
+ * searching from the root page. insertstate.buf will hold a buffer that
+ * is locked in exclusive mode afterwards.
+ */
+ stack = _bt_search_insert(rel, heapRel, &insertstate);
+
+ /*
+ * checkingunique inserts are not allowed to go ahead when two tuples with
+ * equal key attribute values would be visible to new MVCC snapshots once
+ * the xact commits. Check for conflicts in the locked page/buffer (if
+ * needed) here.
+ *
+ * It might be necessary to check a page to the right in _bt_check_unique,
+ * though that should be very rare. In practice the first page the value
+ * could be on (with scantid omitted) is almost always also the only page
+ * that a matching tuple might be found on. This is due to the behavior
+ * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
+ * only be allowed to cross a page boundary when there is no candidate
+ * leaf page split point that avoids it. Also, _bt_check_unique can use
+ * the leaf page high key to determine that there will be no duplicates on
+ * the right sibling without actually visiting it (it uses the high key in
+ * cases where the new item happens to belong at the far right of the leaf
+ * page).
+ *
+ * NOTE: obviously, _bt_check_unique can only detect keys that are already
+ * in the index; so it cannot defend against concurrent insertions of the
+ * same key. We protect against that by means of holding a write lock on
+ * the first page the value could be on, with omitted/-inf value for the
+ * implicit heap TID tiebreaker attribute. Any other would-be inserter of
+ * the same key must acquire a write lock on the same page, so only one
+ * would-be inserter can be making the check at one time. Furthermore,
+ * once we are past the check we hold write locks continuously until we
+ * have performed our insertion, so no later inserter can fail to see our
+ * insertion. (This requires some care in _bt_findinsertloc.)
+ *
+ * If we must wait for another xact, we release the lock while waiting,
+ * and then must perform a new search.
+ *
+ * For a partial uniqueness check, we don't wait for the other xact. Just
+ * let the tuple in and return false for possibly non-unique, or true for
+ * definitely unique.
+ */
+ if (checkingunique)
+ {
+ TransactionId xwait;
+ uint32 speculativeToken;
+
+ xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
+ &is_unique, &speculativeToken);
+
+ if (unlikely(TransactionIdIsValid(xwait)))
+ {
+ /* Have to wait for the other guy ... */
+ _bt_relbuf(rel, insertstate.buf);
+ insertstate.buf = InvalidBuffer;
+
+ /*
+ * If it's a speculative insertion, wait for it to finish (ie. to
+ * go ahead with the insertion, or kill the tuple). Otherwise
+ * wait for the transaction to finish as usual.
+ */
+ if (speculativeToken)
+ SpeculativeInsertionWait(xwait, speculativeToken);
+ else
+ XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
+
+ /* start over... */
+ if (stack)
+ _bt_freestack(stack);
+ goto search;
+ }
+
+ /* Uniqueness is established -- restore heap tid as scantid */
+ if (itup_key->heapkeyspace)
+ itup_key->scantid = &itup->t_tid;
+ }
+
+ if (checkUnique != UNIQUE_CHECK_EXISTING)
+ {
+ OffsetNumber newitemoff;
+
+ /*
+ * The only conflict predicate locking cares about for indexes is when
+ * an index tuple insert conflicts with an existing lock. We don't
+ * know the actual page we're going to insert on for sure just yet in
+ * checkingunique and !heapkeyspace cases, but it's okay to use the
+ * first page the value could be on (with scantid omitted) instead.
+ */
+ CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
+
+ /*
+ * Do the insertion. Note that insertstate contains cached binary
+ * search bounds established within _bt_check_unique when insertion is
+ * checkingunique.
+ */
+ newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
+ indexUnchanged, stack, heapRel);
+ _bt_insertonpg(rel, heapRel, itup_key, insertstate.buf, InvalidBuffer,
+ stack, itup, insertstate.itemsz, newitemoff,
+ insertstate.postingoff, false);
+ }
+ else
+ {
+ /* just release the buffer */
+ _bt_relbuf(rel, insertstate.buf);
+ }
+
+ /* be tidy */
+ if (stack)
+ _bt_freestack(stack);
+ pfree(itup_key);
+
+ return is_unique;
+}
+
+/*
+ * _bt_search_insert() -- _bt_search() wrapper for inserts
+ *
+ * Search the tree for a particular scankey, or more precisely for the first
+ * leaf page it could be on. Try to make use of the fastpath optimization's
+ * rightmost leaf page cache before actually searching the tree from the root
+ * page, though.
+ *
+ * Return value is a stack of parent-page pointers (though see notes about
+ * fastpath optimization and page splits below). insertstate->buf is set to
+ * the address of the leaf-page buffer, which is write-locked and pinned in
+ * all cases (if necessary by creating a new empty root page for caller).
+ *
+ * The fastpath optimization avoids most of the work of searching the tree
+ * repeatedly when a single backend inserts successive new tuples on the
+ * rightmost leaf page of an index. A backend cache of the rightmost leaf
+ * page is maintained within _bt_insertonpg(), and used here. The cache is
+ * invalidated here when an insert of a non-pivot tuple must take place on a
+ * non-rightmost leaf page.
+ *
+ * The optimization helps with indexes on an auto-incremented field. It also
+ * helps with indexes on datetime columns, as well as indexes with lots of
+ * NULL values. (NULLs usually get inserted in the rightmost page for single
+ * column indexes, since they usually get treated as coming after everything
+ * else in the key space. Individual NULL tuples will generally be placed on
+ * the rightmost leaf page due to the influence of the heap TID column.)
+ *
+ * Note that we avoid applying the optimization when there is insufficient
+ * space on the rightmost page to fit caller's new item. This is necessary
+ * because we'll need to return a real descent stack when a page split is
+ * expected (actually, caller can cope with a leaf page split that uses a NULL
+ * stack, but that's very slow and so must be avoided). Note also that the
+ * fastpath optimization acquires the lock on the page conditionally as a way
+ * of reducing extra contention when there are concurrent insertions into the
+ * rightmost page (we give up if we'd have to wait for the lock). We assume
+ * that it isn't useful to apply the optimization when there is contention,
+ * since each per-backend cache won't stay valid for long.
+ */
+static BTStack
+_bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
+{
+ Assert(insertstate->buf == InvalidBuffer);
+ Assert(!insertstate->bounds_valid);
+ Assert(insertstate->postingoff == 0);
+
+ if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
+ {
+ /* Simulate a _bt_getbuf() call with conditional locking */
+ insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
+ if (_bt_conditionallockbuf(rel, insertstate->buf))
+ {
+ Page page;
+ BTPageOpaque opaque;
+ AttrNumber cmpcol = 1;
+
+ _bt_checkpage(rel, insertstate->buf);
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ /*
+ * Check if the page is still the rightmost leaf page and has
+ * enough free space to accommodate the new tuple. Also check
+ * that the insertion scan key is strictly greater than the first
+ * non-pivot tuple on the page. (Note that we expect itup_key's
+ * scantid to be unset when our caller is a checkingunique
+ * inserter.)
+ */
+ if (P_RIGHTMOST(opaque) &&
+ P_ISLEAF(opaque) &&
+ !P_IGNORE(opaque) &&
+ PageGetFreeSpace(page) > insertstate->itemsz &&
+ PageGetMaxOffsetNumber(page) >= P_HIKEY &&
+ _bt_compare(rel, insertstate->itup_key, page, P_HIKEY,
+ &cmpcol) > 0)
+ {
+ /*
+ * Caller can use the fastpath optimization because cached
+ * block is still rightmost leaf page, which can fit caller's
+ * new tuple without splitting. Keep block in local cache for
+ * next insert, and have caller use NULL stack.
+ *
+ * Note that _bt_insert_parent() has an assertion that catches
+ * leaf page splits that somehow follow from a fastpath insert
+ * (it should only be passed a NULL stack when it must deal
+ * with a concurrent root page split, and never because a NULL
+ * stack was returned here).
+ */
+ return NULL;
+ }
+
+ /* Page unsuitable for caller, drop lock and pin */
+ _bt_relbuf(rel, insertstate->buf);
+ }
+ else
+ {
+ /* Lock unavailable, drop pin */
+ ReleaseBuffer(insertstate->buf);
+ }
+
+ /* Forget block, since cache doesn't appear to be useful */
+ RelationSetTargetBlock(rel, InvalidBlockNumber);
+ }
+
+ /* Cannot use optimization -- descend tree, return proper descent stack */
+ return _bt_search(rel, heaprel, insertstate->itup_key, &insertstate->buf,
+ BT_WRITE, NULL);
+}
+
+/*
+ * _bt_findinsertloc() -- Finds an insert location for a tuple
+ *
+ * On entry, insertstate buffer contains the page the new tuple belongs
+ * on. It is exclusive-locked and pinned by the caller.
+ *
+ * If 'checkingunique' is true, the buffer on entry is the first page
+ * that contains duplicates of the new key. If there are duplicates on
+ * multiple pages, the correct insertion position might be some page to
+ * the right, rather than the first page. In that case, this function
+ * moves right to the correct target page.
+ *
+ * (In a !heapkeyspace index, there can be multiple pages with the same
+ * high key, where the new tuple could legitimately be placed on. In
+ * that case, the caller passes the first page containing duplicates,
+ * just like when checkingunique=true. If that page doesn't have enough
+ * room for the new tuple, this function moves right, trying to find a
+ * legal page that does.)
+ *
+ * If 'indexUnchanged' is true, this is for an UPDATE that didn't
+ * logically change the indexed value, but must nevertheless have a new
+ * entry to point to a successor version. This hint from the executor
+ * will influence our behavior when the page might have to be split and
+ * we must consider our options. Bottom-up index deletion can avoid
+ * pathological version-driven page splits, but we only want to go to the
+ * trouble of trying it when we already have moderate confidence that
+ * it's appropriate. The hint should not significantly affect our
+ * behavior over time unless practically all inserts on to the leaf page
+ * get the hint.
+ *
+ * On exit, insertstate buffer contains the chosen insertion page, and
+ * the offset within that page is returned. If _bt_findinsertloc needed
+ * to move right, the lock and pin on the original page are released, and
+ * the new buffer is exclusively locked and pinned instead.
+ *
+ * If insertstate contains cached binary search bounds, we will take
+ * advantage of them. This avoids repeating comparisons that we made in
+ * _bt_check_unique() already.
+ */
+static OffsetNumber
+_bt_findinsertloc(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel)
+{
+ BTScanInsert itup_key = insertstate->itup_key;
+ Page page = BufferGetPage(insertstate->buf);
+ BTPageOpaque opaque;
+ OffsetNumber newitemoff;
+
+ opaque = BTPageGetOpaque(page);
+
+ /* Check 1/3 of a page restriction */
+ if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
+ _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
+ insertstate->itup);
+
+ Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
+ Assert(!insertstate->bounds_valid || checkingunique);
+ Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
+ Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
+ Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
+
+ if (itup_key->heapkeyspace)
+ {
+ /* Keep track of whether checkingunique duplicate seen */
+ bool uniquedup = indexUnchanged;
+
+ /*
+ * If we're inserting into a unique index, we may have to walk right
+ * through leaf pages to find the one leaf page that we must insert on
+ * to.
+ *
+ * This is needed for checkingunique callers because a scantid was not
+ * used when we called _bt_search(). scantid can only be set after
+ * _bt_check_unique() has checked for duplicates. The buffer
+ * initially stored in insertstate->buf has the page where the first
+ * duplicate key might be found, which isn't always the page that new
+ * tuple belongs on. The heap TID attribute for new tuple (scantid)
+ * could force us to insert on a sibling page, though that should be
+ * very rare in practice.
+ */
+ if (checkingunique)
+ {
+ if (insertstate->low < insertstate->stricthigh)
+ {
+ /* Encountered a duplicate in _bt_check_unique() */
+ Assert(insertstate->bounds_valid);
+ uniquedup = true;
+ }
+
+ for (;;)
+ {
+ AttrNumber cmpcol = 1;
+
+ /*
+ * Does the new tuple belong on this page?
+ *
+ * The earlier _bt_check_unique() call may well have
+ * established a strict upper bound on the offset for the new
+ * item. If it's not the last item of the page (i.e. if there
+ * is at least one tuple on the page that goes after the tuple
+ * we're inserting) then we know that the tuple belongs on
+ * this page. We can skip the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ /* Test '<=', not '!=', since scantid is set now */
+ if (P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
+ break;
+
+ _bt_stepright(rel, heapRel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ /* Assume duplicates (if checkingunique) */
+ uniquedup = true;
+ }
+ }
+
+ /*
+ * If the target page cannot fit newitem, try to avoid splitting the
+ * page on insert by performing deletion or deduplication now
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
+ checkingunique, uniquedup,
+ indexUnchanged);
+ }
+ else
+ {
+ /*----------
+ * This is a !heapkeyspace (version 2 or 3) index. The current page
+ * is the first page that we could insert the new tuple to, but there
+ * may be other pages to the right that we could opt to use instead.
+ *
+ * If the new key is equal to one or more existing keys, we can
+ * legitimately place it anywhere in the series of equal keys. In
+ * fact, if the new key is equal to the page's "high key" we can place
+ * it on the next page. If it is equal to the high key, and there's
+ * not room to insert the new tuple on the current page without
+ * splitting, then we move right hoping to find more free space and
+ * avoid a split.
+ *
+ * Keep scanning right until we
+ * (a) find a page with enough free space,
+ * (b) reach the last page where the tuple can legally go, or
+ * (c) get tired of searching.
+ * (c) is not flippant; it is important because if there are many
+ * pages' worth of equal keys, it's better to split one of the early
+ * pages than to scan all the way to the end of the run of equal keys
+ * on every insert. We implement "get tired" as a random choice,
+ * since stopping after scanning a fixed number of pages wouldn't work
+ * well (we'd never reach the right-hand side of previously split
+ * pages). The probability of moving right is set at 0.99, which may
+ * seem too high to change the behavior much, but it does an excellent
+ * job of preventing O(N^2) behavior with many equal keys.
+ *----------
+ */
+ while (PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ AttrNumber cmpcol = 1;
+
+ /*
+ * Before considering moving right, see if we can obtain enough
+ * space by erasing LP_DEAD items
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ /* Perform simple deletion */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ if (PageGetFreeSpace(page) >= insertstate->itemsz)
+ break; /* OK, now we have enough space */
+ }
+
+ /*
+ * Nope, so check conditions (b) and (c) enumerated above
+ *
+ * The earlier _bt_check_unique() call may well have established a
+ * strict upper bound on the offset for the new item. If it's not
+ * the last item of the page (i.e. if there is at least one tuple
+ * on the page that's greater than the tuple we're inserting to)
+ * then we know that the tuple belongs on this page. We can skip
+ * the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ if (P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) != 0 ||
+ pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
+ break;
+
+ _bt_stepright(rel, heapRel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ }
+ }
+
+ /*
+ * We should now be on the correct page. Find the offset within the page
+ * for the new tuple. (Possibly reusing earlier search bounds.)
+ */
+ {
+ AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
+ Assert(P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0);
+ }
+
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
+
+ if (insertstate->postingoff == -1)
+ {
+ /*
+ * There is an overlapping posting list tuple with its LP_DEAD bit
+ * set. We don't want to unnecessarily unset its LP_DEAD bit while
+ * performing a posting list split, so perform simple index tuple
+ * deletion early.
+ */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ /*
+ * Do new binary search. New insert location cannot overlap with any
+ * posting list now.
+ */
+ Assert(!insertstate->bounds_valid);
+ insertstate->postingoff = 0;
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
+ Assert(insertstate->postingoff == 0);
+ }
+
+ return newitemoff;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 6558aea42b..7e8e4409c1 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1810,6 +1810,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
bool rightsib_empty;
Page page;
BTPageOpaque opaque;
+ nbts_prep_ctx(rel);
/*
* Save original leafbuf block number from caller. Only deleted blocks
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 62bc9917f1..58f2fdba18 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -87,6 +87,8 @@ static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
OffsetNumber updatedoffset,
int *nremaining);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtree_spec.c"
+#include "access/nbtree_spec.h"
/*
* Btree handler function: return IndexAmRoutine with access method parameters
@@ -121,7 +123,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambuild = btbuild;
amroutine->ambuildempty = btbuildempty;
- amroutine->aminsert = btinsert;
+ amroutine->aminsert = btinsert_default;
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
@@ -155,6 +157,8 @@ btbuildempty(Relation index)
Buffer metabuf;
Page metapage;
+ nbt_opt_specialize(index);
+
/*
* Initalize the metapage.
*
@@ -180,33 +184,6 @@ btbuildempty(Relation index)
ReleaseBuffer(metabuf);
}
-/*
- * btinsert() -- insert an index tuple into a btree.
- *
- * Descend the tree recursively, find the appropriate location for our
- * new tuple, and put it there.
- */
-bool
-btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- IndexInfo *indexInfo)
-{
- bool result;
- IndexTuple itup;
-
- /* generate an index tuple */
- itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
- itup->t_tid = *ht_ctid;
-
- result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);
-
- pfree(itup);
-
- return result;
-}
-
/*
* btgettuple() -- Get the next tuple in the scan.
*/
@@ -348,6 +325,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
IndexScanDesc scan;
BTScanOpaque so;
+ nbt_opt_specialize(rel);
+
/* no order by operators allowed */
Assert(norderbys == 0);
@@ -791,6 +770,8 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
Relation rel = info->index;
BTCycleId cycleid;
+ nbt_opt_specialize(rel);
+
/* allocate stats if first time through, else re-use existing struct */
if (stats == NULL)
stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
diff --git a/src/backend/access/nbtree/nbtree_spec.c b/src/backend/access/nbtree/nbtree_spec.c
new file mode 100644
index 0000000000..6b766581ab
--- /dev/null
+++ b/src/backend/access/nbtree/nbtree_spec.c
@@ -0,0 +1,69 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtree_spec.c
+ * Index shape-specialized functions for nbtree.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtree_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+/*
+ * _bt_specialize() -- Specialize this index relation for its index key.
+ */
+void
+_bt_specialize(Relation rel)
+{
+#ifdef NBTS_SPECIALIZING_DEFAULT
+ NBTS_MAKE_CTX(rel);
+ /*
+ * We can't directly address _bt_specialize here because it'd be macro-
+ * expanded, nor can we utilize NBTS_SPECIALIZE_NAME here because it'd
+ * try to call _bt_specialize, which would be an infinite recursive call.
+ */
+ switch (__nbts_ctx) {
+ case NBTS_CTX_CACHED:
+ _bt_specialize_cached(rel);
+ break;
+ case NBTS_CTX_DEFAULT:
+ break;
+ }
+#else
+ rel->rd_indam->aminsert = btinsert;
+#endif
+}
+
+/*
+ * btinsert() -- insert an index tuple into a btree.
+ *
+ * Descend the tree recursively, find the appropriate location for our
+ * new tuple, and put it there.
+ */
+bool
+btinsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ IndexInfo *indexInfo)
+{
+ bool result;
+ IndexTuple itup;
+
+ /* generate an index tuple */
+ itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
+ itup->t_tid = *ht_ctid;
+
+ result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);
+
+ pfree(itup);
+
+ return result;
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index a6998e48d8..d31bb8abdf 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,12 +26,8 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
-static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf,
- AttrNumber *highkeycmpcol);
static int _bt_binsrch_posting(BTScanInsert key, Page page,
OffsetNumber offnum);
-static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
- OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
static int _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
@@ -48,6 +44,8 @@ static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsearch_spec.c"
+#include "access/nbtree_spec.h"
/*
* _bt_drop_lock_and_maybe_pin()
@@ -72,601 +70,6 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
}
}
-/*
- * _bt_search() -- Search the tree for a particular scankey,
- * or more precisely for the first leaf page it could be on.
- *
- * The passed scankey is an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * Return value is a stack of parent-page pointers (i.e. there is no entry for
- * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
- * which is locked and pinned. No locks are held on the parent pages,
- * however!
- *
- * If the snapshot parameter is not NULL, "old snapshot" checking will take
- * place during the descent through the tree. This is not needed when
- * positioning for an insert or delete, so NULL is used for those cases.
- *
- * The returned buffer is locked according to access parameter. Additionally,
- * access = BT_WRITE will allow an empty root page to be created and returned.
- * When access = BT_READ, an empty index will result in *bufP being set to
- * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
- * during the search will be finished.
- *
- * heaprel must be provided by callers that pass access = BT_WRITE, since we
- * might need to allocate a new root page for caller -- see _bt_allocbuf.
- */
-BTStack
-_bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
- int access, Snapshot snapshot)
-{
- BTStack stack_in = NULL;
- int page_access = BT_READ;
- char tupdatabuf[BLCKSZ / 3];
- AttrNumber highkeycmpcol = 1;
-
- /* heaprel must be set whenever _bt_allocbuf is reachable */
- Assert(access == BT_READ || access == BT_WRITE);
- Assert(access == BT_READ || heaprel != NULL);
-
- /* Get the root page to start with */
- *bufP = _bt_getroot(rel, heaprel, access);
-
- /* If index is empty and access = BT_READ, no root page is created. */
- if (!BufferIsValid(*bufP))
- return (BTStack) NULL;
-
- /* Loop iterates once per level descended in the tree */
- for (;;)
- {
- Page page;
- BTPageOpaque opaque;
- OffsetNumber offnum;
- ItemId itemid;
- IndexTuple itup;
- BlockNumber child;
- BTStack new_stack;
-
- /*
- * Race -- the page we just grabbed may have split since we read its
- * downlink in its parent page (or the metapage). If it has, we may
- * need to move right to its new sibling. Do that.
- *
- * In write-mode, allow _bt_moveright to finish any incomplete splits
- * along the way. Strictly speaking, we'd only need to finish an
- * incomplete split on the leaf page we're about to insert to, not on
- * any of the upper levels (internal pages with incomplete splits are
- * also taken care of in _bt_getstackbuf). But this is a good
- * opportunity to finish splits of internal pages too.
- */
- *bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
- stack_in, page_access, snapshot, &highkeycmpcol,
- (char *) tupdatabuf);
-
- /* if this is a leaf page, we're done */
- page = BufferGetPage(*bufP);
- opaque = BTPageGetOpaque(page);
- if (P_ISLEAF(opaque))
- break;
-
- /*
- * Find the appropriate pivot tuple on this page. Its downlink points
- * to the child page that we're about to descend to.
- */
- offnum = _bt_binsrch(rel, key, *bufP, &highkeycmpcol);
- itemid = PageGetItemId(page, offnum);
- itup = (IndexTuple) PageGetItem(page, itemid);
- Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
- child = BTreeTupleGetDownLink(itup);
-
- Assert(IndexTupleSize(itup) < sizeof(tupdatabuf));
- memcpy((char *) tupdatabuf, (char *) itup, IndexTupleSize(itup));
-
- /*
- * We need to save the location of the pivot tuple we chose in a new
- * stack entry for this page/level. If caller ends up splitting a
- * page one level down, it usually ends up inserting a new pivot
- * tuple/downlink immediately after the location recorded here.
- */
- new_stack = (BTStack) palloc(sizeof(BTStackData));
- new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
- new_stack->bts_offset = offnum;
- new_stack->bts_parent = stack_in;
-
- /*
- * Page level 1 is lowest non-leaf page level prior to leaves. So, if
- * we're on the level 1 and asked to lock leaf page in write mode,
- * then lock next page in write mode, because it must be a leaf.
- */
- if (opaque->btpo_level == 1 && access == BT_WRITE)
- page_access = BT_WRITE;
-
- /* drop the read lock on the page, then acquire one on its child */
- *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
-
- /* okay, all set to move down a level */
- stack_in = new_stack;
- }
-
- /*
- * If we're asked to lock leaf in write mode, but didn't manage to, then
- * relock. This should only happen when the root page is a leaf page (and
- * the only page in the index other than the metapage).
- */
- if (access == BT_WRITE && page_access == BT_READ)
- {
- highkeycmpcol = 1;
-
- /* trade in our read lock for a write lock */
- _bt_unlockbuf(rel, *bufP);
- _bt_lockbuf(rel, *bufP, BT_WRITE);
-
- /*
- * Race -- the leaf page may have split after we dropped the read lock
- * but before we acquired a write lock. If it has, we may need to
- * move right to its new sibling. Do that.
- */
- *bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
- snapshot, &highkeycmpcol, (char *) tupdatabuf);
- }
-
- return stack_in;
-}
-
-/*
- * _bt_moveright() -- move right in the btree if necessary.
- *
- * When we follow a pointer to reach a page, it is possible that
- * the page has changed in the meanwhile. If this happens, we're
- * guaranteed that the page has "split right" -- that is, that any
- * data that appeared on the page originally is either on the page
- * or strictly to the right of it.
- *
- * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page. If that entry is
- * strictly less than the scankey, or <= the scankey in the
- * key.nextkey=true case, then we followed the wrong link and we need
- * to move right.
- *
- * The passed insertion-type scankey can omit the rightmost column(s) of the
- * index. (see nbtree/README)
- *
- * When key.nextkey is false (the usual case), we are looking for the first
- * item >= key. When key.nextkey is true, we are looking for the first item
- * strictly greater than key.
- *
- * If forupdate is true, we will attempt to finish any incomplete splits
- * that we encounter. This is required when locking a target page for an
- * insertion, because we don't allow inserting on a page before the split is
- * completed. 'heaprel' and 'stack' are only used if forupdate is true.
- *
- * On entry, we have the buffer pinned and a lock of the type specified by
- * 'access'. If we move right, we release the buffer and lock and acquire
- * the same on the right sibling. Return value is the buffer we stop at.
- *
- * If the snapshot parameter is not NULL, "old snapshot" checking will take
- * place during the descent through the tree. This is not needed when
- * positioning for an insert or delete, so NULL is used for those cases.
- */
-Buffer
-_bt_moveright(Relation rel,
- Relation heaprel,
- BTScanInsert key,
- Buffer buf,
- bool forupdate,
- BTStack stack,
- int access,
- Snapshot snapshot,
- AttrNumber *comparecol,
- char *tupdatabuf)
-{
- Page page;
- BTPageOpaque opaque;
- int32 cmpval;
-
- Assert(!forupdate || heaprel != NULL);
- Assert(PointerIsValid(comparecol) && PointerIsValid(tupdatabuf));
-
- /*
- * When nextkey = false (normal case): if the scan key that brought us to
- * this page is > the high key stored on the page, then the page has split
- * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
- * have some duplicates to the right as well as the left, but that's
- * something that's only ever dealt with on the leaf level, after
- * _bt_search has found an initial leaf page.)
- *
- * When nextkey = true: move right if the scan key is >= page's high key.
- * (Note that key.scantid cannot be set in this case.)
- *
- * The page could even have split more than once, so scan as far as
- * needed.
- *
- * We also have to move right if we followed a link that brought us to a
- * dead page.
- */
- cmpval = key->nextkey ? 0 : 1;
-
- for (;;)
- {
- AttrNumber cmpcol = 1;
-
- page = BufferGetPage(buf);
- TestForOldSnapshot(snapshot, rel, page);
- opaque = BTPageGetOpaque(page);
-
- if (P_RIGHTMOST(opaque))
- {
- *comparecol = 1;
- break;
- }
-
- /*
- * Finish any incomplete splits we encounter along the way.
- */
- if (forupdate && P_INCOMPLETE_SPLIT(opaque))
- {
- BlockNumber blkno = BufferGetBlockNumber(buf);
-
- /* upgrade our lock if necessary */
- if (access == BT_READ)
- {
- _bt_unlockbuf(rel, buf);
- _bt_lockbuf(rel, buf, BT_WRITE);
- }
-
- if (P_INCOMPLETE_SPLIT(opaque))
- _bt_finish_split(rel, heaprel, buf, stack);
- else
- _bt_relbuf(rel, buf);
-
- /* re-acquire the lock in the right mode, and re-check */
- buf = _bt_getbuf(rel, blkno, access);
- continue;
- }
-
- /*
- * tupdatabuf is filled with the right seperator of the parent node.
- * This allows us to do a binary equality check between the parent
- * node's right seperator (which is < key) and this page's P_HIKEY.
- * If they equal, we can reuse the result of the parent node's
- * rightkey compare, which means we can potentially save a full key
- * compare (which includes indirect calls to attribute comparison
- * functions).
- *
- * Without this, we'd on average use 3 full key compares per page before
- * we achieve full dynamic prefix bounds, but with this optimization
- * that is only 2.
- *
- * 3 compares: 1 for the highkey (rightmost), and on average 2 before
- * we move right in the binary search on the page, this average equals
- * SUM (1/2 ^ x) for x from 0 to log(n items)), which tends to 2.
- */
- if (!P_IGNORE(opaque) && *comparecol > 1)
- {
- IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
- IndexTuple buftuple = (IndexTuple) tupdatabuf;
- if (IndexTupleSize(itup) == IndexTupleSize(buftuple))
- {
- char *dataptr = (char *) itup;
-
- if (memcmp(dataptr + sizeof(IndexTupleData),
- tupdatabuf + sizeof(IndexTupleData),
- IndexTupleSize(itup) - sizeof(IndexTupleData)) == 0)
- break;
- } else {
- *comparecol = 1;
- }
- } else {
- *comparecol = 1;
- }
-
- if (P_IGNORE(opaque) ||
- _bt_compare(rel, key, page, P_HIKEY, &cmpcol) >= cmpval)
- {
- *comparecol = 1;
- /* step right one page */
- buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
- continue;
- }
- else
- {
- *comparecol = cmpcol;
- break;
- }
- }
-
- if (P_IGNORE(opaque))
- elog(ERROR, "fell off the end of index \"%s\"",
- RelationGetRelationName(rel));
-
- return buf;
-}
-
-/*
- * _bt_binsrch() -- Do a binary search for a key on a particular page.
- *
- * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
- * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
- * particular, this means it is possible to return a value 1 greater than the
- * number of keys on the page, if the scankey is > all keys on the page.)
- *
- * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
- * of the last key < given scankey, or last key <= given scankey if nextkey
- * is true. (Since _bt_compare treats the first data key of such a page as
- * minus infinity, there will be at least one key < scankey, so the result
- * always points at one of the keys on the page.) This key indicates the
- * right place to descend to be sure we find all leaf keys >= given scankey
- * (or leaf keys > given scankey when nextkey is true).
- *
- * When called, the "highkeycmpcol" pointer argument is expected to contain the
- * AttrNumber of the first attribute that is not shared between scan key and
- * this page's high key, i.e. the first attribute that we have to compare
- * against the scan key. The value will be updated by _bt_binsrch to contain
- * this same first column we'll need to compare against the scan key, but now
- * for the index tuple at the returned offset. Valid values range from 1
- * (no shared prefix) to the number of key attributes + 1 (all index key
- * attributes are equal to the scan key). See also _bt_compare, and
- * backend/access/nbtree/README for more info.
- *
- * This procedure is not responsible for walking right, it just examines
- * the given page. _bt_binsrch() has no lock or refcount side effects
- * on the buffer.
- */
-static OffsetNumber
-_bt_binsrch(Relation rel,
- BTScanInsert key,
- Buffer buf,
- AttrNumber *highkeycmpcol)
-{
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high;
- int32 result,
- cmpval;
- /*
- * Prefix bounds, for the high/low offset's compare columns.
- * "highkeycmpcol" is the value for this page's high key (if any) or 1
- * (no established shared prefix)
- */
- AttrNumber highcmpcol = *highkeycmpcol,
- lowcmpcol = 1;
-
- page = BufferGetPage(buf);
- opaque = BTPageGetOpaque(page);
-
- /* Requesting nextkey semantics while using scantid seems nonsensical */
- Assert(!key->nextkey || key->scantid == NULL);
- /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
- Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
-
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
-
- /*
- * If there are no keys on the page, return the first available slot. Note
- * this covers two cases: the page is really empty (no keys), or it
- * contains only a high key. The latter case is possible after vacuuming.
- * This can never happen on an internal page, however, since they are
- * never empty (an internal page must have children).
- */
- if (unlikely(high < low))
- return low;
-
- /*
- * Binary search to find the first key on the page >= scan key, or first
- * key > scankey when nextkey is true.
- *
- * For nextkey=false (cmpval=1), the loop invariant is: all slots before
- * 'low' are < scan key, all slots at or after 'high' are >= scan key.
- *
- * For nextkey=true (cmpval=0), the loop invariant is: all slots before
- * 'low' are <= scan key, all slots at or after 'high' are > scan key.
- *
- * We maintain highcmpcol and lowcmpcol to keep track of prefixes that
- * tuples share with the scan key, potentially allowing us to skip a
- * prefix in the midpoint comparison.
- *
- * We can fall out when high == low.
- */
- high++; /* establish the loop invariant for high */
-
- cmpval = key->nextkey ? 0 : 1; /* select comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
- AttrNumber cmpcol = Min(highcmpcol, lowcmpcol); /* update prefix bounds */
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid, &cmpcol);
-
- if (result >= cmpval)
- {
- low = mid + 1;
- lowcmpcol = cmpcol;
- }
- else
- {
- high = mid;
- highcmpcol = cmpcol;
- }
- }
-
- /* update the bounds at the caller */
- *highkeycmpcol = highcmpcol;
-
- /*
- * At this point we have high == low, but be careful: they could point
- * past the last slot on the page.
- *
- * On a leaf page, we always return the first key >= scan key (resp. >
- * scan key), which could be the last slot + 1.
- */
- if (P_ISLEAF(opaque))
- return low;
-
- /*
- * On a non-leaf page, return the last key < scan key (resp. <= scan key).
- * There must be one if _bt_compare() is playing by the rules.
- */
- Assert(low > P_FIRSTDATAKEY(opaque));
-
- return OffsetNumberPrev(low);
-}
-
-/*
- *
- * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
- *
- * Like _bt_binsrch(), but with support for caching the binary search
- * bounds. Only used during insertion, and only on the leaf page that it
- * looks like caller will insert tuple on. Exclusive-locked and pinned
- * leaf page is contained within insertstate.
- *
- * Caches the bounds fields in insertstate so that a subsequent call can
- * reuse the low and strict high bounds of original binary search. Callers
- * that use these fields directly must be prepared for the case where low
- * and/or stricthigh are not on the same page (one or both exceed maxoff
- * for the page). The case where there are no items on the page (high <
- * low) makes bounds invalid.
- *
- * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time, and for dealing with posting list
- * tuple matches (callers can use insertstate's postingoff field to
- * determine which existing heap TID will need to be replaced by a posting
- * list split).
- */
-OffsetNumber
-_bt_binsrch_insert(Relation rel, BTInsertState insertstate,
- AttrNumber highcmpcol)
-{
- BTScanInsert key = insertstate->itup_key;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high,
- stricthigh;
- int32 result,
- cmpval;
- AttrNumber lowcmpcol = 1;
-
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- Assert(P_ISLEAF(opaque));
- Assert(!key->nextkey);
- Assert(insertstate->postingoff == 0);
-
- if (!insertstate->bounds_valid)
- {
- /* Start new binary search */
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
- }
- else
- {
- /* Restore result of previous binary search against same page */
- low = insertstate->low;
- high = insertstate->stricthigh;
- }
-
- /* If there are no keys on the page, return the first available slot */
- if (unlikely(high < low))
- {
- /* Caller can't reuse bounds */
- insertstate->low = InvalidOffsetNumber;
- insertstate->stricthigh = InvalidOffsetNumber;
- insertstate->bounds_valid = false;
- return low;
- }
-
- /*
- * Binary search to find the first key on the page >= scan key. (nextkey
- * is always false when inserting).
- *
- * The loop invariant is: all slots before 'low' are < scan key, all slots
- * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
- * maintained to save additional search effort for caller.
- *
- * We can fall out when high == low.
- */
- if (!insertstate->bounds_valid)
- high++; /* establish the loop invariant for high */
- stricthigh = high; /* high initially strictly higher */
-
- cmpval = 1; /* !nextkey comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
- AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid, &cmpcol);
-
- if (result >= cmpval)
- {
- low = mid + 1;
- lowcmpcol = cmpcol;
- }
- else
- {
- high = mid;
- highcmpcol = cmpcol;
-
- if (result != 0)
- stricthigh = high;
- }
-
- /*
- * If tuple at offset located by binary search is a posting list whose
- * TID range overlaps with caller's scantid, perform posting list
- * binary search to set postingoff for caller. Caller must split the
- * posting list when postingoff is set. This should happen
- * infrequently.
- */
- if (unlikely(result == 0 && key->scantid != NULL))
- {
- /*
- * postingoff should never be set more than once per leaf page
- * binary search. That would mean that there are duplicate table
- * TIDs in the index, which is never okay. Check for that here.
- */
- if (insertstate->postingoff != 0)
- ereport(ERROR,
- (errcode(ERRCODE_INDEX_CORRUPTED),
- errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
- ItemPointerGetBlockNumber(key->scantid),
- ItemPointerGetOffsetNumber(key->scantid),
- low, stricthigh,
- BufferGetBlockNumber(insertstate->buf),
- RelationGetRelationName(rel))));
-
- insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
- }
- }
-
- /*
- * On a leaf page, a binary search always returns the first key >= scan
- * key (at least in !nextkey case), which could be the last slot + 1. This
- * is also the lower bound of cached search.
- *
- * stricthigh may also be the last slot + 1, which prevents caller from
- * using bounds directly, but is still useful to us if we're called a
- * second time with cached bounds (cached low will be < stricthigh when
- * that happens).
- */
- insertstate->low = low;
- insertstate->stricthigh = stricthigh;
- insertstate->bounds_valid = true;
-
- return low;
-}
-
/*----------
* _bt_binsrch_posting() -- posting list binary search.
*
@@ -734,235 +137,6 @@ _bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
return low;
}
-/*----------
- * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
- *
- * page/offnum: location of btree item to be compared to.
- *
- * This routine returns:
- * <0 if scankey < tuple at offnum;
- * 0 if scankey == tuple at offnum;
- * >0 if scankey > tuple at offnum.
- *
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be returned
- * to the caller as a matching key. Similarly, an insertion scankey
- * with its scantid set is treated as equal to a posting tuple whose TID
- * range overlaps with their scantid. There generally won't be a
- * matching TID in the posting tuple, which caller must handle
- * themselves (e.g., by splitting the posting list tuple).
- *
- * NOTE: The "comparecol" argument must refer to the first attribute of the
- * index tuple of which the caller knows that it does not match the scan key:
- * this means 1 for "no known matching attributes", up to the number of key
- * attributes + 1 if the caller knows that all key attributes of the index
- * tuple match those of the scan key. See backend/access/nbtree/README for
- * details.
- *
- * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
- * "minus infinity": this routine will always claim it is less than the
- * scankey. The actual key value stored is explicitly truncated to 0
- * attributes (explicitly minus infinity) with version 3+ indexes, but
- * that isn't relied upon. This allows us to implement the Lehman and
- * Yao convention that the first down-link pointer is before the first
- * key. See backend/access/nbtree/README for details.
- *----------
- */
-int32
-_bt_compare(Relation rel,
- BTScanInsert key,
- Page page,
- OffsetNumber offnum,
- AttrNumber *comparecol)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- IndexTuple itup;
- ItemPointer heapTid;
- ScanKey scankey;
- int ncmpkey;
- int ntupatts;
- int32 result;
-
- Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
- Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
- Assert(key->heapkeyspace || key->scantid == NULL);
-
- /*
- * Force result ">" if target item is first data item on an internal page
- * --- see NOTE above.
- */
- if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
- return 1;
-
- itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
- ntupatts = BTreeTupleGetNAtts(itup, rel);
-
- /*
- * The scan key is set up with the attribute number associated with each
- * term in the key. It is important that, if the index is multi-key, the
- * scan contain the first k key attributes, and that they be in order. If
- * you think about how multi-key ordering works, you'll understand why
- * this is.
- *
- * We don't test for violation of this condition here, however. The
- * initial setup for the index scan had better have gotten it right (see
- * _bt_first).
- */
-
- ncmpkey = Min(ntupatts, key->keysz);
- Assert(key->heapkeyspace || ncmpkey == key->keysz);
- Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
-
- scankey = key->scankeys + ((*comparecol) - 1);
- for (int i = *comparecol; i <= ncmpkey; i++)
- {
- Datum datum;
- bool isNull;
-
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
-
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
- {
- if (isNull)
- result = 0; /* NULL "=" NULL */
- else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = -1; /* NULL "<" NOT_NULL */
- else
- result = 1; /* NULL ">" NOT_NULL */
- }
- else if (isNull) /* key is NOT_NULL and item is NULL */
- {
- if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = 1; /* NOT_NULL ">" NULL */
- else
- result = -1; /* NOT_NULL "<" NULL */
- }
- else
- {
- /*
- * The sk_func needs to be passed the index value as left arg and
- * the sk_argument as right arg (they might be of different
- * types). Since it is convenient for callers to think of
- * _bt_compare as comparing the scankey to the index item, we have
- * to flip the sign of the comparison result. (Unless it's a DESC
- * column, in which case we *don't* flip the sign.)
- */
- result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum,
- scankey->sk_argument));
-
- if (!(scankey->sk_flags & SK_BT_DESC))
- INVERT_COMPARE_RESULT(result);
- }
-
- /* if the keys are unequal, return the difference */
- if (result != 0)
- {
- *comparecol = i;
- return result;
- }
-
- scankey++;
- }
-
- /*
- * All tuple attributes are equal to the scan key, only later attributes
- * could potentially not equal the scan key.
- */
- *comparecol = ntupatts + 1;
-
- /*
- * All non-truncated attributes (other than heap TID) were found to be
- * equal. Treat truncated attributes as minus infinity when scankey has a
- * key attribute value that would otherwise be compared directly.
- *
- * Note: it doesn't matter if ntupatts includes non-key attributes;
- * scankey won't, so explicitly excluding non-key attributes isn't
- * necessary.
- */
- if (key->keysz > ntupatts)
- return 1;
-
- /*
- * Use the heap TID attribute and scantid to try to break the tie. The
- * rules are the same as any other key attribute -- only the
- * representation differs.
- */
- heapTid = BTreeTupleGetHeapTID(itup);
- if (key->scantid == NULL)
- {
- /*
- * Most searches have a scankey that is considered greater than a
- * truncated pivot tuple if and when the scankey has equal values for
- * attributes up to and including the least significant untruncated
- * attribute in tuple.
- *
- * For example, if an index has the minimum two attributes (single
- * user key attribute, plus heap TID attribute), and a page's high key
- * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
- * will not descend to the page to the left. The search will descend
- * right instead. The truncated attribute in pivot tuple means that
- * all non-pivot tuples on the page to the left are strictly < 'foo',
- * so it isn't necessary to descend left. In other words, search
- * doesn't have to descend left because it isn't interested in a match
- * that has a heap TID value of -inf.
- *
- * However, some searches (pivotsearch searches) actually require that
- * we descend left when this happens. -inf is treated as a possible
- * match for omitted scankey attribute(s). This is needed by page
- * deletion, which must re-find leaf pages that are targets for
- * deletion using their high keys.
- *
- * Note: the heap TID part of the test ensures that scankey is being
- * compared to a pivot tuple with one or more truncated key
- * attributes.
- *
- * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
- * left here, since they have no heap TID attribute (and cannot have
- * any -inf key values in any case, since truncation can only remove
- * non-key attributes). !heapkeyspace searches must always be
- * prepared to deal with matches on both sides of the pivot once the
- * leaf level is reached.
- */
- if (key->heapkeyspace && !key->pivotsearch &&
- key->keysz == ntupatts && heapTid == NULL)
- return 1;
-
- /* All provided scankey arguments found to be equal */
- return 0;
- }
-
- /*
- * Treat truncated heap TID as minus infinity, since scankey has a key
- * attribute value (scantid) that would otherwise be compared directly
- */
- Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
- if (heapTid == NULL)
- return 1;
-
- /*
- * Scankey must be treated as equal to a posting list tuple if its scantid
- * value falls within the range of the posting list. In all other cases
- * there can only be a single heap TID value, which is compared directly
- * with scantid.
- */
- Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- result = ItemPointerCompare(key->scantid, heapTid);
- if (result <= 0 || !BTreeTupleIsPosting(itup))
- return result;
- else
- {
- result = ItemPointerCompare(key->scantid,
- BTreeTupleGetMaxHeapTID(itup));
- if (result > 0)
- return 1;
- }
-
- return 0;
-}
-
/*
* _bt_first() -- Find the first item in a scan.
*
@@ -1004,6 +178,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
BTScanPosItem *currItem;
BlockNumber blkno;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(rel);
Assert(!BTScanPosIsValid(so->currPos));
@@ -1638,280 +813,6 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
-/*
- * _bt_readpage() -- Load data from current index page into so->currPos
- *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate. All other fields of so->currPos are
- * initialized from scratch here.
- *
- * We scan the current page starting at offnum and moving in the indicated
- * direction. All items matching the scan keys are loaded into currPos.items.
- * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
- * that there can be no more matching tuples in the current scan direction.
- *
- * In the case of a parallel scan, caller must have called _bt_parallel_seize
- * prior to calling this function; this function will invoke
- * _bt_parallel_release before returning.
- *
- * Returns true if any matching items found on the page, false if none.
- */
-static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
-{
- BTScanOpaque so = (BTScanOpaque) scan->opaque;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber minoff;
- OffsetNumber maxoff;
- int itemIndex;
- bool continuescan;
- int indnatts;
-
- /*
- * We must have the buffer pinned and locked, but the usual macro can't be
- * used here; this function is what makes it good for currPos.
- */
- Assert(BufferIsValid(so->currPos.buf));
-
- page = BufferGetPage(so->currPos.buf);
- opaque = BTPageGetOpaque(page);
-
- /* allow next page be processed by parallel worker */
- if (scan->parallel_scan)
- {
- if (ScanDirectionIsForward(dir))
- _bt_parallel_release(scan, opaque->btpo_next);
- else
- _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
- }
-
- continuescan = true; /* default assumption */
- indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * We note the buffer's block number so that we can release the pin later.
- * This allows us to re-read the buffer if it is needed again for hinting.
- */
- so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-
- /*
- * We save the LSN of the page as we read it, so that we know whether it
- * safe to apply LP_DEAD hints to the page later. This allows us to drop
- * the pin for MVCC scans, which allows vacuum to avoid blocking.
- */
- so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-
- /*
- * we must save the page's right-link while scanning it; this tells us
- * where to step right to after we're done with these items. There is no
- * corresponding need for the left-link, since splits always go right.
- */
- so->currPos.nextPage = opaque->btpo_next;
-
- /* initialize tuple workspace to empty */
- so->currPos.nextTupleOffset = 0;
-
- /*
- * Now that the current page has been made consistent, the macro should be
- * good.
- */
- Assert(BTScanPosIsPinned(so->currPos));
-
- if (ScanDirectionIsForward(dir))
- {
- /* load items[] in ascending order */
- itemIndex = 0;
-
- offnum = Max(offnum, minoff);
-
- while (offnum <= maxoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- offnum = OffsetNumberNext(offnum);
- continue;
- }
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID
- */
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- itemIndex++;
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- itemIndex++;
- }
- }
- }
- /* When !continuescan, there can't be any more matches, so stop */
- if (!continuescan)
- break;
-
- offnum = OffsetNumberNext(offnum);
- }
-
- /*
- * We don't need to visit page to the right when the high key
- * indicates that no more matches will be found there.
- *
- * Checking the high key like this works out more often than you might
- * think. Leaf page splits pick a split point between the two most
- * dissimilar tuples (this is weighed against the need to evenly share
- * free space). Leaf pages with high key attribute values that can
- * only appear on non-pivot tuples on the right sibling page are
- * common.
- */
- if (continuescan && !P_RIGHTMOST(opaque))
- {
- ItemId iid = PageGetItemId(page, P_HIKEY);
- IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
- int truncatt;
-
- truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
- _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
- }
-
- if (!continuescan)
- so->currPos.moreRight = false;
-
- Assert(itemIndex <= MaxTIDsPerBTreePage);
- so->currPos.firstItem = 0;
- so->currPos.lastItem = itemIndex - 1;
- so->currPos.itemIndex = 0;
- }
- else
- {
- /* load items[] in descending order */
- itemIndex = MaxTIDsPerBTreePage;
-
- offnum = Min(offnum, maxoff);
-
- while (offnum >= minoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
- bool tuple_alive;
- bool passes_quals;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual. Most of the
- * time, it's a win to not bother examining the tuple's index
- * keys, but just skip to the next tuple (previous, actually,
- * since we're scanning backwards). However, if this is the first
- * tuple on the page, we do check the index keys, to prevent
- * uselessly advancing to the page to the left. This is similar
- * to the high key optimization used by forward scans.
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- Assert(offnum >= P_FIRSTDATAKEY(opaque));
- if (offnum > P_FIRSTDATAKEY(opaque))
- {
- offnum = OffsetNumberPrev(offnum);
- continue;
- }
-
- tuple_alive = false;
- }
- else
- tuple_alive = true;
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan);
- if (passes_quals && tuple_alive)
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID.
- *
- * Note that we deliberately save/return items from
- * posting lists in ascending heap TID order for backwards
- * scans. This allows _bt_killitems() to make a
- * consistent assumption about the order of items
- * associated with the same posting list tuple.
- */
- itemIndex--;
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- itemIndex--;
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- }
- }
- }
- if (!continuescan)
- {
- /* there can't be any more matches, so stop */
- so->currPos.moreLeft = false;
- break;
- }
-
- offnum = OffsetNumberPrev(offnum);
- }
-
- Assert(itemIndex >= 0);
- so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
- so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
- }
-
- return (so->currPos.firstItem <= so->currPos.lastItem);
-}
-
/* Save an index item into so->currPos.items[itemIndex] */
static void
_bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -2120,12 +1021,11 @@ static bool
_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
- Relation rel;
+ Relation rel = scan->indexRelation;
Page page;
BTPageOpaque opaque;
bool status;
-
- rel = scan->indexRelation;
+ nbts_prep_ctx(rel);
if (ScanDirectionIsForward(dir))
{
@@ -2537,6 +1437,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
BTPageOpaque opaque;
OffsetNumber start;
BTScanPosItem *currItem;
+ nbts_prep_ctx(rel);
/*
* Scan down to the leftmost or rightmost leaf page. This is a simplified
diff --git a/src/backend/access/nbtree/nbtsearch_spec.c b/src/backend/access/nbtree/nbtsearch_spec.c
new file mode 100644
index 0000000000..5f1ead2400
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsearch_spec.c
@@ -0,0 +1,1123 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtsearch_spec.c
+ * Index shape-specialized functions for nbtsearch.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtsearch_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_binsrch NBTS_FUNCTION(_bt_binsrch)
+#define _bt_readpage NBTS_FUNCTION(_bt_readpage)
+
+static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf,
+ AttrNumber *highkeycmpcol);
+static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
+ OffsetNumber offnum);
+
+/*
+ * _bt_search() -- Search the tree for a particular scankey,
+ * or more precisely for the first leaf page it could be on.
+ *
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
+ * but it can omit the rightmost column(s) of the index.
+ *
+ * Return value is a stack of parent-page pointers (i.e. there is no entry for
+ * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
+ * which is locked and pinned. No locks are held on the parent pages,
+ * however!
+ *
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
+ *
+ * The returned buffer is locked according to access parameter. Additionally,
+ * access = BT_WRITE will allow an empty root page to be created and returned.
+ * When access = BT_READ, an empty index will result in *bufP being set to
+ * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
+ * during the search will be finished.
+ *
+ * heaprel must be provided by callers that pass access = BT_WRITE, since we
+ * might need to allocate a new root page for caller -- see _bt_allocbuf.
+ */
+BTStack
+_bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
+ int access, Snapshot snapshot)
+{
+ BTStack stack_in = NULL;
+ int page_access = BT_READ;
+ char tupdatabuf[BLCKSZ / 3];
+ AttrNumber highkeycmpcol = 1;
+
+ /* heaprel must be set whenever _bt_allocbuf is reachable */
+ Assert(access == BT_READ || access == BT_WRITE);
+ Assert(access == BT_READ || heaprel != NULL);
+
+ /* Get the root page to start with */
+ *bufP = _bt_getroot(rel, heaprel, access);
+
+ /* If index is empty and access = BT_READ, no root page is created. */
+ if (!BufferIsValid(*bufP))
+ return (BTStack) NULL;
+
+ /* Loop iterates once per level descended in the tree */
+ for (;;)
+ {
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber offnum;
+ ItemId itemid;
+ IndexTuple itup;
+ BlockNumber child;
+ BTStack new_stack;
+
+ /*
+ * Race -- the page we just grabbed may have split since we read its
+ * downlink in its parent page (or the metapage). If it has, we may
+ * need to move right to its new sibling. Do that.
+ *
+ * In write-mode, allow _bt_moveright to finish any incomplete splits
+ * along the way. Strictly speaking, we'd only need to finish an
+ * incomplete split on the leaf page we're about to insert to, not on
+ * any of the upper levels (internal pages with incomplete splits are
+ * also taken care of in _bt_getstackbuf). But this is a good
+ * opportunity to finish splits of internal pages too.
+ */
+ *bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
+ stack_in, page_access, snapshot, &highkeycmpcol,
+ (char *) tupdatabuf);
+
+ /* if this is a leaf page, we're done */
+ page = BufferGetPage(*bufP);
+ opaque = BTPageGetOpaque(page);
+ if (P_ISLEAF(opaque))
+ break;
+
+ /*
+ * Find the appropriate pivot tuple on this page. Its downlink points
+ * to the child page that we're about to descend to.
+ */
+ offnum = _bt_binsrch(rel, key, *bufP, &highkeycmpcol);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
+ child = BTreeTupleGetDownLink(itup);
+
+ Assert(IndexTupleSize(itup) < sizeof(tupdatabuf));
+ memcpy((char *) tupdatabuf, (char *) itup, IndexTupleSize(itup));
+
+ /*
+ * We need to save the location of the pivot tuple we chose in a new
+ * stack entry for this page/level. If caller ends up splitting a
+ * page one level down, it usually ends up inserting a new pivot
+ * tuple/downlink immediately after the location recorded here.
+ */
+ new_stack = (BTStack) palloc(sizeof(BTStackData));
+ new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
+ new_stack->bts_offset = offnum;
+ new_stack->bts_parent = stack_in;
+
+ /*
+ * Page level 1 is lowest non-leaf page level prior to leaves. So, if
+ * we're on the level 1 and asked to lock leaf page in write mode,
+ * then lock next page in write mode, because it must be a leaf.
+ */
+ if (opaque->btpo_level == 1 && access == BT_WRITE)
+ page_access = BT_WRITE;
+
+ /* drop the read lock on the page, then acquire one on its child */
+ *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
+
+ /* okay, all set to move down a level */
+ stack_in = new_stack;
+ }
+
+ /*
+ * If we're asked to lock leaf in write mode, but didn't manage to, then
+ * relock. This should only happen when the root page is a leaf page (and
+ * the only page in the index other than the metapage).
+ */
+ if (access == BT_WRITE && page_access == BT_READ)
+ {
+ highkeycmpcol = 1;
+
+ /* trade in our read lock for a write lock */
+ _bt_unlockbuf(rel, *bufP);
+ _bt_lockbuf(rel, *bufP, BT_WRITE);
+
+ /*
+ * Race -- the leaf page may have split after we dropped the read lock
+ * but before we acquired a write lock. If it has, we may need to
+ * move right to its new sibling. Do that.
+ */
+ *bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
+ snapshot, &highkeycmpcol, (char *) tupdatabuf);
+ }
+
+ return stack_in;
+}
+
+/*
+ * _bt_moveright() -- move right in the btree if necessary.
+ *
+ * When we follow a pointer to reach a page, it is possible that
+ * the page has changed in the meanwhile. If this happens, we're
+ * guaranteed that the page has "split right" -- that is, that any
+ * data that appeared on the page originally is either on the page
+ * or strictly to the right of it.
+ *
+ * This routine decides whether or not we need to move right in the
+ * tree by examining the high key entry on the page. If that entry is
+ * strictly less than the scankey, or <= the scankey in the
+ * key.nextkey=true case, then we followed the wrong link and we need
+ * to move right.
+ *
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
+ *
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key. When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
+ *
+ * If forupdate is true, we will attempt to finish any incomplete splits
+ * that we encounter. This is required when locking a target page for an
+ * insertion, because we don't allow inserting on a page before the split is
+ * completed. 'heaprel' and 'stack' are only used if forupdate is true.
+ *
+ * On entry, we have the buffer pinned and a lock of the type specified by
+ * 'access'. If we move right, we release the buffer and lock and acquire
+ * the same on the right sibling. Return value is the buffer we stop at.
+ *
+ * If the snapshot parameter is not NULL, "old snapshot" checking will take
+ * place during the descent through the tree. This is not needed when
+ * positioning for an insert or delete, so NULL is used for those cases.
+ */
+Buffer
+_bt_moveright(Relation rel,
+ Relation heaprel,
+ BTScanInsert key,
+ Buffer buf,
+ bool forupdate,
+ BTStack stack,
+ int access,
+ Snapshot snapshot,
+ AttrNumber *comparecol,
+ char *tupdatabuf)
+{
+ Page page;
+ BTPageOpaque opaque;
+ int32 cmpval;
+
+ Assert(!forupdate || heaprel != NULL);
+ Assert(PointerIsValid(comparecol) && PointerIsValid(tupdatabuf));
+
+ /*
+ * When nextkey = false (normal case): if the scan key that brought us to
+ * this page is > the high key stored on the page, then the page has split
+ * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
+ * have some duplicates to the right as well as the left, but that's
+ * something that's only ever dealt with on the leaf level, after
+ * _bt_search has found an initial leaf page.)
+ *
+ * When nextkey = true: move right if the scan key is >= page's high key.
+ * (Note that key.scantid cannot be set in this case.)
+ *
+ * The page could even have split more than once, so scan as far as
+ * needed.
+ *
+ * We also have to move right if we followed a link that brought us to a
+ * dead page.
+ */
+ cmpval = key->nextkey ? 0 : 1;
+
+ for (;;)
+ {
+ AttrNumber cmpcol = 1;
+
+ page = BufferGetPage(buf);
+ TestForOldSnapshot(snapshot, rel, page);
+ opaque = BTPageGetOpaque(page);
+
+ if (P_RIGHTMOST(opaque))
+ {
+ *comparecol = 1;
+ break;
+ }
+
+ /*
+ * Finish any incomplete splits we encounter along the way.
+ */
+ if (forupdate && P_INCOMPLETE_SPLIT(opaque))
+ {
+ BlockNumber blkno = BufferGetBlockNumber(buf);
+
+ /* upgrade our lock if necessary */
+ if (access == BT_READ)
+ {
+ _bt_unlockbuf(rel, buf);
+ _bt_lockbuf(rel, buf, BT_WRITE);
+ }
+
+ if (P_INCOMPLETE_SPLIT(opaque))
+ _bt_finish_split(rel, heaprel, buf, stack);
+ else
+ _bt_relbuf(rel, buf);
+
+ /* re-acquire the lock in the right mode, and re-check */
+ buf = _bt_getbuf(rel, blkno, access);
+ continue;
+ }
+
+ /*
+ * tupdatabuf is filled with the right seperator of the parent node.
+ * This allows us to do a binary equality check between the parent
+ * node's right seperator (which is < key) and this page's P_HIKEY.
+ * If they equal, we can reuse the result of the parent node's
+ * rightkey compare, which means we can potentially save a full key
+ * compare (which includes indirect calls to attribute comparison
+ * functions).
+ *
+ * Without this, we'd on average use 3 full key compares per page before
+ * we achieve full dynamic prefix bounds, but with this optimization
+ * that is only 2.
+ *
+ * 3 compares: 1 for the highkey (rightmost), and on average 2 before
+ * we move right in the binary search on the page, this average equals
+ * SUM (1/2 ^ x) for x from 0 to log(n items)), which tends to 2.
+ */
+ if (!P_IGNORE(opaque) && *comparecol > 1)
+ {
+ IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
+ IndexTuple buftuple = (IndexTuple) tupdatabuf;
+ if (IndexTupleSize(itup) == IndexTupleSize(buftuple))
+ {
+ char *dataptr = (char *) itup;
+
+ if (memcmp(dataptr + sizeof(IndexTupleData),
+ tupdatabuf + sizeof(IndexTupleData),
+ IndexTupleSize(itup) - sizeof(IndexTupleData)) == 0)
+ break;
+ } else {
+ *comparecol = 1;
+ }
+ } else {
+ *comparecol = 1;
+ }
+
+ if (P_IGNORE(opaque) ||
+ _bt_compare(rel, key, page, P_HIKEY, &cmpcol) >= cmpval)
+ {
+ *comparecol = 1;
+ /* step right one page */
+ buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
+ continue;
+ }
+ else
+ {
+ *comparecol = cmpcol;
+ break;
+ }
+ }
+
+ if (P_IGNORE(opaque))
+ elog(ERROR, "fell off the end of index \"%s\"",
+ RelationGetRelationName(rel));
+
+ return buf;
+}
+
+/*
+ * _bt_binsrch() -- Do a binary search for a key on a particular page.
+ *
+ * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
+ * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
+ * particular, this means it is possible to return a value 1 greater than the
+ * number of keys on the page, if the scankey is > all keys on the page.)
+ *
+ * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
+ * of the last key < given scankey, or last key <= given scankey if nextkey
+ * is true. (Since _bt_compare treats the first data key of such a page as
+ * minus infinity, there will be at least one key < scankey, so the result
+ * always points at one of the keys on the page.) This key indicates the
+ * right place to descend to be sure we find all leaf keys >= given scankey
+ * (or leaf keys > given scankey when nextkey is true).
+ *
+ * When called, the "highkeycmpcol" pointer argument is expected to contain the
+ * AttrNumber of the first attribute that is not shared between scan key and
+ * this page's high key, i.e. the first attribute that we have to compare
+ * against the scan key. The value will be updated by _bt_binsrch to contain
+ * this same first column we'll need to compare against the scan key, but now
+ * for the index tuple at the returned offset. Valid values range from 1
+ * (no shared prefix) to the number of key attributes + 1 (all index key
+ * attributes are equal to the scan key). See also _bt_compare, and
+ * backend/access/nbtree/README for more info.
+ *
+ * This procedure is not responsible for walking right, it just examines
+ * the given page. _bt_binsrch() has no lock or refcount side effects
+ * on the buffer.
+ */
+static OffsetNumber
+_bt_binsrch(Relation rel,
+ BTScanInsert key,
+ Buffer buf,
+ AttrNumber *highkeycmpcol)
+{
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high;
+ int32 result,
+ cmpval;
+ /*
+ * Prefix bounds, for the high/low offset's compare columns.
+ * "highkeycmpcol" is the value for this page's high key (if any) or 1
+ * (no established shared prefix)
+ */
+ AttrNumber highcmpcol = *highkeycmpcol,
+ lowcmpcol = 1;
+
+ page = BufferGetPage(buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* Requesting nextkey semantics while using scantid seems nonsensical */
+ Assert(!key->nextkey || key->scantid == NULL);
+ /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
+ Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+
+ /*
+ * If there are no keys on the page, return the first available slot. Note
+ * this covers two cases: the page is really empty (no keys), or it
+ * contains only a high key. The latter case is possible after vacuuming.
+ * This can never happen on an internal page, however, since they are
+ * never empty (an internal page must have children).
+ */
+ if (unlikely(high < low))
+ return low;
+
+ /*
+ * Binary search to find the first key on the page >= scan key, or first
+ * key > scankey when nextkey is true.
+ *
+ * For nextkey=false (cmpval=1), the loop invariant is: all slots before
+ * 'low' are < scan key, all slots at or after 'high' are >= scan key.
+ *
+ * For nextkey=true (cmpval=0), the loop invariant is: all slots before
+ * 'low' are <= scan key, all slots at or after 'high' are > scan key.
+ *
+ * We maintain highcmpcol and lowcmpcol to keep track of prefixes that
+ * tuples share with the scan key, potentially allowing us to skip a
+ * prefix in the midpoint comparison.
+ *
+ * We can fall out when high == low.
+ */
+ high++; /* establish the loop invariant for high */
+
+ cmpval = key->nextkey ? 0 : 1; /* select comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol); /* update prefix bounds */
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
+
+ if (result >= cmpval)
+ {
+ low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
+ else
+ {
+ high = mid;
+ highcmpcol = cmpcol;
+ }
+ }
+
+ /* update the bounds at the caller */
+ *highkeycmpcol = highcmpcol;
+
+ /*
+ * At this point we have high == low, but be careful: they could point
+ * past the last slot on the page.
+ *
+ * On a leaf page, we always return the first key >= scan key (resp. >
+ * scan key), which could be the last slot + 1.
+ */
+ if (P_ISLEAF(opaque))
+ return low;
+
+ /*
+ * On a non-leaf page, return the last key < scan key (resp. <= scan key).
+ * There must be one if _bt_compare() is playing by the rules.
+ */
+ Assert(low > P_FIRSTDATAKEY(opaque));
+
+ return OffsetNumberPrev(low);
+}
+
+/*
+ *
+ * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
+ *
+ * Like _bt_binsrch(), but with support for caching the binary search
+ * bounds. Only used during insertion, and only on the leaf page that it
+ * looks like caller will insert tuple on. Exclusive-locked and pinned
+ * leaf page is contained within insertstate.
+ *
+ * Caches the bounds fields in insertstate so that a subsequent call can
+ * reuse the low and strict high bounds of original binary search. Callers
+ * that use these fields directly must be prepared for the case where low
+ * and/or stricthigh are not on the same page (one or both exceed maxoff
+ * for the page). The case where there are no items on the page (high <
+ * low) makes bounds invalid.
+ *
+ * Caller is responsible for invalidating bounds when it modifies the page
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by a posting
+ * list split).
+ */
+OffsetNumber
+_bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol)
+{
+ BTScanInsert key = insertstate->itup_key;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high,
+ stricthigh;
+ int32 result,
+ cmpval;
+ AttrNumber lowcmpcol = 1;
+
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ Assert(P_ISLEAF(opaque));
+ Assert(!key->nextkey);
+ Assert(insertstate->postingoff == 0);
+
+ if (!insertstate->bounds_valid)
+ {
+ /* Start new binary search */
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ }
+ else
+ {
+ /* Restore result of previous binary search against same page */
+ low = insertstate->low;
+ high = insertstate->stricthigh;
+ }
+
+ /* If there are no keys on the page, return the first available slot */
+ if (unlikely(high < low))
+ {
+ /* Caller can't reuse bounds */
+ insertstate->low = InvalidOffsetNumber;
+ insertstate->stricthigh = InvalidOffsetNumber;
+ insertstate->bounds_valid = false;
+ return low;
+ }
+
+ /*
+ * Binary search to find the first key on the page >= scan key. (nextkey
+ * is always false when inserting).
+ *
+ * The loop invariant is: all slots before 'low' are < scan key, all slots
+ * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
+ * maintained to save additional search effort for caller.
+ *
+ * We can fall out when high == low.
+ */
+ if (!insertstate->bounds_valid)
+ high++; /* establish the loop invariant for high */
+ stricthigh = high; /* high initially strictly higher */
+
+ cmpval = 1; /* !nextkey comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
+
+ if (result >= cmpval)
+ {
+ low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
+ else
+ {
+ high = mid;
+ highcmpcol = cmpcol;
+
+ if (result != 0)
+ stricthigh = high;
+ }
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set postingoff for caller. Caller must split the
+ * posting list when postingoff is set. This should happen
+ * infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ {
+ /*
+ * postingoff should never be set more than once per leaf page
+ * binary search. That would mean that there are duplicate table
+ * TIDs in the index, which is never okay. Check for that here.
+ */
+ if (insertstate->postingoff != 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
+ ItemPointerGetBlockNumber(key->scantid),
+ ItemPointerGetOffsetNumber(key->scantid),
+ low, stricthigh,
+ BufferGetBlockNumber(insertstate->buf),
+ RelationGetRelationName(rel))));
+
+ insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
+ }
+ }
+
+ /*
+ * On a leaf page, a binary search always returns the first key >= scan
+ * key (at least in !nextkey case), which could be the last slot + 1. This
+ * is also the lower bound of cached search.
+ *
+ * stricthigh may also be the last slot + 1, which prevents caller from
+ * using bounds directly, but is still useful to us if we're called a
+ * second time with cached bounds (cached low will be < stricthigh when
+ * that happens).
+ */
+ insertstate->low = low;
+ insertstate->stricthigh = stricthigh;
+ insertstate->bounds_valid = true;
+
+ return low;
+}
+
+/*----------
+ * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
+ *
+ * page/offnum: location of btree item to be compared to.
+ *
+ * This routine returns:
+ * <0 if scankey < tuple at offnum;
+ * 0 if scankey == tuple at offnum;
+ * >0 if scankey > tuple at offnum.
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * NOTE: The "comparecol" argument must refer to the first attribute of the
+ * index tuple of which the caller knows that it does not match the scan key:
+ * this means 1 for "no known matching attributes", up to the number of key
+ * attributes + 1 if the caller knows that all key attributes of the index
+ * tuple match those of the scan key. See backend/access/nbtree/README for
+ * details.
+ *
+ * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
+ * "minus infinity": this routine will always claim it is less than the
+ * scankey. The actual key value stored is explicitly truncated to 0
+ * attributes (explicitly minus infinity) with version 3+ indexes, but
+ * that isn't relied upon. This allows us to implement the Lehman and
+ * Yao convention that the first down-link pointer is before the first
+ * key. See backend/access/nbtree/README for details.
+ *----------
+ */
+int32
+_bt_compare(Relation rel,
+ BTScanInsert key,
+ Page page,
+ OffsetNumber offnum,
+ AttrNumber *comparecol)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ IndexTuple itup;
+ ItemPointer heapTid;
+ ScanKey scankey;
+ int ncmpkey;
+ int ntupatts;
+ int32 result;
+
+ Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
+ Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+ Assert(key->heapkeyspace || key->scantid == NULL);
+
+ /*
+ * Force result ">" if target item is first data item on an internal page
+ * --- see NOTE above.
+ */
+ if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
+ return 1;
+
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+ ntupatts = BTreeTupleGetNAtts(itup, rel);
+
+ /*
+ * The scan key is set up with the attribute number associated with each
+ * term in the key. It is important that, if the index is multi-key, the
+ * scan contain the first k key attributes, and that they be in order. If
+ * you think about how multi-key ordering works, you'll understand why
+ * this is.
+ *
+ * We don't test for violation of this condition here, however. The
+ * initial setup for the index scan had better have gotten it right (see
+ * _bt_first).
+ */
+
+ ncmpkey = Min(ntupatts, key->keysz);
+ Assert(key->heapkeyspace || ncmpkey == key->keysz);
+ Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
+
+ scankey = key->scankeys + ((*comparecol) - 1);
+ for (int i = *comparecol; i <= ncmpkey; i++)
+ {
+ Datum datum;
+ bool isNull;
+
+ datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+
+ if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ {
+ if (isNull)
+ result = 0; /* NULL "=" NULL */
+ else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = -1; /* NULL "<" NOT_NULL */
+ else
+ result = 1; /* NULL ">" NOT_NULL */
+ }
+ else if (isNull) /* key is NOT_NULL and item is NULL */
+ {
+ if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = 1; /* NOT_NULL ">" NULL */
+ else
+ result = -1; /* NOT_NULL "<" NULL */
+ }
+ else
+ {
+ /*
+ * The sk_func needs to be passed the index value as left arg and
+ * the sk_argument as right arg (they might be of different
+ * types). Since it is convenient for callers to think of
+ * _bt_compare as comparing the scankey to the index item, we have
+ * to flip the sign of the comparison result. (Unless it's a DESC
+ * column, in which case we *don't* flip the sign.)
+ */
+ result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum,
+ scankey->sk_argument));
+
+ if (!(scankey->sk_flags & SK_BT_DESC))
+ INVERT_COMPARE_RESULT(result);
+ }
+
+ /* if the keys are unequal, return the difference */
+ if (result != 0)
+ {
+ *comparecol = i;
+ return result;
+ }
+
+ scankey++;
+ }
+
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *comparecol = ntupatts + 1;
+
+ /*
+ * All non-truncated attributes (other than heap TID) were found to be
+ * equal. Treat truncated attributes as minus infinity when scankey has a
+ * key attribute value that would otherwise be compared directly.
+ *
+ * Note: it doesn't matter if ntupatts includes non-key attributes;
+ * scankey won't, so explicitly excluding non-key attributes isn't
+ * necessary.
+ */
+ if (key->keysz > ntupatts)
+ return 1;
+
+ /*
+ * Use the heap TID attribute and scantid to try to break the tie. The
+ * rules are the same as any other key attribute -- only the
+ * representation differs.
+ */
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid == NULL)
+ {
+ /*
+ * Most searches have a scankey that is considered greater than a
+ * truncated pivot tuple if and when the scankey has equal values for
+ * attributes up to and including the least significant untruncated
+ * attribute in tuple.
+ *
+ * For example, if an index has the minimum two attributes (single
+ * user key attribute, plus heap TID attribute), and a page's high key
+ * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
+ * will not descend to the page to the left. The search will descend
+ * right instead. The truncated attribute in pivot tuple means that
+ * all non-pivot tuples on the page to the left are strictly < 'foo',
+ * so it isn't necessary to descend left. In other words, search
+ * doesn't have to descend left because it isn't interested in a match
+ * that has a heap TID value of -inf.
+ *
+ * However, some searches (pivotsearch searches) actually require that
+ * we descend left when this happens. -inf is treated as a possible
+ * match for omitted scankey attribute(s). This is needed by page
+ * deletion, which must re-find leaf pages that are targets for
+ * deletion using their high keys.
+ *
+ * Note: the heap TID part of the test ensures that scankey is being
+ * compared to a pivot tuple with one or more truncated key
+ * attributes.
+ *
+ * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
+ * left here, since they have no heap TID attribute (and cannot have
+ * any -inf key values in any case, since truncation can only remove
+ * non-key attributes). !heapkeyspace searches must always be
+ * prepared to deal with matches on both sides of the pivot once the
+ * leaf level is reached.
+ */
+ if (key->heapkeyspace && !key->pivotsearch &&
+ key->keysz == ntupatts && heapTid == NULL)
+ return 1;
+
+ /* All provided scankey arguments found to be equal */
+ return 0;
+ }
+
+ /*
+ * Treat truncated heap TID as minus infinity, since scankey has a key
+ * attribute value (scantid) that would otherwise be compared directly
+ */
+ Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+ if (heapTid == NULL)
+ return 1;
+
+ /*
+ * Scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * with scantid.
+ */
+ Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (result <= 0 || !BTreeTupleIsPosting(itup))
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid,
+ BTreeTupleGetMaxHeapTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
+}
+
+/*
+ * _bt_readpage() -- Load data from current index page into so->currPos
+ *
+ * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
+ * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
+ * they are updated as appropriate. All other fields of so->currPos are
+ * initialized from scratch here.
+ *
+ * We scan the current page starting at offnum and moving in the indicated
+ * direction. All items matching the scan keys are loaded into currPos.items.
+ * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
+ * that there can be no more matching tuples in the current scan direction.
+ *
+ * In the case of a parallel scan, caller must have called _bt_parallel_seize
+ * prior to calling this function; this function will invoke
+ * _bt_parallel_release before returning.
+ *
+ * Returns true if any matching items found on the page, false if none.
+ */
+static bool
+_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber minoff;
+ OffsetNumber maxoff;
+ int itemIndex;
+ bool continuescan;
+ int indnatts;
+
+ /*
+ * We must have the buffer pinned and locked, but the usual macro can't be
+ * used here; this function is what makes it good for currPos.
+ */
+ Assert(BufferIsValid(so->currPos.buf));
+
+ page = BufferGetPage(so->currPos.buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* allow next page be processed by parallel worker */
+ if (scan->parallel_scan)
+ {
+ if (ScanDirectionIsForward(dir))
+ _bt_parallel_release(scan, opaque->btpo_next);
+ else
+ _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+ }
+
+ continuescan = true; /* default assumption */
+ indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * We note the buffer's block number so that we can release the pin later.
+ * This allows us to re-read the buffer if it is needed again for hinting.
+ */
+ so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later. This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+
+ /*
+ * we must save the page's right-link while scanning it; this tells us
+ * where to step right to after we're done with these items. There is no
+ * corresponding need for the left-link, since splits always go right.
+ */
+ so->currPos.nextPage = opaque->btpo_next;
+
+ /* initialize tuple workspace to empty */
+ so->currPos.nextTupleOffset = 0;
+
+ /*
+ * Now that the current page has been made consistent, the macro should be
+ * good.
+ */
+ Assert(BTScanPosIsPinned(so->currPos));
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* load items[] in ascending order */
+ itemIndex = 0;
+
+ offnum = Max(offnum, minoff);
+
+ while (offnum <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ offnum = OffsetNumberNext(offnum);
+ continue;
+ }
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID
+ */
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ itemIndex++;
+ }
+ }
+ }
+ /* When !continuescan, there can't be any more matches, so stop */
+ if (!continuescan)
+ break;
+
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /*
+ * We don't need to visit page to the right when the high key
+ * indicates that no more matches will be found there.
+ *
+ * Checking the high key like this works out more often than you might
+ * think. Leaf page splits pick a split point between the two most
+ * dissimilar tuples (this is weighed against the need to evenly share
+ * free space). Leaf pages with high key attribute values that can
+ * only appear on non-pivot tuples on the right sibling page are
+ * common.
+ */
+ if (continuescan && !P_RIGHTMOST(opaque))
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
+ int truncatt;
+
+ truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
+ _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+ }
+
+ if (!continuescan)
+ so->currPos.moreRight = false;
+
+ Assert(itemIndex <= MaxTIDsPerBTreePage);
+ so->currPos.firstItem = 0;
+ so->currPos.lastItem = itemIndex - 1;
+ so->currPos.itemIndex = 0;
+ }
+ else
+ {
+ /* load items[] in descending order */
+ itemIndex = MaxTIDsPerBTreePage;
+
+ offnum = Min(offnum, maxoff);
+
+ while (offnum >= minoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+ bool tuple_alive;
+ bool passes_quals;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual. Most of the
+ * time, it's a win to not bother examining the tuple's index
+ * keys, but just skip to the next tuple (previous, actually,
+ * since we're scanning backwards). However, if this is the first
+ * tuple on the page, we do check the index keys, to prevent
+ * uselessly advancing to the page to the left. This is similar
+ * to the high key optimization used by forward scans.
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ Assert(offnum >= P_FIRSTDATAKEY(opaque));
+ if (offnum > P_FIRSTDATAKEY(opaque))
+ {
+ offnum = OffsetNumberPrev(offnum);
+ continue;
+ }
+
+ tuple_alive = false;
+ }
+ else
+ tuple_alive = true;
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
+ &continuescan);
+ if (passes_quals && tuple_alive)
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID.
+ *
+ * Note that we deliberately save/return items from
+ * posting lists in ascending heap TID order for backwards
+ * scans. This allows _bt_killitems() to make a
+ * consistent assumption about the order of items
+ * associated with the same posting list tuple.
+ */
+ itemIndex--;
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ }
+ }
+ }
+ if (!continuescan)
+ {
+ /* there can't be any more matches, so stop */
+ so->currPos.moreLeft = false;
+ break;
+ }
+
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ Assert(itemIndex >= 0);
+ so->currPos.firstItem = itemIndex;
+ so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
+ so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+ }
+
+ return (so->currPos.firstItem <= so->currPos.lastItem);
+}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index c2665fce41..8742716383 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -279,8 +279,6 @@ static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
BTPageState *state,
BTDedupState dstate);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
-static void _bt_load(BTWriteState *wstate,
- BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
int request);
static void _bt_end_parallel(BTLeader *btleader);
@@ -293,6 +291,8 @@ static void _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
Sharedsort *sharedsort2, int sortmem,
bool progress);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsort_spec.c"
+#include "access/nbtree_spec.h"
/*
* btbuild() -- build a new btree index.
@@ -544,6 +544,7 @@ static void
_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
{
BTWriteState wstate;
+ nbts_prep_ctx(btspool->index);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
@@ -846,6 +847,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
Size pgspc;
Size itupsz;
bool isleaf;
+ nbts_prep_ctx(wstate->index);
/*
* This is a handy place to check for cancel interrupts during the btree
@@ -1178,264 +1180,6 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
-/*
- * Read tuples in correct sort order from tuplesort, and load them into
- * btree leaves.
- */
-static void
-_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
-{
- BTPageState *state = NULL;
- bool merge = (btspool2 != NULL);
- IndexTuple itup,
- itup2 = NULL;
- bool load1;
- TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
- SortSupport sortKeys;
- int64 tuples_done = 0;
- bool deduplicate;
-
- deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
- BTGetDeduplicateItems(wstate->index);
-
- if (merge)
- {
- /*
- * Another BTSpool for dead tuples exists. Now we have to merge
- * btspool and btspool2.
- */
-
- /* the preparation of merge */
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
-
- /* Prepare SortSupport data for each column */
- sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
- for (i = 0; i < keysz; i++)
- {
- SortSupport sortKey = sortKeys + i;
- ScanKey scanKey = wstate->inskey->scankeys + i;
- int16 strategy;
-
- sortKey->ssup_cxt = CurrentMemoryContext;
- sortKey->ssup_collation = scanKey->sk_collation;
- sortKey->ssup_nulls_first =
- (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
- sortKey->ssup_attno = scanKey->sk_attno;
- /* Abbreviation is not supported here */
- sortKey->abbreviate = false;
-
- Assert(sortKey->ssup_attno != 0);
-
- strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
- BTGreaterStrategyNumber : BTLessStrategyNumber;
-
- PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
- }
-
- for (;;)
- {
- load1 = true; /* load BTSpool next ? */
- if (itup2 == NULL)
- {
- if (itup == NULL)
- break;
- }
- else if (itup != NULL)
- {
- int32 compare = 0;
-
- for (i = 1; i <= keysz; i++)
- {
- SortSupport entry;
- Datum attrDatum1,
- attrDatum2;
- bool isNull1,
- isNull2;
-
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
- entry);
- if (compare > 0)
- {
- load1 = false;
- break;
- }
- else if (compare < 0)
- break;
- }
-
- /*
- * If key values are equal, we sort on ItemPointer. This is
- * required for btree indexes, since heap TID is treated as an
- * implicit last key attribute in order to ensure that all
- * keys in the index are physically unique.
- */
- if (compare == 0)
- {
- compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
- Assert(compare != 0);
- if (compare > 0)
- load1 = false;
- }
- }
- else
- load1 = false;
-
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- if (load1)
- {
- _bt_buildadd(wstate, state, itup, 0);
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- }
- else
- {
- _bt_buildadd(wstate, state, itup2, 0);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- pfree(sortKeys);
- }
- else if (deduplicate)
- {
- /* merge is unnecessary, deduplicate into posting lists */
- BTDedupState dstate;
-
- dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
- dstate->deduplicate = true; /* unused */
- dstate->nmaxitems = 0; /* unused */
- dstate->maxpostingsize = 0; /* set later */
- /* Metadata about base tuple of current pending posting list */
- dstate->base = NULL;
- dstate->baseoff = InvalidOffsetNumber; /* unused */
- dstate->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- dstate->htids = NULL;
- dstate->nhtids = 0;
- dstate->nitems = 0;
- dstate->phystupsize = 0; /* unused */
- dstate->nintervals = 0; /* unused */
-
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- {
- state = _bt_pagestate(wstate, 0);
-
- /*
- * Limit size of posting list tuples to 1/10 space we want to
- * leave behind on the page, plus space for final item's line
- * pointer. This is equal to the space that we'd like to
- * leave behind on each leaf page when fillfactor is 90,
- * allowing us to get close to fillfactor% space utilization
- * when there happen to be a great many duplicates. (This
- * makes higher leaf fillfactor settings ineffective when
- * building indexes that have many duplicates, but packing
- * leaf pages full with few very large tuples doesn't seem
- * like a useful goal.)
- */
- dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
- sizeof(ItemIdData);
- Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
- dstate->maxpostingsize <= INDEX_SIZE_MASK);
- dstate->htids = palloc(dstate->maxpostingsize);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
- else if (_bt_keep_natts_fast(wstate->index, dstate->base,
- itup) > keysz &&
- _bt_dedup_save_htid(dstate, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID from itup has been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list.
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
-
- if (state)
- {
- /*
- * Handle the last item (there must be a last item when the
- * tuplesort returned one or more tuples)
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
- pfree(dstate->htids);
- }
-
- pfree(dstate);
- }
- else
- {
- /* merging and deduplication are both unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- _bt_buildadd(wstate, state, itup, 0);
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- }
-
- /* Close down final pages and write the metapage */
- _bt_uppershutdown(wstate, state);
-
- /*
- * When we WAL-logged index pages, we must nonetheless fsync index files.
- * Since we're building outside shared buffers, a CHECKPOINT occurring
- * during the build has no way to flush the previously written data to
- * disk (indeed it won't know the index even exists). A crash later on
- * would replay WAL from the checkpoint, therefore it wouldn't replay our
- * earlier WAL entries. If we do not fsync those pages here, they might
- * still not be on disk when the crash occurs.
- */
- if (wstate->btws_use_wal)
- smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
-}
-
/*
* Create parallel context, and launch workers for leader.
*
diff --git a/src/backend/access/nbtree/nbtsort_spec.c b/src/backend/access/nbtree/nbtsort_spec.c
new file mode 100644
index 0000000000..368d6f244c
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsort_spec.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtsort_spec.c
+ * Index shape-specialized functions for nbtsort.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtsort_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_load NBTS_FUNCTION(_bt_load)
+
+static void _bt_load(BTWriteState *wstate,
+ BTSpool *btspool, BTSpool *btspool2);
+
+/*
+ * Read tuples in correct sort order from tuplesort, and load them into
+ * btree leaves.
+ */
+static void
+_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
+{
+ BTPageState *state = NULL;
+ bool merge = (btspool2 != NULL);
+ IndexTuple itup,
+ itup2 = NULL;
+ bool load1;
+ TupleDesc tupdes = RelationGetDescr(wstate->index);
+ int i,
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ SortSupport sortKeys;
+ int64 tuples_done = 0;
+ bool deduplicate;
+
+ deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
+ BTGetDeduplicateItems(wstate->index);
+
+ if (merge)
+ {
+ /*
+ * Another BTSpool for dead tuples exists. Now we have to merge
+ * btspool and btspool2.
+ */
+
+ /* the preparation of merge */
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+
+ /* Prepare SortSupport data for each column */
+ sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+
+ for (i = 0; i < keysz; i++)
+ {
+ SortSupport sortKey = sortKeys + i;
+ ScanKey scanKey = wstate->inskey->scankeys + i;
+ int16 strategy;
+
+ sortKey->ssup_cxt = CurrentMemoryContext;
+ sortKey->ssup_collation = scanKey->sk_collation;
+ sortKey->ssup_nulls_first =
+ (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+ sortKey->ssup_attno = scanKey->sk_attno;
+ /* Abbreviation is not supported here */
+ sortKey->abbreviate = false;
+
+ Assert(sortKey->ssup_attno != 0);
+
+ strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+ BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+ PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+ }
+
+ for (;;)
+ {
+ load1 = true; /* load BTSpool next ? */
+ if (itup2 == NULL)
+ {
+ if (itup == NULL)
+ break;
+ }
+ else if (itup != NULL)
+ {
+ int32 compare = 0;
+
+ for (i = 1; i <= keysz; i++)
+ {
+ SortSupport entry;
+ Datum attrDatum1,
+ attrDatum2;
+ bool isNull1,
+ isNull2;
+
+ entry = sortKeys + i - 1;
+ attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+ attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+ compare = ApplySortComparator(attrDatum1, isNull1,
+ attrDatum2, isNull2,
+ entry);
+ if (compare > 0)
+ {
+ load1 = false;
+ break;
+ }
+ else if (compare < 0)
+ break;
+ }
+
+ /*
+ * If key values are equal, we sort on ItemPointer. This is
+ * required for btree indexes, since heap TID is treated as an
+ * implicit last key attribute in order to ensure that all
+ * keys in the index are physically unique.
+ */
+ if (compare == 0)
+ {
+ compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+ Assert(compare != 0);
+ if (compare > 0)
+ load1 = false;
+ }
+ }
+ else
+ load1 = false;
+
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ if (load1)
+ {
+ _bt_buildadd(wstate, state, itup, 0);
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ }
+ else
+ {
+ _bt_buildadd(wstate, state, itup2, 0);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ pfree(sortKeys);
+ }
+ else if (deduplicate)
+ {
+ /* merge is unnecessary, deduplicate into posting lists */
+ BTDedupState dstate;
+
+ dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ dstate->deduplicate = true; /* unused */
+ dstate->nmaxitems = 0; /* unused */
+ dstate->maxpostingsize = 0; /* set later */
+ /* Metadata about base tuple of current pending posting list */
+ dstate->base = NULL;
+ dstate->baseoff = InvalidOffsetNumber; /* unused */
+ dstate->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ dstate->htids = NULL;
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+ dstate->phystupsize = 0; /* unused */
+ dstate->nintervals = 0; /* unused */
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+
+ /*
+ * Limit size of posting list tuples to 1/10 space we want to
+ * leave behind on the page, plus space for final item's line
+ * pointer. This is equal to the space that we'd like to
+ * leave behind on each leaf page when fillfactor is 90,
+ * allowing us to get close to fillfactor% space utilization
+ * when there happen to be a great many duplicates. (This
+ * makes higher leaf fillfactor settings ineffective when
+ * building indexes that have many duplicates, but packing
+ * leaf pages full with few very large tuples doesn't seem
+ * like a useful goal.)
+ */
+ dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
+ sizeof(ItemIdData);
+ Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
+ dstate->maxpostingsize <= INDEX_SIZE_MASK);
+ dstate->htids = palloc(dstate->maxpostingsize);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+ else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+ itup) > keysz &&
+ _bt_dedup_save_htid(dstate, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID from itup has been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list.
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+
+ if (state)
+ {
+ /*
+ * Handle the last item (there must be a last item when the
+ * tuplesort returned one or more tuples)
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+ pfree(dstate->htids);
+ }
+
+ pfree(dstate);
+ }
+ else
+ {
+ /* merging and deduplication are both unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ _bt_buildadd(wstate, state, itup, 0);
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+
+ /* Close down final pages and write the metapage */
+ _bt_uppershutdown(wstate, state);
+
+ /*
+ * When we WAL-logged index pages, we must nonetheless fsync index files.
+ * Since we're building outside shared buffers, a CHECKPOINT occurring
+ * during the build has no way to flush the previously written data to
+ * disk (indeed it won't know the index even exists). A crash later on
+ * would replay WAL from the checkpoint, therefore it wouldn't replay our
+ * earlier WAL entries. If we do not fsync those pages here, they might
+ * still not be on disk when the crash occurs.
+ */
+ if (wstate->btws_use_wal)
+ smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 43b67893d9..db2da1e303 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -639,6 +639,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
ItemId itemid;
IndexTuple tup;
int keepnatts;
+ nbts_prep_ctx(state->rel);
Assert(state->is_leaf && !state->is_rightmost);
@@ -945,6 +946,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
*rightinterval;
int perfectpenalty;
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
+ nbts_prep_ctx(state->rel);
/* Assume that alternative strategy won't be used for now */
*strategy = SPLIT_DEFAULT;
@@ -1137,6 +1139,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
{
IndexTuple lastleft;
IndexTuple firstright;
+ nbts_prep_ctx(state->rel);
if (!state->is_leaf)
{
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7da499c4dd..37d644e9f3 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -50,130 +50,10 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
bool *result);
static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
static void _bt_mark_scankey_required(ScanKey skey);
-static bool _bt_check_rowcompare(ScanKey skey,
- IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
- ScanDirection dir, bool *continuescan);
-static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtutils_spec.c"
+#include "access/nbtree_spec.h"
-/*
- * _bt_mkscankey
- * Build an insertion scan key that contains comparison data from itup
- * as well as comparator routines appropriate to the key datatypes.
- *
- * When itup is a non-pivot tuple, the returned insertion scan key is
- * suitable for finding a place for it to go on the leaf level. Pivot
- * tuples can be used to re-find leaf page with matching high key, but
- * then caller needs to set scan key's pivotsearch field to true. This
- * allows caller to search for a leaf page with a matching high key,
- * which is usually to the left of the first leaf page a non-pivot match
- * might appear on.
- *
- * The result is intended for use with _bt_compare() and _bt_truncate().
- * Callers that don't need to fill out the insertion scankey arguments
- * (e.g. they use an ad-hoc comparison routine, or only need a scankey
- * for _bt_truncate()) can pass a NULL index tuple. The scankey will
- * be initialized as if an "all truncated" pivot tuple was passed
- * instead.
- *
- * Note that we may occasionally have to share lock the metapage to
- * determine whether or not the keys in the index are expected to be
- * unique (i.e. if this is a "heapkeyspace" index). We assume a
- * heapkeyspace index when caller passes a NULL tuple, allowing index
- * build callers to avoid accessing the non-existent metapage. We
- * also assume that the index is _not_ allequalimage when a NULL tuple
- * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
- * field themselves.
- */
-BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
-{
- BTScanInsert key;
- ScanKey skey;
- TupleDesc itupdesc;
- int indnkeyatts;
- int16 *indoption;
- int tupnatts;
- int i;
-
- itupdesc = RelationGetDescr(rel);
- indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- indoption = rel->rd_indoption;
- tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
-
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
- /*
- * We'll execute search using scan key constructed on key columns.
- * Truncated attributes and non-key attributes are omitted from the final
- * scan key.
- */
- key = palloc(offsetof(BTScanInsertData, scankeys) +
- sizeof(ScanKeyData) * indnkeyatts);
- if (itup)
- _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
- else
- {
- /* Utility statement callers can set these fields themselves */
- key->heapkeyspace = true;
- key->allequalimage = false;
- }
- key->anynullkeys = false; /* initial assumption */
- key->nextkey = false;
- key->pivotsearch = false;
- key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
- skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
- {
- FmgrInfo *procinfo;
- Datum arg;
- bool null;
- int flags;
-
- /*
- * We can use the cached (default) support procs since no cross-type
- * comparison can be needed.
- */
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-
- /*
- * Key arguments built from truncated attributes (or when caller
- * provides no tuple) are defensively represented as NULL values. They
- * should never be used.
- */
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
- else
- {
- arg = (Datum) 0;
- null = true;
- }
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
- flags,
- (AttrNumber) (i + 1),
- InvalidStrategy,
- InvalidOid,
- rel->rd_indcollation[i],
- procinfo,
- arg);
- /* Record if any key attribute is NULL (or truncated) */
- if (null)
- key->anynullkeys = true;
- }
-
- /*
- * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
- * that full uniqueness check is done.
- */
- if (rel->rd_index->indnullsnotdistinct)
- key->anynullkeys = false;
-
- return key;
-}
/*
* free a retracement stack made by _bt_search.
@@ -1340,356 +1220,6 @@ _bt_mark_scankey_required(ScanKey skey)
}
}
-/*
- * Test whether an indextuple satisfies all the scankey conditions.
- *
- * Return true if so, false if not. If the tuple fails to pass the qual,
- * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly. See comments for
- * _bt_preprocess_keys(), above, about how this is done.
- *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
- *
- * scan: index scan descriptor (containing a search-type scankey)
- * tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
- */
-bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
- ScanDirection dir, bool *continuescan)
-{
- TupleDesc tupdesc;
- BTScanOpaque so;
- int keysz;
- int ikey;
- ScanKey key;
-
- Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
-
- *continuescan = true; /* default assumption */
-
- tupdesc = RelationGetDescr(scan->indexRelation);
- so = (BTScanOpaque) scan->opaque;
- keysz = so->numberOfKeys;
-
- for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
- {
- Datum datum;
- bool isNull;
- Datum test;
-
- if (key->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- continue;
- }
-
- /* row-comparison keys need special processing */
- if (key->sk_flags & SK_ROW_HEADER)
- {
- if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
- continuescan))
- continue;
- return false;
- }
-
- datum = index_getattr(tuple,
- key->sk_attno,
- tupdesc,
- &isNull);
-
- if (key->sk_flags & SK_ISNULL)
- {
- /* Handle IS NULL/NOT NULL tests */
- if (key->sk_flags & SK_SEARCHNULL)
- {
- if (isNull)
- continue; /* tuple satisfies this qual */
- }
- else
- {
- Assert(key->sk_flags & SK_SEARCHNOTNULL);
- if (!isNull)
- continue; /* tuple satisfies this qual */
- }
-
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (isNull)
- {
- if (key->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
- datum, key->sk_argument);
-
- if (!DatumGetBool(test))
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- *
- * Note: because we stop the scan as soon as any required equality
- * qual fails, it is critical that equality quals be used for the
- * initial positioning in _bt_first() when they are available. See
- * comments in _bt_first().
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
- }
-
- /* If we get here, the tuple passes all index quals. */
- return true;
-}
-
-/*
- * Test whether an indextuple satisfies a row-comparison scan condition.
- *
- * Return true if so, false if not. If not, also clear *continuescan if
- * it's not possible for any future tuples in the current scan direction
- * to pass the qual.
- *
- * This is a subroutine for _bt_checkkeys, which see for more info.
- */
-static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
- TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
-{
- ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
- int32 cmpresult = 0;
- bool result;
-
- /* First subkey should be same as the header says */
- Assert(subkey->sk_attno == skey->sk_attno);
-
- /* Loop over columns of the row condition */
- for (;;)
- {
- Datum datum;
- bool isNull;
-
- Assert(subkey->sk_flags & SK_ROW_MEMBER);
-
- if (subkey->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- cmpresult = 0;
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- continue;
- }
-
- datum = index_getattr(tuple,
- subkey->sk_attno,
- tupdesc,
- &isNull);
-
- if (isNull)
- {
- if (subkey->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (subkey->sk_flags & SK_ISNULL)
- {
- /*
- * Unlike the simple-scankey case, this isn't a disallowed case.
- * But it can never match. If all the earlier row comparison
- * columns are required for the scan direction, we can stop the
- * scan, because there can't be another tuple that will succeed.
- */
- if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
- subkey--;
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- return false;
- }
-
- /* Perform the test --- three-way comparison not bool operator */
- cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
- subkey->sk_collation,
- datum,
- subkey->sk_argument));
-
- if (subkey->sk_flags & SK_BT_DESC)
- INVERT_COMPARE_RESULT(cmpresult);
-
- /* Done comparing if unequal, else advance to next column */
- if (cmpresult != 0)
- break;
-
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- }
-
- /*
- * At this point cmpresult indicates the overall result of the row
- * comparison, and subkey points to the deciding column (or the last
- * column if the result is "=").
- */
- switch (subkey->sk_strategy)
- {
- /* EQ and NE cases aren't allowed here */
- case BTLessStrategyNumber:
- result = (cmpresult < 0);
- break;
- case BTLessEqualStrategyNumber:
- result = (cmpresult <= 0);
- break;
- case BTGreaterEqualStrategyNumber:
- result = (cmpresult >= 0);
- break;
- case BTGreaterStrategyNumber:
- result = (cmpresult > 0);
- break;
- default:
- elog(ERROR, "unrecognized RowCompareType: %d",
- (int) subkey->sk_strategy);
- result = 0; /* keep compiler quiet */
- break;
- }
-
- if (!result)
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will pass,
- * either. Note we have to look at the deciding column, not
- * necessarily the first or last column of the row condition.
- */
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
-
- return result;
-}
-
/*
* _bt_killitems - set LP_DEAD state for items an indexscan caller has
* told us were killed
@@ -2173,286 +1703,6 @@ btbuildphasename(int64 phasenum)
}
}
-/*
- * _bt_truncate() -- create tuple without unneeded suffix attributes.
- *
- * Returns truncated pivot index tuple allocated in caller's memory context,
- * with key attributes copied from caller's firstright argument. If rel is
- * an INCLUDE index, non-key attributes will definitely be truncated away,
- * since they're not part of the key space. More aggressive suffix
- * truncation can take place when it's clear that the returned tuple does not
- * need one or more suffix key attributes. We only need to keep firstright
- * attributes up to and including the first non-lastleft-equal attribute.
- * Caller's insertion scankey is used to compare the tuples; the scankey's
- * argument values are not considered here.
- *
- * Note that returned tuple's t_tid offset will hold the number of attributes
- * present, so the original item pointer offset is not represented. Caller
- * should only change truncated tuple's downlink. Note also that truncated
- * key attributes are treated as containing "minus infinity" values by
- * _bt_compare().
- *
- * In the worst case (when a heap TID must be appended to distinguish lastleft
- * from firstright), the size of the returned tuple is the size of firstright
- * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
- * is important, since callers need to stay under the 1/3 of a page
- * restriction on tuple size. If this routine is ever taught to truncate
- * within an attribute/datum, it will need to avoid returning an enlarged
- * tuple to caller when truncation + TOAST compression ends up enlarging the
- * final datum.
- */
-IndexTuple
-_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
- IndexTuple pivot;
- IndexTuple tidpivot;
- ItemPointer pivotheaptid;
- Size newsize;
-
- /*
- * We should only ever truncate non-pivot tuples from leaf pages. It's
- * never okay to truncate when splitting an internal page.
- */
- Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
-
- /* Determine how many attributes must be kept in truncated tuple */
- keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
-
-#ifdef DEBUG_NO_TRUNCATE
- /* Force truncation to be ineffective for testing purposes */
- keepnatts = nkeyatts + 1;
-#endif
-
- pivot = index_truncate_tuple(itupdesc, firstright,
- Min(keepnatts, nkeyatts));
-
- if (BTreeTupleIsPosting(pivot))
- {
- /*
- * index_truncate_tuple() just returns a straight copy of firstright
- * when it has no attributes to truncate. When that happens, we may
- * need to truncate away a posting list here instead.
- */
- Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
- Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
- pivot->t_info &= ~INDEX_SIZE_MASK;
- pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
- }
-
- /*
- * If there is a distinguishing key attribute within pivot tuple, we're
- * done
- */
- if (keepnatts <= nkeyatts)
- {
- BTreeTupleSetNAtts(pivot, keepnatts, false);
- return pivot;
- }
-
- /*
- * We have to store a heap TID in the new pivot tuple, since no non-TID
- * key attribute value in firstright distinguishes the right side of the
- * split from the left side. nbtree conceptualizes this case as an
- * inability to truncate away any key attributes, since heap TID is
- * treated as just another key attribute (despite lacking a pg_attribute
- * entry).
- *
- * Use enlarged space that holds a copy of pivot. We need the extra space
- * to store a heap TID at the end (using the special pivot tuple
- * representation). Note that the original pivot already has firstright's
- * possible posting list/non-key attribute values removed at this point.
- */
- newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
- tidpivot = palloc0(newsize);
- memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
- /* Cannot leak memory here */
- pfree(pivot);
-
- /*
- * Store all of firstright's key attribute values plus a tiebreaker heap
- * TID value in enlarged pivot tuple
- */
- tidpivot->t_info &= ~INDEX_SIZE_MASK;
- tidpivot->t_info |= newsize;
- BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
- pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
-
- /*
- * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
- * consider suffix truncation. It seems like a good idea to follow that
- * example in cases where no truncation takes place -- use lastleft's heap
- * TID. (This is also the closest value to negative infinity that's
- * legally usable.)
- */
- ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
-
- /*
- * We're done. Assert() that heap TID invariants hold before returning.
- *
- * Lehman and Yao require that the downlink to the right page, which is to
- * be inserted into the parent page in the second phase of a page split be
- * a strict lower bound on items on the right page, and a non-strict upper
- * bound for items on the left page. Assert that heap TIDs follow these
- * invariants, since a heap TID value is apparently needed as a
- * tiebreaker.
- */
-#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
- BTreeTupleGetHeapTID(firstright)) < 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(lastleft)) >= 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#else
-
- /*
- * Those invariants aren't guaranteed to hold for lastleft + firstright
- * heap TID attribute values when they're considered here only because
- * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
- * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
- * TID value that always works as a strict lower bound for items to the
- * right. In particular, it must avoid using firstright's leading key
- * attribute values along with lastleft's heap TID value when lastleft's
- * TID happens to be greater than firstright's TID.
- */
- ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
-
- /*
- * Pivot heap TID should never be fully equal to firstright. Note that
- * the pivot heap TID will still end up equal to lastleft's heap TID when
- * that's the only usable value.
- */
- ItemPointerSetOffsetNumber(pivotheaptid,
- OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#endif
-
- return tidpivot;
-}
-
-/*
- * _bt_keep_natts - how many key attributes to keep when truncating.
- *
- * Caller provides two tuples that enclose a split point. Caller's insertion
- * scankey is used to compare the tuples; the scankey's argument values are
- * not considered here.
- *
- * This can return a number of attributes that is one greater than the
- * number of key attributes for the index relation. This indicates that the
- * caller must use a heap TID as a unique-ifier in new pivot tuple.
- */
-static int
-_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keepnatts;
- ScanKey scankey;
-
- /*
- * _bt_compare() treats truncated key attributes as having the value minus
- * infinity, which would break searches within !heapkeyspace indexes. We
- * must still truncate away non-key attribute values, though.
- */
- if (!itup_key->heapkeyspace)
- return nkeyatts;
-
- scankey = itup_key->scankeys;
- keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum1,
- datum2)) != 0)
- break;
-
- keepnatts++;
- }
-
- /*
- * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
- * expected in an allequalimage index.
- */
- Assert(!itup_key->allequalimage ||
- keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
-
- return keepnatts;
-}
-
-/*
- * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
- *
- * This is exported so that a candidate split point can have its effect on
- * suffix truncation inexpensively evaluated ahead of time when finding a
- * split location. A naive bitwise approach to datum comparisons is used to
- * save cycles.
- *
- * The approach taken here usually provides the same answer as _bt_keep_natts
- * will (for the same pair of tuples from a heapkeyspace index), since the
- * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting. When an index only has
- * "equal image" columns, routine is guaranteed to give the same result as
- * _bt_keep_natts would.
- *
- * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts, even when the index uses
- * an opclass or collation that is not "allequalimage"/deduplication-safe.
- * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
- * negatives generally only have the effect of making leaf page splits use a
- * more balanced split point.
- */
-int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
-
- keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
- Form_pg_attribute att;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
- break;
-
- keepnatts++;
- }
-
- return keepnatts;
-}
-
/*
* _bt_check_natts() -- Verify tuple has expected number of attributes.
*
diff --git a/src/backend/access/nbtree/nbtutils_spec.c b/src/backend/access/nbtree/nbtutils_spec.c
new file mode 100644
index 0000000000..0288da22d6
--- /dev/null
+++ b/src/backend/access/nbtree/nbtutils_spec.c
@@ -0,0 +1,775 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtutils_spec.c
+ * Index shape-specialized functions for nbtutils.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtutils_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_check_rowcompare NBTS_FUNCTION(_bt_check_rowcompare)
+#define _bt_keep_natts NBTS_FUNCTION(_bt_keep_natts)
+
+static bool _bt_check_rowcompare(ScanKey skey,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ ScanDirection dir, bool *continuescan);
+static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+
+
+/*
+ * _bt_mkscankey
+ * Build an insertion scan key that contains comparison data from itup
+ * as well as comparator routines appropriate to the key datatypes.
+ *
+ * When itup is a non-pivot tuple, the returned insertion scan key is
+ * suitable for finding a place for it to go on the leaf level. Pivot
+ * tuples can be used to re-find leaf page with matching high key, but
+ * then caller needs to set scan key's pivotsearch field to true. This
+ * allows caller to search for a leaf page with a matching high key,
+ * which is usually to the left of the first leaf page a non-pivot match
+ * might appear on.
+ *
+ * The result is intended for use with _bt_compare() and _bt_truncate().
+ * Callers that don't need to fill out the insertion scankey arguments
+ * (e.g. they use an ad-hoc comparison routine, or only need a scankey
+ * for _bt_truncate()) can pass a NULL index tuple. The scankey will
+ * be initialized as if an "all truncated" pivot tuple was passed
+ * instead.
+ *
+ * Note that we may occasionally have to share lock the metapage to
+ * determine whether or not the keys in the index are expected to be
+ * unique (i.e. if this is a "heapkeyspace" index). We assume a
+ * heapkeyspace index when caller passes a NULL tuple, allowing index
+ * build callers to avoid accessing the non-existent metapage. We
+ * also assume that the index is _not_ allequalimage when a NULL tuple
+ * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
+ * field themselves.
+ */
+BTScanInsert
+_bt_mkscankey(Relation rel, IndexTuple itup)
+{
+ BTScanInsert key;
+ ScanKey skey;
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int16 *indoption;
+ int tupnatts;
+ int i;
+
+ itupdesc = RelationGetDescr(rel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ indoption = rel->rd_indoption;
+ tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+
+ /*
+ * We'll execute search using scan key constructed on key columns.
+ * Truncated attributes and non-key attributes are omitted from the final
+ * scan key.
+ */
+ key = palloc(offsetof(BTScanInsertData, scankeys) +
+ sizeof(ScanKeyData) * indnkeyatts);
+ if (itup)
+ _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+ else
+ {
+ /* Utility statement callers can set these fields themselves */
+ key->heapkeyspace = true;
+ key->allequalimage = false;
+ }
+ key->anynullkeys = false; /* initial assumption */
+ key->nextkey = false;
+ key->pivotsearch = false;
+ key->keysz = Min(indnkeyatts, tupnatts);
+ key->scantid = key->heapkeyspace && itup ?
+ BTreeTupleGetHeapTID(itup) : NULL;
+ skey = key->scankeys;
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ FmgrInfo *procinfo;
+ Datum arg;
+ bool null;
+ int flags;
+
+ /*
+ * We can use the cached (default) support procs since no cross-type
+ * comparison can be needed.
+ */
+ procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+
+ /*
+ * Key arguments built from truncated attributes (or when caller
+ * provides no tuple) are defensively represented as NULL values. They
+ * should never be used.
+ */
+ if (i < tupnatts)
+ arg = index_getattr(itup, i + 1, itupdesc, &null);
+ else
+ {
+ arg = (Datum) 0;
+ null = true;
+ }
+ flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags,
+ (AttrNumber) (i + 1),
+ InvalidStrategy,
+ InvalidOid,
+ rel->rd_indcollation[i],
+ procinfo,
+ arg);
+ /* Record if any key attribute is NULL (or truncated) */
+ if (null)
+ key->anynullkeys = true;
+ }
+
+ /*
+ * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
+ * that full uniqueness check is done.
+ */
+ if (rel->rd_index->indnullsnotdistinct)
+ key->anynullkeys = false;
+
+ return key;
+}
+
+/*
+ * Test whether an indextuple satisfies all the scankey conditions.
+ *
+ * Return true if so, false if not. If the tuple fails to pass the qual,
+ * we also determine whether there's any need to continue the scan beyond
+ * this tuple, and set *continuescan accordingly. See comments for
+ * _bt_preprocess_keys(), above, about how this is done.
+ *
+ * Forward scan callers can pass a high key tuple in the hopes of having
+ * us set *continuescan to false, and avoiding an unnecessary visit to
+ * the page to the right.
+ *
+ * scan: index scan descriptor (containing a search-type scankey)
+ * tuple: index tuple to test
+ * tupnatts: number of attributes in tupnatts (high key may be truncated)
+ * dir: direction we are scanning in
+ * continuescan: output parameter (will be set correctly in all cases)
+ */
+bool
+_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan)
+{
+ TupleDesc tupdesc;
+ BTScanOpaque so;
+ int keysz;
+ int ikey;
+ ScanKey key;
+
+ Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+
+ *continuescan = true; /* default assumption */
+
+ tupdesc = RelationGetDescr(scan->indexRelation);
+ so = (BTScanOpaque) scan->opaque;
+ keysz = so->numberOfKeys;
+
+ for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+ {
+ Datum datum;
+ bool isNull;
+ Datum test;
+
+ if (key->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ continue;
+ }
+
+ /* row-comparison keys need special processing */
+ if (key->sk_flags & SK_ROW_HEADER)
+ {
+ if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
+ continuescan))
+ continue;
+ return false;
+ }
+
+ datum = index_getattr(tuple,
+ key->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (key->sk_flags & SK_ISNULL)
+ {
+ /* Handle IS NULL/NOT NULL tests */
+ if (key->sk_flags & SK_SEARCHNULL)
+ {
+ if (isNull)
+ continue; /* tuple satisfies this qual */
+ }
+ else
+ {
+ Assert(key->sk_flags & SK_SEARCHNOTNULL);
+ if (!isNull)
+ continue; /* tuple satisfies this qual */
+ }
+
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (isNull)
+ {
+ if (key->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
+ datum, key->sk_argument);
+
+ if (!DatumGetBool(test))
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ *
+ * Note: because we stop the scan as soon as any required equality
+ * qual fails, it is critical that equality quals be used for the
+ * initial positioning in _bt_first() when they are available. See
+ * comments in _bt_first().
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+ }
+
+ /* If we get here, the tuple passes all index quals. */
+ return true;
+}
+
+/*
+ * Test whether an indextuple satisfies a row-comparison scan condition.
+ *
+ * Return true if so, false if not. If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction
+ * to pass the qual.
+ *
+ * This is a subroutine for _bt_checkkeys, which see for more info.
+ */
+static bool
+_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
+ TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
+{
+ ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
+ int32 cmpresult = 0;
+ bool result;
+
+ /* First subkey should be same as the header says */
+ Assert(subkey->sk_attno == skey->sk_attno);
+
+ /* Loop over columns of the row condition */
+ for (;;)
+ {
+ Datum datum;
+ bool isNull;
+
+ Assert(subkey->sk_flags & SK_ROW_MEMBER);
+
+ if (subkey->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ cmpresult = 0;
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ continue;
+ }
+
+ datum = index_getattr(tuple,
+ subkey->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (isNull)
+ {
+ if (subkey->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (subkey->sk_flags & SK_ISNULL)
+ {
+ /*
+ * Unlike the simple-scankey case, this isn't a disallowed case.
+ * But it can never match. If all the earlier row comparison
+ * columns are required for the scan direction, we can stop the
+ * scan, because there can't be another tuple that will succeed.
+ */
+ if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
+ subkey--;
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ return false;
+ }
+
+ /* Perform the test --- three-way comparison not bool operator */
+ cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
+ subkey->sk_collation,
+ datum,
+ subkey->sk_argument));
+
+ if (subkey->sk_flags & SK_BT_DESC)
+ INVERT_COMPARE_RESULT(cmpresult);
+
+ /* Done comparing if unequal, else advance to next column */
+ if (cmpresult != 0)
+ break;
+
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ }
+
+ /*
+ * At this point cmpresult indicates the overall result of the row
+ * comparison, and subkey points to the deciding column (or the last
+ * column if the result is "=").
+ */
+ switch (subkey->sk_strategy)
+ {
+ /* EQ and NE cases aren't allowed here */
+ case BTLessStrategyNumber:
+ result = (cmpresult < 0);
+ break;
+ case BTLessEqualStrategyNumber:
+ result = (cmpresult <= 0);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ result = (cmpresult >= 0);
+ break;
+ case BTGreaterStrategyNumber:
+ result = (cmpresult > 0);
+ break;
+ default:
+ elog(ERROR, "unrecognized RowCompareType: %d",
+ (int) subkey->sk_strategy);
+ result = 0; /* keep compiler quiet */
+ break;
+ }
+
+ if (!result)
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will pass,
+ * either. Note we have to look at the deciding column, not
+ * necessarily the first or last column of the row condition.
+ */
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+
+ return result;
+}
+
+/*
+ * _bt_truncate() -- create tuple without unneeded suffix attributes.
+ *
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument. If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space. More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes. We only need to keep firstright
+ * attributes up to and including the first non-lastleft-equal attribute.
+ * Caller's insertion scankey is used to compare the tuples; the scankey's
+ * argument values are not considered here.
+ *
+ * Note that returned tuple's t_tid offset will hold the number of attributes
+ * present, so the original item pointer offset is not represented. Caller
+ * should only change truncated tuple's downlink. Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
+ *
+ * In the worst case (when a heap TID must be appended to distinguish lastleft
+ * from firstright), the size of the returned tuple is the size of firstright
+ * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
+ * is important, since callers need to stay under the 1/3 of a page
+ * restriction on tuple size. If this routine is ever taught to truncate
+ * within an attribute/datum, it will need to avoid returning an enlarged
+ * tuple to caller when truncation + TOAST compression ends up enlarging the
+ * final datum.
+ */
+IndexTuple
+_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ BTScanInsert itup_key)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+ IndexTuple pivot;
+ IndexTuple tidpivot;
+ ItemPointer pivotheaptid;
+ Size newsize;
+
+ /*
+ * We should only ever truncate non-pivot tuples from leaf pages. It's
+ * never okay to truncate when splitting an internal page.
+ */
+ Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
+
+ /* Determine how many attributes must be kept in truncated tuple */
+ keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+
+#ifdef DEBUG_NO_TRUNCATE
+ /* Force truncation to be ineffective for testing purposes */
+ keepnatts = nkeyatts + 1;
+#endif
+
+ pivot = index_truncate_tuple(itupdesc, firstright,
+ Min(keepnatts, nkeyatts));
+
+ if (BTreeTupleIsPosting(pivot))
+ {
+ /*
+ * index_truncate_tuple() just returns a straight copy of firstright
+ * when it has no attributes to truncate. When that happens, we may
+ * need to truncate away a posting list here instead.
+ */
+ Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
+ Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+
+ /*
+ * If there is a distinguishing key attribute within pivot tuple, we're
+ * done
+ */
+ if (keepnatts <= nkeyatts)
+ {
+ BTreeTupleSetNAtts(pivot, keepnatts, false);
+ return pivot;
+ }
+
+ /*
+ * We have to store a heap TID in the new pivot tuple, since no non-TID
+ * key attribute value in firstright distinguishes the right side of the
+ * split from the left side. nbtree conceptualizes this case as an
+ * inability to truncate away any key attributes, since heap TID is
+ * treated as just another key attribute (despite lacking a pg_attribute
+ * entry).
+ *
+ * Use enlarged space that holds a copy of pivot. We need the extra space
+ * to store a heap TID at the end (using the special pivot tuple
+ * representation). Note that the original pivot already has firstright's
+ * possible posting list/non-key attribute values removed at this point.
+ */
+ newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
+ tidpivot = palloc0(newsize);
+ memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
+ /* Cannot leak memory here */
+ pfree(pivot);
+
+ /*
+ * Store all of firstright's key attribute values plus a tiebreaker heap
+ * TID value in enlarged pivot tuple
+ */
+ tidpivot->t_info &= ~INDEX_SIZE_MASK;
+ tidpivot->t_info |= newsize;
+ BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
+ pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
+
+ /*
+ * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
+ * consider suffix truncation. It seems like a good idea to follow that
+ * example in cases where no truncation takes place -- use lastleft's heap
+ * TID. (This is also the closest value to negative infinity that's
+ * legally usable.)
+ */
+ ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
+
+ /*
+ * We're done. Assert() that heap TID invariants hold before returning.
+ *
+ * Lehman and Yao require that the downlink to the right page, which is to
+ * be inserted into the parent page in the second phase of a page split be
+ * a strict lower bound on items on the right page, and a non-strict upper
+ * bound for items on the left page. Assert that heap TIDs follow these
+ * invariants, since a heap TID value is apparently needed as a
+ * tiebreaker.
+ */
+#ifndef DEBUG_NO_TRUNCATE
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#else
+
+ /*
+ * Those invariants aren't guaranteed to hold for lastleft + firstright
+ * heap TID attribute values when they're considered here only because
+ * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
+ * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
+ * TID value that always works as a strict lower bound for items to the
+ * right. In particular, it must avoid using firstright's leading key
+ * attribute values along with lastleft's heap TID value when lastleft's
+ * TID happens to be greater than firstright's TID.
+ */
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
+
+ /*
+ * Pivot heap TID should never be fully equal to firstright. Note that
+ * the pivot heap TID will still end up equal to lastleft's heap TID when
+ * that's the only usable value.
+ */
+ ItemPointerSetOffsetNumber(pivotheaptid,
+ OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#endif
+
+ return tidpivot;
+}
+
+/*
+ * _bt_keep_natts - how many key attributes to keep when truncating.
+ *
+ * Caller provides two tuples that enclose a split point. Caller's insertion
+ * scankey is used to compare the tuples; the scankey's argument values are
+ * not considered here.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation. This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ BTScanInsert itup_key)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keepnatts;
+ ScanKey scankey;
+
+ /*
+ * _bt_compare() treats truncated key attributes as having the value minus
+ * infinity, which would break searches within !heapkeyspace indexes. We
+ * must still truncate away non-key attribute values, though.
+ */
+ if (!itup_key->heapkeyspace)
+ return nkeyatts;
+
+ scankey = itup_key->scankeys;
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum1,
+ datum2)) != 0)
+ break;
+
+ keepnatts++;
+ }
+
+ /*
+ * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
+ * expected in an allequalimage index.
+ */
+ Assert(!itup_key->allequalimage ||
+ keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+
+ return keepnatts;
+}
+
+/*
+ * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location. A naive bitwise approach to datum comparisons is used to
+ * save cycles.
+ *
+ * The approach taken here usually provides the same answer as _bt_keep_natts
+ * will (for the same pair of tuples from a heapkeyspace index), since the
+ * majority of btree opclasses can never indicate that two datums are equal
+ * unless they're bitwise equal after detoasting. When an index only has
+ * "equal image" columns, routine is guaranteed to give the same result as
+ * _bt_keep_natts would.
+ *
+ * Callers can rely on the fact that attributes considered equal here are
+ * definitely also equal according to _bt_keep_natts, even when the index uses
+ * an opclass or collation that is not "allequalimage"/deduplication-safe.
+ * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
+ * negatives generally only have the effect of making leaf page splits use a
+ * more balanced split point.
+ */
+int
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= keysz; attnum++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+ Form_pg_attribute att;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ att = TupleDescAttr(itupdesc, attnum - 1);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
+ break;
+
+ keepnatts++;
+ }
+
+ return keepnatts;
+}
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 84442a93c5..d93839620d 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -61,10 +61,6 @@ static void writetup_cluster(Tuplesortstate *state, LogicalTape *tape,
SortTuple *stup);
static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
LogicalTape *tape, unsigned int tuplen);
-static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state);
-static int comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state);
static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
Tuplesortstate *state);
static int comparetup_index_hash_tiebreak(const SortTuple *a, const SortTuple *b,
@@ -140,6 +136,9 @@ typedef struct
int datumTypeLen;
} TuplesortDatumArg;
+#define NBT_SPECIALIZE_FILE "../../backend/utils/sort/tuplesortvariants_spec.c"
+#include "access/nbtree_spec.h"
+
Tuplesortstate *
tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
@@ -228,6 +227,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
MemoryContext oldcontext;
TuplesortClusterArg *arg;
int i;
+ nbts_prep_ctx(indexRel);
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
@@ -340,6 +340,7 @@ tuplesort_begin_index_btree(Relation heapRel,
TuplesortIndexBTreeArg *arg;
MemoryContext oldcontext;
int i;
+ nbts_prep_ctx(indexRel);
oldcontext = MemoryContextSwitchTo(base->maincontext);
arg = (TuplesortIndexBTreeArg *) palloc(sizeof(TuplesortIndexBTreeArg));
@@ -475,6 +476,7 @@ tuplesort_begin_index_gist(Relation heapRel,
MemoryContext oldcontext;
TuplesortIndexBTreeArg *arg;
int i;
+ nbts_prep_ctx(indexRel);
oldcontext = MemoryContextSwitchTo(base->maincontext);
arg = (TuplesortIndexBTreeArg *) palloc(sizeof(TuplesortIndexBTreeArg));
@@ -1299,152 +1301,6 @@ removeabbrev_index(Tuplesortstate *state, SortTuple *stups, int count)
}
}
-static int
-comparetup_index_btree(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state)
-{
- /*
- * This is similar to comparetup_heap(), but expects index tuples. There
- * is also special handling for enforcing uniqueness, and special
- * treatment for equal keys at the end.
- */
- TuplesortPublic *base = TuplesortstateGetPublic(state);
- SortSupport sortKey = base->sortKeys;
- int32 compare;
-
- /* Compare the leading sort key */
- compare = ApplySortComparator(a->datum1, a->isnull1,
- b->datum1, b->isnull1,
- sortKey);
- if (compare != 0)
- return compare;
-
- /* Compare additional sort keys */
- return comparetup_index_btree_tiebreak(a, b, state);
-}
-
-static int
-comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state)
-{
- TuplesortPublic *base = TuplesortstateGetPublic(state);
- TuplesortIndexBTreeArg *arg = (TuplesortIndexBTreeArg *) base->arg;
- SortSupport sortKey = base->sortKeys;
- IndexTuple tuple1;
- IndexTuple tuple2;
- int keysz;
- TupleDesc tupDes;
- bool equal_hasnull = false;
- int nkey;
- int32 compare;
- Datum datum1,
- datum2;
- bool isnull1,
- isnull2;
-
- tuple1 = (IndexTuple) a->tuple;
- tuple2 = (IndexTuple) b->tuple;
- keysz = base->nKeys;
- tupDes = RelationGetDescr(arg->index.indexRel);
-
- if (sortKey->abbrev_converter)
- {
- datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
-
- compare = ApplySortAbbrevFullComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare;
- }
-
- /* they are equal, so we only need to examine one null flag */
- if (a->isnull1)
- equal_hasnull = true;
-
- sortKey++;
- for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
- {
- datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
-
- compare = ApplySortComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare; /* done when we find unequal attributes */
-
- /* they are equal, so we only need to examine one null flag */
- if (isnull1)
- equal_hasnull = true;
- }
-
- /*
- * If btree has asked us to enforce uniqueness, complain if two equal
- * tuples are detected (unless there was at least one NULL field and NULLS
- * NOT DISTINCT was not set).
- *
- * It is sufficient to make the test here, because if two tuples are equal
- * they *must* get compared at some stage of the sort --- otherwise the
- * sort algorithm wouldn't have checked whether one must appear before the
- * other.
- */
- if (arg->enforceUnique && !(!arg->uniqueNullsNotDistinct && equal_hasnull))
- {
- Datum values[INDEX_MAX_KEYS];
- bool isnull[INDEX_MAX_KEYS];
- char *key_desc;
-
- /*
- * Some rather brain-dead implementations of qsort (such as the one in
- * QNX 4) will sometimes call the comparison routine to compare a
- * value to itself, but we always use our own implementation, which
- * does not.
- */
- Assert(tuple1 != tuple2);
-
- index_deform_tuple(tuple1, tupDes, values, isnull);
-
- key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("could not create unique index \"%s\"",
- RelationGetRelationName(arg->index.indexRel)),
- key_desc ? errdetail("Key %s is duplicated.", key_desc) :
- errdetail("Duplicate keys exist."),
- errtableconstraint(arg->index.heapRel,
- RelationGetRelationName(arg->index.indexRel))));
- }
-
- /*
- * If key values are equal, we sort on ItemPointer. This is required for
- * btree indexes, since heap TID is treated as an implicit last key
- * attribute in order to ensure that all keys in the index are physically
- * unique.
- */
- {
- BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
- BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
-
- if (blk1 != blk2)
- return (blk1 < blk2) ? -1 : 1;
- }
- {
- OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
- OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
-
- if (pos1 != pos2)
- return (pos1 < pos2) ? -1 : 1;
- }
-
- /* ItemPointer values should never be equal */
- Assert(false);
-
- return 0;
-}
-
static int
comparetup_index_hash(const SortTuple *a, const SortTuple *b,
Tuplesortstate *state)
diff --git a/src/backend/utils/sort/tuplesortvariants_spec.c b/src/backend/utils/sort/tuplesortvariants_spec.c
new file mode 100644
index 0000000000..705da09329
--- /dev/null
+++ b/src/backend/utils/sort/tuplesortvariants_spec.c
@@ -0,0 +1,175 @@
+/*-------------------------------------------------------------------------
+ *
+ * tuplesortvariants_spec.c
+ * Index shape-specialized functions for tuplesortvariants.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/tuplesortvariants_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define comparetup_index_btree NBTS_FUNCTION(comparetup_index_btree)
+#define comparetup_index_btree_tiebreak NBTS_FUNCTION(comparetup_index_btree_tiebreak)
+
+static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
+ Tuplesortstate *state);
+static int comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
+ Tuplesortstate *state);
+
+static int
+comparetup_index_btree(const SortTuple *a, const SortTuple *b,
+ Tuplesortstate *state)
+{
+ /*
+ * This is similar to comparetup_heap(), but expects index tuples. There
+ * is also special handling for enforcing uniqueness, and special
+ * treatment for equal keys at the end.
+ */
+ TuplesortPublic *base = TuplesortstateGetPublic(state);
+ SortSupport sortKey = base->sortKeys;
+ int32 compare;
+
+ /* Compare the leading sort key */
+ compare = ApplySortComparator(a->datum1, a->isnull1,
+ b->datum1, b->isnull1,
+ sortKey);
+ if (compare != 0)
+ return compare;
+
+ /* Compare additional sort keys */
+ return comparetup_index_btree_tiebreak(a, b, state);
+}
+
+static int
+comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
+ Tuplesortstate *state)
+{ /*
+ * This is similar to comparetup_heap(), but expects index tuples. There
+ * is also special handling for enforcing uniqueness, and special
+ * treatment for equal keys at the end.
+ */
+ TuplesortPublic *base = TuplesortstateGetPublic(state);
+ TuplesortIndexBTreeArg *arg = (TuplesortIndexBTreeArg *) base->arg;
+ SortSupport sortKey = base->sortKeys;
+ IndexTuple tuple1;
+ IndexTuple tuple2;
+ int keysz;
+ TupleDesc tupDes;
+ bool equal_hasnull = false;
+ int nkey;
+ int32 compare;
+ Datum datum1,
+ datum2;
+ bool isnull1,
+ isnull2;
+
+ tuple1 = (IndexTuple) a->tuple;
+ tuple2 = (IndexTuple) b->tuple;
+ keysz = base->nKeys;
+ tupDes = RelationGetDescr(arg->index.indexRel);
+
+ if (sortKey->abbrev_converter)
+ {
+ datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
+ datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
+
+ compare = ApplySortAbbrevFullComparator(datum1, isnull1,
+ datum2, isnull2,
+ sortKey);
+ if (compare != 0)
+ return compare;
+ }
+
+ /* they are equal, so we only need to examine one null flag */
+ if (a->isnull1)
+ equal_hasnull = true;
+
+ sortKey++;
+ for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
+ {
+ datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
+ datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
+
+ compare = ApplySortComparator(datum1, isnull1,
+ datum2, isnull2,
+ sortKey);
+ if (compare != 0)
+ return compare; /* done when we find unequal attributes */
+
+ /* they are equal, so we only need to examine one null flag */
+ if (isnull1)
+ equal_hasnull = true;
+ }
+
+ /*
+ * If btree has asked us to enforce uniqueness, complain if two equal
+ * tuples are detected (unless there was at least one NULL field and NULLS
+ * NOT DISTINCT was not set).
+ *
+ * It is sufficient to make the test here, because if two tuples are equal
+ * they *must* get compared at some stage of the sort --- otherwise the
+ * sort algorithm wouldn't have checked whether one must appear before the
+ * other.
+ */
+ if (arg->enforceUnique && !(!arg->uniqueNullsNotDistinct && equal_hasnull))
+ {
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ char *key_desc;
+
+ /*
+ * Some rather brain-dead implementations of qsort (such as the one in
+ * QNX 4) will sometimes call the comparison routine to compare a
+ * value to itself, but we always use our own implementation, which
+ * does not.
+ */
+ Assert(tuple1 != tuple2);
+
+ index_deform_tuple(tuple1, tupDes, values, isnull);
+
+ key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("could not create unique index \"%s\"",
+ RelationGetRelationName(arg->index.indexRel)),
+ key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+ errdetail("Duplicate keys exist."),
+ errtableconstraint(arg->index.heapRel,
+ RelationGetRelationName(arg->index.indexRel))));
+ }
+
+ /*
+ * If key values are equal, we sort on ItemPointer. This is required for
+ * btree indexes, since heap TID is treated as an implicit last key
+ * attribute in order to ensure that all keys in the index are physically
+ * unique.
+ */
+ {
+ BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
+ BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
+
+ if (blk1 != blk2)
+ return (blk1 < blk2) ? -1 : 1;
+ }
+ {
+ OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
+ OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
+
+ if (pos1 != pos2)
+ return (pos1 < pos2) ? -1 : 1;
+ }
+
+ /* ItemPointer values should never be equal */
+ Assert(false);
+
+ return 0;
+}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 11f4184107..d1bbc4d2a8 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1121,15 +1121,27 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_PERFORMSORT_2 4
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
+typedef enum NBTS_CTX {
+ NBTS_CTX_CACHED,
+ NBTS_CTX_DEFAULT, /* fallback */
+} NBTS_CTX;
+
+static inline NBTS_CTX _nbt_spec_context(Relation irel)
+{
+ if (!PointerIsValid(irel))
+ return NBTS_CTX_DEFAULT;
+
+ return NBTS_CTX_CACHED;
+}
+
+
+#define NBT_SPECIALIZE_FILE "access/nbtree_specfuncs.h"
+#include "nbtree_spec.h"
+
/*
* external entry points for btree, in nbtree.c
*/
extern void btbuildempty(Relation index);
-extern bool btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- struct IndexInfo *indexInfo);
extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
extern Size btestimateparallelscan(void);
extern void btinitparallelscan(void *target);
@@ -1160,8 +1172,6 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
/*
* prototypes for functions in nbtdedup.c
*/
-extern void _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem,
- Size newitemsz, bool bottomupdedup);
extern bool _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
Size newitemsz);
extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
@@ -1177,9 +1187,6 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
/*
* prototypes for functions in nbtinsert.c
*/
-extern bool _bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel);
extern void _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf,
BTStack stack);
extern Buffer _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack,
@@ -1230,16 +1237,6 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
/*
* prototypes for functions in nbtsearch.c
*/
-extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
- Buffer *bufP, int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
- Buffer buf, bool forupdate, BTStack stack,
- int access, Snapshot snapshot,
- AttrNumber *comparecol, char *tupdatabuf);
-extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate,
- AttrNumber highcmpcol);
-extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page,
- OffsetNumber offnum, AttrNumber *comparecol);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -1248,7 +1245,6 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
/*
* prototypes for functions in nbtutils.c
*/
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
extern void _bt_freestack(BTStack stack);
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
@@ -1256,8 +1252,6 @@ extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
extern void _bt_mark_array_keys(IndexScanDesc scan);
extern void _bt_restore_array_keys(IndexScanDesc scan);
extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
- int tupnatts, ScanDirection dir, bool *continuescan);
extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
extern BTCycleId _bt_start_vacuum(Relation rel);
@@ -1270,10 +1264,6 @@ extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
extern char *btbuildphasename(int64 phasenum);
-extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
-extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
- IndexTuple firstright);
extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/nbtree_spec.h b/src/include/access/nbtree_spec.h
new file mode 100644
index 0000000000..fa38b09c6e
--- /dev/null
+++ b/src/include/access/nbtree_spec.h
@@ -0,0 +1,183 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtree_specialize.h
+ * header file for postgres btree access method implementation.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/nbtree_specialize.h
+ *
+ *-------------------------------------------------------------------------
+ *
+ * Specialize key-accessing functions and the hot code around those.
+ *
+ * Key attribute iteration is specialized through the use of the following
+ * macros:
+ *
+ * - nbts_attiterdeclare(itup)
+ * Declare the variables required to iterate over the provided IndexTuple's
+ * key attributes. Many tuples may have their attributes iterated over at the
+ * same time.
+ * - nbts_attiterinit(itup, initAttNum, tupDesc)
+ * Initialize the attribute iterator for the provided IndexTuple at
+ * the provided AttributeNumber.
+ * - nbts_foreachattr(initAttNum, endAttNum)
+ * Start a loop over the attributes, starting at initAttNum and ending at
+ * endAttNum, inclusive. It also takes care of truncated attributes.
+ * - nbts_attiter_attnum
+ * The current attribute number
+ * - nbts_attiter_nextattdatum(itup, tupDesc)
+ * Updates the attribute iterator state to the next attribute. Returns the
+ * datum of the next attribute, which might be null (see below)
+ * - nbts_attiter_curattisnull(itup)
+ * Returns whether the result from the last nbts_attiter_nextattdatum is
+ * null.
+ * - nbts_context(irel)
+ * Constructs a context that is used to call specialized functions.
+ * Note that this is unneeded in paths that are inaccessible to unspecialized
+ * code paths (i.e. code included through nbtree_spec.h), because that
+ * always calls the optimized functions directly.
+ */
+
+/*
+ * Macros used in the nbtree specialization code.
+ */
+#define NBTS_TYPE_CACHED cached
+#define NBTS_TYPE_DEFAULT default
+#define NBTS_CTX_NAME __nbts_ctx
+
+/* contextual specializations */
+#define NBTS_MAKE_CTX(rel) const NBTS_CTX NBTS_CTX_NAME = _nbt_spec_context(rel)
+#define NBTS_SPECIALIZE_NAME(name) ( \
+ (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ ) \
+)
+
+/* how do we make names? */
+#define NBTS_MAKE_PREFIX(a) CppConcat(a,_)
+#define NBTS_MAKE_NAME_(a,b) CppConcat(a,b)
+#define NBTS_MAKE_NAME(a,b) NBTS_MAKE_NAME_(NBTS_MAKE_PREFIX(a),b)
+
+#define nbt_opt_specialize(rel) \
+do { \
+ Assert(PointerIsValid(rel)); \
+ if (unlikely((rel)->rd_indam->aminsert == btinsert_default)) \
+ { \
+ nbts_prep_ctx(rel); \
+ _bt_specialize(rel); \
+ } \
+} while (false)
+
+/*
+ * Protections against multiple inclusions - the definition of this macro is
+ * different for files included with the templating mechanism vs the users
+ * of this template, so redefine these macros at top and bottom.
+ */
+#ifdef NBTS_FUNCTION
+#undef NBTS_FUNCTION
+#endif
+#define NBTS_FUNCTION(name) NBTS_MAKE_NAME(name, NBTS_TYPE)
+
+/* While specializing, the context is the local context */
+#ifdef nbts_prep_ctx
+#undef nbts_prep_ctx
+#endif
+#define nbts_prep_ctx(rel)
+
+/*
+ * Specialization 1: CACHED
+ *
+ * Multiple key columns, optimized access for attcacheoff -cacheable offsets.
+ */
+#define NBTS_SPECIALIZING_CACHED
+#define NBTS_TYPE NBTS_TYPE_CACHED
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc) do {} while (false)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_SPECIALIZING_CACHED
+#undef NBTS_TYPE
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/*
+ * Specialization 2: DEFAULT
+ *
+ * "Default", externally accessible, not so optimized functions
+ */
+
+/* Only the default context may need to specialize in some cases, so here's that */
+#undef nbts_prep_ctx
+#define nbts_prep_ctx(rel) NBTS_MAKE_CTX(rel)
+
+#define NBTS_SPECIALIZING_DEFAULT
+#define NBTS_TYPE NBTS_TYPE_DEFAULT
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_DEFAULT
+
+/* un-define the optimization macros */
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/*
+ * All next uses of nbts_prep_ctx are in non-templated code, so here we make
+ * sure we actually create the context.
+ */
+#undef nbts_prep_ctx
+#define nbts_prep_ctx(rel) NBTS_MAKE_CTX(rel)
+
+/*
+ * from here on all NBTS_FUNCTIONs are from specialized function names that
+ * are being called. Change the result of those macros from a direct call
+ * call to a conditional call to the right place, depending on the correct
+ * context.
+ */
+#undef NBTS_FUNCTION
+#define NBTS_FUNCTION(name) NBTS_SPECIALIZE_NAME(name)
+
+#undef NBT_SPECIALIZE_FILE
diff --git a/src/include/access/nbtree_specfuncs.h b/src/include/access/nbtree_specfuncs.h
new file mode 100644
index 0000000000..b87f5bf802
--- /dev/null
+++ b/src/include/access/nbtree_specfuncs.h
@@ -0,0 +1,65 @@
+/*
+ * prototypes for functions that are included in nbtree.h
+ */
+
+#define _bt_specialize NBTS_FUNCTION(_bt_specialize)
+#define btinsert NBTS_FUNCTION(btinsert)
+#define _bt_dedup_pass NBTS_FUNCTION(_bt_dedup_pass)
+#define _bt_doinsert NBTS_FUNCTION(_bt_doinsert)
+#define _bt_search NBTS_FUNCTION(_bt_search)
+#define _bt_moveright NBTS_FUNCTION(_bt_moveright)
+#define _bt_binsrch_insert NBTS_FUNCTION(_bt_binsrch_insert)
+#define _bt_compare NBTS_FUNCTION(_bt_compare)
+#define _bt_mkscankey NBTS_FUNCTION(_bt_mkscankey)
+#define _bt_checkkeys NBTS_FUNCTION(_bt_checkkeys)
+#define _bt_truncate NBTS_FUNCTION(_bt_truncate)
+#define _bt_keep_natts_fast NBTS_FUNCTION(_bt_keep_natts_fast)
+
+/*
+ * prototypes for functions in nbtree_spec.h
+ */
+extern void _bt_specialize(Relation rel);
+
+extern bool btinsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ struct IndexInfo *indexInfo);
+
+/*
+ * prototypes for functions in nbtdedup_spec.h
+ */
+extern void _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem,
+ Size newitemsz, bool bottomupdedup);
+
+
+/*
+ * prototypes for functions in nbtinsert_spec.h
+ */
+
+extern bool _bt_doinsert(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ Relation heapRel);
+
+/*
+ * prototypes for functions in nbtsearch_spec.h
+ */
+extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
+ Buffer *bufP, int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
+ Buffer buf, bool forupdate, BTStack stack,
+ int access, Snapshot snapshot,
+ AttrNumber *comparecol, char *tupdatabuf);
+extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page,
+ OffsetNumber offnum, AttrNumber *comparecol);
+/*
+ * prototypes for functions in nbtutils_spec.h
+ */
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
+extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan);
+extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright);
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index 4e09c4686b..e504a2f114 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -116,6 +116,8 @@ do
test "$f" = src/pl/tcl/pltclerrcodes.h && continue
# Also not meant to be included standalone.
+ test "$f" = src/include/access/nbtree_spec.h && continue
+ test "$f" = src/include/access/nbtree_specfuncs.h && continue
test "$f" = src/include/common/unicode_nonspacing_table.h && continue
test "$f" = src/include/common/unicode_east_asian_fw_table.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index 8dee1b5670..101888c806 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -111,6 +111,8 @@ do
test "$f" = src/pl/tcl/pltclerrcodes.h && continue
# Also not meant to be included standalone.
+ test "$f" = src/include/access/nbtree_spec.h && continue
+ test "$f" = src/include/access/nbtree_specfuncs.h && continue
test "$f" = src/include/common/unicode_nonspacing_table.h && continue
test "$f" = src/include/common/unicode_east_asian_fw_table.h && continue
--
2.40.1
v12-0001-Implement-dynamic-prefix-compression-in-nbtree.patchapplication/octet-stream; name=v12-0001-Implement-dynamic-prefix-compression-in-nbtree.patchDownload
From 36800c5124b34e5ba105901de5ba9a0ed9c18d4b Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Tue, 10 Jan 2023 21:45:44 +0100
Subject: [PATCH v12 1/6] Implement dynamic prefix compression in nbtree
Because tuples are ordered on the page, if some prefix of the
scan attributes on both sides of the compared tuple are equal
to the scankey, then the current tuple that is being compared
must also have those prefixing attributes that equal the
scankey.
We cannot generally propagate this information to _binsrch on
lower pages, as this downstream page may have concurrently split
and/or have merged with its deleted left neighbour (see [0]),
which moves the keyspace of the linked page. We thus can only
trust the current state of this current page for this optimization,
which means we must validate this state each time we open the page.
Although this limits the overall applicability of the
performance improvement, it still allows for a nice performance
improvement in most cases where initial columns have many
duplicate values and a compare function that is not cheap.
As an exception to the above rule, most of the time a pages'
highkey is equal to the right seperator on the parent page due to
how btree splits are done. By storing this right seperator from
the parent page and then validating that the highkey of the child
page contains the exact same data, we can restore the right prefix
bound without having to call the relatively expensive _bt_compare.
In the worst-case scenario of a concurrent page split, we'd still
have to validate the full key, but that doesn't happen very often
when compared to the number of times we descend the btree.
---
contrib/amcheck/verify_nbtree.c | 17 +--
src/backend/access/nbtree/README | 43 ++++++++
src/backend/access/nbtree/nbtinsert.c | 34 ++++--
src/backend/access/nbtree/nbtsearch.c | 145 +++++++++++++++++++++++---
src/include/access/nbtree.h | 9 +-
5 files changed, 214 insertions(+), 34 deletions(-)
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 94a9759322..e57625b75c 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2701,6 +2701,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
BTInsertStateData insertstate;
OffsetNumber offnum;
Page page;
+ AttrNumber cmpcol = 1;
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
@@ -2710,13 +2711,13 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.buf = lbuf;
/* Get matching tuple on leaf page */
- offnum = _bt_binsrch_insert(state->rel, &insertstate);
+ offnum = _bt_binsrch_insert(state->rel, &insertstate, 1);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
/* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
insertstate.postingoff <= 0 &&
- _bt_compare(state->rel, key, page, offnum) == 0)
+ _bt_compare(state->rel, key, page, offnum, &cmpcol) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
}
@@ -2778,6 +2779,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
@@ -2788,7 +2790,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
if (!key->heapkeyspace)
return invariant_leq_offset(state, key, upperbound);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
/*
* _bt_compare() is capable of determining that a scankey with a
@@ -2840,10 +2842,11 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber upperbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
return cmp <= 0;
}
@@ -2863,10 +2866,11 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber lowerbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, lowerbound);
+ cmp = _bt_compare(state->rel, key, state->target, lowerbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
@@ -2901,13 +2905,14 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
/* Verify line pointer before checking tuple */
itemid = PageGetItemIdCareful(state, nontargetblock, nontarget,
upperbound);
- cmp = _bt_compare(state->rel, key, nontarget, upperbound);
+ cmp = _bt_compare(state->rel, key, nontarget, upperbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 52e646c7f7..0f10141a2f 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -901,6 +901,49 @@ large groups of duplicates, maximizing space utilization. Note also that
deduplication more efficient. Deduplication can be performed infrequently,
without merging together existing posting list tuples too often.
+Notes about dynamic prefix truncation
+-------------------------------------
+
+Because NBTrees have a sorted keyspace, when we have determined that some
+prefixing columns of tuples on both sides of the tuple that is being
+compared are equal to the scankey, then the current tuple must also share
+this prefix with the scankey. This allows us to skip comparing those columns,
+saving the indirect function calls in the compare operation.
+
+We can only use this constraint if we have proven this information while we
+hold a pin on the page, so this is only useful on the page level: Concurrent
+page deletions and splits may have moved the keyspace of the page referenced
+by a parent page to the right. If we re-used high- and low-column-prefixes,
+we would not be able to detect a change of keyspace from e.g. [2,3) to [1,2),
+and subsequently return invalid results. This race condition can only be
+prevented by re-establishing the prefix-equal-columns for each page.
+
+There is positive news, though: A page split will put a binary copy of the
+page's highkey in the parent page. This means that we usually can reuse
+the compare result of the parent page's downlink's right sibling when we
+discover that their representation is binary equal. In general this will
+be the case, as only in concurrent page splits and deletes the downlink
+may not point to the page with the correct highkey bound (_bt_moveright
+only rarely actually moves right).
+
+To implement this, we copy the downlink's right differentiator key into a
+temporary buffer, which is then compared against the child pages' highkey.
+If they match, we reuse the compare result (plus prefix) we had for it from
+the parent page, if not, we need to do a full _bt_compare. Because memcpy +
+memcmp is cheap compared to _bt_compare, and because it's quite unlikely
+that we guess wrong this speeds up our _bt_moveright code (at cost of some
+stack memory in _bt_search and some overhead in case of a wrong prediction)
+
+Now that we have prefix bounds on the highest value of a page, the
+_bt_binsrch procedure will use this result as a rightmost prefix compare,
+and for each step in the binary search (that does not compare less than the
+insert key) improve the equal-prefix bounds.
+
+Using the above optimization, we now (on average) only need 2 full key
+compares per page (plus ceil(log2(ntupsperpage)) single-attribute compares),
+as opposed to the ceil(log2(ntupsperpage)) + 1 of a naive implementation;
+a significant improvement.
+
Notes about deduplication
-------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index d33f814a93..39e7e9b731 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -328,6 +328,7 @@ _bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
{
Page page;
BTPageOpaque opaque;
+ AttrNumber cmpcol = 1;
_bt_checkpage(rel, insertstate->buf);
page = BufferGetPage(insertstate->buf);
@@ -346,7 +347,8 @@ _bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
!P_IGNORE(opaque) &&
PageGetFreeSpace(page) > insertstate->itemsz &&
PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- _bt_compare(rel, insertstate->itup_key, page, P_HIKEY) > 0)
+ _bt_compare(rel, insertstate->itup_key, page, P_HIKEY,
+ &cmpcol) > 0)
{
/*
* Caller can use the fastpath optimization because cached
@@ -440,7 +442,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* in the fastpath below, but also in the _bt_findinsertloc() call later.
*/
Assert(!insertstate->bounds_valid);
- offset = _bt_binsrch_insert(rel, insertstate);
+ offset = _bt_binsrch_insert(rel, insertstate, 1);
/*
* Scan over all equal tuples, looking for live conflicts.
@@ -450,6 +452,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(itup_key->scantid == NULL);
for (;;)
{
+ AttrNumber cmpcol = 1;
+
/*
* Each iteration of the loop processes one heap TID, not one index
* tuple. Current offset number for page isn't usually advanced on
@@ -485,7 +489,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(insertstate->bounds_valid);
Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
Assert(insertstate->low <= insertstate->stricthigh);
- Assert(_bt_compare(rel, itup_key, page, offset) < 0);
+ Assert(_bt_compare(rel, itup_key, page, offset, &cmpcol) < 0);
break;
}
@@ -510,7 +514,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
if (!inposting)
{
/* Plain tuple, or first TID in posting list tuple */
- if (_bt_compare(rel, itup_key, page, offset) != 0)
+ if (_bt_compare(rel, itup_key, page, offset, &cmpcol) != 0)
break; /* we're past all the equal tuples */
/* Advanced curitup */
@@ -720,11 +724,12 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
else
{
int highkeycmp;
+ cmpcol = 1;
/* If scankey == hikey we gotta check the next page too */
if (P_RIGHTMOST(opaque))
break;
- highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+ highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol);
Assert(highkeycmp <= 0);
if (highkeycmp != 0)
break;
@@ -867,6 +872,8 @@ _bt_findinsertloc(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
+
/*
* Does the new tuple belong on this page?
*
@@ -884,7 +891,7 @@ _bt_findinsertloc(Relation rel,
/* Test '<=', not '!=', since scantid is set now */
if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
break;
_bt_stepright(rel, heapRel, insertstate, stack);
@@ -937,6 +944,8 @@ _bt_findinsertloc(Relation rel,
*/
while (PageGetFreeSpace(page) < insertstate->itemsz)
{
+ AttrNumber cmpcol = 1;
+
/*
* Before considering moving right, see if we can obtain enough
* space by erasing LP_DEAD items
@@ -967,7 +976,7 @@ _bt_findinsertloc(Relation rel,
break;
if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) != 0 ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) != 0 ||
pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
break;
@@ -982,10 +991,13 @@ _bt_findinsertloc(Relation rel,
* We should now be on the correct page. Find the offset within the page
* for the new tuple. (Possibly reusing earlier search bounds.)
*/
- Assert(P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
+ {
+ AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
+ Assert(P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0);
+ }
- newitemoff = _bt_binsrch_insert(rel, insertstate);
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
if (insertstate->postingoff == -1)
{
@@ -1004,7 +1016,7 @@ _bt_findinsertloc(Relation rel,
*/
Assert(!insertstate->bounds_valid);
insertstate->postingoff = 0;
- newitemoff = _bt_binsrch_insert(rel, insertstate);
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
Assert(insertstate->postingoff == 0);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 3230b3b894..a6998e48d8 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,7 +26,8 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
-static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf,
+ AttrNumber *highkeycmpcol);
static int _bt_binsrch_posting(BTScanInsert key, Page page,
OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
@@ -102,6 +103,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
{
BTStack stack_in = NULL;
int page_access = BT_READ;
+ char tupdatabuf[BLCKSZ / 3];
+ AttrNumber highkeycmpcol = 1;
/* heaprel must be set whenever _bt_allocbuf is reachable */
Assert(access == BT_READ || access == BT_WRITE);
@@ -138,7 +141,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
* opportunity to finish splits of internal pages too.
*/
*bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
- stack_in, page_access, snapshot);
+ stack_in, page_access, snapshot, &highkeycmpcol,
+ (char *) tupdatabuf);
/* if this is a leaf page, we're done */
page = BufferGetPage(*bufP);
@@ -150,12 +154,15 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
* Find the appropriate pivot tuple on this page. Its downlink points
* to the child page that we're about to descend to.
*/
- offnum = _bt_binsrch(rel, key, *bufP);
+ offnum = _bt_binsrch(rel, key, *bufP, &highkeycmpcol);
itemid = PageGetItemId(page, offnum);
itup = (IndexTuple) PageGetItem(page, itemid);
Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
child = BTreeTupleGetDownLink(itup);
+ Assert(IndexTupleSize(itup) < sizeof(tupdatabuf));
+ memcpy((char *) tupdatabuf, (char *) itup, IndexTupleSize(itup));
+
/*
* We need to save the location of the pivot tuple we chose in a new
* stack entry for this page/level. If caller ends up splitting a
@@ -189,6 +196,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
*/
if (access == BT_WRITE && page_access == BT_READ)
{
+ highkeycmpcol = 1;
+
/* trade in our read lock for a write lock */
_bt_unlockbuf(rel, *bufP);
_bt_lockbuf(rel, *bufP, BT_WRITE);
@@ -199,7 +208,7 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
* move right to its new sibling. Do that.
*/
*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
- snapshot);
+ snapshot, &highkeycmpcol, (char *) tupdatabuf);
}
return stack_in;
@@ -248,13 +257,16 @@ _bt_moveright(Relation rel,
bool forupdate,
BTStack stack,
int access,
- Snapshot snapshot)
+ Snapshot snapshot,
+ AttrNumber *comparecol,
+ char *tupdatabuf)
{
Page page;
BTPageOpaque opaque;
int32 cmpval;
Assert(!forupdate || heaprel != NULL);
+ Assert(PointerIsValid(comparecol) && PointerIsValid(tupdatabuf));
/*
* When nextkey = false (normal case): if the scan key that brought us to
@@ -277,12 +289,17 @@ _bt_moveright(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
+
page = BufferGetPage(buf);
TestForOldSnapshot(snapshot, rel, page);
opaque = BTPageGetOpaque(page);
if (P_RIGHTMOST(opaque))
+ {
+ *comparecol = 1;
break;
+ }
/*
* Finish any incomplete splits we encounter along the way.
@@ -308,14 +325,55 @@ _bt_moveright(Relation rel,
continue;
}
- if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
+ /*
+ * tupdatabuf is filled with the right seperator of the parent node.
+ * This allows us to do a binary equality check between the parent
+ * node's right seperator (which is < key) and this page's P_HIKEY.
+ * If they equal, we can reuse the result of the parent node's
+ * rightkey compare, which means we can potentially save a full key
+ * compare (which includes indirect calls to attribute comparison
+ * functions).
+ *
+ * Without this, we'd on average use 3 full key compares per page before
+ * we achieve full dynamic prefix bounds, but with this optimization
+ * that is only 2.
+ *
+ * 3 compares: 1 for the highkey (rightmost), and on average 2 before
+ * we move right in the binary search on the page, this average equals
+ * SUM (1/2 ^ x) for x from 0 to log(n items)), which tends to 2.
+ */
+ if (!P_IGNORE(opaque) && *comparecol > 1)
+ {
+ IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
+ IndexTuple buftuple = (IndexTuple) tupdatabuf;
+ if (IndexTupleSize(itup) == IndexTupleSize(buftuple))
+ {
+ char *dataptr = (char *) itup;
+
+ if (memcmp(dataptr + sizeof(IndexTupleData),
+ tupdatabuf + sizeof(IndexTupleData),
+ IndexTupleSize(itup) - sizeof(IndexTupleData)) == 0)
+ break;
+ } else {
+ *comparecol = 1;
+ }
+ } else {
+ *comparecol = 1;
+ }
+
+ if (P_IGNORE(opaque) ||
+ _bt_compare(rel, key, page, P_HIKEY, &cmpcol) >= cmpval)
{
+ *comparecol = 1;
/* step right one page */
buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
continue;
}
else
+ {
+ *comparecol = cmpcol;
break;
+ }
}
if (P_IGNORE(opaque))
@@ -341,6 +399,16 @@ _bt_moveright(Relation rel,
* right place to descend to be sure we find all leaf keys >= given scankey
* (or leaf keys > given scankey when nextkey is true).
*
+ * When called, the "highkeycmpcol" pointer argument is expected to contain the
+ * AttrNumber of the first attribute that is not shared between scan key and
+ * this page's high key, i.e. the first attribute that we have to compare
+ * against the scan key. The value will be updated by _bt_binsrch to contain
+ * this same first column we'll need to compare against the scan key, but now
+ * for the index tuple at the returned offset. Valid values range from 1
+ * (no shared prefix) to the number of key attributes + 1 (all index key
+ * attributes are equal to the scan key). See also _bt_compare, and
+ * backend/access/nbtree/README for more info.
+ *
* This procedure is not responsible for walking right, it just examines
* the given page. _bt_binsrch() has no lock or refcount side effects
* on the buffer.
@@ -348,7 +416,8 @@ _bt_moveright(Relation rel,
static OffsetNumber
_bt_binsrch(Relation rel,
BTScanInsert key,
- Buffer buf)
+ Buffer buf,
+ AttrNumber *highkeycmpcol)
{
Page page;
BTPageOpaque opaque;
@@ -356,6 +425,13 @@ _bt_binsrch(Relation rel,
high;
int32 result,
cmpval;
+ /*
+ * Prefix bounds, for the high/low offset's compare columns.
+ * "highkeycmpcol" is the value for this page's high key (if any) or 1
+ * (no established shared prefix)
+ */
+ AttrNumber highcmpcol = *highkeycmpcol,
+ lowcmpcol = 1;
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
@@ -388,6 +464,10 @@ _bt_binsrch(Relation rel,
* For nextkey=true (cmpval=0), the loop invariant is: all slots before
* 'low' are <= scan key, all slots at or after 'high' are > scan key.
*
+ * We maintain highcmpcol and lowcmpcol to keep track of prefixes that
+ * tuples share with the scan key, potentially allowing us to skip a
+ * prefix in the midpoint comparison.
+ *
* We can fall out when high == low.
*/
high++; /* establish the loop invariant for high */
@@ -397,17 +477,27 @@ _bt_binsrch(Relation rel,
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol); /* update prefix bounds */
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, key, page, mid);
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
+ {
high = mid;
+ highcmpcol = cmpcol;
+ }
}
+ /* update the bounds at the caller */
+ *highkeycmpcol = highcmpcol;
+
/*
* At this point we have high == low, but be careful: they could point
* past the last slot on the page.
@@ -450,7 +540,8 @@ _bt_binsrch(Relation rel,
* list split).
*/
OffsetNumber
-_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
+_bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol)
{
BTScanInsert key = insertstate->itup_key;
Page page;
@@ -460,6 +551,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
stricthigh;
int32 result,
cmpval;
+ AttrNumber lowcmpcol = 1;
page = BufferGetPage(insertstate->buf);
opaque = BTPageGetOpaque(page);
@@ -510,16 +602,22 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, key, page, mid);
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
{
high = mid;
+ highcmpcol = cmpcol;
+
if (result != 0)
stricthigh = high;
}
@@ -654,6 +752,13 @@ _bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
* matching TID in the posting tuple, which caller must handle
* themselves (e.g., by splitting the posting list tuple).
*
+ * NOTE: The "comparecol" argument must refer to the first attribute of the
+ * index tuple of which the caller knows that it does not match the scan key:
+ * this means 1 for "no known matching attributes", up to the number of key
+ * attributes + 1 if the caller knows that all key attributes of the index
+ * tuple match those of the scan key. See backend/access/nbtree/README for
+ * details.
+ *
* CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
* "minus infinity": this routine will always claim it is less than the
* scankey. The actual key value stored is explicitly truncated to 0
@@ -667,7 +772,8 @@ int32
_bt_compare(Relation rel,
BTScanInsert key,
Page page,
- OffsetNumber offnum)
+ OffsetNumber offnum,
+ AttrNumber *comparecol)
{
TupleDesc itupdesc = RelationGetDescr(rel);
BTPageOpaque opaque = BTPageGetOpaque(page);
@@ -707,8 +813,9 @@ _bt_compare(Relation rel,
ncmpkey = Min(ntupatts, key->keysz);
Assert(key->heapkeyspace || ncmpkey == key->keysz);
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
- scankey = key->scankeys;
- for (int i = 1; i <= ncmpkey; i++)
+
+ scankey = key->scankeys + ((*comparecol) - 1);
+ for (int i = *comparecol; i <= ncmpkey; i++)
{
Datum datum;
bool isNull;
@@ -752,11 +859,20 @@ _bt_compare(Relation rel,
/* if the keys are unequal, return the difference */
if (result != 0)
+ {
+ *comparecol = i;
return result;
+ }
scankey++;
}
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *comparecol = ntupatts + 1;
+
/*
* All non-truncated attributes (other than heap TID) were found to be
* equal. Treat truncated attributes as minus infinity when scankey has a
@@ -887,6 +1003,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
StrategyNumber strat_total;
BTScanPosItem *currItem;
BlockNumber blkno;
+ AttrNumber cmpcol = 1;
Assert(!BTScanPosIsValid(so->currPos));
@@ -1415,7 +1532,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
_bt_initialize_more_data(so, dir);
/* position to the precise item on the page */
- offnum = _bt_binsrch(rel, &inskey, buf);
+ offnum = _bt_binsrch(rel, &inskey, buf, &cmpcol);
/*
* If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8891fa7973..11f4184107 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1234,9 +1234,12 @@ extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
Buffer *bufP, int access, Snapshot snapshot);
extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
Buffer buf, bool forupdate, BTStack stack,
- int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
-extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+ int access, Snapshot snapshot,
+ AttrNumber *comparecol, char *tupdatabuf);
+extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page,
+ OffsetNumber offnum, AttrNumber *comparecol);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
--
2.40.1
v12-0006-btree-specialization-for-variable-length-multi-a.patchapplication/octet-stream; name=v12-0006-btree-specialization-for-variable-length-multi-a.patchDownload
From 26e4beb6d6acf6c4e5ec08df44ff7a4898c0f887 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 13 Jan 2023 15:42:41 +0100
Subject: [PATCH v12 6/6] btree specialization for variable-length
multi-attribute keys
The default code path is relatively slow at O(n^2), so with multiple
attributes we accept the increased startup cost in favour of lower
costs for later attributes.
Note that this will only be used for indexes that use at least one
variable-length key attribute (except as last key attribute in specific
cases).
---
src/backend/access/nbtree/README | 6 +-
src/backend/access/nbtree/nbtree_spec.c | 3 +
src/include/access/itup_attiter.h | 199 ++++++++++++++++++++++++
src/include/access/nbtree.h | 11 +-
src/include/access/nbtree_spec.h | 48 +++++-
5 files changed, 258 insertions(+), 9 deletions(-)
create mode 100644 src/include/access/itup_attiter.h
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index e90e24cb70..0c45288e61 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -1105,14 +1105,12 @@ performance of those hot paths.
Optimized code paths exist for the following cases, in order of preference:
- indexes with only a single key attribute
+ - multi-column indexes that cannot pre-calculate the offsets of all key
+ attributes in the tuple data section
- multi-column indexes that could benefit from the attcacheoff optimization
NB: This is also the default path, and is comparatively slow for uncachable
attribute offsets.
-Future work will optimize for multi-column indexes that don't benefit
-from the attcacheoff optimization by improving on the O(n^2) nature of
-index_getattr through storing attribute offsets.
-
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtree_spec.c b/src/backend/access/nbtree/nbtree_spec.c
index 21635397ed..699197dfa7 100644
--- a/src/backend/access/nbtree/nbtree_spec.c
+++ b/src/backend/access/nbtree/nbtree_spec.c
@@ -33,6 +33,9 @@ _bt_specialize(Relation rel)
case NBTS_CTX_CACHED:
_bt_specialize_cached(rel);
break;
+ case NBTS_CTX_UNCACHED:
+ _bt_specialize_uncached(rel);
+ break;
case NBTS_CTX_SINGLE_KEYATT:
_bt_specialize_single_keyatt(rel);
break;
diff --git a/src/include/access/itup_attiter.h b/src/include/access/itup_attiter.h
new file mode 100644
index 0000000000..c8fb6954bc
--- /dev/null
+++ b/src/include/access/itup_attiter.h
@@ -0,0 +1,199 @@
+/*-------------------------------------------------------------------------
+ *
+ * itup_attiter.h
+ * POSTGRES index tuple attribute iterator definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/itup_attiter.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef ITUP_ATTITER_H
+#define ITUP_ATTITER_H
+
+#include "access/itup.h"
+#include "varatt.h"
+
+typedef struct IAttrIterStateData
+{
+ int offset;
+ bool slow;
+ bool isNull;
+} IAttrIterStateData;
+
+typedef IAttrIterStateData * IAttrIterState;
+
+/* ----------------
+ * index_attiterinit
+ *
+ * This gets called many times, so we macro the cacheable and NULL
+ * lookups, and call nocache_index_attiterinit() for the rest.
+ *
+ * tup - the tuple being iterated on
+ * attnum - the attribute number that we start the iteration with
+ * in the first index_attiternext call
+ * tupdesc - the tuple description
+ *
+ * ----------------
+ */
+#define index_attiterinit(tup, attnum, tupleDesc, iter) \
+do { \
+ if ((attnum) == 1) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ 0 /* Offset of attribute 1 is always 0 */, \
+ false /* slow */, \
+ false /* isNull */ \
+ }); \
+ } \
+ else if (!IndexTupleHasNulls(tup) && \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff >= 0) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff, /* offset */ \
+ false, /* slow */ \
+ false /* isNull */ \
+ }); \
+ } \
+ else \
+ nocache_index_attiterinit((tup), (attnum) - 1, (tupleDesc), (iter)); \
+} while (false);
+
+/*
+ * Initiate an index attribute iterator to attribute attnum,
+ * and return the corresponding datum.
+ *
+ * This is nearly the same as index_deform_tuple, except that this
+ * returns the internal state up to attnum, instead of populating the
+ * datum- and isnull-arrays
+ */
+static inline void
+nocache_index_attiterinit(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ int curatt;
+ char *tp; /* ptr to tuple data */
+ int off; /* offset in tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ bool slow = false; /* can we use/set attcacheoff? */
+ bool null = false;
+
+ /* Assert to protect callers */
+ Assert(PointerIsValid(iter));
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ /* XXX "knows" t_bits are just after fixed tuple header! */
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+ off = 0;
+
+ for (curatt = 0; curatt < attnum; curatt++)
+ {
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, curatt);
+
+ if (hasnulls && att_isnull(curatt, bp))
+ {
+ null = true;
+ slow = true; /* can't use attcacheoff anymore */
+ continue;
+ }
+
+ null = false;
+
+ if (!slow && thisatt->attcacheoff >= 0)
+ off = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ slow = true;
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+ }
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+
+ if (thisatt->attlen <= 0)
+ slow = true; /* can't use attcacheoff anymore */
+ }
+
+ iter->isNull = null;
+ iter->offset = off;
+ iter->slow = slow;
+}
+
+/* ----------------
+ * index_attiternext() - get the next attribute of an index tuple
+ *
+ * This gets called many times, so we do the least amount of work
+ * possible.
+ *
+ * The code does not attempt to update attcacheoff; as it is unlikely
+ * to reach a situation where the cached offset matters a lot.
+ * If the cached offset do matter, the caller should make sure that
+ * PopulateTupleDescCacheOffsets() was called on the tuple descriptor
+ * to populate the attribute offset cache.
+ *
+ * ----------------
+ */
+static inline Datum
+index_attiternext(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ char *tp; /* ptr to tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ Datum datum;
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, attnum - 1);
+
+ Assert(PointerIsValid(iter));
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+
+ if (hasnulls && att_isnull(attnum - 1, bp))
+ {
+ iter->isNull = true;
+ iter->slow = true;
+ return (Datum) 0;
+ }
+
+ iter->isNull = false;
+
+ if (!iter->slow && thisatt->attcacheoff >= 0)
+ iter->offset = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ iter->offset = att_align_pointer(iter->offset, thisatt->attalign, -1,
+ tp + iter->offset);
+ iter->slow = true;
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ iter->offset = att_align_nominal(iter->offset, thisatt->attalign);
+ }
+
+ datum = fetchatt(thisatt, tp + iter->offset);
+
+ iter->offset = att_addlength_pointer(iter->offset, thisatt->attlen, tp + iter->offset);
+
+ if (thisatt->attlen <= 0)
+ iter->slow = true; /* can't use attcacheoff anymore */
+
+ return datum;
+}
+
+#endif /* ITUP_ATTITER_H */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 72fbf3a4c6..204e349872 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -16,6 +16,7 @@
#include "access/amapi.h"
#include "access/itup.h"
+#include "access/itup_attiter.h"
#include "access/sdir.h"
#include "access/tableam.h"
#include "access/xlogreader.h"
@@ -1123,18 +1124,26 @@ typedef struct BTOptions
typedef enum NBTS_CTX {
NBTS_CTX_SINGLE_KEYATT,
+ NBTS_CTX_UNCACHED,
NBTS_CTX_CACHED,
NBTS_CTX_DEFAULT, /* fallback */
} NBTS_CTX;
static inline NBTS_CTX _nbt_spec_context(Relation irel)
{
+ AttrNumber nKeyAtts;
+
if (!PointerIsValid(irel))
return NBTS_CTX_DEFAULT;
- if (IndexRelationGetNumberOfKeyAttributes(irel) == 1)
+ nKeyAtts = IndexRelationGetNumberOfKeyAttributes(irel);
+
+ if (nKeyAtts == 1)
return NBTS_CTX_SINGLE_KEYATT;
+ if (TupleDescAttr(irel->rd_att, nKeyAtts - 1)->attcacheoff < -1)
+ return NBTS_CTX_UNCACHED;
+
return NBTS_CTX_CACHED;
}
diff --git a/src/include/access/nbtree_spec.h b/src/include/access/nbtree_spec.h
index 8e476c300d..efed9824e7 100644
--- a/src/include/access/nbtree_spec.h
+++ b/src/include/access/nbtree_spec.h
@@ -45,6 +45,7 @@
* Macros used in the nbtree specialization code.
*/
#define NBTS_TYPE_SINGLE_KEYATT single_keyatt
+#define NBTS_TYPE_UNCACHED uncached
#define NBTS_TYPE_CACHED cached
#define NBTS_TYPE_DEFAULT default
#define NBTS_CTX_NAME __nbts_ctx
@@ -53,8 +54,10 @@
#define NBTS_MAKE_CTX(rel) const NBTS_CTX NBTS_CTX_NAME = _nbt_spec_context(rel)
#define NBTS_SPECIALIZE_NAME(name) ( \
(NBTS_CTX_NAME) == NBTS_CTX_SINGLE_KEYATT ? (NBTS_MAKE_NAME(name, NBTS_TYPE_SINGLE_KEYATT)) : ( \
- (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
- NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ (NBTS_CTX_NAME) == NBTS_CTX_UNCACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_UNCACHED)) : ( \
+ (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ ) \
) \
) \
)
@@ -69,8 +72,11 @@ do { \
Assert(PointerIsValid(rel)); \
if (unlikely((rel)->rd_indam->aminsert == btinsert_default)) \
{ \
- nbts_prep_ctx(rel); \
- _bt_specialize(rel); \
+ PopulateTupleDescCacheOffsets((rel)->rd_att); \
+ { \
+ nbts_prep_ctx(rel); \
+ _bt_specialize(rel); \
+ } \
} \
} while (false)
@@ -216,6 +222,40 @@ do { \
#undef nbts_attiter_nextattdatum
#undef nbts_attiter_curattisnull
+/*
+ * Multiple key columns, but attcacheoff -optimization doesn't apply.
+ */
+#define NBTS_SPECIALIZING_UNCACHED
+#define NBTS_TYPE NBTS_TYPE_UNCACHED
+
+#define nbts_attiterdeclare(itup) \
+ IAttrIterStateData NBTS_MAKE_NAME(itup, iter)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc) \
+ index_attiterinit((itup), (initAttNum), (tupDesc), &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_attiternext((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, iter).isNull
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_UNCACHED
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
/*
* All next uses of nbts_prep_ctx are in non-templated code, so here we make
* sure we actually create the context.
--
2.40.1
On Wed, 30 Aug 2023 at 21:50, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
Updated in the attached version 12 of the patchset (which is also
rebased on HEAD @ 9c13b681). No changes apart from rebase fixes and
these added comments.
Rebased again to v13 to account for API changes in 9f060253 "Remove
some more "snapshot too old" vestiges."
Kind regards,
Matthias van de Meent
On Mon, 18 Sept 2023 at 17:56, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Wed, 30 Aug 2023 at 21:50, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:Updated in the attached version 12 of the patchset (which is also
rebased on HEAD @ 9c13b681). No changes apart from rebase fixes and
these added comments.Rebased again to v13 to account for API changes in 9f060253 "Remove
some more "snapshot too old" vestiges."
... and now attached.
Kind regards,
Matthias van de Meent
Attachments:
v13-0005-Add-an-attcacheoff-populating-function.patchapplication/octet-stream; name=v13-0005-Add-an-attcacheoff-populating-function.patchDownload
From 34c25502bfb68a42ce16e9a335d1d44c29ab9d04 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 12 Jan 2023 21:34:36 +0100
Subject: [PATCH v13 5/6] Add an attcacheoff-populating function
It populates attcacheoff-capable attributes with the correct offset,
and fills attributes whose offset is uncacheable with an 'uncacheable'
indicator value; as opposed to -1 which signals "unknown".
This allows users of the API to remove redundant cycles that try to
cache the offset of attributes - instead of O(N-attrs) operations, this
one only requires a O(1) check.
---
src/backend/access/common/tupdesc.c | 111 ++++++++++++++++++++++++++++
src/include/access/tupdesc.h | 2 +
2 files changed, 113 insertions(+)
diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c
index 7c5c390503..b3f543cd83 100644
--- a/src/backend/access/common/tupdesc.c
+++ b/src/backend/access/common/tupdesc.c
@@ -927,3 +927,114 @@ BuildDescFromLists(List *names, List *types, List *typmods, List *collations)
return desc;
}
+
+/*
+ * PopulateTupleDescCacheOffsets
+ *
+ * Populate the attcacheoff fields of a TupleDesc, returning the last
+ * attcacheoff with a valid offset value.
+ *
+ * Populates attcacheoff with a negative cache value when no offset
+ * can be calculated (due to e.g. variable length attributes).
+ * The negative value is a value relative to the last cacheable attribute
+ * attcacheoff = -1 - (thisattno - cachedattno)
+ * so that the last attribute with cached offset can be found with
+ * cachedattno = attcacheoff + 1 + thisattno
+ *
+ * The value returned is the AttrNumber of the last (1-based) attribute that
+ * had its offset cached.
+ *
+ * When the TupleDesc has 0 attributes, it returns 0.
+ */
+AttrNumber
+PopulateTupleDescCacheOffsets(TupleDesc desc)
+{
+ int numberOfAttributes = desc->natts;
+ AttrNumber currAttNo, lastCachedAttNo;
+
+ if (numberOfAttributes == 0)
+ return 0;
+
+ /* Non-negative value: this attribute is cached */
+ if (TupleDescAttr(desc, desc->natts - 1)->attcacheoff >= 0)
+ return (AttrNumber) desc->natts;
+ /*
+ * Attribute has been filled with relative offset to last cached value, but
+ * it itself is unreachable.
+ */
+ if (TupleDescAttr(desc, desc->natts - 1)->attcacheoff != -1)
+ return (AttrNumber) (TupleDescAttr(desc, desc->natts - 1)->attcacheoff + 1 + desc->natts);
+
+ /* last attribute of the tupledesc may or may not support attcacheoff */
+
+ /*
+ * First attribute always starts at offset zero.
+ */
+ TupleDescAttr(desc, 0)->attcacheoff = 0;
+
+ currAttNo = 1;
+ /*
+ * Other code may have populated the value previously.
+ * Skip all positive offsets to get to the first attribute without
+ * attcacheoff.
+ */
+ while (currAttNo < numberOfAttributes &&
+ TupleDescAttr(desc, currAttNo)->attcacheoff >= 0)
+ currAttNo++;
+
+ /*
+ * Cache offset is undetermined. Start calculating offsets if possible.
+ *
+ * When we exit this block, currAttNo will point at the first uncacheable
+ * attribute, or past the end of the attribute array.
+ */
+ if (currAttNo < numberOfAttributes &&
+ TupleDescAttr(desc, currAttNo)->attcacheoff == -1)
+ {
+ Form_pg_attribute att = TupleDescAttr(desc, currAttNo - 1);
+ int32 off = att->attcacheoff;
+
+ if (att->attlen >= 0) {
+ off += att->attlen;
+
+ while (currAttNo < numberOfAttributes)
+ {
+ att = TupleDescAttr(desc, currAttNo);
+
+ if (att->attlen < 0)
+ {
+ if (off == att_align_nominal(off, att->attalign))
+ {
+ att->attcacheoff = off;
+ currAttNo++;
+ }
+ break;
+ }
+
+ off = att_align_nominal(off, att->attalign);
+ att->attcacheoff = off;
+ off += att->attlen;
+ currAttNo++;
+ }
+ }
+ }
+
+ Assert(currAttNo == numberOfAttributes || (
+ currAttNo < numberOfAttributes
+ && TupleDescAttr(desc, (currAttNo - 1))->attcacheoff >= 0
+ && TupleDescAttr(desc, currAttNo)->attcacheoff == -1
+ ));
+ /*
+ * No cacheable offsets left. Fill the rest with negative cache values,
+ * but return the latest cached offset.
+ */
+ lastCachedAttNo = currAttNo;
+
+ while (currAttNo < numberOfAttributes)
+ {
+ TupleDescAttr(desc, currAttNo)->attcacheoff = -1 - (currAttNo - lastCachedAttNo);
+ currAttNo++;
+ }
+
+ return lastCachedAttNo;
+}
\ No newline at end of file
diff --git a/src/include/access/tupdesc.h b/src/include/access/tupdesc.h
index b4286cf922..2673f2d0f3 100644
--- a/src/include/access/tupdesc.h
+++ b/src/include/access/tupdesc.h
@@ -151,4 +151,6 @@ extern TupleDesc BuildDescForRelation(List *schema);
extern TupleDesc BuildDescFromLists(List *names, List *types, List *typmods, List *collations);
+extern AttrNumber PopulateTupleDescCacheOffsets(TupleDesc desc);
+
#endif /* TUPDESC_H */
--
2.40.1
v13-0001-Implement-dynamic-prefix-compression-in-nbtree.patchapplication/octet-stream; name=v13-0001-Implement-dynamic-prefix-compression-in-nbtree.patchDownload
From 1b096923670ebb4f17743ba007c0b2e168104fd7 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Tue, 10 Jan 2023 21:45:44 +0100
Subject: [PATCH v13 1/6] Implement dynamic prefix compression in nbtree
Because tuples are ordered on the page, if some prefix of the
scan attributes on both sides of the compared tuple are equal
to the scankey, then the current tuple that is being compared
must also have those prefixing attributes that equal the
scankey.
We cannot generally propagate this information to _binsrch on
lower pages, as this downstream page may have concurrently split
and/or have merged with its deleted left neighbour (see [0]),
which moves the keyspace of the linked page. We thus can only
trust the current state of this current page for this optimization,
which means we must validate this state each time we open the page.
Although this limits the overall applicability of the
performance improvement, it still allows for a nice performance
improvement in most cases where initial columns have many
duplicate values and a compare function that is not cheap.
As an exception to the above rule, most of the time a pages'
highkey is equal to the right seperator on the parent page due to
how btree splits are done. By storing this right seperator from
the parent page and then validating that the highkey of the child
page contains the exact same data, we can restore the right prefix
bound without having to call the relatively expensive _bt_compare.
In the worst-case scenario of a concurrent page split, we'd still
have to validate the full key, but that doesn't happen very often
when compared to the number of times we descend the btree.
---
contrib/amcheck/verify_nbtree.c | 17 +--
src/backend/access/nbtree/README | 43 ++++++++
src/backend/access/nbtree/nbtinsert.c | 34 ++++--
src/backend/access/nbtree/nbtsearch.c | 146 +++++++++++++++++++++++---
src/include/access/nbtree.h | 9 +-
5 files changed, 215 insertions(+), 34 deletions(-)
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index dbb83d80f8..e5cab3e3a9 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2701,6 +2701,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
BTInsertStateData insertstate;
OffsetNumber offnum;
Page page;
+ AttrNumber cmpcol = 1;
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
@@ -2710,13 +2711,13 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.buf = lbuf;
/* Get matching tuple on leaf page */
- offnum = _bt_binsrch_insert(state->rel, &insertstate);
+ offnum = _bt_binsrch_insert(state->rel, &insertstate, 1);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
/* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
insertstate.postingoff <= 0 &&
- _bt_compare(state->rel, key, page, offnum) == 0)
+ _bt_compare(state->rel, key, page, offnum, &cmpcol) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
}
@@ -2778,6 +2779,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
@@ -2788,7 +2790,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
if (!key->heapkeyspace)
return invariant_leq_offset(state, key, upperbound);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
/*
* _bt_compare() is capable of determining that a scankey with a
@@ -2840,10 +2842,11 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber upperbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, upperbound);
+ cmp = _bt_compare(state->rel, key, state->target, upperbound, &cmpcol);
return cmp <= 0;
}
@@ -2863,10 +2866,11 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
OffsetNumber lowerbound)
{
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
- cmp = _bt_compare(state->rel, key, state->target, lowerbound);
+ cmp = _bt_compare(state->rel, key, state->target, lowerbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
@@ -2901,13 +2905,14 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
{
ItemId itemid;
int32 cmp;
+ AttrNumber cmpcol = 1;
Assert(key->pivotsearch);
/* Verify line pointer before checking tuple */
itemid = PageGetItemIdCareful(state, nontargetblock, nontarget,
upperbound);
- cmp = _bt_compare(state->rel, key, nontarget, upperbound);
+ cmp = _bt_compare(state->rel, key, nontarget, upperbound, &cmpcol);
/* pg_upgrade'd indexes may legally have equal sibling tuples */
if (!key->heapkeyspace)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 52e646c7f7..0f10141a2f 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -901,6 +901,49 @@ large groups of duplicates, maximizing space utilization. Note also that
deduplication more efficient. Deduplication can be performed infrequently,
without merging together existing posting list tuples too often.
+Notes about dynamic prefix truncation
+-------------------------------------
+
+Because NBTrees have a sorted keyspace, when we have determined that some
+prefixing columns of tuples on both sides of the tuple that is being
+compared are equal to the scankey, then the current tuple must also share
+this prefix with the scankey. This allows us to skip comparing those columns,
+saving the indirect function calls in the compare operation.
+
+We can only use this constraint if we have proven this information while we
+hold a pin on the page, so this is only useful on the page level: Concurrent
+page deletions and splits may have moved the keyspace of the page referenced
+by a parent page to the right. If we re-used high- and low-column-prefixes,
+we would not be able to detect a change of keyspace from e.g. [2,3) to [1,2),
+and subsequently return invalid results. This race condition can only be
+prevented by re-establishing the prefix-equal-columns for each page.
+
+There is positive news, though: A page split will put a binary copy of the
+page's highkey in the parent page. This means that we usually can reuse
+the compare result of the parent page's downlink's right sibling when we
+discover that their representation is binary equal. In general this will
+be the case, as only in concurrent page splits and deletes the downlink
+may not point to the page with the correct highkey bound (_bt_moveright
+only rarely actually moves right).
+
+To implement this, we copy the downlink's right differentiator key into a
+temporary buffer, which is then compared against the child pages' highkey.
+If they match, we reuse the compare result (plus prefix) we had for it from
+the parent page, if not, we need to do a full _bt_compare. Because memcpy +
+memcmp is cheap compared to _bt_compare, and because it's quite unlikely
+that we guess wrong this speeds up our _bt_moveright code (at cost of some
+stack memory in _bt_search and some overhead in case of a wrong prediction)
+
+Now that we have prefix bounds on the highest value of a page, the
+_bt_binsrch procedure will use this result as a rightmost prefix compare,
+and for each step in the binary search (that does not compare less than the
+insert key) improve the equal-prefix bounds.
+
+Using the above optimization, we now (on average) only need 2 full key
+compares per page (plus ceil(log2(ntupsperpage)) single-attribute compares),
+as opposed to the ceil(log2(ntupsperpage)) + 1 of a naive implementation;
+a significant improvement.
+
Notes about deduplication
-------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 9cff4f2931..8f602ab2d6 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -328,6 +328,7 @@ _bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
{
Page page;
BTPageOpaque opaque;
+ AttrNumber cmpcol = 1;
_bt_checkpage(rel, insertstate->buf);
page = BufferGetPage(insertstate->buf);
@@ -346,7 +347,8 @@ _bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
!P_IGNORE(opaque) &&
PageGetFreeSpace(page) > insertstate->itemsz &&
PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- _bt_compare(rel, insertstate->itup_key, page, P_HIKEY) > 0)
+ _bt_compare(rel, insertstate->itup_key, page, P_HIKEY,
+ &cmpcol) > 0)
{
/*
* Caller can use the fastpath optimization because cached
@@ -440,7 +442,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* in the fastpath below, but also in the _bt_findinsertloc() call later.
*/
Assert(!insertstate->bounds_valid);
- offset = _bt_binsrch_insert(rel, insertstate);
+ offset = _bt_binsrch_insert(rel, insertstate, 1);
/*
* Scan over all equal tuples, looking for live conflicts.
@@ -450,6 +452,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(itup_key->scantid == NULL);
for (;;)
{
+ AttrNumber cmpcol = 1;
+
/*
* Each iteration of the loop processes one heap TID, not one index
* tuple. Current offset number for page isn't usually advanced on
@@ -485,7 +489,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
Assert(insertstate->bounds_valid);
Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
Assert(insertstate->low <= insertstate->stricthigh);
- Assert(_bt_compare(rel, itup_key, page, offset) < 0);
+ Assert(_bt_compare(rel, itup_key, page, offset, &cmpcol) < 0);
break;
}
@@ -510,7 +514,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
if (!inposting)
{
/* Plain tuple, or first TID in posting list tuple */
- if (_bt_compare(rel, itup_key, page, offset) != 0)
+ if (_bt_compare(rel, itup_key, page, offset, &cmpcol) != 0)
break; /* we're past all the equal tuples */
/* Advanced curitup */
@@ -720,11 +724,12 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
else
{
int highkeycmp;
+ cmpcol = 1;
/* If scankey == hikey we gotta check the next page too */
if (P_RIGHTMOST(opaque))
break;
- highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+ highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol);
Assert(highkeycmp <= 0);
if (highkeycmp != 0)
break;
@@ -867,6 +872,8 @@ _bt_findinsertloc(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
+
/*
* Does the new tuple belong on this page?
*
@@ -884,7 +891,7 @@ _bt_findinsertloc(Relation rel,
/* Test '<=', not '!=', since scantid is set now */
if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
break;
_bt_stepright(rel, heapRel, insertstate, stack);
@@ -937,6 +944,8 @@ _bt_findinsertloc(Relation rel,
*/
while (PageGetFreeSpace(page) < insertstate->itemsz)
{
+ AttrNumber cmpcol = 1;
+
/*
* Before considering moving right, see if we can obtain enough
* space by erasing LP_DEAD items
@@ -967,7 +976,7 @@ _bt_findinsertloc(Relation rel,
break;
if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) != 0 ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) != 0 ||
pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
break;
@@ -982,10 +991,13 @@ _bt_findinsertloc(Relation rel,
* We should now be on the correct page. Find the offset within the page
* for the new tuple. (Possibly reusing earlier search bounds.)
*/
- Assert(P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
+ {
+ AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
+ Assert(P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0);
+ }
- newitemoff = _bt_binsrch_insert(rel, insertstate);
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
if (insertstate->postingoff == -1)
{
@@ -1004,7 +1016,7 @@ _bt_findinsertloc(Relation rel,
*/
Assert(!insertstate->bounds_valid);
insertstate->postingoff = 0;
- newitemoff = _bt_binsrch_insert(rel, insertstate);
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
Assert(insertstate->postingoff == 0);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 17ad89749d..44a09ced98 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,7 +26,8 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
-static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf,
+ AttrNumber *highkeycmpcol);
static int _bt_binsrch_posting(BTScanInsert key, Page page,
OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
@@ -98,6 +99,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
{
BTStack stack_in = NULL;
int page_access = BT_READ;
+ char tupdatabuf[BLCKSZ / 3];
+ AttrNumber highkeycmpcol = 1;
/* heaprel must be set whenever _bt_allocbuf is reachable */
Assert(access == BT_READ || access == BT_WRITE);
@@ -134,7 +137,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
* opportunity to finish splits of internal pages too.
*/
*bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
- stack_in, page_access);
+ stack_in, page_access, &highkeycmpcol,
+ (char *) tupdatabuf);
/* if this is a leaf page, we're done */
page = BufferGetPage(*bufP);
@@ -146,12 +150,15 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
* Find the appropriate pivot tuple on this page. Its downlink points
* to the child page that we're about to descend to.
*/
- offnum = _bt_binsrch(rel, key, *bufP);
+ offnum = _bt_binsrch(rel, key, *bufP, &highkeycmpcol);
itemid = PageGetItemId(page, offnum);
itup = (IndexTuple) PageGetItem(page, itemid);
Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
child = BTreeTupleGetDownLink(itup);
+ Assert(IndexTupleSize(itup) < sizeof(tupdatabuf));
+ memcpy((char *) tupdatabuf, (char *) itup, IndexTupleSize(itup));
+
/*
* We need to save the location of the pivot tuple we chose in a new
* stack entry for this page/level. If caller ends up splitting a
@@ -185,6 +192,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
*/
if (access == BT_WRITE && page_access == BT_READ)
{
+ highkeycmpcol = 1;
+
/* trade in our read lock for a write lock */
_bt_unlockbuf(rel, *bufP);
_bt_lockbuf(rel, *bufP, BT_WRITE);
@@ -194,7 +203,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
* but before we acquired a write lock. If it has, we may need to
* move right to its new sibling. Do that.
*/
- *bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE);
+ *bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
+ &highkeycmpcol, (char *) tupdatabuf);
}
return stack_in;
@@ -238,13 +248,16 @@ _bt_moveright(Relation rel,
Buffer buf,
bool forupdate,
BTStack stack,
- int access)
+ int access,
+ AttrNumber *comparecol,
+ char *tupdatabuf)
{
Page page;
BTPageOpaque opaque;
int32 cmpval;
Assert(!forupdate || heaprel != NULL);
+ Assert(PointerIsValid(comparecol) && PointerIsValid(tupdatabuf));
/*
* When nextkey = false (normal case): if the scan key that brought us to
@@ -267,11 +280,16 @@ _bt_moveright(Relation rel,
for (;;)
{
+ AttrNumber cmpcol = 1;
+
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
if (P_RIGHTMOST(opaque))
+ {
+ *comparecol = 1;
break;
+ }
/*
* Finish any incomplete splits we encounter along the way.
@@ -297,14 +315,55 @@ _bt_moveright(Relation rel,
continue;
}
- if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
+ /*
+ * tupdatabuf is filled with the right seperator of the parent node.
+ * This allows us to do a binary equality check between the parent
+ * node's right seperator (which is < key) and this page's P_HIKEY.
+ * If they equal, we can reuse the result of the parent node's
+ * rightkey compare, which means we can potentially save a full key
+ * compare (which includes indirect calls to attribute comparison
+ * functions).
+ *
+ * Without this, we'd on average use 3 full key compares per page before
+ * we achieve full dynamic prefix bounds, but with this optimization
+ * that is only 2.
+ *
+ * 3 compares: 1 for the highkey (rightmost), and on average 2 before
+ * we move right in the binary search on the page, this average equals
+ * SUM (1/2 ^ x) for x from 0 to log(n items)), which tends to 2.
+ */
+ if (!P_IGNORE(opaque) && *comparecol > 1)
+ {
+ IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
+ IndexTuple buftuple = (IndexTuple) tupdatabuf;
+ if (IndexTupleSize(itup) == IndexTupleSize(buftuple))
+ {
+ char *dataptr = (char *) itup;
+
+ if (memcmp(dataptr + sizeof(IndexTupleData),
+ tupdatabuf + sizeof(IndexTupleData),
+ IndexTupleSize(itup) - sizeof(IndexTupleData)) == 0)
+ break;
+ } else {
+ *comparecol = 1;
+ }
+ } else {
+ *comparecol = 1;
+ }
+
+ if (P_IGNORE(opaque) ||
+ _bt_compare(rel, key, page, P_HIKEY, &cmpcol) >= cmpval)
{
+ *comparecol = 1;
/* step right one page */
buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
continue;
}
else
+ {
+ *comparecol = cmpcol;
break;
+ }
}
if (P_IGNORE(opaque))
@@ -330,6 +389,16 @@ _bt_moveright(Relation rel,
* right place to descend to be sure we find all leaf keys >= given scankey
* (or leaf keys > given scankey when nextkey is true).
*
+ * When called, the "highkeycmpcol" pointer argument is expected to contain the
+ * AttrNumber of the first attribute that is not shared between scan key and
+ * this page's high key, i.e. the first attribute that we have to compare
+ * against the scan key. The value will be updated by _bt_binsrch to contain
+ * this same first column we'll need to compare against the scan key, but now
+ * for the index tuple at the returned offset. Valid values range from 1
+ * (no shared prefix) to the number of key attributes + 1 (all index key
+ * attributes are equal to the scan key). See also _bt_compare, and
+ * backend/access/nbtree/README for more info.
+ *
* This procedure is not responsible for walking right, it just examines
* the given page. _bt_binsrch() has no lock or refcount side effects
* on the buffer.
@@ -337,7 +406,8 @@ _bt_moveright(Relation rel,
static OffsetNumber
_bt_binsrch(Relation rel,
BTScanInsert key,
- Buffer buf)
+ Buffer buf,
+ AttrNumber *highkeycmpcol)
{
Page page;
BTPageOpaque opaque;
@@ -345,6 +415,13 @@ _bt_binsrch(Relation rel,
high;
int32 result,
cmpval;
+ /*
+ * Prefix bounds, for the high/low offset's compare columns.
+ * "highkeycmpcol" is the value for this page's high key (if any) or 1
+ * (no established shared prefix)
+ */
+ AttrNumber highcmpcol = *highkeycmpcol,
+ lowcmpcol = 1;
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
@@ -377,6 +454,10 @@ _bt_binsrch(Relation rel,
* For nextkey=true (cmpval=0), the loop invariant is: all slots before
* 'low' are <= scan key, all slots at or after 'high' are > scan key.
*
+ * We maintain highcmpcol and lowcmpcol to keep track of prefixes that
+ * tuples share with the scan key, potentially allowing us to skip a
+ * prefix in the midpoint comparison.
+ *
* We can fall out when high == low.
*/
high++; /* establish the loop invariant for high */
@@ -386,17 +467,27 @@ _bt_binsrch(Relation rel,
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol); /* update prefix bounds */
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, key, page, mid);
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
+ {
high = mid;
+ highcmpcol = cmpcol;
+ }
}
+ /* update the bounds at the caller */
+ *highkeycmpcol = highcmpcol;
+
/*
* At this point we have high == low, but be careful: they could point
* past the last slot on the page.
@@ -439,7 +530,8 @@ _bt_binsrch(Relation rel,
* list split).
*/
OffsetNumber
-_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
+_bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol)
{
BTScanInsert key = insertstate->itup_key;
Page page;
@@ -449,6 +541,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
stricthigh;
int32 result,
cmpval;
+ AttrNumber lowcmpcol = 1;
page = BufferGetPage(insertstate->buf);
opaque = BTPageGetOpaque(page);
@@ -499,16 +592,22 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
while (high > low)
{
OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, key, page, mid);
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
if (result >= cmpval)
+ {
low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
else
{
high = mid;
+ highcmpcol = cmpcol;
+
if (result != 0)
stricthigh = high;
}
@@ -643,6 +742,13 @@ _bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
* matching TID in the posting tuple, which caller must handle
* themselves (e.g., by splitting the posting list tuple).
*
+ * NOTE: The "comparecol" argument must refer to the first attribute of the
+ * index tuple of which the caller knows that it does not match the scan key:
+ * this means 1 for "no known matching attributes", up to the number of key
+ * attributes + 1 if the caller knows that all key attributes of the index
+ * tuple match those of the scan key. See backend/access/nbtree/README for
+ * details.
+ *
* CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
* "minus infinity": this routine will always claim it is less than the
* scankey. The actual key value stored is explicitly truncated to 0
@@ -656,7 +762,8 @@ int32
_bt_compare(Relation rel,
BTScanInsert key,
Page page,
- OffsetNumber offnum)
+ OffsetNumber offnum,
+ AttrNumber *comparecol)
{
TupleDesc itupdesc = RelationGetDescr(rel);
BTPageOpaque opaque = BTPageGetOpaque(page);
@@ -696,8 +803,9 @@ _bt_compare(Relation rel,
ncmpkey = Min(ntupatts, key->keysz);
Assert(key->heapkeyspace || ncmpkey == key->keysz);
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
- scankey = key->scankeys;
- for (int i = 1; i <= ncmpkey; i++)
+
+ scankey = key->scankeys + ((*comparecol) - 1);
+ for (int i = *comparecol; i <= ncmpkey; i++)
{
Datum datum;
bool isNull;
@@ -741,11 +849,20 @@ _bt_compare(Relation rel,
/* if the keys are unequal, return the difference */
if (result != 0)
+ {
+ *comparecol = i;
return result;
+ }
scankey++;
}
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *comparecol = ntupatts + 1;
+
/*
* All non-truncated attributes (other than heap TID) were found to be
* equal. Treat truncated attributes as minus infinity when scankey has a
@@ -876,6 +993,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
StrategyNumber strat_total;
BTScanPosItem *currItem;
BlockNumber blkno;
+ AttrNumber cmpcol = 1;
Assert(!BTScanPosIsValid(so->currPos));
@@ -1403,7 +1521,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
_bt_initialize_more_data(so, dir);
/* position to the precise item on the page */
- offnum = _bt_binsrch(rel, &inskey, buf);
+ offnum = _bt_binsrch(rel, &inskey, buf, &cmpcol);
/*
* If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index f5c66964ca..0579120693 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1234,9 +1234,12 @@ extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
Buffer *bufP, int access);
extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
Buffer buf, bool forupdate, BTStack stack,
- int access);
-extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
-extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+ int access, AttrNumber *comparecol,
+ char *tupdatabuf);
+extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page,
+ OffsetNumber offnum, AttrNumber *comparecol);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
--
2.40.1
v13-0004-Optimize-nbts_attiter-for-nkeyatts-1-btrees.patchapplication/octet-stream; name=v13-0004-Optimize-nbts_attiter-for-nkeyatts-1-btrees.patchDownload
From ddacd699417ac50e34e569138a62f27a11ea163e Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 11 Jan 2023 20:04:56 +0100
Subject: [PATCH v13 4/6] Optimize nbts_attiter for nkeyatts==1 btrees
This removes the index_getattr_nocache call path, which has significant overhead, and uses constant 0 offset.
---
src/backend/access/nbtree/README | 1 +
src/backend/access/nbtree/nbtree_spec.c | 3 ++
src/include/access/nbtree.h | 35 ++++++++++++++++
src/include/access/nbtree_spec.h | 56 ++++++++++++++++++++++++-
4 files changed, 93 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index e9d0cf6ac1..e90e24cb70 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -1104,6 +1104,7 @@ in the index AM to call the specialized functions, increasing the
performance of those hot paths.
Optimized code paths exist for the following cases, in order of preference:
+ - indexes with only a single key attribute
- multi-column indexes that could benefit from the attcacheoff optimization
NB: This is also the default path, and is comparatively slow for uncachable
attribute offsets.
diff --git a/src/backend/access/nbtree/nbtree_spec.c b/src/backend/access/nbtree/nbtree_spec.c
index 6b766581ab..21635397ed 100644
--- a/src/backend/access/nbtree/nbtree_spec.c
+++ b/src/backend/access/nbtree/nbtree_spec.c
@@ -33,6 +33,9 @@ _bt_specialize(Relation rel)
case NBTS_CTX_CACHED:
_bt_specialize_cached(rel);
break;
+ case NBTS_CTX_SINGLE_KEYATT:
+ _bt_specialize_single_keyatt(rel);
+ break;
case NBTS_CTX_DEFAULT:
break;
}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ce6999d4b5..89d1c7ab01 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1122,6 +1122,7 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
typedef enum NBTS_CTX {
+ NBTS_CTX_SINGLE_KEYATT,
NBTS_CTX_CACHED,
NBTS_CTX_DEFAULT, /* fallback */
} NBTS_CTX;
@@ -1131,9 +1132,43 @@ static inline NBTS_CTX _nbt_spec_context(Relation irel)
if (!PointerIsValid(irel))
return NBTS_CTX_DEFAULT;
+ if (IndexRelationGetNumberOfKeyAttributes(irel) == 1)
+ return NBTS_CTX_SINGLE_KEYATT;
+
return NBTS_CTX_CACHED;
}
+static inline Datum _bt_getfirstatt(IndexTuple tuple, TupleDesc tupleDesc,
+ bool *isNull)
+{
+ Datum result;
+ if (IndexTupleHasNulls(tuple))
+ {
+ if (att_isnull(0, (bits8 *)(tuple) + sizeof(IndexTupleData)))
+ {
+ *isNull = true;
+ result = (Datum) 0;
+ }
+ else
+ {
+ *isNull = false;
+ result = fetchatt(TupleDescAttr(tupleDesc, 0),
+ ((char *) tuple)
+ + MAXALIGN(sizeof(IndexTupleData)
+ + sizeof(IndexAttributeBitMapData)));
+ }
+ }
+ else
+ {
+ *isNull = false;
+ result = fetchatt(TupleDescAttr(tupleDesc, 0),
+ ((char *) tuple)
+ + MAXALIGN(sizeof(IndexTupleData)));
+ }
+
+ return result;
+}
+
#define NBT_SPECIALIZE_FILE "access/nbtree_specfuncs.h"
#include "nbtree_spec.h"
diff --git a/src/include/access/nbtree_spec.h b/src/include/access/nbtree_spec.h
index fa38b09c6e..8e476c300d 100644
--- a/src/include/access/nbtree_spec.h
+++ b/src/include/access/nbtree_spec.h
@@ -44,6 +44,7 @@
/*
* Macros used in the nbtree specialization code.
*/
+#define NBTS_TYPE_SINGLE_KEYATT single_keyatt
#define NBTS_TYPE_CACHED cached
#define NBTS_TYPE_DEFAULT default
#define NBTS_CTX_NAME __nbts_ctx
@@ -51,8 +52,10 @@
/* contextual specializations */
#define NBTS_MAKE_CTX(rel) const NBTS_CTX NBTS_CTX_NAME = _nbt_spec_context(rel)
#define NBTS_SPECIALIZE_NAME(name) ( \
- (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
- NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ (NBTS_CTX_NAME) == NBTS_CTX_SINGLE_KEYATT ? (NBTS_MAKE_NAME(name, NBTS_TYPE_SINGLE_KEYATT)) : ( \
+ (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ ) \
) \
)
@@ -164,6 +167,55 @@ do { \
#undef nbts_attiter_nextattdatum
#undef nbts_attiter_curattisnull
+/*
+ * Specialization 3: SINGLE_KEYATT
+ *
+ * Optimized access for indexes with a single key column.
+ *
+ * Note that this path cannot be used for indexes with multiple key
++ * columns, because it never considers the next column.
+ */
+
+/* the default context (and later contexts) do need to specialize, so here's that */
+#undef nbts_prep_ctx
+#define nbts_prep_ctx(rel)
+
+#define NBTS_SPECIALIZING_SINGLE_KEYATT
+#define NBTS_TYPE NBTS_TYPE_SINGLE_KEYATT
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ Assert((endAttNum) == 1); ((void) (endAttNum)); \
+ if ((initAttNum) == 1) for (int spec_i = 0; spec_i < 1; spec_i++)
+
+#define nbts_attiter_attnum 1
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+( \
+ AssertMacro(spec_i == 0), \
+ _bt_getfirstatt(itup, tupDesc, &NBTS_MAKE_NAME(itup, isNull)) \
+)
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_SINGLE_KEYATT
+
+/* un-define the optimization macros */
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
/*
* All next uses of nbts_prep_ctx are in non-templated code, so here we make
* sure we actually create the context.
--
2.40.1
v13-0003-Use-specialized-attribute-iterators-in-the-speci.patchapplication/octet-stream; name=v13-0003-Use-specialized-attribute-iterators-in-the-speci.patchDownload
From 4c7d6b471c4025bedd7c386914c0913f409a57a5 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 11 Jan 2023 02:57:21 +0100
Subject: [PATCH v13 3/6] Use specialized attribute iterators in the
specialized source files
This is committed separately to make clear what substantial changes were
made to the pre-existing code.
Even though not all nbt*_spec functions have been updated; these functions
can now directly call (and inline, and optimize for) the specialized functions
they call, instead of having to determine the right specialization based on
the (potentially locally unavailable) index relation, making the specialization
of those functions still worth specializing/duplicating.
---
src/backend/access/nbtree/nbtsearch_spec.c | 18 +++---
src/backend/access/nbtree/nbtsort_spec.c | 24 +++----
src/backend/access/nbtree/nbtutils_spec.c | 62 ++++++++++++-------
.../utils/sort/tuplesortvariants_spec.c | 53 +++++++++-------
4 files changed, 92 insertions(+), 65 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsearch_spec.c b/src/backend/access/nbtree/nbtsearch_spec.c
index 383010dc31..a93841e686 100644
--- a/src/backend/access/nbtree/nbtsearch_spec.c
+++ b/src/backend/access/nbtree/nbtsearch_spec.c
@@ -658,6 +658,7 @@ _bt_compare(Relation rel,
int ncmpkey;
int ntupatts;
int32 result;
+ nbts_attiterdeclare(itup);
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -690,23 +691,26 @@ _bt_compare(Relation rel,
Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
scankey = key->scankeys + ((*comparecol) - 1);
- for (int i = *comparecol; i <= ncmpkey; i++)
+ nbts_attiterinit(itup, *comparecol, itupdesc);
+
+ nbts_foreachattr(*comparecol, ncmpkey)
{
Datum datum;
- bool isNull;
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+ datum = nbts_attiter_nextattdatum(itup, itupdesc);
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ /* key is NULL */
+ if (scankey->sk_flags & SK_ISNULL)
{
- if (isNull)
+ if (nbts_attiter_curattisnull(itup))
result = 0; /* NULL "=" NULL */
else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = -1; /* NULL "<" NOT_NULL */
else
result = 1; /* NULL ">" NOT_NULL */
}
- else if (isNull) /* key is NOT_NULL and item is NULL */
+ /* key is NOT_NULL and item is NULL */
+ else if (nbts_attiter_curattisnull(itup))
{
if (scankey->sk_flags & SK_BT_NULLS_FIRST)
result = 1; /* NOT_NULL ">" NULL */
@@ -735,7 +739,7 @@ _bt_compare(Relation rel,
/* if the keys are unequal, return the difference */
if (result != 0)
{
- *comparecol = i;
+ *comparecol = nbts_attiter_attnum;
return result;
}
diff --git a/src/backend/access/nbtree/nbtsort_spec.c b/src/backend/access/nbtree/nbtsort_spec.c
index 368d6f244c..6f33cc4cc2 100644
--- a/src/backend/access/nbtree/nbtsort_spec.c
+++ b/src/backend/access/nbtree/nbtsort_spec.c
@@ -34,8 +34,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
itup2 = NULL;
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
bool deduplicate;
@@ -57,7 +56,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
/* Prepare SortSupport data for each column */
sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
- for (i = 0; i < keysz; i++)
+ for (int i = 0; i < keysz; i++)
{
SortSupport sortKey = sortKeys + i;
ScanKey scanKey = wstate->inskey->scankeys + i;
@@ -90,21 +89,24 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
else if (itup != NULL)
{
int32 compare = 0;
+ nbts_attiterdeclare(itup);
+ nbts_attiterdeclare(itup2);
- for (i = 1; i <= keysz; i++)
+ nbts_attiterinit(itup, 1, tupdes);
+ nbts_attiterinit(itup2, 1, tupdes);
+
+ nbts_foreachattr(1, keysz)
{
SortSupport entry;
Datum attrDatum1,
attrDatum2;
- bool isNull1,
- isNull2;
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+ entry = sortKeys + nbts_attiter_attnum - 1;
+ attrDatum1 = nbts_attiter_nextattdatum(itup, tupdes);
+ attrDatum2 = nbts_attiter_nextattdatum(itup2, tupdes);
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
+ compare = ApplySortComparator(attrDatum1, nbts_attiter_curattisnull(itup),
+ attrDatum2, nbts_attiter_curattisnull(itup2),
entry);
if (compare > 0)
{
diff --git a/src/backend/access/nbtree/nbtutils_spec.c b/src/backend/access/nbtree/nbtutils_spec.c
index 0288da22d6..07ca18f404 100644
--- a/src/backend/access/nbtree/nbtutils_spec.c
+++ b/src/backend/access/nbtree/nbtutils_spec.c
@@ -64,7 +64,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
int indnkeyatts;
int16 *indoption;
int tupnatts;
- int i;
+ nbts_attiterdeclare(itup);
itupdesc = RelationGetDescr(rel);
indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -95,7 +95,10 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->scantid = key->heapkeyspace && itup ?
BTreeTupleGetHeapTID(itup) : NULL;
skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
+
+ nbts_attiterinit(itup, 1, itupdesc);
+
+ nbts_foreachattr(1, indnkeyatts)
{
FmgrInfo *procinfo;
Datum arg;
@@ -106,27 +109,30 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
* We can use the cached (default) support procs since no cross-type
* comparison can be needed.
*/
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+ procinfo = index_getprocinfo(rel, nbts_attiter_attnum, BTORDER_PROC);
/*
* Key arguments built from truncated attributes (or when caller
* provides no tuple) are defensively represented as NULL values. They
* should never be used.
*/
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
+ if (nbts_attiter_attnum <= tupnatts)
+ {
+ arg = nbts_attiter_nextattdatum(itup, itupdesc);
+ null = nbts_attiter_curattisnull(itup);
+ }
else
{
arg = (Datum) 0;
null = true;
}
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags = (null ? SK_ISNULL : 0) | (indoption[nbts_attiter_attnum - 1] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[nbts_attiter_attnum - 1],
flags,
- (AttrNumber) (i + 1),
+ (AttrNumber) nbts_attiter_attnum,
InvalidStrategy,
InvalidOid,
- rel->rd_indcollation[i],
+ rel->rd_indcollation[nbts_attiter_attnum - 1],
procinfo,
arg);
/* Record if any key attribute is NULL (or truncated) */
@@ -675,6 +681,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
TupleDesc itupdesc = RelationGetDescr(rel);
int keepnatts;
ScanKey scankey;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
/*
* _bt_compare() treats truncated key attributes as having the value minus
@@ -686,20 +694,22 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
scankey = itup_key->scankeys;
keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, nkeyatts)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ datum1 = nbts_attiter_nextattdatum(lastleft, itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) != nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
scankey->sk_collation,
datum1,
@@ -707,6 +717,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
break;
keepnatts++;
+ scankey++;
}
/*
@@ -747,24 +758,27 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
TupleDesc itupdesc = RelationGetDescr(rel);
int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
int keepnatts;
+ nbts_attiterdeclare(lastleft);
+ nbts_attiterdeclare(firstright);
keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
+ nbts_attiterinit(lastleft, 1, itupdesc);
+ nbts_attiterinit(firstright, 1, itupdesc);
+
+ nbts_foreachattr(1, keysz)
{
Datum datum1,
datum2;
- bool isNull1,
- isNull2;
Form_pg_attribute att;
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
+ datum1 = nbts_attiter_nextattdatum(lastleft, itupdesc);
+ datum2 = nbts_attiter_nextattdatum(firstright, itupdesc);
+ att = TupleDescAttr(itupdesc, nbts_attiter_attnum - 1);
- if (isNull1 != isNull2)
+ if (nbts_attiter_curattisnull(lastleft) != nbts_attiter_curattisnull(firstright))
break;
- if (!isNull1 &&
+ if (!nbts_attiter_curattisnull(lastleft) &&
!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
diff --git a/src/backend/utils/sort/tuplesortvariants_spec.c b/src/backend/utils/sort/tuplesortvariants_spec.c
index 705da09329..cf262eee2d 100644
--- a/src/backend/utils/sort/tuplesortvariants_spec.c
+++ b/src/backend/utils/sort/tuplesortvariants_spec.c
@@ -66,47 +66,54 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
bool equal_hasnull = false;
int nkey;
int32 compare;
- Datum datum1,
- datum2;
- bool isnull1,
- isnull2;
+ nbts_attiterdeclare(tuple1);
+ nbts_attiterdeclare(tuple2);
tuple1 = (IndexTuple) a->tuple;
tuple2 = (IndexTuple) b->tuple;
keysz = base->nKeys;
tupDes = RelationGetDescr(arg->index.indexRel);
- if (sortKey->abbrev_converter)
+ if (!sortKey->abbrev_converter)
{
- datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
-
- compare = ApplySortAbbrevFullComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare;
+ nkey = 2;
+ sortKey++;
+ }
+ else
+ {
+ nkey = 1;
}
/* they are equal, so we only need to examine one null flag */
if (a->isnull1)
equal_hasnull = true;
- sortKey++;
- for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
+ nbts_attiterinit(tuple1, nkey, tupDes);
+ nbts_attiterinit(tuple2, nkey, tupDes);
+
+ nbts_foreachattr(nkey, keysz)
{
- datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
+ Datum datum1,
+ datum2;
+ datum1 = nbts_attiter_nextattdatum(tuple1, tupDes);
+ datum2 = nbts_attiter_nextattdatum(tuple2, tupDes);
+
+ if (nbts_attiter_attnum == 1)
+ compare = ApplySortAbbrevFullComparator(datum1, nbts_attiter_curattisnull(tuple1),
+ datum2, nbts_attiter_curattisnull(tuple2),
+ sortKey);
+ else
+ compare = ApplySortComparator(datum1, nbts_attiter_curattisnull(tuple1),
+ datum2, nbts_attiter_curattisnull(tuple2),
+ sortKey);
- compare = ApplySortComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
if (compare != 0)
- return compare; /* done when we find unequal attributes */
+ return compare;
- /* they are equal, so we only need to examine one null flag */
- if (isnull1)
+ if (nbts_attiter_curattisnull(tuple1))
equal_hasnull = true;
+
+ sortKey++;
}
/*
--
2.40.1
v13-0002-Specialize-nbtree-functions-on-btree-key-shape.patchapplication/octet-stream; name=v13-0002-Specialize-nbtree-functions-on-btree-key-shape.patchDownload
From 1ee2045425d5a4f9c1a9473ba4a526d1fcbbec7f Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 11 Jan 2023 02:13:04 +0100
Subject: [PATCH v13 2/6] Specialize nbtree functions on btree key shape.
nbtree keys are not all made the same, so a significant amount of time is
spent on code that exists only to deal with other key's shape. By specializing
function calls based on the key shape, we can remove or reduce these causes
of overhead.
This commit adds the basic infrastructure for specializing specific hot code
in the nbtree AM to certain shapes of keys, and splits the code that can
benefit from attribute offset optimizations into separate files. This does
NOT yet update the code itself - it just makes the code compile cleanly.
The performance should be comparable if not the same.
---
contrib/amcheck/verify_nbtree.c | 6 +
src/backend/access/nbtree/README | 28 +
src/backend/access/nbtree/nbtdedup.c | 300 +----
src/backend/access/nbtree/nbtdedup_spec.c | 317 +++++
src/backend/access/nbtree/nbtinsert.c | 579 +--------
src/backend/access/nbtree/nbtinsert_spec.c | 584 +++++++++
src/backend/access/nbtree/nbtpage.c | 1 +
src/backend/access/nbtree/nbtree.c | 37 +-
src/backend/access/nbtree/nbtree_spec.c | 69 +
src/backend/access/nbtree/nbtsearch.c | 1101 +---------------
src/backend/access/nbtree/nbtsearch_spec.c | 1113 +++++++++++++++++
src/backend/access/nbtree/nbtsort.c | 264 +---
src/backend/access/nbtree/nbtsort_spec.c | 280 +++++
src/backend/access/nbtree/nbtsplitloc.c | 3 +
src/backend/access/nbtree/nbtutils.c | 754 +----------
src/backend/access/nbtree/nbtutils_spec.c | 775 ++++++++++++
src/backend/utils/sort/tuplesortvariants.c | 156 +--
.../utils/sort/tuplesortvariants_spec.c | 175 +++
src/include/access/nbtree.h | 44 +-
src/include/access/nbtree_spec.h | 183 +++
src/include/access/nbtree_specfuncs.h | 65 +
src/tools/pginclude/cpluspluscheck | 2 +
src/tools/pginclude/headerscheck | 2 +
23 files changed, 3659 insertions(+), 3179 deletions(-)
create mode 100644 src/backend/access/nbtree/nbtdedup_spec.c
create mode 100644 src/backend/access/nbtree/nbtinsert_spec.c
create mode 100644 src/backend/access/nbtree/nbtree_spec.c
create mode 100644 src/backend/access/nbtree/nbtsearch_spec.c
create mode 100644 src/backend/access/nbtree/nbtsort_spec.c
create mode 100644 src/backend/access/nbtree/nbtutils_spec.c
create mode 100644 src/backend/utils/sort/tuplesortvariants_spec.c
create mode 100644 src/include/access/nbtree_spec.h
create mode 100644 src/include/access/nbtree_specfuncs.h
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index e5cab3e3a9..c8b8b0d339 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2680,6 +2680,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
BTStack stack;
Buffer lbuf;
bool exists;
+ nbts_prep_ctx(NULL);
key = _bt_mkscankey(state->rel, itup);
Assert(key->heapkeyspace && key->scantid != NULL);
@@ -2780,6 +2781,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
ItemId itemid;
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -2843,6 +2845,7 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
{
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -2867,6 +2870,7 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
{
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -2906,6 +2910,7 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
ItemId itemid;
int32 cmp;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(NULL);
Assert(key->pivotsearch);
@@ -3141,6 +3146,7 @@ static inline BTScanInsert
bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
{
BTScanInsert skey;
+ nbts_prep_ctx(NULL);
skey = _bt_mkscankey(rel, itup);
skey->pivotsearch = true;
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 0f10141a2f..e9d0cf6ac1 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -1084,6 +1084,34 @@ that need a page split anyway. Besides, supporting variable "split points"
while splitting posting lists won't actually improve overall space
utilization.
+Notes about nbtree specialization
+---------------------------------
+
+Attribute iteration is a significant overhead for multi-column indexes
+with variable length attributes, due to our inability to cache the offset
+of each attribute into an on-disk tuple. To combat this, we'd have to either
+fully deserialize the tuple, or maintain our offset into the tuple as we
+iterate over the tuple's fields.
+
+Keeping track of this offset also has a non-negligible overhead too, so we'd
+prefer to not have to keep track of these offsets when we can use the cache.
+By specializing performance-sensitive search functions for these specific
+index tuple shapes and calling those selectively, we can keep the performance
+of cacheable attribute offsets where that is applicable, while improving
+performance where we currently would see O(n_atts^2) time iterating on
+variable-length attributes. Additionally, we update the entry points
+in the index AM to call the specialized functions, increasing the
+performance of those hot paths.
+
+Optimized code paths exist for the following cases, in order of preference:
+ - multi-column indexes that could benefit from the attcacheoff optimization
+ NB: This is also the default path, and is comparatively slow for uncachable
+ attribute offsets.
+
+Future work will optimize for multi-column indexes that don't benefit
+from the attcacheoff optimization by improving on the O(n^2) nature of
+index_getattr through storing attribute offsets.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index d4db0b28f2..4589ade267 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -22,260 +22,14 @@
static void _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
TM_IndexDeleteOp *delstate);
-static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem);
static void _bt_singleval_fillfactor(Page page, BTDedupState state,
Size newitemsz);
#ifdef USE_ASSERT_CHECKING
static bool _bt_posting_valid(IndexTuple posting);
#endif
-/*
- * Perform a deduplication pass.
- *
- * The general approach taken here is to perform as much deduplication as
- * possible to free as much space as possible. Note, however, that "single
- * value" strategy is used for !bottomupdedup callers when the page is full of
- * tuples of a single value. Deduplication passes that apply the strategy
- * will leave behind a few untouched tuples at the end of the page, preparing
- * the page for an anticipated page split that uses nbtsplitloc.c's own single
- * value strategy. Our high level goal is to delay merging the untouched
- * tuples until after the page splits.
- *
- * When a call to _bt_bottomupdel_pass() just took place (and failed), our
- * high level goal is to prevent a page split entirely by buying more time.
- * We still hope that a page split can be avoided altogether. That's why
- * single value strategy is not even considered for bottomupdedup callers.
- *
- * The page will have to be split if we cannot successfully free at least
- * newitemsz (we also need space for newitem's line pointer, which isn't
- * included in caller's newitemsz).
- *
- * Note: Caller should have already deleted all existing items with their
- * LP_DEAD bits set.
- */
-void
-_bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
- bool bottomupdedup)
-{
- OffsetNumber offnum,
- minoff,
- maxoff;
- Page page = BufferGetPage(buf);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- Page newpage;
- BTDedupState state;
- Size pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
- bool singlevalstrat = false;
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-
- /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
- newitemsz += sizeof(ItemIdData);
-
- /*
- * Initialize deduplication state.
- *
- * It would be possible for maxpostingsize (limit on posting list tuple
- * size) to be set to one third of the page. However, it seems like a
- * good idea to limit the size of posting lists to one sixth of a page.
- * That ought to leave us with a good split point when pages full of
- * duplicates can be split several times.
- */
- state = (BTDedupState) palloc(sizeof(BTDedupStateData));
- state->deduplicate = true;
- state->nmaxitems = 0;
- state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
- /* Metadata about base tuple of current pending posting list */
- state->base = NULL;
- state->baseoff = InvalidOffsetNumber;
- state->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- state->htids = palloc(state->maxpostingsize);
- state->nhtids = 0;
- state->nitems = 0;
- /* Size of all physical tuples to be replaced by pending posting list */
- state->phystupsize = 0;
- /* nintervals should be initialized to zero */
- state->nintervals = 0;
-
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * Consider applying "single value" strategy, though only if the page
- * seems likely to be split in the near future
- */
- if (!bottomupdedup)
- singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
-
- /*
- * Deduplicate items from page, and write them to newpage.
- *
- * Copy the original page's LSN into newpage copy. This will become the
- * updated version of the page. We need this because XLogInsert will
- * examine the LSN and possibly dump it in a page image.
- */
- newpage = PageGetTempPageCopySpecial(page);
- PageSetLSN(newpage, PageGetLSN(page));
-
- /* Copy high key, if any */
- if (!P_RIGHTMOST(opaque))
- {
- ItemId hitemid = PageGetItemId(page, P_HIKEY);
- Size hitemsz = ItemIdGetLength(hitemid);
- IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
-
- if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
- false, false) == InvalidOffsetNumber)
- elog(ERROR, "deduplication failed to add highkey");
- }
-
- for (offnum = minoff;
- offnum <= maxoff;
- offnum = OffsetNumberNext(offnum))
- {
- ItemId itemid = PageGetItemId(page, offnum);
- IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
-
- Assert(!ItemIdIsDead(itemid));
-
- if (offnum == minoff)
- {
- /*
- * No previous/base tuple for the data item -- use the data item
- * as base tuple of pending posting list
- */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- else if (state->deduplicate &&
- _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
- _bt_dedup_save_htid(state, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID(s) for itup have been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list for some other reason (e.g., adding more
- * TIDs would have caused posting list to exceed current
- * maxpostingsize).
- *
- * If state contains pending posting list with more than one item,
- * form new posting tuple and add it to our temp page (newpage).
- * Else add pending interval's base tuple to the temp page as-is.
- */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- if (singlevalstrat)
- {
- /*
- * Single value strategy's extra steps.
- *
- * Lower maxpostingsize for sixth and final large posting list
- * tuple at the point where 5 maxpostingsize-capped tuples
- * have either been formed or observed.
- *
- * When a sixth maxpostingsize-capped item is formed/observed,
- * stop merging together tuples altogether. The few tuples
- * that remain at the end of the page won't be merged together
- * at all (at least not until after a future page split takes
- * place, when this page's newly allocated right sibling page
- * gets its first deduplication pass).
- */
- if (state->nmaxitems == 5)
- _bt_singleval_fillfactor(page, state, newitemsz);
- else if (state->nmaxitems == 6)
- {
- state->deduplicate = false;
- singlevalstrat = false; /* won't be back here */
- }
- }
-
- /* itup starts new pending posting list */
- _bt_dedup_start_pending(state, itup, offnum);
- }
- }
-
- /* Handle the last item */
- pagesaving += _bt_dedup_finish_pending(newpage, state);
-
- /*
- * If no items suitable for deduplication were found, newpage must be
- * exactly the same as the original page, so just return from function.
- *
- * We could determine whether or not to proceed on the basis the space
- * savings being sufficient to avoid an immediate page split instead. We
- * don't do that because there is some small value in nbtsplitloc.c always
- * operating against a page that is fully deduplicated (apart from
- * newitem). Besides, most of the cost has already been paid.
- */
- if (state->nintervals == 0)
- {
- /* cannot leak memory here */
- pfree(newpage);
- pfree(state->htids);
- pfree(state);
- return;
- }
-
- /*
- * By here, it's clear that deduplication will definitely go ahead.
- *
- * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
- * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
- * But keep things tidy.
- */
- if (P_HAS_GARBAGE(opaque))
- {
- BTPageOpaque nopaque = BTPageGetOpaque(newpage);
-
- nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
- }
-
- START_CRIT_SECTION();
-
- PageRestoreTempPage(newpage, page);
- MarkBufferDirty(buf);
-
- /* XLOG stuff */
- if (RelationNeedsWAL(rel))
- {
- XLogRecPtr recptr;
- xl_btree_dedup xlrec_dedup;
-
- xlrec_dedup.nintervals = state->nintervals;
-
- XLogBeginInsert();
- XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
-
- /*
- * The intervals array is not in the buffer, but pretend that it is.
- * When XLogInsert stores the whole buffer, the array need not be
- * stored too.
- */
- XLogRegisterBufData(0, (char *) state->intervals,
- state->nintervals * sizeof(BTDedupInterval));
-
- recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
-
- PageSetLSN(page, recptr);
- }
-
- END_CRIT_SECTION();
-
- /* Local space accounting should agree with page accounting */
- Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
-
- /* cannot leak memory here */
- pfree(state->htids);
- pfree(state);
-}
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtdedup_spec.c"
+#include "access/nbtree_spec.h"
/*
* Perform bottom-up index deletion pass.
@@ -316,6 +70,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
TM_IndexDeleteOp delstate;
bool neverdedup;
int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ nbts_prep_ctx(rel);
/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
newitemsz += sizeof(ItemIdData);
@@ -752,55 +507,6 @@ _bt_bottomupdel_finish_pending(Page page, BTDedupState state,
state->phystupsize = 0;
}
-/*
- * Determine if page non-pivot tuples (data items) are all duplicates of the
- * same value -- if they are, deduplication's "single value" strategy should
- * be applied. The general goal of this strategy is to ensure that
- * nbtsplitloc.c (which uses its own single value strategy) will find a useful
- * split point as further duplicates are inserted, and successive rightmost
- * page splits occur among pages that store the same duplicate value. When
- * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
- * just like it would if deduplication were disabled.
- *
- * We expect that affected workloads will require _several_ single value
- * strategy deduplication passes (over a page that only stores duplicates)
- * before the page is finally split. The first deduplication pass should only
- * find regular non-pivot tuples. Later deduplication passes will find
- * existing maxpostingsize-capped posting list tuples, which must be skipped
- * over. The penultimate pass is generally the first pass that actually
- * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
- * few untouched non-pivot tuples. The final deduplication pass won't free
- * any space -- it will skip over everything without merging anything (it
- * retraces the steps of the penultimate pass).
- *
- * Fortunately, having several passes isn't too expensive. Each pass (after
- * the first pass) won't spend many cycles on the large posting list tuples
- * left by previous passes. Each pass will find a large contiguous group of
- * smaller duplicate tuples to merge together at the end of the page.
- */
-static bool
-_bt_do_singleval(Relation rel, Page page, BTDedupState state,
- OffsetNumber minoff, IndexTuple newitem)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- ItemId itemid;
- IndexTuple itup;
-
- itemid = PageGetItemId(page, minoff);
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- {
- itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
- itup = (IndexTuple) PageGetItem(page, itemid);
-
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
- return true;
- }
-
- return false;
-}
-
/*
* Lower maxpostingsize when using "single value" strategy, to avoid a sixth
* and final maxpostingsize-capped tuple. The sixth and final posting list
diff --git a/src/backend/access/nbtree/nbtdedup_spec.c b/src/backend/access/nbtree/nbtdedup_spec.c
new file mode 100644
index 0000000000..4b280de980
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup_spec.c
@@ -0,0 +1,317 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup_spec.c
+ * Index shape-specialized functions for nbtdedup.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtdedup_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_do_singleval NBTS_FUNCTION(_bt_do_singleval)
+
+static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem);
+
+/*
+ * Perform a deduplication pass.
+ *
+ * The general approach taken here is to perform as much deduplication as
+ * possible to free as much space as possible. Note, however, that "single
+ * value" strategy is used for !bottomupdedup callers when the page is full of
+ * tuples of a single value. Deduplication passes that apply the strategy
+ * will leave behind a few untouched tuples at the end of the page, preparing
+ * the page for an anticipated page split that uses nbtsplitloc.c's own single
+ * value strategy. Our high level goal is to delay merging the untouched
+ * tuples until after the page splits.
+ *
+ * When a call to _bt_bottomupdel_pass() just took place (and failed), our
+ * high level goal is to prevent a page split entirely by buying more time.
+ * We still hope that a page split can be avoided altogether. That's why
+ * single value strategy is not even considered for bottomupdedup callers.
+ *
+ * The page will have to be split if we cannot successfully free at least
+ * newitemsz (we also need space for newitem's line pointer, which isn't
+ * included in caller's newitemsz).
+ *
+ * Note: Caller should have already deleted all existing items with their
+ * LP_DEAD bits set.
+ */
+void
+_bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
+ bool bottomupdedup)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ Page newpage;
+ BTDedupState state;
+ Size pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
+ bool singlevalstrat = false;
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+
+ /*
+ * Initialize deduplication state.
+ *
+ * It would be possible for maxpostingsize (limit on posting list tuple
+ * size) to be set to one third of the page. However, it seems like a
+ * good idea to limit the size of posting lists to one sixth of a page.
+ * That ought to leave us with a good split point when pages full of
+ * duplicates can be split several times.
+ */
+ state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ state->deduplicate = true;
+ state->nmaxitems = 0;
+ state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
+ /* Metadata about base tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ state->htids = palloc(state->maxpostingsize);
+ state->nhtids = 0;
+ state->nitems = 0;
+ /* Size of all physical tuples to be replaced by pending posting list */
+ state->phystupsize = 0;
+ /* nintervals should be initialized to zero */
+ state->nintervals = 0;
+
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Consider applying "single value" strategy, though only if the page
+ * seems likely to be split in the near future
+ */
+ if (!bottomupdedup)
+ singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
+
+ /*
+ * Deduplicate items from page, and write them to newpage.
+ *
+ * Copy the original page's LSN into newpage copy. This will become the
+ * updated version of the page. We need this because XLogInsert will
+ * examine the LSN and possibly dump it in a page image.
+ */
+ newpage = PageGetTempPageCopySpecial(page);
+ PageSetLSN(newpage, PageGetLSN(page));
+
+ /* Copy high key, if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId hitemid = PageGetItemId(page, P_HIKEY);
+ Size hitemsz = ItemIdGetLength(hitemid);
+ IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+ if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add highkey");
+ }
+
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == minoff)
+ {
+ /*
+ * No previous/base tuple for the data item -- use the data item
+ * as base tuple of pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (state->deduplicate &&
+ _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+ _bt_dedup_save_htid(state, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID(s) for itup have been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list for some other reason (e.g., adding more
+ * TIDs would have caused posting list to exceed current
+ * maxpostingsize).
+ *
+ * If state contains pending posting list with more than one item,
+ * form new posting tuple and add it to our temp page (newpage).
+ * Else add pending interval's base tuple to the temp page as-is.
+ */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ if (singlevalstrat)
+ {
+ /*
+ * Single value strategy's extra steps.
+ *
+ * Lower maxpostingsize for sixth and final large posting list
+ * tuple at the point where 5 maxpostingsize-capped tuples
+ * have either been formed or observed.
+ *
+ * When a sixth maxpostingsize-capped item is formed/observed,
+ * stop merging together tuples altogether. The few tuples
+ * that remain at the end of the page won't be merged together
+ * at all (at least not until after a future page split takes
+ * place, when this page's newly allocated right sibling page
+ * gets its first deduplication pass).
+ */
+ if (state->nmaxitems == 5)
+ _bt_singleval_fillfactor(page, state, newitemsz);
+ else if (state->nmaxitems == 6)
+ {
+ state->deduplicate = false;
+ singlevalstrat = false; /* won't be back here */
+ }
+ }
+
+ /* itup starts new pending posting list */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ }
+
+ /* Handle the last item */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ /*
+ * If no items suitable for deduplication were found, newpage must be
+ * exactly the same as the original page, so just return from function.
+ *
+ * We could determine whether or not to proceed on the basis the space
+ * savings being sufficient to avoid an immediate page split instead. We
+ * don't do that because there is some small value in nbtsplitloc.c always
+ * operating against a page that is fully deduplicated (apart from
+ * newitem). Besides, most of the cost has already been paid.
+ */
+ if (state->nintervals == 0)
+ {
+ /* cannot leak memory here */
+ pfree(newpage);
+ pfree(state->htids);
+ pfree(state);
+ return;
+ }
+
+ /*
+ * By here, it's clear that deduplication will definitely go ahead.
+ *
+ * Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
+ * index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
+ * But keep things tidy.
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ BTPageOpaque nopaque = BTPageGetOpaque(newpage);
+
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+ }
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.nintervals = state->nintervals;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ /*
+ * The intervals array is not in the buffer, but pretend that it is.
+ * When XLogInsert stores the whole buffer, the array need not be
+ * stored too.
+ */
+ XLogRegisterBufData(0, (char *) state->intervals,
+ state->nintervals * sizeof(BTDedupInterval));
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* Local space accounting should agree with page accounting */
+ Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+ /* cannot leak memory here */
+ pfree(state->htids);
+ pfree(state);
+}
+
+/*
+ * Determine if page non-pivot tuples (data items) are all duplicates of the
+ * same value -- if they are, deduplication's "single value" strategy should
+ * be applied. The general goal of this strategy is to ensure that
+ * nbtsplitloc.c (which uses its own single value strategy) will find a useful
+ * split point as further duplicates are inserted, and successive rightmost
+ * page splits occur among pages that store the same duplicate value. When
+ * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
+ * just like it would if deduplication were disabled.
+ *
+ * We expect that affected workloads will require _several_ single value
+ * strategy deduplication passes (over a page that only stores duplicates)
+ * before the page is finally split. The first deduplication pass should only
+ * find regular non-pivot tuples. Later deduplication passes will find
+ * existing maxpostingsize-capped posting list tuples, which must be skipped
+ * over. The penultimate pass is generally the first pass that actually
+ * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
+ * few untouched non-pivot tuples. The final deduplication pass won't free
+ * any space -- it will skip over everything without merging anything (it
+ * retraces the steps of the penultimate pass).
+ *
+ * Fortunately, having several passes isn't too expensive. Each pass (after
+ * the first pass) won't spend many cycles on the large posting list tuples
+ * left by previous passes. Each pass will find a large contiguous group of
+ * smaller duplicate tuples to merge together at the end of the page.
+ */
+static bool
+_bt_do_singleval(Relation rel, Page page, BTDedupState state,
+ OffsetNumber minoff, IndexTuple newitem)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, minoff);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+ {
+ itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+ return true;
+ }
+
+ return false;
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 8f602ab2d6..4a9ed5e7b7 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,28 +30,16 @@
#define BTREE_FASTPATH_MIN_LEVEL 2
-static BTStack _bt_search_insert(Relation rel, Relation heaprel,
- BTInsertState insertstate);
static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
Relation heapRel,
IndexUniqueCheck checkUnique, bool *is_unique,
uint32 *speculativeToken);
-static OffsetNumber _bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel);
static void _bt_stepright(Relation rel, Relation heaprel,
BTInsertState insertstate, BTStack stack);
-static void _bt_insertonpg(Relation rel, Relation heaprel, BTScanInsert itup_key,
- Buffer buf,
- Buffer cbuf,
- BTStack stack,
- IndexTuple itup,
- Size itemsz,
- OffsetNumber newitemoff,
- int postingoff,
+static void _bt_insertonpg(Relation rel, Relation heaprel,
+ BTScanInsert itup_key, Buffer buf, Buffer cbuf,
+ BTStack stack, IndexTuple itup, Size itemsz,
+ OffsetNumber newitemoff, int postingoff,
bool split_only_page);
static Buffer _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key,
Buffer buf, Buffer cbuf, OffsetNumber newitemoff,
@@ -75,313 +63,8 @@ static BlockNumber *_bt_deadblocks(Page page, OffsetNumber *deletable,
int *nblocks);
static inline int _bt_blk_cmp(const void *arg1, const void *arg2);
-/*
- * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
- *
- * This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
- *
- * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
- * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
- * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
- * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
- * don't actually insert.
- *
- * indexUnchanged executor hint indicates if itup is from an
- * UPDATE that didn't logically change the indexed value, but
- * must nevertheless have a new entry to point to a successor
- * version.
- *
- * The result value is only significant for UNIQUE_CHECK_PARTIAL:
- * it must be true if the entry is known unique, else false.
- * (In the current implementation we'll also return true after a
- * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
- * that's just a coding artifact.)
- */
-bool
-_bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel)
-{
- bool is_unique = false;
- BTInsertStateData insertstate;
- BTScanInsert itup_key;
- BTStack stack;
- bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
-
- /* we need an insertion scan key to do our search, so build one */
- itup_key = _bt_mkscankey(rel, itup);
-
- if (checkingunique)
- {
- if (!itup_key->anynullkeys)
- {
- /* No (heapkeyspace) scantid until uniqueness established */
- itup_key->scantid = NULL;
- }
- else
- {
- /*
- * Scan key for new tuple contains NULL key values. Bypass
- * checkingunique steps. They are unnecessary because core code
- * considers NULL unequal to every value, including NULL.
- *
- * This optimization avoids O(N^2) behavior within the
- * _bt_findinsertloc() heapkeyspace path when a unique index has a
- * large number of "duplicates" with NULL key values.
- */
- checkingunique = false;
- /* Tuple is unique in the sense that core code cares about */
- Assert(checkUnique != UNIQUE_CHECK_EXISTING);
- is_unique = true;
- }
- }
-
- /*
- * Fill in the BTInsertState working area, to track the current page and
- * position within the page to insert on.
- *
- * Note that itemsz is passed down to lower level code that deals with
- * inserting the item. It must be MAXALIGN()'d. This ensures that space
- * accounting code consistently considers the alignment overhead that we
- * expect PageAddItem() will add later. (Actually, index_form_tuple() is
- * already conservative about alignment, but we don't rely on that from
- * this distance. Besides, preserving the "true" tuple size in index
- * tuple headers for the benefit of nbtsplitloc.c might happen someday.
- * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
- */
- insertstate.itup = itup;
- insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
- insertstate.itup_key = itup_key;
- insertstate.bounds_valid = false;
- insertstate.buf = InvalidBuffer;
- insertstate.postingoff = 0;
-
-search:
-
- /*
- * Find and lock the leaf page that the tuple should be added to by
- * searching from the root page. insertstate.buf will hold a buffer that
- * is locked in exclusive mode afterwards.
- */
- stack = _bt_search_insert(rel, heapRel, &insertstate);
-
- /*
- * checkingunique inserts are not allowed to go ahead when two tuples with
- * equal key attribute values would be visible to new MVCC snapshots once
- * the xact commits. Check for conflicts in the locked page/buffer (if
- * needed) here.
- *
- * It might be necessary to check a page to the right in _bt_check_unique,
- * though that should be very rare. In practice the first page the value
- * could be on (with scantid omitted) is almost always also the only page
- * that a matching tuple might be found on. This is due to the behavior
- * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
- * only be allowed to cross a page boundary when there is no candidate
- * leaf page split point that avoids it. Also, _bt_check_unique can use
- * the leaf page high key to determine that there will be no duplicates on
- * the right sibling without actually visiting it (it uses the high key in
- * cases where the new item happens to belong at the far right of the leaf
- * page).
- *
- * NOTE: obviously, _bt_check_unique can only detect keys that are already
- * in the index; so it cannot defend against concurrent insertions of the
- * same key. We protect against that by means of holding a write lock on
- * the first page the value could be on, with omitted/-inf value for the
- * implicit heap TID tiebreaker attribute. Any other would-be inserter of
- * the same key must acquire a write lock on the same page, so only one
- * would-be inserter can be making the check at one time. Furthermore,
- * once we are past the check we hold write locks continuously until we
- * have performed our insertion, so no later inserter can fail to see our
- * insertion. (This requires some care in _bt_findinsertloc.)
- *
- * If we must wait for another xact, we release the lock while waiting,
- * and then must perform a new search.
- *
- * For a partial uniqueness check, we don't wait for the other xact. Just
- * let the tuple in and return false for possibly non-unique, or true for
- * definitely unique.
- */
- if (checkingunique)
- {
- TransactionId xwait;
- uint32 speculativeToken;
-
- xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
- &is_unique, &speculativeToken);
-
- if (unlikely(TransactionIdIsValid(xwait)))
- {
- /* Have to wait for the other guy ... */
- _bt_relbuf(rel, insertstate.buf);
- insertstate.buf = InvalidBuffer;
-
- /*
- * If it's a speculative insertion, wait for it to finish (ie. to
- * go ahead with the insertion, or kill the tuple). Otherwise
- * wait for the transaction to finish as usual.
- */
- if (speculativeToken)
- SpeculativeInsertionWait(xwait, speculativeToken);
- else
- XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
-
- /* start over... */
- if (stack)
- _bt_freestack(stack);
- goto search;
- }
-
- /* Uniqueness is established -- restore heap tid as scantid */
- if (itup_key->heapkeyspace)
- itup_key->scantid = &itup->t_tid;
- }
-
- if (checkUnique != UNIQUE_CHECK_EXISTING)
- {
- OffsetNumber newitemoff;
-
- /*
- * The only conflict predicate locking cares about for indexes is when
- * an index tuple insert conflicts with an existing lock. We don't
- * know the actual page we're going to insert on for sure just yet in
- * checkingunique and !heapkeyspace cases, but it's okay to use the
- * first page the value could be on (with scantid omitted) instead.
- */
- CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
-
- /*
- * Do the insertion. Note that insertstate contains cached binary
- * search bounds established within _bt_check_unique when insertion is
- * checkingunique.
- */
- newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
- indexUnchanged, stack, heapRel);
- _bt_insertonpg(rel, heapRel, itup_key, insertstate.buf, InvalidBuffer,
- stack, itup, insertstate.itemsz, newitemoff,
- insertstate.postingoff, false);
- }
- else
- {
- /* just release the buffer */
- _bt_relbuf(rel, insertstate.buf);
- }
-
- /* be tidy */
- if (stack)
- _bt_freestack(stack);
- pfree(itup_key);
-
- return is_unique;
-}
-
-/*
- * _bt_search_insert() -- _bt_search() wrapper for inserts
- *
- * Search the tree for a particular scankey, or more precisely for the first
- * leaf page it could be on. Try to make use of the fastpath optimization's
- * rightmost leaf page cache before actually searching the tree from the root
- * page, though.
- *
- * Return value is a stack of parent-page pointers (though see notes about
- * fastpath optimization and page splits below). insertstate->buf is set to
- * the address of the leaf-page buffer, which is write-locked and pinned in
- * all cases (if necessary by creating a new empty root page for caller).
- *
- * The fastpath optimization avoids most of the work of searching the tree
- * repeatedly when a single backend inserts successive new tuples on the
- * rightmost leaf page of an index. A backend cache of the rightmost leaf
- * page is maintained within _bt_insertonpg(), and used here. The cache is
- * invalidated here when an insert of a non-pivot tuple must take place on a
- * non-rightmost leaf page.
- *
- * The optimization helps with indexes on an auto-incremented field. It also
- * helps with indexes on datetime columns, as well as indexes with lots of
- * NULL values. (NULLs usually get inserted in the rightmost page for single
- * column indexes, since they usually get treated as coming after everything
- * else in the key space. Individual NULL tuples will generally be placed on
- * the rightmost leaf page due to the influence of the heap TID column.)
- *
- * Note that we avoid applying the optimization when there is insufficient
- * space on the rightmost page to fit caller's new item. This is necessary
- * because we'll need to return a real descent stack when a page split is
- * expected (actually, caller can cope with a leaf page split that uses a NULL
- * stack, but that's very slow and so must be avoided). Note also that the
- * fastpath optimization acquires the lock on the page conditionally as a way
- * of reducing extra contention when there are concurrent insertions into the
- * rightmost page (we give up if we'd have to wait for the lock). We assume
- * that it isn't useful to apply the optimization when there is contention,
- * since each per-backend cache won't stay valid for long.
- */
-static BTStack
-_bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
-{
- Assert(insertstate->buf == InvalidBuffer);
- Assert(!insertstate->bounds_valid);
- Assert(insertstate->postingoff == 0);
-
- if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
- {
- /* Simulate a _bt_getbuf() call with conditional locking */
- insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
- if (_bt_conditionallockbuf(rel, insertstate->buf))
- {
- Page page;
- BTPageOpaque opaque;
- AttrNumber cmpcol = 1;
-
- _bt_checkpage(rel, insertstate->buf);
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- /*
- * Check if the page is still the rightmost leaf page and has
- * enough free space to accommodate the new tuple. Also check
- * that the insertion scan key is strictly greater than the first
- * non-pivot tuple on the page. (Note that we expect itup_key's
- * scantid to be unset when our caller is a checkingunique
- * inserter.)
- */
- if (P_RIGHTMOST(opaque) &&
- P_ISLEAF(opaque) &&
- !P_IGNORE(opaque) &&
- PageGetFreeSpace(page) > insertstate->itemsz &&
- PageGetMaxOffsetNumber(page) >= P_HIKEY &&
- _bt_compare(rel, insertstate->itup_key, page, P_HIKEY,
- &cmpcol) > 0)
- {
- /*
- * Caller can use the fastpath optimization because cached
- * block is still rightmost leaf page, which can fit caller's
- * new tuple without splitting. Keep block in local cache for
- * next insert, and have caller use NULL stack.
- *
- * Note that _bt_insert_parent() has an assertion that catches
- * leaf page splits that somehow follow from a fastpath insert
- * (it should only be passed a NULL stack when it must deal
- * with a concurrent root page split, and never because a NULL
- * stack was returned here).
- */
- return NULL;
- }
-
- /* Page unsuitable for caller, drop lock and pin */
- _bt_relbuf(rel, insertstate->buf);
- }
- else
- {
- /* Lock unavailable, drop pin */
- ReleaseBuffer(insertstate->buf);
- }
-
- /* Forget block, since cache doesn't appear to be useful */
- RelationSetTargetBlock(rel, InvalidBlockNumber);
- }
-
- /* Cannot use optimization -- descend tree, return proper descent stack */
- return _bt_search(rel, heaprel, insertstate->itup_key, &insertstate->buf,
- BT_WRITE);
-}
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtinsert_spec.c"
+#include "access/nbtree_spec.h"
/*
* _bt_check_unique() -- Check for violation of unique index constraint
@@ -425,6 +108,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
bool inposting = false;
bool prevalldead = true;
int curposti = 0;
+ nbts_prep_ctx(rel);
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -776,253 +460,6 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
return InvalidTransactionId;
}
-
-/*
- * _bt_findinsertloc() -- Finds an insert location for a tuple
- *
- * On entry, insertstate buffer contains the page the new tuple belongs
- * on. It is exclusive-locked and pinned by the caller.
- *
- * If 'checkingunique' is true, the buffer on entry is the first page
- * that contains duplicates of the new key. If there are duplicates on
- * multiple pages, the correct insertion position might be some page to
- * the right, rather than the first page. In that case, this function
- * moves right to the correct target page.
- *
- * (In a !heapkeyspace index, there can be multiple pages with the same
- * high key, where the new tuple could legitimately be placed on. In
- * that case, the caller passes the first page containing duplicates,
- * just like when checkingunique=true. If that page doesn't have enough
- * room for the new tuple, this function moves right, trying to find a
- * legal page that does.)
- *
- * If 'indexUnchanged' is true, this is for an UPDATE that didn't
- * logically change the indexed value, but must nevertheless have a new
- * entry to point to a successor version. This hint from the executor
- * will influence our behavior when the page might have to be split and
- * we must consider our options. Bottom-up index deletion can avoid
- * pathological version-driven page splits, but we only want to go to the
- * trouble of trying it when we already have moderate confidence that
- * it's appropriate. The hint should not significantly affect our
- * behavior over time unless practically all inserts on to the leaf page
- * get the hint.
- *
- * On exit, insertstate buffer contains the chosen insertion page, and
- * the offset within that page is returned. If _bt_findinsertloc needed
- * to move right, the lock and pin on the original page are released, and
- * the new buffer is exclusively locked and pinned instead.
- *
- * If insertstate contains cached binary search bounds, we will take
- * advantage of them. This avoids repeating comparisons that we made in
- * _bt_check_unique() already.
- */
-static OffsetNumber
-_bt_findinsertloc(Relation rel,
- BTInsertState insertstate,
- bool checkingunique,
- bool indexUnchanged,
- BTStack stack,
- Relation heapRel)
-{
- BTScanInsert itup_key = insertstate->itup_key;
- Page page = BufferGetPage(insertstate->buf);
- BTPageOpaque opaque;
- OffsetNumber newitemoff;
-
- opaque = BTPageGetOpaque(page);
-
- /* Check 1/3 of a page restriction */
- if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
- _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
- insertstate->itup);
-
- Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
- Assert(!insertstate->bounds_valid || checkingunique);
- Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
- Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
- Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
-
- if (itup_key->heapkeyspace)
- {
- /* Keep track of whether checkingunique duplicate seen */
- bool uniquedup = indexUnchanged;
-
- /*
- * If we're inserting into a unique index, we may have to walk right
- * through leaf pages to find the one leaf page that we must insert on
- * to.
- *
- * This is needed for checkingunique callers because a scantid was not
- * used when we called _bt_search(). scantid can only be set after
- * _bt_check_unique() has checked for duplicates. The buffer
- * initially stored in insertstate->buf has the page where the first
- * duplicate key might be found, which isn't always the page that new
- * tuple belongs on. The heap TID attribute for new tuple (scantid)
- * could force us to insert on a sibling page, though that should be
- * very rare in practice.
- */
- if (checkingunique)
- {
- if (insertstate->low < insertstate->stricthigh)
- {
- /* Encountered a duplicate in _bt_check_unique() */
- Assert(insertstate->bounds_valid);
- uniquedup = true;
- }
-
- for (;;)
- {
- AttrNumber cmpcol = 1;
-
- /*
- * Does the new tuple belong on this page?
- *
- * The earlier _bt_check_unique() call may well have
- * established a strict upper bound on the offset for the new
- * item. If it's not the last item of the page (i.e. if there
- * is at least one tuple on the page that goes after the tuple
- * we're inserting) then we know that the tuple belongs on
- * this page. We can skip the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- /* Test '<=', not '!=', since scantid is set now */
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
- break;
-
- _bt_stepright(rel, heapRel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- /* Assume duplicates (if checkingunique) */
- uniquedup = true;
- }
- }
-
- /*
- * If the target page cannot fit newitem, try to avoid splitting the
- * page on insert by performing deletion or deduplication now
- */
- if (PageGetFreeSpace(page) < insertstate->itemsz)
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
- checkingunique, uniquedup,
- indexUnchanged);
- }
- else
- {
- /*----------
- * This is a !heapkeyspace (version 2 or 3) index. The current page
- * is the first page that we could insert the new tuple to, but there
- * may be other pages to the right that we could opt to use instead.
- *
- * If the new key is equal to one or more existing keys, we can
- * legitimately place it anywhere in the series of equal keys. In
- * fact, if the new key is equal to the page's "high key" we can place
- * it on the next page. If it is equal to the high key, and there's
- * not room to insert the new tuple on the current page without
- * splitting, then we move right hoping to find more free space and
- * avoid a split.
- *
- * Keep scanning right until we
- * (a) find a page with enough free space,
- * (b) reach the last page where the tuple can legally go, or
- * (c) get tired of searching.
- * (c) is not flippant; it is important because if there are many
- * pages' worth of equal keys, it's better to split one of the early
- * pages than to scan all the way to the end of the run of equal keys
- * on every insert. We implement "get tired" as a random choice,
- * since stopping after scanning a fixed number of pages wouldn't work
- * well (we'd never reach the right-hand side of previously split
- * pages). The probability of moving right is set at 0.99, which may
- * seem too high to change the behavior much, but it does an excellent
- * job of preventing O(N^2) behavior with many equal keys.
- *----------
- */
- while (PageGetFreeSpace(page) < insertstate->itemsz)
- {
- AttrNumber cmpcol = 1;
-
- /*
- * Before considering moving right, see if we can obtain enough
- * space by erasing LP_DEAD items
- */
- if (P_HAS_GARBAGE(opaque))
- {
- /* Perform simple deletion */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- if (PageGetFreeSpace(page) >= insertstate->itemsz)
- break; /* OK, now we have enough space */
- }
-
- /*
- * Nope, so check conditions (b) and (c) enumerated above
- *
- * The earlier _bt_check_unique() call may well have established a
- * strict upper bound on the offset for the new item. If it's not
- * the last item of the page (i.e. if there is at least one tuple
- * on the page that's greater than the tuple we're inserting to)
- * then we know that the tuple belongs on this page. We can skip
- * the high key check.
- */
- if (insertstate->bounds_valid &&
- insertstate->low <= insertstate->stricthigh &&
- insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
- break;
-
- if (P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) != 0 ||
- pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
- break;
-
- _bt_stepright(rel, heapRel, insertstate, stack);
- /* Update local state after stepping right */
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
- }
- }
-
- /*
- * We should now be on the correct page. Find the offset within the page
- * for the new tuple. (Possibly reusing earlier search bounds.)
- */
- {
- AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
- Assert(P_RIGHTMOST(opaque) ||
- _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0);
- }
-
- newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
-
- if (insertstate->postingoff == -1)
- {
- /*
- * There is an overlapping posting list tuple with its LP_DEAD bit
- * set. We don't want to unnecessarily unset its LP_DEAD bit while
- * performing a posting list split, so perform simple index tuple
- * deletion early.
- */
- _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
- false, false, false);
-
- /*
- * Do new binary search. New insert location cannot overlap with any
- * posting list now.
- */
- Assert(!insertstate->bounds_valid);
- insertstate->postingoff = 0;
- newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
- Assert(insertstate->postingoff == 0);
- }
-
- return newitemoff;
-}
-
/*
* Step right to next non-dead page, during insertion.
*
@@ -1506,6 +943,7 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
bool newitemonleft,
isleaf,
isrightmost;
+ nbts_prep_ctx(rel);
/*
* origpage is the original page to be split. leftpage is a temporary
@@ -2706,6 +2144,7 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
BTScanInsert itup_key = insertstate->itup_key;
Page page = BufferGetPage(buffer);
BTPageOpaque opaque = BTPageGetOpaque(page);
+ nbts_prep_ctx(rel);
Assert(P_ISLEAF(opaque));
Assert(simpleonly || itup_key->heapkeyspace);
diff --git a/src/backend/access/nbtree/nbtinsert_spec.c b/src/backend/access/nbtree/nbtinsert_spec.c
new file mode 100644
index 0000000000..969da194fd
--- /dev/null
+++ b/src/backend/access/nbtree/nbtinsert_spec.c
@@ -0,0 +1,584 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtinsert_spec.c
+ * Index shape-specialized functions for nbtinsert.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtinsert_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_search_insert NBTS_FUNCTION(_bt_search_insert)
+#define _bt_findinsertloc NBTS_FUNCTION(_bt_findinsertloc)
+
+static BTStack _bt_search_insert(Relation rel, Relation heaprel,
+ BTInsertState insertstate);
+static OffsetNumber _bt_findinsertloc(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel);
+
+
+/*
+ * _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
+ *
+ * This routine is called by the public interface routine, btinsert.
+ * By here, itup is filled in, including the TID.
+ *
+ * If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
+ * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
+ * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
+ * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
+ * don't actually insert.
+ *
+ * indexUnchanged executor hint indicates if itup is from an
+ * UPDATE that didn't logically change the indexed value, but
+ * must nevertheless have a new entry to point to a successor
+ * version.
+ *
+ * The result value is only significant for UNIQUE_CHECK_PARTIAL:
+ * it must be true if the entry is known unique, else false.
+ * (In the current implementation we'll also return true after a
+ * successful UNIQUE_CHECK_YES or UNIQUE_CHECK_EXISTING call, but
+ * that's just a coding artifact.)
+ */
+bool
+_bt_doinsert(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ Relation heapRel)
+{
+ bool is_unique = false;
+ BTInsertStateData insertstate;
+ BTScanInsert itup_key;
+ BTStack stack;
+ bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
+
+ /* we need an insertion scan key to do our search, so build one */
+ itup_key = _bt_mkscankey(rel, itup);
+
+ if (checkingunique)
+ {
+ if (!itup_key->anynullkeys)
+ {
+ /* No (heapkeyspace) scantid until uniqueness established */
+ itup_key->scantid = NULL;
+ }
+ else
+ {
+ /*
+ * Scan key for new tuple contains NULL key values. Bypass
+ * checkingunique steps. They are unnecessary because core code
+ * considers NULL unequal to every value, including NULL.
+ *
+ * This optimization avoids O(N^2) behavior within the
+ * _bt_findinsertloc() heapkeyspace path when a unique index has a
+ * large number of "duplicates" with NULL key values.
+ */
+ checkingunique = false;
+ /* Tuple is unique in the sense that core code cares about */
+ Assert(checkUnique != UNIQUE_CHECK_EXISTING);
+ is_unique = true;
+ }
+ }
+
+ /*
+ * Fill in the BTInsertState working area, to track the current page and
+ * position within the page to insert on.
+ *
+ * Note that itemsz is passed down to lower level code that deals with
+ * inserting the item. It must be MAXALIGN()'d. This ensures that space
+ * accounting code consistently considers the alignment overhead that we
+ * expect PageAddItem() will add later. (Actually, index_form_tuple() is
+ * already conservative about alignment, but we don't rely on that from
+ * this distance. Besides, preserving the "true" tuple size in index
+ * tuple headers for the benefit of nbtsplitloc.c might happen someday.
+ * Note that heapam does not MAXALIGN() each heap tuple's lp_len field.)
+ */
+ insertstate.itup = itup;
+ insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
+ insertstate.itup_key = itup_key;
+ insertstate.bounds_valid = false;
+ insertstate.buf = InvalidBuffer;
+ insertstate.postingoff = 0;
+
+ search:
+
+ /*
+ * Find and lock the leaf page that the tuple should be added to by
+ * searching from the root page. insertstate.buf will hold a buffer that
+ * is locked in exclusive mode afterwards.
+ */
+ stack = _bt_search_insert(rel, heapRel, &insertstate);
+
+ /*
+ * checkingunique inserts are not allowed to go ahead when two tuples with
+ * equal key attribute values would be visible to new MVCC snapshots once
+ * the xact commits. Check for conflicts in the locked page/buffer (if
+ * needed) here.
+ *
+ * It might be necessary to check a page to the right in _bt_check_unique,
+ * though that should be very rare. In practice the first page the value
+ * could be on (with scantid omitted) is almost always also the only page
+ * that a matching tuple might be found on. This is due to the behavior
+ * of _bt_findsplitloc with duplicate tuples -- a group of duplicates can
+ * only be allowed to cross a page boundary when there is no candidate
+ * leaf page split point that avoids it. Also, _bt_check_unique can use
+ * the leaf page high key to determine that there will be no duplicates on
+ * the right sibling without actually visiting it (it uses the high key in
+ * cases where the new item happens to belong at the far right of the leaf
+ * page).
+ *
+ * NOTE: obviously, _bt_check_unique can only detect keys that are already
+ * in the index; so it cannot defend against concurrent insertions of the
+ * same key. We protect against that by means of holding a write lock on
+ * the first page the value could be on, with omitted/-inf value for the
+ * implicit heap TID tiebreaker attribute. Any other would-be inserter of
+ * the same key must acquire a write lock on the same page, so only one
+ * would-be inserter can be making the check at one time. Furthermore,
+ * once we are past the check we hold write locks continuously until we
+ * have performed our insertion, so no later inserter can fail to see our
+ * insertion. (This requires some care in _bt_findinsertloc.)
+ *
+ * If we must wait for another xact, we release the lock while waiting,
+ * and then must perform a new search.
+ *
+ * For a partial uniqueness check, we don't wait for the other xact. Just
+ * let the tuple in and return false for possibly non-unique, or true for
+ * definitely unique.
+ */
+ if (checkingunique)
+ {
+ TransactionId xwait;
+ uint32 speculativeToken;
+
+ xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
+ &is_unique, &speculativeToken);
+
+ if (unlikely(TransactionIdIsValid(xwait)))
+ {
+ /* Have to wait for the other guy ... */
+ _bt_relbuf(rel, insertstate.buf);
+ insertstate.buf = InvalidBuffer;
+
+ /*
+ * If it's a speculative insertion, wait for it to finish (ie. to
+ * go ahead with the insertion, or kill the tuple). Otherwise
+ * wait for the transaction to finish as usual.
+ */
+ if (speculativeToken)
+ SpeculativeInsertionWait(xwait, speculativeToken);
+ else
+ XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
+
+ /* start over... */
+ if (stack)
+ _bt_freestack(stack);
+ goto search;
+ }
+
+ /* Uniqueness is established -- restore heap tid as scantid */
+ if (itup_key->heapkeyspace)
+ itup_key->scantid = &itup->t_tid;
+ }
+
+ if (checkUnique != UNIQUE_CHECK_EXISTING)
+ {
+ OffsetNumber newitemoff;
+
+ /*
+ * The only conflict predicate locking cares about for indexes is when
+ * an index tuple insert conflicts with an existing lock. We don't
+ * know the actual page we're going to insert on for sure just yet in
+ * checkingunique and !heapkeyspace cases, but it's okay to use the
+ * first page the value could be on (with scantid omitted) instead.
+ */
+ CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
+
+ /*
+ * Do the insertion. Note that insertstate contains cached binary
+ * search bounds established within _bt_check_unique when insertion is
+ * checkingunique.
+ */
+ newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
+ indexUnchanged, stack, heapRel);
+ _bt_insertonpg(rel, heapRel, itup_key, insertstate.buf, InvalidBuffer,
+ stack, itup, insertstate.itemsz, newitemoff,
+ insertstate.postingoff, false);
+ }
+ else
+ {
+ /* just release the buffer */
+ _bt_relbuf(rel, insertstate.buf);
+ }
+
+ /* be tidy */
+ if (stack)
+ _bt_freestack(stack);
+ pfree(itup_key);
+
+ return is_unique;
+}
+
+/*
+ * _bt_search_insert() -- _bt_search() wrapper for inserts
+ *
+ * Search the tree for a particular scankey, or more precisely for the first
+ * leaf page it could be on. Try to make use of the fastpath optimization's
+ * rightmost leaf page cache before actually searching the tree from the root
+ * page, though.
+ *
+ * Return value is a stack of parent-page pointers (though see notes about
+ * fastpath optimization and page splits below). insertstate->buf is set to
+ * the address of the leaf-page buffer, which is write-locked and pinned in
+ * all cases (if necessary by creating a new empty root page for caller).
+ *
+ * The fastpath optimization avoids most of the work of searching the tree
+ * repeatedly when a single backend inserts successive new tuples on the
+ * rightmost leaf page of an index. A backend cache of the rightmost leaf
+ * page is maintained within _bt_insertonpg(), and used here. The cache is
+ * invalidated here when an insert of a non-pivot tuple must take place on a
+ * non-rightmost leaf page.
+ *
+ * The optimization helps with indexes on an auto-incremented field. It also
+ * helps with indexes on datetime columns, as well as indexes with lots of
+ * NULL values. (NULLs usually get inserted in the rightmost page for single
+ * column indexes, since they usually get treated as coming after everything
+ * else in the key space. Individual NULL tuples will generally be placed on
+ * the rightmost leaf page due to the influence of the heap TID column.)
+ *
+ * Note that we avoid applying the optimization when there is insufficient
+ * space on the rightmost page to fit caller's new item. This is necessary
+ * because we'll need to return a real descent stack when a page split is
+ * expected (actually, caller can cope with a leaf page split that uses a NULL
+ * stack, but that's very slow and so must be avoided). Note also that the
+ * fastpath optimization acquires the lock on the page conditionally as a way
+ * of reducing extra contention when there are concurrent insertions into the
+ * rightmost page (we give up if we'd have to wait for the lock). We assume
+ * that it isn't useful to apply the optimization when there is contention,
+ * since each per-backend cache won't stay valid for long.
+ */
+static BTStack
+_bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
+{
+ Assert(insertstate->buf == InvalidBuffer);
+ Assert(!insertstate->bounds_valid);
+ Assert(insertstate->postingoff == 0);
+
+ if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
+ {
+ /* Simulate a _bt_getbuf() call with conditional locking */
+ insertstate->buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
+ if (_bt_conditionallockbuf(rel, insertstate->buf))
+ {
+ Page page;
+ BTPageOpaque opaque;
+ AttrNumber cmpcol = 1;
+
+ _bt_checkpage(rel, insertstate->buf);
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ /*
+ * Check if the page is still the rightmost leaf page and has
+ * enough free space to accommodate the new tuple. Also check
+ * that the insertion scan key is strictly greater than the first
+ * non-pivot tuple on the page. (Note that we expect itup_key's
+ * scantid to be unset when our caller is a checkingunique
+ * inserter.)
+ */
+ if (P_RIGHTMOST(opaque) &&
+ P_ISLEAF(opaque) &&
+ !P_IGNORE(opaque) &&
+ PageGetFreeSpace(page) > insertstate->itemsz &&
+ PageGetMaxOffsetNumber(page) >= P_HIKEY &&
+ _bt_compare(rel, insertstate->itup_key, page, P_HIKEY,
+ &cmpcol) > 0)
+ {
+ /*
+ * Caller can use the fastpath optimization because cached
+ * block is still rightmost leaf page, which can fit caller's
+ * new tuple without splitting. Keep block in local cache for
+ * next insert, and have caller use NULL stack.
+ *
+ * Note that _bt_insert_parent() has an assertion that catches
+ * leaf page splits that somehow follow from a fastpath insert
+ * (it should only be passed a NULL stack when it must deal
+ * with a concurrent root page split, and never because a NULL
+ * stack was returned here).
+ */
+ return NULL;
+ }
+
+ /* Page unsuitable for caller, drop lock and pin */
+ _bt_relbuf(rel, insertstate->buf);
+ }
+ else
+ {
+ /* Lock unavailable, drop pin */
+ ReleaseBuffer(insertstate->buf);
+ }
+
+ /* Forget block, since cache doesn't appear to be useful */
+ RelationSetTargetBlock(rel, InvalidBlockNumber);
+ }
+
+ /* Cannot use optimization -- descend tree, return proper descent stack */
+ return _bt_search(rel, heaprel, insertstate->itup_key, &insertstate->buf,
+ BT_WRITE);
+}
+
+/*
+ * _bt_findinsertloc() -- Finds an insert location for a tuple
+ *
+ * On entry, insertstate buffer contains the page the new tuple belongs
+ * on. It is exclusive-locked and pinned by the caller.
+ *
+ * If 'checkingunique' is true, the buffer on entry is the first page
+ * that contains duplicates of the new key. If there are duplicates on
+ * multiple pages, the correct insertion position might be some page to
+ * the right, rather than the first page. In that case, this function
+ * moves right to the correct target page.
+ *
+ * (In a !heapkeyspace index, there can be multiple pages with the same
+ * high key, where the new tuple could legitimately be placed on. In
+ * that case, the caller passes the first page containing duplicates,
+ * just like when checkingunique=true. If that page doesn't have enough
+ * room for the new tuple, this function moves right, trying to find a
+ * legal page that does.)
+ *
+ * If 'indexUnchanged' is true, this is for an UPDATE that didn't
+ * logically change the indexed value, but must nevertheless have a new
+ * entry to point to a successor version. This hint from the executor
+ * will influence our behavior when the page might have to be split and
+ * we must consider our options. Bottom-up index deletion can avoid
+ * pathological version-driven page splits, but we only want to go to the
+ * trouble of trying it when we already have moderate confidence that
+ * it's appropriate. The hint should not significantly affect our
+ * behavior over time unless practically all inserts on to the leaf page
+ * get the hint.
+ *
+ * On exit, insertstate buffer contains the chosen insertion page, and
+ * the offset within that page is returned. If _bt_findinsertloc needed
+ * to move right, the lock and pin on the original page are released, and
+ * the new buffer is exclusively locked and pinned instead.
+ *
+ * If insertstate contains cached binary search bounds, we will take
+ * advantage of them. This avoids repeating comparisons that we made in
+ * _bt_check_unique() already.
+ */
+static OffsetNumber
+_bt_findinsertloc(Relation rel,
+ BTInsertState insertstate,
+ bool checkingunique,
+ bool indexUnchanged,
+ BTStack stack,
+ Relation heapRel)
+{
+ BTScanInsert itup_key = insertstate->itup_key;
+ Page page = BufferGetPage(insertstate->buf);
+ BTPageOpaque opaque;
+ OffsetNumber newitemoff;
+
+ opaque = BTPageGetOpaque(page);
+
+ /* Check 1/3 of a page restriction */
+ if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
+ _bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
+ insertstate->itup);
+
+ Assert(P_ISLEAF(opaque) && !P_INCOMPLETE_SPLIT(opaque));
+ Assert(!insertstate->bounds_valid || checkingunique);
+ Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
+ Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
+ Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
+
+ if (itup_key->heapkeyspace)
+ {
+ /* Keep track of whether checkingunique duplicate seen */
+ bool uniquedup = indexUnchanged;
+
+ /*
+ * If we're inserting into a unique index, we may have to walk right
+ * through leaf pages to find the one leaf page that we must insert on
+ * to.
+ *
+ * This is needed for checkingunique callers because a scantid was not
+ * used when we called _bt_search(). scantid can only be set after
+ * _bt_check_unique() has checked for duplicates. The buffer
+ * initially stored in insertstate->buf has the page where the first
+ * duplicate key might be found, which isn't always the page that new
+ * tuple belongs on. The heap TID attribute for new tuple (scantid)
+ * could force us to insert on a sibling page, though that should be
+ * very rare in practice.
+ */
+ if (checkingunique)
+ {
+ if (insertstate->low < insertstate->stricthigh)
+ {
+ /* Encountered a duplicate in _bt_check_unique() */
+ Assert(insertstate->bounds_valid);
+ uniquedup = true;
+ }
+
+ for (;;)
+ {
+ AttrNumber cmpcol = 1;
+
+ /*
+ * Does the new tuple belong on this page?
+ *
+ * The earlier _bt_check_unique() call may well have
+ * established a strict upper bound on the offset for the new
+ * item. If it's not the last item of the page (i.e. if there
+ * is at least one tuple on the page that goes after the tuple
+ * we're inserting) then we know that the tuple belongs on
+ * this page. We can skip the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ /* Test '<=', not '!=', since scantid is set now */
+ if (P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0)
+ break;
+
+ _bt_stepright(rel, heapRel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ /* Assume duplicates (if checkingunique) */
+ uniquedup = true;
+ }
+ }
+
+ /*
+ * If the target page cannot fit newitem, try to avoid splitting the
+ * page on insert by performing deletion or deduplication now
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
+ checkingunique, uniquedup,
+ indexUnchanged);
+ }
+ else
+ {
+ /*----------
+ * This is a !heapkeyspace (version 2 or 3) index. The current page
+ * is the first page that we could insert the new tuple to, but there
+ * may be other pages to the right that we could opt to use instead.
+ *
+ * If the new key is equal to one or more existing keys, we can
+ * legitimately place it anywhere in the series of equal keys. In
+ * fact, if the new key is equal to the page's "high key" we can place
+ * it on the next page. If it is equal to the high key, and there's
+ * not room to insert the new tuple on the current page without
+ * splitting, then we move right hoping to find more free space and
+ * avoid a split.
+ *
+ * Keep scanning right until we
+ * (a) find a page with enough free space,
+ * (b) reach the last page where the tuple can legally go, or
+ * (c) get tired of searching.
+ * (c) is not flippant; it is important because if there are many
+ * pages' worth of equal keys, it's better to split one of the early
+ * pages than to scan all the way to the end of the run of equal keys
+ * on every insert. We implement "get tired" as a random choice,
+ * since stopping after scanning a fixed number of pages wouldn't work
+ * well (we'd never reach the right-hand side of previously split
+ * pages). The probability of moving right is set at 0.99, which may
+ * seem too high to change the behavior much, but it does an excellent
+ * job of preventing O(N^2) behavior with many equal keys.
+ *----------
+ */
+ while (PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ AttrNumber cmpcol = 1;
+
+ /*
+ * Before considering moving right, see if we can obtain enough
+ * space by erasing LP_DEAD items
+ */
+ if (P_HAS_GARBAGE(opaque))
+ {
+ /* Perform simple deletion */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ if (PageGetFreeSpace(page) >= insertstate->itemsz)
+ break; /* OK, now we have enough space */
+ }
+
+ /*
+ * Nope, so check conditions (b) and (c) enumerated above
+ *
+ * The earlier _bt_check_unique() call may well have established a
+ * strict upper bound on the offset for the new item. If it's not
+ * the last item of the page (i.e. if there is at least one tuple
+ * on the page that's greater than the tuple we're inserting to)
+ * then we know that the tuple belongs on this page. We can skip
+ * the high key check.
+ */
+ if (insertstate->bounds_valid &&
+ insertstate->low <= insertstate->stricthigh &&
+ insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+ break;
+
+ if (P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) != 0 ||
+ pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
+ break;
+
+ _bt_stepright(rel, heapRel, insertstate, stack);
+ /* Update local state after stepping right */
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+ }
+ }
+
+ /*
+ * We should now be on the correct page. Find the offset within the page
+ * for the new tuple. (Possibly reusing earlier search bounds.)
+ */
+ {
+ AttrNumber cmpcol PG_USED_FOR_ASSERTS_ONLY = 1;
+ Assert(P_RIGHTMOST(opaque) ||
+ _bt_compare(rel, itup_key, page, P_HIKEY, &cmpcol) <= 0);
+ }
+
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
+
+ if (insertstate->postingoff == -1)
+ {
+ /*
+ * There is an overlapping posting list tuple with its LP_DEAD bit
+ * set. We don't want to unnecessarily unset its LP_DEAD bit while
+ * performing a posting list split, so perform simple index tuple
+ * deletion early.
+ */
+ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
+ false, false, false);
+
+ /*
+ * Do new binary search. New insert location cannot overlap with any
+ * posting list now.
+ */
+ Assert(!insertstate->bounds_valid);
+ insertstate->postingoff = 0;
+ newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
+ Assert(insertstate->postingoff == 0);
+ }
+
+ return newitemoff;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index b7660a459e..493f684e83 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1810,6 +1810,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
bool rightsib_empty;
Page page;
BTPageOpaque opaque;
+ nbts_prep_ctx(rel);
/*
* Save original leafbuf block number from caller. Only deleted blocks
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 62bc9917f1..58f2fdba18 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -87,6 +87,8 @@ static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
OffsetNumber updatedoffset,
int *nremaining);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtree_spec.c"
+#include "access/nbtree_spec.h"
/*
* Btree handler function: return IndexAmRoutine with access method parameters
@@ -121,7 +123,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->ambuild = btbuild;
amroutine->ambuildempty = btbuildempty;
- amroutine->aminsert = btinsert;
+ amroutine->aminsert = btinsert_default;
amroutine->ambulkdelete = btbulkdelete;
amroutine->amvacuumcleanup = btvacuumcleanup;
amroutine->amcanreturn = btcanreturn;
@@ -155,6 +157,8 @@ btbuildempty(Relation index)
Buffer metabuf;
Page metapage;
+ nbt_opt_specialize(index);
+
/*
* Initalize the metapage.
*
@@ -180,33 +184,6 @@ btbuildempty(Relation index)
ReleaseBuffer(metabuf);
}
-/*
- * btinsert() -- insert an index tuple into a btree.
- *
- * Descend the tree recursively, find the appropriate location for our
- * new tuple, and put it there.
- */
-bool
-btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- IndexInfo *indexInfo)
-{
- bool result;
- IndexTuple itup;
-
- /* generate an index tuple */
- itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
- itup->t_tid = *ht_ctid;
-
- result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);
-
- pfree(itup);
-
- return result;
-}
-
/*
* btgettuple() -- Get the next tuple in the scan.
*/
@@ -348,6 +325,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
IndexScanDesc scan;
BTScanOpaque so;
+ nbt_opt_specialize(rel);
+
/* no order by operators allowed */
Assert(norderbys == 0);
@@ -791,6 +770,8 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
Relation rel = info->index;
BTCycleId cycleid;
+ nbt_opt_specialize(rel);
+
/* allocate stats if first time through, else re-use existing struct */
if (stats == NULL)
stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
diff --git a/src/backend/access/nbtree/nbtree_spec.c b/src/backend/access/nbtree/nbtree_spec.c
new file mode 100644
index 0000000000..6b766581ab
--- /dev/null
+++ b/src/backend/access/nbtree/nbtree_spec.c
@@ -0,0 +1,69 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtree_spec.c
+ * Index shape-specialized functions for nbtree.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtree_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+/*
+ * _bt_specialize() -- Specialize this index relation for its index key.
+ */
+void
+_bt_specialize(Relation rel)
+{
+#ifdef NBTS_SPECIALIZING_DEFAULT
+ NBTS_MAKE_CTX(rel);
+ /*
+ * We can't directly address _bt_specialize here because it'd be macro-
+ * expanded, nor can we utilize NBTS_SPECIALIZE_NAME here because it'd
+ * try to call _bt_specialize, which would be an infinite recursive call.
+ */
+ switch (__nbts_ctx) {
+ case NBTS_CTX_CACHED:
+ _bt_specialize_cached(rel);
+ break;
+ case NBTS_CTX_DEFAULT:
+ break;
+ }
+#else
+ rel->rd_indam->aminsert = btinsert;
+#endif
+}
+
+/*
+ * btinsert() -- insert an index tuple into a btree.
+ *
+ * Descend the tree recursively, find the appropriate location for our
+ * new tuple, and put it there.
+ */
+bool
+btinsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ IndexInfo *indexInfo)
+{
+ bool result;
+ IndexTuple itup;
+
+ /* generate an index tuple */
+ itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
+ itup->t_tid = *ht_ctid;
+
+ result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);
+
+ pfree(itup);
+
+ return result;
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 44a09ced98..d4604e7f99 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,12 +26,8 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
-static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf,
- AttrNumber *highkeycmpcol);
static int _bt_binsrch_posting(BTScanInsert key, Page page,
OffsetNumber offnum);
-static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
- OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
static int _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
@@ -48,6 +44,8 @@ static Buffer _bt_walk_left(Relation rel, Buffer buf);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsearch_spec.c"
+#include "access/nbtree_spec.h"
/*
* _bt_drop_lock_and_maybe_pin()
@@ -72,591 +70,6 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
}
}
-/*
- * _bt_search() -- Search the tree for a particular scankey,
- * or more precisely for the first leaf page it could be on.
- *
- * The passed scankey is an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * Return value is a stack of parent-page pointers (i.e. there is no entry for
- * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
- * which is locked and pinned. No locks are held on the parent pages,
- * however!
- *
- * The returned buffer is locked according to access parameter. Additionally,
- * access = BT_WRITE will allow an empty root page to be created and returned.
- * When access = BT_READ, an empty index will result in *bufP being set to
- * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
- * during the search will be finished.
- *
- * heaprel must be provided by callers that pass access = BT_WRITE, since we
- * might need to allocate a new root page for caller -- see _bt_allocbuf.
- */
-BTStack
-_bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
- int access)
-{
- BTStack stack_in = NULL;
- int page_access = BT_READ;
- char tupdatabuf[BLCKSZ / 3];
- AttrNumber highkeycmpcol = 1;
-
- /* heaprel must be set whenever _bt_allocbuf is reachable */
- Assert(access == BT_READ || access == BT_WRITE);
- Assert(access == BT_READ || heaprel != NULL);
-
- /* Get the root page to start with */
- *bufP = _bt_getroot(rel, heaprel, access);
-
- /* If index is empty and access = BT_READ, no root page is created. */
- if (!BufferIsValid(*bufP))
- return (BTStack) NULL;
-
- /* Loop iterates once per level descended in the tree */
- for (;;)
- {
- Page page;
- BTPageOpaque opaque;
- OffsetNumber offnum;
- ItemId itemid;
- IndexTuple itup;
- BlockNumber child;
- BTStack new_stack;
-
- /*
- * Race -- the page we just grabbed may have split since we read its
- * downlink in its parent page (or the metapage). If it has, we may
- * need to move right to its new sibling. Do that.
- *
- * In write-mode, allow _bt_moveright to finish any incomplete splits
- * along the way. Strictly speaking, we'd only need to finish an
- * incomplete split on the leaf page we're about to insert to, not on
- * any of the upper levels (internal pages with incomplete splits are
- * also taken care of in _bt_getstackbuf). But this is a good
- * opportunity to finish splits of internal pages too.
- */
- *bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
- stack_in, page_access, &highkeycmpcol,
- (char *) tupdatabuf);
-
- /* if this is a leaf page, we're done */
- page = BufferGetPage(*bufP);
- opaque = BTPageGetOpaque(page);
- if (P_ISLEAF(opaque))
- break;
-
- /*
- * Find the appropriate pivot tuple on this page. Its downlink points
- * to the child page that we're about to descend to.
- */
- offnum = _bt_binsrch(rel, key, *bufP, &highkeycmpcol);
- itemid = PageGetItemId(page, offnum);
- itup = (IndexTuple) PageGetItem(page, itemid);
- Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
- child = BTreeTupleGetDownLink(itup);
-
- Assert(IndexTupleSize(itup) < sizeof(tupdatabuf));
- memcpy((char *) tupdatabuf, (char *) itup, IndexTupleSize(itup));
-
- /*
- * We need to save the location of the pivot tuple we chose in a new
- * stack entry for this page/level. If caller ends up splitting a
- * page one level down, it usually ends up inserting a new pivot
- * tuple/downlink immediately after the location recorded here.
- */
- new_stack = (BTStack) palloc(sizeof(BTStackData));
- new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
- new_stack->bts_offset = offnum;
- new_stack->bts_parent = stack_in;
-
- /*
- * Page level 1 is lowest non-leaf page level prior to leaves. So, if
- * we're on the level 1 and asked to lock leaf page in write mode,
- * then lock next page in write mode, because it must be a leaf.
- */
- if (opaque->btpo_level == 1 && access == BT_WRITE)
- page_access = BT_WRITE;
-
- /* drop the read lock on the page, then acquire one on its child */
- *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
-
- /* okay, all set to move down a level */
- stack_in = new_stack;
- }
-
- /*
- * If we're asked to lock leaf in write mode, but didn't manage to, then
- * relock. This should only happen when the root page is a leaf page (and
- * the only page in the index other than the metapage).
- */
- if (access == BT_WRITE && page_access == BT_READ)
- {
- highkeycmpcol = 1;
-
- /* trade in our read lock for a write lock */
- _bt_unlockbuf(rel, *bufP);
- _bt_lockbuf(rel, *bufP, BT_WRITE);
-
- /*
- * Race -- the leaf page may have split after we dropped the read lock
- * but before we acquired a write lock. If it has, we may need to
- * move right to its new sibling. Do that.
- */
- *bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
- &highkeycmpcol, (char *) tupdatabuf);
- }
-
- return stack_in;
-}
-
-/*
- * _bt_moveright() -- move right in the btree if necessary.
- *
- * When we follow a pointer to reach a page, it is possible that
- * the page has changed in the meanwhile. If this happens, we're
- * guaranteed that the page has "split right" -- that is, that any
- * data that appeared on the page originally is either on the page
- * or strictly to the right of it.
- *
- * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page. If that entry is
- * strictly less than the scankey, or <= the scankey in the
- * key.nextkey=true case, then we followed the wrong link and we need
- * to move right.
- *
- * The passed insertion-type scankey can omit the rightmost column(s) of the
- * index. (see nbtree/README)
- *
- * When key.nextkey is false (the usual case), we are looking for the first
- * item >= key. When key.nextkey is true, we are looking for the first item
- * strictly greater than key.
- *
- * If forupdate is true, we will attempt to finish any incomplete splits
- * that we encounter. This is required when locking a target page for an
- * insertion, because we don't allow inserting on a page before the split is
- * completed. 'heaprel' and 'stack' are only used if forupdate is true.
- *
- * On entry, we have the buffer pinned and a lock of the type specified by
- * 'access'. If we move right, we release the buffer and lock and acquire
- * the same on the right sibling. Return value is the buffer we stop at.
- */
-Buffer
-_bt_moveright(Relation rel,
- Relation heaprel,
- BTScanInsert key,
- Buffer buf,
- bool forupdate,
- BTStack stack,
- int access,
- AttrNumber *comparecol,
- char *tupdatabuf)
-{
- Page page;
- BTPageOpaque opaque;
- int32 cmpval;
-
- Assert(!forupdate || heaprel != NULL);
- Assert(PointerIsValid(comparecol) && PointerIsValid(tupdatabuf));
-
- /*
- * When nextkey = false (normal case): if the scan key that brought us to
- * this page is > the high key stored on the page, then the page has split
- * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
- * have some duplicates to the right as well as the left, but that's
- * something that's only ever dealt with on the leaf level, after
- * _bt_search has found an initial leaf page.)
- *
- * When nextkey = true: move right if the scan key is >= page's high key.
- * (Note that key.scantid cannot be set in this case.)
- *
- * The page could even have split more than once, so scan as far as
- * needed.
- *
- * We also have to move right if we followed a link that brought us to a
- * dead page.
- */
- cmpval = key->nextkey ? 0 : 1;
-
- for (;;)
- {
- AttrNumber cmpcol = 1;
-
- page = BufferGetPage(buf);
- opaque = BTPageGetOpaque(page);
-
- if (P_RIGHTMOST(opaque))
- {
- *comparecol = 1;
- break;
- }
-
- /*
- * Finish any incomplete splits we encounter along the way.
- */
- if (forupdate && P_INCOMPLETE_SPLIT(opaque))
- {
- BlockNumber blkno = BufferGetBlockNumber(buf);
-
- /* upgrade our lock if necessary */
- if (access == BT_READ)
- {
- _bt_unlockbuf(rel, buf);
- _bt_lockbuf(rel, buf, BT_WRITE);
- }
-
- if (P_INCOMPLETE_SPLIT(opaque))
- _bt_finish_split(rel, heaprel, buf, stack);
- else
- _bt_relbuf(rel, buf);
-
- /* re-acquire the lock in the right mode, and re-check */
- buf = _bt_getbuf(rel, blkno, access);
- continue;
- }
-
- /*
- * tupdatabuf is filled with the right seperator of the parent node.
- * This allows us to do a binary equality check between the parent
- * node's right seperator (which is < key) and this page's P_HIKEY.
- * If they equal, we can reuse the result of the parent node's
- * rightkey compare, which means we can potentially save a full key
- * compare (which includes indirect calls to attribute comparison
- * functions).
- *
- * Without this, we'd on average use 3 full key compares per page before
- * we achieve full dynamic prefix bounds, but with this optimization
- * that is only 2.
- *
- * 3 compares: 1 for the highkey (rightmost), and on average 2 before
- * we move right in the binary search on the page, this average equals
- * SUM (1/2 ^ x) for x from 0 to log(n items)), which tends to 2.
- */
- if (!P_IGNORE(opaque) && *comparecol > 1)
- {
- IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
- IndexTuple buftuple = (IndexTuple) tupdatabuf;
- if (IndexTupleSize(itup) == IndexTupleSize(buftuple))
- {
- char *dataptr = (char *) itup;
-
- if (memcmp(dataptr + sizeof(IndexTupleData),
- tupdatabuf + sizeof(IndexTupleData),
- IndexTupleSize(itup) - sizeof(IndexTupleData)) == 0)
- break;
- } else {
- *comparecol = 1;
- }
- } else {
- *comparecol = 1;
- }
-
- if (P_IGNORE(opaque) ||
- _bt_compare(rel, key, page, P_HIKEY, &cmpcol) >= cmpval)
- {
- *comparecol = 1;
- /* step right one page */
- buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
- continue;
- }
- else
- {
- *comparecol = cmpcol;
- break;
- }
- }
-
- if (P_IGNORE(opaque))
- elog(ERROR, "fell off the end of index \"%s\"",
- RelationGetRelationName(rel));
-
- return buf;
-}
-
-/*
- * _bt_binsrch() -- Do a binary search for a key on a particular page.
- *
- * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
- * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
- * particular, this means it is possible to return a value 1 greater than the
- * number of keys on the page, if the scankey is > all keys on the page.)
- *
- * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
- * of the last key < given scankey, or last key <= given scankey if nextkey
- * is true. (Since _bt_compare treats the first data key of such a page as
- * minus infinity, there will be at least one key < scankey, so the result
- * always points at one of the keys on the page.) This key indicates the
- * right place to descend to be sure we find all leaf keys >= given scankey
- * (or leaf keys > given scankey when nextkey is true).
- *
- * When called, the "highkeycmpcol" pointer argument is expected to contain the
- * AttrNumber of the first attribute that is not shared between scan key and
- * this page's high key, i.e. the first attribute that we have to compare
- * against the scan key. The value will be updated by _bt_binsrch to contain
- * this same first column we'll need to compare against the scan key, but now
- * for the index tuple at the returned offset. Valid values range from 1
- * (no shared prefix) to the number of key attributes + 1 (all index key
- * attributes are equal to the scan key). See also _bt_compare, and
- * backend/access/nbtree/README for more info.
- *
- * This procedure is not responsible for walking right, it just examines
- * the given page. _bt_binsrch() has no lock or refcount side effects
- * on the buffer.
- */
-static OffsetNumber
-_bt_binsrch(Relation rel,
- BTScanInsert key,
- Buffer buf,
- AttrNumber *highkeycmpcol)
-{
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high;
- int32 result,
- cmpval;
- /*
- * Prefix bounds, for the high/low offset's compare columns.
- * "highkeycmpcol" is the value for this page's high key (if any) or 1
- * (no established shared prefix)
- */
- AttrNumber highcmpcol = *highkeycmpcol,
- lowcmpcol = 1;
-
- page = BufferGetPage(buf);
- opaque = BTPageGetOpaque(page);
-
- /* Requesting nextkey semantics while using scantid seems nonsensical */
- Assert(!key->nextkey || key->scantid == NULL);
- /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
- Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
-
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
-
- /*
- * If there are no keys on the page, return the first available slot. Note
- * this covers two cases: the page is really empty (no keys), or it
- * contains only a high key. The latter case is possible after vacuuming.
- * This can never happen on an internal page, however, since they are
- * never empty (an internal page must have children).
- */
- if (unlikely(high < low))
- return low;
-
- /*
- * Binary search to find the first key on the page >= scan key, or first
- * key > scankey when nextkey is true.
- *
- * For nextkey=false (cmpval=1), the loop invariant is: all slots before
- * 'low' are < scan key, all slots at or after 'high' are >= scan key.
- *
- * For nextkey=true (cmpval=0), the loop invariant is: all slots before
- * 'low' are <= scan key, all slots at or after 'high' are > scan key.
- *
- * We maintain highcmpcol and lowcmpcol to keep track of prefixes that
- * tuples share with the scan key, potentially allowing us to skip a
- * prefix in the midpoint comparison.
- *
- * We can fall out when high == low.
- */
- high++; /* establish the loop invariant for high */
-
- cmpval = key->nextkey ? 0 : 1; /* select comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
- AttrNumber cmpcol = Min(highcmpcol, lowcmpcol); /* update prefix bounds */
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid, &cmpcol);
-
- if (result >= cmpval)
- {
- low = mid + 1;
- lowcmpcol = cmpcol;
- }
- else
- {
- high = mid;
- highcmpcol = cmpcol;
- }
- }
-
- /* update the bounds at the caller */
- *highkeycmpcol = highcmpcol;
-
- /*
- * At this point we have high == low, but be careful: they could point
- * past the last slot on the page.
- *
- * On a leaf page, we always return the first key >= scan key (resp. >
- * scan key), which could be the last slot + 1.
- */
- if (P_ISLEAF(opaque))
- return low;
-
- /*
- * On a non-leaf page, return the last key < scan key (resp. <= scan key).
- * There must be one if _bt_compare() is playing by the rules.
- */
- Assert(low > P_FIRSTDATAKEY(opaque));
-
- return OffsetNumberPrev(low);
-}
-
-/*
- *
- * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
- *
- * Like _bt_binsrch(), but with support for caching the binary search
- * bounds. Only used during insertion, and only on the leaf page that it
- * looks like caller will insert tuple on. Exclusive-locked and pinned
- * leaf page is contained within insertstate.
- *
- * Caches the bounds fields in insertstate so that a subsequent call can
- * reuse the low and strict high bounds of original binary search. Callers
- * that use these fields directly must be prepared for the case where low
- * and/or stricthigh are not on the same page (one or both exceed maxoff
- * for the page). The case where there are no items on the page (high <
- * low) makes bounds invalid.
- *
- * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time, and for dealing with posting list
- * tuple matches (callers can use insertstate's postingoff field to
- * determine which existing heap TID will need to be replaced by a posting
- * list split).
- */
-OffsetNumber
-_bt_binsrch_insert(Relation rel, BTInsertState insertstate,
- AttrNumber highcmpcol)
-{
- BTScanInsert key = insertstate->itup_key;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber low,
- high,
- stricthigh;
- int32 result,
- cmpval;
- AttrNumber lowcmpcol = 1;
-
- page = BufferGetPage(insertstate->buf);
- opaque = BTPageGetOpaque(page);
-
- Assert(P_ISLEAF(opaque));
- Assert(!key->nextkey);
- Assert(insertstate->postingoff == 0);
-
- if (!insertstate->bounds_valid)
- {
- /* Start new binary search */
- low = P_FIRSTDATAKEY(opaque);
- high = PageGetMaxOffsetNumber(page);
- }
- else
- {
- /* Restore result of previous binary search against same page */
- low = insertstate->low;
- high = insertstate->stricthigh;
- }
-
- /* If there are no keys on the page, return the first available slot */
- if (unlikely(high < low))
- {
- /* Caller can't reuse bounds */
- insertstate->low = InvalidOffsetNumber;
- insertstate->stricthigh = InvalidOffsetNumber;
- insertstate->bounds_valid = false;
- return low;
- }
-
- /*
- * Binary search to find the first key on the page >= scan key. (nextkey
- * is always false when inserting).
- *
- * The loop invariant is: all slots before 'low' are < scan key, all slots
- * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
- * maintained to save additional search effort for caller.
- *
- * We can fall out when high == low.
- */
- if (!insertstate->bounds_valid)
- high++; /* establish the loop invariant for high */
- stricthigh = high; /* high initially strictly higher */
-
- cmpval = 1; /* !nextkey comparison value */
-
- while (high > low)
- {
- OffsetNumber mid = low + ((high - low) / 2);
- AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
-
- /* We have low <= mid < high, so mid points at a real slot */
-
- result = _bt_compare(rel, key, page, mid, &cmpcol);
-
- if (result >= cmpval)
- {
- low = mid + 1;
- lowcmpcol = cmpcol;
- }
- else
- {
- high = mid;
- highcmpcol = cmpcol;
-
- if (result != 0)
- stricthigh = high;
- }
-
- /*
- * If tuple at offset located by binary search is a posting list whose
- * TID range overlaps with caller's scantid, perform posting list
- * binary search to set postingoff for caller. Caller must split the
- * posting list when postingoff is set. This should happen
- * infrequently.
- */
- if (unlikely(result == 0 && key->scantid != NULL))
- {
- /*
- * postingoff should never be set more than once per leaf page
- * binary search. That would mean that there are duplicate table
- * TIDs in the index, which is never okay. Check for that here.
- */
- if (insertstate->postingoff != 0)
- ereport(ERROR,
- (errcode(ERRCODE_INDEX_CORRUPTED),
- errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
- ItemPointerGetBlockNumber(key->scantid),
- ItemPointerGetOffsetNumber(key->scantid),
- low, stricthigh,
- BufferGetBlockNumber(insertstate->buf),
- RelationGetRelationName(rel))));
-
- insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
- }
- }
-
- /*
- * On a leaf page, a binary search always returns the first key >= scan
- * key (at least in !nextkey case), which could be the last slot + 1. This
- * is also the lower bound of cached search.
- *
- * stricthigh may also be the last slot + 1, which prevents caller from
- * using bounds directly, but is still useful to us if we're called a
- * second time with cached bounds (cached low will be < stricthigh when
- * that happens).
- */
- insertstate->low = low;
- insertstate->stricthigh = stricthigh;
- insertstate->bounds_valid = true;
-
- return low;
-}
-
/*----------
* _bt_binsrch_posting() -- posting list binary search.
*
@@ -724,235 +137,6 @@ _bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
return low;
}
-/*----------
- * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
- *
- * page/offnum: location of btree item to be compared to.
- *
- * This routine returns:
- * <0 if scankey < tuple at offnum;
- * 0 if scankey == tuple at offnum;
- * >0 if scankey > tuple at offnum.
- *
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be returned
- * to the caller as a matching key. Similarly, an insertion scankey
- * with its scantid set is treated as equal to a posting tuple whose TID
- * range overlaps with their scantid. There generally won't be a
- * matching TID in the posting tuple, which caller must handle
- * themselves (e.g., by splitting the posting list tuple).
- *
- * NOTE: The "comparecol" argument must refer to the first attribute of the
- * index tuple of which the caller knows that it does not match the scan key:
- * this means 1 for "no known matching attributes", up to the number of key
- * attributes + 1 if the caller knows that all key attributes of the index
- * tuple match those of the scan key. See backend/access/nbtree/README for
- * details.
- *
- * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
- * "minus infinity": this routine will always claim it is less than the
- * scankey. The actual key value stored is explicitly truncated to 0
- * attributes (explicitly minus infinity) with version 3+ indexes, but
- * that isn't relied upon. This allows us to implement the Lehman and
- * Yao convention that the first down-link pointer is before the first
- * key. See backend/access/nbtree/README for details.
- *----------
- */
-int32
-_bt_compare(Relation rel,
- BTScanInsert key,
- Page page,
- OffsetNumber offnum,
- AttrNumber *comparecol)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- BTPageOpaque opaque = BTPageGetOpaque(page);
- IndexTuple itup;
- ItemPointer heapTid;
- ScanKey scankey;
- int ncmpkey;
- int ntupatts;
- int32 result;
-
- Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
- Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
- Assert(key->heapkeyspace || key->scantid == NULL);
-
- /*
- * Force result ">" if target item is first data item on an internal page
- * --- see NOTE above.
- */
- if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
- return 1;
-
- itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
- ntupatts = BTreeTupleGetNAtts(itup, rel);
-
- /*
- * The scan key is set up with the attribute number associated with each
- * term in the key. It is important that, if the index is multi-key, the
- * scan contain the first k key attributes, and that they be in order. If
- * you think about how multi-key ordering works, you'll understand why
- * this is.
- *
- * We don't test for violation of this condition here, however. The
- * initial setup for the index scan had better have gotten it right (see
- * _bt_first).
- */
-
- ncmpkey = Min(ntupatts, key->keysz);
- Assert(key->heapkeyspace || ncmpkey == key->keysz);
- Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
-
- scankey = key->scankeys + ((*comparecol) - 1);
- for (int i = *comparecol; i <= ncmpkey; i++)
- {
- Datum datum;
- bool isNull;
-
- datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
-
- if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
- {
- if (isNull)
- result = 0; /* NULL "=" NULL */
- else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = -1; /* NULL "<" NOT_NULL */
- else
- result = 1; /* NULL ">" NOT_NULL */
- }
- else if (isNull) /* key is NOT_NULL and item is NULL */
- {
- if (scankey->sk_flags & SK_BT_NULLS_FIRST)
- result = 1; /* NOT_NULL ">" NULL */
- else
- result = -1; /* NOT_NULL "<" NULL */
- }
- else
- {
- /*
- * The sk_func needs to be passed the index value as left arg and
- * the sk_argument as right arg (they might be of different
- * types). Since it is convenient for callers to think of
- * _bt_compare as comparing the scankey to the index item, we have
- * to flip the sign of the comparison result. (Unless it's a DESC
- * column, in which case we *don't* flip the sign.)
- */
- result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum,
- scankey->sk_argument));
-
- if (!(scankey->sk_flags & SK_BT_DESC))
- INVERT_COMPARE_RESULT(result);
- }
-
- /* if the keys are unequal, return the difference */
- if (result != 0)
- {
- *comparecol = i;
- return result;
- }
-
- scankey++;
- }
-
- /*
- * All tuple attributes are equal to the scan key, only later attributes
- * could potentially not equal the scan key.
- */
- *comparecol = ntupatts + 1;
-
- /*
- * All non-truncated attributes (other than heap TID) were found to be
- * equal. Treat truncated attributes as minus infinity when scankey has a
- * key attribute value that would otherwise be compared directly.
- *
- * Note: it doesn't matter if ntupatts includes non-key attributes;
- * scankey won't, so explicitly excluding non-key attributes isn't
- * necessary.
- */
- if (key->keysz > ntupatts)
- return 1;
-
- /*
- * Use the heap TID attribute and scantid to try to break the tie. The
- * rules are the same as any other key attribute -- only the
- * representation differs.
- */
- heapTid = BTreeTupleGetHeapTID(itup);
- if (key->scantid == NULL)
- {
- /*
- * Most searches have a scankey that is considered greater than a
- * truncated pivot tuple if and when the scankey has equal values for
- * attributes up to and including the least significant untruncated
- * attribute in tuple.
- *
- * For example, if an index has the minimum two attributes (single
- * user key attribute, plus heap TID attribute), and a page's high key
- * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
- * will not descend to the page to the left. The search will descend
- * right instead. The truncated attribute in pivot tuple means that
- * all non-pivot tuples on the page to the left are strictly < 'foo',
- * so it isn't necessary to descend left. In other words, search
- * doesn't have to descend left because it isn't interested in a match
- * that has a heap TID value of -inf.
- *
- * However, some searches (pivotsearch searches) actually require that
- * we descend left when this happens. -inf is treated as a possible
- * match for omitted scankey attribute(s). This is needed by page
- * deletion, which must re-find leaf pages that are targets for
- * deletion using their high keys.
- *
- * Note: the heap TID part of the test ensures that scankey is being
- * compared to a pivot tuple with one or more truncated key
- * attributes.
- *
- * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
- * left here, since they have no heap TID attribute (and cannot have
- * any -inf key values in any case, since truncation can only remove
- * non-key attributes). !heapkeyspace searches must always be
- * prepared to deal with matches on both sides of the pivot once the
- * leaf level is reached.
- */
- if (key->heapkeyspace && !key->pivotsearch &&
- key->keysz == ntupatts && heapTid == NULL)
- return 1;
-
- /* All provided scankey arguments found to be equal */
- return 0;
- }
-
- /*
- * Treat truncated heap TID as minus infinity, since scankey has a key
- * attribute value (scantid) that would otherwise be compared directly
- */
- Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
- if (heapTid == NULL)
- return 1;
-
- /*
- * Scankey must be treated as equal to a posting list tuple if its scantid
- * value falls within the range of the posting list. In all other cases
- * there can only be a single heap TID value, which is compared directly
- * with scantid.
- */
- Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- result = ItemPointerCompare(key->scantid, heapTid);
- if (result <= 0 || !BTreeTupleIsPosting(itup))
- return result;
- else
- {
- result = ItemPointerCompare(key->scantid,
- BTreeTupleGetMaxHeapTID(itup));
- if (result > 0)
- return 1;
- }
-
- return 0;
-}
-
/*
* _bt_first() -- Find the first item in a scan.
*
@@ -994,6 +178,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
BTScanPosItem *currItem;
BlockNumber blkno;
AttrNumber cmpcol = 1;
+ nbts_prep_ctx(rel);
Assert(!BTScanPosIsValid(so->currPos));
@@ -1627,280 +812,6 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
return true;
}
-/*
- * _bt_readpage() -- Load data from current index page into so->currPos
- *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate. All other fields of so->currPos are
- * initialized from scratch here.
- *
- * We scan the current page starting at offnum and moving in the indicated
- * direction. All items matching the scan keys are loaded into currPos.items.
- * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
- * that there can be no more matching tuples in the current scan direction.
- *
- * In the case of a parallel scan, caller must have called _bt_parallel_seize
- * prior to calling this function; this function will invoke
- * _bt_parallel_release before returning.
- *
- * Returns true if any matching items found on the page, false if none.
- */
-static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
-{
- BTScanOpaque so = (BTScanOpaque) scan->opaque;
- Page page;
- BTPageOpaque opaque;
- OffsetNumber minoff;
- OffsetNumber maxoff;
- int itemIndex;
- bool continuescan;
- int indnatts;
-
- /*
- * We must have the buffer pinned and locked, but the usual macro can't be
- * used here; this function is what makes it good for currPos.
- */
- Assert(BufferIsValid(so->currPos.buf));
-
- page = BufferGetPage(so->currPos.buf);
- opaque = BTPageGetOpaque(page);
-
- /* allow next page be processed by parallel worker */
- if (scan->parallel_scan)
- {
- if (ScanDirectionIsForward(dir))
- _bt_parallel_release(scan, opaque->btpo_next);
- else
- _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
- }
-
- continuescan = true; /* default assumption */
- indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
- minoff = P_FIRSTDATAKEY(opaque);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * We note the buffer's block number so that we can release the pin later.
- * This allows us to re-read the buffer if it is needed again for hinting.
- */
- so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-
- /*
- * We save the LSN of the page as we read it, so that we know whether it
- * safe to apply LP_DEAD hints to the page later. This allows us to drop
- * the pin for MVCC scans, which allows vacuum to avoid blocking.
- */
- so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-
- /*
- * we must save the page's right-link while scanning it; this tells us
- * where to step right to after we're done with these items. There is no
- * corresponding need for the left-link, since splits always go right.
- */
- so->currPos.nextPage = opaque->btpo_next;
-
- /* initialize tuple workspace to empty */
- so->currPos.nextTupleOffset = 0;
-
- /*
- * Now that the current page has been made consistent, the macro should be
- * good.
- */
- Assert(BTScanPosIsPinned(so->currPos));
-
- if (ScanDirectionIsForward(dir))
- {
- /* load items[] in ascending order */
- itemIndex = 0;
-
- offnum = Max(offnum, minoff);
-
- while (offnum <= maxoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- offnum = OffsetNumberNext(offnum);
- continue;
- }
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID
- */
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- itemIndex++;
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- itemIndex++;
- }
- }
- }
- /* When !continuescan, there can't be any more matches, so stop */
- if (!continuescan)
- break;
-
- offnum = OffsetNumberNext(offnum);
- }
-
- /*
- * We don't need to visit page to the right when the high key
- * indicates that no more matches will be found there.
- *
- * Checking the high key like this works out more often than you might
- * think. Leaf page splits pick a split point between the two most
- * dissimilar tuples (this is weighed against the need to evenly share
- * free space). Leaf pages with high key attribute values that can
- * only appear on non-pivot tuples on the right sibling page are
- * common.
- */
- if (continuescan && !P_RIGHTMOST(opaque))
- {
- ItemId iid = PageGetItemId(page, P_HIKEY);
- IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
- int truncatt;
-
- truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
- _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
- }
-
- if (!continuescan)
- so->currPos.moreRight = false;
-
- Assert(itemIndex <= MaxTIDsPerBTreePage);
- so->currPos.firstItem = 0;
- so->currPos.lastItem = itemIndex - 1;
- so->currPos.itemIndex = 0;
- }
- else
- {
- /* load items[] in descending order */
- itemIndex = MaxTIDsPerBTreePage;
-
- offnum = Min(offnum, maxoff);
-
- while (offnum >= minoff)
- {
- ItemId iid = PageGetItemId(page, offnum);
- IndexTuple itup;
- bool tuple_alive;
- bool passes_quals;
-
- /*
- * If the scan specifies not to return killed tuples, then we
- * treat a killed tuple as not passing the qual. Most of the
- * time, it's a win to not bother examining the tuple's index
- * keys, but just skip to the next tuple (previous, actually,
- * since we're scanning backwards). However, if this is the first
- * tuple on the page, we do check the index keys, to prevent
- * uselessly advancing to the page to the left. This is similar
- * to the high key optimization used by forward scans.
- */
- if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
- {
- Assert(offnum >= P_FIRSTDATAKEY(opaque));
- if (offnum > P_FIRSTDATAKEY(opaque))
- {
- offnum = OffsetNumberPrev(offnum);
- continue;
- }
-
- tuple_alive = false;
- }
- else
- tuple_alive = true;
-
- itup = (IndexTuple) PageGetItem(page, iid);
-
- passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan);
- if (passes_quals && tuple_alive)
- {
- /* tuple passes all scan key conditions */
- if (!BTreeTupleIsPosting(itup))
- {
- /* Remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
- }
- else
- {
- int tupleOffset;
-
- /*
- * Set up state to return posting list, and remember first
- * TID.
- *
- * Note that we deliberately save/return items from
- * posting lists in ascending heap TID order for backwards
- * scans. This allows _bt_killitems() to make a
- * consistent assumption about the order of items
- * associated with the same posting list tuple.
- */
- itemIndex--;
- tupleOffset =
- _bt_setuppostingitems(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, 0),
- itup);
- /* Remember additional TIDs */
- for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
- {
- itemIndex--;
- _bt_savepostingitem(so, itemIndex, offnum,
- BTreeTupleGetPostingN(itup, i),
- tupleOffset);
- }
- }
- }
- if (!continuescan)
- {
- /* there can't be any more matches, so stop */
- so->currPos.moreLeft = false;
- break;
- }
-
- offnum = OffsetNumberPrev(offnum);
- }
-
- Assert(itemIndex >= 0);
- so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
- so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
- }
-
- return (so->currPos.firstItem <= so->currPos.lastItem);
-}
-
/* Save an index item into so->currPos.items[itemIndex] */
static void
_bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -2109,12 +1020,11 @@ static bool
_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
- Relation rel;
+ Relation rel = scan->indexRelation;
Page page;
BTPageOpaque opaque;
bool status;
-
- rel = scan->indexRelation;
+ nbts_prep_ctx(rel);
if (ScanDirectionIsForward(dir))
{
@@ -2516,6 +1426,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
BTPageOpaque opaque;
OffsetNumber start;
BTScanPosItem *currItem;
+ nbts_prep_ctx(rel);
/*
* Scan down to the leftmost or rightmost leaf page. This is a simplified
diff --git a/src/backend/access/nbtree/nbtsearch_spec.c b/src/backend/access/nbtree/nbtsearch_spec.c
new file mode 100644
index 0000000000..383010dc31
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsearch_spec.c
@@ -0,0 +1,1113 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtsearch_spec.c
+ * Index shape-specialized functions for nbtsearch.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtsearch_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_binsrch NBTS_FUNCTION(_bt_binsrch)
+#define _bt_readpage NBTS_FUNCTION(_bt_readpage)
+
+static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf,
+ AttrNumber *highkeycmpcol);
+static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
+ OffsetNumber offnum);
+
+/*
+ * _bt_search() -- Search the tree for a particular scankey,
+ * or more precisely for the first leaf page it could be on.
+ *
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
+ * but it can omit the rightmost column(s) of the index.
+ *
+ * Return value is a stack of parent-page pointers (i.e. there is no entry for
+ * the leaf level/page). *bufP is set to the address of the leaf-page buffer,
+ * which is locked and pinned. No locks are held on the parent pages,
+ * however!
+ *
+ * The returned buffer is locked according to access parameter. Additionally,
+ * access = BT_WRITE will allow an empty root page to be created and returned.
+ * When access = BT_READ, an empty index will result in *bufP being set to
+ * InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
+ * during the search will be finished.
+ *
+ * heaprel must be provided by callers that pass access = BT_WRITE, since we
+ * might need to allocate a new root page for caller -- see _bt_allocbuf.
+ */
+BTStack
+_bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
+ int access)
+{
+ BTStack stack_in = NULL;
+ int page_access = BT_READ;
+ char tupdatabuf[BLCKSZ / 3];
+ AttrNumber highkeycmpcol = 1;
+
+ /* heaprel must be set whenever _bt_allocbuf is reachable */
+ Assert(access == BT_READ || access == BT_WRITE);
+ Assert(access == BT_READ || heaprel != NULL);
+
+ /* Get the root page to start with */
+ *bufP = _bt_getroot(rel, heaprel, access);
+
+ /* If index is empty and access = BT_READ, no root page is created. */
+ if (!BufferIsValid(*bufP))
+ return (BTStack) NULL;
+
+ /* Loop iterates once per level descended in the tree */
+ for (;;)
+ {
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber offnum;
+ ItemId itemid;
+ IndexTuple itup;
+ BlockNumber child;
+ BTStack new_stack;
+
+ /*
+ * Race -- the page we just grabbed may have split since we read its
+ * downlink in its parent page (or the metapage). If it has, we may
+ * need to move right to its new sibling. Do that.
+ *
+ * In write-mode, allow _bt_moveright to finish any incomplete splits
+ * along the way. Strictly speaking, we'd only need to finish an
+ * incomplete split on the leaf page we're about to insert to, not on
+ * any of the upper levels (internal pages with incomplete splits are
+ * also taken care of in _bt_getstackbuf). But this is a good
+ * opportunity to finish splits of internal pages too.
+ */
+ *bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
+ stack_in, page_access, &highkeycmpcol,
+ (char *) tupdatabuf);
+
+ /* if this is a leaf page, we're done */
+ page = BufferGetPage(*bufP);
+ opaque = BTPageGetOpaque(page);
+ if (P_ISLEAF(opaque))
+ break;
+
+ /*
+ * Find the appropriate pivot tuple on this page. Its downlink points
+ * to the child page that we're about to descend to.
+ */
+ offnum = _bt_binsrch(rel, key, *bufP, &highkeycmpcol);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
+ child = BTreeTupleGetDownLink(itup);
+
+ Assert(IndexTupleSize(itup) < sizeof(tupdatabuf));
+ memcpy((char *) tupdatabuf, (char *) itup, IndexTupleSize(itup));
+
+ /*
+ * We need to save the location of the pivot tuple we chose in a new
+ * stack entry for this page/level. If caller ends up splitting a
+ * page one level down, it usually ends up inserting a new pivot
+ * tuple/downlink immediately after the location recorded here.
+ */
+ new_stack = (BTStack) palloc(sizeof(BTStackData));
+ new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
+ new_stack->bts_offset = offnum;
+ new_stack->bts_parent = stack_in;
+
+ /*
+ * Page level 1 is lowest non-leaf page level prior to leaves. So, if
+ * we're on the level 1 and asked to lock leaf page in write mode,
+ * then lock next page in write mode, because it must be a leaf.
+ */
+ if (opaque->btpo_level == 1 && access == BT_WRITE)
+ page_access = BT_WRITE;
+
+ /* drop the read lock on the page, then acquire one on its child */
+ *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
+
+ /* okay, all set to move down a level */
+ stack_in = new_stack;
+ }
+
+ /*
+ * If we're asked to lock leaf in write mode, but didn't manage to, then
+ * relock. This should only happen when the root page is a leaf page (and
+ * the only page in the index other than the metapage).
+ */
+ if (access == BT_WRITE && page_access == BT_READ)
+ {
+ highkeycmpcol = 1;
+
+ /* trade in our read lock for a write lock */
+ _bt_unlockbuf(rel, *bufP);
+ _bt_lockbuf(rel, *bufP, BT_WRITE);
+
+ /*
+ * Race -- the leaf page may have split after we dropped the read lock
+ * but before we acquired a write lock. If it has, we may need to
+ * move right to its new sibling. Do that.
+ */
+ *bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
+ &highkeycmpcol, (char *) tupdatabuf);
+ }
+
+ return stack_in;
+}
+
+/*
+ * _bt_moveright() -- move right in the btree if necessary.
+ *
+ * When we follow a pointer to reach a page, it is possible that
+ * the page has changed in the meanwhile. If this happens, we're
+ * guaranteed that the page has "split right" -- that is, that any
+ * data that appeared on the page originally is either on the page
+ * or strictly to the right of it.
+ *
+ * This routine decides whether or not we need to move right in the
+ * tree by examining the high key entry on the page. If that entry is
+ * strictly less than the scankey, or <= the scankey in the
+ * key.nextkey=true case, then we followed the wrong link and we need
+ * to move right.
+ *
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
+ *
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key. When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
+ *
+ * If forupdate is true, we will attempt to finish any incomplete splits
+ * that we encounter. This is required when locking a target page for an
+ * insertion, because we don't allow inserting on a page before the split is
+ * completed. 'heaprel' and 'stack' are only used if forupdate is true.
+ *
+ * On entry, we have the buffer pinned and a lock of the type specified by
+ * 'access'. If we move right, we release the buffer and lock and acquire
+ * the same on the right sibling. Return value is the buffer we stop at.
+ */
+Buffer
+_bt_moveright(Relation rel,
+ Relation heaprel,
+ BTScanInsert key,
+ Buffer buf,
+ bool forupdate,
+ BTStack stack,
+ int access,
+ AttrNumber *comparecol,
+ char *tupdatabuf)
+{
+ Page page;
+ BTPageOpaque opaque;
+ int32 cmpval;
+
+ Assert(!forupdate || heaprel != NULL);
+ Assert(PointerIsValid(comparecol) && PointerIsValid(tupdatabuf));
+
+ /*
+ * When nextkey = false (normal case): if the scan key that brought us to
+ * this page is > the high key stored on the page, then the page has split
+ * and we need to move right. (pg_upgrade'd !heapkeyspace indexes could
+ * have some duplicates to the right as well as the left, but that's
+ * something that's only ever dealt with on the leaf level, after
+ * _bt_search has found an initial leaf page.)
+ *
+ * When nextkey = true: move right if the scan key is >= page's high key.
+ * (Note that key.scantid cannot be set in this case.)
+ *
+ * The page could even have split more than once, so scan as far as
+ * needed.
+ *
+ * We also have to move right if we followed a link that brought us to a
+ * dead page.
+ */
+ cmpval = key->nextkey ? 0 : 1;
+
+ for (;;)
+ {
+ AttrNumber cmpcol = 1;
+
+ page = BufferGetPage(buf);
+ opaque = BTPageGetOpaque(page);
+
+ if (P_RIGHTMOST(opaque))
+ {
+ *comparecol = 1;
+ break;
+ }
+
+ /*
+ * Finish any incomplete splits we encounter along the way.
+ */
+ if (forupdate && P_INCOMPLETE_SPLIT(opaque))
+ {
+ BlockNumber blkno = BufferGetBlockNumber(buf);
+
+ /* upgrade our lock if necessary */
+ if (access == BT_READ)
+ {
+ _bt_unlockbuf(rel, buf);
+ _bt_lockbuf(rel, buf, BT_WRITE);
+ }
+
+ if (P_INCOMPLETE_SPLIT(opaque))
+ _bt_finish_split(rel, heaprel, buf, stack);
+ else
+ _bt_relbuf(rel, buf);
+
+ /* re-acquire the lock in the right mode, and re-check */
+ buf = _bt_getbuf(rel, blkno, access);
+ continue;
+ }
+
+ /*
+ * tupdatabuf is filled with the right seperator of the parent node.
+ * This allows us to do a binary equality check between the parent
+ * node's right seperator (which is < key) and this page's P_HIKEY.
+ * If they equal, we can reuse the result of the parent node's
+ * rightkey compare, which means we can potentially save a full key
+ * compare (which includes indirect calls to attribute comparison
+ * functions).
+ *
+ * Without this, we'd on average use 3 full key compares per page before
+ * we achieve full dynamic prefix bounds, but with this optimization
+ * that is only 2.
+ *
+ * 3 compares: 1 for the highkey (rightmost), and on average 2 before
+ * we move right in the binary search on the page, this average equals
+ * SUM (1/2 ^ x) for x from 0 to log(n items)), which tends to 2.
+ */
+ if (!P_IGNORE(opaque) && *comparecol > 1)
+ {
+ IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
+ IndexTuple buftuple = (IndexTuple) tupdatabuf;
+ if (IndexTupleSize(itup) == IndexTupleSize(buftuple))
+ {
+ char *dataptr = (char *) itup;
+
+ if (memcmp(dataptr + sizeof(IndexTupleData),
+ tupdatabuf + sizeof(IndexTupleData),
+ IndexTupleSize(itup) - sizeof(IndexTupleData)) == 0)
+ break;
+ } else {
+ *comparecol = 1;
+ }
+ } else {
+ *comparecol = 1;
+ }
+
+ if (P_IGNORE(opaque) ||
+ _bt_compare(rel, key, page, P_HIKEY, &cmpcol) >= cmpval)
+ {
+ *comparecol = 1;
+ /* step right one page */
+ buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
+ continue;
+ }
+ else
+ {
+ *comparecol = cmpcol;
+ break;
+ }
+ }
+
+ if (P_IGNORE(opaque))
+ elog(ERROR, "fell off the end of index \"%s\"",
+ RelationGetRelationName(rel));
+
+ return buf;
+}
+
+/*
+ * _bt_binsrch() -- Do a binary search for a key on a particular page.
+ *
+ * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
+ * key >= given scankey, or > scankey if nextkey is true. (NOTE: in
+ * particular, this means it is possible to return a value 1 greater than the
+ * number of keys on the page, if the scankey is > all keys on the page.)
+ *
+ * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
+ * of the last key < given scankey, or last key <= given scankey if nextkey
+ * is true. (Since _bt_compare treats the first data key of such a page as
+ * minus infinity, there will be at least one key < scankey, so the result
+ * always points at one of the keys on the page.) This key indicates the
+ * right place to descend to be sure we find all leaf keys >= given scankey
+ * (or leaf keys > given scankey when nextkey is true).
+ *
+ * When called, the "highkeycmpcol" pointer argument is expected to contain the
+ * AttrNumber of the first attribute that is not shared between scan key and
+ * this page's high key, i.e. the first attribute that we have to compare
+ * against the scan key. The value will be updated by _bt_binsrch to contain
+ * this same first column we'll need to compare against the scan key, but now
+ * for the index tuple at the returned offset. Valid values range from 1
+ * (no shared prefix) to the number of key attributes + 1 (all index key
+ * attributes are equal to the scan key). See also _bt_compare, and
+ * backend/access/nbtree/README for more info.
+ *
+ * This procedure is not responsible for walking right, it just examines
+ * the given page. _bt_binsrch() has no lock or refcount side effects
+ * on the buffer.
+ */
+static OffsetNumber
+_bt_binsrch(Relation rel,
+ BTScanInsert key,
+ Buffer buf,
+ AttrNumber *highkeycmpcol)
+{
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high;
+ int32 result,
+ cmpval;
+ /*
+ * Prefix bounds, for the high/low offset's compare columns.
+ * "highkeycmpcol" is the value for this page's high key (if any) or 1
+ * (no established shared prefix)
+ */
+ AttrNumber highcmpcol = *highkeycmpcol,
+ lowcmpcol = 1;
+
+ page = BufferGetPage(buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* Requesting nextkey semantics while using scantid seems nonsensical */
+ Assert(!key->nextkey || key->scantid == NULL);
+ /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
+ Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
+
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+
+ /*
+ * If there are no keys on the page, return the first available slot. Note
+ * this covers two cases: the page is really empty (no keys), or it
+ * contains only a high key. The latter case is possible after vacuuming.
+ * This can never happen on an internal page, however, since they are
+ * never empty (an internal page must have children).
+ */
+ if (unlikely(high < low))
+ return low;
+
+ /*
+ * Binary search to find the first key on the page >= scan key, or first
+ * key > scankey when nextkey is true.
+ *
+ * For nextkey=false (cmpval=1), the loop invariant is: all slots before
+ * 'low' are < scan key, all slots at or after 'high' are >= scan key.
+ *
+ * For nextkey=true (cmpval=0), the loop invariant is: all slots before
+ * 'low' are <= scan key, all slots at or after 'high' are > scan key.
+ *
+ * We maintain highcmpcol and lowcmpcol to keep track of prefixes that
+ * tuples share with the scan key, potentially allowing us to skip a
+ * prefix in the midpoint comparison.
+ *
+ * We can fall out when high == low.
+ */
+ high++; /* establish the loop invariant for high */
+
+ cmpval = key->nextkey ? 0 : 1; /* select comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol); /* update prefix bounds */
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
+
+ if (result >= cmpval)
+ {
+ low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
+ else
+ {
+ high = mid;
+ highcmpcol = cmpcol;
+ }
+ }
+
+ /* update the bounds at the caller */
+ *highkeycmpcol = highcmpcol;
+
+ /*
+ * At this point we have high == low, but be careful: they could point
+ * past the last slot on the page.
+ *
+ * On a leaf page, we always return the first key >= scan key (resp. >
+ * scan key), which could be the last slot + 1.
+ */
+ if (P_ISLEAF(opaque))
+ return low;
+
+ /*
+ * On a non-leaf page, return the last key < scan key (resp. <= scan key).
+ * There must be one if _bt_compare() is playing by the rules.
+ */
+ Assert(low > P_FIRSTDATAKEY(opaque));
+
+ return OffsetNumberPrev(low);
+}
+
+/*
+ *
+ * _bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
+ *
+ * Like _bt_binsrch(), but with support for caching the binary search
+ * bounds. Only used during insertion, and only on the leaf page that it
+ * looks like caller will insert tuple on. Exclusive-locked and pinned
+ * leaf page is contained within insertstate.
+ *
+ * Caches the bounds fields in insertstate so that a subsequent call can
+ * reuse the low and strict high bounds of original binary search. Callers
+ * that use these fields directly must be prepared for the case where low
+ * and/or stricthigh are not on the same page (one or both exceed maxoff
+ * for the page). The case where there are no items on the page (high <
+ * low) makes bounds invalid.
+ *
+ * Caller is responsible for invalidating bounds when it modifies the page
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by a posting
+ * list split).
+ */
+OffsetNumber
+_bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol)
+{
+ BTScanInsert key = insertstate->itup_key;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber low,
+ high,
+ stricthigh;
+ int32 result,
+ cmpval;
+ AttrNumber lowcmpcol = 1;
+
+ page = BufferGetPage(insertstate->buf);
+ opaque = BTPageGetOpaque(page);
+
+ Assert(P_ISLEAF(opaque));
+ Assert(!key->nextkey);
+ Assert(insertstate->postingoff == 0);
+
+ if (!insertstate->bounds_valid)
+ {
+ /* Start new binary search */
+ low = P_FIRSTDATAKEY(opaque);
+ high = PageGetMaxOffsetNumber(page);
+ }
+ else
+ {
+ /* Restore result of previous binary search against same page */
+ low = insertstate->low;
+ high = insertstate->stricthigh;
+ }
+
+ /* If there are no keys on the page, return the first available slot */
+ if (unlikely(high < low))
+ {
+ /* Caller can't reuse bounds */
+ insertstate->low = InvalidOffsetNumber;
+ insertstate->stricthigh = InvalidOffsetNumber;
+ insertstate->bounds_valid = false;
+ return low;
+ }
+
+ /*
+ * Binary search to find the first key on the page >= scan key. (nextkey
+ * is always false when inserting).
+ *
+ * The loop invariant is: all slots before 'low' are < scan key, all slots
+ * at or after 'high' are >= scan key. 'stricthigh' is > scan key, and is
+ * maintained to save additional search effort for caller.
+ *
+ * We can fall out when high == low.
+ */
+ if (!insertstate->bounds_valid)
+ high++; /* establish the loop invariant for high */
+ stricthigh = high; /* high initially strictly higher */
+
+ cmpval = 1; /* !nextkey comparison value */
+
+ while (high > low)
+ {
+ OffsetNumber mid = low + ((high - low) / 2);
+ AttrNumber cmpcol = Min(highcmpcol, lowcmpcol);
+
+ /* We have low <= mid < high, so mid points at a real slot */
+
+ result = _bt_compare(rel, key, page, mid, &cmpcol);
+
+ if (result >= cmpval)
+ {
+ low = mid + 1;
+ lowcmpcol = cmpcol;
+ }
+ else
+ {
+ high = mid;
+ highcmpcol = cmpcol;
+
+ if (result != 0)
+ stricthigh = high;
+ }
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set postingoff for caller. Caller must split the
+ * posting list when postingoff is set. This should happen
+ * infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ {
+ /*
+ * postingoff should never be set more than once per leaf page
+ * binary search. That would mean that there are duplicate table
+ * TIDs in the index, which is never okay. Check for that here.
+ */
+ if (insertstate->postingoff != 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\"",
+ ItemPointerGetBlockNumber(key->scantid),
+ ItemPointerGetOffsetNumber(key->scantid),
+ low, stricthigh,
+ BufferGetBlockNumber(insertstate->buf),
+ RelationGetRelationName(rel))));
+
+ insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
+ }
+ }
+
+ /*
+ * On a leaf page, a binary search always returns the first key >= scan
+ * key (at least in !nextkey case), which could be the last slot + 1. This
+ * is also the lower bound of cached search.
+ *
+ * stricthigh may also be the last slot + 1, which prevents caller from
+ * using bounds directly, but is still useful to us if we're called a
+ * second time with cached bounds (cached low will be < stricthigh when
+ * that happens).
+ */
+ insertstate->low = low;
+ insertstate->stricthigh = stricthigh;
+ insertstate->bounds_valid = true;
+
+ return low;
+}
+
+/*----------
+ * _bt_compare() -- Compare insertion-type scankey to tuple on a page.
+ *
+ * page/offnum: location of btree item to be compared to.
+ *
+ * This routine returns:
+ * <0 if scankey < tuple at offnum;
+ * 0 if scankey == tuple at offnum;
+ * >0 if scankey > tuple at offnum.
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * NOTE: The "comparecol" argument must refer to the first attribute of the
+ * index tuple of which the caller knows that it does not match the scan key:
+ * this means 1 for "no known matching attributes", up to the number of key
+ * attributes + 1 if the caller knows that all key attributes of the index
+ * tuple match those of the scan key. See backend/access/nbtree/README for
+ * details.
+ *
+ * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
+ * "minus infinity": this routine will always claim it is less than the
+ * scankey. The actual key value stored is explicitly truncated to 0
+ * attributes (explicitly minus infinity) with version 3+ indexes, but
+ * that isn't relied upon. This allows us to implement the Lehman and
+ * Yao convention that the first down-link pointer is before the first
+ * key. See backend/access/nbtree/README for details.
+ *----------
+ */
+int32
+_bt_compare(Relation rel,
+ BTScanInsert key,
+ Page page,
+ OffsetNumber offnum,
+ AttrNumber *comparecol)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+ IndexTuple itup;
+ ItemPointer heapTid;
+ ScanKey scankey;
+ int ncmpkey;
+ int ntupatts;
+ int32 result;
+
+ Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
+ Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+ Assert(key->heapkeyspace || key->scantid == NULL);
+
+ /*
+ * Force result ">" if target item is first data item on an internal page
+ * --- see NOTE above.
+ */
+ if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
+ return 1;
+
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+ ntupatts = BTreeTupleGetNAtts(itup, rel);
+
+ /*
+ * The scan key is set up with the attribute number associated with each
+ * term in the key. It is important that, if the index is multi-key, the
+ * scan contain the first k key attributes, and that they be in order. If
+ * you think about how multi-key ordering works, you'll understand why
+ * this is.
+ *
+ * We don't test for violation of this condition here, however. The
+ * initial setup for the index scan had better have gotten it right (see
+ * _bt_first).
+ */
+
+ ncmpkey = Min(ntupatts, key->keysz);
+ Assert(key->heapkeyspace || ncmpkey == key->keysz);
+ Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
+
+ scankey = key->scankeys + ((*comparecol) - 1);
+ for (int i = *comparecol; i <= ncmpkey; i++)
+ {
+ Datum datum;
+ bool isNull;
+
+ datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+
+ if (scankey->sk_flags & SK_ISNULL) /* key is NULL */
+ {
+ if (isNull)
+ result = 0; /* NULL "=" NULL */
+ else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = -1; /* NULL "<" NOT_NULL */
+ else
+ result = 1; /* NULL ">" NOT_NULL */
+ }
+ else if (isNull) /* key is NOT_NULL and item is NULL */
+ {
+ if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = 1; /* NOT_NULL ">" NULL */
+ else
+ result = -1; /* NOT_NULL "<" NULL */
+ }
+ else
+ {
+ /*
+ * The sk_func needs to be passed the index value as left arg and
+ * the sk_argument as right arg (they might be of different
+ * types). Since it is convenient for callers to think of
+ * _bt_compare as comparing the scankey to the index item, we have
+ * to flip the sign of the comparison result. (Unless it's a DESC
+ * column, in which case we *don't* flip the sign.)
+ */
+ result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum,
+ scankey->sk_argument));
+
+ if (!(scankey->sk_flags & SK_BT_DESC))
+ INVERT_COMPARE_RESULT(result);
+ }
+
+ /* if the keys are unequal, return the difference */
+ if (result != 0)
+ {
+ *comparecol = i;
+ return result;
+ }
+
+ scankey++;
+ }
+
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *comparecol = ntupatts + 1;
+
+ /*
+ * All non-truncated attributes (other than heap TID) were found to be
+ * equal. Treat truncated attributes as minus infinity when scankey has a
+ * key attribute value that would otherwise be compared directly.
+ *
+ * Note: it doesn't matter if ntupatts includes non-key attributes;
+ * scankey won't, so explicitly excluding non-key attributes isn't
+ * necessary.
+ */
+ if (key->keysz > ntupatts)
+ return 1;
+
+ /*
+ * Use the heap TID attribute and scantid to try to break the tie. The
+ * rules are the same as any other key attribute -- only the
+ * representation differs.
+ */
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid == NULL)
+ {
+ /*
+ * Most searches have a scankey that is considered greater than a
+ * truncated pivot tuple if and when the scankey has equal values for
+ * attributes up to and including the least significant untruncated
+ * attribute in tuple.
+ *
+ * For example, if an index has the minimum two attributes (single
+ * user key attribute, plus heap TID attribute), and a page's high key
+ * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
+ * will not descend to the page to the left. The search will descend
+ * right instead. The truncated attribute in pivot tuple means that
+ * all non-pivot tuples on the page to the left are strictly < 'foo',
+ * so it isn't necessary to descend left. In other words, search
+ * doesn't have to descend left because it isn't interested in a match
+ * that has a heap TID value of -inf.
+ *
+ * However, some searches (pivotsearch searches) actually require that
+ * we descend left when this happens. -inf is treated as a possible
+ * match for omitted scankey attribute(s). This is needed by page
+ * deletion, which must re-find leaf pages that are targets for
+ * deletion using their high keys.
+ *
+ * Note: the heap TID part of the test ensures that scankey is being
+ * compared to a pivot tuple with one or more truncated key
+ * attributes.
+ *
+ * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
+ * left here, since they have no heap TID attribute (and cannot have
+ * any -inf key values in any case, since truncation can only remove
+ * non-key attributes). !heapkeyspace searches must always be
+ * prepared to deal with matches on both sides of the pivot once the
+ * leaf level is reached.
+ */
+ if (key->heapkeyspace && !key->pivotsearch &&
+ key->keysz == ntupatts && heapTid == NULL)
+ return 1;
+
+ /* All provided scankey arguments found to be equal */
+ return 0;
+ }
+
+ /*
+ * Treat truncated heap TID as minus infinity, since scankey has a key
+ * attribute value (scantid) that would otherwise be compared directly
+ */
+ Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+ if (heapTid == NULL)
+ return 1;
+
+ /*
+ * Scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * with scantid.
+ */
+ Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (result <= 0 || !BTreeTupleIsPosting(itup))
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid,
+ BTreeTupleGetMaxHeapTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
+}
+
+/*
+ * _bt_readpage() -- Load data from current index page into so->currPos
+ *
+ * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
+ * is not changed here. Also, currPos.moreLeft and moreRight must be valid;
+ * they are updated as appropriate. All other fields of so->currPos are
+ * initialized from scratch here.
+ *
+ * We scan the current page starting at offnum and moving in the indicated
+ * direction. All items matching the scan keys are loaded into currPos.items.
+ * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
+ * that there can be no more matching tuples in the current scan direction.
+ *
+ * In the case of a parallel scan, caller must have called _bt_parallel_seize
+ * prior to calling this function; this function will invoke
+ * _bt_parallel_release before returning.
+ *
+ * Returns true if any matching items found on the page, false if none.
+ */
+static bool
+_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber minoff;
+ OffsetNumber maxoff;
+ int itemIndex;
+ bool continuescan;
+ int indnatts;
+
+ /*
+ * We must have the buffer pinned and locked, but the usual macro can't be
+ * used here; this function is what makes it good for currPos.
+ */
+ Assert(BufferIsValid(so->currPos.buf));
+
+ page = BufferGetPage(so->currPos.buf);
+ opaque = BTPageGetOpaque(page);
+
+ /* allow next page be processed by parallel worker */
+ if (scan->parallel_scan)
+ {
+ if (ScanDirectionIsForward(dir))
+ _bt_parallel_release(scan, opaque->btpo_next);
+ else
+ _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+ }
+
+ continuescan = true; /* default assumption */
+ indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * We note the buffer's block number so that we can release the pin later.
+ * This allows us to re-read the buffer if it is needed again for hinting.
+ */
+ so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later. This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+
+ /*
+ * we must save the page's right-link while scanning it; this tells us
+ * where to step right to after we're done with these items. There is no
+ * corresponding need for the left-link, since splits always go right.
+ */
+ so->currPos.nextPage = opaque->btpo_next;
+
+ /* initialize tuple workspace to empty */
+ so->currPos.nextTupleOffset = 0;
+
+ /*
+ * Now that the current page has been made consistent, the macro should be
+ * good.
+ */
+ Assert(BTScanPosIsPinned(so->currPos));
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* load items[] in ascending order */
+ itemIndex = 0;
+
+ offnum = Max(offnum, minoff);
+
+ while (offnum <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ offnum = OffsetNumberNext(offnum);
+ continue;
+ }
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID
+ */
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ itemIndex++;
+ }
+ }
+ }
+ /* When !continuescan, there can't be any more matches, so stop */
+ if (!continuescan)
+ break;
+
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /*
+ * We don't need to visit page to the right when the high key
+ * indicates that no more matches will be found there.
+ *
+ * Checking the high key like this works out more often than you might
+ * think. Leaf page splits pick a split point between the two most
+ * dissimilar tuples (this is weighed against the need to evenly share
+ * free space). Leaf pages with high key attribute values that can
+ * only appear on non-pivot tuples on the right sibling page are
+ * common.
+ */
+ if (continuescan && !P_RIGHTMOST(opaque))
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
+ int truncatt;
+
+ truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
+ _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+ }
+
+ if (!continuescan)
+ so->currPos.moreRight = false;
+
+ Assert(itemIndex <= MaxTIDsPerBTreePage);
+ so->currPos.firstItem = 0;
+ so->currPos.lastItem = itemIndex - 1;
+ so->currPos.itemIndex = 0;
+ }
+ else
+ {
+ /* load items[] in descending order */
+ itemIndex = MaxTIDsPerBTreePage;
+
+ offnum = Min(offnum, maxoff);
+
+ while (offnum >= minoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple itup;
+ bool tuple_alive;
+ bool passes_quals;
+
+ /*
+ * If the scan specifies not to return killed tuples, then we
+ * treat a killed tuple as not passing the qual. Most of the
+ * time, it's a win to not bother examining the tuple's index
+ * keys, but just skip to the next tuple (previous, actually,
+ * since we're scanning backwards). However, if this is the first
+ * tuple on the page, we do check the index keys, to prevent
+ * uselessly advancing to the page to the left. This is similar
+ * to the high key optimization used by forward scans.
+ */
+ if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ {
+ Assert(offnum >= P_FIRSTDATAKEY(opaque));
+ if (offnum > P_FIRSTDATAKEY(opaque))
+ {
+ offnum = OffsetNumberPrev(offnum);
+ continue;
+ }
+
+ tuple_alive = false;
+ }
+ else
+ tuple_alive = true;
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
+ &continuescan);
+ if (passes_quals && tuple_alive)
+ {
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int tupleOffset;
+
+ /*
+ * Set up state to return posting list, and remember first
+ * TID.
+ *
+ * Note that we deliberately save/return items from
+ * posting lists in ascending heap TID order for backwards
+ * scans. This allows _bt_killitems() to make a
+ * consistent assumption about the order of items
+ * associated with the same posting list tuple.
+ */
+ itemIndex--;
+ tupleOffset =
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ /* Remember additional TIDs */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ tupleOffset);
+ }
+ }
+ }
+ if (!continuescan)
+ {
+ /* there can't be any more matches, so stop */
+ so->currPos.moreLeft = false;
+ break;
+ }
+
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ Assert(itemIndex >= 0);
+ so->currPos.firstItem = itemIndex;
+ so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
+ so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+ }
+
+ return (so->currPos.firstItem <= so->currPos.lastItem);
+}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index c2665fce41..8742716383 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -279,8 +279,6 @@ static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
BTPageState *state,
BTDedupState dstate);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
-static void _bt_load(BTWriteState *wstate,
- BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
int request);
static void _bt_end_parallel(BTLeader *btleader);
@@ -293,6 +291,8 @@ static void _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
Sharedsort *sharedsort2, int sortmem,
bool progress);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtsort_spec.c"
+#include "access/nbtree_spec.h"
/*
* btbuild() -- build a new btree index.
@@ -544,6 +544,7 @@ static void
_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
{
BTWriteState wstate;
+ nbts_prep_ctx(btspool->index);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
@@ -846,6 +847,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
Size pgspc;
Size itupsz;
bool isleaf;
+ nbts_prep_ctx(wstate->index);
/*
* This is a handy place to check for cancel interrupts during the btree
@@ -1178,264 +1180,6 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
-/*
- * Read tuples in correct sort order from tuplesort, and load them into
- * btree leaves.
- */
-static void
-_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
-{
- BTPageState *state = NULL;
- bool merge = (btspool2 != NULL);
- IndexTuple itup,
- itup2 = NULL;
- bool load1;
- TupleDesc tupdes = RelationGetDescr(wstate->index);
- int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
- SortSupport sortKeys;
- int64 tuples_done = 0;
- bool deduplicate;
-
- deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
- BTGetDeduplicateItems(wstate->index);
-
- if (merge)
- {
- /*
- * Another BTSpool for dead tuples exists. Now we have to merge
- * btspool and btspool2.
- */
-
- /* the preparation of merge */
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
-
- /* Prepare SortSupport data for each column */
- sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
- for (i = 0; i < keysz; i++)
- {
- SortSupport sortKey = sortKeys + i;
- ScanKey scanKey = wstate->inskey->scankeys + i;
- int16 strategy;
-
- sortKey->ssup_cxt = CurrentMemoryContext;
- sortKey->ssup_collation = scanKey->sk_collation;
- sortKey->ssup_nulls_first =
- (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
- sortKey->ssup_attno = scanKey->sk_attno;
- /* Abbreviation is not supported here */
- sortKey->abbreviate = false;
-
- Assert(sortKey->ssup_attno != 0);
-
- strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
- BTGreaterStrategyNumber : BTLessStrategyNumber;
-
- PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
- }
-
- for (;;)
- {
- load1 = true; /* load BTSpool next ? */
- if (itup2 == NULL)
- {
- if (itup == NULL)
- break;
- }
- else if (itup != NULL)
- {
- int32 compare = 0;
-
- for (i = 1; i <= keysz; i++)
- {
- SortSupport entry;
- Datum attrDatum1,
- attrDatum2;
- bool isNull1,
- isNull2;
-
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
- entry);
- if (compare > 0)
- {
- load1 = false;
- break;
- }
- else if (compare < 0)
- break;
- }
-
- /*
- * If key values are equal, we sort on ItemPointer. This is
- * required for btree indexes, since heap TID is treated as an
- * implicit last key attribute in order to ensure that all
- * keys in the index are physically unique.
- */
- if (compare == 0)
- {
- compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
- Assert(compare != 0);
- if (compare > 0)
- load1 = false;
- }
- }
- else
- load1 = false;
-
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- if (load1)
- {
- _bt_buildadd(wstate, state, itup, 0);
- itup = tuplesort_getindextuple(btspool->sortstate, true);
- }
- else
- {
- _bt_buildadd(wstate, state, itup2, 0);
- itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- pfree(sortKeys);
- }
- else if (deduplicate)
- {
- /* merge is unnecessary, deduplicate into posting lists */
- BTDedupState dstate;
-
- dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
- dstate->deduplicate = true; /* unused */
- dstate->nmaxitems = 0; /* unused */
- dstate->maxpostingsize = 0; /* set later */
- /* Metadata about base tuple of current pending posting list */
- dstate->base = NULL;
- dstate->baseoff = InvalidOffsetNumber; /* unused */
- dstate->basetupsize = 0;
- /* Metadata about current pending posting list TIDs */
- dstate->htids = NULL;
- dstate->nhtids = 0;
- dstate->nitems = 0;
- dstate->phystupsize = 0; /* unused */
- dstate->nintervals = 0; /* unused */
-
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- {
- state = _bt_pagestate(wstate, 0);
-
- /*
- * Limit size of posting list tuples to 1/10 space we want to
- * leave behind on the page, plus space for final item's line
- * pointer. This is equal to the space that we'd like to
- * leave behind on each leaf page when fillfactor is 90,
- * allowing us to get close to fillfactor% space utilization
- * when there happen to be a great many duplicates. (This
- * makes higher leaf fillfactor settings ineffective when
- * building indexes that have many duplicates, but packing
- * leaf pages full with few very large tuples doesn't seem
- * like a useful goal.)
- */
- dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
- sizeof(ItemIdData);
- Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
- dstate->maxpostingsize <= INDEX_SIZE_MASK);
- dstate->htids = palloc(dstate->maxpostingsize);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
- else if (_bt_keep_natts_fast(wstate->index, dstate->base,
- itup) > keysz &&
- _bt_dedup_save_htid(dstate, itup))
- {
- /*
- * Tuple is equal to base tuple of pending posting list. Heap
- * TID from itup has been saved in state.
- */
- }
- else
- {
- /*
- * Tuple is not equal to pending posting list tuple, or
- * _bt_dedup_save_htid() opted to not merge current item into
- * pending posting list.
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
-
- /* start new pending posting list with itup copy */
- _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
- InvalidOffsetNumber);
- }
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
-
- if (state)
- {
- /*
- * Handle the last item (there must be a last item when the
- * tuplesort returned one or more tuples)
- */
- _bt_sort_dedup_finish_pending(wstate, state, dstate);
- pfree(dstate->base);
- pfree(dstate->htids);
- }
-
- pfree(dstate);
- }
- else
- {
- /* merging and deduplication are both unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
- {
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
-
- _bt_buildadd(wstate, state, itup, 0);
-
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
- }
- }
-
- /* Close down final pages and write the metapage */
- _bt_uppershutdown(wstate, state);
-
- /*
- * When we WAL-logged index pages, we must nonetheless fsync index files.
- * Since we're building outside shared buffers, a CHECKPOINT occurring
- * during the build has no way to flush the previously written data to
- * disk (indeed it won't know the index even exists). A crash later on
- * would replay WAL from the checkpoint, therefore it wouldn't replay our
- * earlier WAL entries. If we do not fsync those pages here, they might
- * still not be on disk when the crash occurs.
- */
- if (wstate->btws_use_wal)
- smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
-}
-
/*
* Create parallel context, and launch workers for leader.
*
diff --git a/src/backend/access/nbtree/nbtsort_spec.c b/src/backend/access/nbtree/nbtsort_spec.c
new file mode 100644
index 0000000000..368d6f244c
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsort_spec.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtsort_spec.c
+ * Index shape-specialized functions for nbtsort.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtsort_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_load NBTS_FUNCTION(_bt_load)
+
+static void _bt_load(BTWriteState *wstate,
+ BTSpool *btspool, BTSpool *btspool2);
+
+/*
+ * Read tuples in correct sort order from tuplesort, and load them into
+ * btree leaves.
+ */
+static void
+_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
+{
+ BTPageState *state = NULL;
+ bool merge = (btspool2 != NULL);
+ IndexTuple itup,
+ itup2 = NULL;
+ bool load1;
+ TupleDesc tupdes = RelationGetDescr(wstate->index);
+ int i,
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ SortSupport sortKeys;
+ int64 tuples_done = 0;
+ bool deduplicate;
+
+ deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
+ BTGetDeduplicateItems(wstate->index);
+
+ if (merge)
+ {
+ /*
+ * Another BTSpool for dead tuples exists. Now we have to merge
+ * btspool and btspool2.
+ */
+
+ /* the preparation of merge */
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+
+ /* Prepare SortSupport data for each column */
+ sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+
+ for (i = 0; i < keysz; i++)
+ {
+ SortSupport sortKey = sortKeys + i;
+ ScanKey scanKey = wstate->inskey->scankeys + i;
+ int16 strategy;
+
+ sortKey->ssup_cxt = CurrentMemoryContext;
+ sortKey->ssup_collation = scanKey->sk_collation;
+ sortKey->ssup_nulls_first =
+ (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+ sortKey->ssup_attno = scanKey->sk_attno;
+ /* Abbreviation is not supported here */
+ sortKey->abbreviate = false;
+
+ Assert(sortKey->ssup_attno != 0);
+
+ strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+ BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+ PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+ }
+
+ for (;;)
+ {
+ load1 = true; /* load BTSpool next ? */
+ if (itup2 == NULL)
+ {
+ if (itup == NULL)
+ break;
+ }
+ else if (itup != NULL)
+ {
+ int32 compare = 0;
+
+ for (i = 1; i <= keysz; i++)
+ {
+ SortSupport entry;
+ Datum attrDatum1,
+ attrDatum2;
+ bool isNull1,
+ isNull2;
+
+ entry = sortKeys + i - 1;
+ attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+ attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+ compare = ApplySortComparator(attrDatum1, isNull1,
+ attrDatum2, isNull2,
+ entry);
+ if (compare > 0)
+ {
+ load1 = false;
+ break;
+ }
+ else if (compare < 0)
+ break;
+ }
+
+ /*
+ * If key values are equal, we sort on ItemPointer. This is
+ * required for btree indexes, since heap TID is treated as an
+ * implicit last key attribute in order to ensure that all
+ * keys in the index are physically unique.
+ */
+ if (compare == 0)
+ {
+ compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+ Assert(compare != 0);
+ if (compare > 0)
+ load1 = false;
+ }
+ }
+ else
+ load1 = false;
+
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ if (load1)
+ {
+ _bt_buildadd(wstate, state, itup, 0);
+ itup = tuplesort_getindextuple(btspool->sortstate, true);
+ }
+ else
+ {
+ _bt_buildadd(wstate, state, itup2, 0);
+ itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ pfree(sortKeys);
+ }
+ else if (deduplicate)
+ {
+ /* merge is unnecessary, deduplicate into posting lists */
+ BTDedupState dstate;
+
+ dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ dstate->deduplicate = true; /* unused */
+ dstate->nmaxitems = 0; /* unused */
+ dstate->maxpostingsize = 0; /* set later */
+ /* Metadata about base tuple of current pending posting list */
+ dstate->base = NULL;
+ dstate->baseoff = InvalidOffsetNumber; /* unused */
+ dstate->basetupsize = 0;
+ /* Metadata about current pending posting list TIDs */
+ dstate->htids = NULL;
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+ dstate->phystupsize = 0; /* unused */
+ dstate->nintervals = 0; /* unused */
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+
+ /*
+ * Limit size of posting list tuples to 1/10 space we want to
+ * leave behind on the page, plus space for final item's line
+ * pointer. This is equal to the space that we'd like to
+ * leave behind on each leaf page when fillfactor is 90,
+ * allowing us to get close to fillfactor% space utilization
+ * when there happen to be a great many duplicates. (This
+ * makes higher leaf fillfactor settings ineffective when
+ * building indexes that have many duplicates, but packing
+ * leaf pages full with few very large tuples doesn't seem
+ * like a useful goal.)
+ */
+ dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
+ sizeof(ItemIdData);
+ Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
+ dstate->maxpostingsize <= INDEX_SIZE_MASK);
+ dstate->htids = palloc(dstate->maxpostingsize);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+ else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+ itup) > keysz &&
+ _bt_dedup_save_htid(dstate, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID from itup has been saved in state.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list.
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+
+ /* start new pending posting list with itup copy */
+ _bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+ InvalidOffsetNumber);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+
+ if (state)
+ {
+ /*
+ * Handle the last item (there must be a last item when the
+ * tuplesort returned one or more tuples)
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ pfree(dstate->base);
+ pfree(dstate->htids);
+ }
+
+ pfree(dstate);
+ }
+ else
+ {
+ /* merging and deduplication are both unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ _bt_buildadd(wstate, state, itup, 0);
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+
+ /* Close down final pages and write the metapage */
+ _bt_uppershutdown(wstate, state);
+
+ /*
+ * When we WAL-logged index pages, we must nonetheless fsync index files.
+ * Since we're building outside shared buffers, a CHECKPOINT occurring
+ * during the build has no way to flush the previously written data to
+ * disk (indeed it won't know the index even exists). A crash later on
+ * would replay WAL from the checkpoint, therefore it wouldn't replay our
+ * earlier WAL entries. If we do not fsync those pages here, they might
+ * still not be on disk when the crash occurs.
+ */
+ if (wstate->btws_use_wal)
+ smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 43b67893d9..db2da1e303 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -639,6 +639,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
ItemId itemid;
IndexTuple tup;
int keepnatts;
+ nbts_prep_ctx(state->rel);
Assert(state->is_leaf && !state->is_rightmost);
@@ -945,6 +946,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
*rightinterval;
int perfectpenalty;
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
+ nbts_prep_ctx(state->rel);
/* Assume that alternative strategy won't be used for now */
*strategy = SPLIT_DEFAULT;
@@ -1137,6 +1139,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
{
IndexTuple lastleft;
IndexTuple firstright;
+ nbts_prep_ctx(state->rel);
if (!state->is_leaf)
{
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7da499c4dd..37d644e9f3 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -50,130 +50,10 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
bool *result);
static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
static void _bt_mark_scankey_required(ScanKey skey);
-static bool _bt_check_rowcompare(ScanKey skey,
- IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
- ScanDirection dir, bool *continuescan);
-static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
+#define NBT_SPECIALIZE_FILE "../../backend/access/nbtree/nbtutils_spec.c"
+#include "access/nbtree_spec.h"
-/*
- * _bt_mkscankey
- * Build an insertion scan key that contains comparison data from itup
- * as well as comparator routines appropriate to the key datatypes.
- *
- * When itup is a non-pivot tuple, the returned insertion scan key is
- * suitable for finding a place for it to go on the leaf level. Pivot
- * tuples can be used to re-find leaf page with matching high key, but
- * then caller needs to set scan key's pivotsearch field to true. This
- * allows caller to search for a leaf page with a matching high key,
- * which is usually to the left of the first leaf page a non-pivot match
- * might appear on.
- *
- * The result is intended for use with _bt_compare() and _bt_truncate().
- * Callers that don't need to fill out the insertion scankey arguments
- * (e.g. they use an ad-hoc comparison routine, or only need a scankey
- * for _bt_truncate()) can pass a NULL index tuple. The scankey will
- * be initialized as if an "all truncated" pivot tuple was passed
- * instead.
- *
- * Note that we may occasionally have to share lock the metapage to
- * determine whether or not the keys in the index are expected to be
- * unique (i.e. if this is a "heapkeyspace" index). We assume a
- * heapkeyspace index when caller passes a NULL tuple, allowing index
- * build callers to avoid accessing the non-existent metapage. We
- * also assume that the index is _not_ allequalimage when a NULL tuple
- * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
- * field themselves.
- */
-BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
-{
- BTScanInsert key;
- ScanKey skey;
- TupleDesc itupdesc;
- int indnkeyatts;
- int16 *indoption;
- int tupnatts;
- int i;
-
- itupdesc = RelationGetDescr(rel);
- indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- indoption = rel->rd_indoption;
- tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
-
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
- /*
- * We'll execute search using scan key constructed on key columns.
- * Truncated attributes and non-key attributes are omitted from the final
- * scan key.
- */
- key = palloc(offsetof(BTScanInsertData, scankeys) +
- sizeof(ScanKeyData) * indnkeyatts);
- if (itup)
- _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
- else
- {
- /* Utility statement callers can set these fields themselves */
- key->heapkeyspace = true;
- key->allequalimage = false;
- }
- key->anynullkeys = false; /* initial assumption */
- key->nextkey = false;
- key->pivotsearch = false;
- key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
- skey = key->scankeys;
- for (i = 0; i < indnkeyatts; i++)
- {
- FmgrInfo *procinfo;
- Datum arg;
- bool null;
- int flags;
-
- /*
- * We can use the cached (default) support procs since no cross-type
- * comparison can be needed.
- */
- procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-
- /*
- * Key arguments built from truncated attributes (or when caller
- * provides no tuple) are defensively represented as NULL values. They
- * should never be used.
- */
- if (i < tupnatts)
- arg = index_getattr(itup, i + 1, itupdesc, &null);
- else
- {
- arg = (Datum) 0;
- null = true;
- }
- flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
- ScanKeyEntryInitializeWithInfo(&skey[i],
- flags,
- (AttrNumber) (i + 1),
- InvalidStrategy,
- InvalidOid,
- rel->rd_indcollation[i],
- procinfo,
- arg);
- /* Record if any key attribute is NULL (or truncated) */
- if (null)
- key->anynullkeys = true;
- }
-
- /*
- * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
- * that full uniqueness check is done.
- */
- if (rel->rd_index->indnullsnotdistinct)
- key->anynullkeys = false;
-
- return key;
-}
/*
* free a retracement stack made by _bt_search.
@@ -1340,356 +1220,6 @@ _bt_mark_scankey_required(ScanKey skey)
}
}
-/*
- * Test whether an indextuple satisfies all the scankey conditions.
- *
- * Return true if so, false if not. If the tuple fails to pass the qual,
- * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly. See comments for
- * _bt_preprocess_keys(), above, about how this is done.
- *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
- *
- * scan: index scan descriptor (containing a search-type scankey)
- * tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
- */
-bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
- ScanDirection dir, bool *continuescan)
-{
- TupleDesc tupdesc;
- BTScanOpaque so;
- int keysz;
- int ikey;
- ScanKey key;
-
- Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
-
- *continuescan = true; /* default assumption */
-
- tupdesc = RelationGetDescr(scan->indexRelation);
- so = (BTScanOpaque) scan->opaque;
- keysz = so->numberOfKeys;
-
- for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
- {
- Datum datum;
- bool isNull;
- Datum test;
-
- if (key->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- continue;
- }
-
- /* row-comparison keys need special processing */
- if (key->sk_flags & SK_ROW_HEADER)
- {
- if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
- continuescan))
- continue;
- return false;
- }
-
- datum = index_getattr(tuple,
- key->sk_attno,
- tupdesc,
- &isNull);
-
- if (key->sk_flags & SK_ISNULL)
- {
- /* Handle IS NULL/NOT NULL tests */
- if (key->sk_flags & SK_SEARCHNULL)
- {
- if (isNull)
- continue; /* tuple satisfies this qual */
- }
- else
- {
- Assert(key->sk_flags & SK_SEARCHNOTNULL);
- if (!isNull)
- continue; /* tuple satisfies this qual */
- }
-
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (isNull)
- {
- if (key->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
- datum, key->sk_argument);
-
- if (!DatumGetBool(test))
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will
- * pass, either.
- *
- * Note: because we stop the scan as soon as any required equality
- * qual fails, it is critical that equality quals be used for the
- * initial positioning in _bt_first() when they are available. See
- * comments in _bt_first().
- */
- if ((key->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((key->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
- }
-
- /* If we get here, the tuple passes all index quals. */
- return true;
-}
-
-/*
- * Test whether an indextuple satisfies a row-comparison scan condition.
- *
- * Return true if so, false if not. If not, also clear *continuescan if
- * it's not possible for any future tuples in the current scan direction
- * to pass the qual.
- *
- * This is a subroutine for _bt_checkkeys, which see for more info.
- */
-static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
- TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
-{
- ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
- int32 cmpresult = 0;
- bool result;
-
- /* First subkey should be same as the header says */
- Assert(subkey->sk_attno == skey->sk_attno);
-
- /* Loop over columns of the row condition */
- for (;;)
- {
- Datum datum;
- bool isNull;
-
- Assert(subkey->sk_flags & SK_ROW_MEMBER);
-
- if (subkey->sk_attno > tupnatts)
- {
- /*
- * This attribute is truncated (must be high key). The value for
- * this attribute in the first non-pivot tuple on the page to the
- * right could be any possible value. Assume that truncated
- * attribute passes the qual.
- */
- Assert(ScanDirectionIsForward(dir));
- Assert(BTreeTupleIsPivot(tuple));
- cmpresult = 0;
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- continue;
- }
-
- datum = index_getattr(tuple,
- subkey->sk_attno,
- tupdesc,
- &isNull);
-
- if (isNull)
- {
- if (subkey->sk_flags & SK_BT_NULLS_FIRST)
- {
- /*
- * Since NULLs are sorted before non-NULLs, we know we have
- * reached the lower limit of the range of values for this
- * index attr. On a backward scan, we can stop if this qual
- * is one of the "must match" subset. We can stop regardless
- * of whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a forward scan, however, we must keep going, because we may
- * have initially positioned to the start of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
- else
- {
- /*
- * Since NULLs are sorted after non-NULLs, we know we have
- * reached the upper limit of the range of values for this
- * index attr. On a forward scan, we can stop if this qual is
- * one of the "must match" subset. We can stop regardless of
- * whether the qual is > or <, so long as it's required,
- * because it's not possible for any future tuples to pass. On
- * a backward scan, however, we must keep going, because we
- * may have initially positioned to the end of the index.
- */
- if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- }
-
- /*
- * In any case, this indextuple doesn't match the qual.
- */
- return false;
- }
-
- if (subkey->sk_flags & SK_ISNULL)
- {
- /*
- * Unlike the simple-scankey case, this isn't a disallowed case.
- * But it can never match. If all the earlier row comparison
- * columns are required for the scan direction, we can stop the
- * scan, because there can't be another tuple that will succeed.
- */
- if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
- subkey--;
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- return false;
- }
-
- /* Perform the test --- three-way comparison not bool operator */
- cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
- subkey->sk_collation,
- datum,
- subkey->sk_argument));
-
- if (subkey->sk_flags & SK_BT_DESC)
- INVERT_COMPARE_RESULT(cmpresult);
-
- /* Done comparing if unequal, else advance to next column */
- if (cmpresult != 0)
- break;
-
- if (subkey->sk_flags & SK_ROW_END)
- break;
- subkey++;
- }
-
- /*
- * At this point cmpresult indicates the overall result of the row
- * comparison, and subkey points to the deciding column (or the last
- * column if the result is "=").
- */
- switch (subkey->sk_strategy)
- {
- /* EQ and NE cases aren't allowed here */
- case BTLessStrategyNumber:
- result = (cmpresult < 0);
- break;
- case BTLessEqualStrategyNumber:
- result = (cmpresult <= 0);
- break;
- case BTGreaterEqualStrategyNumber:
- result = (cmpresult >= 0);
- break;
- case BTGreaterStrategyNumber:
- result = (cmpresult > 0);
- break;
- default:
- elog(ERROR, "unrecognized RowCompareType: %d",
- (int) subkey->sk_strategy);
- result = 0; /* keep compiler quiet */
- break;
- }
-
- if (!result)
- {
- /*
- * Tuple fails this qual. If it's a required qual for the current
- * scan direction, then we can conclude no further tuples will pass,
- * either. Note we have to look at the deciding column, not
- * necessarily the first or last column of the row condition.
- */
- if ((subkey->sk_flags & SK_BT_REQFWD) &&
- ScanDirectionIsForward(dir))
- *continuescan = false;
- else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
- ScanDirectionIsBackward(dir))
- *continuescan = false;
- }
-
- return result;
-}
-
/*
* _bt_killitems - set LP_DEAD state for items an indexscan caller has
* told us were killed
@@ -2173,286 +1703,6 @@ btbuildphasename(int64 phasenum)
}
}
-/*
- * _bt_truncate() -- create tuple without unneeded suffix attributes.
- *
- * Returns truncated pivot index tuple allocated in caller's memory context,
- * with key attributes copied from caller's firstright argument. If rel is
- * an INCLUDE index, non-key attributes will definitely be truncated away,
- * since they're not part of the key space. More aggressive suffix
- * truncation can take place when it's clear that the returned tuple does not
- * need one or more suffix key attributes. We only need to keep firstright
- * attributes up to and including the first non-lastleft-equal attribute.
- * Caller's insertion scankey is used to compare the tuples; the scankey's
- * argument values are not considered here.
- *
- * Note that returned tuple's t_tid offset will hold the number of attributes
- * present, so the original item pointer offset is not represented. Caller
- * should only change truncated tuple's downlink. Note also that truncated
- * key attributes are treated as containing "minus infinity" values by
- * _bt_compare().
- *
- * In the worst case (when a heap TID must be appended to distinguish lastleft
- * from firstright), the size of the returned tuple is the size of firstright
- * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
- * is important, since callers need to stay under the 1/3 of a page
- * restriction on tuple size. If this routine is ever taught to truncate
- * within an attribute/datum, it will need to avoid returning an enlarged
- * tuple to caller when truncation + TOAST compression ends up enlarging the
- * final datum.
- */
-IndexTuple
-_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
- IndexTuple pivot;
- IndexTuple tidpivot;
- ItemPointer pivotheaptid;
- Size newsize;
-
- /*
- * We should only ever truncate non-pivot tuples from leaf pages. It's
- * never okay to truncate when splitting an internal page.
- */
- Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
-
- /* Determine how many attributes must be kept in truncated tuple */
- keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
-
-#ifdef DEBUG_NO_TRUNCATE
- /* Force truncation to be ineffective for testing purposes */
- keepnatts = nkeyatts + 1;
-#endif
-
- pivot = index_truncate_tuple(itupdesc, firstright,
- Min(keepnatts, nkeyatts));
-
- if (BTreeTupleIsPosting(pivot))
- {
- /*
- * index_truncate_tuple() just returns a straight copy of firstright
- * when it has no attributes to truncate. When that happens, we may
- * need to truncate away a posting list here instead.
- */
- Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
- Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
- pivot->t_info &= ~INDEX_SIZE_MASK;
- pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
- }
-
- /*
- * If there is a distinguishing key attribute within pivot tuple, we're
- * done
- */
- if (keepnatts <= nkeyatts)
- {
- BTreeTupleSetNAtts(pivot, keepnatts, false);
- return pivot;
- }
-
- /*
- * We have to store a heap TID in the new pivot tuple, since no non-TID
- * key attribute value in firstright distinguishes the right side of the
- * split from the left side. nbtree conceptualizes this case as an
- * inability to truncate away any key attributes, since heap TID is
- * treated as just another key attribute (despite lacking a pg_attribute
- * entry).
- *
- * Use enlarged space that holds a copy of pivot. We need the extra space
- * to store a heap TID at the end (using the special pivot tuple
- * representation). Note that the original pivot already has firstright's
- * possible posting list/non-key attribute values removed at this point.
- */
- newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
- tidpivot = palloc0(newsize);
- memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
- /* Cannot leak memory here */
- pfree(pivot);
-
- /*
- * Store all of firstright's key attribute values plus a tiebreaker heap
- * TID value in enlarged pivot tuple
- */
- tidpivot->t_info &= ~INDEX_SIZE_MASK;
- tidpivot->t_info |= newsize;
- BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
- pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
-
- /*
- * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
- * consider suffix truncation. It seems like a good idea to follow that
- * example in cases where no truncation takes place -- use lastleft's heap
- * TID. (This is also the closest value to negative infinity that's
- * legally usable.)
- */
- ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
-
- /*
- * We're done. Assert() that heap TID invariants hold before returning.
- *
- * Lehman and Yao require that the downlink to the right page, which is to
- * be inserted into the parent page in the second phase of a page split be
- * a strict lower bound on items on the right page, and a non-strict upper
- * bound for items on the left page. Assert that heap TIDs follow these
- * invariants, since a heap TID value is apparently needed as a
- * tiebreaker.
- */
-#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
- BTreeTupleGetHeapTID(firstright)) < 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(lastleft)) >= 0);
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#else
-
- /*
- * Those invariants aren't guaranteed to hold for lastleft + firstright
- * heap TID attribute values when they're considered here only because
- * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
- * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
- * TID value that always works as a strict lower bound for items to the
- * right. In particular, it must avoid using firstright's leading key
- * attribute values along with lastleft's heap TID value when lastleft's
- * TID happens to be greater than firstright's TID.
- */
- ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
-
- /*
- * Pivot heap TID should never be fully equal to firstright. Note that
- * the pivot heap TID will still end up equal to lastleft's heap TID when
- * that's the only usable value.
- */
- ItemPointerSetOffsetNumber(pivotheaptid,
- OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid,
- BTreeTupleGetHeapTID(firstright)) < 0);
-#endif
-
- return tidpivot;
-}
-
-/*
- * _bt_keep_natts - how many key attributes to keep when truncating.
- *
- * Caller provides two tuples that enclose a split point. Caller's insertion
- * scankey is used to compare the tuples; the scankey's argument values are
- * not considered here.
- *
- * This can return a number of attributes that is one greater than the
- * number of key attributes for the index relation. This indicates that the
- * caller must use a heap TID as a unique-ifier in new pivot tuple.
- */
-static int
-_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
-{
- int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keepnatts;
- ScanKey scankey;
-
- /*
- * _bt_compare() treats truncated key attributes as having the value minus
- * infinity, which would break searches within !heapkeyspace indexes. We
- * must still truncate away non-key attribute values, though.
- */
- if (!itup_key->heapkeyspace)
- return nkeyatts;
-
- scankey = itup_key->scankeys;
- keepnatts = 1;
- for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
- scankey->sk_collation,
- datum1,
- datum2)) != 0)
- break;
-
- keepnatts++;
- }
-
- /*
- * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
- * expected in an allequalimage index.
- */
- Assert(!itup_key->allequalimage ||
- keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
-
- return keepnatts;
-}
-
-/*
- * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
- *
- * This is exported so that a candidate split point can have its effect on
- * suffix truncation inexpensively evaluated ahead of time when finding a
- * split location. A naive bitwise approach to datum comparisons is used to
- * save cycles.
- *
- * The approach taken here usually provides the same answer as _bt_keep_natts
- * will (for the same pair of tuples from a heapkeyspace index), since the
- * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting. When an index only has
- * "equal image" columns, routine is guaranteed to give the same result as
- * _bt_keep_natts would.
- *
- * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts, even when the index uses
- * an opclass or collation that is not "allequalimage"/deduplication-safe.
- * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
- * negatives generally only have the effect of making leaf page splits use a
- * more balanced split point.
- */
-int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
-{
- TupleDesc itupdesc = RelationGetDescr(rel);
- int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
- int keepnatts;
-
- keepnatts = 1;
- for (int attnum = 1; attnum <= keysz; attnum++)
- {
- Datum datum1,
- datum2;
- bool isNull1,
- isNull2;
- Form_pg_attribute att;
-
- datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
- datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
- att = TupleDescAttr(itupdesc, attnum - 1);
-
- if (isNull1 != isNull2)
- break;
-
- if (!isNull1 &&
- !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
- break;
-
- keepnatts++;
- }
-
- return keepnatts;
-}
-
/*
* _bt_check_natts() -- Verify tuple has expected number of attributes.
*
diff --git a/src/backend/access/nbtree/nbtutils_spec.c b/src/backend/access/nbtree/nbtutils_spec.c
new file mode 100644
index 0000000000..0288da22d6
--- /dev/null
+++ b/src/backend/access/nbtree/nbtutils_spec.c
@@ -0,0 +1,775 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtutils_spec.c
+ * Index shape-specialized functions for nbtutils.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtutils_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define _bt_check_rowcompare NBTS_FUNCTION(_bt_check_rowcompare)
+#define _bt_keep_natts NBTS_FUNCTION(_bt_keep_natts)
+
+static bool _bt_check_rowcompare(ScanKey skey,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ ScanDirection dir, bool *continuescan);
+static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+
+
+/*
+ * _bt_mkscankey
+ * Build an insertion scan key that contains comparison data from itup
+ * as well as comparator routines appropriate to the key datatypes.
+ *
+ * When itup is a non-pivot tuple, the returned insertion scan key is
+ * suitable for finding a place for it to go on the leaf level. Pivot
+ * tuples can be used to re-find leaf page with matching high key, but
+ * then caller needs to set scan key's pivotsearch field to true. This
+ * allows caller to search for a leaf page with a matching high key,
+ * which is usually to the left of the first leaf page a non-pivot match
+ * might appear on.
+ *
+ * The result is intended for use with _bt_compare() and _bt_truncate().
+ * Callers that don't need to fill out the insertion scankey arguments
+ * (e.g. they use an ad-hoc comparison routine, or only need a scankey
+ * for _bt_truncate()) can pass a NULL index tuple. The scankey will
+ * be initialized as if an "all truncated" pivot tuple was passed
+ * instead.
+ *
+ * Note that we may occasionally have to share lock the metapage to
+ * determine whether or not the keys in the index are expected to be
+ * unique (i.e. if this is a "heapkeyspace" index). We assume a
+ * heapkeyspace index when caller passes a NULL tuple, allowing index
+ * build callers to avoid accessing the non-existent metapage. We
+ * also assume that the index is _not_ allequalimage when a NULL tuple
+ * is passed; CREATE INDEX callers call _bt_allequalimage() to set the
+ * field themselves.
+ */
+BTScanInsert
+_bt_mkscankey(Relation rel, IndexTuple itup)
+{
+ BTScanInsert key;
+ ScanKey skey;
+ TupleDesc itupdesc;
+ int indnkeyatts;
+ int16 *indoption;
+ int tupnatts;
+ int i;
+
+ itupdesc = RelationGetDescr(rel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ indoption = rel->rd_indoption;
+ tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+
+ /*
+ * We'll execute search using scan key constructed on key columns.
+ * Truncated attributes and non-key attributes are omitted from the final
+ * scan key.
+ */
+ key = palloc(offsetof(BTScanInsertData, scankeys) +
+ sizeof(ScanKeyData) * indnkeyatts);
+ if (itup)
+ _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+ else
+ {
+ /* Utility statement callers can set these fields themselves */
+ key->heapkeyspace = true;
+ key->allequalimage = false;
+ }
+ key->anynullkeys = false; /* initial assumption */
+ key->nextkey = false;
+ key->pivotsearch = false;
+ key->keysz = Min(indnkeyatts, tupnatts);
+ key->scantid = key->heapkeyspace && itup ?
+ BTreeTupleGetHeapTID(itup) : NULL;
+ skey = key->scankeys;
+ for (i = 0; i < indnkeyatts; i++)
+ {
+ FmgrInfo *procinfo;
+ Datum arg;
+ bool null;
+ int flags;
+
+ /*
+ * We can use the cached (default) support procs since no cross-type
+ * comparison can be needed.
+ */
+ procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
+
+ /*
+ * Key arguments built from truncated attributes (or when caller
+ * provides no tuple) are defensively represented as NULL values. They
+ * should never be used.
+ */
+ if (i < tupnatts)
+ arg = index_getattr(itup, i + 1, itupdesc, &null);
+ else
+ {
+ arg = (Datum) 0;
+ null = true;
+ }
+ flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
+ ScanKeyEntryInitializeWithInfo(&skey[i],
+ flags,
+ (AttrNumber) (i + 1),
+ InvalidStrategy,
+ InvalidOid,
+ rel->rd_indcollation[i],
+ procinfo,
+ arg);
+ /* Record if any key attribute is NULL (or truncated) */
+ if (null)
+ key->anynullkeys = true;
+ }
+
+ /*
+ * In NULLS NOT DISTINCT mode, we pretend that there are no null keys, so
+ * that full uniqueness check is done.
+ */
+ if (rel->rd_index->indnullsnotdistinct)
+ key->anynullkeys = false;
+
+ return key;
+}
+
+/*
+ * Test whether an indextuple satisfies all the scankey conditions.
+ *
+ * Return true if so, false if not. If the tuple fails to pass the qual,
+ * we also determine whether there's any need to continue the scan beyond
+ * this tuple, and set *continuescan accordingly. See comments for
+ * _bt_preprocess_keys(), above, about how this is done.
+ *
+ * Forward scan callers can pass a high key tuple in the hopes of having
+ * us set *continuescan to false, and avoiding an unnecessary visit to
+ * the page to the right.
+ *
+ * scan: index scan descriptor (containing a search-type scankey)
+ * tuple: index tuple to test
+ * tupnatts: number of attributes in tupnatts (high key may be truncated)
+ * dir: direction we are scanning in
+ * continuescan: output parameter (will be set correctly in all cases)
+ */
+bool
+_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan)
+{
+ TupleDesc tupdesc;
+ BTScanOpaque so;
+ int keysz;
+ int ikey;
+ ScanKey key;
+
+ Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+
+ *continuescan = true; /* default assumption */
+
+ tupdesc = RelationGetDescr(scan->indexRelation);
+ so = (BTScanOpaque) scan->opaque;
+ keysz = so->numberOfKeys;
+
+ for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+ {
+ Datum datum;
+ bool isNull;
+ Datum test;
+
+ if (key->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ continue;
+ }
+
+ /* row-comparison keys need special processing */
+ if (key->sk_flags & SK_ROW_HEADER)
+ {
+ if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
+ continuescan))
+ continue;
+ return false;
+ }
+
+ datum = index_getattr(tuple,
+ key->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (key->sk_flags & SK_ISNULL)
+ {
+ /* Handle IS NULL/NOT NULL tests */
+ if (key->sk_flags & SK_SEARCHNULL)
+ {
+ if (isNull)
+ continue; /* tuple satisfies this qual */
+ }
+ else
+ {
+ Assert(key->sk_flags & SK_SEARCHNOTNULL);
+ if (!isNull)
+ continue; /* tuple satisfies this qual */
+ }
+
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (isNull)
+ {
+ if (key->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
+ datum, key->sk_argument);
+
+ if (!DatumGetBool(test))
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will
+ * pass, either.
+ *
+ * Note: because we stop the scan as soon as any required equality
+ * qual fails, it is critical that equality quals be used for the
+ * initial positioning in _bt_first() when they are available. See
+ * comments in _bt_first().
+ */
+ if ((key->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((key->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+ }
+
+ /* If we get here, the tuple passes all index quals. */
+ return true;
+}
+
+/*
+ * Test whether an indextuple satisfies a row-comparison scan condition.
+ *
+ * Return true if so, false if not. If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction
+ * to pass the qual.
+ *
+ * This is a subroutine for _bt_checkkeys, which see for more info.
+ */
+static bool
+_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
+ TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
+{
+ ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
+ int32 cmpresult = 0;
+ bool result;
+
+ /* First subkey should be same as the header says */
+ Assert(subkey->sk_attno == skey->sk_attno);
+
+ /* Loop over columns of the row condition */
+ for (;;)
+ {
+ Datum datum;
+ bool isNull;
+
+ Assert(subkey->sk_flags & SK_ROW_MEMBER);
+
+ if (subkey->sk_attno > tupnatts)
+ {
+ /*
+ * This attribute is truncated (must be high key). The value for
+ * this attribute in the first non-pivot tuple on the page to the
+ * right could be any possible value. Assume that truncated
+ * attribute passes the qual.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
+ cmpresult = 0;
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ continue;
+ }
+
+ datum = index_getattr(tuple,
+ subkey->sk_attno,
+ tupdesc,
+ &isNull);
+
+ if (isNull)
+ {
+ if (subkey->sk_flags & SK_BT_NULLS_FIRST)
+ {
+ /*
+ * Since NULLs are sorted before non-NULLs, we know we have
+ * reached the lower limit of the range of values for this
+ * index attr. On a backward scan, we can stop if this qual
+ * is one of the "must match" subset. We can stop regardless
+ * of whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a forward scan, however, we must keep going, because we may
+ * have initially positioned to the start of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+ else
+ {
+ /*
+ * Since NULLs are sorted after non-NULLs, we know we have
+ * reached the upper limit of the range of values for this
+ * index attr. On a forward scan, we can stop if this qual is
+ * one of the "must match" subset. We can stop regardless of
+ * whether the qual is > or <, so long as it's required,
+ * because it's not possible for any future tuples to pass. On
+ * a backward scan, however, we must keep going, because we
+ * may have initially positioned to the end of the index.
+ */
+ if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual.
+ */
+ return false;
+ }
+
+ if (subkey->sk_flags & SK_ISNULL)
+ {
+ /*
+ * Unlike the simple-scankey case, this isn't a disallowed case.
+ * But it can never match. If all the earlier row comparison
+ * columns are required for the scan direction, we can stop the
+ * scan, because there can't be another tuple that will succeed.
+ */
+ if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
+ subkey--;
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ return false;
+ }
+
+ /* Perform the test --- three-way comparison not bool operator */
+ cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
+ subkey->sk_collation,
+ datum,
+ subkey->sk_argument));
+
+ if (subkey->sk_flags & SK_BT_DESC)
+ INVERT_COMPARE_RESULT(cmpresult);
+
+ /* Done comparing if unequal, else advance to next column */
+ if (cmpresult != 0)
+ break;
+
+ if (subkey->sk_flags & SK_ROW_END)
+ break;
+ subkey++;
+ }
+
+ /*
+ * At this point cmpresult indicates the overall result of the row
+ * comparison, and subkey points to the deciding column (or the last
+ * column if the result is "=").
+ */
+ switch (subkey->sk_strategy)
+ {
+ /* EQ and NE cases aren't allowed here */
+ case BTLessStrategyNumber:
+ result = (cmpresult < 0);
+ break;
+ case BTLessEqualStrategyNumber:
+ result = (cmpresult <= 0);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ result = (cmpresult >= 0);
+ break;
+ case BTGreaterStrategyNumber:
+ result = (cmpresult > 0);
+ break;
+ default:
+ elog(ERROR, "unrecognized RowCompareType: %d",
+ (int) subkey->sk_strategy);
+ result = 0; /* keep compiler quiet */
+ break;
+ }
+
+ if (!result)
+ {
+ /*
+ * Tuple fails this qual. If it's a required qual for the current
+ * scan direction, then we can conclude no further tuples will pass,
+ * either. Note we have to look at the deciding column, not
+ * necessarily the first or last column of the row condition.
+ */
+ if ((subkey->sk_flags & SK_BT_REQFWD) &&
+ ScanDirectionIsForward(dir))
+ *continuescan = false;
+ else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
+ ScanDirectionIsBackward(dir))
+ *continuescan = false;
+ }
+
+ return result;
+}
+
+/*
+ * _bt_truncate() -- create tuple without unneeded suffix attributes.
+ *
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument. If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space. More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes. We only need to keep firstright
+ * attributes up to and including the first non-lastleft-equal attribute.
+ * Caller's insertion scankey is used to compare the tuples; the scankey's
+ * argument values are not considered here.
+ *
+ * Note that returned tuple's t_tid offset will hold the number of attributes
+ * present, so the original item pointer offset is not represented. Caller
+ * should only change truncated tuple's downlink. Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
+ *
+ * In the worst case (when a heap TID must be appended to distinguish lastleft
+ * from firstright), the size of the returned tuple is the size of firstright
+ * plus the size of an additional MAXALIGN()'d item pointer. This guarantee
+ * is important, since callers need to stay under the 1/3 of a page
+ * restriction on tuple size. If this routine is ever taught to truncate
+ * within an attribute/datum, it will need to avoid returning an enlarged
+ * tuple to caller when truncation + TOAST compression ends up enlarging the
+ * final datum.
+ */
+IndexTuple
+_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ BTScanInsert itup_key)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+ IndexTuple pivot;
+ IndexTuple tidpivot;
+ ItemPointer pivotheaptid;
+ Size newsize;
+
+ /*
+ * We should only ever truncate non-pivot tuples from leaf pages. It's
+ * never okay to truncate when splitting an internal page.
+ */
+ Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
+
+ /* Determine how many attributes must be kept in truncated tuple */
+ keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+
+#ifdef DEBUG_NO_TRUNCATE
+ /* Force truncation to be ineffective for testing purposes */
+ keepnatts = nkeyatts + 1;
+#endif
+
+ pivot = index_truncate_tuple(itupdesc, firstright,
+ Min(keepnatts, nkeyatts));
+
+ if (BTreeTupleIsPosting(pivot))
+ {
+ /*
+ * index_truncate_tuple() just returns a straight copy of firstright
+ * when it has no attributes to truncate. When that happens, we may
+ * need to truncate away a posting list here instead.
+ */
+ Assert(keepnatts == nkeyatts || keepnatts == nkeyatts + 1);
+ Assert(IndexRelationGetNumberOfAttributes(rel) == nkeyatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+
+ /*
+ * If there is a distinguishing key attribute within pivot tuple, we're
+ * done
+ */
+ if (keepnatts <= nkeyatts)
+ {
+ BTreeTupleSetNAtts(pivot, keepnatts, false);
+ return pivot;
+ }
+
+ /*
+ * We have to store a heap TID in the new pivot tuple, since no non-TID
+ * key attribute value in firstright distinguishes the right side of the
+ * split from the left side. nbtree conceptualizes this case as an
+ * inability to truncate away any key attributes, since heap TID is
+ * treated as just another key attribute (despite lacking a pg_attribute
+ * entry).
+ *
+ * Use enlarged space that holds a copy of pivot. We need the extra space
+ * to store a heap TID at the end (using the special pivot tuple
+ * representation). Note that the original pivot already has firstright's
+ * possible posting list/non-key attribute values removed at this point.
+ */
+ newsize = MAXALIGN(IndexTupleSize(pivot)) + MAXALIGN(sizeof(ItemPointerData));
+ tidpivot = palloc0(newsize);
+ memcpy(tidpivot, pivot, MAXALIGN(IndexTupleSize(pivot)));
+ /* Cannot leak memory here */
+ pfree(pivot);
+
+ /*
+ * Store all of firstright's key attribute values plus a tiebreaker heap
+ * TID value in enlarged pivot tuple
+ */
+ tidpivot->t_info &= ~INDEX_SIZE_MASK;
+ tidpivot->t_info |= newsize;
+ BTreeTupleSetNAtts(tidpivot, nkeyatts, true);
+ pivotheaptid = BTreeTupleGetHeapTID(tidpivot);
+
+ /*
+ * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
+ * consider suffix truncation. It seems like a good idea to follow that
+ * example in cases where no truncation takes place -- use lastleft's heap
+ * TID. (This is also the closest value to negative infinity that's
+ * legally usable.)
+ */
+ ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
+
+ /*
+ * We're done. Assert() that heap TID invariants hold before returning.
+ *
+ * Lehman and Yao require that the downlink to the right page, which is to
+ * be inserted into the parent page in the second phase of a page split be
+ * a strict lower bound on items on the right page, and a non-strict upper
+ * bound for items on the left page. Assert that heap TIDs follow these
+ * invariants, since a heap TID value is apparently needed as a
+ * tiebreaker.
+ */
+#ifndef DEBUG_NO_TRUNCATE
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#else
+
+ /*
+ * Those invariants aren't guaranteed to hold for lastleft + firstright
+ * heap TID attribute values when they're considered here only because
+ * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
+ * needed as a tiebreaker). DEBUG_NO_TRUNCATE must therefore use a heap
+ * TID value that always works as a strict lower bound for items to the
+ * right. In particular, it must avoid using firstright's leading key
+ * attribute values along with lastleft's heap TID value when lastleft's
+ * TID happens to be greater than firstright's TID.
+ */
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
+
+ /*
+ * Pivot heap TID should never be fully equal to firstright. Note that
+ * the pivot heap TID will still end up equal to lastleft's heap TID when
+ * that's the only usable value.
+ */
+ ItemPointerSetOffsetNumber(pivotheaptid,
+ OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
+#endif
+
+ return tidpivot;
+}
+
+/*
+ * _bt_keep_natts - how many key attributes to keep when truncating.
+ *
+ * Caller provides two tuples that enclose a split point. Caller's insertion
+ * scankey is used to compare the tuples; the scankey's argument values are
+ * not considered here.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation. This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ BTScanInsert itup_key)
+{
+ int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keepnatts;
+ ScanKey scankey;
+
+ /*
+ * _bt_compare() treats truncated key attributes as having the value minus
+ * infinity, which would break searches within !heapkeyspace indexes. We
+ * must still truncate away non-key attribute values, though.
+ */
+ if (!itup_key->heapkeyspace)
+ return nkeyatts;
+
+ scankey = itup_key->scankeys;
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+ scankey->sk_collation,
+ datum1,
+ datum2)) != 0)
+ break;
+
+ keepnatts++;
+ }
+
+ /*
+ * Assert that _bt_keep_natts_fast() agrees with us in passing. This is
+ * expected in an allequalimage index.
+ */
+ Assert(!itup_key->allequalimage ||
+ keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+
+ return keepnatts;
+}
+
+/*
+ * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location. A naive bitwise approach to datum comparisons is used to
+ * save cycles.
+ *
+ * The approach taken here usually provides the same answer as _bt_keep_natts
+ * will (for the same pair of tuples from a heapkeyspace index), since the
+ * majority of btree opclasses can never indicate that two datums are equal
+ * unless they're bitwise equal after detoasting. When an index only has
+ * "equal image" columns, routine is guaranteed to give the same result as
+ * _bt_keep_natts would.
+ *
+ * Callers can rely on the fact that attributes considered equal here are
+ * definitely also equal according to _bt_keep_natts, even when the index uses
+ * an opclass or collation that is not "allequalimage"/deduplication-safe.
+ * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
+ * negatives generally only have the effect of making leaf page splits use a
+ * more balanced split point.
+ */
+int
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+ int keepnatts;
+
+ keepnatts = 1;
+ for (int attnum = 1; attnum <= keysz; attnum++)
+ {
+ Datum datum1,
+ datum2;
+ bool isNull1,
+ isNull2;
+ Form_pg_attribute att;
+
+ datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+ datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ att = TupleDescAttr(itupdesc, attnum - 1);
+
+ if (isNull1 != isNull2)
+ break;
+
+ if (!isNull1 &&
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
+ break;
+
+ keepnatts++;
+ }
+
+ return keepnatts;
+}
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 84442a93c5..d93839620d 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -61,10 +61,6 @@ static void writetup_cluster(Tuplesortstate *state, LogicalTape *tape,
SortTuple *stup);
static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
LogicalTape *tape, unsigned int tuplen);
-static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state);
-static int comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state);
static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
Tuplesortstate *state);
static int comparetup_index_hash_tiebreak(const SortTuple *a, const SortTuple *b,
@@ -140,6 +136,9 @@ typedef struct
int datumTypeLen;
} TuplesortDatumArg;
+#define NBT_SPECIALIZE_FILE "../../backend/utils/sort/tuplesortvariants_spec.c"
+#include "access/nbtree_spec.h"
+
Tuplesortstate *
tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
@@ -228,6 +227,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
MemoryContext oldcontext;
TuplesortClusterArg *arg;
int i;
+ nbts_prep_ctx(indexRel);
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
@@ -340,6 +340,7 @@ tuplesort_begin_index_btree(Relation heapRel,
TuplesortIndexBTreeArg *arg;
MemoryContext oldcontext;
int i;
+ nbts_prep_ctx(indexRel);
oldcontext = MemoryContextSwitchTo(base->maincontext);
arg = (TuplesortIndexBTreeArg *) palloc(sizeof(TuplesortIndexBTreeArg));
@@ -475,6 +476,7 @@ tuplesort_begin_index_gist(Relation heapRel,
MemoryContext oldcontext;
TuplesortIndexBTreeArg *arg;
int i;
+ nbts_prep_ctx(indexRel);
oldcontext = MemoryContextSwitchTo(base->maincontext);
arg = (TuplesortIndexBTreeArg *) palloc(sizeof(TuplesortIndexBTreeArg));
@@ -1299,152 +1301,6 @@ removeabbrev_index(Tuplesortstate *state, SortTuple *stups, int count)
}
}
-static int
-comparetup_index_btree(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state)
-{
- /*
- * This is similar to comparetup_heap(), but expects index tuples. There
- * is also special handling for enforcing uniqueness, and special
- * treatment for equal keys at the end.
- */
- TuplesortPublic *base = TuplesortstateGetPublic(state);
- SortSupport sortKey = base->sortKeys;
- int32 compare;
-
- /* Compare the leading sort key */
- compare = ApplySortComparator(a->datum1, a->isnull1,
- b->datum1, b->isnull1,
- sortKey);
- if (compare != 0)
- return compare;
-
- /* Compare additional sort keys */
- return comparetup_index_btree_tiebreak(a, b, state);
-}
-
-static int
-comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
- Tuplesortstate *state)
-{
- TuplesortPublic *base = TuplesortstateGetPublic(state);
- TuplesortIndexBTreeArg *arg = (TuplesortIndexBTreeArg *) base->arg;
- SortSupport sortKey = base->sortKeys;
- IndexTuple tuple1;
- IndexTuple tuple2;
- int keysz;
- TupleDesc tupDes;
- bool equal_hasnull = false;
- int nkey;
- int32 compare;
- Datum datum1,
- datum2;
- bool isnull1,
- isnull2;
-
- tuple1 = (IndexTuple) a->tuple;
- tuple2 = (IndexTuple) b->tuple;
- keysz = base->nKeys;
- tupDes = RelationGetDescr(arg->index.indexRel);
-
- if (sortKey->abbrev_converter)
- {
- datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
-
- compare = ApplySortAbbrevFullComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare;
- }
-
- /* they are equal, so we only need to examine one null flag */
- if (a->isnull1)
- equal_hasnull = true;
-
- sortKey++;
- for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
- {
- datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
- datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
-
- compare = ApplySortComparator(datum1, isnull1,
- datum2, isnull2,
- sortKey);
- if (compare != 0)
- return compare; /* done when we find unequal attributes */
-
- /* they are equal, so we only need to examine one null flag */
- if (isnull1)
- equal_hasnull = true;
- }
-
- /*
- * If btree has asked us to enforce uniqueness, complain if two equal
- * tuples are detected (unless there was at least one NULL field and NULLS
- * NOT DISTINCT was not set).
- *
- * It is sufficient to make the test here, because if two tuples are equal
- * they *must* get compared at some stage of the sort --- otherwise the
- * sort algorithm wouldn't have checked whether one must appear before the
- * other.
- */
- if (arg->enforceUnique && !(!arg->uniqueNullsNotDistinct && equal_hasnull))
- {
- Datum values[INDEX_MAX_KEYS];
- bool isnull[INDEX_MAX_KEYS];
- char *key_desc;
-
- /*
- * Some rather brain-dead implementations of qsort (such as the one in
- * QNX 4) will sometimes call the comparison routine to compare a
- * value to itself, but we always use our own implementation, which
- * does not.
- */
- Assert(tuple1 != tuple2);
-
- index_deform_tuple(tuple1, tupDes, values, isnull);
-
- key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("could not create unique index \"%s\"",
- RelationGetRelationName(arg->index.indexRel)),
- key_desc ? errdetail("Key %s is duplicated.", key_desc) :
- errdetail("Duplicate keys exist."),
- errtableconstraint(arg->index.heapRel,
- RelationGetRelationName(arg->index.indexRel))));
- }
-
- /*
- * If key values are equal, we sort on ItemPointer. This is required for
- * btree indexes, since heap TID is treated as an implicit last key
- * attribute in order to ensure that all keys in the index are physically
- * unique.
- */
- {
- BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
- BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
-
- if (blk1 != blk2)
- return (blk1 < blk2) ? -1 : 1;
- }
- {
- OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
- OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
-
- if (pos1 != pos2)
- return (pos1 < pos2) ? -1 : 1;
- }
-
- /* ItemPointer values should never be equal */
- Assert(false);
-
- return 0;
-}
-
static int
comparetup_index_hash(const SortTuple *a, const SortTuple *b,
Tuplesortstate *state)
diff --git a/src/backend/utils/sort/tuplesortvariants_spec.c b/src/backend/utils/sort/tuplesortvariants_spec.c
new file mode 100644
index 0000000000..705da09329
--- /dev/null
+++ b/src/backend/utils/sort/tuplesortvariants_spec.c
@@ -0,0 +1,175 @@
+/*-------------------------------------------------------------------------
+ *
+ * tuplesortvariants_spec.c
+ * Index shape-specialized functions for tuplesortvariants.c
+ *
+ * NOTES
+ * See also: access/nbtree/README section "nbtree specialization"
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/tuplesortvariants_spec.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define comparetup_index_btree NBTS_FUNCTION(comparetup_index_btree)
+#define comparetup_index_btree_tiebreak NBTS_FUNCTION(comparetup_index_btree_tiebreak)
+
+static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
+ Tuplesortstate *state);
+static int comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
+ Tuplesortstate *state);
+
+static int
+comparetup_index_btree(const SortTuple *a, const SortTuple *b,
+ Tuplesortstate *state)
+{
+ /*
+ * This is similar to comparetup_heap(), but expects index tuples. There
+ * is also special handling for enforcing uniqueness, and special
+ * treatment for equal keys at the end.
+ */
+ TuplesortPublic *base = TuplesortstateGetPublic(state);
+ SortSupport sortKey = base->sortKeys;
+ int32 compare;
+
+ /* Compare the leading sort key */
+ compare = ApplySortComparator(a->datum1, a->isnull1,
+ b->datum1, b->isnull1,
+ sortKey);
+ if (compare != 0)
+ return compare;
+
+ /* Compare additional sort keys */
+ return comparetup_index_btree_tiebreak(a, b, state);
+}
+
+static int
+comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
+ Tuplesortstate *state)
+{ /*
+ * This is similar to comparetup_heap(), but expects index tuples. There
+ * is also special handling for enforcing uniqueness, and special
+ * treatment for equal keys at the end.
+ */
+ TuplesortPublic *base = TuplesortstateGetPublic(state);
+ TuplesortIndexBTreeArg *arg = (TuplesortIndexBTreeArg *) base->arg;
+ SortSupport sortKey = base->sortKeys;
+ IndexTuple tuple1;
+ IndexTuple tuple2;
+ int keysz;
+ TupleDesc tupDes;
+ bool equal_hasnull = false;
+ int nkey;
+ int32 compare;
+ Datum datum1,
+ datum2;
+ bool isnull1,
+ isnull2;
+
+ tuple1 = (IndexTuple) a->tuple;
+ tuple2 = (IndexTuple) b->tuple;
+ keysz = base->nKeys;
+ tupDes = RelationGetDescr(arg->index.indexRel);
+
+ if (sortKey->abbrev_converter)
+ {
+ datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
+ datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
+
+ compare = ApplySortAbbrevFullComparator(datum1, isnull1,
+ datum2, isnull2,
+ sortKey);
+ if (compare != 0)
+ return compare;
+ }
+
+ /* they are equal, so we only need to examine one null flag */
+ if (a->isnull1)
+ equal_hasnull = true;
+
+ sortKey++;
+ for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
+ {
+ datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
+ datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
+
+ compare = ApplySortComparator(datum1, isnull1,
+ datum2, isnull2,
+ sortKey);
+ if (compare != 0)
+ return compare; /* done when we find unequal attributes */
+
+ /* they are equal, so we only need to examine one null flag */
+ if (isnull1)
+ equal_hasnull = true;
+ }
+
+ /*
+ * If btree has asked us to enforce uniqueness, complain if two equal
+ * tuples are detected (unless there was at least one NULL field and NULLS
+ * NOT DISTINCT was not set).
+ *
+ * It is sufficient to make the test here, because if two tuples are equal
+ * they *must* get compared at some stage of the sort --- otherwise the
+ * sort algorithm wouldn't have checked whether one must appear before the
+ * other.
+ */
+ if (arg->enforceUnique && !(!arg->uniqueNullsNotDistinct && equal_hasnull))
+ {
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ char *key_desc;
+
+ /*
+ * Some rather brain-dead implementations of qsort (such as the one in
+ * QNX 4) will sometimes call the comparison routine to compare a
+ * value to itself, but we always use our own implementation, which
+ * does not.
+ */
+ Assert(tuple1 != tuple2);
+
+ index_deform_tuple(tuple1, tupDes, values, isnull);
+
+ key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("could not create unique index \"%s\"",
+ RelationGetRelationName(arg->index.indexRel)),
+ key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+ errdetail("Duplicate keys exist."),
+ errtableconstraint(arg->index.heapRel,
+ RelationGetRelationName(arg->index.indexRel))));
+ }
+
+ /*
+ * If key values are equal, we sort on ItemPointer. This is required for
+ * btree indexes, since heap TID is treated as an implicit last key
+ * attribute in order to ensure that all keys in the index are physically
+ * unique.
+ */
+ {
+ BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
+ BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
+
+ if (blk1 != blk2)
+ return (blk1 < blk2) ? -1 : 1;
+ }
+ {
+ OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
+ OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
+
+ if (pos1 != pos2)
+ return (pos1 < pos2) ? -1 : 1;
+ }
+
+ /* ItemPointer values should never be equal */
+ Assert(false);
+
+ return 0;
+}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 0579120693..ce6999d4b5 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1121,15 +1121,27 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_PERFORMSORT_2 4
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
+typedef enum NBTS_CTX {
+ NBTS_CTX_CACHED,
+ NBTS_CTX_DEFAULT, /* fallback */
+} NBTS_CTX;
+
+static inline NBTS_CTX _nbt_spec_context(Relation irel)
+{
+ if (!PointerIsValid(irel))
+ return NBTS_CTX_DEFAULT;
+
+ return NBTS_CTX_CACHED;
+}
+
+
+#define NBT_SPECIALIZE_FILE "access/nbtree_specfuncs.h"
+#include "nbtree_spec.h"
+
/*
* external entry points for btree, in nbtree.c
*/
extern void btbuildempty(Relation index);
-extern bool btinsert(Relation rel, Datum *values, bool *isnull,
- ItemPointer ht_ctid, Relation heapRel,
- IndexUniqueCheck checkUnique,
- bool indexUnchanged,
- struct IndexInfo *indexInfo);
extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
extern Size btestimateparallelscan(void);
extern void btinitparallelscan(void *target);
@@ -1160,8 +1172,6 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
/*
* prototypes for functions in nbtdedup.c
*/
-extern void _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem,
- Size newitemsz, bool bottomupdedup);
extern bool _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
Size newitemsz);
extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
@@ -1177,9 +1187,6 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
/*
* prototypes for functions in nbtinsert.c
*/
-extern bool _bt_doinsert(Relation rel, IndexTuple itup,
- IndexUniqueCheck checkUnique, bool indexUnchanged,
- Relation heapRel);
extern void _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf,
BTStack stack);
extern Buffer _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack,
@@ -1230,16 +1237,6 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
/*
* prototypes for functions in nbtsearch.c
*/
-extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
- Buffer *bufP, int access);
-extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
- Buffer buf, bool forupdate, BTStack stack,
- int access, AttrNumber *comparecol,
- char *tupdatabuf);
-extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate,
- AttrNumber highcmpcol);
-extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page,
- OffsetNumber offnum, AttrNumber *comparecol);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
@@ -1247,7 +1244,6 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
/*
* prototypes for functions in nbtutils.c
*/
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
extern void _bt_freestack(BTStack stack);
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
@@ -1255,8 +1251,6 @@ extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
extern void _bt_mark_array_keys(IndexScanDesc scan);
extern void _bt_restore_array_keys(IndexScanDesc scan);
extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
- int tupnatts, ScanDirection dir, bool *continuescan);
extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
extern BTCycleId _bt_start_vacuum(Relation rel);
@@ -1269,10 +1263,6 @@ extern bool btproperty(Oid index_oid, int attno,
IndexAMProperty prop, const char *propname,
bool *res, bool *isnull);
extern char *btbuildphasename(int64 phasenum);
-extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
-extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
- IndexTuple firstright);
extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/nbtree_spec.h b/src/include/access/nbtree_spec.h
new file mode 100644
index 0000000000..fa38b09c6e
--- /dev/null
+++ b/src/include/access/nbtree_spec.h
@@ -0,0 +1,183 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtree_specialize.h
+ * header file for postgres btree access method implementation.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/nbtree_specialize.h
+ *
+ *-------------------------------------------------------------------------
+ *
+ * Specialize key-accessing functions and the hot code around those.
+ *
+ * Key attribute iteration is specialized through the use of the following
+ * macros:
+ *
+ * - nbts_attiterdeclare(itup)
+ * Declare the variables required to iterate over the provided IndexTuple's
+ * key attributes. Many tuples may have their attributes iterated over at the
+ * same time.
+ * - nbts_attiterinit(itup, initAttNum, tupDesc)
+ * Initialize the attribute iterator for the provided IndexTuple at
+ * the provided AttributeNumber.
+ * - nbts_foreachattr(initAttNum, endAttNum)
+ * Start a loop over the attributes, starting at initAttNum and ending at
+ * endAttNum, inclusive. It also takes care of truncated attributes.
+ * - nbts_attiter_attnum
+ * The current attribute number
+ * - nbts_attiter_nextattdatum(itup, tupDesc)
+ * Updates the attribute iterator state to the next attribute. Returns the
+ * datum of the next attribute, which might be null (see below)
+ * - nbts_attiter_curattisnull(itup)
+ * Returns whether the result from the last nbts_attiter_nextattdatum is
+ * null.
+ * - nbts_context(irel)
+ * Constructs a context that is used to call specialized functions.
+ * Note that this is unneeded in paths that are inaccessible to unspecialized
+ * code paths (i.e. code included through nbtree_spec.h), because that
+ * always calls the optimized functions directly.
+ */
+
+/*
+ * Macros used in the nbtree specialization code.
+ */
+#define NBTS_TYPE_CACHED cached
+#define NBTS_TYPE_DEFAULT default
+#define NBTS_CTX_NAME __nbts_ctx
+
+/* contextual specializations */
+#define NBTS_MAKE_CTX(rel) const NBTS_CTX NBTS_CTX_NAME = _nbt_spec_context(rel)
+#define NBTS_SPECIALIZE_NAME(name) ( \
+ (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ ) \
+)
+
+/* how do we make names? */
+#define NBTS_MAKE_PREFIX(a) CppConcat(a,_)
+#define NBTS_MAKE_NAME_(a,b) CppConcat(a,b)
+#define NBTS_MAKE_NAME(a,b) NBTS_MAKE_NAME_(NBTS_MAKE_PREFIX(a),b)
+
+#define nbt_opt_specialize(rel) \
+do { \
+ Assert(PointerIsValid(rel)); \
+ if (unlikely((rel)->rd_indam->aminsert == btinsert_default)) \
+ { \
+ nbts_prep_ctx(rel); \
+ _bt_specialize(rel); \
+ } \
+} while (false)
+
+/*
+ * Protections against multiple inclusions - the definition of this macro is
+ * different for files included with the templating mechanism vs the users
+ * of this template, so redefine these macros at top and bottom.
+ */
+#ifdef NBTS_FUNCTION
+#undef NBTS_FUNCTION
+#endif
+#define NBTS_FUNCTION(name) NBTS_MAKE_NAME(name, NBTS_TYPE)
+
+/* While specializing, the context is the local context */
+#ifdef nbts_prep_ctx
+#undef nbts_prep_ctx
+#endif
+#define nbts_prep_ctx(rel)
+
+/*
+ * Specialization 1: CACHED
+ *
+ * Multiple key columns, optimized access for attcacheoff -cacheable offsets.
+ */
+#define NBTS_SPECIALIZING_CACHED
+#define NBTS_TYPE NBTS_TYPE_CACHED
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc) do {} while (false)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_SPECIALIZING_CACHED
+#undef NBTS_TYPE
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/*
+ * Specialization 2: DEFAULT
+ *
+ * "Default", externally accessible, not so optimized functions
+ */
+
+/* Only the default context may need to specialize in some cases, so here's that */
+#undef nbts_prep_ctx
+#define nbts_prep_ctx(rel) NBTS_MAKE_CTX(rel)
+
+#define NBTS_SPECIALIZING_DEFAULT
+#define NBTS_TYPE NBTS_TYPE_DEFAULT
+
+#define nbts_attiterdeclare(itup) \
+ bool NBTS_MAKE_NAME(itup, isNull)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc)
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_getattr((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, isNull)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, isNull)
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_DEFAULT
+
+/* un-define the optimization macros */
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
+/*
+ * All next uses of nbts_prep_ctx are in non-templated code, so here we make
+ * sure we actually create the context.
+ */
+#undef nbts_prep_ctx
+#define nbts_prep_ctx(rel) NBTS_MAKE_CTX(rel)
+
+/*
+ * from here on all NBTS_FUNCTIONs are from specialized function names that
+ * are being called. Change the result of those macros from a direct call
+ * call to a conditional call to the right place, depending on the correct
+ * context.
+ */
+#undef NBTS_FUNCTION
+#define NBTS_FUNCTION(name) NBTS_SPECIALIZE_NAME(name)
+
+#undef NBT_SPECIALIZE_FILE
diff --git a/src/include/access/nbtree_specfuncs.h b/src/include/access/nbtree_specfuncs.h
new file mode 100644
index 0000000000..cf27d406ae
--- /dev/null
+++ b/src/include/access/nbtree_specfuncs.h
@@ -0,0 +1,65 @@
+/*
+ * prototypes for functions that are included in nbtree.h
+ */
+
+#define _bt_specialize NBTS_FUNCTION(_bt_specialize)
+#define btinsert NBTS_FUNCTION(btinsert)
+#define _bt_dedup_pass NBTS_FUNCTION(_bt_dedup_pass)
+#define _bt_doinsert NBTS_FUNCTION(_bt_doinsert)
+#define _bt_search NBTS_FUNCTION(_bt_search)
+#define _bt_moveright NBTS_FUNCTION(_bt_moveright)
+#define _bt_binsrch_insert NBTS_FUNCTION(_bt_binsrch_insert)
+#define _bt_compare NBTS_FUNCTION(_bt_compare)
+#define _bt_mkscankey NBTS_FUNCTION(_bt_mkscankey)
+#define _bt_checkkeys NBTS_FUNCTION(_bt_checkkeys)
+#define _bt_truncate NBTS_FUNCTION(_bt_truncate)
+#define _bt_keep_natts_fast NBTS_FUNCTION(_bt_keep_natts_fast)
+
+/*
+ * prototypes for functions in nbtree_spec.h
+ */
+extern void _bt_specialize(Relation rel);
+
+extern bool btinsert(Relation rel, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ struct IndexInfo *indexInfo);
+
+/*
+ * prototypes for functions in nbtdedup_spec.h
+ */
+extern void _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem,
+ Size newitemsz, bool bottomupdedup);
+
+
+/*
+ * prototypes for functions in nbtinsert_spec.h
+ */
+
+extern bool _bt_doinsert(Relation rel, IndexTuple itup,
+ IndexUniqueCheck checkUnique, bool indexUnchanged,
+ Relation heapRel);
+
+/*
+ * prototypes for functions in nbtsearch_spec.h
+ */
+extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
+ Buffer *bufP, int access);
+extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
+ Buffer buf, bool forupdate, BTStack stack,
+ int access, AttrNumber *comparecol,
+ char *tupdatabuf);
+extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+ AttrNumber highcmpcol);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page,
+ OffsetNumber offnum, AttrNumber *comparecol);
+/*
+ * prototypes for functions in nbtutils_spec.h
+ */
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
+extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+ ScanDirection dir, bool *continuescan);
+extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright);
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index 4e09c4686b..e504a2f114 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -116,6 +116,8 @@ do
test "$f" = src/pl/tcl/pltclerrcodes.h && continue
# Also not meant to be included standalone.
+ test "$f" = src/include/access/nbtree_spec.h && continue
+ test "$f" = src/include/access/nbtree_specfuncs.h && continue
test "$f" = src/include/common/unicode_nonspacing_table.h && continue
test "$f" = src/include/common/unicode_east_asian_fw_table.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index 8dee1b5670..101888c806 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -111,6 +111,8 @@ do
test "$f" = src/pl/tcl/pltclerrcodes.h && continue
# Also not meant to be included standalone.
+ test "$f" = src/include/access/nbtree_spec.h && continue
+ test "$f" = src/include/access/nbtree_specfuncs.h && continue
test "$f" = src/include/common/unicode_nonspacing_table.h && continue
test "$f" = src/include/common/unicode_east_asian_fw_table.h && continue
--
2.40.1
v13-0006-btree-specialization-for-variable-length-multi-a.patchapplication/octet-stream; name=v13-0006-btree-specialization-for-variable-length-multi-a.patchDownload
From 2970387e88cf40542ee507ff94bcc93d47961946 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 13 Jan 2023 15:42:41 +0100
Subject: [PATCH v13 6/6] btree specialization for variable-length
multi-attribute keys
The default code path is relatively slow at O(n^2), so with multiple
attributes we accept the increased startup cost in favour of lower
costs for later attributes.
Note that this will only be used for indexes that use at least one
variable-length key attribute (except as last key attribute in specific
cases).
---
src/backend/access/nbtree/README | 6 +-
src/backend/access/nbtree/nbtree_spec.c | 3 +
src/include/access/itup_attiter.h | 199 ++++++++++++++++++++++++
src/include/access/nbtree.h | 11 +-
src/include/access/nbtree_spec.h | 48 +++++-
5 files changed, 258 insertions(+), 9 deletions(-)
create mode 100644 src/include/access/itup_attiter.h
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index e90e24cb70..0c45288e61 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -1105,14 +1105,12 @@ performance of those hot paths.
Optimized code paths exist for the following cases, in order of preference:
- indexes with only a single key attribute
+ - multi-column indexes that cannot pre-calculate the offsets of all key
+ attributes in the tuple data section
- multi-column indexes that could benefit from the attcacheoff optimization
NB: This is also the default path, and is comparatively slow for uncachable
attribute offsets.
-Future work will optimize for multi-column indexes that don't benefit
-from the attcacheoff optimization by improving on the O(n^2) nature of
-index_getattr through storing attribute offsets.
-
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtree_spec.c b/src/backend/access/nbtree/nbtree_spec.c
index 21635397ed..699197dfa7 100644
--- a/src/backend/access/nbtree/nbtree_spec.c
+++ b/src/backend/access/nbtree/nbtree_spec.c
@@ -33,6 +33,9 @@ _bt_specialize(Relation rel)
case NBTS_CTX_CACHED:
_bt_specialize_cached(rel);
break;
+ case NBTS_CTX_UNCACHED:
+ _bt_specialize_uncached(rel);
+ break;
case NBTS_CTX_SINGLE_KEYATT:
_bt_specialize_single_keyatt(rel);
break;
diff --git a/src/include/access/itup_attiter.h b/src/include/access/itup_attiter.h
new file mode 100644
index 0000000000..c8fb6954bc
--- /dev/null
+++ b/src/include/access/itup_attiter.h
@@ -0,0 +1,199 @@
+/*-------------------------------------------------------------------------
+ *
+ * itup_attiter.h
+ * POSTGRES index tuple attribute iterator definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/itup_attiter.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef ITUP_ATTITER_H
+#define ITUP_ATTITER_H
+
+#include "access/itup.h"
+#include "varatt.h"
+
+typedef struct IAttrIterStateData
+{
+ int offset;
+ bool slow;
+ bool isNull;
+} IAttrIterStateData;
+
+typedef IAttrIterStateData * IAttrIterState;
+
+/* ----------------
+ * index_attiterinit
+ *
+ * This gets called many times, so we macro the cacheable and NULL
+ * lookups, and call nocache_index_attiterinit() for the rest.
+ *
+ * tup - the tuple being iterated on
+ * attnum - the attribute number that we start the iteration with
+ * in the first index_attiternext call
+ * tupdesc - the tuple description
+ *
+ * ----------------
+ */
+#define index_attiterinit(tup, attnum, tupleDesc, iter) \
+do { \
+ if ((attnum) == 1) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ 0 /* Offset of attribute 1 is always 0 */, \
+ false /* slow */, \
+ false /* isNull */ \
+ }); \
+ } \
+ else if (!IndexTupleHasNulls(tup) && \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff >= 0) \
+ { \
+ *(iter) = ((IAttrIterStateData) { \
+ TupleDescAttr((tupleDesc), (attnum)-1)->attcacheoff, /* offset */ \
+ false, /* slow */ \
+ false /* isNull */ \
+ }); \
+ } \
+ else \
+ nocache_index_attiterinit((tup), (attnum) - 1, (tupleDesc), (iter)); \
+} while (false);
+
+/*
+ * Initiate an index attribute iterator to attribute attnum,
+ * and return the corresponding datum.
+ *
+ * This is nearly the same as index_deform_tuple, except that this
+ * returns the internal state up to attnum, instead of populating the
+ * datum- and isnull-arrays
+ */
+static inline void
+nocache_index_attiterinit(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ int curatt;
+ char *tp; /* ptr to tuple data */
+ int off; /* offset in tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ bool slow = false; /* can we use/set attcacheoff? */
+ bool null = false;
+
+ /* Assert to protect callers */
+ Assert(PointerIsValid(iter));
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ /* XXX "knows" t_bits are just after fixed tuple header! */
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+ off = 0;
+
+ for (curatt = 0; curatt < attnum; curatt++)
+ {
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, curatt);
+
+ if (hasnulls && att_isnull(curatt, bp))
+ {
+ null = true;
+ slow = true; /* can't use attcacheoff anymore */
+ continue;
+ }
+
+ null = false;
+
+ if (!slow && thisatt->attcacheoff >= 0)
+ off = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ slow = true;
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+ }
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+
+ if (thisatt->attlen <= 0)
+ slow = true; /* can't use attcacheoff anymore */
+ }
+
+ iter->isNull = null;
+ iter->offset = off;
+ iter->slow = slow;
+}
+
+/* ----------------
+ * index_attiternext() - get the next attribute of an index tuple
+ *
+ * This gets called many times, so we do the least amount of work
+ * possible.
+ *
+ * The code does not attempt to update attcacheoff; as it is unlikely
+ * to reach a situation where the cached offset matters a lot.
+ * If the cached offset do matter, the caller should make sure that
+ * PopulateTupleDescCacheOffsets() was called on the tuple descriptor
+ * to populate the attribute offset cache.
+ *
+ * ----------------
+ */
+static inline Datum
+index_attiternext(IndexTuple tup, AttrNumber attnum, TupleDesc tupleDesc, IAttrIterState iter)
+{
+ bool hasnulls = IndexTupleHasNulls(tup);
+ char *tp; /* ptr to tuple data */
+ bits8 *bp; /* ptr to null bitmap in tuple */
+ Datum datum;
+ Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, attnum - 1);
+
+ Assert(PointerIsValid(iter));
+ Assert(tupleDesc->natts <= INDEX_MAX_KEYS);
+ Assert(attnum <= tupleDesc->natts);
+ Assert(attnum > 0);
+
+ bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
+
+ tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+
+ if (hasnulls && att_isnull(attnum - 1, bp))
+ {
+ iter->isNull = true;
+ iter->slow = true;
+ return (Datum) 0;
+ }
+
+ iter->isNull = false;
+
+ if (!iter->slow && thisatt->attcacheoff >= 0)
+ iter->offset = thisatt->attcacheoff;
+ else if (thisatt->attlen == -1)
+ {
+ iter->offset = att_align_pointer(iter->offset, thisatt->attalign, -1,
+ tp + iter->offset);
+ iter->slow = true;
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ iter->offset = att_align_nominal(iter->offset, thisatt->attalign);
+ }
+
+ datum = fetchatt(thisatt, tp + iter->offset);
+
+ iter->offset = att_addlength_pointer(iter->offset, thisatt->attlen, tp + iter->offset);
+
+ if (thisatt->attlen <= 0)
+ iter->slow = true; /* can't use attcacheoff anymore */
+
+ return datum;
+}
+
+#endif /* ITUP_ATTITER_H */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 89d1c7ab01..5d2fb4b5db 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -16,6 +16,7 @@
#include "access/amapi.h"
#include "access/itup.h"
+#include "access/itup_attiter.h"
#include "access/sdir.h"
#include "access/tableam.h"
#include "access/xlogreader.h"
@@ -1123,18 +1124,26 @@ typedef struct BTOptions
typedef enum NBTS_CTX {
NBTS_CTX_SINGLE_KEYATT,
+ NBTS_CTX_UNCACHED,
NBTS_CTX_CACHED,
NBTS_CTX_DEFAULT, /* fallback */
} NBTS_CTX;
static inline NBTS_CTX _nbt_spec_context(Relation irel)
{
+ AttrNumber nKeyAtts;
+
if (!PointerIsValid(irel))
return NBTS_CTX_DEFAULT;
- if (IndexRelationGetNumberOfKeyAttributes(irel) == 1)
+ nKeyAtts = IndexRelationGetNumberOfKeyAttributes(irel);
+
+ if (nKeyAtts == 1)
return NBTS_CTX_SINGLE_KEYATT;
+ if (TupleDescAttr(irel->rd_att, nKeyAtts - 1)->attcacheoff < -1)
+ return NBTS_CTX_UNCACHED;
+
return NBTS_CTX_CACHED;
}
diff --git a/src/include/access/nbtree_spec.h b/src/include/access/nbtree_spec.h
index 8e476c300d..efed9824e7 100644
--- a/src/include/access/nbtree_spec.h
+++ b/src/include/access/nbtree_spec.h
@@ -45,6 +45,7 @@
* Macros used in the nbtree specialization code.
*/
#define NBTS_TYPE_SINGLE_KEYATT single_keyatt
+#define NBTS_TYPE_UNCACHED uncached
#define NBTS_TYPE_CACHED cached
#define NBTS_TYPE_DEFAULT default
#define NBTS_CTX_NAME __nbts_ctx
@@ -53,8 +54,10 @@
#define NBTS_MAKE_CTX(rel) const NBTS_CTX NBTS_CTX_NAME = _nbt_spec_context(rel)
#define NBTS_SPECIALIZE_NAME(name) ( \
(NBTS_CTX_NAME) == NBTS_CTX_SINGLE_KEYATT ? (NBTS_MAKE_NAME(name, NBTS_TYPE_SINGLE_KEYATT)) : ( \
- (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
- NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ (NBTS_CTX_NAME) == NBTS_CTX_UNCACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_UNCACHED)) : ( \
+ (NBTS_CTX_NAME) == NBTS_CTX_CACHED ? (NBTS_MAKE_NAME(name, NBTS_TYPE_CACHED)) : ( \
+ NBTS_MAKE_NAME(name, NBTS_TYPE_DEFAULT) \
+ ) \
) \
) \
)
@@ -69,8 +72,11 @@ do { \
Assert(PointerIsValid(rel)); \
if (unlikely((rel)->rd_indam->aminsert == btinsert_default)) \
{ \
- nbts_prep_ctx(rel); \
- _bt_specialize(rel); \
+ PopulateTupleDescCacheOffsets((rel)->rd_att); \
+ { \
+ nbts_prep_ctx(rel); \
+ _bt_specialize(rel); \
+ } \
} \
} while (false)
@@ -216,6 +222,40 @@ do { \
#undef nbts_attiter_nextattdatum
#undef nbts_attiter_curattisnull
+/*
+ * Multiple key columns, but attcacheoff -optimization doesn't apply.
+ */
+#define NBTS_SPECIALIZING_UNCACHED
+#define NBTS_TYPE NBTS_TYPE_UNCACHED
+
+#define nbts_attiterdeclare(itup) \
+ IAttrIterStateData NBTS_MAKE_NAME(itup, iter)
+
+#define nbts_attiterinit(itup, initAttNum, tupDesc) \
+ index_attiterinit((itup), (initAttNum), (tupDesc), &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_foreachattr(initAttNum, endAttNum) \
+ for (int spec_i = (initAttNum); spec_i <= (endAttNum); spec_i++)
+
+#define nbts_attiter_attnum spec_i
+
+#define nbts_attiter_nextattdatum(itup, tupDesc) \
+ index_attiternext((itup), spec_i, (tupDesc), &(NBTS_MAKE_NAME(itup, iter)))
+
+#define nbts_attiter_curattisnull(itup) \
+ NBTS_MAKE_NAME(itup, iter).isNull
+
+#include NBT_SPECIALIZE_FILE
+
+#undef NBTS_TYPE
+#undef NBTS_SPECIALIZING_UNCACHED
+#undef nbts_attiterdeclare
+#undef nbts_attiterinit
+#undef nbts_foreachattr
+#undef nbts_attiter_attnum
+#undef nbts_attiter_nextattdatum
+#undef nbts_attiter_curattisnull
+
/*
* All next uses of nbts_prep_ctx are in non-templated code, so here we make
* sure we actually create the context.
--
2.40.1
On Mon, Sep 18, 2023 at 8:57 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
Rebased again to v13 to account for API changes in 9f060253 "Remove
some more "snapshot too old" vestiges."... and now attached.
I see that this revised version is approximately as invasive as
earlier versions were - it has specializations for almost everything.
Do you really need to specialize basically all of nbtdedup.c, for
example?
In order for this patch series to have any hope of getting committed,
there needs to be significant work on limiting the amount of code
churn, and resulting object code size. There are various problems that
come from whole-sale specializing all of this code. There are
distributed costs -- costs that won't necessarily be in evidence from
microbenchmarks.
It might be worth familiarizing yourself with bloaty, a tool for
profiling the size of binaries:
https://github.com/google/bloaty
Is it actually sensible to tie dynamic prefix compression to
everything else here? Apparently there is a regression for certain
cases caused by that patch (the first one), which necessitates making
up the difference in later patches. But...isn't it also conceivable
that some completely different optimization could do that for us
instead? Why is there a regression from v13-0001-*? Can we just fix
the regression directly? And if not, why not?
I also have significant doubts about your scheme for avoiding
invalidating the bounds of the page based on its high key matching the
parent's separator. The subtle dynamic prefix compression race
condition that I was worried about was one caused by page deletion.
But page deletion doesn't change the high key at all (it does that for
the deleted page, but that's hardly relevant). So how could checking
the high key possibly help?
Page deletion will make the pivot tuple in the parent page whose
downlink originally pointed to the concurrently deleted page change,
so that it points to the deleted page's original right sibling page
(the sibling being the page that you need to worry about). This means
that the lower bound for the not-deleted right sibling page has
changed from under us. And that we lack any simple way of detecting
that it might have happened.
The race that I'm worried about is extremely narrow, because it
involves a page deletion and a concurrent insert into the key space
that was originally covered by the deleted page. It's extremely
unlikely to happen in the real world, but it's still a bug.
It's possible that it'd make sense to do a memcmp() of the high key
using a copy of a separator from the parent page. That at least seems
like it could be made safe. But I don't see what it has to do with
dynamic prefix compression. In any case there is a simpler way to
avoid the high key check for internal pages: do the _bt_binsrch first,
and only consider _bt_moveright when the answer that _bt_binsrch gave
suggests that we might actually need to call _bt_moveright.
--
Peter Geoghegan
On Mon, Sep 18, 2023 at 6:29 PM Peter Geoghegan <pg@bowt.ie> wrote:
I also have significant doubts about your scheme for avoiding
invalidating the bounds of the page based on its high key matching the
parent's separator. The subtle dynamic prefix compression race
condition that I was worried about was one caused by page deletion.
But page deletion doesn't change the high key at all (it does that for
the deleted page, but that's hardly relevant). So how could checking
the high key possibly help?
To be clear, page deletion does what I described here (it does an
in-place update of the downlink to the deleted page, so the same pivot
tuple now points to its right sibling, which is our page of concern),
in addition to fully removing the original pivot tuple whose downlink
originally pointed to our page of concern. This is why page deletion
makes the key space "move to the right", very much like a page split
would.
IMV it would be better if it made the key space "move to the left"
instead, which would make page deletion close to the exact opposite of
a page split -- that's what the Lanin & Shasha paper does (sort of).
If you have this symmetry, then things like dynamic prefix compression
are a lot simpler.
ISTM that the only way that a scheme like yours could work, assuming
that making page deletion closer to Lanin & Shasha is not going to
happen, is something even more invasive than that: it might work if
you had a page low key (not just a high key) on every page. You'd have
to compare the lower bound separator key from the parent (which might
itself be the page-level low key for the parent) to the page low key.
That's not a serious suggestion; I'm just pointing out that you need
to be able to compare like with like for a canary condition like this
one, and AFAICT there is no lightweight practical way of doing that
that is 100% robust.
--
Peter Geoghegan
On Tue, 19 Sept 2023 at 03:56, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Sep 18, 2023 at 6:29 PM Peter Geoghegan <pg@bowt.ie> wrote:
I also have significant doubts about your scheme for avoiding
invalidating the bounds of the page based on its high key matching the
parent's separator. The subtle dynamic prefix compression race
condition that I was worried about was one caused by page deletion.
But page deletion doesn't change the high key at all (it does that for
the deleted page, but that's hardly relevant). So how could checking
the high key possibly help?To be clear, page deletion does what I described here (it does an
in-place update of the downlink to the deleted page, so the same pivot
tuple now points to its right sibling, which is our page of concern),
in addition to fully removing the original pivot tuple whose downlink
originally pointed to our page of concern. This is why page deletion
makes the key space "move to the right", very much like a page split
would.
I am still aware of this issue, and I think we've discussed it in
detail earlier. I think it does not really impact this patchset. Sure,
I can't use dynamic prefix compression to its full potential, but I
still do get serious performance benefits:
FULL KEY _bt_compare calls:
'Optimal' full-tree DPT: average O(3)
Paged DPT (this patch): average O(2 * height)
... without HK opt: average O(3 * height)
Current: O(log2(n))
Single-attribute compares:
'Optimal' full-tree DPT: O(log(N))
Paged DPT (this patch): O(log(N))
Current: 0 (or, O(log(N) * natts))
So, in effect, this patch moves most compare operations to the level
of only one or two full key compare operations per page (on average).
I use "on average": on a sorted array with values ranging from
potentially minus infinity to positive infinity, it takes on average 3
compares before a binary search can determine the bounds of the
keyspace it has still to search. If one side's bounds is already
known, it takes on average 2 compare operations before these bounds
are known.
IMV it would be better if it made the key space "move to the left"
instead, which would make page deletion close to the exact opposite of
a page split -- that's what the Lanin & Shasha paper does (sort of).
If you have this symmetry, then things like dynamic prefix compression
are a lot simpler.ISTM that the only way that a scheme like yours could work, assuming
that making page deletion closer to Lanin & Shasha is not going to
happen, is something even more invasive than that: it might work if
you had a page low key (not just a high key) on every page.
Note that the "dynamic prefix compression" is currently only active on
the page level.
True, the patch does carry over _bt_compare's prefix result for the
high key on the child page, but we do that only if the highkey is
actually an exact copy of the right separator on the parent page. This
carry-over opportunity is extremely likely to happen, because the high
key generated in _bt_split is then later inserted on the parent page.
The only case where it could differ is in concurrent page deletions.
That is thus a case of betting a few cycles to commonly save many
cycles (memcmp vs _bt_compare full key compare.
Again, we do not actually skip a prefix on the compare call of the
P_HIGHKEY tuple, nor for the compares of the midpoints unless we've
found a tuple on the page that compares as smaller than the search
key.
You'd have
to compare the lower bound separator key from the parent (which might
itself be the page-level low key for the parent) to the page low key.
That's not a serious suggestion; I'm just pointing out that you need
to be able to compare like with like for a canary condition like this
one, and AFAICT there is no lightweight practical way of doing that
that is 100% robust.
True, if we had consistent LOWKEYs on pages, that'd make this job much
easier: the prefix could indeed be carried over in full. But that's
not currently the case for the nbtree code, and this is the next best
thing, as it also has the benefit of working with all currently
supported physical formats of btree indexes.
Kind regards,
Matthias van de Meent
On Tue, Sep 19, 2023 at 6:28 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
To be clear, page deletion does what I described here (it does an
in-place update of the downlink to the deleted page, so the same pivot
tuple now points to its right sibling, which is our page of concern),
in addition to fully removing the original pivot tuple whose downlink
originally pointed to our page of concern. This is why page deletion
makes the key space "move to the right", very much like a page split
would.I am still aware of this issue, and I think we've discussed it in
detail earlier. I think it does not really impact this patchset. Sure,
I can't use dynamic prefix compression to its full potential, but I
still do get serious performance benefits:
Then why have you linked whatever the first patch does with the high
key to dynamic prefix compression in the first place? Your commit
message makes it sound like it's a way to get around the race
condition that affects dynamic prefix compression, but as far as I can
tell it has nothing whatsoever to do with that race condition.
Questions:
1. Why shouldn't the high key thing be treated as an unrelated piece of work?
I guess it's possible that it really should be structured that way,
but even then it's your responsibility to make it clear why that is.
As things stand, this presentation is very confusing.
2. Separately, why should dynamic prefix compression be tied to the
specialization work? I also see no principled reason why it should be
tied to the other two things.
I didn't mind this sort of structure so much back when this work was
very clearly exploratory -- I've certainly structured work in this
area that way myself, in the past. But if you want this patch set to
ever go beyond being an exploratory patch set, something has to
change. I don't have time to do a comprehensive (or even a fairly
cursory) analysis of which parts of the patch are helping, and which
are marginal or even add no value.
You'd have
to compare the lower bound separator key from the parent (which might
itself be the page-level low key for the parent) to the page low key.
That's not a serious suggestion; I'm just pointing out that you need
to be able to compare like with like for a canary condition like this
one, and AFAICT there is no lightweight practical way of doing that
that is 100% robust.True, if we had consistent LOWKEYs on pages, that'd make this job much
easier: the prefix could indeed be carried over in full. But that's
not currently the case for the nbtree code, and this is the next best
thing, as it also has the benefit of working with all currently
supported physical formats of btree indexes.
I went over the low key thing again because I had to struggle to
understand what your high key optimization had to do with dynamic
prefix compression. I'm still struggling. I think that your commit
message very much led me astray. Quoting it here:
"""
Although this limits the overall applicability of the
performance improvement, it still allows for a nice performance
improvement in most cases where initial columns have many
duplicate values and a compare function that is not cheap.
As an exception to the above rule, most of the time a pages'
highkey is equal to the right seperator on the parent page due to
how btree splits are done. By storing this right seperator from
the parent page and then validating that the highkey of the child
page contains the exact same data, we can restore the right prefix
bound without having to call the relatively expensive _bt_compare.
"""
You're directly tying the high key optimization to the dynamic prefix
compression optimization. But why?
I have long understood that you gave up on the idea of keeping the
bounds across levels of the tree (which does make sense to me), but
yesterday the issue became totally muddled by this high key business.
That's why I rehashed the earlier discussion, which I had previously
understood to be settled.
--
Peter Geoghegan
On Tue, 19 Sept 2023 at 22:49, Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Sep 19, 2023 at 6:28 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:To be clear, page deletion does what I described here (it does an
in-place update of the downlink to the deleted page, so the same pivot
tuple now points to its right sibling, which is our page of concern),
in addition to fully removing the original pivot tuple whose downlink
originally pointed to our page of concern. This is why page deletion
makes the key space "move to the right", very much like a page split
would.I am still aware of this issue, and I think we've discussed it in
detail earlier. I think it does not really impact this patchset. Sure,
I can't use dynamic prefix compression to its full potential, but I
still do get serious performance benefits:Then why have you linked whatever the first patch does with the high
key to dynamic prefix compression in the first place? Your commit
message makes it sound like it's a way to get around the race
condition that affects dynamic prefix compression, but as far as I can
tell it has nothing whatsoever to do with that race condition.
We wouldn't have to store the downlink's right separator and compare
it to the highkey if we didn't deviate from L&Y's algorithm for DELETE
operations (which causes the race condition): just the right sibling's
block number would be enough.
(Yes, the right sibling's block number isn't available for the
rightmost downlink of a page. In those cases, we'd have to reuse the
parent page's high key with that of the downlink page, but I suppose
that'll be relatively rare).
Questions:
1. Why shouldn't the high key thing be treated as an unrelated piece of work?
Because it was only significant and relatively visible after getting
rid of the other full key compare operations, and it touches
essentially the same areas. Splitting them out in more patches would
be a hassle.
I guess it's possible that it really should be structured that way,
but even then it's your responsibility to make it clear why that is.
Sure. But I think I've made that clear upthread too.
As things stand, this presentation is very confusing.
I'll take a look at improving the presentation.
2. Separately, why should dynamic prefix compression be tied to the
specialization work? I also see no principled reason why it should be
tied to the other two things.
My performance results show that insert performance degrades by 2-3%
for single-column indexes if only dynamic the prefix truncation patch
is applied [0]/messages/by-id/CAEze2Wh_3+_Q+BefaLrpdXXR01vKr3R2R=h5gFxR+U4+0Z=40w@mail.gmail.com. The specialization patches fix that regression on my
machine (5950x) due to having optimized code for the use case. I can't
say for certain that other machines will see the same results, but I
think results will at least be similar.
I didn't mind this sort of structure so much back when this work was
very clearly exploratory -- I've certainly structured work in this
area that way myself, in the past. But if you want this patch set to
ever go beyond being an exploratory patch set, something has to
change.
I think it's fairly complete, and mostly waiting for review.
I don't have time to do a comprehensive (or even a fairly
cursory) analysis of which parts of the patch are helping, and which
are marginal or even add no value.
It is a shame that you don't have the time to review this patch.
You'd have
to compare the lower bound separator key from the parent (which might
itself be the page-level low key for the parent) to the page low key.
That's not a serious suggestion; I'm just pointing out that you need
to be able to compare like with like for a canary condition like this
one, and AFAICT there is no lightweight practical way of doing that
that is 100% robust.True, if we had consistent LOWKEYs on pages, that'd make this job much
easier: the prefix could indeed be carried over in full. But that's
not currently the case for the nbtree code, and this is the next best
thing, as it also has the benefit of working with all currently
supported physical formats of btree indexes.I went over the low key thing again because I had to struggle to
understand what your high key optimization had to do with dynamic
prefix compression. I'm still struggling. I think that your commit
message very much led me astray. Quoting it here:"""
Although this limits [...] relatively expensive _bt_compare.
"""You're directly tying the high key optimization to the dynamic prefix
compression optimization. But why?
The value of skipping the _bt_compare call on the highkey is
relatively much higher in the prefix-skip case than it is on master,
as on master it's only one of the log(n) _bt_compare calls on the
page, while in the patch it's one of (on average) 3 full key
_bt_compare calls. This makes it much easier to prove the performance
gain, which made me integrate it into that patch instead of keeping it
separate.
I have long understood that you gave up on the idea of keeping the
bounds across levels of the tree (which does make sense to me), but
yesterday the issue became totally muddled by this high key business.
That's why I rehashed the earlier discussion, which I had previously
understood to be settled.
Understood. I'll see if I can improve the wording to something that is
more clear about what the optimization entails.
I'm planning to have these documentation changes to be included in the
next revision of the patchset, which will probably also reduce the
number of specialized functions (and with it the size of the binary).
It will take some extra time, because I would need to re-run the
performance suite, but the code changes should be very limited when
compared to the current patch (apart from moving code between .c and
_spec.c).
---
The meat of the changes are probably in 0001 (dynamic prefix skip),
0003 (change attribute iteration code to use specializable macros),
and 0006 (index attribute iteration for variable key offsets). 0002 is
mostly mechanical code movement, 0004 is a relatively easy
implementation of the iteration functionality for single-key-column
indexes, and 0005 adds an instrument for improving the efficiency of
attcacheoff by implementing negatively cached values ("cannot be
cached", instead of just "isn't cached") which are then used in 0006.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
[0]: /messages/by-id/CAEze2Wh_3+_Q+BefaLrpdXXR01vKr3R2R=h5gFxR+U4+0Z=40w@mail.gmail.com
On Mon, Sep 25, 2023 at 9:13 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
I think it's fairly complete, and mostly waiting for review.
I don't have time to do a comprehensive (or even a fairly
cursory) analysis of which parts of the patch are helping, and which
are marginal or even add no value.It is a shame that you don't have the time to review this patch.
I didn't say that. Just that I don't have the time (or more like the
inclination) to do most or all of the analysis that might allow us to
arrive at a commitable patch. Most of the work with something like
this is the analysis of the trade-offs, not writing code. There are
all kinds of trade-offs that you could make with something like this,
and the prospect of doing that myself is kind of daunting. Ideally
you'd have made a significant start on that at this point.
I have long understood that you gave up on the idea of keeping the
bounds across levels of the tree (which does make sense to me), but
yesterday the issue became totally muddled by this high key business.
That's why I rehashed the earlier discussion, which I had previously
understood to be settled.Understood. I'll see if I can improve the wording to something that is
more clear about what the optimization entails.
Cool.
--
Peter Geoghegan
On Thu, 26 Oct 2023 at 00:36, Peter Geoghegan <pg@bowt.ie> wrote:
Most of the work with something like
this is the analysis of the trade-offs, not writing code. There are
all kinds of trade-offs that you could make with something like this,
and the prospect of doing that myself is kind of daunting. Ideally
you'd have made a significant start on that at this point.
I believe I'd already made most trade-offs clear earlier in the
threads, along with rationales for the changes in behaviour. But here
goes again:
_bt_compare currently uses index_getattr() on each attribute of the
key. index_getattr() is O(n) for the n-th attribute if the index tuple
has any null or non-attcacheoff attributes in front of the current
one. Thus, _bt_compare costs O(n^2) work with n=the number of
attributes, which can cost several % of performance, but is very very
bad in extreme cases, and doesO(n) calls to opclass-supplied compare
operations.
To solve most of the O(n) compare operations, we can optimize
_bt_compare to only compare "interesting" attributes, i.e. we can
apply "dynamic prefix truncation". This is implemented by patch 0001.
This is further enhanced with 0002, where we skip the compare
operations if the HIKEY is the same as the right separator of the
downlink we followed (due to our page split code, this case is
extremely likely).
However, the above only changes the attribute indexing code in
_bt_compare to O(n) for at most about 76% of the index tuples on the
page (1 - (2 / log2(max_tuples_per_page))), while the other on average
20+% of the compare operations still have to deal with the O(n^2)
total complexity of index_getattr.
To fix this O(n^2) issue (the issue this thread was originally created
for) the approach I implemented originally is to not use index_getattr
but an "attribute iterator" that incrementally extracts the next
attribute, while keeping track of the current offset into the tuple,
so each next attribute would be O(1). That is implemented in the last
patches of the patchset.
This attribute iterator approach has an issue: It doesn't perform very
well for indexes that make full use of attcacheoff. The bookkeeping
for attribute iteration proved to be much more expensive than just
reading attcacheoff from memory. This is why the latter patches
(patchset 14 0003+) adapt the btree code to generate different paths
for different "shapes" of key index attributes, to allow the current
attcacheoff code to keep its performance, but to get higher
performance for indexes where the attcacheoff optimization can not be
applied. In passing, it also specializes the code for single-attribute
indexes, so that they don't have to manage branching code, increasing
their performance, too.
TLDR:
The specialization in 0003+ is applied because index_getattr is good
when attcacheoff applies, but very bad when it doesn't. Attribute
iteration is worse than index_getattr when attcacheoff applies, but is
significantly better when attcacheoff does not work. By specializing
we get the best of both worlds.
The 0001 and 0002 optimizations were added later to further remove
unneeded calls to the btree attribute compare functions, thus further
reducing the total time spent in _bt_compare.
Anyway.
PFA v14 of the patchset. v13's 0001 is now split in two, containing
prefix truncation in 0001, and 0002 containing the downlink's right
separator/HIKEY optimization.
Performance numbers (data attached):
0001 has significant gains in multi-column indexes with shared
prefixes, where the prefix columns are expensive to compare, but
otherwise doesn't have much impact.
0002 further improves performance across the board, but again mostly
for indexes with expensive compare operations.
0007 sees performance improvements almost across the board, with only
the 'ul' and 'tnt' indexes getting some worse results than master (but
still better average results),
All patches applied, per-index average performance improvements on 15
runs range from 3% to 290% across the board for INSERT benchmarks, and
-2.83 to 370% for REINDEX.
Configured with autoconf: config.log:
It was created by PostgreSQL configure 17devel, which was
generated by GNU Autoconf 2.69. Invocation command line was$ ./configure --enable-tap-tests --enable-depend --with-lz4 --with-zstd COPT=-ggdb -O3 --prefix=/home/matthias/projects/postgresql/pg_install --no-create --no-recursion
Benchmark was done on 1m random rows of the pp-complete dataset, as
found on UK Gov's S3 bucket [0]http://prod1.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.csv: using a parallel and threaded
downloader is preferred because the throughput is measured in kBps per
client.
I'll do a few runs on the full dataset of 29M rows soon too, but
master's performance is so bad for the 'worstcase' index that I can't
finish its runs fast enough; benchmarking it takes hours per
iteration.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
[0]: http://prod1.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.csv
Attachments:
image.pngimage/png; name=image.pngDownload
�PNG
IHDR � � ���H IDATx^��M��]7�����f^��@x"�ZA��$�AB�GD����HH����b"�a5�WLV���8�LVnKH�<��)bF�Z�A�xO�OH����{�����[��9u>-!���s~��;=t�{�j�Q�| @� @� @� HP`C��`��L� @� @� 0v� @� @� @� �
��m�� @� @� @� �]� @� @� @�@���d[�p @� @� @� a�=@� @� @� ����+��)� @� @� @@�e @� @� @� $+ �J�u
'@� @� @� v� @� @� @� �
��m�� @� @� @� �]� @� @� @�@���d[�p @� @� @� a�=@� @� @� ����+��)� @� @� ���������]y��&@� @� @� tB@���6Z @� @� @ OaW�}�j @� @� @�@'�]�h�E @� @� @� �v��w�&@� @� @� tB@���6Z @� @� @ OaW�}�j @� @� @�@'�]�h�E @� @� @� �v��w�&@� @� @� tB@���6Z @� @� @ OaW�}�j @� @� @�@'�]�h�E @� @� ��@����
�n>
��+XW+�����g��;���[��$@� ���]�Y�� @� @ �I�qB=��8z��+�_�>t`�����
mv����p+\
7�{r���g�� ���"�~���;w�?���N��{��������C�����Z��"�z&\��nx��_n\<FtR�E�u�8��w��W.����y�~���
6��k��$@����]��73 @� @���&a����S������~��v;���{E����7�����f������j�z~�w�����kw��<��v��������7�������n���<3[d�5��^��d��/�����^q���QW�B���) �����=k3 @ q��c�>y8Vy�5l]������w�W_��S����_[/�6�����W8��OqN�Q���>mY�DG @� �*��a����]P���=�������I=�//������>��^x��G�� ��=�a��z%<���������iq�X�����e�t��p��_o��t�p������1=2p���jO� j�j�9� � ��j}��`�o��_z+�N���x.\�p1l^x,��������a�� �*y�E�O&?����8���^|�����z�����Dx��3�6��������!����� @� �Q�r�q;���'��9t�U���?'����������_|;���o�{�\84H��k4���WC���_x}"�����\�,��JC�Iv�R�����B�[����5�w7�~%\��rx��w�k����O|,<�}#�Z�>7���{�u/l�o�=>�L�x�F�q��p���]����\
/n������0_�7^y�����'>[��7B�!p���/���{��y1�P�� �
����� ����>yx������?���o���G�����L�)���������!�\�U<��Oq�/>��Q|R�x�|x���'��n� @� ��@�k���r��c�v�^W���>>��a�UX��]���v������^�yBGe��7�$�=��W<�q�n�j����������e7<}�j�z��p������;��>���6�����T8��vxy�x���7����^�xt�o�{�xl��������b��^|��p����W_ ����;�|�Y�g�k��;���t����������nx�K���7�"@� ���VoltD`E�<��D�~)���+�����{^�~1��dq������}-s!V��c>�y������s� @� �F��5A��/��Vx6|���I����H���������{h���xY�(>TX$2�bUMN�$���3���w���O��w6P���N
���*��x��S��&���A��YJ����8H,������r>|�o
M�]\�N�T�zf��{P0>>s���@rrw����Q������ ����+��� � T�[���g�0|c���E����X�s����:�]�t�� @� �/0���a�w�F�z�����-CO|�k����U����#O��r}�9�e�wj��)�z�f<�bt�Ra������bxg�=ZK�i�������L�5x��'�O�V����g�6�����g�oxx�����������/��3��?,z�������?1=�����'GN�S�cN=�p���g.�[����������]�h��w�v�bsr�x1��'������>&���p�����'�>��������T���>�yxhaW�?�L� ��@�`�x��v�����
l��x
����]R_���'LB��G6v
����_���]�����5|o��]��*{���sR�5����PG���1�:����]o�����^�H���������)����N���{&\�S<�7�� @� ���]�Y� �~1��'�j���^�Z��w���p�t8s�Bx��s��+W��^I�U������')��]s>�y�������#@� ���Ia�����{��\��p��V(�i�1����'���+���\>����i�Zi�uR�]����N��� [�� w���p���6��^v�=v�������������8���S<���#���� PO`o��Cw�%��g�,d(P=�Z���G$�����_�
���zx��7�����8�l����/L=5}�ORV�������]�LX2 @�@�';����}8|�/J�",|�`h����r��U��K�����:��2*��G�X�FT��A�~��0x��'��[��6��|�1��<�������^|�x7��d6����#�q���c�s_�
O����S�������g��w�g������� @�=aW{�f"@�@�^���o�'+�<�_�>���g�g�.:�'-s!���8�-���I� @� �H`��DIp3S���O\���W��&��p�+���W�
�o���3x�8����4�Z�����LQ��E��/�����������j�����]e��_\X��(�w�������O�?�����{ub��?�r��"�<v�uXq8� @�@=aW=7g @ C��/�*}��������B���k��e?�y�naW�N:� @�@$�'�C����U8���.~}5�x���!��
�>���
�����p���/�S�K���{E@U�����xUl�����w�mz�_=�����K�<f����^��X�__~����I=��Sk�W`.�3��F�t!�<L���a�w5��x=���BxfX�����������K����uj��{�xt�g���8����#@� �
��
H!@� ��@���y�<|3l�.���c����N��������z����y��<�)�]~L @� �%0��`�����7��n8u�+���!�"�U�uq+�zur�S��jT����O���;J� �N�
��<?8����n�O���?��z������j�i?<����@�l'�?J~*:a�4v�O�x<��W���E���n�~����3�O��Fx{&������Ko�N��p��f8_X�o_������M�u�;����}� 4��f�GN���1Q-Rv��)u @`�5/��|�����������g>�������`��1�S_����y�����t�=�S��v�}�*� @��B� �ak�h�C�����qU����a�������Fy�1aW)��p���p��������7s����:k��]�������t��I4}7�q�6v
�'_
W^x1���N�|O|,<���p��{���g��#w�����b�z������� �N�3�7���W�c����+
�&==rr:[h�:� *�*S9� ���
�?yX<������\A�'>v!<�;�t���k��[����������������a'+^��)���A����� @� @� �I@��&x� @ =��a�����~�p��x,����u=��f?�3|���������{!\}e+\=�#6�b�zb��S���vU��� @� @� 4* �j��` @� @� @� m
���6 @� @� @�@���F9
F� @� @� �����Mms @� @� @� 4* �j��` @� @� @� m
���6 @� @� @�@���F9
F� @� @� �����Mms @� @� @� 4* �j��` @� @� @� m
���6 @� @� @�@���F9
F� @� @� �����Mms @� @� @� 4* �j��` @� @� @� m
���6 @� @� @���{{{�c���4 @� @� @����]��7/ @� @� @������
@� @� @� �.a����K� @� @� ����kiB @� @� @� �K@��.y� @� @� @� ,- �Z�� @� @� @� �v�K�� @� @� @� K��&4 @� @� @����]��7/ @� @� @������
@� @� @� �.a����K� @� @� ����kiB @� @� @� �K@��.y� @� @� @� ,- �Z�� @� @� @� �v�K�� @� @� @� K��&4 @� @� @����]��7/ @� @� @������
@� @� @� �.a����K� @� �v77��[e�_
7�����v7����g��;����>��1eSU9o�1��^8�sw<����;\k����zN$@� R��N
]Rc,�
�f/��
����EO,�Y @� d"0��7��"L:�k4]��(����������*��p�������;Wsw�a�H�J�t��L�k� @�@!������{{{�t2�Ye����5�
=~� @� ���8���vs�0<� �a V����U9��1��g��]�w�
���?�����;�?H�| @� �}'�f[j���.�����*43�.LJ?@X��
���i`{� @� �n
T����k�i��N����?�f�rc*�:�������/l�v
�XR`G����Z P�����i�w���{�^�u\SfM�����t�G� @� HO`tM�����[a���������� G�<���,�����=&����o�C�/Lo[�� 4*���F9
�y�<����/Mq��(\�t�� @� �Q`tw���S�������������{�����I�������H����I\7y������c�|j"@� - �}�%h�tD��a���G.�k�?��p���.z:��a @� �'0����#�Y�]�o��yw��N��<��=e�w��?��H�=z>�����G��S� @�C�{������g�� �^��s���N�����@��a���(�
%���>j��P�8�j��g������'?�I������l-C ����@��������u�k`�� �� ~?���� 3HB`�G��1�eO83�|��}o���u������?|���=��A��O��O�#�HT$���{����_���@�C�Px���*.=��w*9,S�l������eX���C���6x�3X�_��_������?g��o�:.��mo���T������O>���-�@��k����vBo�=�!�"��{�*��\2���F����y�w'�����<�������������� @�����������
��c����M�M\ ��k�?�A�����"������y�������w~�w���'@� t[����fx�����'�8F��5��Gz�P(B�sa�w���c���N?l���N�9�l�*�U9������n�*�K�e� @� ��
���������t/�^�
�N��^YH5���0J��gP��k%?W%@� 4. �j���)�����<�c�n�Ka����� ��1����p��9����'}���Ucx\;����FW;tE�V�����#�}�*���krw����Bc�(�[gG��
�P����/z�]ULG� X���k�=PAl�p��tYg��;���<��+{���c�}tcO�:����eMB������V��!�b���!@� �j�]C(����_���@���(��F�
���E�=wv���q @�maW���#�
aW7�h ��@��+�u��@��
��DLy.aW��S; ����+�n[+���]�Y� �v��s�����V�<R����Z �AaW�jIZv��l
0G@�e�X���k��Q� ���=�#@� ��l����9� 4+ �j��h��]��aW��� @� �d�]��J��vE�� @�@���Lo��
�Z��o2aW|=Q @�L@�e_ PG@�UG�9 @�YaW��F#P& ��|_�2� �O� $# �J�U
%���+�v(� 2ve�x�nU@��*w|� ����� @� e�.�� �:��:j�!@� �
���4�2aW��B����| @ aW2�R(���]Q�C1 ����+��[v���V���L�_OTD� (v��v�QshV@������ �2����7�� @� ���i�B D% ����!@� �L�]�6��W.����?��k��qO ���?�#@� L�]�u�]u��C� �v5�i4e������+�
`� @�@2��dZ�PQ ��j�b @ SaW����V�]�r�7��+���� P& ��/�# ���� �����YO�(ve�/�(b
IDAT]�o �'@� �v%�*��J@�U;C� �
�2m�e�* �j�;���]��DE @� �2a�}A�@aW5� @� �f�]�z�@���+�}!��|X> ����+�V)�@T������T@��i�-�UaW���M&���'*"@� � ������9� 4+ �j��h��]��aW��� @� �d�]��J��vE�� @�@���Lo��
�Z��o2aW|=Q @�L@�e_ PG@�UG�9 @�YaW��F#P& ��|_�2� �O� $# �J�U
%���+�v(� 2ve�x�nU@��*w|� ����� @� e�.�� �:��:j�!@� �
���4�2aW��B����| @ aW2�R(���]Q�C1 ����+��[v���V���L�_OTD� (v��v�QshV@������ �2����7�� @� ���i�B D% ����!@� �L�]�6��[v���d���z�" @�@����� @�������s @�@���f=�F�L@����ve�,� HF@��L�J *aWT�Pd* ����������U��&v�� @� ��]�u�]u��C� �v5�i4e������+�
`� @�@2��dZ�PQ ��j�b @ SaW����V�]�r�7��+���� P& ��/�# ���� �����YO�(ve�/�]�o �'@� �v%�*��J@�U;C� �
�2m�e�* �j�;���]��DE @� �2a�}A�@aW5� @� �f�]�z��D`oooC����ve�,� HF@��L�J *aWT�Pd* ����������U��&v�� @� ��]�u�]u��C� �v5�i4e������+�
`� @�@2��dZ�PQ ��j�b @ SaW����V�]�r�7��+���� P& ��/�# ���� �����YO�(ve�/�]�o �'@� �v%�*��J@�U;C� �
�2m�e�* �j�;���]��DE @� �2a�}A�@aW5� @� �f�]�z�@���+�}!��|X> ����+�V)�@T������T@��i�-�UaW���M&���'*"@� � ������9� 4+ �j��h��]��aW��� @� �d�]��J��vE�� @�@���Lo��
�Z��o2aW|=Q @�L@�e_ PG@�UG�9 @�YaW��F#P& ��|_�2� �O� $# �J�U
%���+�v(� 2ve�x�nU@��*w|� ����� @� e�.�� �:��:j�!@� �
���4�2aW��B����| @ aW2�R(���]Q�C1 ����+��[v���V���L�_OTD� (v��v�QshV@������ �2����7�� @� ���i�B D% ����!@� �L�]�6��[v���d���z�" @�@����� @�������s @�@���f=�F�L@����ve�,� HF@��L�J *aWT�Pd* ����������U��&v�� @� ��]�u�]u��C� �v5�i4e������+�
`� @�@2��dZ�PQ ��j�b @ SaW����V�]�r�7��+���� P& ��/�# ���� �����YO�(ve�/�]�o �'@� �v%�*��J@�U;C� �
�2m�e�\`oooa�����@�wTG� �������9� 4+ �j��h��]��aW��� @� �d�]��J��vE�� @�@���Lo��
�Z��o2aW|=Q @�L@�e_ PG@�UG�9 @�YaW��F#P& ��|_�2� �O� $# �J�U
%���+�v(� 2ve�x�nU@��*w|� ����� @� e�.�� �:��:j�!@� �
���4�2aW��B����| @ aW2�R(���]Q�C1 ����+��[v���V���L�_OTD� (v��v�QshV@������ �2����7�� @� ���i�B D% ����!@� �L�]�6��[v���d���z�" @�@����� @�������s @�@���f=�F�L@����ve�,� HF@��L�J *aWT�Pd* ����������U��&v�� @� ��]�u�]u��C� �v5�i4e������+�
`� @�@2��dZ�PQ ��j�b @ SaW����V�]�r�7��+���� P& ��/�# ���� �����YO�(ve�/�]�o �'@� �v%�*��J@�U;C� �
�2m�e�* �j�;���]��DE @� �2a�}A�@aW5� @� �f�]�z�@���+�}!��|X> ����+�V)�@T������T@��i�-�UaW���M&���'*"@� � ������9� 4+ �j��h��]��aW��� @� �d�]��J��vE�� @�@���Lo��
�Z��o2aW|=Q @�L@�e_ PG@�UG�9 @�YaW��F#P& ��|_�2� �O� $# �J�U
%���+�v(� 2ve�x�nU@��*w|� ����� @� e�.�� �:��:j�!@� �
���4������>��+�}!��|X> ����+�V)�@T������T@��i�-�UaW���M&���'*"@� � ������9� 4+ �j��h��]��aW��� @� �d�]��J��vE�� @�@���Lo��
�Z��o2aW|=Q @�L@�e_ PG@�UG�9 @�YaW��F#P& ��|_�2� �O� $# �J�U
%���+�v(� 2ve�x�nU@��*w|� ����� @� e�.�� �:��:j�!@� �
���4�2aW��B����| @ aW2�R(���]Q�C1 ����+��[v���V���L�_OTD� (v��v�QshV@������ �2����7�� @� ���i�B D% ����!@� �L�]�6��[v���d���z�" @�@����� @�������s @�@���f=�F�L@����ve�,� HF@��L�J *aWT�Pd* ����������U��&v�� @� ��]�u�]u��C� �v5�i4e������+�
`� @�@2��dZ�PQ ��j�b @ SaW����V�]�r�7��+���� �K���^
�'������paW^��j 4% �jJ�8 @�������3 TvU���q���6�� @� �d~��:|����=���+��)�@:��tz�R���������x�]��b-����nR @������f @`���U�� TvUsr�e�]��u�\aW�h @�@������x���o� @ aWM�������� �Z��� @� �.{� �U
�V�kl PM@�U��Q�v-���s�]h�% @� I��n�� D/ ���E
$@� ��]4��. �Z{�[��k��f'@� ��V) �Z��� @�@5aW5'GXF��aW�������n>�
��Z��a�����9{-��o�����u���g�PR1���F��/����vU�r @`E���� �����F @� �v��*��������:v
�[������Q�n>*��G?l����0U9�l�T9o�1���;W�sw�a�H������������c�lWaW5� @� �v5gi$�
��
�~a��{���t/�@;!\����h�5���0�[���h'T?�l�44vq�Fqw���f�pl���pnw��:�����N� hX@��0���v� @`�����@��^�u\�������Q�5}Bq'����eWn�?f:L+
������a���E�G����v-�p @�@����A
G���� @� � �"k�r:)�G�5y����\�?����F�<���+���3����ij����?����N�,�����E @� �j�R $(�����d�����s-��:v
�~��;f�z���@j����~8wR 6�n���68v����G����/vE���$ @ KaW�m�h� �Z�68V@�esX�@���i�Q�5x�Wq7W��]��]37�M3�����h�y��w�M���W��xdG����������o��o�~��� -
lll�G��8�� PO���������=���~g�����p���z9� �,�]Y��� @ 2aWd
QN'� �����uk�(�*�(�r�*cX6�x
a�����v��7������������������?��
nQ�+��| ����O� VN�@2��������<�:u���\��G>"�J��
%���+�>�� �v���oG ��k�}\��a���84:�F�kwB���3����0��i�f����p9��5��V��u\��?W�B��UE�1 @� �� x���l�L�@�.�� �_@������������.�dm��{�vz��7���k����T86}���1e��yU�94�p]������vu��
@� �����c+$�Na�:��M� F�.;����v��{�. �.�:;zg��eXGB�� ��13���d��y�����#
y��q[��]��!3 @�$a��A��*�]��56�& ����(�t0�q�wk�f6�4�l�k�>����W�Sv�
Xz����_xP�$��������x�v-���\ @������
�@����.�� �_@�������
����fV(�j��( @� �����r�#@�������c @��j�]��5:����+�} ��|X> �va��[� �vu��G� ��i�2�v%����v-oh @��2��e��K��<a�<!�'@� �v����]��aW��� @� �������������8HD@��H������+��-_��kyC# @� �v-��\� �� �>X���k��f ��|�2� �O� �]@���(�@��]�n�� @�@"��D����]I�o���]�� ����k=� 0O@�5O�� @����]�76aW�{@����| @`�����@:- ��t{-� v%�(e&- �J�}�/�Z�� @� �e�]��9� �y��yB�O� V/ �Z���2����7�� @� k8u�T��?���������|�#����gj��7���p������ �v��+� @�@w�]�����W`ooo� a�z{����]ko� @� ��a����������?��?ve�7,�@��&�A� �v-��lU�]U�:|������4 @ �1L�M�$����+��)� :$ ��P3-%ZaW��i�0aW;�f!@� ' ��7X���k���&@� ��]��E`a�2z8W���&Z ����+��)�@����[�@�@@��A�-q�����`����ov @���� @`���U�� TvUsr�e�]��u�\aW�h @�@������x���o� @ aWM�������� �Z��� @� �.{� �U
�V�kl PM@�U��Q�v-���s�]h�% @� I��n�� D/ ���E
$@� ��]4��. �Z{�[��k��f'@� ��V) �Z��� @�@5aW5'GXF@���^�vu���@� $- �J�}�'���+�)� 2ve�dK\���k�-Xo������ ��X���k���&@� ��]��E`a�2z8W���&Z ����+��)�@����[�@�@@��A�-q�����`����ov @���� �����F�|+�K7����1�?l�1�:{-��o����U�)+��ys��o�������/���v�t��o��/������C @����
�����F>iaW��[�xa���F @� ,# �ZF���S��L�5�������������*��t-5v+�����~�.���?`����tZw��E{.�ZT�� @�y�F�.��4�#vJ@���v.�a��f� @� 4) �jR�Xy
a����]�������kl�����qU�)S�r^�c�{�6�;���4�#����h���;�?H�j| �j�9� 4,�|�����[b�
�:��E�$�ZD�� @� ���]��1/�����owIM�]�;����������� �������LM�}a���k���"