Saving stack space in nbtree's _bt_first function
nbtree's _bt_first function uses arrays on the stack to store the scan
keys that will be used for initial positioning purposes. We could use
dynamic allocation here instead, but experience has shown that that
would cause performance issues -- particularly during nestloop joins
with an inner index scan. The amount of stack space used by _bt_first
does still seem kind of excessive, though.
Perhaps most notably, "BTScanInsertData inskey" takes up 2328 bytes.
There isn't much we can do about that, unless we're willing to make
_bt_first more complicated. The "inskey" variable is so large because
it ends with an array of INDEX_MAX_KEYS-many (generally 32)
ScanKeyData structs -- these structs are themselves fairly large
struct (they're 72 bytes each).
However, there's no reason why "ScanKeyData
notnullkeys[INDEX_MAX_KEYS]" needs to be an array at all. In practice,
_bt_first will only need a single temp notnullkeys ScanKeyData, since
there can never be more than a single deduced NOT NULL constraint used
within our final insertion scan key.
Attached patch shows how this could work. It saves us a little over 2
KiB of stack space, which seems worthwhile given that the patch is so
simple.
I'll submit this patch to the next open commitfest for Postgres 19.
--
Peter Geoghegan
Attachments:
v1-0001-nbtree-Use-only-one-notnullkey-ScanKeyData.patchapplication/octet-stream; name=v1-0001-nbtree-Use-only-one-notnullkey-ScanKeyData.patchDownload
From 3bc3b12bc39738188c3a3f00503965b86eb956db Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 6 Jul 2025 20:03:02 -0400
Subject: [PATCH v1] nbtree: Use only one notnullkey ScanKeyData
---
src/backend/access/nbtree/nbtsearch.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 4af1ff1e9..cc36a2706 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -892,7 +892,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
OffsetNumber offnum;
BTScanInsertData inskey;
ScanKey startKeys[INDEX_MAX_KEYS];
- ScanKeyData notnullkeys[INDEX_MAX_KEYS];
+ ScanKeyData notnullkey;
int keysz = 0;
StrategyNumber strat_total;
BlockNumber blkno = InvalidBlockNumber,
@@ -1122,8 +1122,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
ScanDirectionIsForward(dir) :
ScanDirectionIsBackward(dir)))
{
- /* Yes, so build the key in notnullkeys[keysz] */
- bkey = ¬nullkeys[keysz];
+ /* Final startKeys[] entry will be deduced NOT NULL key */
+ bkey = ¬nullkey;
ScanKeyEntryInitialize(bkey,
(SK_SEARCHNOTNULL | SK_ISNULL |
(impliesNN->sk_flags &
--
2.50.0
Hi,
On 07/07/2025 01:22, Peter Geoghegan wrote:
However, there's no reason why "ScanKeyData
notnullkeys[INDEX_MAX_KEYS]" needs to be an array at all. In practice,
_bt_first will only need a single temp notnullkeys ScanKeyData, since
there can never be more than a single deduced NOT NULL constraint used
within our final insertion scan key.
As an experiment, I added an elog(WARNING,...) just above the main
changed line in the patch, and then ran the tests (make installcheck).
This resulted in lines logged next to some of the SELECT statements in
the following files in src/test/regress/sql:
* create_index.sql
* inherit.sql
I saw the log line only once per query, showing that at least for the
scenarios in the tests, the claim checks out. I then tried a couple of
modifications to those queries which generated the log line, but did not
get it to log more than once.
Kind regards,
Mircea Cadariu
On Tue, Jul 15, 2025 at 4:50 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:
As an experiment, I added an elog(WARNING,...) just above the main changed line in the patch, and then ran the tests (make installcheck). This resulted in lines logged next to some of the SELECT statements in the following files in src/test/regress/sql:
create_index.sql
inherit.sqlI saw the log line only once per query, showing that at least for the scenarios in the tests, the claim checks out.
I'm relying on the fact that we'll inevitably break out of the
relevant _bt_first loop over so->keyData[] (the loop that finds our
initial position keys/builds our insertion scan key by storing ScanKey pointers
in startKeys[]) immediately after we generate an IS NOT NULL boundary
key in this way. Such a generated IS NOT NULL key must use the
BTGreaterStrategyNumber strategy (or the BTLessStrategyNumber strategy
when scanning backwards/when the index is NULLS LAST). That's why it's
inherently impossible to need more than a single notnullkey.
It's a little hard to see why this is from the loop itself, since
there is no "break" in the relevant code block (the block that
actually uses notnullkey). Rather, we rely on the generic logic that
builds our startKeys[] entries. It will inevitably "break" before ever
moving on to the next index attribute/next so->keyData[] key because
strat_total will inevitably become
BTGreaterStrategyNumber/BTLessStrategyNumber. In other words, the
generic BTGreaterStrategyNumber/BTLessStrategyNumber test will
inevitably cause the loop to "break" right after our first (and only)
use of notnullkey.
I've tried to make that clearer in the attached v2 revision's commit
message. v2 also slightly simplifies the logic that the notnullkey
code block uses to select the key's strategy (which is also the
strategy that'll be used as _bt_first's strat_total later on), since
that seems like it'll make my explanation slightly clearer to anybody
that reads the code.
Thanks
--
Peter Geoghegan
Attachments:
v2-0001-nbtree-Use-one-notnullkey-_bt_first-ScanKeyData.patchapplication/x-patch; name=v2-0001-nbtree-Use-one-notnullkey-_bt_first-ScanKeyData.patchDownload
From 48fb1a3eb4a0d62f6be5b0c3d33c1ce639511d52 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 6 Jul 2025 20:03:02 -0400
Subject: [PATCH v2] nbtree: Use one notnullkey _bt_first ScanKeyData.
_bt_first need only store one ScanKeyData struct on the stack to build
an IS NOT NULL key that it consed from an implied NOT NULL constraint.
It isn't necessary to store an array of INDEX_MAX_KEYS of such keys.
This saves us a little over 2KB in stack space. It's possible that this
has an appreciable performance benefit. In any case it's a bit simpler.
It isn't possible for more than a single index attribute to need its own
implied IS NOT NULL key: the first such attribute/IS NOT NULL key always
makes _bt_first stop adding additional boundary keys to startKeys[].
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Mircea Cadariu <cadariu.mircea@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wzm=1kJMSZhhTLoM5BPbwQNWxUj-ynOEh=89ptDZAVgauw@mail.gmail.com
---
src/backend/access/nbtree/nbtsearch.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 4af1ff1e9..e6802101b 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -892,7 +892,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
OffsetNumber offnum;
BTScanInsertData inskey;
ScanKey startKeys[INDEX_MAX_KEYS];
- ScanKeyData notnullkeys[INDEX_MAX_KEYS];
+ ScanKeyData notnullkey;
int keysz = 0;
StrategyNumber strat_total;
BlockNumber blkno = InvalidBlockNumber,
@@ -1122,16 +1122,15 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
ScanDirectionIsForward(dir) :
ScanDirectionIsBackward(dir)))
{
- /* Yes, so build the key in notnullkeys[keysz] */
- bkey = ¬nullkeys[keysz];
+ /* Final startKeys[] entry will be deduced NOT NULL key */
+ bkey = ¬nullkey;
ScanKeyEntryInitialize(bkey,
(SK_SEARCHNOTNULL | SK_ISNULL |
(impliesNN->sk_flags &
(SK_BT_DESC | SK_BT_NULLS_FIRST))),
curattr,
- ((impliesNN->sk_flags & SK_BT_NULLS_FIRST) ?
- BTGreaterStrategyNumber :
- BTLessStrategyNumber),
+ ScanDirectionIsForward(dir) ?
+ BTGreaterStrategyNumber : BTLessStrategyNumber,
InvalidOid,
InvalidOid,
InvalidOid,
--
2.50.0
On 16/07/2025 07:27, Peter Geoghegan wrote:
[...] Rather, we rely on the generic logic that
builds our startKeys[] entries. It will inevitably "break" before ever
moving on to the next index attribute/next so->keyData[] key because
strat_total will inevitably become
BTGreaterStrategyNumber/BTLessStrategyNumber. In other words, the
generic BTGreaterStrategyNumber/BTLessStrategyNumber test will
inevitably cause the loop to "break" right after our first (and only)
use of notnullkey.
Thanks for the elaboration and updated patch! Indeed, I see it's set in
the ScanKeyEntryInitialize to either BTGreaterStrategyNumber or
BTLessStrategyNumber, then few lines lower there's the if with the break.
I'm convinced.
I noticed this CI job failure for the V2, seems unrelated to the subject
of the patch though, does it need a retry?
https://cirrus-ci.com/task/5781246762024960
Kind regards,
Mircea Cadariu
On Wed, Jul 16, 2025 at 9:17 AM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:
Thanks for the elaboration and updated patch! Indeed, I see it's set in
the ScanKeyEntryInitialize to either BTGreaterStrategyNumber or
BTLessStrategyNumber, then few lines lower there's the if with the break.I'm convinced.
Pushed.
I noticed this CI job failure for the V2, seems unrelated to the subject
of the patch though, does it need a retry?
https://cirrus-ci.com/task/5781246762024960
It must be unrelated. I didn't see any such failure myself just now,
and I can't imagine how it could possibly be relevant.
Thanks for the review
--
Peter Geoghegan