Returning nbtree posting list TIDs in DESC order during backwards scans

Started by Peter Geoghegan6 months ago17 messages
#1Peter Geoghegan
Peter Geoghegan
pg@bowt.ie
1 attachment(s)

Currently, nbtree always returns individual posting list TIDs to the
scan in ASC order -- even during a backwards scan (more on why the
deduplication patch deliberately did things that way later). But
allowing backwards scans to return TIDs in "somewhat descending order"
like this results in needless unpinning and re-pinning of the same
heap page. This seems worth fixing.

Patch goals
===========

nbtree scans ought to guarantee that they'll return all TIDs in
whatever order they appear on the page, when read either
front-to-back, or back-to-front (whichever order matches the scan's
current direction). Individual posting list TIDs should also be
returned in either front-to-back or back-to-front order, also
according to the scan's direction. In general, the order that TIDs are
returned in ought to never be affected by the use of deduplication --
including during backwards scans.

Attached patch teaches nbtree's _bt_readpage function to return TIDs
from a posting list in DESC order during backwards scans, bringing
nbtree in line with the ideal described in the previous paragraph.
This approach seems more principled to me, and is likely to have a
small performance benefit.

I'll submit this to the first commitfest for Postgres 19.

Test case
=========

Here's an example of the problem on HEAD, that the patch fixes:

-- Setup table with low cardinality index:
create table duplicate_test(dup int4);
create index on duplicate_test (dup);
insert into duplicate_test
select val from generate_series(1, 18) val,
generate_series(1,1000) dups_per_val;

-- Forwards scan, gets 23 buffer hits:
explain analyze
select ctid, * from
duplicate_test
where dup < 5
order by dup;

-- Equivalent backwards scan, gets 42 buffer hits on HEAD:
explain analyze
select ctid, * from
duplicate_test
where dup < 5
order by dup desc;

With the patch applied, *both* queries get the expected 23 buffer hits.

I am generally suspicious of cases where backwards scans are at a
gratuitous disadvantage relative to equivalent forwards scans. See
previous work in commit c9c0589f and commit 1bd4bc85 for earlier
examples of fixing problems that have that same quality to them. This
might be the last such patch that I'll ever have to write.

Sorting so->killedItems[] in _bt_killitems
==========================================

Making this change in _bt_readpage creates a problem in _bt_killitems:
it currently expects posting list TIDs in ascending order (the loop
invariant relies on that), which is why the deduplication patch didn't
just do things like this in _bt_readpage to begin with. The patch
deals with that new problem by qsort'ing the so->killedItems[] array
(at the top of _bt_killitems) such that the items always appear
exactly once, in the expected ASC order.

Another benefit of the qsort approach in _bt_killitems is that it'll
gracefully handle scrollable cursor scans that happen to scroll back
and forth on the same leaf page. This might result in the same dead
TID being added to the so->killedItems[] array multiple times, in
neither ASC or DESC order. That also confuses the loop inside
_bt_killitems, in a way that needlessly prevents setting LP_DEAD bits
(this is a preexisting problem, not a problem created by the changes
that the patch makes to _bt_readpage). Having duplicate TIDs in
so->killedItems[] like this isn't particularly likely, but it seems
worth having proper handling for all the same, on general principle.

I tend to doubt that the overhead of the qsort will ever really
matter, but if it does then we can always limit it to backwards scans
(meaning limit it to cases where so->currPos.dir indicates that
_bt_readpage processed the page in the backwards scan direction), and
drop the goal of dealing with duplicate so->killedItems[] TIDs
gracefully.

Note that the qsort is completely useless during standard forwards
scans. It's hard to tell those apart from scrollable cursor scans that
briefly changed directions "within a page", though, so I'm inclined to
just always do the qsort, rather than trying to cleverly avoid it.
That's what happens in this v1 of the patch (we qsort unconditionally
in _bt_killitems).

--
Peter Geoghegan

Attachments:

v1-0001-Return-TIDs-in-desc-order-during-backwards-scans.patchapplication/octet-stream; name=v1-0001-Return-TIDs-in-desc-order-during-backwards-scans.patch
#2Peter Geoghegan
Peter Geoghegan
pg@bowt.ie
In reply to: Peter Geoghegan (#1)
1 attachment(s)
Re: Returning nbtree posting list TIDs in DESC order during backwards scans

On Mon, Jun 16, 2025 at 12:46 PM Peter Geoghegan <pg@bowt.ie> wrote:

Making this change in _bt_readpage creates a problem in _bt_killitems:
it currently expects posting list TIDs in ascending order (the loop
invariant relies on that), which is why the deduplication patch didn't
just do things like this in _bt_readpage to begin with. The patch
deals with that new problem by qsort'ing the so->killedItems[] array
(at the top of _bt_killitems) such that the items always appear
exactly once, in the expected ASC order.

Actually, we can just completely get rid of so->killedItems[]. We can
replace it with a new 1 bit itemDead field in the BTScanPosItem struct
(the struct used for so->currPos.items[] entries). That way, the patch
avoids a qsort. The patch doesn't need to change the order of anything
(except of course for the order that _bt_readpage initially saves
posting list tuple TIDs in, when scanning backwards).

The problem with so->killedItems[] is that it stores things in
whatever order successive btgettuple calls find most convenient in the
kill_prior_tuple path. It turns out that it's even more convenient for
btgettuple if its this kill_prior_tuple path simply sets the relevant
(prior/just-returned) item's so->currPos.items[].itemDead field. There
is no need to allocate memory for so->killedItems[], and no need to
worry about running out of space in so->killedItems[] in cases
involving scrollable cursors that happen to scroll back and forth,
triggering kill_prior_tuple for the same TID. We'll simply set the
TID/item's so->currPos.items[].itemDead field a second or a third
time, which can't complicate things, since my new approach makes that
idempotent.

Attached is v2 of the patch, which works that way. This seems like a
better direction to take things in within _bt_killitems.

In general we're quite sensitive to the size of the BTScanPosItem
struct, since (at least for now) we routinely allocate
MaxTIDsPerBTreePage-many of them (i.e. 1,358 of them). 1,358 * 10
bytes is already too much. But there's no need to make the relevant
allocation even larger. I can steal a bit from
BTScanPosItem.tupleOffset for these new
so->currPos.items[].itemDead/BTScanPosItem.itemDead fields. That's
safe, since we only really need 15 bits for each
BTScanPosItem.tupleOffset offset.

--
Peter Geoghegan

Attachments:

v2-0001-Return-TIDs-in-desc-order-during-backwards-scans.patchapplication/octet-stream; name=v2-0001-Return-TIDs-in-desc-order-during-backwards-scans.patch
#3Mircea Cadariu
Mircea Cadariu
cadariu.mircea@gmail.com
In reply to: Peter Geoghegan (#2)
Re: Returning nbtree posting list TIDs in DESC order during backwards scans

Hi,

-    for (int i = 0; i < numKilled; i++)
+    for (int i = so->currPos.firstItem; i <= so->currPos.lastItem; i++)

Does the above change mean it will have to do more work in the loop?
Whereas before it visited strictly killed, it now has to go through all
of them?

Kind regards,

Mircea Cadariu

#4Peter Geoghegan
Peter Geoghegan
pg@bowt.ie
In reply to: Mircea Cadariu (#3)
Re: Returning nbtree posting list TIDs in DESC order during backwards scans

On Wed, Jul 16, 2025 at 5:27 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:

Does the above change mean it will have to do more work in the loop?
Whereas before it visited strictly killed, it now has to go through all
of them?

Yes, that's true. Any item that the scan returns from the so->currPos
page needs to be considered within the loop.

The loop has an early check for this (for non-itemDead entries) here:

/* Quickly skip over items never ItemDead-set by btgettuple */
if (!kitem->itemDead)
continue;

I really doubt that this matters, because:

* This can only happen when we actually call _bt_killitems in the
first place, so there has to be at least one item whose index tuple
really does need to be LP_DEAD-set.

* The chances of there being a huge number of so->currPos.items[]
items but only one or two with their "itemDead" bit set seems low, in
general.

* The new loop is significantly simpler in that it iterates through
so->currPos.items[] in order, without any of the so->killedItems[]
indirection you see on HEAD. Modern CPUs are likely to skip over
non-itemDead entries very quickly.

Note that so->killedItems[] (which this patch removes) can be in
ascending leaf-page-wise order, descending leaf-page-wise order, or
(with a scrollable cursor) some random mix of the two -- even when
there's no posting list tuples involved.

--
Peter Geoghegan

#5Peter Geoghegan
Peter Geoghegan
pg@bowt.ie
In reply to: Peter Geoghegan (#4)
1 attachment(s)
Re: Returning nbtree posting list TIDs in DESC order during backwards scans

On Thu, Jul 17, 2025 at 2:26 PM Peter Geoghegan <pg@bowt.ie> wrote:

The loop has an early check for this (for non-itemDead entries) here:

/* Quickly skip over items never ItemDead-set by btgettuple */
if (!kitem->itemDead)
continue;

I really doubt that this matters

I think that this patch isn't too far off being committable. I'll need
to think about issues around adding the new kitem->itemDead bitfield.
I'm not really worried about _bt_killitems; more so about the routines
called by _bt_readpage, which must make sure that the bit is unset
every time a TID is saved in so->currPos.items[].

Attached is v3. Not much has changed. We now defensively unset each
previously set kitem->itemDead within _bt_killitems. I've also added a
pair of macros that represent 0 and 1 values for these kitem->itemDead
bits.

Thanks
--
Peter Geoghegan

Attachments:

v3-0001-Return-TIDs-in-desc-order-during-backwards-scans.patchapplication/octet-stream; name=v3-0001-Return-TIDs-in-desc-order-during-backwards-scans.patch
#6Mircea Cadariu
Mircea Cadariu
cadariu.mircea@gmail.com
In reply to: Peter Geoghegan (#5)
Re: Returning nbtree posting list TIDs in DESC order during backwards scans

Hi,

On 18/07/2025 05:37, Peter Geoghegan wrote:

I think that this patch isn't too far off being committable.

Agreed. I just tried v3, and got 23 buffer hits, same as in the original
demonstrative example.

+                    /* Remember all prior TIDs (must be at least one) */
+                    for (int i = nitems - 2; i >= 0; i--)

This loop has to start from the end, in order to return TIDs in DESC order?

I'm not really worried about _bt_killitems; more so about the routines
called by _bt_readpage, which must make sure that the bit is unset
every time a TID is saved in so->currPos.items[].

I did a search through the code for "so->" and had a look at the results
for functions I'd expect to see changes in, at the minimum:

*  btbeginscan
* btrescan
* btendscan
* btrestrpos
* _bt_steppage
* _bt_readfirstpage

I could find all of the above being touched in v3.

Modern CPUs are likely to skip over
non-itemDead entries very quickly.

Okay, yeah. A sequential iteration through an array will be fast, and we
expect the branch predictor to do its job properly with the "if
(!kitem->itemDead)".

I'll need
to think about issues around adding the new kitem->itemDead bitfield.

It's probably not a concern here, but got reminded of this ABI break:
/messages/by-id/CABOikdNmVBC1LL6pY26dyxAS2f+gLZvTsNt=2XbcyG7WxXVBBQ@mail.gmail.com

Kind regards,

Mircea Cadariu

#7Andy Fan
Andy Fan
zhihuifan1213@163.com
In reply to: Peter Geoghegan (#5)
Re: Returning nbtree posting list TIDs in DESC order during backwards scans

Hi,

On Thu, Jul 17, 2025 at 2:26 PM Peter Geoghegan <pg@bowt.ie> wrote:

The loop has an early check for this (for non-itemDead entries) here:

/* Quickly skip over items never ItemDead-set by btgettuple */
if (!kitem->itemDead)
continue;

I really doubt that this matters

I'll need to think about issues around adding the new kitem->itemDead
bitfield. I'm not really worried about _bt_killitems; more so about
the routines called by _bt_readpage, which must make sure that the bit
is unset every time a TID is saved in so->currPos.items[].

Yes, otherwise the live tuple maybe marked as dead, which is terrible. I
think you have unset the bit on all the needed places, including
_bt_saveitem, _bt_savepostingitem and _bt_setuppostingitems for the
first item in postlist item. I can't find out anything is missed.

(I thought another place for this work might be _bt_returnitem, this
might be a more centralized place to set. Later I think it is totally
wrong because for the queries like SELECT * FROM t WHERE idx_col = 3
LIMIT 3; Some items in so->currPos.items[] were filled in
_bt_readpage but maybe never returned to heap at all, and later
_bt_killitems also run against it on the whole, so unsetting the bit on
_bt_returnitem is too late...)

I think this patch does two things. one is refactoring the data struct
for _bt_killitems, the other one is scaning the postlist in the backward
way for the backward scan. then does the below changes belongs to any of
them? Is it an intentional change?

_bt_killitems:

 		if (offnum < minoff)
-			continue;			/* pure paranoia */
+		{
+			/*
+			 * Page must have originally been the rightmost page, but has
+			 * since split
+			 */
+			Assert(!so->dropPin);
+			offnum = minoff;
+		}

At last, I can get the same test result as you. The buffer hits go back to
23 in the test case, thank for working on this!

I think that this patch isn't too far off being committable.

I think so.

--
Best Regards
Andy Fan

#8Peter Geoghegan
Peter Geoghegan
pg@bowt.ie
In reply to: Andy Fan (#7)
1 attachment(s)
Re: Returning nbtree posting list TIDs in DESC order during backwards scans

On Sun, Jul 27, 2025 at 9:43 AM Andy Fan <zhihuifan1213@163.com> wrote:

At last, I can get the same test result as you. The buffer hits go back to
23 in the test case, thank for working on this!

I think that this patch isn't too far off being committable.

I think so.

Coming back to this patch now, after several months of work on index
prefetching.

I decided that it wasn't such a great idea to reuse/steal an unused
"itemDead" bit from the BTScanPosItem.itemOffset field after all. That
forces _bt_killitems to iterate through every so->currPos.item[], not
just those that are known to require LP_DEAD marking.

Tomas Vondra suggested that I keep killedItems as a separate
allocation (as it is on master), while using a Bitmapset to represent
killedItems (unlike on master, where it is represented using a simple
array). This has all of the same advantages as my previous approach,
but doesn't have the aforementioned disadvantages within _bt_killitems
(plus we no longer need to change BTScanPosItem in any way).

Attached is v4, which does it that way.

My plan is to commit this improved version in the next couple of days.

--
Peter Geoghegan

Attachments:

v4-0001-Return-TIDs-in-desc-order-during-backwards-scans.patchapplication/octet-stream; name=v4-0001-Return-TIDs-in-desc-order-during-backwards-scans.patch
#9Chao Li
Chao Li
li.evan.chao@gmail.com
In reply to: Peter Geoghegan (#8)
Re: Returning nbtree posting list TIDs in DESC order during backwards scans

Hi Peter,

The patch v4 looks carefully written and technically solid, and the core logic (switching killedItems[] to Bitmapset and reworking backwards posting list scans) is coherent.

I just got a few comments/questions:

On Dec 3, 2025, at 11:08, Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v4, which does it that way.

My plan is to commit this improved version in the next couple of days.

--
Peter Geoghegan
<v4-0001-Return-TIDs-in-desc-order-during-backwards-scans.patch>

1
```
- /* Always invalidate so->killedItems[] before leaving so->currPos */
- so->numKilled = 0;
```

The old code only sets so->numKilled to 0 and reuse memory of so->killedItems[], now the new code always bms_free(so->killedItems) and re-alloc memory when next time adding a member to bms.

I am think that, if there were bms_clear(), then we could have just cleared the bitmap and reused the memory next time. But unfortunately, there is no such a bms_clear() now. What do you think to add bms_clear() and use it here? I don’t want to do that, I can try that once you push this patch.

2
```
+					/* Set up posting list state (and remember last TID) */
 					itemIndex--;
 					tupleOffset =
 						_bt_setuppostingitems(so, itemIndex, offnum,
-											  BTreeTupleGetPostingN(itup, 0),
+											  BTreeTupleGetPostingN(itup, nitems - 1),
 											  itup);
-					/* Remember additional TIDs */
-					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+
+					/* Remember all prior TIDs (must be at least one) */
+					for (int i = nitems - 2; i >= 0; i--)
 					{
 						itemIndex--;
 						_bt_savepostingitem(so, itemIndex, offnum,
```

I wonder if the comment “must be at lease one” should apply to the assignment of tupleOffset? The “for” loop starts from nitems-2, will it still must be at lease one item?

3
```
 				/*
-				 * Don't bother advancing the outermost loop's int iterator to
-				 * avoid processing killed items that relate to the same
-				 * offnum/posting list tuple.  This micro-optimization hardly
-				 * seems worth it.  (Further iterations of the outermost loop
-				 * will fail to match on this same posting list's first heap
-				 * TID instead, so we'll advance to the next offnum/index
-				 * tuple pretty quickly.)
+				 * Don't advance itemIndex for outermost loop, no matter how
+				 * nextIndex was advanced.  It's possible that items whose
+				 * TIDs weren't matched in posting list can still be killed
+				 * (there might be a later tuple whose TID is a match).
 				 */
 				if (j == nposting)
 					killtuple = true;
```

I really don’t get what "Don't bother advancing the outermost loop's int iterator” means? Here we only set killtuple to true, then if (killtuple && !ItemIdIsDead(iid)), it breaks the inner while loop, in that case, outer while loop "while ((itemIndex = bms_next_member(so->killedItems, itemIndex)) >= 0)” will advance itemIndex.

I know this is a legacy comment, if you can explain a little bit, that will be very appreciated.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#10Victor Yegorov
Victor Yegorov
vyegorov@gmail.com
In reply to: Peter Geoghegan (#8)
Re: Returning nbtree posting list TIDs in DESC order during backwards scans

ср, 3 дек. 2025 г. в 06:09, Peter Geoghegan <pg@bowt.ie>:

Coming back to this patch now, after several months of work on index
prefetching.

I decided that it wasn't such a great idea to reuse/steal an unused
"itemDead" bit from the BTScanPosItem.itemOffset field after all. That
forces _bt_killitems to iterate through every so->currPos.item[], not
just those that are known to require LP_DEAD marking.

Tomas Vondra suggested that I keep killedItems as a separate
allocation (as it is on master), while using a Bitmapset to represent
killedItems (unlike on master, where it is represented using a simple
array). This has all of the same advantages as my previous approach,
but doesn't have the aforementioned disadvantages within _bt_killitems
(plus we no longer need to change BTScanPosItem in any way).

Attached is v4, which does it that way.

My plan is to commit this improved version in the next couple of days.

--
Peter Geoghegan

Patch looks fine, applies and compiles cleanly, passes tests.

I'd like to point out a missing space after the dot in the 2nd para of the
commit message,
falls out of style.

--
Victor Yegorov

#11Mircea Cadariu
Mircea Cadariu
cadariu.mircea@gmail.com
In reply to: Peter Geoghegan (#8)
Re: Returning nbtree posting list TIDs in DESC order during backwards scans

Hi,

On 03/12/2025 03:08, Peter Geoghegan wrote:

Coming back to this patch now, after several months of work on index
prefetching.

Nice, welcome back! It would be great to wrap this up.

Tomas Vondra suggested that I keep killedItems as a separate
allocation (as it is on master), while using a Bitmapset to represent
killedItems (unlike on master, where it is represented using a simple
array). This has all of the same advantages as my previous approach,
but doesn't have the aforementioned disadvantages within _bt_killitems
(plus we no longer need to change BTScanPosItem in any way).

Good one!

-	* Don't bother advancing the outermost loop's int iterator to
-	* avoid processing killed items that relate to the same
-       * offnum/posting list tuple.  This micro-optimization hardly
-       * seems worth it.  (Further iterations of the outermost loop
-	* will fail to match on this same posting list's first heap
-       * TID instead, so we'll advance to the next offnum/index
-       * tuple pretty quickly.)
+       * Don't advance itemIndex for outermost loop, no matter how
+       * nextIndex was advanced.  It's possible that items whose
+       * TIDs weren't matched in posting list can still be killed
+       * (there might be a later tuple whose TID is a match).

Hmm, as far as I can tell, the old version of the comment seems more
accurate. If I understand correctly, it's still safe to do the
micro-optimization, but we choose to not do it because we expect the
speedup will not be worth the increased complexity / risk of introducing
bugs.

--
Thanks,
Mircea Cadariu

#12Peter Geoghegan
Peter Geoghegan
pg@bowt.ie
In reply to: Chao Li (#9)
Re: Returning nbtree posting list TIDs in DESC order during backwards scans

On Wed, Dec 3, 2025 at 7:32 AM Chao Li <li.evan.chao@gmail.com> wrote:

The old code only sets so->numKilled to 0 and reuse memory of so->killedItems[], now the new code always bms_free(so->killedItems) and re-alloc memory when next time adding a member to bms.

I am think that, if there were bms_clear(), then we could have just cleared the bitmap and reused the memory next time. But unfortunately, there is no such a bms_clear() now. What do you think to add bms_clear() and use it here? I don’t want to do that, I can try that once you push this patch.

I don't think that it's worth the complexity. We can rely on palloc()
to recycle the memory that was freed by the most recent bms_clear().

I say this in part because the complexity of reusing the same
Bitmapset would be rather high. The idea that the only valid
representation of an empty set is a NULL pointer is baked into the
Bitmapset API.

Note also that we'll use much less memory for killedItems by
representing it as a Bitmapset. We'll use at most one bit per
so->currPos.items[] item, whereas before we used 4 bytes per item.

I wonder if the comment “must be at lease one” should apply to the assignment of tupleOffset? The “for” loop starts from nitems-2, will it still must be at lease one item?

By definition, a posting list tuple has at least 2 TIDs -- that's a
posting list invariant. If there was only 1 TID, then it wouldn't be a
posting list in the first place. (Unlike within GIN, where single TID
posting lists are possible.)

/*
-                                * Don't bother advancing the outermost loop's int iterator to
-                                * avoid processing killed items that relate to the same
-                                * offnum/posting list tuple.  This micro-optimization hardly
-                                * seems worth it.  (Further iterations of the outermost loop
-                                * will fail to match on this same posting list's first heap
-                                * TID instead, so we'll advance to the next offnum/index
-                                * tuple pretty quickly.)
+                                * Don't advance itemIndex for outermost loop, no matter how
+                                * nextIndex was advanced.  It's possible that items whose
+                                * TIDs weren't matched in posting list can still be killed
+                                * (there might be a later tuple whose TID is a match).
*/
if (j == nposting)
killtuple = true;
```

I really don’t get what "Don't bother advancing the outermost loop's int iterator” means? Here we only set killtuple to true, then if (killtuple && !ItemIdIsDead(iid)), it breaks the inner while loop, in that case, outer while loop "while ((itemIndex = bms_next_member(so->killedItems, itemIndex)) >= 0)” will advance itemIndex.

The old comment should probably have been written more like the new
one (that I propose to replace it with). It's saying "don't try to be
clever by remembering that we already determined that all the TIDs
that we tried to compare to this posting list aren't matches for *any*
TID". But I don't think that that's accurate; we *haven't* determined
that those TIDs aren't matches for *any and all* TIDs on the page
(only for those now in the posting list specifically). We might still
be able to match those TIDs to later tuples, immediately after the
posting list.

Note that this is all a bit academic and theoretical; in practice it
rarely comes up. During so->dropPin scans (which is the common case),
we'll give up/get scared at the start of _bt_killitems if the page has
changed at all since it was initially examined within _bt_readpage
anyway -- no matter how the page was modified. It doesn't matter that
the page probably *won't* have been modified by VACUUM when
_bt_kilitems gets scared of modifying the page like this (VACUUM is
the only thing that truly makes it unsafe for _bt_killitems to run,
but _bt_killitems is conservative/easily scared).

So while it is true that "We might still be able to match those TIDs
to later tuples, immediately after the posting list" (as I said in the
paragraph before the previous paragraph), we can only do so during
!so->dropPin scans. In other words, only during index-only scans, or
scans of an unlogged relation, or when we don't use an MVCC snapshot.
All of which are rare. Which makes all this fairly
academic/theoretical (mostly it's historical, 10+ years ago *all*
scans were !so->dropPin scans).

--
Peter Geoghegan

#13Chao Li
Chao Li
li.evan.chao@gmail.com
In reply to: Peter Geoghegan (#12)
Re: Returning nbtree posting list TIDs in DESC order during backwards scans

On Dec 4, 2025, at 01:31, Peter Geoghegan <pg@bowt.ie> wrote:

Note also that we'll use much less memory for killedItems by
representing it as a Bitmapset. We'll use at most one bit per
so->currPos.items[] item, whereas before we used 4 bytes per item.

That’s true, BitmapSet saves 7/8 of memory usage for killedItems. However, it also brings a little performance burden, because bms_next_member() does O(N) iteration. Say so->curPos.items[] = {0, 1000}, the old code directly gives 0 and 1000 to the “for” loop, but the new code needs to iterate over 999 bits to get next member 1000. Maybe that’s affordable.

Actually MaxTIDsPerBTreePage (max length of so->curPos.items[]) is a value around 1000, in the old code, killedItems could be “short *” instead of “int *”, which may also save a half of memory usage.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#14Peter Geoghegan
Peter Geoghegan
pg@bowt.ie
In reply to: Victor Yegorov (#10)
1 attachment(s)
Re: Returning nbtree posting list TIDs in DESC order during backwards scans

On Wed, Dec 3, 2025 at 10:18 AM Victor Yegorov <vyegorov@gmail.com> wrote:

Patch looks fine, applies and compiles cleanly, passes tests.

This patch was trickier than initially expected. I paid no attention
to the possible downside of changing the posting list iteration code
in _bt_readpage (i.e. from scan posting lists in descending order for
backwards scans), which was an oversight.

It seems that we're very sensitive to how the compiler allocates
registers within _bt_readpage, which led to v4 of the patch (and
presumably all earlier versions) not storing itemIndex in a register.
With v4, the compiler spilled the itemIndex comparison to the stack
(at least on my machine, which uses GCC 15.2 from Debian unstable),
and so had to read itemIndex from memory on each loop iteration. This
regressed pgbench's SELECT workload by about 1% of total TPS (on my
Zen 3 CPU).

Attached v5 avoids the regression by tweaking _bt_readpage. I will
commit this version soon (I really mean it this time!).

I'm not sure why these changes have the intended effect -- I mostly
figured all this out through trial and error. Though I can say that my
testing showed that _not_ changing the posting list iteration code in
the _bt_readpage forwards scan loop makes the crucial difference. The
other tweaks probably aren't strictly necessary, but they seem to make
the compiler generate ever so slightly faster code (at least with
pgbench SELECT), so I kept them in.

I also gave up on the idea of using a bitmapset for v5 -- the issue
with regressing _bt_readpage discouraged me from adding code that
allocates memory (more than once per scan) within btgettuple, which is
a performance-critical path. We can simply sort and unique-ify the
killedItems array from within _bt_killitems instead -- which is
largely the same way that my original v1 did it. That way it won't
matter what order we append to killItems in, relative to the scan
direction. (The downside of sticking with an array for killedItems is
that we can still run out of array space in extreme cases involving
scrollable cursors that return many killable tuples, and repeatedly
append the same TID to killedItems[], but that doesn't seem like much
of a loss to me -- that almost never happens in practice.)

I'd like to point out a missing space after the dot in the 2nd para of the commit message,
falls out of style.

Fixed that too.

Thanks

--
Peter Geoghegan

Attachments:

v5-0001-Return-TIDs-in-desc-order-during-backwards-scans.patchapplication/x-patch; name=v5-0001-Return-TIDs-in-desc-order-during-backwards-scans.patch
#15Victor Yegorov
Victor Yegorov
vyegorov@gmail.com
In reply to: Peter Geoghegan (#14)
Re: Returning nbtree posting list TIDs in DESC order during backwards scans

ср, 10 дек. 2025 г. в 21:40, Peter Geoghegan <pg@bowt.ie>:

Attached v5 avoids the regression by tweaking _bt_readpage. I will
commit this version soon (I really mean it this time!).

I also gave up on the idea of using a bitmapset for v5 -- the issue
with regressing _bt_readpage discouraged me from adding code that
allocates memory (more than once per scan) within btgettuple, which is
a performance-critical path. We can simply sort and unique-ify the
killedItems array from within _bt_killitems instead -- which is
largely the same way that my original v1 did it. That way it won't
matter what order we append to killItems in, relative to the scan
direction. (The downside of sticking with an array for killedItems is
that we can still run out of array space in extreme cases involving
scrollable cursors that return many killable tuples, and repeatedly
append the same TID to killedItems[], but that doesn't seem like much
of a loss to me -- that almost never happens in practice.)

Compiled and tested without issues.

Small note: as you're removing “We rely on the convention that heap TIDs in
the scanpos
items array are stored in ascending heap TID order…” part of the comment,
perhaps it'd be
good to add smth similar to that to the “Sort and uniqueify
so->killedItems[] to deal with all this.”
piece? Smth like:

+ * Sort and uniqueify so->killedItems[] to deal with all this,
+ * as TIDs are processed in ascending order.

I feel the need for such a comment from my POV as a code reader.

--
Victor Yegorov

#16Peter Geoghegan
Peter Geoghegan
pg@bowt.ie
In reply to: Victor Yegorov (#15)
Re: Returning nbtree posting list TIDs in DESC order during backwards scans

On Wed, Dec 10, 2025 at 2:41 PM Victor Yegorov <vyegorov@gmail.com> wrote:

Compiled and tested without issues.

Pushed. Thanks for the review!

Small note: as you're removing “We rely on the convention that heap TIDs in the scanpos
items array are stored in ascending heap TID order…” part of the comment, perhaps it'd be
good to add smth similar to that to the “Sort and uniqueify so->killedItems[] to deal with all this.”
piece? Smth like:

+ * Sort and uniqueify so->killedItems[] to deal with all this,
+ * as TIDs are processed in ascending order.

I feel the need for such a comment from my POV as a code reader.

I did have a comment like that at one point, but I felt that it didn't
quite make sense to keep it. Such a comment would address how things
used to work, not how they work now (also how they really should have
worked all along).

With this change in place, we have a strict guarantee that the
contents of so->currPos.items[] will always be exactly the same when
scanning a page backwards as when scanning the same page forwards --
regardless of whether or not posting lists are involved, or any other
factor. In short, so->currPos.items[] will contain matching items in
index key space order in all cases. So ISTM that it doesn't make any
sense to draw attention to posting list TIDs within _bt_killitems. All
of that is strongly implied by the index key space order thing.

--
Peter Geoghegan

#17Peter Geoghegan
Peter Geoghegan
pg@bowt.ie
In reply to: Peter Geoghegan (#16)
Re: Returning nbtree posting list TIDs in DESC order during backwards scans

On Wed, Dec 10, 2025 at 5:33 PM Peter Geoghegan <pg@bowt.ie> wrote:

I did have a comment like that at one point, but I felt that it didn't
quite make sense to keep it. Such a comment would address how things
used to work, not how they work now (also how they really should have
worked all along).

On second thought, you (Victor) had this right: we really should have
such a comment.

I must have forgotten that the loop in _bt_killitems doesn't iterate
through so->currPos.items[] directly; it iterates through
killedItems[]. Earlier versions of the patch (that fully got rid of
killedItems) *directly* looped over so->currPos.items[], but the
committed version doesn't work that way.

I pushed a commit just now that adds a comment to clarify the
situation. It specifically mentions posting list tuples, per your
suggestion. (The commit also adds a documenting assertion to verify
leaf page order within the _bt_killitems loop.)

Thanks
--
Peter Geoghegan