[GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions

Started by Salma El-Sayedabout 2 months ago14 messageshackers

salmasayed182003@gmail.com

about 2 months ago

Hello Hackers,

My name is Salma El-Sayed, and I am working on B-tree Index Bloat Reduction
as part of GSoC 2026. This is my introduction email[1]/messages/by-id/CANBEAPFW64QUGmMYoPekV6jGrn5Yx8kTNfeEtcTG9CCBUUBoLQ@mail.gmail.com. I would like to
share my current approach for page merging and get your feedback.

*The Core Idea*
Two sibling B-tree pages L (left) and R (right) are candidates for merging
when both pages are below a size threshold (less than 10% full for example)
and their combined size fits within the fill factor.

The merge copies all data from L into R, then:
1- R is marked BTP_MERGED --> it now contains L's data.
2- L is marked BTP_MERGED_AWAY --> it is logically gone but kept around
temporarily for any active scan that still holds a reference to it.
3- L is unlinked from its parent. VACUUM will handle updating the
right-link of L's left sibling as part of normal cleanup.

Once all transactions that were active at the time of the merge have
committed, L can transition to HALF_DEAD and be reclaimed by VACUUM.

*Handling Concurrent Scans*
The tricky part is ensuring correctness for scans that were already in
progress when the merge happened.

-Forward Scans-
A forward scanner that arrives at L after the merge sees BTP_MERGED_AWAY
and follows through to R.

If the scanner had already read L before the merge, it would have saved L's
high key. It can then binary-search R to skip the tuples it already saw and
continue from where it left off.

-Backward Scans-
A backward scanner that arrives at R after the merge sees BTP_MERGED, reads
R (which now contains L's data), and skips L entirely.

If the scanner had already read R before the merge, when it later arrives
at L it sees BTP_MERGED_AWAY. Here we need a small piece of state on the
scanner:
- If the scanner has not previously seen a BTP_MERGED flag, it knows it
visited R before the merge and has not yet seen L's data, so it must read L.
- If the scanner has previously seen BTP_MERGED, it knows it read R after
the merge, meaning R already contained L's data, so it should skip L and
continue left.

*Questions for the Community*

1- Triggering the merge: how aggressive should we be?

How should the merge process be triggered? I see two broad options:
a- Full-index scan: aggressively find all merge candidates in one pass.
b- Bounded scan: a separate dedicated scan that stops after a controlled
number of pages, giving the user control over how much of the index to
process at a time.

A possible interface for the candidate scanner might look something like:
bt_merge_pages(index regclass, min_fillfactor float8, max_fillfactor
float8, max_pages int4)

** This would allow the process to run over a controlled subset of the
index rather than always scanning the whole thing, which seems important
for large tables where a full scan could run for a very long time. What is
the community's preferred approach here?

2- Flag naming and free bits

I am proposing two new flags in btpo_flags:
- BTP_MERGED: set on R to indicate it absorbed L's data.
- BTP_MERGED_AWAY: set on L to indicate it has been merged away.

** Are there free bits available in btpo_flags suitable for this purpose?
And does the community see any objection to consuming two bits in this
field for the merge mechanism?

Thank you for your time. I look forward to your feedback!

[1]: /messages/by-id/CANBEAPFW64QUGmMYoPekV6jGrn5Yx8kTNfeEtcTG9CCBUUBoLQ@mail.gmail.com
/messages/by-id/CANBEAPFW64QUGmMYoPekV6jGrn5Yx8kTNfeEtcTG9CCBUUBoLQ@mail.gmail.com

Best regards,
Salma El-Sayed

Matthias van de Meent

boekewurm+postgres@gmail.com

about 1 month ago

In reply to: Salma El-Sayed (#1)

Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions

On Fri, 8 May 2026 at 16:40, Salma El-Sayed <salmasayed182003@gmail.com> wrote:

Hello Hackers,

My name is Salma El-Sayed, and I am working on B-tree Index Bloat Reduction as part of GSoC 2026. This is my introduction email[1]. I would like to share my current approach for page merging and get your feedback.

*The Core Idea*
Two sibling B-tree pages L (left) and R (right) are candidates for merging when both pages are below a size threshold (less than 10% full for example) and their combined size fits within the fill factor.

The merge copies all data from L into R, then:
1- R is marked BTP_MERGED --> it now contains L's data.
2- L is marked BTP_MERGED_AWAY --> it is logically gone but kept around temporarily for any active scan that still holds a reference to it.
3- L is unlinked from its parent. VACUUM will handle updating the right-link of L's left sibling as part of normal cleanup.

The merging of pages is the simple part. Making scans not skip or
return duplicated tuples is the hard part.

I think it's useful and important to have an up-to-date design with
all such considerations included, so that reviewers also are able to
have the whole picture in mind when providing guidance.

*Questions for the Community*

1- Triggering the merge: how aggressive should we be?

How should the merge process be triggered? I see two broad options:
a- Full-index scan: aggressively find all merge candidates in one pass.
b- Bounded scan: a separate dedicated scan that stops after a controlled number of pages, giving the user control over how much of the index to process at a time.

I suggest to keep track of candidate pages (i.e., leaf pages with more
than X% free space) in the btree vacuum scans. Whenever you hit a page
which are a merge candidate (or, would become a merge candidate), you
check if any of its sibling pages are already present as merge
candidate in a list of candidate merge pages. If one is, merge the
pages. If not; then add the current page not; and check if siblings of
the current page are in our candidate list. If so, try to merge with
the sibling; if they're not (or the merge failed), then you add the
current page to that candidate list.

Presumably, the amount of active candidates would remain small enough
to fit in a reasonable amount of memory; as merged candidates can be
removed once they've been used.

A possible interface for the candidate scanner might look something like:
bt_merge_pages(index regclass, min_fillfactor float8, max_fillfactor float8, max_pages int4)

I don't think a separate scan method for this makes much sense.

2- Flag naming and free bits

I am proposing two new flags in btpo_flags:
- BTP_MERGED: set on R to indicate it absorbed L's data.
- BTP_MERGED_AWAY: set on L to indicate it has been merged away.

** Are there free bits available in btpo_flags suitable for this purpose? And does the community see any objection to consuming two bits in this field for the merge mechanism?

btpo_flags has 16 bits, of which 7 don't yet have a meaning.

You might even be able to combine one new bit with BTP_HALF_DEAD, to
limit the need for new bits to just one: presumably, pages cannot be
both HALF_DEAD and MERGED/MERGED_AWAY at the same time.

Kind regards,

Matthias van de Meent

PS.
Some food for thought and questions in return:

1.) Page space usage
The idea that you shared in [0]/messages/by-id/CANBEAPFW64QUGmMYoPekV6jGrn5Yx8kTNfeEtcTG9CCBUUBoLQ@mail.gmail.com described a page-level "Merge ID".
Adding this to the page would affect the amount of space available on
btree pages (for at least merged btree pages), and that seems to imply
a possibly backward-incompatible change to currently indexed tables
(less space on a page available -> smaller max tuple size -> current
indexed data may not be reindexable with the new checks in place).

Have you thought about this, and/or figured out whether the decrease
in page space is relevant or can safely be worked around?

2.) changes to page lifecycle operations
There are various concurrent operations that may impact scan
consistency. You've covered how to recover from merges with documents
linked to from your earlier mail, but it's not clear to me if these
things work without causing issues w.r.t. scan output:

a. Deleting a BTP_MERGED page
When vacuum runs a second time all items on the target page may
now be removable; can that page be deleted safely?

b. Deleting a BTP_MERGED_AWAY page
BTP_MERGED_AWAY pages keep a copy of their old tuples around,
which you mention are used by backward scans. This means its contents
must also be cleaned up by subsequent VACUUM runs, as the backwards
IOS may otherwise return TIDs that have been recycled and recieved new
indexed values. This cleanup can result in an empty page - which can
happen earlier than the XID horizon in the MergeID. Does this design
allow those pages to be reclaimed?

c. A BTP_MERGED page must be split
What happens? And what happens if it keeps splitting, before any
further vacuum (maintenance) operations can affect the merged
contents?

d. Are BTP_MERGED_AWAY pages still part of the data structure?
So, is L still pointed to by both L's left sibling and R, or is it
immediately removed from the structure (or at least as immediate as a
HALF_DEAD page would)?
If it's kept in the structure for extended periods: Why?

e. When/how does VACUUM determine to recycle the BTP_MERGED_AWAY
page (mark it BTP_HALF_DEAD or BTP_DELETED)?
The docs [1]https://docs.google.com/document/d/1PIMG0N__4BIB0uDWOfcK-AruatNL8SCTlfTlBTCaoCo/edit?tab=t.0 linked from [0]/messages/by-id/CANBEAPFW64QUGmMYoPekV6jGrn5Yx8kTNfeEtcTG9CCBUUBoLQ@mail.gmail.com indicate MergeID is present only on
BTP_MERGED pages, so it can't be used to locally determine to start
the recycling process for a BTP_MERGED_AWAY page. What process is used
to clean up this index page?

3.) General performance
I'm worried about the cost of maintaining the scan boundary;
especially for index-only scans. Though it's likely you'll only have
to keep track of this at page boundaries, it's still a new and
possibly significant overhead; index key boundaries are not always as
small as int4/int8 + TID.

4.) Workload
Do you have an example workload that you expect to show improvements?

5.) Inner pages
Are inner pages ever considered for merging, or are you only
targeting leaf pages?

[0]: /messages/by-id/CANBEAPFW64QUGmMYoPekV6jGrn5Yx8kTNfeEtcTG9CCBUUBoLQ@mail.gmail.com
[1]: https://docs.google.com/document/d/1PIMG0N__4BIB0uDWOfcK-AruatNL8SCTlfTlBTCaoCo/edit?tab=t.0

Salma El-Sayed

salmasayed182003@gmail.com

12 days ago

In reply to: Matthias van de Meent (#2)

Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions

Hi Matthias,

Thanks for your email and these detailed answers and questions.
Apologies for the delayed response, I am right in the middle of my
university final exams right now.

I am actively using your feedback to shape the design plan, but I
wanted to go ahead and address a couple of your specific questions
regarding BTP_MERGED_AWAY pages:

b. Deleting a BTP_MERGED_AWAY page
BTP_MERGED_AWAY pages keep a copy of their old tuples around,
which you mention are used by backward scans. This means its contents
must also be cleaned up by subsequent VACUUM runs, as the backwards
IOS may otherwise return TIDs that have been recycled and recieved new
indexed values. This cleanup can result in an empty page - which can
happen earlier than the XID horizon in the MergeID. Does this design
allow those pages to be reclaimed?

VACUUM will be taught to ignore the contents of BTP_MERGED_AWAY pages.
The entries inside are not live data, they are ghost copies cached for
exactly one case:
a backward scan that was positioned between R and L at the moment of the merge.
No other reader ever sees them. Forward scans see BTP_MERGED_AWAY and
skip via the right link.
New backward scans already read R (which now holds L's data) before
arriving at L, so they skip L too.
TID safety is guaranteed by MergeXID. The only scanner that can reach
L's "ghost copies" is one whose snapshot predates MergeXID. MVCC
guarantees that the heap rows those TIDs point to cannot be recycled
while any transaction predating MergeXID is still active, because
those rows are still visible to that transaction. Once MergeXID is no
longer visible to any active transaction, the page transitions to
HALF_DEAD normally. So VACUUM never needs to touch L's entries. The
page header does all the work.

d. Are BTP_MERGED_AWAY pages still part of the data structure?
So, is L still pointed to by both L's left sibling and R, or is it
immediately removed from the structure (or at least as immediate as a
HALF_DEAD page would)?
If it's kept in the structure for extended periods: Why?

L is unlinked from the parent as soon as it becomes BTP_MERGED_AWAY
and its key space will be assigned to R.
However, it does remain part of the leaf-level data structure (still
pointed to by L's left sibling and R).
This is necessary because backward scans that were positioned between
L and R during the merge still need to traverse left into L to read
the original data.
As soon as it is safe for L to become HALF_DEAD (the MergeXID horizon
is passed), it will be treated as a normal page deletion.

Best regards,
Salma El-Sayed

Robert Haas

robertmhaas@gmail.com

12 days ago

In reply to: Salma El-Sayed (#1)

Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions

On Fri, May 8, 2026 at 10:40 AM Salma El-Sayed
<salmasayed182003@gmail.com> wrote:

A forward scanner that arrives at L after the merge sees BTP_MERGED_AWAY and follows through to R.
A backward scanner that arrives at R after the merge sees BTP_MERGED, reads R (which now contains L's data), and skips L entirely.

This seems OK for the first merge, but I think you need to be a lot
more explicit about what's going to happen after that. For instance,
what if you want to perform a merge on a page that is already marked
BTP_MERGED?

Or, for example, what happens if more splits happen after the merge?
Like if we have page A and then page B, we might mark A
BTP_MERGED_AWAY and B BTP_MERGED. Now suppose at the time this
happens, a scan is pointing to page A. Before the scan advances to
page B, that page gets split, so now we have: A(BTP_MERGED_AWAY) B0
(???) B1 (???). The problem is that we've already read page A, so we
need to use the logic that skips over tuples we may have already read
on both B0 and B1, both of which contain some of the tuples from B,
which now includes everything that we already read from A. So
presumably to make that work, we need to mark both B0 and B1 as
BTP_MERGED.

But if we do that, then there's no longer a 1:1 relationship between
BTP_MERGED_AWAY pages and BTP_MERGED pages. When we come to a
BTP_MERGED page, we don't know if it corresponds to some
BTP_MERGED_AWAY page we previously encountered, or some other
BTP_MERGED_AWAY page from long ago.

I'm not certain, but I am suspicious that using flag bits for this is
not going to work out. Maybe a flag bit is OK for the page that is
going away, because then it eventually transitions to half-dead like
you said. But for the surviving page, if that's just indicated by it
being marked BTP_MERGED, then eventually we can just end up with tons
of BTP_MERGED pages in the heap and there's nothing to unset those
bits. That's probably going to break something; if it doesn't, then it
seems unclear that BTP_MERGED needs to exists in the first place. I
feel like we might need to mark the surviving page using some kind of
indicator that "times out," like an XID or something, so that we don't
have to go back and clear BTP_MERGED flags later. But I don't really
know.

How should the merge process be triggered?

This seems really tricky. I think if the user has to manually run a
"try to merge pages" command or function, this functionality won't get
used very much. Ideally it would happen either automatically during
foreground operation, or as part of VACUUM. But that seems complicated
to make work, because there's a risk of merging pages too
aggressively, which could not only waste work but result in them being
split again soon afterward.

--
Robert Haas
EDB: http://www.enterprisedb.com

Kirk Wolak

wolakk@gmail.com

12 days ago

In reply to: Robert Haas (#4)

Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions

On Thu, Jun 11, 2026 at 3:19 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, May 8, 2026 at 10:40 AM Salma El-Sayed
...

How should the merge process be triggered?

This seems really tricky. I think if the user has to manually run a
"try to merge pages" command or function, this functionality won't get
used very much. Ideally it would happen either automatically during
foreground operation, or as part of VACUUM. But that seems complicated
to make work, because there's a risk of merging pages too
aggressively, which could not only waste work but result in them being
split again soon afterward.

Totally agree, Robert!

Our goal on this project is:
1) Provably Solve the problem with ONLY a function (on demand)
2) Teaching AutoVacuum to reset BTP_MERGED* back to normal, after XID as
you suggested.
3) Experiment with various levels of how much work each function call
performs (we are being very conservative)

But the final progress might be:
A) A core function for people to run/use (fine tune control over how
aggressive)
B) Integrated into AutoVacuum (when it sees/makes mostly empty pages, it
spits out a request to merge them, out of process)
C) Integrated into AutoVacuum (it actually applies the merge). This is
currently a step way too far away.

But that is likely going to be over full versions of PG (probably years,
not months).

Kirk

Matthias van de Meent

boekewurm+postgres@gmail.com

12 days ago

In reply to: Salma El-Sayed (#3)

Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions

On Thu, 11 Jun 2026 at 19:25, Salma El-Sayed <salmasayed182003@gmail.com> wrote:

Hi Matthias,

Thanks for your email and these detailed answers and questions.
Apologies for the delayed response, I am right in the middle of my
university final exams right now.

I am actively using your feedback to shape the design plan, but I
wanted to go ahead and address a couple of your specific questions
regarding BTP_MERGED_AWAY pages:

b. Deleting a BTP_MERGED_AWAY page
BTP_MERGED_AWAY pages keep a copy of their old tuples around,
which you mention are used by backward scans. This means its contents
must also be cleaned up by subsequent VACUUM runs, as the backwards
IOS may otherwise return TIDs that have been recycled and recieved new
indexed values. This cleanup can result in an empty page - which can
happen earlier than the XID horizon in the MergeID. Does this design
allow those pages to be reclaimed?

VACUUM will be taught to ignore the contents of BTP_MERGED_AWAY pages.
The entries inside are not live data, they are ghost copies cached for
exactly one case:
a backward scan that was positioned between R and L at the moment of the merge.
No other reader ever sees them. Forward scans see BTP_MERGED_AWAY and
skip via the right link.
New backward scans already read R (which now holds L's data) before
arriving at L, so they skip L too.
TID safety is guaranteed by MergeXID. The only scanner that can reach
L's "ghost copies" is one whose snapshot predates MergeXID. MVCC
guarantees that the heap rows those TIDs point to cannot be recycled
while any transaction predating MergeXID is still active, because
those rows are still visible to that transaction.

I don't think that's accurate. The index is not guaranteed to just
contain live or recently-dead rows: it is almost certainly going to
contain references to rows which are already dead to every session.
This is what VACUUM removes when it goes through index cleanup with
its calls into index_bulk_delete.

It is important to also note that it's not guaranteed that all current
dead rows are being cleaned up by a current index_bulk_delete() call -
if insufficient maintenance_work_mem was allocated, then only a subset
of currently dead rows will be included in the current
index_bulk_delete() call, and it's likely that cleanup will happen
again very soon.

So, if you merge a page that contains references to dead rows (which
you can't necessarily know when merging), the BTP_MERGED_AWAY page
might therefore contain references to rows which are (soon) ALL_DEAD
and would be cleaned up in the next VACUUM scan. So, if the page is
BTP_MERGED_AWAY with dead tuples that have been removed everywhere
else but whose tuples aren't cleaned up by vacuum by principle, then
that's an issue for scanners that took very long to continue on to
that page.

Once MergeXID is no
longer visible to any active transaction, the page transitions to
HALF_DEAD normally. So VACUUM never needs to touch L's entries. The
page header does all the work.

If a scan can return the items on a page (including BTP_MERGED_AWAY),
then vacuum *must* be able to clean up those items. If the index's
vacuum process doesn't clean up those items when requested, then the
index breaks the API contract of never returning TIDs that were
supposed to be removed in a completed ambulkdelete() call. In that
case, an index-only scan could encounter TIDs in the index that have
since been marked UNUSED by vacuum and gotten their page marked
ALL_VISIBLE; resulting in the dead entries being returned to the user.

d. Are BTP_MERGED_AWAY pages still part of the data structure?
So, is L still pointed to by both L's left sibling and R, or is it
immediately removed from the structure (or at least as immediate as a
HALF_DEAD page would)?
If it's kept in the structure for extended periods: Why?

L is unlinked from the parent as soon as it becomes BTP_MERGED_AWAY
and its key space will be assigned to R.
However, it does remain part of the leaf-level data structure (still
pointed to by L's left sibling and R).
This is necessary because backward scans that were positioned between
L and R during the merge still need to traverse left into L to read
the original data.
As soon as it is safe for L to become HALF_DEAD (the MergeXID horizon
is passed), it will be treated as a normal page deletion.

Could you expand on this "treated as" a bit more? Do you mean that
once the horizon has passed, the next time maintenance comes around
this page will be deleted like a normal empty page would during
vacuum? Or is it immediately considered dead?

Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)

Kirk Wolak

wolakk@gmail.com

about 13 hours ago

In reply to: Matthias van de Meent (#6)

Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions

On Fri, Jun 12, 2026 at 11:07 AM Matthias van de Meent <
boekewurm+postgres@gmail.com> wrote:

On Thu, 11 Jun 2026 at 19:25, Salma El-Sayed <salmasayed182003@gmail.com>
wrote:
...

VACUUM will be taught to ignore the contents of BTP_MERGED_AWAY pages.
The entries inside are not live data, they are ghost copies cached for
exactly one case:
a backward scan that was positioned between R and L at the moment of the

merge.

No other reader ever sees them. Forward scans see BTP_MERGED_AWAY and
skip via the right link.
New backward scans already read R (which now holds L's data) before
arriving at L, so they skip L too.
TID safety is guaranteed by MergeXID. The only scanner that can reach
L's "ghost copies" is one whose snapshot predates MergeXID. MVCC
guarantees that the heap rows those TIDs point to cannot be recycled
while any transaction predating MergeXID is still active, because
those rows are still visible to that transaction.

I don't think that's accurate. The index is not guaranteed to just
contain live or recently-dead rows: it is almost certainly going to
contain references to rows which are already dead to every session.
This is what VACUUM removes when it goes through index cleanup with
its calls into index_bulk_delete.

It is important to also note that it's not guaranteed that all current
dead rows are being cleaned up by a current index_bulk_delete() call -
if insufficient maintenance_work_mem was allocated, then only a subset
of currently dead rows will be included in the current
index_bulk_delete() call, and it's likely that cleanup will happen
again very soon.

So, if you merge a page that contains references to dead rows (which
you can't necessarily know when merging), the BTP_MERGED_AWAY page
might therefore contain references to rows which are (soon) ALL_DEAD
and would be cleaned up in the next VACUUM scan. So, if the page is
BTP_MERGED_AWAY with dead tuples that have been removed everywhere
else but whose tuples aren't cleaned up by vacuum by principle, then
that's an issue for scanners that took very long to continue on to
that page.

Allow me to step in on this part of the conversation. BTP_MERGED_AWAY is
mostly a FLAG page.
That is really like a pre- BTP_HALF_DEAD page. NOBODY should be interested
in the contents.
Especially Vacuum. No NORMAL scans should read those contents.

We are leaving the contents on this "L" page (of L,R, and L+R notation),
for the 2 VERY specific Edge cases.
1) FWD Scans that already read "L", and finds the "L+R" instead of R. (it
was caught in the middle due to locks, etc).
It "knows" there are ghost copies left behind on "L" (BTP_MERGED_AWAY), and
it goes back, reads them,
and filters them out of the L+R results it just processed so the L records
are NOT duplicated.
2) BWD Scans that read "R" before the merge, then finds "L"
(BTP_MERGED_AWAY). It knows it did NOT
read the L+R page. So, like the previous logic, it can safely read the "L"
records.

outside of those two cases, which are clearly edge cases. NOBODY should
see those "ghost" records.
We are using them just to make the algorithm clean and efficient without
having to add something elsewhere
to handle the fixups.

Back to the question of VACUUM. And where your insight might be trying to
help us from making a terrible mistake.
My understanding is that there can be MANY "extra" references in any index
to records that have been LONG deleted.
And when the Table is read, those are removed from the result set (I
believe it's why index-only scans are limited to
mostly clean tables, freshly vacuumed, etc).

So, our assumption is that, if we envision a RACE condition, where VACUUM
is trying to remove the INDEX references.
BUT our scan is already past that BLOCK. Our scan does not restart, know
or care that vacuum wants to remove them
from the index (there is no "fixup" process required for a scan). And
that's our argument for keeping VACUUM ignorant
of anything on BTP_MERGED_AWAY. Because NO "normal" scan will read those
contents. Only one that was likely
blocked by our activity, and needs to "fix itself" (to avoid dupes or
skips). Any scan that runs after our lock doesn't care.
If they ran before our lock, and finished, they don't care. The 2 edge
cases care... Because it causes Skips (FWD scan),
and Dupes (BWD scan), otherwise.

We even avoid using BTP_HALF_DEAD in our case, to "keep this flag" for the
edge cases.
Technically if we could KNOW there are no more edge cases alive, we could
have VACUUM always force this
BTP_MERGED_AWAY into BTP_HALF_DEAD status. That was our original thought
process. But you see that
we cannot know we have an edge case... Until... No queries could possibly
see past our transaction marker.

Now, it's at this point that I wonder if we can actually just DELETE the
BTP_MERGED_AWAY page, by VACUUM.
But what we prefer, is that it gets marked half dead, and the
TXNID_Block_No is used to flip the BTP_MERGED page(s)
that follow (because there could have been page splits). And all of those
pages that are past that horizon, WILL BE
flipped back to NORMAL pages (from BTP_MERGED).

For now, for testing and validation... We want to get it working first,
and have it take a few steps so we
can watch the table "self-recover" from the merge with the Autovacuum.

This is why you will hear Salma defend vacuum "not looking" at the old "L"
records, certainly not changing them.
They are REQUIRED to be the snapshot for what was merged... Only for the 2
edge cases.

Thanks for the comments. And for the record, I am one of the mentors on
this GSoC for Salma.
(Today's lesson for her, is how to defend your position, while being
prepared for the public beat down (lol) where you might be wrong. Be wrong
out loud! You learn faster!)

Kirk

Salma El-Sayed

salmasayed182003@gmail.com

about 13 hours ago

In reply to: Matthias van de Meent (#6)

Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions

Hi Matthias,

Thanks for the response and detailed explanation.

Could you expand on this "treated as" a bit more? Do you mean that
once the horizon has passed, the next time maintenance comes around
this page will be deleted like a normal empty page would during
vacuum? Or is it immediately considered dead?

Regarding your question about how the page is "treated as" a normal deletion:

Because a BTP_MERGED_AWAY page is already unlinked from its parent, it
has essentially already completed the first stage of deletion. Once
the MergeXID horizon has safely passed, the page transitions to
HALF_DEAD and is handled exactly as described in the nbtree README for
the second stage of deletion:
"In the second-stage, the half-dead leaf page is unlinked from its
siblings. We first lock the left sibling (if any) of the target, the
target page itself, and its right sibling (there must be one) in that
order. Then we update the side-links in the siblings, and mark the
target page deleted."

To safely track this state transition, I need to store the MergeXID
and the blkno of the BTP_MERGED_AWAY page. As you pointed out
previously, adding these to the B-tree page header reduces available
space and risks backward incompatibility with max tuple sizes.
Given the constraints you mentioned, is modifying the header
completely off the table, or could we safely introduce this through a
new index version?

Also, I wanted to share my current implementation for forward scans
that were positioned between L and R before the merge. Since the
forward scan already read L, here is how I handle it:
When the scan encounters the BTP_MERGED page (R), it calls
_bt_readpage. After unlocking R, but before returning, it steps back
to read L (BTP_MERGED_AWAY). It saves L's tuples in a list inside
BTScanOpaqueData, compares them against the data just read from R
(so->currPos.items), and removes any duplicates.

Best regards,
Salma El-Sayed

Peter Geoghegan

pg@bowt.ie

about 13 hours ago

In reply to: Kirk Wolak (#7)

Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions

On Tue, Jun 23, 2026 at 11:57 AM Kirk Wolak <wolakk@gmail.com> wrote:

Back to the question of VACUUM. And where your insight might be trying to help us from making a terrible mistake.
My understanding is that there can be MANY "extra" references in any index to records that have been LONG deleted.
And when the Table is read, those are removed from the result set (I believe it's why index-only scans are limited to
mostly clean tables, freshly vacuumed, etc).

It's not okay for VACUUM to ever fail to remove any TIDs from its
deadItems[] from the target index. If it fails to do that, the index
is already irretrievably broken; you cannot account for it at the scan
level.

Matthias explained why this is, but it boils down to this: that is a
core invariant, that all index AMs (with the possible exception of
BRIN) must follow at all times. It has nothing to do with what
concurrent scans might be doing when VACUUM runs; it's about
maintaining basic agreement between the index structure and the heap.
You can never allow the heap to recycle the same TID for a wholly
unrelated row/HOT chain, because doing so will result in the same TIDs
being reused for different logical rows some time after VACUUM runs.

Now, it's at this point that I wonder if we can actually just DELETE the BTP_MERGED_AWAY page, by VACUUM.
But what we prefer, is that it gets marked half dead, and the TXNID_Block_No is used to flip the BTP_MERGED page(s)
that follow (because there could have been page splits). And all of those pages that are past that horizon, WILL BE
flipped back to NORMAL pages (from BTP_MERGED).

For now, for testing and validation... We want to get it working first, and have it take a few steps so we
can watch the table "self-recover" from the merge with the Autovacuum.

This is why you will hear Salma defend vacuum "not looking" at the old "L" records, certainly not changing them.
They are REQUIRED to be the snapshot for what was merged... Only for the 2 edge cases.

Any design that lacks a completely airtight argument for why VACUUM
will reliably remove all of the TIDs from its deadItems[] in the
target index has zero chance of being accepted. Again, whether and how
that affects concurrent scans is irrelevant. The downstream problems
only start after VACUUM runs, when the heap starts to recycle TID
addresses that were supposed to be safe to recycle (safe because there
couldn't possibly be any references to the same TID left behind in
indexes).

--
Peter Geoghegan

#10

Matthias van de Meent

boekewurm+postgres@gmail.com

about 11 hours ago

In reply to: Kirk Wolak (#7)

Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions

On Tue, 23 Jun 2026 at 17:57, Kirk Wolak <wolakk@gmail.com> wrote:

On Fri, Jun 12, 2026 at 11:07 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:

On Thu, 11 Jun 2026 at 19:25, Salma El-Sayed <salmasayed182003@gmail.com> wrote:
...

VACUUM will be taught to ignore the contents of BTP_MERGED_AWAY pages.
The entries inside are not live data, they are ghost copies cached for
exactly one case:
a backward scan that was positioned between R and L at the moment of the merge.
No other reader ever sees them. Forward scans see BTP_MERGED_AWAY and
skip via the right link.
New backward scans already read R (which now holds L's data) before
arriving at L, so they skip L too.
TID safety is guaranteed by MergeXID. The only scanner that can reach
L's "ghost copies" is one whose snapshot predates MergeXID. MVCC
guarantees that the heap rows those TIDs point to cannot be recycled
while any transaction predating MergeXID is still active, because
those rows are still visible to that transaction.

I don't think that's accurate. The index is not guaranteed to just
contain live or recently-dead rows: it is almost certainly going to
contain references to rows which are already dead to every session.
This is what VACUUM removes when it goes through index cleanup with
its calls into index_bulk_delete.

It is important to also note that it's not guaranteed that all current
dead rows are being cleaned up by a current index_bulk_delete() call -
if insufficient maintenance_work_mem was allocated, then only a subset
of currently dead rows will be included in the current
index_bulk_delete() call, and it's likely that cleanup will happen
again very soon.

So, if you merge a page that contains references to dead rows (which
you can't necessarily know when merging), the BTP_MERGED_AWAY page
might therefore contain references to rows which are (soon) ALL_DEAD
and would be cleaned up in the next VACUUM scan. So, if the page is
BTP_MERGED_AWAY with dead tuples that have been removed everywhere
else but whose tuples aren't cleaned up by vacuum by principle, then
that's an issue for scanners that took very long to continue on to
that page.

Allow me to step in on this part of the conversation. BTP_MERGED_AWAY is mostly a FLAG page.
That is really like a pre- BTP_HALF_DEAD page. NOBODY should be interested in the contents.
Especially Vacuum. No NORMAL scans should read those contents.

We are leaving the contents on this "L" page (of L,R, and L+R notation), for the 2 VERY specific Edge cases.
1) FWD Scans that already read "L", and finds the "L+R" instead of R. (it was caught in the middle due to locks, etc).
It "knows" there are ghost copies left behind on "L" (BTP_MERGED_AWAY), and it goes back, reads them,
and filters them out of the L+R results it just processed so the L records are NOT duplicated.
2) BWD Scans that read "R" before the merge, then finds "L" (BTP_MERGED_AWAY). It knows it did NOT
read the L+R page. So, like the previous logic, it can safely read the "L" records.

outside of those two cases, which are clearly edge cases. NOBODY should see those "ghost" records.

My point is this: It is *exactly* a problem if BWD scans read L's data
if L's data is ignored by vacuum.

Consider this situation: L contains TIDs 1, 2, 3; R contains TIDs 4, 5, 6.

1. Session 1: Backward Index-Only scan finishes processing page R
2. Session 2: Deduplicates L + R, into R. L is marked BTP_MERGED_AWAY,
and contains remnants for (1, 2, 3). R now contains (1, 2, 3, 4, 5,
6), and is marked BTP_MERGED
3. Session 3: Vacuums the backing table, finds that TIDs 2, 3, 4 are
dead, and starts cleaning up indexes.
4. Session 3: As part of vacuum, cleans up page R, which now contains
only (1, 5, 6). L is untouched; contains (1, 2, 3).
5. Session 3: After finishing index cleanup; the LP_DEAD items are
removed from the heap, and all pages are marked all-visible;
6. Session 1: Continues the index-only scan. Finds page L, and returns
the remnants on that page L (1, 2, 3) to the IOS.
7. Session 1: IOS internally checks the visibility of the returned
tuples, finds the pages of those tids to be all-visible (see (5)), and
returns the TIDs and values associated with TIDs 1, 2, and 3, to the
higher scan node.

Note that in step (3), TIDs 2 and 3 are dead, and in step 5 they were
completely removed -- there shouldn't be any index scans that still
return those tuples (*). For all that the table knows, TIDs (2, 3, 4)
don't exist anymore. In step 7, however, they're suddenly visible,
because they weren't cleaned up from L by VACUUM.

To fix this; VACUUM *must* process page L, and remove the references
to TIDs 2 and 3, or otherwise mark those TIDs as "not to be returned
by scans", so that in step 6 the scan doesn't return TIDs 2 and 3 to
the IOS layer above.

(*) For Index-Only scans the original statement is accurate, but
normal Index scans (not IOS) will be able to return tuples after that
cleanup happened. However, instead of using the visibilitymap, those
scans will check the visibility of the rows on the heap page directly,
and will ignore (skip) invisible/missing line pointers.

We are using them just to make the algorithm clean and efficient without having to add something elsewhere
to handle the fixups.

Back to the question of VACUUM. And where your insight might be trying to help us from making a terrible mistake.
My understanding is that there can be MANY "extra" references in any index to records that have been LONG deleted.
And when the Table is read, those are removed from the result set (I believe it's why index-only scans are limited to
mostly clean tables, freshly vacuumed, etc).

Strictly speaking that's right, but indexes like btree, gist, spgist,
and hash, all have a guarantee that any tuple only has a single
location in the index at any point in time.
Scans may see the same TID pointing to a row (or the same TID pointing
to different versions of a row, or TIDs pointing to different versions
of different rows in case of tid reclamation through vacuum) several
times, but only one of those pointed-to rows can be visible to the
scan. The reason why it can see the same tid value several times is
because it takes time to traverse the index, and because the scan
doesn't have a snapshot view of the whole index. A scan only looks at
parts of the index at once, and the same tid value may be present at
different locations at different times, but only ever at one place in
the index at a time.

So, our assumption is that, if we envision a RACE condition, where VACUUM is trying to remove the INDEX references.
BUT our scan is already past that BLOCK. Our scan does not restart, know or care that vacuum wants to remove them
from the index (there is no "fixup" process required for a scan). And that's our argument for keeping VACUUM ignorant
of anything on BTP_MERGED_AWAY. Because NO "normal" scan will read those contents. Only one that was likely
blocked by our activity, and needs to "fix itself" (to avoid dupes or skips).

I think this argument is faulty; if only because scans (and queries
more generally) are not guaranteed to immediately run to completion
without stopping. The perfect example for that is an index scan that's
producing data for a cursor: Cursors may be paused for any amount of
time between any two amgettuple() calls.

Any scan that runs after our lock doesn't care.

Any scan that *starts* after the lock. Scans that started before the
lock but continue running after the lock *will* care.

And I'm not so sure about your claim that *any* scan doesn't care.
With scan boundary keys I think you may be right, but I'm yet to be
convinced that boundary keys provide sufficient performance to get
this into the tree.

If they ran before our lock, and finished, they don't care. The 2 edge cases care... Because it causes Skips (FWD scan),
and Dupes (BWD scan), otherwise.

Indeed.

We even avoid using BTP_HALF_DEAD in our case, to "keep this flag" for the edge cases.
Technically if we could KNOW there are no more edge cases alive, we could have VACUUM always force this
BTP_MERGED_AWAY into BTP_HALF_DEAD status. That was our original thought process. But you see that
we cannot know we have an edge case... Until... No queries could possibly see past our transaction marker.

Why do you need to read the TIDs from the MERGED_AWAY page? Can't you
keep just (and only just) the MERGED_AWAY status + its original key
range around, and use that to extract the moved tuples from the
_MERGED page (... and its successors in case of page splits of that
page)? It is probably a bit (lot) more effort, but it should avoid the
TID recycling issue mentioned above, in cases of backward scans.

For now, for testing and validation... We want to get it working first, and have it take a few steps so we
can watch the table "self-recover" from the merge with the Autovacuum.

What do you mean by "self-recover"? Recycle index pages and reduce in size?

This is why you will hear Salma defend vacuum "not looking" at the old "L" records, certainly not changing them.
They are REQUIRED to be the snapshot for what was merged... Only for the 2 edge cases.

I'm not convinced about that requirement being the right direction; we
can't afford to create a new known visibility bug just to fix known
scan edge cases for page merging.

Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)

#11

Peter Geoghegan

pg@bowt.ie

about 11 hours ago

In reply to: Matthias van de Meent (#10)

Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions

On Tue, Jun 23, 2026 at 1:53 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Tue, 23 Jun 2026 at 17:57, Kirk Wolak <wolakk@gmail.com> wrote:

Back to the question of VACUUM. And where your insight might be trying to help us from making a terrible mistake.
My understanding is that there can be MANY "extra" references in any index to records that have been LONG deleted.
And when the Table is read, those are removed from the result set (I believe it's why index-only scans are limited to
mostly clean tables, freshly vacuumed, etc).

Strictly speaking that's right, but indexes like btree, gist, spgist,
and hash, all have a guarantee that any tuple only has a single
location in the index at any point in time.

Minor clarification, for the benefit of others: the guarantee is that
the index cannot contain the same heap TID twice (which might point to
a hot chain with several versions).

Scans may see the same TID pointing to a row (or the same TID pointing
to different versions of a row, or TIDs pointing to different versions
of different rows in case of tid reclamation through vacuum) several
times, but only one of those pointed-to rows can be visible to the
scan.

I think that you meant that it can see different TIDs originating from
the same updated logical row.

A non-hot update can create 2 separate TIDs that point to different
versions of the same logical row. In that case, both TIDs must be
returned to the scan (assuming both have index tuple values that
satisfy the scan keys). This doesn't really matter to the index AM; it
doesn't know about updates at all.

What it boils down to is that the same TID must never exist in any 2
index tuples in the same index simultaneously. Currently, amcheck
cannot detect that condition because checking every index tuple
against every other one is inconvenient and expensive. However,
offering such a check would be useful; it wouldn't need to rely on
heap visibility information, as the simple fact that the same TID
appears a second time always indicates index corruption.

--
Peter Geoghegan

#12

Matthias van de Meent

boekewurm+postgres@gmail.com

about 11 hours ago

In reply to: Peter Geoghegan (#11)

Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions

On Tue, 23 Jun 2026 at 20:16, Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Jun 23, 2026 at 1:53 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Tue, 23 Jun 2026 at 17:57, Kirk Wolak <wolakk@gmail.com> wrote:

Back to the question of VACUUM. And where your insight might be trying to help us from making a terrible mistake.
My understanding is that there can be MANY "extra" references in any index to records that have been LONG deleted.
And when the Table is read, those are removed from the result set (I believe it's why index-only scans are limited to
mostly clean tables, freshly vacuumed, etc).

Strictly speaking that's right, but indexes like btree, gist, spgist,
and hash, all have a guarantee that any tuple only has a single
location in the index at any point in time.

Minor clarification, for the benefit of others: the guarantee is that
the index cannot contain the same heap TID twice (which might point to
a hot chain with several versions).

Yes, I could have worded that better.

Scans may see the same TID pointing to a row (or the same TID pointing
to different versions of a row, or TIDs pointing to different versions
of different rows in case of tid reclamation through vacuum) several
times, but only one of those pointed-to rows can be visible to the
scan.

I think that you meant that it can see different TIDs originating from
the same updated logical row.

It could access and return the same TID twice, on multiple pages, if
the TID is recycled between the page accesses. But, as mentioned, at
most one of the rows indicated by the index' scan mechanism will be
MVCC-visible for the IndexScan executor node.

A non-hot update can create 2 separate TIDs that point to different
versions of the same logical row. In that case, both TIDs must be
returned to the scan (assuming both have index tuple values that
satisfy the scan keys). This doesn't really matter to the index AM; it
doesn't know about updates at all.

Except the indexUnchanged flag for aminsert() -which index AMs should
only use as a hint- but more generally, yes that's right.

What it boils down to is that the same TID must never exist in any 2
index tuples in the same index simultaneously.

Eventually, indeed, it boils down to this.

Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)

#13

Peter Geoghegan

pg@bowt.ie

about 10 hours ago

In reply to: Matthias van de Meent (#12)

Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions

On Tue, Jun 23, 2026 at 2:30 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

I think that you meant that it can see different TIDs originating from
the same updated logical row.

It could access and return the same TID twice, on multiple pages, if
the TID is recycled between the page accesses. But, as mentioned, at
most one of the rows indicated by the index' scan mechanism will be
MVCC-visible for the IndexScan executor node.

That might be true with SnapshotAny, but bringing visibility concerns
into this discussion doesn't seem useful.

The relevant invariant is that the same physical TID cannot appear
twice within the same index. It is useful to think of it as an
invariant that the index AM is directly concerned with (and to ignore
visibility stuff, which happens at a higher level, and shouldn't be of
concern to the index AM at all).

In general, nbtree page deletion (and merging underfull pages)
modifies a physical data structure to improve space efficiency. It
isn't relevant why VACUUM deleted some index tuples (making that free
space available and indirectly triggering deletion/merging).

Using XIDs for BTPageGetDeleteXid is just a convenient (though very
conservative) way to implement "the drain technique". It would still
be correct to implement it differently, provided this alternative
approach also ensures that no backend follows a downlink and ends up
on a wholly unrelated page due to concurrent deletion.

We must always ensure that such a backend at least lands on a page
marked deleted/a tombstone page and then recovers by moving right --
no scan can ever have an irredeemably bad picture of the tree
structure. But we don't fundamentally need to care about XIDs to make
that work -- this is a physical modification that's orthogonal to
logical/transaction considerations.

A non-hot update can create 2 separate TIDs that point to different
versions of the same logical row. In that case, both TIDs must be
returned to the scan (assuming both have index tuple values that
satisfy the scan keys). This doesn't really matter to the index AM; it
doesn't know about updates at all.

Except the indexUnchanged flag for aminsert() -which index AMs should
only use as a hint- but more generally, yes that's right.

That's just a hint used to trigger bottom-up index deletion, it really
isn't relevant.

--
Peter Geoghegan

#14

Kirk Wolak

wolakk@gmail.com

about 9 hours ago

In reply to: Kirk Wolak (#7)

Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions

On Tue, Jun 23, 2026 at 11:57 AM Kirk Wolak <wolakk@gmail.com> wrote:

On Fri, Jun 12, 2026 at 11:07 AM Matthias van de Meent <
boekewurm+postgres@gmail.com> wrote:

On Thu, 11 Jun 2026 at 19:25, Salma El-Sayed <salmasayed182003@gmail.com>
wrote:
...Thanks for the comments. And for the record, I am one of the mentors
on this GSoC for Salma.

(Today's lesson for her, is how to defend your position, while being

prepared for the public beat down (lol) where you might be wrong. Be
wrong out loud! You learn faster!)

Everyone. My apologies. Some people may not have detected the
sarcasm/humor in that last line.
Obviously this is NOT how this group works, and the moderators would not
tolerate it.

I stand by: "Be wrong out loud! You learn faster!"...

Because I certainly did!

Kirk