Randomize B-Tree page split location to avoid oscillating patterns

Started by Dmitry Dolgov16 days ago6 messageshackers
Jump to latest
#1Dmitry Dolgov
9erthalion6@gmail.com

TL;DR There seems to be a known phenomenon, where a data ingestion into a
B-Tree produces page splits following an oscillating pattern, which in turn
affects IO and buffer contention, impacting the performance. It turns out that
PostgreSQL is not an exception, but it should be possible to randomize a split
location a bit to mitigate the issue.

Hi,

Some time ago while working on models for PostgreSQL performance [1]https://zenodo.org/records/15786156 I've
stumbled upon an interesting oscillating patters around various B-Tree metrics
(number of page splits, index size, etc). This turned out to be a known thing
described in a catchy way as "Waves of misery" [2]Glombiewski N., Seeger B., Graefe G. (2019). Waves of Misery After Index Creation. BTW 2019. Gesellschaft für Informatik. doi:10.18420/btw2019-06, and boils down to the fact
that a fixed split location is usually chosen for a page split -- in this case
probabilities of page split lead to oscillating solutions under certain
workloads. Looks like the only pre-condition is that the data to be ingested
has the same distribution as the data already existing in the tree, in
particular UUIDs are in a bad position for this. And of course it's possible to
reproduce this with PostgreSQL.

Such an oscillation can lead to variability in IO and buffer contention,
negatively impacting performance. The fillfactor only shifts the waves, but do
not cancel them. One of the proposed remediation for this is to do suffix
truncation, which will "spread" split locations across some range. While we do
column suffix truncation, it turns out to be not enough for many workloads.
Another option is to randomize the split location, e.g. pick the actual
location from a range of 20% around the best one. In our case it's easy to do
randomization based on the split state, as all the possible split locations are
sorted by delta -- and all what's needed is to add a shift to the lowsplit from
a range, based on the number of split locations.

I've done few experiments with this, here is how it looks like:

* The first one is synthetic: a single column table with integer values, a
B-Tree index over it with the fillfactor=100, inserting new values one by one
from a uniform distribution via PGBench. In the graph "split.png" you can see
the number of page splits over time for the main branch and the patch (named
"Main" and "Rand" correspondingly), and the oscillating is clear for the
former one.

* Another one is following a data schema from one real projects out there, with
UUIDs as index values and very large records, with the default fillfactor=90.
The data ingestion is happening in large batches. The graph "split_batch.png"
represent the data for this case, with the main branch oscillating much more
than the randomized.

The unfortunate part is that I couldn't get clear numbers for the performance
impact. Turns out the disk in my experimental setup is not good enough to get a
sufficient number of inserts to trigger the issue, and to get nice graphs I was
running everything either on a RAM disk or on an unlogged table -- in both
cases it's easy to observe oscillations of page splits, but their impact is not
large enough since only so much IO is happening.

But anyway, any thoughts / commentaries on that?

[1]: https://zenodo.org/records/15786156
[2]: Glombiewski N., Seeger B., Graefe G. (2019). Waves of Misery After Index Creation. BTW 2019. Gesellschaft für Informatik. doi:10.18420/btw2019-06
Creation. BTW 2019. Gesellschaft für Informatik. doi:10.18420/btw2019-06

Attachments:

v1-0001-Add-a-USDT-for-tracing-nbtree-page-splits.patchtext/plain; charset=us-asciiDownload+5-1
v1-0002-Randomize-nbtree-split-location-to-avoid-oscillat.patchtext/plain; charset=iso-8859-1Download+20-2
split.pngimage/pngDownload+0-1
split_batch.pngimage/pngDownload+1-0
In reply to: Dmitry Dolgov (#1)
Re: Randomize B-Tree page split location to avoid oscillating patterns

On Mon, Apr 27, 2026 at 12:24 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Some time ago while working on models for PostgreSQL performance [1] I've
stumbled upon an interesting oscillating patters around various B-Tree metrics
(number of page splits, index size, etc). This turned out to be a known thing
described in a catchy way as "Waves of misery" [2], and boils down to the fact
that a fixed split location is usually chosen for a page split -- in this case
probabilities of page split lead to oscillating solutions under certain
workloads.

I'm quite familiar with the paper, and find its argument convincing.
I'm not sure of the overall importance of the issues that they
describe, but the phenomenon is obviously real.

Such an oscillation can lead to variability in IO and buffer contention,
negatively impacting performance. The fillfactor only shifts the waves, but do
not cancel them. One of the proposed remediation for this is to do suffix
truncation, which will "spread" split locations across some range. While we do
column suffix truncation, it turns out to be not enough for many workloads.
Another option is to randomize the split location, e.g. pick the actual
location from a range of 20% around the best one. In our case it's easy to do
randomization based on the split state, as all the possible split locations are
sorted by delta -- and all what's needed is to add a shift to the lowsplit from
a range, based on the number of split locations.

My interpretation is that randomization can be combined with the usual
suffix truncation split point choosing heuristics; the paper even
recommends this at one point. That is, you randomly pick from a small
number of candidate split points that are all approximately equally
good according to the traditional criteria. It looks like your patch
doesn't account for suffix truncation at all, though.

Your patch should initially look for several split points that are
approximately equal in quality according to the current criteria -- a
separate, initial pass. You'd then randomly pick a final split point
from those gathered during this initial pass -- not from the original
fillfactormult-sorted list of split points. What you have in v1 will
make suffix truncation much less effective, which seems unacceptable.

Another issue is that nbtsort.c doesn't have any ability to pick among
split points; it focuses entirely on keeping free space balanced. To
get much benefit from this, I think you'd have to teach nbtsort.c to
also pick split points using approximately the same algorithm as
nbtsplitloc.c would (assuming retail inserts in ascending key space
order). I've been meaning to add proper suffix truncation to the
CREATE INDEX case.

I've done few experiments with this, here is how it looks like:

* The first one is synthetic: a single column table with integer values, a
B-Tree index over it with the fillfactor=100, inserting new values one by one
from a uniform distribution via PGBench. In the graph "split.png" you can see
the number of page splits over time for the main branch and the patch (named
"Main" and "Rand" correspondingly), and the oscillating is clear for the
former one.

I don't think that fillfactor=100 is very interesting in general.

The unfortunate part is that I couldn't get clear numbers for the performance
impact. Turns out the disk in my experimental setup is not good enough to get a
sufficient number of inserts to trigger the issue, and to get nice graphs I was
running everything either on a RAM disk or on an unlogged table -- in both
cases it's easy to observe oscillations of page splits, but their impact is not
large enough since only so much IO is happening.

But anyway, any thoughts / commentaries on that?

I wouldn't expect the patch to increase absolute throughput
significantly, if at all. Its value comes from making the *rate* of
splits over time more consistent, for a given fixed workload. You
might notice a more interesting effect if you look at latency,
particularly worst case latency.

--
Peter Geoghegan

#3Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#2)
Re: Randomize B-Tree page split location to avoid oscillating patterns

Hi,

On 2026-04-27 14:07:14 -0400, Peter Geoghegan wrote:

On Mon, Apr 27, 2026 at 12:24 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

The unfortunate part is that I couldn't get clear numbers for the performance
impact. Turns out the disk in my experimental setup is not good enough to get a
sufficient number of inserts to trigger the issue, and to get nice graphs I was
running everything either on a RAM disk or on an unlogged table -- in both
cases it's easy to observe oscillations of page splits, but their impact is not
large enough since only so much IO is happening.

But anyway, any thoughts / commentaries on that?

I wouldn't expect the patch to increase absolute throughput
significantly, if at all. Its value comes from making the *rate* of
splits over time more consistent, for a given fixed workload. You
might notice a more interesting effect if you look at latency,
particularly worst case latency.

Possibly somewhat orthogonal, but the worst case latency point reminded me of
something around this:

It seems like it's not great for contention avoidance / concurrency that right
now btree splits will often do IO - to write out a victim buffer and then to
extend the relation - while holding an exclusive lock on a page.

Couldn't we instead release the lock on the page, acquire an empty page,
reacquire the lock, recheck that the split is still needed and, if so, split
the page without needing to do IO while holding the lock?

Greetings,

Andres Freund

#4Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Peter Geoghegan (#2)
Re: Randomize B-Tree page split location to avoid oscillating patterns

On Mon, Apr 27, 2026 at 02:07:14PM -0400, Peter Geoghegan wrote:

Sorry for the delay.

My interpretation is that randomization can be combined with the usual
suffix truncation split point choosing heuristics; the paper even
recommends this at one point. That is, you randomly pick from a small
number of candidate split points that are all approximately equally
good according to the traditional criteria. It looks like your patch
doesn't account for suffix truncation at all, though.

Your patch should initially look for several split points that are
approximately equal in quality according to the current criteria -- a
separate, initial pass. You'd then randomly pick a final split point
from those gathered during this initial pass -- not from the original
fillfactormult-sorted list of split points. What you have in v1 will
make suffix truncation much less effective, which seems unacceptable.

The paper suggest combining randomization and suffix truncation via
randomly shifting the interval for truncation. I'm not sure a separate
pass is needed for that, looks like it should be enough to add random
shift to lowsplit / highsplit in _bt_bestsplitloc, when the penalty is
calculated.

The interesting part here is that it seems the current split interval
might be too small for such randomization to make a significant impact
-- few experiments show that with the current value the result looks
more like changing the fillfactor, waves are just shifting to the right.
Increasing the split interval helps in this case.

Such approach (randomized suffix truncation interval) produces the same
effect in a test with single column index as in the patch above. In a
test with a wider index and significant impact of truncation, there is
no visible degradation (and no oscillations either, as expected suffix
truncation introduces some randomness on it's own).

Another issue is that nbtsort.c doesn't have any ability to pick among
split points; it focuses entirely on keeping free space balanced. To
get much benefit from this, I think you'd have to teach nbtsort.c to
also pick split points using approximately the same algorithm as
nbtsplitloc.c would (assuming retail inserts in ascending key space
order). I've been meaning to add proper suffix truncation to the
CREATE INDEX case.

Good point, I need to look into this.

The unfortunate part is that I couldn't get clear numbers for the performance
impact. Turns out the disk in my experimental setup is not good enough to get a
sufficient number of inserts to trigger the issue, and to get nice graphs I was
running everything either on a RAM disk or on an unlogged table -- in both
cases it's easy to observe oscillations of page splits, but their impact is not
large enough since only so much IO is happening.

But anyway, any thoughts / commentaries on that?

I wouldn't expect the patch to increase absolute throughput
significantly, if at all. Its value comes from making the *rate* of
splits over time more consistent, for a given fixed workload. You
might notice a more interesting effect if you look at latency,
particularly worst case latency.

That's exactly what I'm talking about. In those experiments I was trying
to get any visible changes in latency variability, but in this
particular setup I could spot nothing beyond regular noise, while the
number of page splits was obviously heavily oscillating.

In reply to: Dmitry Dolgov (#4)
Re: Randomize B-Tree page split location to avoid oscillating patterns

On Wed, May 6, 2026 at 1:10 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Mon, Apr 27, 2026 at 02:07:14PM -0400, Peter Geoghegan wrote:
Your patch should initially look for several split points that are
approximately equal in quality according to the current criteria -- a
separate, initial pass. You'd then randomly pick a final split point
from those gathered during this initial pass -- not from the original
fillfactormult-sorted list of split points. What you have in v1 will
make suffix truncation much less effective, which seems unacceptable.

The paper suggest combining randomization and suffix truncation via
randomly shifting the interval for truncation. I'm not sure a separate
pass is needed for that, looks like it should be enough to add random
shift to lowsplit / highsplit in _bt_bestsplitloc, when the penalty is
calculated.

I can't see why that would make much difference. Increasing the split
interval usually won't affect the final chosen split point at all.
Decreasing is more likely to change the split point, but that's
*precisely* because it makes suffix truncation less effective.

In general it's quite unlikely that an expanded interval (e.g.,
doubling LEAF_SPLIT_DISTANCE) will include a split point that
truncates additional index attributes compared to the best split point
available within the current calculated split interval/current
LEAF_SPLIT_DISTANCE. For example, with the TPC-C indexes, I'd expect
only a tiny minority of all page splits to be affected by doubling
LEAF_SPLIT_DISTANCE to make nbtsplitloc.c consider ~20% of all
possible split points around the middle of the page. Whereas *halving*
LEAF_SPLIT_DISTANCE will indeed make suffix truncation worse.

This asymmetry matters. I believe that it'll make it impossible for
space utilization to average out to the target leaffillfactor% over
time. After all, this approach doesn't actually target space
utilization. This is also why it will make literally no difference at
all with a single column unique index.

The interesting part here is that it seems the current split interval
might be too small for such randomization to make a significant impact
-- few experiments show that with the current value the result looks
more like changing the fillfactor, waves are just shifting to the right.
Increasing the split interval helps in this case.

The point of gathering a list of "equally good" split points in an
initial pass is that it removes the danger of making the split
interval less effective in terms of final split penalty. The random
choice becomes a choice among split points that all truncate away the
same number of suffix attributes (all of which must still be
sufficiently close to the space-optimal split location to ensure that
no split is ever wildly unbalanced). This prevents the variable/random
choice from affecting the suffix truncation/split penalty.

With a single column unique index, all split points will have an equal
penalty under this scheme (implying that suffix truncation will be
equally effective). The initial list gathered in the first pass is
exactly all of the split points within the split interval in that
case. Whereas with other types of indexes the initial list gathered is
some subset of all the split points within the interval, without
regard for how balanced the post-split space utilization will be (all
splits within the interval are assumed to be "good enough" from a
space utilization perspective). Over time, space utilization should
reach fillfactor% -- without it negatively affecting suffix truncation.

It's possible that increasing the split interval would also make
sense. That seems like a follow-up question to me.

--
Peter Geoghegan

#6Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Peter Geoghegan (#5)
Re: Randomize B-Tree page split location to avoid oscillating patterns

On Wed, May 06, 2026 at 01:49:10PM -0400, Peter Geoghegan wrote:
On Wed, May 6, 2026 at 1:10 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Mon, Apr 27, 2026 at 02:07:14PM -0400, Peter Geoghegan wrote:
Your patch should initially look for several split points that are
approximately equal in quality according to the current criteria -- a
separate, initial pass. You'd then randomly pick a final split point
from those gathered during this initial pass -- not from the original
fillfactormult-sorted list of split points. What you have in v1 will
make suffix truncation much less effective, which seems unacceptable.

The paper suggest combining randomization and suffix truncation via
randomly shifting the interval for truncation. I'm not sure a separate
pass is needed for that, looks like it should be enough to add random
shift to lowsplit / highsplit in _bt_bestsplitloc, when the penalty is
calculated.

I can't see why that would make much difference. Increasing the split
interval usually won't affect the final chosen split point at all.
Decreasing is more likely to change the split point, but that's
*precisely* because it makes suffix truncation less effective.

I assume by increasing/decreasing the split interval you mean increase/decrease
around the split point, but I was talking about shifting the interval to the
right. More about this in the next part.

The point of gathering a list of "equally good" split points in an
initial pass is that it removes the danger of making the split
interval less effective in terms of final split penalty. The random
choice becomes a choice among split points that all truncate away the
same number of suffix attributes (all of which must still be
sufficiently close to the space-optimal split location to ensure that
no split is ever wildly unbalanced). This prevents the variable/random
choice from affecting the suffix truncation/split penalty.

I see what you mean, but I'm concerned about the resulting overhead. Here is my
understanding of how it would work, let me know if something is missing in the
chain of thoughts:

* What happens currently is we collect all possible split points, sort them by
delta, calculate the split interval. Then we iterate over the possible split
points in the interval [0, split interval), trying to find the one causing
the lowest penalty. If what we found is the best possible penalty, we stop --
for unique indexes it means that virtually all the time everything is over
after a single iteration, since the first attribute is enough to
distinguish the split.

* If we're going to pick at random from a list of "equally good" split points,
everything up to the split interval is the same. Then we iterate over the
possible split points in the interval [0, split interval), and search for all
the points with penalty equal or lower than the current lowest value (the
list is reset if we found a lower value). We search this way until we found
enough for the chosen randomization interval (those 20% from the paper) or
until we're out of split points. In the end we've got a list of split points,
each has equally low penalty, and we choose a split point at random from this
list. For a unique index it means we have to do at least as many iterations
as the size of randomization interval, instead of one iteration as before --
this is the overhead I'm concerned about.

* If we're going to shift the split interval, everything is also the same up to
the split interval calculation. Now we introduce a

delta = random from (0, randomization interval)

and consider two intervals of available split points:

[0, delta) and [delta, split interval + delta)

We first search in [0, delta) as before for lowest possible penalty, just to
make sure we're not missing it, if it's not present in the second interval.
The main part is search in [delta, split interval + delta) for a split point
with the same or lower penalty as found in the first interval and use it as
the final result. There are few possible outcomes:

1. There are one or more lowest penalty split point in [0, delta) and one or
more in [delta, split interval + delta). Since delta is randomized, we pick
one of the "equally good" split points at random.

2. There are one or more lowest penalty split points in [0, delta) and for
some reason none in [delta, split interval + delta). In this case we forgo
randomization and go with the lowest penalty split point, using the point
from the [0, delta).

3. There are none lowest penalty split points in [0, delta) and one or more
in [delta, split interval + delta). In this case we still pick up one of the
"equally good" split points, still with some randomization.

4. There are none lowest penalty split points in [0, randomization interval)
and one or more in [randomization interval, split interval + randomization
interval). In this case we still pick up one of the "equally good" split
points, but without randomization, since the best point lays beyond the
allowed randomization range, and we hit it all the time.

Ultimately we would like to apply randomization of split point for mostly
unique indexes, since otherwise suffix truncation already does the job. With
this approach the randomization interval serves as a threshold to distinguish
such indexes -- if the index has mostly unique values, the optimal split
point would be located close to the start of the interval and multiple
"equally good" points will fall within the randomization interval. If on the
other hand the index has many duplicating values, the optimal split point may
be located outside the randomization interval and we would choose it all
the time -- but thanks to suffix truncation it doesn't matter.

From the overhead perspective, with this approach we do:
- one search in [0, delta), which will end after one iteration for unique
indexes.
- then do a "jump" of random distance and do one search in [delta, split
interval + delta) on the same conditions. It will end after one iteration
for unique index, and will process a similar number of iteration otherwise.

To summarize, in comparison with the second approach the third one seems to
have less overhead, the same guarantees regarding penalty, and more complex
implementation. I'm fine going with the second approach, but first would like
to discuss the alternatives and make sure we're on the same page.