Problems with get_actual_variable_range's VISITED_PAGES_LIMIT

Started by Peter Geoghegan4 months ago8 messageshackers

pg@bowt.ie

4 months ago

get_actual_variable_range() scans an index to find actual min/max
values for selectivity estimation. Since this happens during planning,
we can't afford to spend too much time on it. We use
VISITED_PAGES_LIMIT (100 heap pages) as a threshold -- once we've
visited that many heap pages, we give up and fall back to pg_statistic
values. See commit 9c6ad5eaa9 for background information.

Recent benchmark results from Mark Callaghan [1]https://smalldatum.blogspot.com/2026/01/cpu-bound-insert-benchmark-vs-postgres.html show that
get_actual_variable_range isn't very effective, once the problem with
dead index tuples really starts to get out of hand. This is expected
with queue-like tables that continually have older records deleted as
newer ones are inserted. Right now this is the Achilles' heel for
Postgres when it runs on Mark's insert benchmark. It's also likely to
be a problem for TPC-C's order table. Tables that look like that are
reasonably common in real world workloads IME.

(My work on this problem is also related to the index prefetching
patch, which makes API revisions that aren't really compatible with
the current VISITED_PAGES_LIMIT design. More on that later.)

Problem
=======

VISITED_PAGES_LIMIT counts heap page visits. But when there are many
index tuples marked LP_DEAD, btgettuple will traverse
indefinitely-many *index* pages before returning even a single tuple
to the selfuncs.c caller. The selfuncs.c heap page counter never gets
a chance to increment. Profiling by Mark shows that we waste quite a
lot of CPU cycles in nbtree functions like _bt_readpage when this
happens. In practice there is no *natural limit* on the amount of work
that we can do during any one call to get_actual_variable_range --
VISITED_PAGES_LIMIT doesn't consider costs related to reading index
leaf pages at all.

The design of get_actual_variable_range specifically aims to set
LP_DEAD bits, with the goal of helping future calls to
get_actual_variable_range avoid heap fetches. But there's an important
sense in which this hurts rather than helps: each LP_DEAD tuple is one
less that can be counted against VISITED_PAGES_LIMIT going forward.
The more LP_DEAD bits we set, the less effective VISITED_PAGES_LIMIT
becomes at bailing out early during future calls to
get_actual_variable_range. Inevitably, the cost of reading index leaf
pages eventually becomes the bottleneck -- at least past some tipping
point. That effect seems perverse -- shouldn't setting LP_DEAD bits in
index tuples be strictly a good thing?

I suspect that VISITED_PAGES_LIMIT is fundamentally addressing the
problem in the wrong way. Even without VISITED_PAGES_LIMIT, or
anything like it, it is reasonable to expect that
get_actual_variable_range will require a very small, fixed amount of
work to find an extremal value in the vast majority of cases. After
all, we only need to locate one non-dead extremal index tuple, and
then we're done. If we get anywhere near VISITED_PAGES_LIMIT, then
it's already reasonable to suppose that we're dealing with a
pathological workload -- a workload exactly like the one that Mark's
testing ran into.

ISTM that Mark has demonstrated that VISITED_PAGES_LIMIT doesn't work
particularly well in general. AFAICT there is nothing special about
Mark's complaint -- it's not some special case that the
VISITED_PAGES_LIMIT design didn't directly consider. Any case where
get_actual_variable_range does many heap fetches is likely to look
rather similar. Again, we only need to find one non-dead index tuple!
If we can't find a suitable non-dead tuple on the first index page we
look at, then why should we find one on the second or the third?

Proposed solution
=================

We really should use an approach that makes a *hard guarantee* that
the costs that we pay will not exceed some known threshold (whether
those costs are from scanning index pages or scanning heap pages). If
we completely ignore costs paid by the index AM layer, then the
problem just shifts to that layer before too long.

Rather than counting heap page visits, I propose limiting the scan to
the extremal leaf page only. If the extremal index leaf page yields no
visible tuples, we give up immediately. The latest version of the
index prefetching patch [2]/messages/by-id/CAH2-Wznv9_KGqHQ1vCW2pkiA6QskBGcx5NC_-UXnD6GEQasvAQ@mail.gmail.com -- Peter Geoghegan adds a WIP patch that does just that.

Testing of the patch posted to that other thead today shows that this
is much more effective. For example, with a 5 million row table, the
latency profile of running EXPLAIN about 66,000 times works out as
follows (with no dead tuples):

Baseline (no dead tuples):
Metric Master Patch Ratio
---------- ------------ ------------ ----------
Avg 0.081 0.081 0.997x
p95 0.096 0.095 0.993x
p99 0.108 0.111 1.028x

Unsurprisingly, the profile for the patch looks very similar to that
of master. But when my test script deletes 20% of the tuples in the
table, and repeats the same process over, things look very different:

Main (with bulk-deleted tuples, concurrent inserts/deletes):
Metric Master Patch Ratio
---------- ------------ ------------ ----------
Avg 0.933 0.080 0.085x
p95 1.167 0.105 0.090x
p99 1.195 0.133 0.112x

The patch is a little over 10x faster than master now. More
importantly, the patch performs very much like it did the first time
around, before we deleted any index tuples. We have a much more
meaningful guarantee about how bad things can get in the worst case.

I don't think that we need to consider heap visits; just limiting the
scan to a single leaf page seems to be sufficient. I'm open to the
idea of giving some consideration to the number of heap fetches
required, if testing can show that we still need some of that/we need
to be clevered. But I'm sure that we absolutely need to pay attention
to the cost of reading index leaf pages to do well here.

Note on index prefetching patch
===============================

As I touched on briefly already, and went into in detail on the index
prefetching patch thread, the current VISITED_PAGES_LIMIT design is
hard to make work the API revisions proposed by the index prefetching
patch. A recap on the problems it creates for the index prefetching
patch seems in order:

The index prefetching patch adds a new table_index_getnext_slot()
interface, rather like the current index_getnext_slot -- we need to
use that from selfuncs.c now (not the low-level index_getnext_tid +
index_fetch_heap interfaces). All visibility map lookups happen on the
table AM/heapam side with this new interface, which is a more modular
design. But it leaves selfuncs.c without any way to implement ad-hoc
counting of heap page accesses to enforce VISITED_PAGES_LIMIT; all of
the relevant heapam implementation details are hidden from callers
such as selfuncs.c.

I could work around this problem by inventing a new way to enforce
VISITED_PAGES_LIMIT that lives on the table AM side. But that doesn't
seem very appealing. We have to accomodate selfuncs.c's requirements
in lower-level code either way, it seems, so we might as well add that
mechanism to nbtree directly. Since it seems to fix the existing
problems with get_actual_variable_range.

Thoughts?

[1]: https://smalldatum.blogspot.com/2026/01/cpu-bound-insert-benchmark-vs-postgres.html
[2]: /messages/by-id/CAH2-Wznv9_KGqHQ1vCW2pkiA6QskBGcx5NC_-UXnD6GEQasvAQ@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

Tom Lane

tgl@sss.pgh.pa.us

4 months ago

In reply to: Peter Geoghegan (#1)

Re: Problems with get_actual_variable_range's VISITED_PAGES_LIMIT

Peter Geoghegan <pg@bowt.ie> writes:

VISITED_PAGES_LIMIT counts heap page visits. But when there are many
index tuples marked LP_DEAD, btgettuple will traverse
indefinitely-many *index* pages before returning even a single tuple
to the selfuncs.c caller.

Yup, that's a problem.

Proposed solution
=================

Rather than counting heap page visits, I propose limiting the scan to
the extremal leaf page only. If the extremal index leaf page yields no
visible tuples, we give up immediately. The latest version of the
index prefetching patch [2] adds a WIP patch that does just that.

I think that's throwing the baby out with the bathwater. In exchange
for a tight limit on planner time expended, you have an enormously
increased chance of getting no useful data at all. For one thing,
it is likely that the extremal page is only partly filled, so that
there may be only a very small number of index entries that this
strategy would consider at all.

Maybe it would work to express the limit as number of dead index
entries we will skip past before failing? I'd want a threshold of
maybe a thousand, but that would still work to bound the index AM's
work as well as the heap AM's.

Testing of the patch posted to that other thead today shows that this
is much more effective.

What definition of "effective" are you using? It sounds like you are
putting zero weight on successfully obtaining an estimate, and I find
that approach quite unacceptable.

regards, tom lane

Peter Geoghegan

pg@bowt.ie

4 months ago

In reply to: Tom Lane (#2)

Re: Problems with get_actual_variable_range's VISITED_PAGES_LIMIT

On Mon, Feb 9, 2026 at 7:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Proposed solution
=================

Rather than counting heap page visits, I propose limiting the scan to
the extremal leaf page only. If the extremal index leaf page yields no
visible tuples, we give up immediately. The latest version of the
index prefetching patch [2] adds a WIP patch that does just that.

I think that's throwing the baby out with the bathwater. In exchange
for a tight limit on planner time expended, you have an enormously
increased chance of getting no useful data at all.

It's true that I have only begun to examine how much of a risk this
is. It is still a WIP patch.

Maybe it would work to express the limit as number of dead index
entries we will skip past before failing?

Something like that makes sense.

Perhaps the limit should be expressed in terms of both index tuples
and index pages. We allow the scan to visit additional leaf pages, if
and only if it is necessary for the scan to do so in order for it to
examine the required number of index tuples. But we also apply a hard
cap on the total number of index pages read, enforced regardless of
how many index tuples we have read. We should have some concern about
sparse deletion patterns and the like, where each leaf page has only 1
or 2 index tuples (there might be no LP_DEAD-marked index tuples that
count towards this index tuple limit within such an index).

Obviously, the number of index tuples isn't necessarily the same thing
as the number of TIDs contained within those index tuples. With nbtree
deduplication, a single leaf page can contain as many as ~1350
distinct TIDs. I wonder if that difference needs to be taken into
account...

I'd want a threshold of
maybe a thousand, but that would still work to bound the index AM's
work as well as the heap AM's.

Ideally we could keep the mechanism confined to nbtree (or other index
AMs that implement the functionality that selfuncs.c will rely on).
I'd like to avoid inventing a new heapam-specific mechanism that has
to somehow coordinate with the nbtree mechanism that I have in mind
for this.

--
Peter Geoghegan

Jakub Wartak

jakub.wartak@enterprisedb.com

4 months ago

In reply to: Peter Geoghegan (#1)

Re: Problems with get_actual_variable_range's VISITED_PAGES_LIMIT

On Tue, Feb 10, 2026 at 12:15 AM Peter Geoghegan <pg@bowt.ie> wrote:

get_actual_variable_range() scans an index to find actual min/max
values for selectivity estimation. Since this happens during planning,
we can't afford to spend too much time on it. We use
VISITED_PAGES_LIMIT (100 heap pages) as a threshold -- once we've
visited that many heap pages, we give up and fall back to pg_statistic
values. See commit 9c6ad5eaa9 for background information.

[..]

Thoughts?

FWIW, dunno if that helps from what I remember, the 9c6ad5eaa957 itself was
born literally just like in 5 minutes just as very quick bandage attempt to
just rescue some big OLTP database (in backpatchable way), so there no was
serious research involved back then. AFAIR the alternative idea was to simply
to cage it into max time limit (using time delta), but that was same
basically the same idea. The 100 tuples there translate directly to
WE_HAVE_NO_IDEA, so I think you could simply fallback also N*100 index
index (instead of heap) pages and it would still be good enough - why? Well
because those below are the timing literally from the case that caused
this back then
(not kidding):
Planning Time: 1620687.653 ms
Execution Time: 970.716 ms

I just worry that if get_actual_variable_range() starts reporting much worse
estimates what do we do then? Well maybe, we'll have pg_plan_advice then,
or maybe the depth (time?) of search should be somehow bound to the column's
statistics_target too, but on second thought it looks like it should be more
property of the query itself (so maybe some GUC?)

-J.

Peter Geoghegan

pg@bowt.ie

3 months ago

In reply to: Peter Geoghegan (#3)

Re: Problems with get_actual_variable_range's VISITED_PAGES_LIMIT

On Mon, Feb 9, 2026 at 8:27 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Feb 9, 2026 at 7:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Proposed solution
=================

Rather than counting heap page visits, I propose limiting the scan to
the extremal leaf page only. If the extremal index leaf page yields no
visible tuples, we give up immediately. The latest version of the
index prefetching patch [2] adds a WIP patch that does just that.

I think that's throwing the baby out with the bathwater. In exchange
for a tight limit on planner time expended, you have an enormously
increased chance of getting no useful data at all.

It's true that I have only begun to examine how much of a risk this
is. It is still a WIP patch.

The latest revision of the index prefetching patch set (v12) has an
improved mechanism that takes your concerns into account:

/messages/by-id/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com

The mechanism will now hold on longer. In practice, it will read up to
3 index leaf pages before giving up. The precise threshold is
controlled by the selfuncs.c caller -- see
v12-0009-Limit-get_actual_variable_range-to-scan-three-in.patch
(unfortunately there's no easy way to break this out into an
independent patch).

--
Peter Geoghegan

Peter Geoghegan

pg@bowt.ie

3 months ago

In reply to: Jakub Wartak (#4)

Re: Problems with get_actual_variable_range's VISITED_PAGES_LIMIT

On Tue, Feb 10, 2026 at 4:26 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

Well because those below are the timing literally from the case that caused
this back then
(not kidding):
Planning Time: 1620687.653 ms
Execution Time: 970.716 ms

To be fair, the fact that selfuncs.c is very deliberate about setting
LP_DEAD bits in index pages these days was likely a big help. AFAICT,
the only problem that remains is with VISITED_PAGES_LIMIT itself,
which doesn't seem very well attuned to the costs that we need to keep
reasonably well bounded.

I just worry that if get_actual_variable_range() starts reporting much worse
estimates what do we do then? Well maybe, we'll have pg_plan_advice then,
or maybe the depth (time?) of search should be somehow bound to the column's
statistics_target too, but on second thought it looks like it should be more
property of the query itself (so maybe some GUC?)

The purpose of get_actual_variable_range is to get the extremal value
in the index -- a single value from a single tuple. If we can't manage
that by reading only 2 or 3 pages, then I see no good reason to
believe we can by reading many more pages.

In other words, we're perfectly justified in making a soft expectation
that it'll require very little work to get an extremal value. But once
we notice that our soft assumption doesn't hold, all bets are off --
we're then justified in giving up relatively quickly. Having several
contiguous pages that are completely empty (or empty of
non-LP_DEAD-marked tuples) at the extreme leftmost or rightmost en of
the index is absolutely a pathological case. It strongly suggests a
queue-like workload of the kind that my testing shows that
VISITED_PAGES_LIMIT can't deal with sensibly.

get_actual_variable_range exists to ascertain information about a
particular index. To me, it makes perfect sense that a mechanism such
as this would work in terms of costs paid on the index AM side. (The
heap fetches matter a great deal too, of course, but it's reasonable
to use index costs as a proxy for heap/table AM costs -- but it's not
reasonable to use table/heap AM costs as a proxy for index AM costs.
It doesn't work both ways.)

--
Peter Geoghegan

Andres Freund

andres@anarazel.de

3 months ago

In reply to: Peter Geoghegan (#6)

Re: Problems with get_actual_variable_range's VISITED_PAGES_LIMIT

Hi,

On 2026-03-10 17:03:14 -0400, Peter Geoghegan wrote:

On Tue, Feb 10, 2026 at 4:26 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

I just worry that if get_actual_variable_range() starts reporting much worse
estimates what do we do then? Well maybe, we'll have pg_plan_advice then,
or maybe the depth (time?) of search should be somehow bound to the column's
statistics_target too, but on second thought it looks like it should be more
property of the query itself (so maybe some GUC?)

The purpose of get_actual_variable_range is to get the extremal value
in the index -- a single value from a single tuple. If we can't manage
that by reading only 2 or 3 pages, then I see no good reason to
believe we can by reading many more pages.

In other words, we're perfectly justified in making a soft expectation
that it'll require very little work to get an extremal value. But once
we notice that our soft assumption doesn't hold, all bets are off --
we're then justified in giving up relatively quickly. Having several
contiguous pages that are completely empty (or empty of
non-LP_DEAD-marked tuples) at the extreme leftmost or rightmost en of
the index is absolutely a pathological case. It strongly suggests a
queue-like workload of the kind that my testing shows that
VISITED_PAGES_LIMIT can't deal with sensibly.

You don't need a queuing workload to occasionally have a few pages of dead
tuples at one end of the value range. E.g. a small batch insert that rolls
back due to a unique conflict is enough. The insert case is also the one
where, without get_actual_variable_range(), we will often end up with
completely bogus estimates, as obviously the stored stats won't yet know about
newly inserted data.

get_actual_variable_range exists to ascertain information about a
particular index. To me, it makes perfect sense that a mechanism such
as this would work in terms of costs paid on the index AM side. (The
heap fetches matter a great deal too, of course, but it's reasonable
to use index costs as a proxy for heap/table AM costs -- but it's not
reasonable to use table/heap AM costs as a proxy for index AM costs.
It doesn't work both ways.)

I am not convinced it's true either direction - tids from three leaf pages of
tids can point to a very small number of heap pages or hundreds of heap
pages. Why is the index AM cost a good proxy for the table? If anything it's
more sane the other way round (i.e. how it is today).

Greetings,

Andres Freund

Peter Geoghegan

pg@bowt.ie

3 months ago

In reply to: Andres Freund (#7)

Re: Problems with get_actual_variable_range's VISITED_PAGES_LIMIT

On Tue, Mar 10, 2026 at 6:08 PM Andres Freund <andres@anarazel.de> wrote:

You don't need a queuing workload to occasionally have a few pages of dead
tuples at one end of the value range. E.g. a small batch insert that rolls
back due to a unique conflict is enough. The insert case is also the one
where, without get_actual_variable_range(), we will often end up with
completely bogus estimates, as obviously the stored stats won't yet know about
newly inserted data.

True, but why would you expect the problem to be any worse with the
approach that I've proposed?

If we conservatively assume that each leaf page has 6 distinct heap
pages, as is the case with pgbench_accounts_pkey (which is very low
and thus favors your argument), we can expect to examine about 2.5 x 6
= 15 heap pages before giving up, with the approach I'm proposing.
That's not all that different to the current behavior.

Besides all this, the scenario you're describing is unlikely to last
long. Clients encountering the problem you describe tend to repeat the
transactions that initially failed. All it takes is another
transaction that inserts another tuple into the rightmost leaf page,
or one close by. This will immediately make it possible for
get_actual_variable_range to find a valid extremal value in the index
once more.

get_actual_variable_range exists to ascertain information about a
particular index. To me, it makes perfect sense that a mechanism such
as this would work in terms of costs paid on the index AM side. (The
heap fetches matter a great deal too, of course, but it's reasonable
to use index costs as a proxy for heap/table AM costs -- but it's not
reasonable to use table/heap AM costs as a proxy for index AM costs.
It doesn't work both ways.)

I am not convinced it's true either direction - tids from three leaf pages of
tids can point to a very small number of heap pages or hundreds of heap
pages.

I was unclear about this point.

There is definitely a wide natural variation in how many distinct heap
blocks appear on each leaf page. And the approach I have in mind is
almost completely indifferent to that. Though, of course, it still
imposes a natural ceiling on how many heap pages can be read: there
can only be so many distinct heap blocks on one leaf page (that was
what I referred to above).

I believe that it's actually a good thing that my approach doesn't
*directly* care about the number of heap accesses. Ignoring the
natural variation in heap layout will produce more consistent behavior
over time, and make it less likely that we'll give up too early. This
is helped by the fact that we can set LP_DEAD bits on dead tuples in
passing (and by the natural ceiling).

Why is the index AM cost a good proxy for the table? If anything it's
more sane the other way round (i.e. how it is today).

At a high level we need to reason about the general conditions in the
index without ever doing excessive work. The boundaries between leaf
pages and leaf page density provide relevant hints about the overall
picture in the index.

If we see a leaf page with one 16 byte index tuple, it is very
reasonable to surmise that it once held hundreds. To a lesser degree
it is valid to expect that its sibling leaf pages will also have very
few index tuples, likely due to a sparse deletion pattern. Indexes are
organized logically/form a key space, so making these kinds of
inferences is much more reasonable than it would be when looking at a
group of physically contiguous heap pages.

By the same token, if we encounter a small run of logically contiguous
leaf pages that are all completely empty, it's a pathological case;
the chances of there being many more empty leaf pages after that shoot
up. The more empty leaf pages we see, the more we should expect to
see. As I keep pointing out, all we need to do is find a single index
tuple -- which shouldn't be hard at all, barring these truly
patholgical cases.

I'm trying to be practical here. I want to design a solution
compatible with the changes that we're making to the table AM
interface. That fixes Mark's complaint. Ideally, this should avoid
adding much new code to hot code paths. Any design that meets those
goals is acceptable to me.

--
Peter Geoghegan