Problems with get_actual_variable_range's VISITED_PAGES_LIMIT

Started by Peter Geoghegan25 days ago4 messages
Jump to latest

get_actual_variable_range() scans an index to find actual min/max
values for selectivity estimation. Since this happens during planning,
we can't afford to spend too much time on it. We use
VISITED_PAGES_LIMIT (100 heap pages) as a threshold -- once we've
visited that many heap pages, we give up and fall back to pg_statistic
values. See commit 9c6ad5eaa9 for background information.

Recent benchmark results from Mark Callaghan [1]https://smalldatum.blogspot.com/2026/01/cpu-bound-insert-benchmark-vs-postgres.html show that
get_actual_variable_range isn't very effective, once the problem with
dead index tuples really starts to get out of hand. This is expected
with queue-like tables that continually have older records deleted as
newer ones are inserted. Right now this is the Achilles' heel for
Postgres when it runs on Mark's insert benchmark. It's also likely to
be a problem for TPC-C's order table. Tables that look like that are
reasonably common in real world workloads IME.

(My work on this problem is also related to the index prefetching
patch, which makes API revisions that aren't really compatible with
the current VISITED_PAGES_LIMIT design. More on that later.)

Problem
=======

VISITED_PAGES_LIMIT counts heap page visits. But when there are many
index tuples marked LP_DEAD, btgettuple will traverse
indefinitely-many *index* pages before returning even a single tuple
to the selfuncs.c caller. The selfuncs.c heap page counter never gets
a chance to increment. Profiling by Mark shows that we waste quite a
lot of CPU cycles in nbtree functions like _bt_readpage when this
happens. In practice there is no *natural limit* on the amount of work
that we can do during any one call to get_actual_variable_range --
VISITED_PAGES_LIMIT doesn't consider costs related to reading index
leaf pages at all.

The design of get_actual_variable_range specifically aims to set
LP_DEAD bits, with the goal of helping future calls to
get_actual_variable_range avoid heap fetches. But there's an important
sense in which this hurts rather than helps: each LP_DEAD tuple is one
less that can be counted against VISITED_PAGES_LIMIT going forward.
The more LP_DEAD bits we set, the less effective VISITED_PAGES_LIMIT
becomes at bailing out early during future calls to
get_actual_variable_range. Inevitably, the cost of reading index leaf
pages eventually becomes the bottleneck -- at least past some tipping
point. That effect seems perverse -- shouldn't setting LP_DEAD bits in
index tuples be strictly a good thing?

I suspect that VISITED_PAGES_LIMIT is fundamentally addressing the
problem in the wrong way. Even without VISITED_PAGES_LIMIT, or
anything like it, it is reasonable to expect that
get_actual_variable_range will require a very small, fixed amount of
work to find an extremal value in the vast majority of cases. After
all, we only need to locate one non-dead extremal index tuple, and
then we're done. If we get anywhere near VISITED_PAGES_LIMIT, then
it's already reasonable to suppose that we're dealing with a
pathological workload -- a workload exactly like the one that Mark's
testing ran into.

ISTM that Mark has demonstrated that VISITED_PAGES_LIMIT doesn't work
particularly well in general. AFAICT there is nothing special about
Mark's complaint -- it's not some special case that the
VISITED_PAGES_LIMIT design didn't directly consider. Any case where
get_actual_variable_range does many heap fetches is likely to look
rather similar. Again, we only need to find one non-dead index tuple!
If we can't find a suitable non-dead tuple on the first index page we
look at, then why should we find one on the second or the third?

Proposed solution
=================

We really should use an approach that makes a *hard guarantee* that
the costs that we pay will not exceed some known threshold (whether
those costs are from scanning index pages or scanning heap pages). If
we completely ignore costs paid by the index AM layer, then the
problem just shifts to that layer before too long.

Rather than counting heap page visits, I propose limiting the scan to
the extremal leaf page only. If the extremal index leaf page yields no
visible tuples, we give up immediately. The latest version of the
index prefetching patch [2]/messages/by-id/CAH2-Wznv9_KGqHQ1vCW2pkiA6QskBGcx5NC_-UXnD6GEQasvAQ@mail.gmail.com -- Peter Geoghegan adds a WIP patch that does just that.

Testing of the patch posted to that other thead today shows that this
is much more effective. For example, with a 5 million row table, the
latency profile of running EXPLAIN about 66,000 times works out as
follows (with no dead tuples):

Baseline (no dead tuples):
Metric Master Patch Ratio
---------- ------------ ------------ ----------
Avg 0.081 0.081 0.997x
p95 0.096 0.095 0.993x
p99 0.108 0.111 1.028x

Unsurprisingly, the profile for the patch looks very similar to that
of master. But when my test script deletes 20% of the tuples in the
table, and repeats the same process over, things look very different:

Main (with bulk-deleted tuples, concurrent inserts/deletes):
Metric Master Patch Ratio
---------- ------------ ------------ ----------
Avg 0.933 0.080 0.085x
p95 1.167 0.105 0.090x
p99 1.195 0.133 0.112x

The patch is a little over 10x faster than master now. More
importantly, the patch performs very much like it did the first time
around, before we deleted any index tuples. We have a much more
meaningful guarantee about how bad things can get in the worst case.

I don't think that we need to consider heap visits; just limiting the
scan to a single leaf page seems to be sufficient. I'm open to the
idea of giving some consideration to the number of heap fetches
required, if testing can show that we still need some of that/we need
to be clevered. But I'm sure that we absolutely need to pay attention
to the cost of reading index leaf pages to do well here.

Note on index prefetching patch
===============================

As I touched on briefly already, and went into in detail on the index
prefetching patch thread, the current VISITED_PAGES_LIMIT design is
hard to make work the API revisions proposed by the index prefetching
patch. A recap on the problems it creates for the index prefetching
patch seems in order:

The index prefetching patch adds a new table_index_getnext_slot()
interface, rather like the current index_getnext_slot -- we need to
use that from selfuncs.c now (not the low-level index_getnext_tid +
index_fetch_heap interfaces). All visibility map lookups happen on the
table AM/heapam side with this new interface, which is a more modular
design. But it leaves selfuncs.c without any way to implement ad-hoc
counting of heap page accesses to enforce VISITED_PAGES_LIMIT; all of
the relevant heapam implementation details are hidden from callers
such as selfuncs.c.

I could work around this problem by inventing a new way to enforce
VISITED_PAGES_LIMIT that lives on the table AM side. But that doesn't
seem very appealing. We have to accomodate selfuncs.c's requirements
in lower-level code either way, it seems, so we might as well add that
mechanism to nbtree directly. Since it seems to fix the existing
problems with get_actual_variable_range.

Thoughts?

[1]: https://smalldatum.blogspot.com/2026/01/cpu-bound-insert-benchmark-vs-postgres.html
[2]: /messages/by-id/CAH2-Wznv9_KGqHQ1vCW2pkiA6QskBGcx5NC_-UXnD6GEQasvAQ@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#1)
Re: Problems with get_actual_variable_range's VISITED_PAGES_LIMIT

Peter Geoghegan <pg@bowt.ie> writes:

VISITED_PAGES_LIMIT counts heap page visits. But when there are many
index tuples marked LP_DEAD, btgettuple will traverse
indefinitely-many *index* pages before returning even a single tuple
to the selfuncs.c caller.

Yup, that's a problem.

Proposed solution
=================

Rather than counting heap page visits, I propose limiting the scan to
the extremal leaf page only. If the extremal index leaf page yields no
visible tuples, we give up immediately. The latest version of the
index prefetching patch [2] adds a WIP patch that does just that.

I think that's throwing the baby out with the bathwater. In exchange
for a tight limit on planner time expended, you have an enormously
increased chance of getting no useful data at all. For one thing,
it is likely that the extremal page is only partly filled, so that
there may be only a very small number of index entries that this
strategy would consider at all.

Maybe it would work to express the limit as number of dead index
entries we will skip past before failing? I'd want a threshold of
maybe a thousand, but that would still work to bound the index AM's
work as well as the heap AM's.

Testing of the patch posted to that other thead today shows that this
is much more effective.

What definition of "effective" are you using? It sounds like you are
putting zero weight on successfully obtaining an estimate, and I find
that approach quite unacceptable.

regards, tom lane

In reply to: Tom Lane (#2)
Re: Problems with get_actual_variable_range's VISITED_PAGES_LIMIT

On Mon, Feb 9, 2026 at 7:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Proposed solution
=================

Rather than counting heap page visits, I propose limiting the scan to
the extremal leaf page only. If the extremal index leaf page yields no
visible tuples, we give up immediately. The latest version of the
index prefetching patch [2] adds a WIP patch that does just that.

I think that's throwing the baby out with the bathwater. In exchange
for a tight limit on planner time expended, you have an enormously
increased chance of getting no useful data at all.

It's true that I have only begun to examine how much of a risk this
is. It is still a WIP patch.

Maybe it would work to express the limit as number of dead index
entries we will skip past before failing?

Something like that makes sense.

Perhaps the limit should be expressed in terms of both index tuples
and index pages. We allow the scan to visit additional leaf pages, if
and only if it is necessary for the scan to do so in order for it to
examine the required number of index tuples. But we also apply a hard
cap on the total number of index pages read, enforced regardless of
how many index tuples we have read. We should have some concern about
sparse deletion patterns and the like, where each leaf page has only 1
or 2 index tuples (there might be no LP_DEAD-marked index tuples that
count towards this index tuple limit within such an index).

Obviously, the number of index tuples isn't necessarily the same thing
as the number of TIDs contained within those index tuples. With nbtree
deduplication, a single leaf page can contain as many as ~1350
distinct TIDs. I wonder if that difference needs to be taken into
account...

I'd want a threshold of
maybe a thousand, but that would still work to bound the index AM's
work as well as the heap AM's.

Ideally we could keep the mechanism confined to nbtree (or other index
AMs that implement the functionality that selfuncs.c will rely on).
I'd like to avoid inventing a new heapam-specific mechanism that has
to somehow coordinate with the nbtree mechanism that I have in mind
for this.

--
Peter Geoghegan

#4Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Peter Geoghegan (#1)
Re: Problems with get_actual_variable_range's VISITED_PAGES_LIMIT

On Tue, Feb 10, 2026 at 12:15 AM Peter Geoghegan <pg@bowt.ie> wrote:

get_actual_variable_range() scans an index to find actual min/max
values for selectivity estimation. Since this happens during planning,
we can't afford to spend too much time on it. We use
VISITED_PAGES_LIMIT (100 heap pages) as a threshold -- once we've
visited that many heap pages, we give up and fall back to pg_statistic
values. See commit 9c6ad5eaa9 for background information.

[..]

Thoughts?

FWIW, dunno if that helps from what I remember, the 9c6ad5eaa957 itself was
born literally just like in 5 minutes just as very quick bandage attempt to
just rescue some big OLTP database (in backpatchable way), so there no was
serious research involved back then. AFAIR the alternative idea was to simply
to cage it into max time limit (using time delta), but that was same
basically the same idea. The 100 tuples there translate directly to
WE_HAVE_NO_IDEA, so I think you could simply fallback also N*100 index
index (instead of heap) pages and it would still be good enough - why? Well
because those below are the timing literally from the case that caused
this back then
(not kidding):
Planning Time: 1620687.653 ms
Execution Time: 970.716 ms

I just worry that if get_actual_variable_range() starts reporting much worse
estimates what do we do then? Well maybe, we'll have pg_plan_advice then,
or maybe the depth (time?) of search should be somehow bound to the column's
statistics_target too, but on second thought it looks like it should be more
property of the query itself (so maybe some GUC?)

-J.