Damage control for planner's get_actual_variable_endpoint() runaway

Started by Jakub Wartakover 3 years ago37 messageshackers

jakub.wartak@enterprisedb.com

over 3 years ago

Hi hackers,

the case of planner's
src/backend/utils/adt/selfuncs.c:get_actual_variable_endpoint()
spending literally seconds seems to be well known fact across hackers
(in the extreme wild case it can be over 1+ hour on VLDBs). For those
unfamiliar it is planner estimation that tries to read real table
index (including deadrows) until min/max. It is blackbox mechanism
that works without any warning which often is hugely affected by
number of dead tuples in indexes and there's no on/off switch or
built-in limitation of how far it can go. It was discussed on pgsql
mailing lists several times [1]/messages/by-id/54446AE2.6080909@BlueTreble.com-[5]https://postgrespro.com/list/thread-id/2436130 (cache). It almost seems like it works
fine in 99.9% cases, until it doesn't and blows up big time on larger
systems and from there operator doesn't have a lot of choices [a lot
of time being already wasted on identifying the root-cause being the
planner]:
1) one can properly VACUUM (which everybody seem to agree is the
proper way to go, but it is often problematic due to other various
circumstances, especially on big tables without serious partitioning
strategies) - again this might be very time consuming
2) one cannot trade a lot CPU/IO burning on planning (actually
fetching indexes on multi-TB tables) to less accurate plans, and
realistically speaking rewriting queries is often impossible
3) application might not support enable prepared statements and even
if then simple queries/reports are also affected
4) there is no visibility into how much activity is spent on btree
index get_actual_variable_endpoint() alone, so one cannot estimate the
system-wide impact

I would like to trigger the discussion on how to give at least partial
control to the end-user of what the optimizer performs. I was thinking
about several things and each of those has pros and cons:

a) the attached quick patch (by Simon Riggs) that put maximum allowed
cost constraints on the index-lookup machinery as a safeguard (that
#define is debatable; in my testing it reduced the hit from ~850ms to
0.6ms +/- 0.3ms at the current value of 20).
b) I was wondering about creating a new wait class "Planner" with the
event "ReadingMinMaxIndex" (or similar). The obvious drawback is the
naming/categorization as wait events are ... well.. as the name "WAIT"
implies, while those btree lookups could easily be ON-CPU activity.
c) Any other idea, e.g. see [3]/messages/by-id/05C72CF7-B5F6-4DB9-8A09-5AC897653113@yandex.ru (SnapshotAny vs SnapshotDirty discussions between Tom and Robert) or [5]https://postgrespro.com/list/thread-id/2436130 (cache) (cache was being proposed).
d) For completeness : a new GUC/knob to completely disable the
functionality (debug_optimizer_minmax_est), but that's actually
trimmed functionality of the patch.
e) I was also wondering about some DEBUG/NOTICE elog() when taking
more than let's say arbitrary 10s, but that could easily spam the log
file

Reproducer on a very small dataset follows. Please note the reproducer
here shows the effect on 1st run EXPLAIN, however in real observed
situation (multi-TB unpartitioned table) each consecutive planner
operation (just EXPLAIN) on that index was affected (I don't know why
LP_DEAD/hints cleaning was not kicking in, but maybe it was, but given
the scale of the problem it was not helping much).

-Jakub Wartak.

[1]: /messages/by-id/54446AE2.6080909@BlueTreble.com
[2]: /messages/by-id/db7111f2-05ef-0ceb-c013-c34adf4f4121@gmail.com
[3]: /messages/by-id/05C72CF7-B5F6-4DB9-8A09-5AC897653113@yandex.ru (SnapshotAny vs SnapshotDirty discussions between Tom and Robert)
(SnapshotAny vs SnapshotDirty discussions between Tom and Robert)
[4]: /messages/by-id/CAECtzeVPM4Oi6dTdqVQmjoLkDBVChNj7ed3hNs1RGrBbwCJ7Cw@mail.gmail.com
[5]: https://postgrespro.com/list/thread-id/2436130 (cache)

s1:
=# drop table test;
=# create table test (id bigint primary key) with (autovacuum_enabled = 'off');
=# insert into test select generate_series(1,10000000); -- ~310MB
table, ~210MB index

s2/start the long running transaction:
=# begin;
=# select min(id) from test;

s1:
=# delete from test where id>1000000;
=# analyze test;
=# set enable_indexonlyscan = 'off'; -- just in case to force
BitmapHeapScans which according to backend/access/nbtree/README
won'tset LP_DEAD, but my bt_page_items() tests indicate that it does
(??)
=# set enable_indexscan = 'off';
=# explain (buffers, verbose) select * from test where id > 11000000;
=> Planning: Buffers: shared hit=9155 read=55276 dirtied=55271
written=53617 / Time: 851.761 ms
=# explain (buffers, verbose) select * from test where id > 11000000;
=> Planning: Buffers: shared read=24594 / Time: 49.172 ms
=# vacuum (verbose) test; => index scan needed: 39824 pages from table
(90.00% of total) had 9000000 dead item identifiers removed
=# explain (buffers, verbose) select * from test where id > 11000000;
=> Planning: Buffers: shared hit=14 read=3 / Time: 0.550 ms

with patch, the results are:
p=# explain (buffers, verbose) select * from test where id > 11000000;
=> Planning: / Buffers: shared hit=17 read=6 dirtied=3 written=5 =>
Time: 0.253 ms
p=# explain (buffers, verbose) select * from test where id > 11000000;
=> Planning: / Buffers: shared hit=11 read=2 dirtied=2 => Time: 0.576
ms
so there's no dramatic hit.

Alvaro Herrera

alvherre@2ndquadrant.com

over 3 years ago

In reply to: Jakub Wartak (#1)

Re: Damage control for planner's get_actual_variable_endpoint() runaway

On 2022-Nov-21, Jakub Wartak wrote:

b) I was wondering about creating a new wait class "Planner" with the
event "ReadingMinMaxIndex" (or similar). The obvious drawback is the
naming/categorization as wait events are ... well.. as the name "WAIT"
implies, while those btree lookups could easily be ON-CPU activity.

I think we should definitely do this, regardless of any other fixes we
add, so that this condition can be identified more easily. I wonder if
we can backpatch it safely.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Always assume the user will do much worse than the stupidest thing
you can imagine." (Julien PUYDT)

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Alvaro Herrera (#2)

Re: Damage control for planner's get_actual_variable_endpoint() runaway

On Mon, Nov 21, 2022 at 7:22 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2022-Nov-21, Jakub Wartak wrote:

b) I was wondering about creating a new wait class "Planner" with the
event "ReadingMinMaxIndex" (or similar). The obvious drawback is the
naming/categorization as wait events are ... well.. as the name "WAIT"
implies, while those btree lookups could easily be ON-CPU activity.

I think we should definitely do this, regardless of any other fixes we
add, so that this condition can be identified more easily. I wonder if
we can backpatch it safely.

I don't think this is safe at all. Wait events can only bracket
individual operations, like the reads of the individual index blocks,
not report on which phase of a larger operation is in progress. If we
try to make them do the latter, we will have a hot mess on our hands.
It might not be a bad capability to have, but it's a different system.

But that doesn't mean we can't do anything about this problem, and I
think we should do something about this problem. It's completely crazy
that after this many years of people getting hosed by this, we haven't
taken more than half measures to fix the problem. I think commit
3ca930fc39ccf987c1c22fd04a1e7463b5dd0dfd was the last time we poked at
this, and before that there was commit
fccebe421d0c410e6378fb281419442c84759213, but neither of those
prevented us from scanning an unbounded number of index tuples before
finding one that we're willing to use, as the commit messages pretty
much admit.

What we need is a solution that avoids reading an unbounded number of
tuples under any circumstances. I previously suggested using
SnapshotAny here, but Tom didn't like that. I'm not sure if there are
safety issues there or if Tom was just concerned about the results
being misleading. Either way, maybe there's some variant on that theme
that could work. For instance, could we teach the index scan to stop
if the first 100 tuples that it finds are all invisible? Or to reach
at most 1 page, or at most 10 pages, or something? If we don't find a
match, we could either try to use a dead tuple, or we could just
return false which, I think, would end up using the value from
pg_statistic rather than any updated value. That is of course not a
great outcome, but it is WAY WAY BETTER than spending OVER AN HOUR
looking for a more suitable tuple, as Jakub describes having seen on a
production system.

I really can't understand why this is even mildly controversial. What
exactly to do here may be debatable, but the idea that it's OK to
spend an unbounded amount of resources during any phase of planning is
clearly wrong. We can say that at the time we wrote the code we didn't
know it was going to actually ever happen, and that is fine and true.
But there have been multiple reports of this over the years and we
know *for sure* that spending totally unreasonable amounts of time
inside this function is a real-world problem that actually brings down
production systems. Unless and until we find a way of putting a tight
bound on the amount of effort that can be expended here, that's going
to keep happening.

--
Robert Haas
EDB: http://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Robert Haas (#3)

Re: Damage control for planner's get_actual_variable_endpoint() runaway

Robert Haas <robertmhaas@gmail.com> writes:

I don't think this is safe at all. Wait events can only bracket
individual operations, like the reads of the individual index blocks,
not report on which phase of a larger operation is in progress. If we
try to make them do the latter, we will have a hot mess on our hands.

Agreed.

What we need is a solution that avoids reading an unbounded number of
tuples under any circumstances. I previously suggested using
SnapshotAny here, but Tom didn't like that. I'm not sure if there are
safety issues there or if Tom was just concerned about the results
being misleading. Either way, maybe there's some variant on that theme
that could work. For instance, could we teach the index scan to stop
if the first 100 tuples that it finds are all invisible? Or to reach
at most 1 page, or at most 10 pages, or something?

A hard limit on the number of index pages examined seems like it
might be a good idea.

If we don't find a
match, we could either try to use a dead tuple, or we could just
return false which, I think, would end up using the value from
pg_statistic rather than any updated value.

Yeah, the latter seems like the best bet. Returning a definitely-dead
value could be highly misleading. In the end this is meant to be
an incremental improvement on what we could get from pg_statistic,
so it's reasonable to limit how hard we'll work on it.

If we do install such a thing, should we undo any of the previous
changes that backed off the reliability of the result?

regards, tom lane

Simon Riggs

simon@2ndQuadrant.com

over 3 years ago

In reply to: Tom Lane (#4)

Re: Damage control for planner's get_actual_variable_endpoint() runaway

On Mon, 21 Nov 2022 at 15:01, Tom Lane <tgl@sss.pgh.pa.us> wrote:

What we need is a solution that avoids reading an unbounded number of
tuples under any circumstances. I previously suggested using
SnapshotAny here, but Tom didn't like that. I'm not sure if there are
safety issues there or if Tom was just concerned about the results
being misleading. Either way, maybe there's some variant on that theme
that could work. For instance, could we teach the index scan to stop
if the first 100 tuples that it finds are all invisible? Or to reach
at most 1 page, or at most 10 pages, or something?

A hard limit on the number of index pages examined seems like it
might be a good idea.

Good, that is what the patch does.

If we don't find a
match, we could either try to use a dead tuple, or we could just
return false which, I think, would end up using the value from
pg_statistic rather than any updated value.

Yeah, the latter seems like the best bet.

Yes, just breaking out of the loop is enough to use the default value.

If we do install such a thing, should we undo any of the previous
changes that backed off the reliability of the result?

Not sure.

--
Simon Riggs http://www.EnterpriseDB.com/

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Simon Riggs (#5)

Re: Damage control for planner's get_actual_variable_endpoint() runaway

On Mon, Nov 21, 2022 at 10:14 AM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

What we need is a solution that avoids reading an unbounded number of
tuples under any circumstances. I previously suggested using
SnapshotAny here, but Tom didn't like that. I'm not sure if there are
safety issues there or if Tom was just concerned about the results
being misleading. Either way, maybe there's some variant on that theme
that could work. For instance, could we teach the index scan to stop
if the first 100 tuples that it finds are all invisible? Or to reach
at most 1 page, or at most 10 pages, or something?

A hard limit on the number of index pages examined seems like it
might be a good idea.

Good, that is what the patch does.

Oh, that's surprisingly simple. Nice!

Is there any reason to tie this into page costs? I'd be more inclined
to just make it a hard limit on the number of pages. I think that
would be more predictable and less prone to surprising (bad) behavior.
And to be honest I would be inclined to make it quite a small number.
Perhaps 5 or 10. Is there a good argument for going any higher?

--
Robert Haas
EDB: http://www.enterprisedb.com

Simon Riggs

simon@2ndQuadrant.com

over 3 years ago

In reply to: Robert Haas (#6)

Re: Damage control for planner's get_actual_variable_endpoint() runaway

On Mon, 21 Nov 2022 at 15:23, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 21, 2022 at 10:14 AM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:

What we need is a solution that avoids reading an unbounded number of
tuples under any circumstances. I previously suggested using
SnapshotAny here, but Tom didn't like that. I'm not sure if there are
safety issues there or if Tom was just concerned about the results
being misleading. Either way, maybe there's some variant on that theme
that could work. For instance, could we teach the index scan to stop
if the first 100 tuples that it finds are all invisible? Or to reach
at most 1 page, or at most 10 pages, or something?

A hard limit on the number of index pages examined seems like it
might be a good idea.

Good, that is what the patch does.

<looks at patch>

Oh, that's surprisingly simple. Nice!

Is there any reason to tie this into page costs? I'd be more inclined
to just make it a hard limit on the number of pages. I think that
would be more predictable and less prone to surprising (bad) behavior.
And to be honest I would be inclined to make it quite a small number.
Perhaps 5 or 10. Is there a good argument for going any higher?

+1, that makes the patch smaller and the behavior more predictable.

(Just didn't want to do anything too simple, in case it looked like a kluge.)

--
Simon Riggs http://www.EnterpriseDB.com/

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Robert Haas (#6)

Re: Damage control for planner's get_actual_variable_endpoint() runaway

Robert Haas <robertmhaas@gmail.com> writes:

Is there any reason to tie this into page costs? I'd be more inclined
to just make it a hard limit on the number of pages. I think that
would be more predictable and less prone to surprising (bad) behavior.

Agreed, a simple limit of N pages fetched seems appropriate.

And to be honest I would be inclined to make it quite a small number.
Perhaps 5 or 10. Is there a good argument for going any higher?

Sure: people are not complaining until it gets into the thousands.
And you have to remember that the entire mechanism exists only
because of user complaints about inaccurate estimates. We shouldn't
be too eager to resurrect that problem.

I'd be happy with a limit of 100 pages.

regards, tom lane

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Tom Lane (#8)

Re: Damage control for planner's get_actual_variable_endpoint() runaway

On Mon, Nov 21, 2022 at 10:32 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Is there any reason to tie this into page costs? I'd be more inclined
to just make it a hard limit on the number of pages. I think that
would be more predictable and less prone to surprising (bad) behavior.

Agreed, a simple limit of N pages fetched seems appropriate.

And to be honest I would be inclined to make it quite a small number.
Perhaps 5 or 10. Is there a good argument for going any higher?

Sure: people are not complaining until it gets into the thousands.
And you have to remember that the entire mechanism exists only
because of user complaints about inaccurate estimates. We shouldn't
be too eager to resurrect that problem.

I'd be happy with a limit of 100 pages.

OK.

--
Robert Haas
EDB: http://www.enterprisedb.com

#10

Jakub Wartak

jakub.wartak@enterprisedb.com

over 3 years ago

In reply to: Robert Haas (#9)

Re: Damage control for planner's get_actual_variable_endpoint() runaway

Hi,

Draft version of the patch attached (it is based on Simon's)
I would be happier if we could make that #define into GUC (just in
case), although I do understand the effort to reduce the number of
various knobs (as their high count causes their own complexity).

-Jakub Wartak.

Show quoted text

On Mon, Nov 21, 2022 at 4:35 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 21, 2022 at 10:32 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Is there any reason to tie this into page costs? I'd be more inclined
to just make it a hard limit on the number of pages. I think that
would be more predictable and less prone to surprising (bad) behavior.

Agreed, a simple limit of N pages fetched seems appropriate.

And to be honest I would be inclined to make it quite a small number.
Perhaps 5 or 10. Is there a good argument for going any higher?

Sure: people are not complaining until it gets into the thousands.
And you have to remember that the entire mechanism exists only
because of user complaints about inaccurate estimates. We shouldn't
be too eager to resurrect that problem.

I'd be happy with a limit of 100 pages.

OK.

--
Robert Haas
EDB: http://www.enterprisedb.com

#11

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Jakub Wartak (#10)

Re: Damage control for planner's get_actual_variable_endpoint() runaway

Hi,

On 2022-11-21 17:06:16 +0100, Jakub Wartak wrote:

@@ -6213,14 +6216,26 @@ get_actual_variable_endpoint(Relation heapRel,
/* Fetch first/next tuple in specified direction */
while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
{
+		BlockNumber block = ItemPointerGetBlockNumber(tid);
if (!VM_ALL_VISIBLE(heapRel,
-							ItemPointerGetBlockNumber(tid),
+							block,
&vmbuffer))
{
/* Rats, we have to visit the heap to check visibility */
if (!index_fetch_heap(index_scan, tableslot))
continue;		/* no visible tuple, try next index entry */

+			{
+				CHECK_FOR_INTERRUPTS();
+				if (block != last_block)
+					visited_pages++;
+#define VISITED_PAGES_LIMIT 100
+				if (visited_pages > VISITED_PAGES_LIMIT)
+					break;
+				else
+					continue; /* no visible tuple, try next index entry */
+			}
+
/* We don't actually need the heap tuple for anything */
ExecClearTuple(tableslot);

--
2.30.2

This can't quite be right - isn't this only applying the limit if we found a
visible tuple?

Greetings,

Andres Freund

#12

Justin Pryzby

pryzby@telsasoft.com

over 3 years ago

In reply to: Jakub Wartak (#10)

Re: Damage control for planner's get_actual_variable_endpoint() runaway

This patch version runs "continue" unconditionally (rather than
conditionally, like the previous version).

if (!index_fetch_heap(index_scan, tableslot))
continue; /* no visible tuple, try next index entry */

#13

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Andres Freund (#11)

Re: Damage control for planner's get_actual_variable_endpoint() runaway

On Mon, Nov 21, 2022 at 12:30 PM Andres Freund <andres@anarazel.de> wrote:

This can't quite be right - isn't this only applying the limit if we found a
visible tuple?

It doesn't look that way to me, but perhaps I'm just too dense to see
the problem?

--
Robert Haas
EDB: http://www.enterprisedb.com

#14

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Andres Freund (#11)

Re: Damage control for planner's get_actual_variable_endpoint() runaway

Andres Freund <andres@anarazel.de> writes:

This can't quite be right - isn't this only applying the limit if we found a
visible tuple?

What it's restricting is the number of heap page fetches, which
might be good enough. We don't have a lot of visibility here
into how many index pages were scanned before returning the next
not-dead index entry, so I'm not sure how hard it'd be to do better.

regards, tom lane

#15

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Tom Lane (#14)

Re: Damage control for planner's get_actual_variable_endpoint() runaway

On Mon, Nov 21, 2022 at 12:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

This can't quite be right - isn't this only applying the limit if we found a
visible tuple?

What it's restricting is the number of heap page fetches, which
might be good enough. We don't have a lot of visibility here
into how many index pages were scanned before returning the next
not-dead index entry, so I'm not sure how hard it'd be to do better.

Oh. That's really sad. Because I think the whole problem here is that
the number of dead index entries can be huge.

--
Robert Haas
EDB: http://www.enterprisedb.com

#16

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Robert Haas (#15)

Re: Damage control for planner's get_actual_variable_endpoint() runaway

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Nov 21, 2022 at 12:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

What it's restricting is the number of heap page fetches, which
might be good enough. We don't have a lot of visibility here
into how many index pages were scanned before returning the next
not-dead index entry, so I'm not sure how hard it'd be to do better.

Oh. That's really sad. Because I think the whole problem here is that
the number of dead index entries can be huge.

If they're *actually* dead, we have enough mitigations in place I think,
as explained by the long comment in get_actual_variable_endpoint.
The problem here is with still-visible-to-somebody tuples. At least,
Jakub's test case sets it up that way.

regards, tom lane

#17

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Robert Haas (#13)

Re: Damage control for planner's get_actual_variable_endpoint() runaway

Hi,

On November 21, 2022 9:37:34 AM PST, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 21, 2022 at 12:30 PM Andres Freund <andres@anarazel.de> wrote:

This can't quite be right - isn't this only applying the limit if we found a
visible tuple?

It doesn't look that way to me, but perhaps I'm just too dense to see
the problem?

The earlier version didn't have the issue, but the latest seems to only limit after a visible tuple has been found. Note the continue; when fetching a heap tuple fails.

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#18

Simon Riggs

simon@2ndQuadrant.com

over 3 years ago

In reply to: Andres Freund (#17)

Re: Damage control for planner's get_actual_variable_endpoint() runaway

On Mon, 21 Nov 2022 at 18:17, Andres Freund <andres@anarazel.de> wrote:

Hi,

On November 21, 2022 9:37:34 AM PST, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 21, 2022 at 12:30 PM Andres Freund <andres@anarazel.de> wrote:

This can't quite be right - isn't this only applying the limit if we found a
visible tuple?

It doesn't look that way to me, but perhaps I'm just too dense to see
the problem?

The earlier version didn't have the issue, but the latest seems to only limit after a visible tuple has been found. Note the continue; when fetching a heap tuple fails.

Agreed, resolved in this version.

Robert, something like this perhaps? limit on both the index and the heap.

--
Simon Riggs http://www.EnterpriseDB.com/

#19

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Simon Riggs (#18)

Re: Damage control for planner's get_actual_variable_endpoint() runaway

Hi,

On November 21, 2022 10:44:17 AM PST, Simon Riggs <simon.riggs@enterprisedb.com> wrote:

Robert, something like this perhaps? limit on both the index and the heap.

I don't think we should add additional code / struct members into very common good paths for these limits.

I don't really understand the point of limiting in the index - where would the large number of pages accessed come from?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#20

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Andres Freund (#19)

Re: Re: Damage control for planner's get_actual_variable_endpoint() runaway

Andres Freund <andres@anarazel.de> writes:

On November 21, 2022 10:44:17 AM PST, Simon Riggs <simon.riggs@enterprisedb.com> wrote:

Robert, something like this perhaps? limit on both the index and the heap.