planner makes provably suboptimal index selection choices with partial indexes
Hello,
I have a situation with a highly volatile table where the planner seems to
make the incorrect choice. The problem is definitely statistics related,
but I don't think statistics ought to be related to the determination of
which index to consider.
Setting up a reproduction case is a bit challenging so the main question is
if my position is logically correct.
Consider a table:
CREATE TABLE foo(foo_key bigint, tt1 TIMESTAMPTZ, t2 TIMESTAMPTZ int);
...and two partial indexes
CREATE INDEX foo_a_idx ON foo(a) WHERE t2 IS NOT NULL;
CREATE INDEX foo_t1_idx ON foo(t1) WHERE t2 IS NOT NULL;
...and the following query
SELECT * FROM foo f WHERE a = ? AND t2 IS NOT NULL;
...the database seems to want to pick foo_t1_idx when the estimated row
counts are similar even though ISTM there is no way it could give better
results for any distribution of dat; only the same, or worse (is this
true?). So my position here is that, when selecting indexes, quals seem
not to be given enough weight. This can lead to dramatic performance
degradations if the estimated row counts are off, which can happen if when
t2 'not null' is volatile and designed to isolate the 'needle' from the
'haystack'.
This example is a bit simplified from my actual case but this seems to be a
general issue; over the years I've noticed that I've had to aggressively
analyze or do other tricks to force the database to pick optimal index a vs
suboptimal index b. I'm interested if there is awareness and/or interest
in this; if so and if needed; I can start working through some
reproduction cases. Note, I haven't tested on 18+ yet so there is a
chance this is already addressed assuming it is in fact a known and
solvable issue.
Thanks in advance,
merlin
Merlin Moncure <mmoncure@gmail.com> writes:
I have a situation with a highly volatile table where the planner seems to
make the incorrect choice. The problem is definitely statistics related,
but I don't think statistics ought to be related to the determination of
which index to consider.
Why would you think that? Statistics are critical in estimating how
many index entries are likely to be hit.
Consider a table:
CREATE TABLE foo(foo_key bigint, tt1 TIMESTAMPTZ, t2 TIMESTAMPTZ int);
...and two partial indexes
CREATE INDEX foo_a_idx ON foo(a) WHERE t2 IS NOT NULL;
CREATE INDEX foo_t1_idx ON foo(t1) WHERE t2 IS NOT NULL;
...and the following query
SELECT * FROM foo f WHERE a = ? AND t2 IS NOT NULL;
...the database seems to want to pick foo_t1_idx when the estimated row
counts are similar even though ISTM there is no way it could give better
results for any distribution of dat; only the same, or worse (is this
true?).
It might be that foo_t1_idx is enough smaller than foo_a_idx that the
estimated scanning cost comes out less even though more rows are
predicted to be read. That could be because the individual index
entries are smaller (not possible for bigint vs timestamptz, but
your example is pretty unclear about what the data type of "a" is),
or because one index is bloated compared to the other thanks to a more
erratic insertion pattern.
Hard to tell exactly what's happening without a more concrete example,
but there are plenty of factors involved in these choices.
regards, tom lane