statistics horribly broken for row-wise comparison

Started by Merlin Moncurealmost 17 years ago6 messages

mmoncure@gmail.com

almost 17 years ago

It looks like for row-wise comparison, only the first column is used
for generating the expected row count. This can lead to bad plans in
some cases.

Test case (takes seconds to minutes hardware depending):

create table range as select v as id, v % 500 as key, now() +
((random() * 1000) || 'days')::interval as ts from
generate_series(1,10000000) v;

create index range_idx on range(key, ts);

explain analyze select * from range where (key, ts) >= (222, '7/11/2009') and
(key, ts) <= (222, '7/12/2009')
order by key, ts;

result (cut down a bit)
Sort (cost=469723.46..475876.12 rows=2461061 width=16) (actual
time=0.054..0.056 rows=13 loops=1)
Sort Key: key, ts
Sort Method: quicksort Memory: 25kB

note the row count expected vs. got. Varying the ts parameters
changes the expected rows, but varying the key does not. Note for the
test case the returned plan is ok, but obviously the planner will
freak out and drop to seq scan or so other nefarious things
circumstances depending.

I confirmed this on 8.2 and HEAD (a month old or so).

merlin

Merlin Moncure

mmoncure@gmail.com

almost 17 years ago

In reply to: Merlin Moncure (#1)

Re: statistics horribly broken for row-wise comparison

On Mon, Mar 2, 2009 at 4:43 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

It looks like for row-wise comparison, only the first column is used
for generating the expected row count. This can lead to bad plans in
some cases.

Test case (takes seconds to minutes hardware depending):

create table range as select v as id, v % 500 as key, now() +
((random() * 1000) || 'days')::interval as ts from
generate_series(1,10000000) v;

create index range_idx on range(key, ts);

explain analyze select * from range where (key, ts) >= (222, '7/11/2009') and
(key, ts) <= (222, '7/12/2009')
order by key, ts;

result (cut down a bit)
Sort (cost=469723.46..475876.12 rows=2461061 width=16) (actual
time=0.054..0.056 rows=13 loops=1)
Sort Key: key, ts
Sort Method: quicksort Memory: 25kB

note the row count expected vs. got. Varying the ts parameters
changes the expected rows, but varying the key does not. Note for the

oop, thats backwards. rows expected depends on key (the first column
in the index) only.

merlin

Tom Lane

tgl@sss.pgh.pa.us

almost 17 years ago

In reply to: Merlin Moncure (#1)

Re: statistics horribly broken for row-wise comparison

Merlin Moncure <mmoncure@gmail.com> writes:

It looks like for row-wise comparison, only the first column is used
for generating the expected row count.

[ shrug... ] Short of multi-column statistics, it's hard to see how to
do better.

regards, tom lane

Merlin Moncure

mmoncure@gmail.com

almost 17 years ago

In reply to: Tom Lane (#3)

Re: statistics horribly broken for row-wise comparison

On Mon, Mar 2, 2009 at 8:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Merlin Moncure <mmoncure@gmail.com> writes:

It looks like for row-wise comparison, only the first column is used
for generating the expected row count.

[ shrug... ] Short of multi-column statistics, it's hard to see how to
do better.

hm... Why can't you just multiply the range estimates for the fields
together when doing an operation over the key?

For example, in this case if the planner estimates 10% of rows for
key, and 5% of matches for ts, just multiply .1 & .05 and get .005
when you fall into the row operation case. This would give a
reasonably accurate answer...formally correct, even. All the
information is there, or am I missing something (not knowing all the
inner workings of the planner, I certainly might be)?

IOW, I don't see this as a 'not enough statistics', more of a 'looking
at the statistics wrong for multi-column index range operation'
problem. Equality works correctly, as it always has. This is a kind
of a stats loophole introduced when we got the ability to correctly do
these types of operations in 8.2.

There's no workaround that I see to this problem short of disabling seq_scan.

The classic form of this query when looking for only one 'key' my problem case):
select * from range where key = x and ts between a and b;

usually gives a plain index scan, which can be really undesirable.

merlin

Tom Lane

tgl@sss.pgh.pa.us

almost 17 years ago

In reply to: Merlin Moncure (#4)

Re: statistics horribly broken for row-wise comparison

Merlin Moncure <mmoncure@gmail.com> writes:

On Mon, Mar 2, 2009 at 8:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Merlin Moncure <mmoncure@gmail.com> writes:

It looks like for row-wise comparison, only the first column is used
for generating the expected row count.

[ shrug... ] �Short of multi-column statistics, it's hard to see how to
do better.

hm... Why can't you just multiply the range estimates for the fields
together when doing an operation over the key?

Because it would be the wrong answer, except in the uncommon case where
the field values are completely independent (at least, I would expect
that to be uncommon when people have multicolumn indexes on them).

In the case at hand, I think you're barking up the wrong tree anyway.
It's much more likely that what we need to do is fix
clauselist_selectivity to recognize range conditions involving
RowCompareExprs.

regards, tom lane

Greg Stark

stark@enterprisedb.com

almost 17 years ago

In reply to: Tom Lane (#5)

Re: statistics horribly broken for row-wise comparison

On Tue, Mar 3, 2009 at 3:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Because it would be the wrong answer, except in the uncommon case where
the field values are completely independent (at least, I would expect
that to be uncommon when people have multicolumn indexes on them).

Actually I think it's *more* likely to be true for multicolumn
indexes. If the two columns were correlated then you wouldn't need the
multicolumn index since you could just use one of the columns alone.
The case where people create multicolumn indexes btree indexes is
usually when they have a "major"-"minor" relationship between the two
columns so for each value of the major key there is a whole range of
minor keys.

Sure multi-column statistics would be nice though.

--
greg