Estimated rows question

Started by Sam Rossover 13 years ago5 messages

elliptic@gmail.com

over 13 years ago

I was wondering why it seems that the query planner can't "see", based
on the histograms, that two join-columns have a very small
intersection, and adjust its row estimation accordingly. Clearly the
below query returns 1001 rows. It appears as if much or all of the
necessary machinery exists in mergejoinscansel, and indeed if you
inspect
leftstartsel, leftendsel, rightstartsel, rightendsel during execution
they are respectively 0.98, 1.00, 0.00, 0.020, which I believe makes sense.

Am I missing something obvious?
Thanks
Sam

create table table_a as select * from generate_series(1,61000) as pkey;
create table table_b as select * from generate_series(60000,110000) as pkey;
create unique index idx_a on table_a(pkey);
create unique index idx_b on table_b(pkey);
analyse table_a;
analyse table_b;

explain select * from table_a a inner join table_b b on a.pkey = b.pkey;

QUERY PLAN
-----------------------------------------------------------------------------------------
Merge Join (cost=1984.88..2550.42 rows=50001 width=8)
Merge Cond: (a.pkey = b.pkey)
-> Index Only Scan using idx_a on table_a a (cost=0.00..1864.32
rows=61000 width=4)
-> Index Only Scan using idx_b on table_b b (cost=0.00..1531.32
rows=50001 width=4)

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Sam Ross (#1)

Re: Estimated rows question

[ sorry for slow response, but I'd not gotten time to think about this... ]

Sam Ross <elliptic@gmail.com> writes:

I was wondering why it seems that the query planner can't "see", based
on the histograms, that two join-columns have a very small
intersection, and adjust its row estimation accordingly.

The reason why not is that eqjoinsel() doesn't take any such
consideration into account. It's possible that it'd be a good idea
to teach it to do so. I'm not entirely convinced though. It would
add a fair amount of expense to that function, as well as adding
some possibly shaky assumptions, and I'm not sure how often we'd
get a usefully-better estimate in practice. OTOH, there are a lot
of shaky assumptions in eqjoinsel() already, and we did decide this
was worth worrying about in mergejoin cost estimation.

Do you want to try it and submit a patch for testing?

regards, tom lane

Sam Ross

elliptic@gmail.com

over 13 years ago

In reply to: Tom Lane (#2)

Re: [GENERAL] Estimated rows question

On Sat, Sep 1, 2012 at 8:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

[ sorry for slow response, but I'd not gotten time to think about this... ]

Sam Ross <elliptic@gmail.com> writes:

I was wondering why it seems that the query planner can't "see", based
on the histograms, that two join-columns have a very small
intersection, and adjust its row estimation accordingly.

The reason why not is that eqjoinsel() doesn't take any such
consideration into account. It's possible that it'd be a good idea
to teach it to do so. I'm not entirely convinced though. It would
add a fair amount of expense to that function, as well as adding
some possibly shaky assumptions, and I'm not sure how often we'd
get a usefully-better estimate in practice. OTOH, there are a lot
of shaky assumptions in eqjoinsel() already, and we did decide this
was worth worrying about in mergejoin cost estimation.

Do you want to try it and submit a patch for testing?

regards, tom lane

Thanks for the answer, and sorry for the slow reply -
I'm not sure I have the necessary expertise, but I'll be happy to give
it a shot. Is there an already-assembled library of queries that is
used to test purported improvements to the planner, or is it expected
that I come up with a convincing test-set myself?
Sam

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Sam Ross (#3)

Re: [GENERAL] Estimated rows question

Sam Ross <elliptic@gmail.com> writes:

On Sat, Sep 1, 2012 at 8:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Do you want to try it and submit a patch for testing?

Thanks for the answer, and sorry for the slow reply -
I'm not sure I have the necessary expertise, but I'll be happy to give
it a shot. Is there an already-assembled library of queries that is
used to test purported improvements to the planner, or is it expected
that I come up with a convincing test-set myself?

No, we don't really have much in that line :-(. There's the regression
tests of course, but they're mostly about functionality not quality of
statistical estimates. In practice, as long as a proposed change made
the estimates demonstrably better in some reasonable scenarios, and
didn't slow it down very much, that would probably be enough.

regards, tom lane

David Fetter

david@fetter.org

over 13 years ago

In reply to: Tom Lane (#4)

Re: [GENERAL] Estimated rows question

On Wed, Sep 12, 2012 at 07:19:06PM -0400, Tom Lane wrote:

Sam Ross <elliptic@gmail.com> writes:

On Sat, Sep 1, 2012 at 8:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Do you want to try it and submit a patch for testing?

Thanks for the answer, and sorry for the slow reply - I'm not sure
I have the necessary expertise, but I'll be happy to give it a
shot. Is there an already-assembled library of queries that is
used to test purported improvements to the planner, or is it
expected that I come up with a convincing test-set myself?

No, we don't really have much in that line :-(.

Does anyone? If so, who?

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate