Planner question

Started by Tom Raneyover 17 years ago7 messages
#1Tom Raney
raneyt@cecs.pdx.edu

Why does the planner consider both input variations of each symmetric merge join? The README says "there is not a lot of difference" between the two options. When are there any differences?

-Tom Raney

#2Jeff Davis
pgsql@j-davis.com
In reply to: Tom Raney (#1)
Re: Planner question

On Fri, 2008-09-05 at 11:21 -0700, Tom Raney wrote:

Why does the planner consider both input variations of each symmetric merge join? The README says "there is not a lot of difference" between the two options. When are there any differences?

-Tom Raney

http://archives.postgresql.org/pgsql-general/2008-08/msg00967.php

My understanding from that thread is that if one table has high
ndistinct and the other has low ndistinct, one plan may require more
re-scanning than the other.

Regards,
Jeff Davis

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Raney (#1)
Re: Planner question

Tom Raney <raneyt@cecs.pdx.edu> writes:

Why does the planner consider both input variations of each symmetric merge join? The README says "there is not a lot of difference" between the two options. When are there any differences?

The righthand side needs to support mark/restore, the left doesn't;
so depending on plan types one way might need a helper Materialize
node that the other way doesn't. Also, duplicated values are a bit
cheaper to process on the left than the right.

regards, tom lane

#4Tom Raney
raneyt@cecs.pdx.edu
In reply to: Tom Lane (#3)
Re: Planner question

Tom Lane wrote:

Tom Raney <raneyt@cecs.pdx.edu> writes:

Why does the planner consider both input variations of each symmetric merge join? The README says "there is not a lot of difference" between the two options. When are there any differences?

The righthand side needs to support mark/restore, the left doesn't;
so depending on plan types one way might need a helper Materialize
node that the other way doesn't. Also, duplicated values are a bit
cheaper to process on the left than the right.

regards, tom lane

Thank you for the explanation.

On a somewhat related issue, I am a bit stumped on the way path keys
function.

In the following query and debug data, why would an index scan on a
single relation contain a path key from a different relation?

optimizer/README says, "The PathKeys data structure represents what is
known about the sort order of the tuples generated by a particular
Path. A path's pathkeys field is a list of PathKey nodes, where the
n'th item represents the n'th sort key of the result."

Why does the index scan for tenk1 include a path key from
onek.unique2? Is it implying an equivalence there?

-Tom Raney

bench=# explain select * from tenk1 JOIN onek ON
tenk1.unique2=onek.unique2;

RELOPTINFO (tenk1): rows=10000 width=244
path list:
SeqScan(tenk1) rows=10000 cost=0.00..434.00
IdxScan(tenk1) rows=10000 cost=0.00..583.25
pathkeys: ((tenk1.unique2, onek.unique2)) <---

cheapest startup path:
SeqScan(tenk1) rows=10000 cost=0.00..434.00

cheapest total path:
SeqScan(tenk1) rows=10000 cost=0.00..434.00

RELOPTINFO (onek): rows=1000 width=244
path list:
SeqScan(onek) rows=1000 cost=0.00..44.00
IdxScan(onek) rows=1000 cost=0.00..72.25
pathkeys: ((tenk1.unique2, onek.unique2))

cheapest startup path:
SeqScan(onek) rows=1000 cost=0.00..44.00

cheapest total path:
SeqScan(onek) rows=1000 cost=0.00..44.00

RELOPTINFO (tenk1/onek): rows=1000 width=488
path list:
MergeJoin(tenk1/onek) rows=1000 cost=0.52..144.24
clauses: tenk1.unique2 = onek.unique2
IdxScan(tenk1) rows=10000 cost=0.00..583.25
pathkeys: ((tenk1.unique2, onek.unique2))
IdxScan(onek) rows=1000 cost=0.00..72.25
pathkeys: ((tenk1.unique2, onek.unique2))
NestLoop(tenk1/onek) rows=1000 cost=0.00..1756.96
clauses: tenk1.unique2 = onek.unique2
SeqScan(onek) rows=1000 cost=0.00..44.00
IdxScan(tenk1) rows=10000 cost=0.00..1.70

cheapest startup path:
NestLoop(tenk1/onek) rows=1000 cost=0.00..1756.96
clauses: tenk1.unique2 = onek.unique2
SeqScan(onek) rows=1000 cost=0.00..44.00
IdxScan(tenk1) rows=10000 cost=0.00..1.70

cheapest total path:
MergeJoin(tenk1/onek) rows=1000 cost=0.52..144.24
clauses: tenk1.unique2 = onek.unique2
IdxScan(tenk1) rows=10000 cost=0.00..583.25
pathkeys: ((tenk1.unique2, onek.unique2))
IdxScan(onek) rows=1000 cost=0.00..72.25
pathkeys: ((tenk1.unique2, onek.unique2))

QUERY PLAN
-----------------------------------------------------------------------------------------
Merge Join (cost=0.52..144.24 rows=1000 width=488)
Merge Cond: (tenk1.unique2 = onek.unique2)
-> Index Scan using tenk1_unique2 on tenk1 (cost=0.00..583.25
rows=10000 width=244)
-> Index Scan using onek_unique2 on onek (cost=0.00..72.25 rows=1000
width=244)
(4 rows)

bench=#

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Raney (#4)
Re: Planner question

Tom Raney <raneyt@cecs.pdx.edu> writes:

Why does the index scan for tenk1 include a path key from
onek.unique2? Is it implying an equivalence there?

bench=# explain select * from tenk1 JOIN onek ON
tenk1.unique2=onek.unique2;

Yes, for an example like that the planner knows that tenk1.unique2 and
onek.unique2 will have equal values in any valid join row, so it's okay
to suppose that a sort by one is the same as a sort by the other. So
the pathkey items actually reference sets of variables
{tenk1.unique2, onek.unique2}
not just individual variables.

RELOPTINFO (tenk1): rows=10000 width=244
path list:
SeqScan(tenk1) rows=10000 cost=0.00..434.00
IdxScan(tenk1) rows=10000 cost=0.00..583.25
pathkeys: ((tenk1.unique2, onek.unique2)) <---

cheapest startup path:
SeqScan(tenk1) rows=10000 cost=0.00..434.00

cheapest total path:
SeqScan(tenk1) rows=10000 cost=0.00..434.00

Hm, I don't recognize this output format, is it coming from some custom
code?

regards, tom lane

#6Tom Raney
raneyt@cecs.pdx.edu
In reply to: Tom Lane (#5)
Re: Planner question

Tom Lane wrote:

Tom Raney <raneyt@cecs.pdx.edu> writes:

Why does the index scan for tenk1 include a path key from
onek.unique2? Is it implying an equivalence there?

bench=# explain select * from tenk1 JOIN onek ON
tenk1.unique2=onek.unique2;

Yes, for an example like that the planner knows that tenk1.unique2 and
onek.unique2 will have equal values in any valid join row, so it's okay
to suppose that a sort by one is the same as a sort by the other. So
the pathkey items actually reference sets of variables
{tenk1.unique2, onek.unique2}
not just individual variables.

Thanks.

RELOPTINFO (tenk1): rows=10000 width=244
path list:
SeqScan(tenk1) rows=10000 cost=0.00..434.00
IdxScan(tenk1) rows=10000 cost=0.00..583.25
pathkeys: ((tenk1.unique2, onek.unique2)) <---

cheapest startup path:
SeqScan(tenk1) rows=10000 cost=0.00..434.00

cheapest total path:
SeqScan(tenk1) rows=10000 cost=0.00..434.00

Hm, I don't recognize this output format, is it coming from some custom
code?

Yes, it is. I thought it was easier to read the OPTIMIZER_DEBUG
output with the relation names instead of the relation indexes. I
will post a patch against CVS HEAD if you think it will help others.

-Tom

#7Bruce Momjian
bruce@momjian.us
In reply to: Tom Raney (#6)
Re: Planner question

Tom Raney wrote:

RELOPTINFO (tenk1): rows=10000 width=244
path list:
SeqScan(tenk1) rows=10000 cost=0.00..434.00
IdxScan(tenk1) rows=10000 cost=0.00..583.25
pathkeys: ((tenk1.unique2, onek.unique2)) <---

cheapest startup path:
SeqScan(tenk1) rows=10000 cost=0.00..434.00

cheapest total path:
SeqScan(tenk1) rows=10000 cost=0.00..434.00

Hm, I don't recognize this output format, is it coming from some custom
code?

Yes, it is. I thought it was easier to read the OPTIMIZER_DEBUG
output with the relation names instead of the relation indexes. I
will post a patch against CVS HEAD if you think it will help others.

Personally I would like to see optimizer debug become a configuration
parameter rather than a compile-time parameter.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +