totally different plan when using partitions
query using partitions explicitly ("1"):
explain analyze
select nome1,
thv3tralacc,
dltbfpgpdch
FROM cell_bsc_60_0610 as cell_bsc
left outer join teststscell73_0610_1 as data on data.ne_id=cell_bsc.nome1
left outer join teststscell13_0610_1 as data1 on data1.ne_id=cell_bsc.nome1 and data1.time=data.time
where
data.time >=cell_bsc.starttime and data.time <=cell_bsc.endtime and
data.time between '2006-10-01 00:00:00' and '2006-10-06 00:00:00'
and data1.time >=cell_bsc.starttime and data1.time <=cell_bsc.endtime
and data1.time between '2006-10-01 00:00:00' and '2006-10-06 00:00:00'
and cell_bsc.nome2=2
explain analyze:
http://explain-analyze.info/query_plans/3805-query-plan-2509
same query, but using postgresql's partition pruning ("2"):
explain analyze
select nome1,
thv3tralacc,
dltbfpgpdch
FROM cell_bsc_60_0610 as cell_bsc
left outer join teststscell73 as data on data.ne_id=cell_bsc.nome1
left outer join teststscell13 as data1 on data1.ne_id=cell_bsc.nome1 and data1.time=data.time
where
data.time >=cell_bsc.starttime and data.time <=cell_bsc.endtime and
data.time between '2006-10-01 00:00:00' and '2006-10-06 00:00:00'
and data1.time >=cell_bsc.starttime and data1.time <=cell_bsc.endtime
and data1.time between '2006-10-01 00:00:00' and '2006-10-06 00:00:00'
and cell_bsc.nome2=2
explain analyze:
http://explain-analyze.info/query_plans/3807-query-plan-2511
The second version is A LOT slower (10x). But the 2 queries should be identical... why the two totally different plans???
As you can see in query "1" I just put the used table, in query "2" postgres uses exactly the table I put in "1" (plus the empty tables that are the "father" of the other tables); so I don't understand why the 2 plans...
Anyone??? This looks like a bug to me... or is there an explanation?
--- Mer 12/8/09, Scara Maccai <m_lists@yahoo.it> ha scritto:
Show quoted text
Da: Scara Maccai <m_lists@yahoo.it>
Oggetto: [GENERAL] totally different plan when using partitions
A: "pgsql-general" <pgsql-general@postgresql.org>
Data: Mercoledì 12 agosto 2009, 13:05query using partitions explicitly ("1"):
explain analyze
select nome1,
thv3tralacc,
dltbfpgpdch
FROM cell_bsc_60_0610 as cell_bsc
left outer join
teststscell73_0610_1 as data on data.ne_id=cell_bsc.nome1
left outer join
teststscell13_0610_1 as data1 on data1.ne_id=cell_bsc.nome1
and data1.time=data.time
wheredata.time >=cell_bsc.starttime and
data.time <=cell_bsc.endtime and
data.time between '2006-10-01
00:00:00' and '2006-10-06 00:00:00'
and data1.time >=cell_bsc.starttime
and data1.time <=cell_bsc.endtime
and
data1.time between '2006-10-01 00:00:00' and '2006-10-06
00:00:00'
and cell_bsc.nome2=2explain analyze:
http://explain-analyze.info/query_plans/3805-query-plan-2509
same query, but using postgresql's partition pruning
("2"):explain analyze
select nome1,
thv3tralacc,
dltbfpgpdch
FROM cell_bsc_60_0610 as cell_bsc
left outer join
teststscell73 as data on data.ne_id=cell_bsc.nome1
left outer join
teststscell13 as data1 on data1.ne_id=cell_bsc.nome1 and
data1.time=data.time
wheredata.time >=cell_bsc.starttime and
data.time <=cell_bsc.endtime and
data.time between '2006-10-01
00:00:00' and '2006-10-06 00:00:00'
and data1.time >=cell_bsc.starttime
and data1.time <=cell_bsc.endtime
and
data1.time between '2006-10-01 00:00:00' and '2006-10-06
00:00:00'
and cell_bsc.nome2=2explain analyze:
http://explain-analyze.info/query_plans/3807-query-plan-2511
The second version is A LOT slower (10x). But the 2 queries
should be identical... why the two totally different
plans???As you can see in query "1" I just put the used table, in
query "2" postgres uses exactly the table I put in "1" (plus
the empty tables that are the "father" of the other tables);
so I don't understand why the 2 plans...
Scara Maccai wrote:
same query, but using postgresql's partition pruning ("2"):
explain analyze
select nome1,
thv3tralacc,
dltbfpgpdch
FROM cell_bsc_60_0610 as cell_bsc
left outer join teststscell73 as data on data.ne_id=cell_bsc.nome1
left outer join teststscell13 as data1 on data1.ne_id=cell_bsc.nome1 and data1.time=data.time
wheredata.time >=cell_bsc.starttime and data.time <=cell_bsc.endtime and
data.time between '2006-10-01 00:00:00' and '2006-10-06 00:00:00'
and data1.time >=cell_bsc.starttime and data1.time <=cell_bsc.endtime
and data1.time between '2006-10-01 00:00:00' and '2006-10-06 00:00:00'
and cell_bsc.nome2=2explain analyze:
http://explain-analyze.info/query_plans/3807-query-plan-2511
The second version is A LOT slower (10x). But the 2 queries should be identical... why the two totally different plans???
Well, the first version was expecting about 400 rows, the second 15000,
so it's not surprising that they have different plans.
I'm not sure whether the planner is smart enough to cope with the
multiple tests on time vs the partitioning and realise it can use your
index on the partition.
I'm assuming the partition is defined as being "between '2006-10-01
00:00:00' and '2006-10-06 00:00:00'" - strikes me as an odd interval,
and you don't say anywhere.
You _might_ have a better plan by moving the partitioned tests into
subqueries:
FROM cell_bsc_60_0610 as cell_bsc
left outer join (
SELECT ne_id FROM teststscell73 WHEREtime between '2006-10-01
00:00:00' and '2006-10-06 00:00:00') as data on data.ne_id=cell_bsc.nome1
--
Richard Huxton
Archonet Ltd
Scara Maccai wrote:
Thank you for your reply. This makes partitions unusable for me...
hope someone explains why this happens... this still looks like a bug
to me... BTW the problem arises when adding the second "left outer
join": when using only 1 partitioned table (that is, only 1 "left
outer join") the 2 plans are identical.My answers follow.
Well, the first version was expecting about 400 rows, the second
15000, so it's not surprising that they have different plans.Well, they're using exactly the same tables; I don't understand why
one expects 400 rows and the other 15000....
Well let's see. It's difficult to be certain without the table
definitions and sizes, but...
Both sequentially scan "cell_bsc_60_0610" because it will return < 300
rows. Seems sensible, and it gets the row estimate right.
The fast one then uses the index "teststscell73_0610_1_pkey" to probe
for matches and then again via "teststscell13_0610_1_pkey". Two nested
loops only make sense where you have a small number of rows matching,
otherwise it can get expensive going back and fore to the index and
table all the time. In this case, we end up matching more rows than we
expected (rows=60 loops=285 gives us ~17100 rows to check against the
second index) but not enough to slow us down.
That first plan estimated a cost of 33391 for its 408 rows and the
second 70402 for 15982 rows. That's double the estimated cost, but it
takes four times longer to complete. If the first query had estimated
the number of rows correctly it would have *looked* more expensive than
the second. So - with your particular setup PostgreSQL thinks it takes
longer to do index searches than it really does (or perhaps it thinks
sorting is quicker, or both).
You might want to adjust your cost estimates (see ch 18.6 of the
manuals). Start by checking effective_cache_size. Then maybe reduce
random_page_cost. Small steps though - these settings will affect all
your other queries too.
Also if you have the time, try issuing "set enable_mergejoin=off" and
re-issue query #2. That will force it to choose another plan.
Oh - if the partitions hold historical (unchanging) data it might be
worth CLUSTERing them on the pkey index too.
I'm not sure whether the planner is smart enough to cope with the
multiple tests on time vs the partitioning and realise it can use
your index on the partition.Sorry, didn't understand that...
The partitioning relies on proving that only partition P needs to be
considered. The more complicated the situation the harder it is to do
that. In this case, I'm not sure that's whats happening though - it
seems to narrow the partition down well enough.
I'm assuming the partition is defined as being "between '2006-10-01
00:00:00' and '2006-10-06 00:00:00'" - strikes me as an odd
interval, and you don't say anywhere.Data is partitioned on a "4 tables per month" basis
But that's from the start of the 1st to the start of the 6th - five full
days rather than 7+. It also *includes* the start of the 6th (<= not <)
which suggests overlap on the partitions. What does the definition of
your partition say *exactly*?
--
Richard Huxton
Archonet Ltd
Import Notes
Reply to msg id not found: 292963.52054.qm@web24601.mail.ird.yahoo.comReference msg id not found: 292963.52054.qm@web24601.mail.ird.yahoo.com | Resolved by subject fallback
I'm still looking into it, but it seems the difference in the 2 plans is due to the fact that when using partitions, the planner adds the time it would take to index-scan the empty "root" table.
But that table will never contain any data...
Is there any chance to have the partitioning mechanism know that a table will always contain no data, because only inheriting table will contain data?
Having the planner line:
-> Index Scan using teststscell13_pkey on teststscell13 data1 (cost=0.0..3.9 rows=1 width=16) (actual time=0.006..0.006 rows=0 loops=285)
doesn't make any sense: that table will never have any data.
I'd like to have a way to tell that to Postgresql...
Something like:
CREATE TABLE tabroot
(...) WITH (NODATA)
So that it will stop scanning the empty table every single loop...
And every time you try to insert directly into tabroot you get an error...
Scara Maccai wrote:
explain analyze
select nome1,
thv3tralacc,
dltbfpgpdch
FROM cell_bsc_60_0610 as cell_bsc
left outer join teststscell73_0610_1 as data on data.ne_id=cell_bsc.nome1
left outer join teststscell13_0610_1 as data1 on data1.ne_id=cell_bsc.nome1 and data1.time=data.time
wheredata.time >=cell_bsc.starttime and data.time <=cell_bsc.endtime and
data.time between '2006-10-01 00:00:00' and '2006-10-06 00:00:00'
and data1.time >=cell_bsc.starttime and data1.time <=cell_bsc.endtime
and data1.time between '2006-10-01 00:00:00' and '2006-10-06 00:00:00'
and cell_bsc.nome2=2explain analyze:
http://explain-analyze.info/query_plans/3805-query-plan-2509
same query, but using postgresql's partition pruning ("2"):
explain analyze
select nome1,
thv3tralacc,
dltbfpgpdch
FROM cell_bsc_60_0610 as cell_bsc
left outer join teststscell73 as data on data.ne_id=cell_bsc.nome1
left outer join teststscell13 as data1 on data1.ne_id=cell_bsc.nome1 and data1.time=data.time
wheredata.time >=cell_bsc.starttime and data.time <=cell_bsc.endtime and
data.time between '2006-10-01 00:00:00' and '2006-10-06 00:00:00'
and data1.time >=cell_bsc.starttime and data1.time <=cell_bsc.endtime
and data1.time between '2006-10-01 00:00:00' and '2006-10-06 00:00:00'
and cell_bsc.nome2=2
Huh, clearly not the same query (you're using the partition directly in
the first query) ... Doing two changes at once is not helping your
case.
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Huh, clearly not the same query (you're using the partition
directly in
the first query) ... Doing two changes at once is not
helping your
case.
Sorry, I don't understand... of course I used the partition directly in the first query... it's the difference between the two... what I don't like is that since the tables used are in fact the same, the plan shouldn't be that different.
My conclusion is that the planner thinks there could be some data in the "root" partition, even if that will always be empty.
What I would like is a way to tell Postgres "hey, don't even look at the root table. That's just a placeholder for the partitions. It will never contain any data" when I create the tables.
Otherwise the planner might get fooled by an empty table index scan in a loop (which is what happens here), thinking that that will take time.
Scara Maccai wrote:
I'm still looking into it, but it seems the difference in the 2 plans is due to the fact that when using partitions, the planner adds the time it would take to index-scan the empty "root" table.
But that table will never contain any data...Is there any chance to have the partitioning mechanism know that a table will always contain no data, because only inheriting table will contain data?
Having the planner line:
-> Index Scan using teststscell13_pkey on teststscell13 data1 (cost=0.0..3.9 rows=1 width=16) (actual time=0.006..0.006 rows=0 loops=285)doesn't make any sense: that table will never have any data.
I'd like to have a way to tell that to Postgresql...
It's one index probe and takes virtually no time at all. That's not your
problem.
--
Richard Huxton
Archonet Ltd
-> Index Scan using
teststscell13_pkey on teststscell13 data1 (cost=0.0..3.9
rows=1 width=16) (actual time=0.006..0.006 rows=0
loops=285)doesn't make any sense: that table will never have any
data.
I'd like to have a way to tell that to Postgresql...
It's one index probe and takes virtually no time at all.
That's not your problem.
Put that in a 60000 nested loop and it won't be "virtually no time at all" I'm afraid... to the planner that "3.9 cost" almost the same as an index scan on a populated table...
Hence the planner uses a different plan.
Otherwise I don't see why the 2 plans should be different...
Scara Maccai wrote:
Huh, clearly not the same query (you're using the partition directly
in the first query) ...� Doing two changes at once is not helping
your case.Sorry, I don't understand... of course I used the partition directly
in the first query... it's the difference between the two... what I
don't like is that since the tables used are in fact the same, the
plan shouldn't be that different.
I misread your original email to say that you were changing the
parameter. What version are you using? Also, please post the table
definitions (preferably in pg_dump -s format)
My conclusion is that the planner thinks there could be some data in
the "root" partition, even if that will always be empty.
What I would like is a way to tell Postgres "hey, don't even look at
the root table. That's just a placeholder for the partitions. It will
never contain any data" when I create the tables.Otherwise the planner might get fooled by an empty table index scan in
a loop (which is what happens here), thinking that that will take
time.
I'm not sure I agree with your assessment of the problem.
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
What version are you using? Also,
please post the table
definitions (preferably in pg_dump -s format)
Table definition at the end of the msg.
Postgresql 8.4beta1
I'm not sure I agree with your assessment of the problem.
This is why I think that's the problem:
This is an explain of the query using
set enable_mergejoin=off;
set enable_hashjoin=off
http://explain-analyze.info/query_plans/3817-query-plan-2525
As you can see, the 2 root partition roots (teststscell73 and teststscell13) take
teststscell73: 3.90 * 30120 loops = 117468 cost
teststscell13: 3.89 * 15964 loops = 62099 cost
total: 179567 cost out of 377398 total cost of the query...
basically the 2 empty tables index access take 1/2 of the query planned time... while they should take 0, since they're empty!!!
Since I can't tell postgresql they're empty, it assumes they have to be accessed...
As I said, when using partitioning, I would like the option of flagging some tables (what I call "the root tables") as "Always empty", so that the planner wouldn't care of them...
CREATE TABLE cell_bsc_60_0610 (
id integer NOT NULL,
nome1 integer,
nome2 integer,
starttime timestamp without time zone,
endtime timestamp without time zone
);
CREATE TABLE teststscell13 (
"time" timestamp without time zone NOT NULL,
ne_id integer NOT NULL,
void0 integer,
void1 integer,
void2 integer,
id1 integer,
[....]
mutil33 integer,
mutil12 integer
);
CREATE TABLE teststscell13_0610_1 (CONSTRAINT teststscell13_0610_1_time_check CHECK ((("time" >= '2006-10-01 00:00:00'::timestamp without time zone) AND ("time" < '2006-10-09 00:00:00'::timestamp without time zone)))
)
INHERITS (teststscell13);
CREATE TABLE teststscell13_0610_2 (CONSTRAINT teststscell13_0610_2_time_check CHECK ((("time" >= '2006-10-09 00:00:00'::timestamp without time zone) AND ("time" < '2006-10-16 00:00:00'::timestamp without time zone)))
)
INHERITS (teststscell13);
CREATE TABLE teststscell13_0610_3 (CONSTRAINT teststscell13_0610_3_time_check CHECK ((("time" >= '2006-10-16 00:00:00'::timestamp without time zone) AND ("time" < '2006-10-24 00:00:00'::timestamp without time zone)))
)
INHERITS (teststscell13);
CREATE TABLE teststscell13_0610_4 (CONSTRAINT teststscell13_0610_4_time_check CHECK ((("time" >= '2006-10-24 00:00:00'::timestamp without time zone) AND ("time" < '2006-11-01 00:00:00'::timestamp without time zone)))
)
INHERITS (teststscell13);
CREATE TABLE teststscell73 (
"time" timestamp without time zone NOT NULL,
ne_id integer NOT NULL,
mutil22 integer,
traffdlgprsscan integer,
dlbpdch integer,
dlgpdch integer,
dlepdch integer,
dltbfpbpdch integer,
[...]
void504 integer,
void505 integer,
void506 integer,
void507 integer,
void508 integer,
void509 integer,
void510 integer,
void511 integer
);
CREATE TABLE teststscell73_0610_1 (CONSTRAINT teststscell73_0610_1_time_check CHECK ((("time" >= '2006-10-01 00:00:00'::timestamp without time zone) AND ("time" < '2006-10-09 00:00:00'::timestamp without time zone)))
)
INHERITS (teststscell73);
CREATE TABLE teststscell73_0610_2 (CONSTRAINT teststscell73_0610_2_time_check CHECK ((("time" >= '2006-10-09 00:00:00'::timestamp without time zone) AND ("time" < '2006-10-16 00:00:00'::timestamp without time zone)))
)
INHERITS (teststscell73);
CREATE TABLE teststscell73_0610_3 (CONSTRAINT teststscell73_0610_3_time_check CHECK ((("time" >= '2006-10-16 00:00:00'::timestamp without time zone) AND ("time" < '2006-10-24 00:00:00'::timestamp without time zone)))
)
INHERITS (teststscell73);
CREATE TABLE teststscell73_0610_4 (CONSTRAINT teststscell73_0610_4_time_check CHECK ((("time" >= '2006-10-24 00:00:00'::timestamp without time zone) AND ("time" < '2006-11-01 00:00:00'::timestamp without time zone)))
)
INHERITS (teststscell73);
ALTER TABLE ONLY teststscell13_0610_1
ADD CONSTRAINT teststscell13_0610_1_pkey PRIMARY KEY (ne_id, "time");
ALTER TABLE ONLY teststscell13_0610_2
ADD CONSTRAINT teststscell13_0610_2_pkey PRIMARY KEY (ne_id, "time");
ALTER TABLE ONLY teststscell13_0610_3
ADD CONSTRAINT teststscell13_0610_3_pkey PRIMARY KEY (ne_id, "time");
ALTER TABLE ONLY teststscell13_0610_4
ADD CONSTRAINT teststscell13_0610_4_pkey PRIMARY KEY (ne_id, "time");
ALTER TABLE ONLY teststscell13
ADD CONSTRAINT teststscell13_pkey PRIMARY KEY (ne_id, "time");
ALTER TABLE ONLY teststscell73_0610_1
ADD CONSTRAINT teststscell73_0610_1_pkey PRIMARY KEY (ne_id, "time");
ALTER TABLE ONLY teststscell73_0610_2
ADD CONSTRAINT teststscell73_0610_2_pkey PRIMARY KEY (ne_id, "time");
ALTER TABLE ONLY teststscell73_0610_3
ADD CONSTRAINT teststscell73_0610_3_pkey PRIMARY KEY (ne_id, "time");
ALTER TABLE ONLY teststscell73_0610_4
ADD CONSTRAINT teststscell73_0610_4_pkey PRIMARY KEY (ne_id, "time");
ALTER TABLE ONLY teststscell73
ADD CONSTRAINT teststscell73_pkey PRIMARY KEY (ne_id, "time");
CREATE INDEX cell_bsc_60_idx ON cell_bsc_60_0610 USING btree (nome2, nome1);
Scara Maccai wrote:
set enable_mergejoin=off;
set enable_hashjoin=offhttp://explain-analyze.info/query_plans/3817-query-plan-2525
Ah, good - that's useful.
As you can see, the 2 root partition roots (teststscell73 and teststscell13) take
teststscell73: 3.90 * 30120 loops = 117468 cost
teststscell13: 3.89 * 15964 loops = 62099 costtotal: 179567 cost out of 377398 total cost of the query...
Your original "slow" query was only estimated at a cost of 70000 - it's
still going to be preferred even if you do get these to zero. Once the
cost estimates bear more of a relation to run-times things might improve.
--
Richard Huxton
Archonet Ltd
Richard Huxton <dev@archonet.com> writes:
Scara Maccai wrote:
http://explain-analyze.info/query_plans/3817-query-plan-2525
Ah, good - that's useful.
Hmmm ... why is the inner Nested Loop estimated to produce 30120 rows,
when the cartesian product of its inputs would only be 285 * 14 = 3990
rows? What PG version is this, and can you extract a self-contained
test case with an equally silly estimate?
regards, tom lane
Hmmm ... why is the inner Nested Loop estimated to produce
30120 rows,
when the cartesian product of its inputs would only be 285
* 14 = 3990
rows?
Oh my... didn't notice it!!!
What PG version is this
That was 8.4 beta1; now tried on
select version()
"PostgreSQL 8.4.0 on sparc-sun-solaris2.10, compiled by cc: Sun C 5.9 SunOS_sparc Patch 124867-02 2007/11/27, 64-bit"
And got same thing:
http://explain-analyze.info/query_plans/3820-query-plan-2526
, and can you extract a
self-contained
test case with an equally silly estimate?
mmh... I'll try to "squeeze" the data I'm using and see if it still gets that plan.
How big a test case could be? Could I post it (ftp?) somewhere? (you can email me about this in private if you want)
Scara Maccai wrote:
Hmmm ... why is the inner Nested Loop estimated to produce
30120 rows,
when the cartesian product of its inputs would only be 285
* 14 = 3990
rows?Oh my... didn't notice it!!!
It was doing the same thing here too:
http://explain-analyze.info/query_plans/3807-query-plan-2511
Basically the same location too: nested loop over a seq-scan and an append.
--
Richard Huxton
Archonet Ltd
Query:
set enable_mergejoin=off;set enable_hashjoin=off;
explain analyze
select nome1,
thv3tralacc,
dltbfpgpdch
FROM cell_bsc_60_0610 as cell_bsc
left outer join teststscell73_test as data on data.ne_id=cell_bsc.nome1
left outer join teststscell13_test as data1 on data1.ne_id=cell_bsc.nome1 and data1.time=data.time
where
data.time >=cell_bsc.starttime and data.time <=cell_bsc.endtime and
data.time between '2006-10-01 00:00:00' and '2006-10-02 01:00:00'
and data1.time >=cell_bsc.starttime and data1.time <=cell_bsc.endtime
and data1.time between '2006-10-01 00:00:00' and '2006-10-02 01:00:00'
and cell_bsc.nome2=2;
Explain analyze on my sistem ("PostgreSQL 8.4.0 on sparc-sun-solaris2.10, compiled by cc: Sun C 5.9 SunOS_sparc Patch 124867-02 2007/11/27, 64-bit"):
http://explain-analyze.info/query_plans/3822-query-plan-2528
interesting bit (if I got it right):
-> Nested Loop (cost=0.0..3139.65 rows=6531 width=36) (actual time=0.141..13.459 rows=3692 loops=1)
Join Filter: ((data1.time >= cell_bsc.starttime) AND (data1.time <= cell_bsc.endtime) AND (cell_bsc.nome1 = data1.ne_id))
-> Seq Scan on cell_bsc_60_0610 cell_bsc (cost=0.0..99.39 rows=285 width=20) (actual time=0.033..1.740 rows=285 loops=1)
Filter: (nome2 = 2)
-> Append (cost=0.0..10.6 rows=4 width=16) (actual time=0.012..0.027 rows=13 loops=285)
285*4 should give 1140 max, not 6531: is this different enough?
If it is, I have a 585K dump file that should recreate the problem.