Query planner question
Hi Everyone,
I have this query :
select t.ticket_id ,
tb.transmission_id
from ticket t,
transmission_base tb
where t.latest_transmission_id = tb.transmission_id
and t.ticket_number = tb.ticket_number
and tb.parse_date > ('2014-07-31');
Execution plan: http://explain.depesz.com/s/YAak
Indexes on ticket :
"ticket_pkey" PRIMARY KEY, btree (ticket_id) CLUSTER
"ticket_by_latest_transmission" btree (latest_transmission_id)
"ticket_by_ticket_number" btree (ticket_number)
This query only returns some portions of rows from ticket table.
The question is, Why does postgres need to get all the rows from ticket
table in order to complete this query?
Can't postgres use indexes to get only needed rows on ticket table?
I try set seqscan to off, but still index scan try to get all rows on
ticket table.
Here's the execution plan : http://explain.depesz.com/s/abH2
Thanks
--
Regards,
Soni Maula Harriz
Soni M wrote
Hi Everyone,
I have this query :
select t.ticket_id ,
tb.transmission_id
from ticket t,
transmission_base tb
where t.latest_transmission_id = tb.transmission_id
and t.ticket_number = tb.ticket_number
and tb.parse_date > ('2014-07-31');Execution plan: http://explain.depesz.com/s/YAak
Indexes on ticket :
"ticket_pkey" PRIMARY KEY, btree (ticket_id) CLUSTER
"ticket_by_latest_transmission" btree (latest_transmission_id)
"ticket_by_ticket_number" btree (ticket_number)This query only returns some portions of rows from ticket table.
The question is, Why does postgres need to get all the rows from ticket
table in order to complete this query?
Can't postgres use indexes to get only needed rows on ticket table?I try set seqscan to off, but still index scan try to get all rows on
ticket table.
Here's the execution plan : http://explain.depesz.com/s/abH2
Short answer: you haven't defined "(latest_transmission_id, ticket_number)"
as being a foreign key onto the transmission_base table yet you seem to want
it to act like one.
Because of this failure the planner considers the following:
Nested Looping over 380,000 records is going to suck so it tries some
advanced "merge/join" techniques to try and speed things up. In any such
alternative the entire ticket table needs to be considered since there is no
constraint provided for that table - the only constraint in on
transmission_base and it rightly is using an index to find records matching
the where clause.
Since ticket_number and latest_transmission_id are found in separate indexes
I do not believe the planner can make use of an Index Only scan to fulfill
the join so each index lookup would require a corresponding heap lookup
which means extra work compared to just sequentially scanning the heap in
the first place. Since it is going to hit the entire thing in either case
the sequential scan is the logical choice for it to make.
Others will correct any factual mistakes I may have made - I am theorizing
here and do not understand the planner sufficient well to be 100% certain
that an FK definition will solve the problem.
David J.
--
View this message in context: http://postgresql.1045698.n5.nabble.com/Query-planner-question-tp5815659p5815661.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Thu, Aug 21, 2014 at 9:26 AM, David G Johnston <
david.g.johnston@gmail.com> wrote:
Soni M wrote
Hi Everyone,
I have this query :
select t.ticket_id ,
tb.transmission_id
from ticket t,
transmission_base tb
where t.latest_transmission_id = tb.transmission_id
and t.ticket_number = tb.ticket_number
and tb.parse_date > ('2014-07-31');Execution plan: http://explain.depesz.com/s/YAak
Indexes on ticket :
"ticket_pkey" PRIMARY KEY, btree (ticket_id) CLUSTER
"ticket_by_latest_transmission" btree (latest_transmission_id)
"ticket_by_ticket_number" btree (ticket_number)This query only returns some portions of rows from ticket table.
The question is, Why does postgres need to get all the rows from ticket
table in order to complete this query?
Can't postgres use indexes to get only needed rows on ticket table?I try set seqscan to off, but still index scan try to get all rows on
ticket table.
Here's the execution plan : http://explain.depesz.com/s/abH2Short answer: you haven't defined "(latest_transmission_id, ticket_number)"
as being a foreign key onto the transmission_base table yet you seem to
want
it to act like one.
Currently we have only latest_transmission_id as FK, described here :
TABLE "ticket" CONSTRAINT "fkcbe86b0c6ddac9e" FOREIGN KEY
(latest_transmission_id) REFERENCES transmission_base(transmission_id)
Change the query to include only FK still result the same:
explain select t.ticket_id ,
tb.transmission_id
from ticket t,
transmission_base tb
where t.latest_transmission_id = tb.transmission_id
and tb.parse_date > ('2014-07-31');
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=113928.06..2583606.96 rows=200338 width=8)
Hash Cond: (t.latest_transmission_id = tb.transmission_id)
-> Seq Scan on ticket t (cost=0.00..1767767.26 rows=69990826 width=8)
-> Hash (cost=108923.38..108923.38 rows=400374 width=4)
-> Index Scan using transmission_base_by_parse_date on
transmission_base tb (cost=0.00..108923.38 rows=400374 width=4)
Index Cond: (parse_date > '2014-07-31 00:00:00'::timestamp
without time zone)
(6 rows)
I've googling this one, it seems that's how hash join works. For hash join
operation, the join predicate cannot be used for the index scan, only
independent predicate can be used in index scan.
http://use-the-index-luke.com/sql/join/hash-join-partial-objects
Because of this failure the planner considers the following:
Nested Looping over 380,000 records is going to suck so it tries some
advanced "merge/join" techniques to try and speed things up. In any such
alternative the entire ticket table needs to be considered since there is
no
constraint provided for that table - the only constraint in on
transmission_base and it rightly is using an index to find records matching
the where clause.Since ticket_number and latest_transmission_id are found in separate
indexes
I do not believe the planner can make use of an Index Only scan to fulfill
the join so each index lookup would require a corresponding heap lookup
which means extra work compared to just sequentially scanning the heap in
the first place. Since it is going to hit the entire thing in either case
the sequential scan is the logical choice for it to make.Others will correct any factual mistakes I may have made - I am theorizing
here and do not understand the planner sufficient well to be 100% certain
that an FK definition will solve the problem.David J.
--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Query-planner-question-tp5815659p5815661.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
Regards,
Soni Maula Harriz
On 22 August 2014 14:26, Soni M <diptatapa@gmail.com> wrote:
Currently we have only latest_transmission_id as FK, described here :
TABLE "ticket" CONSTRAINT "fkcbe86b0c6ddac9e" FOREIGN KEY
(latest_transmission_id) REFERENCES transmission_base(transmission_id)Change the query to include only FK still result the same:
explain select t.ticket_id ,
tb.transmission_id
from ticket t,
transmission_base tb
where t.latest_transmission_id = tb.transmission_id
and tb.parse_date > ('2014-07-31');
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=113928.06..2583606.96 rows=200338 width=8)
Hash Cond: (t.latest_transmission_id = tb.transmission_id)
-> Seq Scan on ticket t (cost=0.00..1767767.26 rows=69990826 width=8)
-> Hash (cost=108923.38..108923.38 rows=400374 width=4)
-> Index Scan using transmission_base_by_parse_date on
transmission_base tb (cost=0.00..108923.38 rows=400374 width=4)
Index Cond: (parse_date > '2014-07-31 00:00:00'::timestamp
without time zone)
(6 rows)
Do you have an index on ticket (latest_transmission_id)?
--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Fri, Aug 22, 2014 at 9:10 PM, Alban Hertroys <haramrae@gmail.com> wrote:
On 22 August 2014 14:26, Soni M <diptatapa@gmail.com> wrote:
Currently we have only latest_transmission_id as FK, described here :
TABLE "ticket" CONSTRAINT "fkcbe86b0c6ddac9e" FOREIGN KEY
(latest_transmission_id) REFERENCES transmission_base(transmission_id)Change the query to include only FK still result the same:
explain select t.ticket_id ,
tb.transmission_id
from ticket t,
transmission_base tb
where t.latest_transmission_id = tb.transmission_id
and tb.parse_date > ('2014-07-31');
QUERY PLAN----------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=113928.06..2583606.96 rows=200338 width=8)
Hash Cond: (t.latest_transmission_id = tb.transmission_id)
-> Seq Scan on ticket t (cost=0.00..1767767.26 rows=69990826width=8)
-> Hash (cost=108923.38..108923.38 rows=400374 width=4)
-> Index Scan using transmission_base_by_parse_date on
transmission_base tb (cost=0.00..108923.38 rows=400374 width=4)
Index Cond: (parse_date > '2014-07-31 00:00:00'::timestamp
without time zone)
(6 rows)Do you have an index on ticket (latest_transmission_id)?
Yes, both t.latest_transmission_id and tb.transmission_id is indexed.
Indexes:
"transmission_base_pkey" PRIMARY KEY, btree (transmission_id) CLUSTER
Indexes:
"ticket_by_latest_transmission" btree (latest_transmission_id)
--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.
--
Regards,
Soni Maula Harriz
Soni M wrote
On Fri, Aug 22, 2014 at 9:10 PM, Alban Hertroys <
haramrae@
> wrote:
On 22 August 2014 14:26, Soni M <
diptatapa@
> wrote:
Currently we have only latest_transmission_id as FK, described here :
TABLE "ticket" CONSTRAINT "fkcbe86b0c6ddac9e" FOREIGN KEY
(latest_transmission_id) REFERENCES transmission_base(transmission_id)Change the query to include only FK still result the same:
explain select t.ticket_id ,
tb.transmission_id
from ticket t,
transmission_base tb
where t.latest_transmission_id = tb.transmission_id
and tb.parse_date > ('2014-07-31');
QUERY PLAN----------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=113928.06..2583606.96 rows=200338 width=8)
Hash Cond: (t.latest_transmission_id = tb.transmission_id)
-> Seq Scan on ticket t (cost=0.00..1767767.26 rows=69990826width=8)
-> Hash (cost=108923.38..108923.38 rows=400374 width=4)
-> Index Scan using transmission_base_by_parse_date on
transmission_base tb (cost=0.00..108923.38 rows=400374 width=4)
Index Cond: (parse_date > '2014-07-3100:00:00'::timestamp
without time zone)
(6 rows)Do you have an index on ticket (latest_transmission_id)?
Yes, both t.latest_transmission_id and tb.transmission_id is indexed.
Indexes:
"transmission_base_pkey" PRIMARY KEY, btree (transmission_id) CLUSTER
Indexes:
"ticket_by_latest_transmission" btree (latest_transmission_id)
Can you provide EXPLAIN ANALYZE for all three queries?
David J.
--
View this message in context: http://postgresql.1045698.n5.nabble.com/Query-planner-question-tp5815659p5815981.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Wednesday, August 20, 2014, Soni M <diptatapa@gmail.com> wrote:
Hi Everyone,
I have this query :
select t.ticket_id ,
tb.transmission_id
from ticket t,
transmission_base tb
where t.latest_transmission_id = tb.transmission_id
and t.ticket_number = tb.ticket_number
and tb.parse_date > ('2014-07-31');Execution plan: http://explain.depesz.com/s/YAak
Indexes on ticket :
"ticket_pkey" PRIMARY KEY, btree (ticket_id) CLUSTER
"ticket_by_latest_transmission" btree (latest_transmission_id)
"ticket_by_ticket_number" btree (ticket_number)
This query only returns some portions of rows from ticket table.
The question is, Why does postgres need to get all the rows from ticket
table in order to complete this query?
Can't postgres use indexes to get only needed rows on ticket table?
It can, but having separate indexes on latest_transmission_id and
ticket_number is not going to work.
You need a joint index on both columns.
Cheers,
Jeff
Show quoted text
On 23 Aug 2014, at 4:34, Soni M <diptatapa@gmail.com> wrote:
On Fri, Aug 22, 2014 at 9:10 PM, Alban Hertroys <haramrae@gmail.com> wrote:
On 22 August 2014 14:26, Soni M <diptatapa@gmail.com> wrote:Currently we have only latest_transmission_id as FK, described here :
TABLE "ticket" CONSTRAINT "fkcbe86b0c6ddac9e" FOREIGN KEY
(latest_transmission_id) REFERENCES transmission_base(transmission_id)Change the query to include only FK still result the same:
explain select t.ticket_id ,
tb.transmission_id
from ticket t,
transmission_base tb
where t.latest_transmission_id = tb.transmission_id
and tb.parse_date > ('2014-07-31');
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=113928.06..2583606.96 rows=200338 width=8)
Hash Cond: (t.latest_transmission_id = tb.transmission_id)
-> Seq Scan on ticket t (cost=0.00..1767767.26 rows=69990826 width=8)
-> Hash (cost=108923.38..108923.38 rows=400374 width=4)
-> Index Scan using transmission_base_by_parse_date on
transmission_base tb (cost=0.00..108923.38 rows=400374 width=4)
Index Cond: (parse_date > '2014-07-31 00:00:00'::timestamp
without time zone)
(6 rows)Do you have an index on ticket (latest_transmission_id)?
Yes, both t.latest_transmission_id and tb.transmission_id is indexed.
Indexes:
"transmission_base_pkey" PRIMARY KEY, btree (transmission_id) CLUSTER
Indexes:
"ticket_by_latest_transmission" btree (latest_transmission_id)
Okay, so we got those indexes. So much for the low-hanging fruit.
From the above plan we learn that the database estimates[1]You did not provide explain analyse output, so we only have estimates to work with. -- If you can't see the forest for the trees, cut the trees and you'll find there is no forest. that 400k rows from transmission match your condition (parse_date > '2014-07-31’). The ticket table has a foreign key to that table, which suggests a 1:n relationship. It also has around 70M rows, or at least the database seems to think that about that amount will match those 400k transmissions.
That means that if on average 175 (=70M/400k) ticket ID’s match a transmission ID, the database would be needing all those 70M rows anyway - and even if it only needs every 175th row, a sequential scan is not a particularly inefficient way to go about this.
The alternative is a whole lot of index lookups, probably not in the same order as either the index or the rows on disk, meaning quite a bit of random disk I/O.
I’m suspecting that the cost estimates for this query with seq-scans disabled aren’t very different, provided doing so comes up with a comparable plan.
Things you might want to verify/try:
* Are those estimated numbers of rows accurate? If not, is autovacuum (or scheduled vacuum) keeping up with the amount of data churn on these tables? Do you collect a sufficiently large sample for the statistics?
* How much bloat is in these tables/indexes?
* Did you change planner settings (such as disabling bitmap scans; I kind of expected one here) or did you change cost estimate parameters?
* Does it help to put an index on transmission (parse_date, transmission_id)?
* If none of that helps, we’re going to need the output of explain analyze - that will probably take long to create, so you might as well start with that and do the other stuff at the side.
What kind of hardware are these disks on? Is it possible that disk I/O on this particular machine is relatively slow (relative to the seq/random cost factor for disk access as specified in your postgresql.conf)?
Cheers,
Alban Hertroys
[1]: You did not provide explain analyse output, so we only have estimates to work with. -- If you can't see the forest for the trees, cut the trees and you'll find there is no forest.
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
here's the explain analyze result : http://explain.depesz.com/s/Mvv and
http://explain.depesz.com/s/xxF9
it seems that i need to dig more on query planner parameter.
BTW, thanks all for the helps.
On Sat, Aug 23, 2014 at 4:33 PM, Alban Hertroys <haramrae@gmail.com> wrote:
On 23 Aug 2014, at 4:34, Soni M <diptatapa@gmail.com> wrote:
On Fri, Aug 22, 2014 at 9:10 PM, Alban Hertroys <haramrae@gmail.com>
wrote:
On 22 August 2014 14:26, Soni M <diptatapa@gmail.com> wrote:
Currently we have only latest_transmission_id as FK, described here :
TABLE "ticket" CONSTRAINT "fkcbe86b0c6ddac9e" FOREIGN KEY
(latest_transmission_id) REFERENCES transmission_base(transmission_id)Change the query to include only FK still result the same:
explain select t.ticket_id ,
tb.transmission_id
from ticket t,
transmission_base tb
where t.latest_transmission_id = tb.transmission_id
and tb.parse_date > ('2014-07-31');
QUERY PLAN----------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=113928.06..2583606.96 rows=200338 width=8)
Hash Cond: (t.latest_transmission_id = tb.transmission_id)
-> Seq Scan on ticket t (cost=0.00..1767767.26 rows=69990826width=8)
-> Hash (cost=108923.38..108923.38 rows=400374 width=4)
-> Index Scan using transmission_base_by_parse_date on
transmission_base tb (cost=0.00..108923.38 rows=400374 width=4)
Index Cond: (parse_date > '2014-07-3100:00:00'::timestamp
without time zone)
(6 rows)Do you have an index on ticket (latest_transmission_id)?
Yes, both t.latest_transmission_id and tb.transmission_id is indexed.
Indexes:
"transmission_base_pkey" PRIMARY KEY, btree (transmission_id) CLUSTER
Indexes:
"ticket_by_latest_transmission" btree (latest_transmission_id)Okay, so we got those indexes. So much for the low-hanging fruit.
From the above plan we learn that the database estimates[1] that 400k rows
from transmission match your condition (parse_date > '2014-07-31’). The
ticket table has a foreign key to that table, which suggests a 1:n
relationship. It also has around 70M rows, or at least the database seems
to think that about that amount will match those 400k transmissions.That means that if on average 175 (=70M/400k) ticket ID’s match a
transmission ID, the database would be needing all those 70M rows anyway -
and even if it only needs every 175th row, a sequential scan is not a
particularly inefficient way to go about this.
The alternative is a whole lot of index lookups, probably not in the same
order as either the index or the rows on disk, meaning quite a bit of
random disk I/O.I’m suspecting that the cost estimates for this query with seq-scans
disabled aren’t very different, provided doing so comes up with a
comparable plan.Things you might want to verify/try:
* Are those estimated numbers of rows accurate? If not, is autovacuum (or
scheduled vacuum) keeping up with the amount of data churn on these tables?
Do you collect a sufficiently large sample for the statistics?
* How much bloat is in these tables/indexes?
* Did you change planner settings (such as disabling bitmap scans; I kind
of expected one here) or did you change cost estimate parameters?
* Does it help to put an index on transmission (parse_date,
transmission_id)?
* If none of that helps, we’re going to need the output of explain analyze
- that will probably take long to create, so you might as well start with
that and do the other stuff at the side.What kind of hardware are these disks on? Is it possible that disk I/O on
this particular machine is relatively slow (relative to the seq/random cost
factor for disk access as specified in your postgresql.conf)?Cheers,
Alban Hertroys
[1] You did not provide explain analyse output, so we only have estimates
to work with.
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.
--
Regards,
Soni Maula Harriz
On Wed, Aug 20, 2014 at 6:16 PM, Soni M <diptatapa@gmail.com> wrote:
Hi Everyone,
I have this query :
select t.ticket_id ,
tb.transmission_id
from ticket t,
transmission_base tb
where t.latest_transmission_id = tb.transmission_id
and t.ticket_number = tb.ticket_number
and tb.parse_date > ('2014-07-31');Execution plan: http://explain.depesz.com/s/YAak
Indexes on ticket :
"ticket_pkey" PRIMARY KEY, btree (ticket_id) CLUSTER
"ticket_by_latest_transmission" btree (latest_transmission_id)
"ticket_by_ticket_number" btree (ticket_number)This query only returns some portions of rows from ticket table.
The question is, Why does postgres need to get all the rows from ticket
table in order to complete this query?
Can't postgres use indexes to get only needed rows on ticket table?I try set seqscan to off, but still index scan try to get all rows on
ticket table.
Here's the execution plan : http://explain.depesz.com/s/abH2
If you want to force a nested loop, you probably need to disable the
mergejoin as well, and maybe the hashjoin. Forcing the planner to do
things the way you want can be difficult.
Cheers,
Jeff
On Mon, 25 Aug 2014 09:09:07 -0700
Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Aug 20, 2014 at 6:16 PM, Soni M <diptatapa@gmail.com> wrote:
Hi Everyone,
I have this query :
select t.ticket_id ,
tb.transmission_id
from ticket t,
transmission_base tb
where t.latest_transmission_id = tb.transmission_id
and t.ticket_number = tb.ticket_number
and tb.parse_date > ('2014-07-31');Execution plan: http://explain.depesz.com/s/YAak
Indexes on ticket :
"ticket_pkey" PRIMARY KEY, btree (ticket_id) CLUSTER
"ticket_by_latest_transmission" btree (latest_transmission_id)
"ticket_by_ticket_number" btree (ticket_number)This query only returns some portions of rows from ticket table.
The question is, Why does postgres need to get all the rows from ticket
table in order to complete this query?
Can't postgres use indexes to get only needed rows on ticket table?I try set seqscan to off, but still index scan try to get all rows on
ticket table.
Here's the execution plan : http://explain.depesz.com/s/abH2
That's probably not the best approach, it's likely that something is feeding
the planner wrong information. An EXPLAIN ANALYZE might reveal if that's the
case.
Some other things to check: are these two tables being analyzed frequently
enough that their statistics are up to date? (EXPLAIN ANALYZE will generally
show if that's a problem too). It would seem that the planner thinks that
the distribution of tb.ticket_number is large enough that it will probably
have to fetch most of the rows from ticket anyway, which is a logical reason
for it to skip the index and just do a seq scan. Can you confirm/deny whether
that's the case? If not, and you're analyzing the tables often enough, you
may need to raise your statistics target on those tables.
--
Bill Moran
I need your help to succeed:
http://gamesbybill.com
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general