performance of IN (subquery)

Started by Kevin Murphyover 21 years ago24 messagesgeneral
Jump to latest
#1Kevin Murphy
murphy@genome.chop.edu

I'm using PG 7.4.3 on Mac OS X.

I am disappointed with the performance of queries like 'select foo from
bar where baz in (subquery)', or updates like 'update bar set foo = 2
where baz in (subquery)'. PG always seems to want to do a sequential
scan of the bar table. I wish there were a way of telling PG, "use the
index on baz in your plan, because I know that the subquery will return
very few results". Where it really matters, I have been constructing
dynamic queries by looping over the values for baz and building a
separate query for each one and combining with a UNION (or just
directly updating, in the update case). Depending on the size of the
bar table, I can get speedups of hundreds or even more than a thousand
times, but it is a big pain to have to do this.

Any tips?

Thanks,
Kevin Murphy

Illustrated:

The query I want to do is very slow:

select bundle_id from build.elements
where elementid in (
SELECT superlocs_2.element_id
FROM superlocs_2 NATURAL JOIN bundle_superlocs_2
WHERE bundle_superlocs_2.protobundle_id = 1);
-----------
7644
7644
(2 rows)
Time: 518.242 ms

The subquery is fast:

SELECT superlocs_2.element_id
FROM superlocs_2 NATURAL JOIN bundle_superlocs_2
WHERE bundle_superlocs_2.protobundle_id = 1;
------------
41209
25047
(2 rows)
Time: 3.268 ms

And using indexes on the main table is fast:

select bundle_id from build.elements
where elementid in (41209, 25047);
-----------
7644
7644
(2 rows)
Time: 2.468 ms

The plan for the slow query:

egenome_test=# explain analyze select bundle_id from build.elements
where elementid in (
SELECT superlocs_2.element_id
FROM superlocs_2 NATURAL JOIN bundle_superlocs_2
WHERE bundle_superlocs_2.protobundle_id = 1);
egenome_test-# egenome_test(# egenome_test(# egenome_test(#
QUERY PLAN
\

------------------------------------------------------------------------
-------------------------------------------------------------
Hash Join (cost=70.33..72.86 rows=25 width=4) (actual
time=583.051..583.059 rows=2 loops=1)
Hash Cond: ("outer".element_id = "inner".elementid)
-> HashAggregate (cost=47.83..47.83 rows=25 width=4) (actual
time=0.656..0.658 rows=2 loops=1)
-> Hash Join (cost=22.51..47.76 rows=25 width=4) (actual
time=0.615..0.625 rows=2 loops=1)
Hash Cond: ("outer".superloc_id = "inner".superloc_id)
-> Seq Scan on superlocs_2 (cost=0.00..20.00 rows=1000
width=8) (actual time=0.004..0.012 rows=9 loops=1)
-> Hash (cost=22.50..22.50 rows=5 width=4) (actual
time=0.076..0.076 rows=0 loops=1)
-> Seq Scan on bundle_superlocs_2
(cost=0.00..22.50 rows=5 width=4) (actual time=0.024..0.033 rows=2
loops=1)
Filter: (protobundle_id = 1)
-> Hash (cost=20.00..20.00 rows=1000 width=8) (actual
time=581.802..581.802 rows=0 loops=1)
-> Seq Scan on elements (cost=0.00..20.00 rows=1000 width=8)
(actual time=0.172..405.243 rows=185535 loops=1)
Total runtime: 593.843 ms
(12 rows)

#2The Hermit Hacker
scrappy@hub.org
In reply to: Kevin Murphy (#1)
Re: performance of IN (subquery)

On Thu, 26 Aug 2004, Kevin Murphy wrote:

I'm using PG 7.4.3 on Mac OS X.

I am disappointed with the performance of queries like 'select foo from bar
where baz in (subquery)', or updates like 'update bar set foo = 2 where baz
in (subquery)'. PG always seems to want to do a sequential scan of the bar
table. I wish there were a way of telling PG, "use the index on baz in your
plan, because I know that the subquery will return very few results". Where
it really matters, I have been constructing dynamic queries by looping over
the values for baz and building a separate query for each one and combining
with a UNION (or just directly updating, in the update case). Depending on
the size of the bar table, I can get speedups of hundreds or even more than a
thousand times, but it is a big pain to have to do this.

Any tips?

Thanks,
Kevin Murphy

Illustrated:

The query I want to do is very slow:

select bundle_id from build.elements
where elementid in (
SELECT superlocs_2.element_id
FROM superlocs_2 NATURAL JOIN bundle_superlocs_2
WHERE bundle_superlocs_2.protobundle_id = 1);
-----------
7644
7644
(2 rows)
Time: 518.242 ms

what field type is protobundle_id? if you typecast the '1' to be the
same, does the index get used?

Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664

#3Paul Tillotson
pntil@shentel.net
In reply to: Kevin Murphy (#1)
Re: performance of IN (subquery)

Kevin Murphy wrote:

------------------------------------------------------------------------
-------------------------------------------------------------
Hash Join (cost=70.33..72.86 rows=25 width=4) (actual
time=583.051..583.059 rows=2 loops=1)
Hash Cond: ("outer".element_id = "inner".elementid)
-> HashAggregate (cost=47.83..47.83 rows=25 width=4) (actual
time=0.656..0.658 rows=2 loops=1)
-> Hash Join (cost=22.51..47.76 rows=25 width=4) (actual
time=0.615..0.625 rows=2 loops=1)
Hash Cond: ("outer".superloc_id = "inner".superloc_id)
-> Seq Scan on superlocs_2 (cost=0.00..20.00 rows=1000 width=8)
(actual time=0.004..0.012 rows=9 loops=1)
-> Hash (cost=22.50..22.50 rows=5 width=4) (actual time=0.076..0.076
rows=0 loops=1)
-> Seq Scan on bundle_superlocs_2 (cost=0.00..22.50 rows=5 width=4)
(actual time=0.024..0.033 rows=2 loops=1)
Filter: (protobundle_id = 1)
-> Hash (cost=20.00..20.00 rows=1000 width=8) (actual
time=581.802..581.802 rows=0 loops=1)
-> Seq Scan on elements (cost=0.00..20.00 rows=1000 width=8) (actual
time=0.172..405.243 rows=185535 loops=1)

The planner thinks that the sequential scan on elements will return 1000
rows, but it actually returned 185000. Did you ANALYZE this table recently?

Afterthought: It would be nice if the database was smart enough to
analyze a table of its own accord when a sequential scan returns more
than, say, 20 times what it was supposed to.

Paul

Show quoted text

Total runtime: 593.843 ms
(12 rows)

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

#4Arthur Ward
award-postgresql@dominionsciences.com
In reply to: Paul Tillotson (#3)
Re: performance of IN (subquery)

Afterthought: It would be nice if the database was smart enough to
analyze a table of its own accord when a sequential scan returns more
than, say, 20 times what it was supposed to.

I've wondered on several occasions if there is any good reason for PG not
to automatically perform an analyze concurrently with a seq scan as it's
happening. That way, no extra disk IO is needed and the stats could say
up-to-date for almost free.

Any hackers around who can say why this might be a bad idea, or is it one
of those things that just needs a volunteer? (I'm not; at least not now.)

#5Bruce Momjian
bruce@momjian.us
In reply to: Arthur Ward (#4)
Re: performance of IN (subquery)

"Arthur Ward" <award-postgresql@dominionsciences.com> writes:

Any hackers around who can say why this might be a bad idea, or is it one
of those things that just needs a volunteer? (I'm not; at least not now.)

a) that would make plans change spontaneously. I hate being paged in the
middle of the night because some query is suddenly being slow when it had been
performing fine before.

b) Not all sequential scans will actually complete the scan. There could be a
limit imposed or a the sequential scan could be inside a EXISTS. In that case
the scan could be aborted at any point.

What I do think would be easy to do would be to keep statistics on the expense
of various components of the cost estimates. cpu_*_cost, random_page_cost
effective_cache_size, ought to be values that can be solved for empirically
from the timing results.

But that still doesn't have to be done on every query. There's a trade-off
between work done on every query to plan queries and the benefit. Gathering
statistics and storing them on every sequential scan is way too much work
slowing down every query for minimal gain.

--
greg

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Paul Tillotson (#3)
Re: performance of IN (subquery)

Paul Tillotson <pntil@shentel.net> writes:

The planner thinks that the sequential scan on elements will return 1000
rows, but it actually returned 185000. Did you ANALYZE this table recently?

Or either of the other ones? All those scan costs look like defaults
:-(

Afterthought: It would be nice if the database was smart enough to
analyze a table of its own accord when a sequential scan returns more
than, say, 20 times what it was supposed to.

I've thought about this before. One simple trick would be to get rid
of the current pg_class reltuples/relpages fields in favor of a
tuples-per-page estimate, which could be multiplied by
RelationGetNumberOfBlocks() during planning. In the absence of any
ANALYZE data the tuples-per-page estimate might be pretty bogus, but
it couldn't be off by more than an order of magnitude or so either way.
And in any case we'd have a guaranteed up-to-date number of blocks.

The objections that could be raised to this are (AFAICS) two:

1. Adding at least an lseek() kernel call per table, and per index, to
every planning operation. I'm not sure this would be significant,
but I'm not sure it wouldn't be, either.

2. Instability of plans. Right now, the planner will not change plans
underneath you --- you have to issue an explicit VACUUM or ANALYZE
to change the terms of discussion. That would stop being true if
physical file size were always taken into account. Maybe this is a
problem, or maybe it isn't ... as someone who likes to be able to
debug planner behavior without actually creating umpteen-gig test
tables, my world view may be a bit skewed ...

It's certainly doable if we decide the pluses outweigh the minuses.
Thoughts?

regards, tom lane

#7Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#6)
Re: performance of IN (subquery)

Tom Lane <tgl@sss.pgh.pa.us> writes:

I've thought about this before. One simple trick would be to get rid of the
current pg_class reltuples/relpages fields in favor of a tuples-per-page
estimate, which could be multiplied by RelationGetNumberOfBlocks() during
planning.

This would do something interesting to one of the problem cases I have now. I
have trouble testing a particular batch job that generates a large amount of
precalculated denormalized data.

That's because the data is empty when the job starts, so the plpgsql function
that handles the job caches plans based on an empty table. But as the job
proceeds the data grows and I'm afraid the cached plan may start performing
poorly.

In order to test it I need to run it once, analyze, then reload the function,
truncate the data and run it another time. And hope I generated good
representative test data.

I'm thinking of changing to a non-plpgsql implementation. But that's only half
the issue. I'm not about to run analyze in the middle of the data generation
(which wouldn't work anyways since it's in a transaction). So I can't really
get good statistics for this job, not until we're actually in a steady state
in production.

Sometimes I wonder whether I wouldn't rather a more predictable system with
less focus on statistics. The resulting plans would be more predictable and
predictability is a good thing in production systems...

In the absence of any ANALYZE data the tuples-per-page estimate might be
pretty bogus, but it couldn't be off by more than an order of magnitude or
so either way.

I don't see why it couldn't. If you have a table badly in need of vacuuming
(or had one at the time of the last analyze) it could be off by way more than
an order of magnitude.

For that matter, a table that had undergone many deletes and then been
vacuumed would not change length for a long time afterward even as many new
inserts are performed. Until the table is analyzed the estimated table size
could be off by an arbitrary factor.

The objections that could be raised to this are (AFAICS) two:

1. Adding at least an lseek() kernel call per table, and per index, to
every planning operation. I'm not sure this would be significant,
but I'm not sure it wouldn't be, either.

That seems like something that could be addressed with enough time. A single
value per table, couldn't it be stored in shared memory and only updated
whenever the heap is extended or truncated? Even if you have thousands of
tables that would only be a few kilobytes of shared memory.

2. Instability of plans. Right now, the planner will not change plans
underneath you --- you have to issue an explicit VACUUM or ANALYZE
to change the terms of discussion. That would stop being true if
physical file size were always taken into account. Maybe this is a
problem, or maybe it isn't ... as someone who likes to be able to
debug planner behavior without actually creating umpteen-gig test
tables, my world view may be a bit skewed ...

This is what I'm afraid of. As a OLTP application programmer -- web sites, I
admit -- I care a lot more about plan stability than finding optimal plans.

An well written OLTP application will only have a fixed number of queries that
are executed repeatedly with different parameters. I don't care how long the
queries take as long as they're always "fast enough".

Every time a new plan is tried there's a chance that it will be wrong. It only
takes one wrong plan out of the hundreds of queries to bring down the entire
application.

Ideally I would want a guarantee that every query would *always* result in the
same plan. Once I've tested them and approved the plans I want to know that
only those approved plans will ever run, and I want to be present and be able
to verify new plans before they go into production.

I doubt I'm going to convince anyone today, but I think there will be a
gradual change in mindset as the new binary protocol becomes more popular. And
postgres takes over some of mysql's web mindshare.

--
greg

#8Gaetano Mendola
mendola@bigfoot.com
In reply to: Tom Lane (#6)
Re: performance of IN (subquery)

Tom Lane wrote:

2. Instability of plans. Right now, the planner will not change plans
underneath you --- you have to issue an explicit VACUUM or ANALYZE
to change the terms of discussion. That would stop being true if
physical file size were always taken into account. Maybe this is a
problem, or maybe it isn't ... as someone who likes to be able to
debug planner behavior without actually creating umpteen-gig test
tables, my world view may be a bit skewed ...

Did you forgot the autovacuum daemon ? I didn't see anyone bitten or
paged during the night for autovacuum daemon job.

Regards
Gaetano Mendola

#9Jon Lapham
lapham@jandr.org
In reply to: Tom Lane (#6)
Re: performance of IN (subquery)

Tom Lane wrote:

I've thought about this before. One simple trick would be to get rid
of the current pg_class reltuples/relpages fields in favor of a
tuples-per-page estimate, which could be multiplied by
RelationGetNumberOfBlocks() during planning. In the absence of any
ANALYZE data the tuples-per-page estimate might be pretty bogus, but
it couldn't be off by more than an order of magnitude or so either way.
And in any case we'd have a guaranteed up-to-date number of blocks.

The objections that could be raised to this are (AFAICS) two:
[snip]
2. Instability of plans. Right now, the planner will not change plans
underneath you --- you have to issue an explicit VACUUM or ANALYZE
to change the terms of discussion. That would stop being true if
physical file size were always taken into account. Maybe this is a
problem, or maybe it isn't ... as someone who likes to be able to
debug planner behavior without actually creating umpteen-gig test
tables, my world view may be a bit skewed ...

It's certainly doable if we decide the pluses outweigh the minuses.
Thoughts?

My first reaction is to wonder if this would give performance exactly
equal to running a true ANALYZE in every situation? If not, then you
would end up with an automated pseudo-ANALYZE (performance-wise).

In my opinion, it is almost a feature that non-ANALYZE-d tables give
such horrendous performance, it kicks you in the butt to do some
thinking about when to correctly deal with ANALYZEing.

So, in short, I think it is a huge win if we could have automatic
ANALYZE with true ANALYZE performance, but a huge loss if the automatic
ANALYZE performance is not exactly as good as a true ANALYZE.

--
-**-*-*---*-*---*-*---*-----*-*-----*---*-*---*-----*-----*-*-----*---
Jon Lapham <lapham@jandr.org> Rio de Janeiro, Brasil
Personal: http://www.jandr.org/
***-*--*----*-------*------------*--------------------*---------------

#10Kevin Murphy
murphy@genome.chop.edu
In reply to: Paul Tillotson (#3)
Re: performance of IN (subquery)

Thanks all for the reminders about analyzing, and I apologize for wasting
everyone's time. The main table did indeed need to be analyzed (I had
truncated it and repopulated it with "insert ... select" but forgotten to
analyze). The other tables are very small temporary tables, and I assumed,
whether correctly or not, that analyzing would not be helpful for them.

All this is happening in the context of an algorithm in a PL/PGSQL function.
The three temporary tables are reused thousands of times. I wasn't sure if
it would be better to truncate them between uses to keep them small or just
allow them to grow. Unfortunately for the Tree of Knowledge, performance is
now more than adequate, so I may not do this experiment.

Thanks,
Kevin Murphy

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jon Lapham (#9)
Re: performance of IN (subquery)

Jon Lapham <lapham@jandr.org> writes:

Tom Lane wrote:

I've thought about this before. One simple trick would be to get rid
of the current pg_class reltuples/relpages fields in favor of a
tuples-per-page estimate, which could be multiplied by
RelationGetNumberOfBlocks() during planning.

My first reaction is to wonder if this would give performance exactly
equal to running a true ANALYZE in every situation?

No, this is quite orthogonal to ANALYZE (and several orders of magnitude
cheaper...) It would fix a lot of the simpler cases, though, since the
first-order effects like table size would be right.

regards, tom lane

#12Paul Tillotson
pntil@shentel.net
In reply to: Kevin Murphy (#1)
Re: performance of IN (subquery)

2. Instability of plans. Right now, the planner will not change plans
underneath you --- you have to issue an explicit VACUUM or ANALYZE
to change the terms of discussion. That would stop being true if
physical file size were always taken into account. Maybe this is a
problem, or maybe it isn't ... as someone who likes to be able to
debug planner behavior without actually creating umpteen-gig test
tables, my world view may be a bit skewed ...

It's certainly doable if we decide the pluses outweigh the minuses.
Thoughts?

Even though Ijust proposed it, now I think that the idea of having
postgres automatically gather statistics only when an estimate is
grossly wrong is not a good one. This will just obscure the need to run
analyze, and many people will then always be using moderately bad
statistics without realizing it.

(I think I used postgres in a production environment for about six
months without even knowing about the ANALYZE command--with small data
sets it's not obvious--and it was still blazingly fast compared to foxpro.)

If pgautovacuum does ANALYZE and we can get most people to use
pgautovacuum, I think this problem will go away.

Another possibility is simply to make postgres generate a warning, saying:

WARNING: statistics for table xxxxxx are off by a factor of y. You
should ANALYZE this table.

Paul

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#7)
Re: performance of IN (subquery)

Greg Stark <gsstark@mit.edu> writes:

I'm not about to run analyze in the middle of the data generation
(which wouldn't work anyways since it's in a transaction).

Since 7.3 or 7.4, you *can* run ANALYZE in the middle of a transaction.
The cached-plan business is a problem, I agree, but I think it's
orthogonal to this particular discussion (and you can always use EXECUTE
if you have to).

In the absence of any ANALYZE data the tuples-per-page estimate might be
pretty bogus, but it couldn't be off by more than an order of magnitude or
so either way.

I don't see why it couldn't. If you have a table badly in need of vacuuming
(or had one at the time of the last analyze) it could be off by way more than
an order of magnitude.

Well, I was actually thinking of the physical tuples-per-page stat
(perhaps better expressed as an average tuple size), but you are right
that the fraction of dead tuples is also something to think about.
We don't model that explicitly ATM but maybe we should. The original
VACUUM-based stats code couldn't really do much with it, since VACUUM
would leave no dead tuples behind in the first place; but separate
ANALYZE could definitely make an estimate of the fraction of dead tuples.

Ideally I would want a guarantee that every query would *always*
result in the same plan. Once I've tested them and approved the plans
I want to know that only those approved plans will ever run, and I
want to be present and be able to verify new plans before they go into
production.

I doubt I'm going to convince anyone today,

Nope, you aren't. The above seems to me to be a recipe for degradation
of performance over time, precisely because the plans wouldn't change in
the face of changes in the situation. I've resisted adding "planner
hints" to the language for this reason, and I'm certainly not eager to
offer any hard guarantees about plans not changing.

regards, tom lane

#14Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#13)
Re: performance of IN (subquery)

Is there anything for the TODO here?

---------------------------------------------------------------------------

Tom Lane wrote:

Greg Stark <gsstark@mit.edu> writes:

I'm not about to run analyze in the middle of the data generation
(which wouldn't work anyways since it's in a transaction).

Since 7.3 or 7.4, you *can* run ANALYZE in the middle of a transaction.
The cached-plan business is a problem, I agree, but I think it's
orthogonal to this particular discussion (and you can always use EXECUTE
if you have to).

In the absence of any ANALYZE data the tuples-per-page estimate might be
pretty bogus, but it couldn't be off by more than an order of magnitude or
so either way.

I don't see why it couldn't. If you have a table badly in need of vacuuming
(or had one at the time of the last analyze) it could be off by way more than
an order of magnitude.

Well, I was actually thinking of the physical tuples-per-page stat
(perhaps better expressed as an average tuple size), but you are right
that the fraction of dead tuples is also something to think about.
We don't model that explicitly ATM but maybe we should. The original
VACUUM-based stats code couldn't really do much with it, since VACUUM
would leave no dead tuples behind in the first place; but separate
ANALYZE could definitely make an estimate of the fraction of dead tuples.

Ideally I would want a guarantee that every query would *always*
result in the same plan. Once I've tested them and approved the plans
I want to know that only those approved plans will ever run, and I
want to be present and be able to verify new plans before they go into
production.

I doubt I'm going to convince anyone today,

Nope, you aren't. The above seems to me to be a recipe for degradation
of performance over time, precisely because the plans wouldn't change in
the face of changes in the situation. I've resisted adding "planner
hints" to the language for this reason, and I'm certainly not eager to
offer any hard guarantees about plans not changing.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#15Joseph Shraibman
jks@selectacast.net
In reply to: Tom Lane (#13)
Re: performance of IN (subquery)

Tom Lane wrote:

Greg Stark <gsstark@mit.edu> writes:

I'm not about to run analyze in the middle of the data generation
(which wouldn't work anyways since it's in a transaction).

Since 7.3 or 7.4, you *can* run ANALYZE in the middle of a transaction.
The cached-plan business is a problem, I agree, but I think it's
orthogonal to this particular discussion (and you can always use EXECUTE
if you have to).

How does EXECUTE solve the cached-plan business?

#16Markus Bertheau
mbertheau@gmail.com
In reply to: Joseph Shraibman (#15)
Re: performance of IN (subquery)

On Fri, 27 Aug 2004 11:09:26 -0400, Joseph Shraibman
<jks@selectacast.net> wrote:

How does EXECUTE solve the cached-plan business?

It re-plans the query at every run.

--
Markus Bertheau <mbertheau@gmail.com>

#17Duane Lee - EGOVX
DLee@mail.maricopa.gov
In reply to: Markus Bertheau (#16)
Re: performance of IN (subquery)

Have you thought about using existence checking: WHERE EXISTS (SELECT '1'
FROM FOO2 WHERE BAZ = BAZ2)

If the index exists on BAZ2 you might get away with a quick index only
check.

Duane

-----Original Message-----
From: Kevin Murphy [mailto:murphy@genome.chop.edu]
Sent: Thursday, August 26, 2004 3:24 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] performance of IN (subquery)

I'm using PG 7.4.3 on Mac OS X.

I am disappointed with the performance of queries like 'select foo from
bar where baz in (subquery)', or updates like 'update bar set foo = 2
where baz in (subquery)'. PG always seems to want to do a sequential
scan of the bar table. I wish there were a way of telling PG, "use the
index on baz in your plan, because I know that the subquery will return
very few results". Where it really matters, I have been constructing
dynamic queries by looping over the values for baz and building a
separate query for each one and combining with a UNION (or just
directly updating, in the update case). Depending on the size of the
bar table, I can get speedups of hundreds or even more than a thousand
times, but it is a big pain to have to do this.

Any tips?

Thanks,
Kevin Murphy

Illustrated:

The query I want to do is very slow:

select bundle_id from build.elements
where elementid in (
SELECT superlocs_2.element_id
FROM superlocs_2 NATURAL JOIN bundle_superlocs_2
WHERE bundle_superlocs_2.protobundle_id = 1);
-----------
7644
7644
(2 rows)
Time: 518.242 ms

The subquery is fast:

SELECT superlocs_2.element_id
FROM superlocs_2 NATURAL JOIN bundle_superlocs_2
WHERE bundle_superlocs_2.protobundle_id = 1;
------------
41209
25047
(2 rows)
Time: 3.268 ms

And using indexes on the main table is fast:

select bundle_id from build.elements
where elementid in (41209, 25047);
-----------
7644
7644
(2 rows)
Time: 2.468 ms

The plan for the slow query:

egenome_test=# explain analyze select bundle_id from build.elements
where elementid in (
SELECT superlocs_2.element_id
FROM superlocs_2 NATURAL JOIN bundle_superlocs_2
WHERE bundle_superlocs_2.protobundle_id = 1);
egenome_test-# egenome_test(# egenome_test(# egenome_test(#
QUERY PLAN
\

------------------------------------------------------------------------
-------------------------------------------------------------
Hash Join (cost=70.33..72.86 rows=25 width=4) (actual
time=583.051..583.059 rows=2 loops=1)
Hash Cond: ("outer".element_id = "inner".elementid)
-> HashAggregate (cost=47.83..47.83 rows=25 width=4) (actual
time=0.656..0.658 rows=2 loops=1)
-> Hash Join (cost=22.51..47.76 rows=25 width=4) (actual
time=0.615..0.625 rows=2 loops=1)
Hash Cond: ("outer".superloc_id = "inner".superloc_id)
-> Seq Scan on superlocs_2 (cost=0.00..20.00 rows=1000
width=8) (actual time=0.004..0.012 rows=9 loops=1)
-> Hash (cost=22.50..22.50 rows=5 width=4) (actual
time=0.076..0.076 rows=0 loops=1)
-> Seq Scan on bundle_superlocs_2
(cost=0.00..22.50 rows=5 width=4) (actual time=0.024..0.033 rows=2
loops=1)
Filter: (protobundle_id = 1)
-> Hash (cost=20.00..20.00 rows=1000 width=8) (actual
time=581.802..581.802 rows=0 loops=1)
-> Seq Scan on elements (cost=0.00..20.00 rows=1000 width=8)
(actual time=0.172..405.243 rows=185535 loops=1)
Total runtime: 593.843 ms
(12 rows)

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

#18Joseph Shraibman
jks@selectacast.net
In reply to: Kevin Murphy (#1)
Re: performance of IN (subquery)

According to the docs is specifically doesn't.

http://www.postgresql.org/docs/7.4/static/sql-prepare.html

When the PREPARE statement is executed, the specified statement is
parsed, rewritten, and planned. When an EXECUTE command is subsequently
issued, the prepared statement need only be executed. Thus, the parsing,
rewriting, and planning stages are only performed once, instead of every
time the statement is executed.

Markus Bertheau wrote:

Show quoted text

On Fri, 27 Aug 2004 11:09:26 -0400, Joseph Shraibman
<jks@selectacast.net> wrote:

How does EXECUTE solve the cached-plan business?

It re-plans the query at every run.

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joseph Shraibman (#18)
Re: performance of IN (subquery)

Joseph Shraibman <jks@selectacast.net> writes:

Markus Bertheau wrote:

It re-plans the query at every run.

According to the docs is specifically doesn't.

You're confusing SQL EXECUTE statement with the quite different plpgsql
EXECUTE statement ...

regards, tom lane

#20Matthew T. O'Connor
matthew@zeut.net
In reply to: Paul Tillotson (#12)
Re: performance of IN (subquery)

Paul Tillotson wrote:

If pgautovacuum does ANALYZE and we can get most people to use
pgautovacuum, I think this problem will go away.

It does. We won't get most people to use it for 8.0 since it is still a
contrib project, but hopefully it will be built-in for 8.1.

#21Bruce Momjian
bruce@momjian.us
In reply to: Kevin Murphy (#1)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#21)
#23Gaetano Mendola
mendola@bigfoot.com
In reply to: Bruce Momjian (#7)
#24Bruce Momjian
bruce@momjian.us
In reply to: Kevin Murphy (#1)