Help with sql
Hi Guys,
This isn't a PostgreSQL specific question but just a SQL question. If this is not an appropriate question for this list, please let me know.
It is also, perhaps, a really silly question.
This query (without the 'explain' keyword) , when executed takes forever and a day:
condor_development=> explain select id from filesets where id not in ( select fileset_id from service_pack_fileset_maps );
QUERY PLAN
----------------------------------------------------------------------------------------------
Seq Scan on filesets (cost=0.00..71937742.00 rows=26088 width=4)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..2517.78 rows=95852 width=4)
-> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4)
(5 rows)
This query returns within a second:
condor_development=> explain select id from filesets where id not in ( select distinct fileset_id from service_pack_fileset_maps );
QUERY PLAN
----------------------------------------------------------------------------------------------
Seq Scan on filesets (cost=2102.31..3153.53 rows=26088 width=4)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> HashAggregate (cost=1903.15..2062.48 rows=15933 width=4)
-> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4)
(5 rows)
The difference is the "distinct" keyword in the inner select.
What I'm confused about is why isn't the "distinct" implicit? I thought the construct "blah in ( select ... )" was using "sets" and an item (I thought) can not be in a set more than once.
Perhaps my question is the opposite really? Why would you not always use "distinct" in the inner select when the operator is "in" or "not in" ?
And if I can throw in another question on top: is there a different method other than "not in" that would work better?
Thank you guys for the help and a really awesome database.
pedz
(this is *suppose* to be sent as plain text... I hope my mailer does what it is told)
On 07/06/2012 03:34 PM, Perry Smith wrote:
Hi Guys,
This isn't a PostgreSQL specific question but just a SQL question. If this is not an appropriate question for this list, please let me know.
It is also, perhaps, a really silly question.
This query (without the 'explain' keyword) , when executed takes forever and a day:
condor_development=> explain select id from filesets where id not in ( select fileset_id from service_pack_fileset_maps );
QUERY PLAN
----------------------------------------------------------------------------------------------
Seq Scan on filesets (cost=0.00..71937742.00 rows=26088 width=4)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..2517.78 rows=95852 width=4)
-> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4)
(5 rows)This query returns within a second:
condor_development=> explain select id from filesets where id not in ( select distinct fileset_id from service_pack_fileset_maps );
QUERY PLAN
----------------------------------------------------------------------------------------------
Seq Scan on filesets (cost=2102.31..3153.53 rows=26088 width=4)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> HashAggregate (cost=1903.15..2062.48 rows=15933 width=4)
-> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4)
(5 rows)The difference is the "distinct" keyword in the inner select.
What I'm confused about is why isn't the "distinct" implicit? I thought the construct "blah in ( select ... )" was using "sets" and an item (I thought) can not be in a set more than once.
Perhaps my question is the opposite really? Why would you not always use "distinct" in the inner select when the operator is "in" or "not in" ?
And if I can throw in another question on top: is there a different method other than "not in" that would work better?
Thank you guys for the help and a really awesome database.
pedz
(this is *suppose* to be sent as plain text... I hope my mailer does what it is told)
Well they are distinct records, they just may have the same values. And
I'm not trying to be flippant. We don't see the structure of those
table: are all the id fields involved primary keys or with unique index
coverage? Does "not exists ( select fileset.id = fileset_id from
service_pack_fileset_map) change the behaviour?
On 07/06/2012 02:34 PM, Perry Smith wrote:
Hi Guys,
This isn't a PostgreSQL specific question but just a SQL question. If this is not an appropriate question for this list, please let me know.
It is also, perhaps, a really silly question.
This query (without the 'explain' keyword) , when executed takes forever and a day:
condor_development=> explain select id from filesets where id not in ( select fileset_id from service_pack_fileset_maps );
QUERY PLAN
----------------------------------------------------------------------------------------------
Seq Scan on filesets (cost=0.00..71937742.00 rows=26088 width=4)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..2517.78 rows=95852 width=4)
-> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4)
(5 rows)This query returns within a second:
condor_development=> explain select id from filesets where id not in ( select distinct fileset_id from service_pack_fileset_maps );
QUERY PLAN
----------------------------------------------------------------------------------------------
Seq Scan on filesets (cost=2102.31..3153.53 rows=26088 width=4)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> HashAggregate (cost=1903.15..2062.48 rows=15933 width=4)
-> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4)
(5 rows)The difference is the "distinct" keyword in the inner select.
What I'm confused about is why isn't the "distinct" implicit? I thought the construct "blah in ( select ... )" was using "sets" and an item (I thought) can not be in a set more than once.
Perhaps my question is the opposite really? Why would you not always use "distinct" in the inner select when the operator is "in" or "not in" ?
And if I can throw in another question on top: is there a different method other than "not in" that would work better?
Actually it is *very* PostgreSQL specific. In fact, it may even be
PostgreSQL *version* specific as you are delving into how the planner
decides how to handle a query.
It appears that the planner is assuming, based on collected stats and
available indexes, that there will be roughly 1/6 the records returned
by the "distinct" query and thus chose a different method to join the
records. One useful piece of information would be the indexes on the two
tables.
As to other methods, you can use:
... where not exists (select 1 from service_pack_fileset_maps where
fileset_id = filesets.id)...
(Note: as alluded to above, ...not in... works better in some releases
and ...not exists... better in others due to improvements over time.)
Still another method:
select id from filesets except select fileset_id from
service_pack_fileset_maps;
Cheers,
Steve
On Jul 6, 2012, at 7:56 PM, Steve Crawford wrote:
On 07/06/2012 02:34 PM, Perry Smith wrote:
Hi Guys,
This isn't a PostgreSQL specific question but just a SQL question. If this is not an appropriate question for this list, please let me know.
It is also, perhaps, a really silly question.
This query (without the 'explain' keyword) , when executed takes forever and a day:
condor_development=> explain select id from filesets where id not in ( select fileset_id from service_pack_fileset_maps );
QUERY PLAN
----------------------------------------------------------------------------------------------
Seq Scan on filesets (cost=0.00..71937742.00 rows=26088 width=4)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..2517.78 rows=95852 width=4)
-> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4)
(5 rows)This query returns within a second:
condor_development=> explain select id from filesets where id not in ( select distinct fileset_id from service_pack_fileset_maps );
QUERY PLAN
----------------------------------------------------------------------------------------------
Seq Scan on filesets (cost=2102.31..3153.53 rows=26088 width=4)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> HashAggregate (cost=1903.15..2062.48 rows=15933 width=4)
-> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4)
(5 rows)The difference is the "distinct" keyword in the inner select.
What I'm confused about is why isn't the "distinct" implicit? I thought the construct "blah in ( select ... )" was using "sets" and an item (I thought) can not be in a set more than once.
Perhaps my question is the opposite really? Why would you not always use "distinct" in the inner select when the operator is "in" or "not in" ?
And if I can throw in another question on top: is there a different method other than "not in" that would work better?
Actually it is *very* PostgreSQL specific. In fact, it may even be PostgreSQL *version* specific as you are delving into how the planner decides how to handle a query.
It appears that the planner is assuming, based on collected stats and available indexes, that there will be roughly 1/6 the records returned by the "distinct" query and thus chose a different method to join the records. One useful piece of information would be the indexes on the two tables.
As to other methods, you can use:
... where not exists (select 1 from service_pack_fileset_maps where fileset_id = filesets.id)...
(Note: as alluded to above, ...not in... works better in some releases and ...not exists... better in others due to improvements over time.)Still another method:
select id from filesets except select fileset_id from service_pack_fileset_maps;
Thanks guys.
Small side note: I thought I saw "set difference" in the documentation but I couldn't find it. It appears "EXCEPT" is set difference. Thank you for that tidbit.
The database is mostly static. I run through a very lengthy process to populate the database maybe once a month and then it is 99% read-only. By far, most of the accesses are via a view that I have that is rather long and ugly so I won't paste it in. I've tried to make this particular view as fast as possible so the indexes, etc are what I think will help that out.
The version is psql (PostgreSQL) 9.0.4
Each table has a key of "id" -- this database was created by / used by Ruby on Rails and that is how it likes to do things.
condor_development=> \d service_pack_fileset_maps
Table "public.service_pack_fileset_maps"
Column | Type | Modifiers
-----------------+-----------------------------+------------------------------------------------------------------------
id | integer | not null default nextval('service_pack_fileset_maps_id_seq'::regclass)
service_pack_id | integer | not null
fileset_id | integer | not null
created_at | timestamp without time zone |
updated_at | timestamp without time zone |
Indexes:
"service_pack_fileset_maps_pkey" PRIMARY KEY, btree (id)
"service_pack_fileset_maps_service_pack_id_key" UNIQUE, btree (service_pack_id, fileset_id)
"index_service_pack_fileset_maps_on_fileset_id" btree (fileset_id)
Foreign-key constraints:
"service_pack_fileset_maps_fileset_id_fkey" FOREIGN KEY (fileset_id) REFERENCES filesets(id) ON DELETE CASCADE DEFERRABLE
"service_pack_fileset_maps_service_pack_id_fkey" FOREIGN KEY (service_pack_id) REFERENCES service_packs(id) ON DELETE CASCADE DEFERRABLE
condor_development=> \d filesets
Table "public.filesets"
Column | Type | Modifiers
------------+-----------------------------+-------------------------------------------------------
id | integer | not null default nextval('filesets_id_seq'::regclass)
lpp_id | integer | not null
vrmf | character varying(255) | not null
created_at | timestamp without time zone |
updated_at | timestamp without time zone |
Indexes:
"filesets_pkey" PRIMARY KEY, btree (id)
"filesets_lpp_id_key" UNIQUE, btree (lpp_id, vrmf)
Foreign-key constraints:
"filesets_lpp_id_fkey" FOREIGN KEY (lpp_id) REFERENCES lpps(id) ON DELETE CASCADE DEFERRABLE
Referenced by:
TABLE "fileset_aix_file_maps" CONSTRAINT "fileset_aix_file_maps_fileset_id_fkey" FOREIGN KEY (fileset_id) REFERENCES filesets(id) ON DELETE CASCADE DEFERRABLE
TABLE "fileset_ptf_maps" CONSTRAINT "fileset_ptf_maps_fileset_id_fkey" FOREIGN KEY (fileset_id) REFERENCES filesets(id) ON DELETE CASCADE DEFERRABLE
TABLE "package_fileset_maps" CONSTRAINT "package_fileset_maps_fileset_id_fkey" FOREIGN KEY (fileset_id) REFERENCES filesets(id) ON DELETE CASCADE DEFERRABLE
TABLE "service_pack_fileset_maps" CONSTRAINT "service_pack_fileset_maps_fileset_id_fkey" FOREIGN KEY (fileset_id) REFERENCES filesets(id) ON DELETE CASCADE DEFERRABLE
TABLE "upd_pc_views" CONSTRAINT "upd_pc_views_fileset_id_fkey" FOREIGN KEY (fileset_id) REFERENCES filesets(id) ON DELETE CASCADE DEFERRABLE
Thank you again for your help,
pedz
On Sat, Jul 7, 2012 at 11:48 PM, Perry Smith <pedzsan@gmail.com> wrote:
The database is mostly static. I run through a very lengthy process to populate the database maybe once a month and then it is 99% read-only.
Do you run an ANALYZE on the table after populating it? Postgres needs
up-to-date statistics for best results. I'd recommend doing an
explicit 'VACUUM ANALYZE' once your data's loaded, and then try your
queries after that - it might not do much, but it also might give a
massive improvement.
ChrisA