eliminating records not in (select id ... so SLOW?

Started by Ivan Sergio Borgonovoover 17 years ago11 messagesgeneral
Jump to latest
#1Ivan Sergio Borgonovo
mail@webthatworks.it

I'm doing something like:

delete from table1 where id not in (select id from table2).

both id are indexed.

table1 contains ~1M record table2 contains ~ 600K record and id is
unique.

The box is a 2x dual core Xeon (below 2GHz) with 4Gb ram.
Default debian etch setup.

It has been working for over 2h now.

Is it normal?

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

#2Lennin Caro
lennin.caro@yahoo.com
In reply to: Ivan Sergio Borgonovo (#1)
Re: eliminating records not in (select id ... so SLOW?
--- On Thu, 7/31/08, Ivan Sergio Borgonovo <mail@webthatworks.it> wrote:

From: Ivan Sergio Borgonovo <mail@webthatworks.it>
Subject: [GENERAL] eliminating records not in (select id ... so SLOW?
To: "PostgreSQL" <pgsql-general@postgresql.org>
Date: Thursday, July 31, 2008, 9:45 PM
I'm doing something like:

delete from table1 where id not in (select id from table2).

both id are indexed.

table1 contains ~1M record table2 contains ~ 600K record
and id is
unique.

The box is a 2x dual core Xeon (below 2GHz) with 4Gb ram.
Default debian etch setup.

It has been working for over 2h now.

Is it normal?

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

you recently run vacuum ?

Show quoted text

--
Sent via pgsql-general mailing list
(pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Ivan Sergio Borgonovo
mail@webthatworks.it
In reply to: Lennin Caro (#2)
Re: eliminating records not in (select id ... so SLOW?

On Thu, 31 Jul 2008 14:59:29 -0700 (PDT)
Lennin Caro <lennin.caro@yahoo.com> wrote:

The box is a 2x dual core Xeon (below 2GHz) with 4Gb ram.
Default debian etch setup.

you recently run vacuum ?

The tables are pretty stable. I think no more than 20 records were
modified (update/insert/delete) during the whole history of the 2
tables.

autovacuum is running regularly.

The actual query running is:

begin;
create index catalog_categoryitem_ItemsID_index on
catalog_categoryitem using btree (ItemID);
delete from catalog_categoryitem
where ItemID not in (select ItemID from catalog_items);
commit;

That's what came back
Timing is on.
BEGIN
Time: 0.198 ms
CREATE INDEX
Time: 3987.991 ms

The query is still running...

As a reminder catalog_categoryitem should contain less than 1M
record.
catalog_items should contain a bit more than 600K record where
ItemID is unique (a pk actually).
PostgreSQL comes from the default install from Debian etch (8.1.X).
It's configuration hasn't been modified.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ivan Sergio Borgonovo (#1)
Re: eliminating records not in (select id ... so SLOW?

Ivan Sergio Borgonovo <mail@webthatworks.it> writes:

I'm doing something like:
delete from table1 where id not in (select id from table2).
table1 contains ~1M record table2 contains ~ 600K record and id is
unique.

That's going to pretty much suck unless you've got work_mem set high
enough to allow a "hashed subplan" plan --- which is likely to require
tens of MB for this case, I don't recall exactly what the per-row
overhead is. Experiment until EXPLAIN tells you it'll use a hashed
subplan.

BTW, don't bother with creating the index, it doesn't help for this.

regards, tom lane

#5Ivan Sergio Borgonovo
mail@webthatworks.it
In reply to: Tom Lane (#4)
Re: eliminating records not in (select id ... so SLOW?

On Thu, 31 Jul 2008 21:37:39 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

Ivan Sergio Borgonovo <mail@webthatworks.it> writes:

I'm doing something like:
delete from table1 where id not in (select id from table2).
table1 contains ~1M record table2 contains ~ 600K record and id
is unique.

That's going to pretty much suck unless you've got work_mem set
high enough to allow a "hashed subplan" plan --- which is likely
to require tens of MB for this case, I don't recall exactly what

Thanks.

the per-row overhead is. Experiment until EXPLAIN tells you it'll
use a hashed subplan.

explain delete from catalog_categoryitem where ItemID not in (select
ItemID from catalog_items);

Well I reached 3Gb of work_mem and still I got:

"Seq Scan on catalog_categoryitem (cost=31747.84..4019284477.13
rows=475532 width=6)"
" Filter: (NOT (subplan))"
" SubPlan"
" -> Materialize (cost=31747.84..38509.51 rows=676167 width=8)"
" -> Seq Scan on catalog_items (cost=0.00..31071.67
rows=676167 width=8)"

I've this too:
alter table catalog_items cluster on catalog_items_pkey;
should I drop it?

This is just a dev box. I loaded the 2 tables with 2 not coherent
set of data just to play with, before adding all the pk/fk I need.
I could just truncate the tables and reload them from coherent
sources.

But what if I *really* had to execute that query?
Any other magic I could play to speed it up?

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

#6Daniel Verite
daniel@manitou-mail.org
In reply to: Ivan Sergio Borgonovo (#5)
Re: eliminating records not in (select id ... so SLOW?

Ivan Sergio Borgonovo wrote:

But what if I *really* had to execute that query?
Any other magic I could play to speed it up?

A trick that is sometimes spectacularly efficient is to rewrite the
query to use an outer join instead of NOT IN.

Try:

DELETE FROM table1 WHERE id IN
(SELECT table1.id FROM table1 LEFT JOIN table2 ON table1.id=table2.id
WHERE table2.id IS NULL)

Best regards,
--
Daniel
PostgreSQL-powered mail user agent and storage:
http://www.manitou-mail.org

#7Lennin Caro
lennin.caro@yahoo.com
In reply to: Ivan Sergio Borgonovo (#3)
Re: eliminating records not in (select id ... so SLOW?

ok try this

delete from catalog_categoryitem where not exists (select id from catalog_items where catalog_items.ItemID = catalog_categoryitem.ItemID);

--- On Thu, 7/31/08, Ivan Sergio Borgonovo <mail@webthatworks.it> wrote:
Show quoted text

From: Ivan Sergio Borgonovo <mail@webthatworks.it>
Subject: Re: [GENERAL] eliminating records not in (select id ... so SLOW?
To:
Cc: "PostgreSQL" <pgsql-general@postgresql.org>
Date: Thursday, July 31, 2008, 11:01 PM
On Thu, 31 Jul 2008 14:59:29 -0700 (PDT)
Lennin Caro <lennin.caro@yahoo.com> wrote:

The box is a 2x dual core Xeon (below 2GHz) with

4Gb ram.

Default debian etch setup.

you recently run vacuum ?

The tables are pretty stable. I think no more than 20
records were
modified (update/insert/delete) during the whole history of
the 2
tables.

autovacuum is running regularly.

The actual query running is:

begin;
create index catalog_categoryitem_ItemsID_index on
catalog_categoryitem using btree (ItemID);
delete from catalog_categoryitem
where ItemID not in (select ItemID from catalog_items);
commit;

That's what came back
Timing is on.
BEGIN
Time: 0.198 ms
CREATE INDEX
Time: 3987.991 ms

The query is still running...

As a reminder catalog_categoryitem should contain less than
1M
record.
catalog_items should contain a bit more than 600K record
where
ItemID is unique (a pk actually).
PostgreSQL comes from the default install from Debian etch
(8.1.X).
It's configuration hasn't been modified.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

--
Sent via pgsql-general mailing list
(pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ivan Sergio Borgonovo (#5)
Re: eliminating records not in (select id ... so SLOW?

Ivan Sergio Borgonovo <mail@webthatworks.it> writes:

Well I reached 3Gb of work_mem and still I got:

"Seq Scan on catalog_categoryitem (cost=31747.84..4019284477.13
rows=475532 width=6)"
" Filter: (NOT (subplan))"
" SubPlan"
" -> Materialize (cost=31747.84..38509.51 rows=676167 width=8)"
" -> Seq Scan on catalog_items (cost=0.00..31071.67
rows=676167 width=8)"

Huh. The only way I can see for that to happen is if the datatypes
involved aren't hashable. What's the datatypes of the two columns
being compared, anyway?

regards, tom lane

#9Ivan Sergio Borgonovo
mail@webthatworks.it
In reply to: Tom Lane (#8)
Re: eliminating records not in (select id ... so SLOW?

On Fri, 01 Aug 2008 10:33:59 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

Ivan Sergio Borgonovo <mail@webthatworks.it> writes:

Well I reached 3Gb of work_mem and still I got:

"Seq Scan on catalog_categoryitem (cost=31747.84..4019284477.13
rows=475532 width=6)"
" Filter: (NOT (subplan))"
" SubPlan"
" -> Materialize (cost=31747.84..38509.51 rows=676167
width=8)" " -> Seq Scan on catalog_items
(cost=0.00..31071.67 rows=676167 width=8)"

Huh. The only way I can see for that to happen is if the datatypes
involved aren't hashable. What's the datatypes of the two columns
being compared, anyway?

That S in CS should mean sober!

thanks to svn I'd say you're right... one column was int the other
bigint.
Among other things I was just fixing that kind of mistakes.

If that could be the reason I'll report if things got better once I
finish to normalise the DB.

BTW does pg 8.3 save you from such kind of mistake being stricter
with auto cast?

Tom sorry for sending this just to your personal email.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ivan Sergio Borgonovo (#1)
Re: eliminating records not in (select id ... so SLOW?

Ivan Sergio Borgonovo <mail@webthatworks.it> writes:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

Huh. The only way I can see for that to happen is if the datatypes
involved aren't hashable. What's the datatypes of the two columns
being compared, anyway?

thanks to svn I'd say you're right... one column was int the other
bigint.

Ah. 8.3 can hash certain cross-type comparisons (including that one)
but prior versions won't.

regards, tom lane

#11Ivan Sergio Borgonovo
mail@webthatworks.it
In reply to: Tom Lane (#8)
Re: eliminating records not in (select id ... so SLOW?

On Fri, 01 Aug 2008 10:33:59 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

Ivan Sergio Borgonovo <mail@webthatworks.it> writes:

Well I reached 3Gb of work_mem and still I got:

"Seq Scan on catalog_categoryitem (cost=31747.84..4019284477.13
rows=475532 width=6)"
" Filter: (NOT (subplan))"
" SubPlan"
" -> Materialize (cost=31747.84..38509.51 rows=676167
width=8)" " -> Seq Scan on catalog_items
(cost=0.00..31071.67 rows=676167 width=8)"

Huh. The only way I can see for that to happen is if the datatypes
involved aren't hashable. What's the datatypes of the two columns
being compared, anyway?

I changed both columns to bigint.
I added 2 indexes on the ItemID column of both tables and increased
work_mem to 3Gb [sic].
The query got executed in ~1300ms... but explain gave the same
output as the one above.

The problem is solved... but curious mind want to know.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it