50K record DELETE Begins, 100% CPU, Never Completes 1 hour later

Started by Clay Lutherover 22 years ago6 messagesgeneral

claycle@cisco.com

over 22 years ago

Again, we have an odd performance problem with PGSQL, 7.4b2.

Here is the query:

delete from numplan where pkid in
(select numplan.pkid from numplan
left outer join pilothuntgroup on numplan.pkid=pilothuntgroup.fknumplan
left outer join devicenumplanmap on numplan.pkid = devicenumplanmap.fknumplan
where numplan.tkpatternusage=2
and pilothuntgroup.fknumplan is null
and devicenumplanmap.fknumplan is null);

The query starts, PGSQL shoots to 134MB(!) of memory and 100% CPU and never completes.

The query works fine on smaller datasets. This occurs when 50K+ records exist in the numplan table.

Here is the query plan:

ccm=# explain delete from numplan where pkid in (select numplan.pkid from numplan left outer join pilothuntgroup on numplan.pkid=pilothuntgroup.fknumplan left outer join devicenumplanmap on numplan.pkid = devicenumplanmap.fknumplan where numplan.tkpatternusage=2 and pilothuntgroup.fknumplan is null and devicenumplanmap.fknumplan is null);
QUERY PLAN
------------------------------------------------------------------------------------------------
Merge IN Join (cost=37947.25..40851.71 rows=82225 width=6)
Merge Cond: ("outer"."?column3?" = ("inner".pkid)::text)
-> Sort (cost=11481.65..11687.35 rows=82279 width=46)
Sort Key: (public.numplan.pkid)::text
-> Seq Scan on numplan (cost=0.00..2936.79 rows=82279 width=46)
-> Materialize (cost=26465.60..27930.85 rows=82225 width=40)
-> Merge Left Join (cost=23917.22..25822.60 rows=82225 width=40)
Merge Cond: (("outer".pkid)::text = "inner"."?column2?")
Filter: ("inner".fknumplan IS NULL)
-> Merge Left Join (cost=11407.97..11819.13 rows=82225 width=40)
Merge Cond: ("outer"."?column2?" = "inner"."?column2?")
Filter: ("inner".fknumplan IS NULL)
-> Sort (cost=11406.89..11612.45 rows=82225 width=40)
Sort Key: (public.numplan.pkid)::text
-> Seq Scan on numplan (cost=0.00..3142.49 rows=82225 width=40)
Filter: (tkpatternusage = 2)
-> Sort (cost=1.08..1.09 rows=4 width=42)
Sort Key: (pilothuntgroup.fknumplan)::text
-> Seq Scan on pilothuntgroup (cost=0.00..1.04 rows=4 width=42)
-> Sort (cost=12509.25..12734.70 rows=90180 width=40)
Sort Key: (devicenumplanmap.fknumplan)::text
-> Seq Scan on devicenumplanmap (cost=0.00..3326.80 rows=90180 width=40)
(22 rows)

---
Clay
Cisco Systems, Inc.
claycle@cisco.com
(972) 813-5004

I've stopped 19,658 spam messages. You can too!
One month FREE spam protection at http://www.cloudmark.com/spamnetsig/}

Tom Lane

tgl@sss.pgh.pa.us

over 22 years ago

In reply to: Clay Luther (#1)

Re: 50K record DELETE Begins, 100% CPU, Never Completes 1 hour later

"Clay Luther" <claycle@cisco.com> writes:

ccm=# explain delete from numplan where pkid in (select numplan.pkid from numplan left outer join pilothuntgroup on numplan.pkid=pilothuntgroup.fknumplan left outer join devicenumplanmap on numplan.pkid = devicenumplanmap.fknumplan where numplan.tkpatternusage=2 and pilothuntgroup.fknumplan is null and devicenumplanmap.fknumplan is null);

The left join/is null thingies look like a workaround for our pre-7.4
lack of performance with NOT IN queries. Have you tried expressing
this more straightforwardly with NOT IN?

Also, what sort_mem setting are you using?

regards, tom lane

Clay Luther

claycle@cisco.com

over 22 years ago

In reply to: Tom Lane (#2)

Re: 50K record DELETE Begins, 100% CPU, Never Completes 1 hour later

Actually, I tried the NOT IN query as well, as well as writing and caching the primary keys I wanted to be sure to delete for deletion later.

Both of these options produce the same results, which lead me to believe there's a concurrancy issue somewhere, perhaps. I had combed our code for uncommitted writes, but found nothing outstanding.

I am testing now with autocommit=true.

Sort_mem is 32K.

cwl

Show quoted text

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Thursday, September 11, 2003 12:32 AM
To: Clay Luther
Cc: Pgsql-General (E-mail)
Subject: Re: [GENERAL] 50K record DELETE Begins, 100% CPU, Never
Completes 1 hour later

"Clay Luther" <claycle@cisco.com> writes:

ccm=# explain delete from numplan where pkid in (select

numplan.pkid from numplan left outer join pilothuntgroup on
numplan.pkid=pilothuntgroup.fknumplan left outer join
devicenumplanmap on numplan.pkid = devicenumplanmap.fknumplan
where numplan.tkpatternusage=2 and pilothuntgroup.fknumplan
is null and devicenumplanmap.fknumplan is null);

The left join/is null thingies look like a workaround for our pre-7.4
lack of performance with NOT IN queries. Have you tried expressing
this more straightforwardly with NOT IN?

Also, what sort_mem setting are you using?

regards, tom lane

Import Notes

Resolved by subject fallback

Tom Lane

tgl@sss.pgh.pa.us

over 22 years ago

In reply to: Clay Luther (#3)

Re: 50K record DELETE Begins, 100% CPU, Never Completes 1 hour later

"Clay Luther" <claycle@cisco.com> writes:

Sort_mem is 32K.

Try more (like 32M). Particularly in 7.4, you can really hobble a query
by starving it for sort memory (since that also determines whether
hashtable techniques will be tried).

regards, tom lane

Clay Luther

claycle@cisco.com

over 22 years ago

In reply to: Tom Lane (#4)

Re: 50K record DELETE Begins, 100% CPU, Never Completes 1 hour later

By 32K I meant:

sort_mem = 32768 # min 64, size in KB

Do you mean to say that this should be

sort_mem = 33554432

Thanks.

cwl

Show quoted text

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Thursday, September 11, 2003 3:00 PM
To: Clay Luther
Cc: Pgsql-General (E-mail)
Subject: Re: [GENERAL] 50K record DELETE Begins, 100% CPU, Never
Completes 1 hour later

"Clay Luther" <claycle@cisco.com> writes:

Sort_mem is 32K.

Try more (like 32M). Particularly in 7.4, you can really
hobble a query
by starving it for sort memory (since that also determines whether
hashtable techniques will be tried).

regards, tom lane

Import Notes

Resolved by subject fallback

Tom Lane

tgl@sss.pgh.pa.us

over 22 years ago

In reply to: Clay Luther (#5)

Re: 50K record DELETE Begins, 100% CPU, Never Completes 1 hour later

"Clay Luther" <claycle@cisco.com> writes:

By 32K I meant:
sort_mem = 32768 # min 64, size in KB

Ah, so really 32M. Okay, that is in the realm of reason. But it would
still be worth your while to investigate whether performance changes if
you kick it up some more notches. If the planner is estimating that you
would need 50M for a hash table, it will avoid hash-based plans with
this setting. (Look at estimated number of rows times estimated row
width in EXPLAIN output to get a handle on what the planner is guessing
as the data volume at each step.)

The rationale for keeping sort_mem relatively small by default is that
you may have a ton of transactions each concurrently doing one or several
sorts, and you don't want to run the system into swap hell. But if you
have one complex query to execute at a time, you should consider kicking
up sort_mem just in that session.

regards, tom lane