Update Perrformance with nested loop vs. merge-join

Started by Nathan Boleyover 19 years ago2 messagesgeneral
Jump to latest
#1Nathan Boley
nate.boley@gmail.com

Hi,

I just created a new column on a relatively large table (~ 2 mil rows over
400 mb in tablesize) and am trying to populate it from another much smaller
table by using an update query .
The purpose of the query to set a bool flag if the user for the action on
table1 is the last user, as determined from table 2.

The large table has a two column primary key and one of the columns is the
primary key on the smaller table.

My original update query is as follows:

update table1
set lastuser = (table1.user = table2.lastuser)::bool from table2
where table1.pkeycolumn1 = table2.pkey

which has an explain output of

Merge Join (cost=883182.60..1024622.03 rows=6873573 width=89)
Merge Cond: ("outer".pkeycolumn1 = "inner".pkey)
-> Sort (cost=93387.50..94711.01 rows=529404 width=18)
Sort Key: table2.pkey
-> Seq Scan on table2 (cost=0.00..15037.04 rows=529404 width=18)
-> Sort (cost=789795.10..795605.39 rows=2324116 width=81)
Sort Key: table1.pkeycolumn1
-> Seq Scan on table1 (cost=0.00..71900.16 rows=2324116 width=81)

This seems like it *should* be the best query to me, but when I try and run
it it takes forever.
However, when I add a subselect clause to the end to force the planner to
use a nested loop
select on table2, the query runs MUCH faster (1 hour instead of > 10, I
never let it finish)

Here is the query:

update table1
set lastuser = (table1.user = table2.lastuser)::bool from table2
where table1.pkeycolumn1 = table2.pkey
and table1.pkeycolumn1 is in ( select pkey from table3 limit 1000000)

where table1.pkeycolumn1 is a foreign key in table3.pkey. Also, note that
the num of rows in table 3 is *much* less than 1000000.
Also, the limit clause is necessary to force the planner into using the
nested loop. (Aside: Is there a better way to do this inside of a query?)
Here is the explain for the above query:

Hash Join (cost=13863.09..109298.79 rows=51388 width=89)
Hash Cond: ("outer".pkeycolumn1 = "inner".pkey)
-> Seq Scan on table1 (cost=0.00..71900.16 rows=2324116 width=81)
-> Hash (cost=13854.99..13854.99 rows=1621 width=26)
-> Nested Loop (cost=8205.72..13854.99 rows=1621 width=26)
-> HashAggregate (cost=8205.72..8207.72 rows=200 width=8)
-> Limit (cost=0.00..5891.43 rows=185143 width=8)
-> Seq Scan on table3
(cost=0.00..5891.43rows=185143 width=8)
-> Index Scan using table2-pkey-index" on table2 (cost=
0.00..28.14 rows=8 width=18)
Index Cond: ("outer".pkey = table2.pkey)

My only theory up to this point is that it has something to do with how the
tablespace is being cached. I notice that when I use plan 1, my
computer goes through long periods of io and with bursts of processor
activity every minute or so. When I use plan 2, the io sits at about 90%
of my total resource usage while my normal processor usage sits at about
10%. Maybe it keeps trying to cache and resort table2? I don't really
have any idea, but that is my only guess.

If anyone knows why this may be happenning, I would really appreciate it.

Thanks,
Nathan

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Boley (#1)
Re: Update Perrformance with nested loop vs. merge-join

"Nathan Boley" <nate.boley@gmail.com> writes:

Merge Join (cost=883182.60..1024622.03 rows=6873573 width=89)
Merge Cond: ("outer".pkeycolumn1 = "inner".pkey)
-> Sort (cost=93387.50..94711.01 rows=529404 width=18)
Sort Key: table2.pkey
-> Seq Scan on table2 (cost=0.00..15037.04 rows=529404 width=18)
-> Sort (cost=789795.10..795605.39 rows=2324116 width=81)
Sort Key: table1.pkeycolumn1
-> Seq Scan on table1 (cost=0.00..71900.16 rows=2324116 width=81)

This seems like it *should* be the best query to me, but when I try and run
it it takes forever.

What PG version is this, and what have you got work_mem (or sort_mem)
set to? I'm betting the sorts are slow ...

regards, tom lane