Performance of UPDATE SET = FROM vs UPDATE SET = (SELECT ...)

Started by Paul Jonesover 11 years ago3 messagesgeneral
Jump to latest
#1Paul Jones
pbj@cmicdo.com

Why does the UPDATE SET = FROM choose a more poorly performing plan than
the UPDATE SET = (SELECT ...)? It seems to me that it is the same join.

I'm using 9.3.5.

CREATE TABLE orig
(
key1 VARCHAR(11) PRIMARY KEY,
time1 TIME
);

INSERT INTO orig (key1, time1)
SELECT
a::TEXT,
(((random()*100)::INT % 24)::TEXT || ':' ||
((random()*100)::INT % 60)::TEXT)::TIME
FROM generate_series(80000000000, 80002000000) a;

CREATE INDEX odx ON orig(key1);

CREATE TABLE second (LIKE orig);

INSERT INTO second (key1)
SELECT (80000000000+(((random()*1000000)::INT) % 1000000))::TEXT
FROM generate_series(1,400000);

EXPLAIN ANALYZE
UPDATE second SET time1 = orig.time1
FROM orig
WHERE second.key1 = orig.key1;

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Update on second (cost=69461.02..106082.02 rows=400000 width=32) (actual time=16033.023..16033.023 rows=0 loops=1)
-> Hash Join (cost=69461.02..106082.02 rows=400000 width=32) (actual time=7698.445..12992.039 rows=400000 loops=1)
Hash Cond: ((second.key1)::text = (orig.key1)::text)
-> Seq Scan on second (cost=0.00..12627.00 rows=400000 width=18) (actual time=49.820..791.397 rows=400000 loops=1)
-> Hash (cost=31765.01..31765.01 rows=2000001 width=26) (actual time=7648.540..7648.540 rows=2000001 loops=1)
Buckets: 4096 Batches: 128 Memory Usage: 717kB
-> Seq Scan on orig (cost=0.00..31765.01 rows=2000001 width=26) (actual time=0.014..3655.844 rows=2000001 loops=1)
Total runtime: 16033.193 ms
(8 rows)

UPDATE second SET time1 = NULL;

EXPLAIN ANALYZE
UPDATE second SET time1 = (SELECT orig.time1 FROM orig,second
WHERE orig.key1 = second.key1 LIMIT 1);

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Update on second (cost=3.60..19078.19 rows=1279959 width=18) (actual time=4642.453..4642.453 rows=0 loops=1)
InitPlan 1 (returns $1)
-> Limit (cost=0.43..3.60 rows=1 width=8) (actual time=2.611..2.613 rows=1 loops=1)
-> Nested Loop (cost=0.43..4056331.83 rows=1279959 width=8) (actual time=2.606..2.606 rows=1 loops=1)
-> Seq Scan on second second_1 (cost=0.00..19074.59 rows=1279959 width=12) (actual time=2.487..2.487 rows=1 loops=1)
-> Index Scan using odx on orig (cost=0.43..3.14 rows=1 width=20) (actual time=0.098..0.098 rows=1 loops=1)
Index Cond: ((key1)::text = (second_1.key1)::text)
-> Seq Scan on second (cost=0.00..19074.59 rows=1279959 width=18) (actual time=6.420..817.739 rows=400000 loops=1)
Total runtime: 4642.561 ms
(9 rows)

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Shaun Thomas
sthomas@optionshouse.com
In reply to: Paul Jones (#1)
Re: Performance of UPDATE SET = FROM vs UPDATE SET = (SELECT ...)

These updates aren't equivalent. It's very important you know this, because you're also inflating your table with a lot of extra updated rows.

Take the first UPDATE:

UPDATE second SET time1 = orig.time1
FROM orig
WHERE second.key1 = orig.key1;

If you wrote this as a SELECT, it would look like this:

SELECT second.time1, orig.time1
FROM second
JOIN ORIG ON (second.key1 = orig.key1)

Since second is a many to one subset of orig, you now have several simultaneous updates. Your second UPDATE:

UPDATE second SET time1 = (SELECT orig.time1 FROM orig,second
WHERE orig.key1 = second.key1 LIMIT 1);

Is equivalent to this SELECT:

SELECT second.time1,
(SELECT orig.time1 FROM orig,second
WHERE orig.key1 = second.key1 LIMIT 1)
FROM second;

Meaning you'd only get as many updates as there are rows in second. The difference is your LIMIT 1. However, since you're not using an ORDER BY clause, the actual value you get for time1 will be indeterminate. Something like this would remove the row inflation and fix the random time1 behavior, but I'm not sure it was your intent:

UPDATE second
SET time1 = orig.time1
FROM (SELECT DISTINCT ON (key1) key1, time1
FROM orig
ORDER BY key1, time1 DESC) sub
WHERE second.key1 = sub.key1;

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

#3Paul Jones
pbj@cmicdo.com
In reply to: Shaun Thomas (#2)
Re: Performance of UPDATE SET = FROM vs UPDATE SET = (SELECT ...)

On Thu, Nov 06, 2014 at 02:55:20PM +0000, Shaun Thomas wrote:

These updates aren't equivalent. It's very important you know this, because you're also inflating your table with a lot of extra updated rows.

Take the first UPDATE:

UPDATE second SET time1 = orig.time1
FROM orig
WHERE second.key1 = orig.key1;

If you wrote this as a SELECT, it would look like this:

SELECT second.time1, orig.time1
FROM second
JOIN ORIG ON (second.key1 = orig.key1)

Since second is a many to one subset of orig, you now have several simultaneous updates. Your second UPDATE:

UPDATE second SET time1 = (SELECT orig.time1 FROM orig,second
WHERE orig.key1 = second.key1 LIMIT 1);

Is equivalent to this SELECT:

SELECT second.time1,
(SELECT orig.time1 FROM orig,second
WHERE orig.key1 = second.key1 LIMIT 1)
FROM second;

Meaning you'd only get as many updates as there are rows in second. The difference is your LIMIT 1. However, since you're not using an ORDER BY clause, the actual value you get for time1 will be indeterminate. Something like this would remove the row inflation and fix the random time1 behavior, but I'm not sure it was your intent:

UPDATE second
SET time1 = orig.time1
FROM (SELECT DISTINCT ON (key1) key1, time1
FROM orig
ORDER BY key1, time1 DESC) sub
WHERE second.key1 = sub.key1;

I see now that I made more than one mistake.

1) I forgot to INCLUDE INDEXES when creating second. I would have
seen dup keys when filling it.

CREATE TABLE second (LIKE orig INCLUDING INDEXES);

2) I should have used something like this to fill second:

INSERT INTO second (key1)
SELECT key1 FROM orig
ORDER BY random()
LIMIT 400000;

3) I then incorrectly remembered the query I had written at work. It
should have been:

EXPLAIN ANALYZE
UPDATE second se SET time1 = (SELECT time1 FROM orig
WHERE orig.key1 = se.key1);

Once the second table is filled with unique keys, then both UPDATES
should have produced the same results, but the UPDATE FROM is faster
than the UPDATE = SELECT, which is documented.

My original intent was to find out what the performance differences
between the two are.

Thanks for pointing these things out!
PJ

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general