Conditional query plans.
Hi.
This whole message might be a giant brain fart but I had an
interesting idea today.
I was confronted by an obscene query plan. I have a table of logins
that shows when webmail accounts were created. So a spammer went and
set up 20 or so spam accounts. So I got a list by his IP and the time
when he set them up. Now to batch cancel them I hacked up a quick
query:
update users set enabled='f',disablereason='We do not allow our
system to be used for SPAM.' where id in (select id from users where
loginid in (select distinct loginid from logins where
ip='123.123.12.12'));
This is a horrible way to do it and the query plan is even worse:
NOTICE: QUERY PLAN:
Seq Scan on users (cost=0.00..612996782699.54 rows=18180 width=172)
SubPlan
-> Materialize (cost=33718194.83..33718194.83 rows=18180 width=4)
-> Seq Scan on users (cost=0.00..33718194.83 rows=18180 width=4)
SubPlan
-> Materialize (cost=1854.65..1854.65 rows=48 width=12)
-> Unique (cost=1853.44..1854.65 rows=48 width=12)
-> Sort (cost=1853.44..1853.44 rows=482 width=12)
-> Index Scan using logins_ip_idx on logins
(cost=0.00..1831.97 rows=482 width=12)
Given that the first and second subplan actually return only 25 rows,
there are 2 possibly distillations of this plan:
update users set enabled='f',disablereason='We do not allow our
system to be used for SPAM.' where id in
(27082,27083,27084,27085,27086,27087,27088,27089,27090,27091,27092,270
97,27098,27099,27101,27102,27103,27104,27094,27096,27095,27106,27100,2
7105,27093);
Which comes up with a plan:
NOTICE: QUERY PLAN:
Index Scan using users_pkey, users_pkey, users_pkey, users_pkey,
users_pkey, users_pkey, users_pkey, users_pkey, users_pkey,
users_pkey, users_pkey, users_pkey, users_pkey, users_pkey,
users_pkey, users_pkey, users_pkey, users_pkey, users_pkey,
users_pkey, users_pkey, users_pkey, users_pkey, users_pkey,
users_pkey on users (cost=0.00..57.04 rows=2 width=172)
Basically it's going through each of the 25 as though they were
separate updates.
The second and probably less optimal plan would be to create a hash
of these 25 answers and do a sequential scan on users updating rows
where id is found in that hash.
For these 2 query plans, 1 would be optimal in the event there is a
small list to update, and the other would be ideal in the event there
is a large list to update.
Why attempt to formulate a complete query plan at the outset. Could
you not break the query into smaller parts and re-optimize after
every subplan completes? This way you would have an exact number of
rows provided from the subplans so more accurate choices could be
made farther down the line? This becomes especially relevant on large
joins and other complex queries.
Maybe I just gave away an idea I could have sold to Oracle for
millions, and maybe everyone is already doing this. Anyway, it's just
thoughts and if anyone makes it this far it might be worthwhile for a
little discussion.
-Michael
_________________________________________________________________
http://fastmail.ca/ - Fast Free Web Email for Canadians
From pgsql-hackers-owner@hub.org Thu Oct 19 19:12:20 2000
Received: from trixie.kosman.via.ayuda.com (root@adsl-63-194-39-148.dsl.lsan03.pacbell.net [63.194.39.148])
by hub.org (8.10.1/8.11.0) with ESMTP id e9JNCFJ49070
for <pgsql-hackers@hub.org>; Thu, 19 Oct 2000 19:12:15 -0400 (EDT)
(envelope-from kogorman@pacbell.net)
Received: from pacbell.net (kevin@localhost [127.0.0.1])
by trixie.kosman.via.ayuda.com (8.9.3/8.9.3) with ESMTP id QAA12018;
Thu, 19 Oct 2000 16:12:02 -0700
Message-ID: <39EF7FC0.3A190F9C@pacbell.net>
Date: Thu, 19 Oct 2000 16:12:00 -0700
From: "Kevin O'Gorman" <kogorman@pacbell.net>
Reply-To: kogorman@pacbell.net
Organization: K.O.'s (chaos?) Manor
X-Mailer: Mozilla 4.75C-CCK-MCD Caldera Systems OpenLinux [en] (X11; U; Linux 2.2.10 i586)
X-Accept-Language: en
MIME-Version: 1.0
To: PGSQL Hackers List <pgsql-hackers@hub.org>
Subject: Rule system goes weird with SELECT queries
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Archive-Number: 200010/759
X-Sequence-Number: 7789
I must admit I'm trying to (ab)use the rule system into
being a stored-procedure system, so maybe I'm just getting
what I deserve. However, the results I'm getting are
just plain weird.
If I define two rules for the same action, each with
a single select command, I wind up with two selects as
expected, but they are both cross-product selects on the
two tables. This is unexpected.
If I change the grammar rules so that I can have a
compound action with two selects, I get two selects,
each effectively the four-times cross-product of
the selects. Talk about exponential growth!!
Now I can see why compound SELECTs were disallowed.
And I can guess why my two separate rules behaved this
way, sort of. But if I'm right, the rules are being
processed by the planner once on creation and again when
being invoked, and something is not quite right about
it.
But: does anyone else see a need for a stored-procedure
facility, different from function definition? I'm
probably going to do it anyway, but if there's support
for the idea, I will try to make it conform to the
standards of the community. In return for a little
guidance on that subject.
Here are the details (all tables initially empty):
Form 1: two separate rules gives two cross-products.
create rule rule4a as on insert to dummy do instead select * from d2;
create rule rule4b as on insert to dummy do instead select * from d3;
explain insert into dummy values(1);
psql:rule4.sql:14: NOTICE: QUERY PLAN:
Nested Loop (cost=0.00..30020.00 rows=1000000 width=8)
-> Seq Scan on d3 (cost=0.00..20.00 rows=1000 width=4)
-> Seq Scan on d2 (cost=0.00..20.00 rows=1000 width=4)
psql:rule4.sql:14: NOTICE: QUERY PLAN:
Nested Loop (cost=0.00..30020.00 rows=1000000 width=8)
-> Seq Scan on d2 (cost=0.00..20.00 rows=1000 width=4)
-> Seq Scan on d3 (cost=0.00..20.00 rows=1000 width=4)
EXPLAIN
Form 2: single rule with two SELECT commands gives something
quite weird apparently a quadruple cross-product, performed
twice:
create rule rule3 as on insert to dummy do instead (select * from d2;
select * from d3;);
explain insert into dummy values(1);
psql:rule3.sql:13: NOTICE: QUERY PLAN:
Nested Loop (cost=0.00..30030030020.00 rows=1000000000000 width=16)
-> Nested Loop (cost=0.00..30030020.00 rows=1000000000 width=12)
-> Nested Loop (cost=0.00..30020.00 rows=1000000 width=8)
-> Seq Scan on d2 (cost=0.00..20.00 rows=1000 width=4)
-> Seq Scan on d3 (cost=0.00..20.00 rows=1000 width=4)
-> Seq Scan on d3 (cost=0.00..20.00 rows=1000 width=4)
-> Seq Scan on d2 (cost=0.00..20.00 rows=1000 width=4)
psql:rule3.sql:13: NOTICE: QUERY PLAN:
Nested Loop (cost=0.00..30030030020.00 rows=1000000000000 width=16)
-> Nested Loop (cost=0.00..30030020.00 rows=1000000000 width=12)
-> Nested Loop (cost=0.00..30020.00 rows=1000000 width=8)
-> Seq Scan on d2 (cost=0.00..20.00 rows=1000 width=4)
-> Seq Scan on d3 (cost=0.00..20.00 rows=1000 width=4)
-> Seq Scan on d3 (cost=0.00..20.00 rows=1000 width=4)
-> Seq Scan on d2 (cost=0.00..20.00 rows=1000
width=4)
EXPLAIN
--
Kevin O'Gorman (805) 650-6274 mailto:kogorman@pacbell.net
Permanent e-mail forwarder: mailto:Kevin.O'Gorman.64@Alum.Dartmouth.org
At school: mailto:kogorman@cs.ucsb.edu
Web: http://www.cs.ucsb.edu/~kogorman/index.html
Web: http://trixie.kosman.via.ayuda.com/~kevin/index.html
"There is a freedom lying beyond circumstance,
derived from the direct intuition that life can
be grounded upon its absorption in what is
changeless amid change"
-- Alfred North Whitehead
"Michael Richards" <michael@fastmail.ca> writes:
The second and probably less optimal plan would be to create a hash
of these 25 answers and do a sequential scan on users updating rows
where id is found in that hash.
Given the presence of the "materialize" nodes, I don't think this query
plan is quite as nonoptimal as you think, especially for ~25 rows out of
the subplan. It's a linear search over a 25-entry table for each outer
row, but so what? With hundreds or thousands of rows out of the
subquery, it'd be nice to have a smarter table lookup method, agreed,
but here it hardly matters.
Something that's been on the todo list for a long time is to try to
convert WHERE foo IN (SELECT ...) queries into some kind of join,
instead of a subselect. With that approach we'd be able to use merge
or hash strategies to match up inner and outer rows, which'd work a lot
better when there are large numbers of rows involved. It might actually
happen for 7.2...
regards, tom lane
How does PostgreSQL handles a "too big" transaction?
By that I mean a transaction which, after a certain point, there will be no
way to roll back. On PgSQL, maybe that only happens when the disk fills. Is
there a configurable "size" limit for a single transaction?
In addition, what happens if the disk fills up? Postgres is able to roll
back, right?
I'm assuming you can prevent the disk from actually filling up (and crashing
the whole server) by turning on quotas for the postgres super user, so that
only pgsql would complain. Please correct me if I'm wrong.
update users set enabled='f',disablereason='We do not allow our
system to be used for SPAM.' where id in (select id from users where
loginid in (select distinct loginid from logins where
ip='123.123.12.12'));
Would it run better as:
update users set enabled='f',disablereason='We do not allow our
system to be used for SPAM.' where id in (select distinct loginid from
logins where
ip='123.123.12.12');
Or perhaps even:
update users set enabled='f',disablereason='We do not allow our
system to be used for SPAM.' where id in (select unique id from users,logins
where
users.loginid=logins.loginid where ip='123.123.12.12');
I don't know if that helps the query plan, but it looks prettier :)