IN vs EXISTS equivalence

Started by Kevin Grittnerover 18 years ago25 messageshackers
Jump to latest
#1Kevin Grittner
Kevin.Grittner@wicourts.gov

I've requested this before without response, but I'm asking again
because it just caused me pain again: could we get a TODO added to
have the planner recognize equivalent IN and EXISTS constructs and
have them compete on cost estimates? I know it's not a trivial
improvement, but if it's on the list maybe someone will pick it up,
and I see it as the single biggest weakness in PostgreSQL
performance.

I don't need help resolving this particular case, because the fix is
always blinding obvious when we hit this, and it doesn't even break
portability because no other database we've tested fails to recognize
these equivalent cases.

step=# explain DELETE FROM "Body" WHERE "bodySeqNo" NOT IN (SELECT "bodySeqNo" FROM "Message");
QUERY PLAN
----------------------------------------------------------------------------------
Seq Scan on "Body" (cost=90277.43..285235351699.39 rows=3313379 width=6)
Filter: (NOT (subplan))
SubPlan
-> Materialize (cost=90277.43..159793.40 rows=6627957 width=11)
-> Seq Scan on "Message" (cost=0.00..80413.07 rows=6627957 width=11)
(5 rows)

step=# explain DELETE FROM "Body" WHERE NOT EXISTS (SELECT * FROM "Message" m WHERE m."bodySeqNo" = "Body"."bodySeqNo");
QUERY PLAN
--------------------------------------------------------------------------------------------
Seq Scan on "Body" (cost=0.00..3401760.88 rows=3313416 width=6)
Filter: (NOT (subplan))
SubPlan
-> Index Scan using "Message_Body" on "Message" m (cost=0.00..0.49 rows=1 width=136)
Index Cond: (("bodySeqNo")::numeric = ($0)::numeric)
(5 rows)

The bodySeqNo column is NOT NULL in both tables, and is the primary
key in the Body table. The Message table has a non-unique index on
it. (\d lists will follow at the bottom.)

I cancelled the first query after it had been running for 54 hours
over our slowest hours (the weekend). The second form ran in four
minutes in competition with peak time queries.

-Kevin

step=# \d "Body"
Table "public.Body"
Column | Type | Modifiers
-------------+------------------------+-----------
bodySeqNo | "SequenceT" | not null
contentType | character varying(255) | not null
encoding | character varying(255) |
body | "BodyT" |
Indexes:
"Body_pkey" PRIMARY KEY, btree ("bodySeqNo")

step=# \d "Message"
Table "public.Message"
Column | Type | Modifiers
-----------------+--------------------------+-----------
messageId | "SequenceT" | not null
clientMessageId | "ClientMessageIdT" | not null
correlationId | "SequenceT" |
destQueue | "QueueNameT" | not null
replyToQueue | "QueueNameT" | not null
typeCode | character(2) |
expiration | timestamp with time zone |
priority | smallint | not null
status | character(2) | not null
created | timestamp with time zone | not null
lastModified | timestamp with time zone | not null
bodySeqNo | "SequenceT" | not null
messageIdSearch | "PrioritySequenceT" | not null
Indexes:
"Message_pkey" PRIMARY KEY, btree ("messageId")
"MessageIndex2" UNIQUE, btree ("destQueue", "clientMessageId")
"Message_MessageIdSearch" UNIQUE, btree ("destQueue", status, "messageIdSearch") CLUSTER
"Message_Body" btree ("bodySeqNo")
"Message_Created" btree ("destQueue", status, created)
"Message_Created2" btree ("destQueue", created)
"Message_Expiration" btree (expiration)
"Message_LastModified" btree ("destQueue", "lastModified")
"Message_ReplyToQueue" btree ("replyToQueue")
Foreign-key constraints:
"Message_fk1" FOREIGN KEY ("destQueue") REFERENCES "Queue"(name)
"Message_fk2" FOREIGN KEY ("replyToQueue") REFERENCES "Queue"(name)

#2Simon Riggs
simon@2ndQuadrant.com
In reply to: Kevin Grittner (#1)
Re: IN vs EXISTS equivalence

On Mon, 2007-10-22 at 09:31 -0500, Kevin Grittner wrote:

I've requested this before without response, but I'm asking again
because it just caused me pain again: could we get a TODO added to
have the planner recognize equivalent IN and EXISTS constructs and
have them compete on cost estimates? I know it's not a trivial
improvement, but if it's on the list maybe someone will pick it up,
and I see it as the single biggest weakness in PostgreSQL
performance.

I'll pick it up as a default unless someone requests they have it from
me.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#3Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Simon Riggs (#2)
Re: IN vs EXISTS equivalence

On Mon, Oct 22, 2007 at 1:30 PM, in message

<1193077831.4319.61.camel@ebony.site>, Simon Riggs <simon@2ndquadrant.com>
wrote:

On Mon, 2007-10-22 at 09:31 -0500, Kevin Grittner wrote:

I've requested this before without response, but I'm asking again
because it just caused me pain again: could we get a TODO added to
have the planner recognize equivalent IN and EXISTS constructs and
have them compete on cost estimates? I know it's not a trivial
improvement, but if it's on the list maybe someone will pick it up,
and I see it as the single biggest weakness in PostgreSQL
performance.

I'll pick it up as a default unless someone requests they have it from
me.

Thanks, Simon.

One more logically equivalent, PostgreSQL-specific form which
costs out even better was suggested off-list:

step=# explain DELETE FROM "Body" USING "Message" WHERE "Message"."bodySeqNo" = "Body"."bodySeqNo";
QUERY PLAN
--------------------------------------------------------------------------------------------------
Merge Join (cost=0.00..696766.20 rows=4048543 width=6)
Merge Cond: (("Body"."bodySeqNo")::numeric = ("Message"."bodySeqNo")::numeric)
-> Index Scan using "Body_pkey" on "Body" (cost=0.00..326108.11 rows=4048543 width=18)
-> Index Scan using "Message_Body" on "Message" (cost=0.00..310085.16 rows=4048847 width=12)
(4 rows)

If both of the other syntaxes could compete against that, it would
be fantastic. (If that's feasible.)

-Kevin

#4Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Kevin Grittner (#3)
Re: IN vs EXISTS equivalence

On Mon, Oct 22, 2007 at 4:37 PM, in message

<471CD1BE.EE98.0025.0@wicourts.gov>, "Kevin Grittner"
<Kevin.Grittner@wicourts.gov> wrote:

One more logically equivalent, PostgreSQL-specific form which
costs out even better was suggested off-list:

Oops. That is not logically equivalent. We want to delete WHERE NOT
EXISTS; the logic of that suggestion is backwards.

Disregard that last post, please.

-Kevin

#5Pavel Stehule
pavel.stehule@gmail.com
In reply to: Kevin Grittner (#4)
Re: IN vs EXISTS equivalence

2007/10/23, Kevin Grittner <Kevin.Grittner@wicourts.gov>:

On Mon, Oct 22, 2007 at 4:37 PM, in message

<471CD1BE.EE98.0025.0@wicourts.gov>, "Kevin Grittner"
<Kevin.Grittner@wicourts.gov> wrote:

One more logically equivalent, PostgreSQL-specific form which
costs out even better was suggested off-list:

Oops. That is not logically equivalent. We want to delete WHERE NOT
EXISTS; the logic of that suggestion is backwards.

Disregard that last post, please.

-Kevin

my mistake, sorry

Pavel

#6Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Kevin Grittner (#4)
Re: IN vs EXISTS equivalence

On Mon, Oct 22, 2007 at 5:04 PM, in message

<471CD819.EE98.0025.0@wicourts.gov>, "Kevin Grittner"

Oops. That is not logically equivalent. We want to delete WHERE NOT
EXISTS; the logic of that suggestion is backwards.

Disregard that last post, please.

Maybe that last post shouldn't be totally disregarded -- it wouldn't
be a bad idea to support a Merge NOT IN Join if it the effort isn't
out of line with the benefit.

Pavel suggested a clever kludge to accomplish this, which costs out
better than anything else I've tried:

step=# explain DELETE FROM "Body"
step-# WHERE "bodySeqNo" IN (SELECT "Body"."bodySeqNo"
step(# FROM "Body"
step(# LEFT JOIN "Message"
step(# ON "Body"."bodySeqNo" = "Message"."bodySeqNo"
step(# WHERE "Message"."bodySeqNo" IS NULL);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Merge IN Join (cost=825315.30..1265285.81 rows=2010418 width=6)
Merge Cond: ((public."Body"."bodySeqNo")::numeric = (public."Body"."bodySeqNo")::numeric)
-> Index Scan using "Body_pkey" on "Body" (cost=0.00..383702.32 rows=4020835 width=18)
-> Materialize (cost=825315.30..846401.18 rows=2010418 width=12)
-> Merge Left Join (cost=0.00..822323.18 rows=2010418 width=12)
Merge Cond: ((public."Body"."bodySeqNo")::numeric = ("Message"."bodySeqNo")::numeric)
Filter: ("Message"."bodySeqNo" IS NULL)
-> Index Scan using "Body_pkey" on "Body" (cost=0.00..383702.32 rows=4020835 width=12)
-> Index Scan using "Message_Body" on "Message" (cost=0.00..378901.17 rows=4021733 width=12)
(9 rows)

Just some ideas to look at while you're "in the neighborhood."

-Kevin

#7Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Simon Riggs (#2)
Re: IN vs EXISTS equivalence

On Mon, Oct 22, 2007 at 1:30 PM, Simon Riggs wrote:

On Mon, 2007-10-22 at 09:31 -0500, Kevin Grittner wrote:

I've requested this before without response, but I'm asking again
because it just caused me pain again: could we get a TODO added to
have the planner recognize equivalent IN and EXISTS constructs and
have them compete on cost estimates? I know it's not a trivial
improvement, but if it's on the list maybe someone will pick it up,
and I see it as the single biggest weakness in PostgreSQL
performance.

I'll pick it up as a default unless someone requests they have it

from

me.

Since Simon's focus has shifted to other issues, I'm hoping this can
go onto the TODO list.

I'm adding some NOT EXISTS examples to the thread for completeness of
what someone might want to address while working on it. For two
queries which can easily be shown (to a human viewer, anyway) to
return identical results, I see performance differences of over five
orders of magnitude. (Six if you compare to the LEFT JOIN ... WHERE
not_null_right_column IS NULL trick.)

Below are the cost estimates of the three techniques for a
medium-sized table joining to a large table, and for a large table
joining to a small table. The IN behavior has the worst worst-case
behavior, at least in queries that I've run, although many people
report that it is usually faster. The technique of doing an existence
test with a LEFT JOIN and then checking whether a NOT NULL column from
the right-hand table is null is often faster than either technique,
and seldom much worse than the best technique for any given test.

Queries and plans attached. Summary of costs below, in millions of
cost units. (Fractions of a million discarded.)

NOT IN (independent_subquery)
19745843, 5

WHERE NOT EXISTS
74, 318

LEFT JOIN WHERE not_null_right_column IS NULL
10, 17

These cost estimates tend to come out in pretty consistent ratio to
the actual run times.

-Kevin

Attachments:

not-exists-timings.txttext/plain; name=not-exists-timings.txtDownload
#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#7)
Re: IN vs EXISTS equivalence

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

I'm adding some NOT EXISTS examples to the thread for completeness of
what someone might want to address while working on it. For two
queries which can easily be shown (to a human viewer, anyway) to
return identical results, I see performance differences of over five
orders of magnitude.

Could we see EXPLAIN ANALYZE not just EXPLAIN for these? When people
are complaining of bad planner behavior, I don't find bare EXPLAIN
output to be very convincing.

regards, tom lane

#9Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#8)
Re: IN vs EXISTS equivalence

On Mon, Aug 4, 2008 at 6:48 PM, in message

<15026.1217893692@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

I'm adding some NOT EXISTS examples to the thread for completeness

of

what someone might want to address while working on it. For two
queries which can easily be shown (to a human viewer, anyway) to
return identical results, I see performance differences of over

five

orders of magnitude.

Could we see EXPLAIN ANALYZE not just EXPLAIN for these? When

people

are complaining of bad planner behavior, I don't find bare EXPLAIN
output to be very convincing.

I'll give it a shot. I've never had the patience to let the one with
the cost five or six orders of magnitude higher than the others run to
completion, but I've started the lot of 'em.

-Kevin

#10Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#8)
Re: IN vs EXISTS equivalence

On Mon, Aug 4, 2008 at 6:48 PM, in message

<15026.1217893692@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

I'm adding some NOT EXISTS examples to the thread for completeness

of

what someone might want to address while working on it. For two
queries which can easily be shown (to a human viewer, anyway) to
return identical results, I see performance differences of over

five

orders of magnitude.

Could we see EXPLAIN ANALYZE not just EXPLAIN for these? When

people

are complaining of bad planner behavior, I don't find bare EXPLAIN
output to be very convincing.

The other five queries have a cost to millisecond ratio of between 9.8
and 267. If the expensive one falls in the same range, it will run
for 2.3 to 64 years. I know I have left it running for days before
without completion. I don't think I can devote the resources to it.
Attached are the EXPLAIN ANALYZE output for the other five.

-Kevin

Attachments:

not-exists-timings2.outapplication/octet-stream; name=not-exists-timings2.outDownload
#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#10)
Re: IN vs EXISTS equivalence

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

I'm adding some NOT EXISTS examples to the thread for completeness
of what someone might want to address while working on it. For two
queries which can easily be shown (to a human viewer, anyway) to
return identical results, I see performance differences of over
five orders of magnitude.

I've been looking at this a bit and have an action plan worked out.

I believe that the optimizable cases for EXISTS are those where the
EXISTS() is either at the top level of WHERE, or just underneath a NOT,
and the contained subselect:

* has no set operations (UNION etc), grouping, set-returning functions
in the SELECT list, LIMIT, or a few other funny cases

* references outer-level variables only in its WHERE clause.

Given these conditions, an EXISTS is equivalent to a standard semijoin
between the outer relations used in its WHERE clause and the sub-select
with the WHERE removed, where the join condition is just the WHERE
clause. (A semijoin returns one copy of each LHS row for which there
exists at least one RHS row satisfying the join condition.)

Similarly, a NOT EXISTS is equivalent to a standard anti-semijoin.
(An anti-semijoin returns one copy of each LHS row for which there
exists no RHS row satisfying the join condition.)

So what I think we should do is implement JOIN_SEMI and JOIN_ANTI
as variant outer-join types in the planner and executor, and convert
EXISTS into those. JOIN_SEMI would replace the existing JOIN_IN support.
(It's effectively the same thing so far as the executor is concerned,
but there are some places in the planner that assume an IN join condition
consists of one or more btree equality operators. I'm envisioning folding
the current planner support for IN into the more general outer-join
planning infrastructure, and fixing it so that it doesn't assume the
join clause must be of that form.)

Explicit support for JOIN_ANTI is probably not strictly necessary:
we could represent it using the "LEFT JOIN WHERE rhs-variable IS NULL"
hack. However this only works if the join's ON-condition can be proved
strict for at least one RHS variable, which is an assumption I'd rather
not require for optimizing EXISTS. Also, the planner should be able to
make much better estimates of the size of an antijoin result if it has an
explicit concept that that's what's happening. (The existing kluge in
nulltestsel() is not only ugly but has little hope of ever giving an
accurate estimate.) So I'd prefer to go the other way: make the planner
recognize the IS NULL trick and remove the IS NULL clause in favor of
marking the LEFT JOIN as being really an antijoin.

As far as IN goes: IN at the top level of WHERE can still be optimized
to a semijoin under the same conditions as now. NOT IN is a lot trickier,
for the same reason that typically trips up novices who try to use it:
if any row of the subselect produces a NULL comparison result, then it is
impossible for the NOT IN to result in TRUE, which means that it does not
function as a standard antijoin. I thought about optimizing it only in
the case where we can prove that the subselect outputs and the compared-to
values are known NOT NULL (which in typical cases we could prove by
looking for NOT NULL constraints on those table columns). The trouble
with this is that that's not a sufficient condition: you must also assume
that the comparison operator involved never yields NULL for non-null
inputs. That might be okay for btree comparison functions but it's not a
very comfy assumption in general; we certainly haven't got any explicit
knowledge that any functions are guaranteed to act that way. So this
case might be worth doing later but I'm not feeling excited about it.
We generally tell people to avoid NOT IN and I'm happy to keep on
saying that.

Comments?

regards, tom lane

#12Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#11)
Re: IN vs EXISTS equivalence

Tom Lane <tgl@sss.pgh.pa.us> wrote:

I believe that the optimizable cases for EXISTS are those where the
EXISTS() is either at the top level of WHERE, or just underneath a

NOT,

The rest of the plan makes sense to me, but this part seems narrow.
There's probably a good reason for that which is beyond my depth, but
attached is a view that is used for calculating statistics for a
database which is primarily used for "case management" purposes. If
EXISTS could also be optimized in the contexts used there, it would be
great.

For context, this view references three other non-trivial views
(MatterHistSearch, MatterHistStage, & MatterHistStatus), and does
perform acceptably, so it's not a matter of complaining if it can't
work here, just providing a real-world example of other useful
contexts for EXISTS which might be worth covering if possible.

This view is used in a large number of queries, mostly either
selecting a single date with other selection criteria or counting rows
which match a set of matterNo values selected on complex criteria.

By the way, this view was totally unusable under 8.2. Showing how it
worked under 8.3 was all it took to get them to expedite the upgrade.
These views, possible because of improvements in 8.3, saved "countless
programmer days" (to quote one of the programmers).

-Kevin

Attachments:

MatterDateStat.txttext/plain; name=MatterDateStat.txtDownload
#13Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tom Lane (#11)
Re: IN vs EXISTS equivalence

On Aug 8, 2008, at 3:23 PM, Tom Lane wrote:

* has no set operations (UNION etc), grouping, set-returning functions
in the SELECT list, LIMIT, or a few other funny cases

Couldn't union/union all be treated as

EXISTS(a)
OR EXISTS(b)
...

Or am I missing some detail with NULLS?

Personally, I'd rather write it as separate EXISTS clauses rather
than using UNION, but perhaps others have a different preference...
--
Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Attachments:

smime.p7sapplication/pkcs7-signature; name=smime.p7sDownload
#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#12)
Re: IN vs EXISTS equivalence

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

I believe that the optimizable cases for EXISTS are those where the
EXISTS() is either at the top level of WHERE, or just underneath a
NOT,

The rest of the plan makes sense to me, but this part seems narrow.
There's probably a good reason for that which is beyond my depth, but
attached is a view that is used for calculating statistics for a
database which is primarily used for "case management" purposes. If
EXISTS could also be optimized in the contexts used there, it would be
great.

I chewed on that for awhile. We can certainly optimize EXISTS that's
appearing in the ON-condition of a regular JOIN, because that's not
really semantically different from a WHERE-condition. But I don't think
it's going to be reasonable to improve EXISTS in outer-JOIN ON
conditions. There are a couple of problems. Consider

t1 LEFT JOIN t2
ON (t1.f1 = t2.f2 AND
EXISTS(SELECT 1 FROM t3 WHERE t3.f3 = t1.fx AND t3.f4 = t2.fy))

To implement this with the correct semantics, we'd have to have the
t1/t2 join and the t1/t2/t3 join going on in the same execution node,
with two different join behaviors (LEFT and SEMI). There isn't any
way AFAICS to factor it into two separate steps. That's unreasonably
complicated, and it's not clear that you'd get any better performance
anyway than the current implementation (which'd treat the EXISTS as
a subplan).

Even worse, let the EXISTS be a degenerate case:

t1 LEFT JOIN t2
ON (t1.f1 = t2.f2 AND
EXISTS(SELECT 1 FROM t3 WHERE t3.f3 = t1.fx));

We can't actually treat this EXISTS as a semijoin between t1 and t3,
either before or after the LEFT JOIN; because then the behavior would be
to drop t1 rows that have no t3 match, which is not what this query
specifies.

(Note: the other degenerate case, where the EXISTS depends only on
t2, *could* be optimized since we could just let the semijoin be
performed within the RHS of the LEFT JOIN.)

So this is not something I'm going to tackle; at least not this
devel cycle.

One small step we can take in this direction, though, is to improve the
planner's internal handling of the qual conditions for IN and EXISTS.
Right now the process is just to throw the sub-select into the main
range table and put the IN join conditions into the same place in WHERE
that the IN-clause was to start with. The trouble with this is that the
distribute_quals_to_rels processing has no idea that there's anything
special about the IN join conditions. We got away with that for the
limited case of IN clauses at the top level of WHERE, but it's become
clear to me over the weekend that this has no hope of working for NOT
EXISTS --- since that's effectively an outer join, it absolutely has to
have the same kinds of qual-scheduling constraints as ordinary outer
joins do. So we need a data structure that distribute_quals_to_rels can
work with. What I think needs to happen is that the initial pass that
pulls up optimizable IN/EXISTS sub-selects should not merge the
SubLink's replacement qual clauses seamlessly, but put them underneath a
new node type, say "FlattenedSubLink", that retains knowledge of the
join it's representing. The FlattenedSubLink would survive only as far
as distribute_quals_to_rels, which would distribute out the contained
qual conditions instead of the FlattenedSubLink itself --- but only
after marking them properly for the outer-join restrictions. This
representation would make it feasible to work with IN/EXISTS that are
inside JOIN ON conditions, which the present representation using a
single in_info_list really can't do. The semantic issues are still
there but at least the representation isn't getting in the way ...

regards, tom lane

#15Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#14)
Re: IN vs EXISTS equivalence

Our Internet connectivity failed as this was being sent. It looks
like at least the list didn't get it, so here goes another try.
Apologies for any duplication.

-Kevin

Tom Lane <tgl@sss.pgh.pa.us> wrote:

I chewed on that for awhile. We can certainly optimize EXISTS

that's

appearing in the ON-condition of a regular JOIN, because that's not
really semantically different from a WHERE-condition.

Good to hear. I thought that might be doable. :-)

But I don't think
it's going to be reasonable to improve EXISTS in outer-JOIN ON
conditions. There are a couple of problems. Consider

The discussion did make the difficulties clear.

So this is not something I'm going to tackle; at least not this
devel cycle.

Fair enough.

One small step we can take in this direction, though, is to improve

the

planner's internal handling of the qual conditions for IN and

EXISTS.

Right now the process is just to throw the sub-select into the main
range table and put the IN join conditions into the same place in

WHERE

that the IN-clause was to start with. The trouble with this is that

the

distribute_quals_to_rels processing has no idea that there's

anything

special about the IN join conditions. We got away with that for the
limited case of IN clauses at the top level of WHERE, but it's

become

clear to me over the weekend that this has no hope of working for

NOT

EXISTS --- since that's effectively an outer join, it absolutely has

to

have the same kinds of qual-scheduling constraints as ordinary outer
joins do. So we need a data structure that distribute_quals_to_rels

can

work with. What I think needs to happen is that the initial pass

that

pulls up optimizable IN/EXISTS sub-selects should not merge the
SubLink's replacement qual clauses seamlessly, but put them

underneath a

new node type, say "FlattenedSubLink", that retains knowledge of the
join it's representing. The FlattenedSubLink would survive only as

far

as distribute_quals_to_rels, which would distribute out the

contained

qual conditions instead of the FlattenedSubLink itself --- but only
after marking them properly for the outer-join restrictions. This
representation would make it feasible to work with IN/EXISTS that

are

inside JOIN ON conditions, which the present representation using a
single in_info_list really can't do. The semantic issues are still
there but at least the representation isn't getting in the way ...

Just curious, is that something for this cycle, or a TODO item?

Thanks for looking at this. The one part I'm not sure about is where
the CASE/EXISTS in the SELECT value list fits into this discussion.
It seems conceptually similar to the OUTER JOIN, but sort of a special
case, so I'm not sure what you had in mind there.

-Kevin

#16Bruce Momjian
bruce@momjian.us
In reply to: Jim Nasby (#13)
Re: IN vs EXISTS equivalence

"Decibel!" <decibel@decibel.org> writes:

On Aug 8, 2008, at 3:23 PM, Tom Lane wrote:

* has no set operations (UNION etc), grouping, set-returning functions
in the SELECT list, LIMIT, or a few other funny cases

Couldn't union/union all be treated as

EXISTS(a)
OR EXISTS(b)

Kind of confused by what you mean here. Can you give an example?

The usual transformation to consider with UNION is to transform

SELECT ... WHERE x OR y

into

SELECT ...
WHERE x
UNION ALL
SELECT ...
WHERE y AND NOT x

(modulo handling NULLs properly)

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Nasby (#13)
Re: IN vs EXISTS equivalence

Decibel! <decibel@decibel.org> writes:

On Aug 8, 2008, at 3:23 PM, Tom Lane wrote:

* has no set operations (UNION etc), grouping, set-returning functions
in the SELECT list, LIMIT, or a few other funny cases

Couldn't union/union all be treated as
EXISTS(a)
OR EXISTS(b)

Perhaps, but that would end up defeating the optimization anyway,
because as soon as the EXISTS is underneath an OR, it's no longer
representing a potential join clause.

regards, tom lane

#18Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Bruce Momjian (#16)
Re: IN vs EXISTS equivalence

On Aug 11, 2008, at 3:40 PM, Gregory Stark wrote:

"Decibel!" <decibel@decibel.org> writes:

On Aug 8, 2008, at 3:23 PM, Tom Lane wrote:

* has no set operations (UNION etc), grouping, set-returning
functions
in the SELECT list, LIMIT, or a few other funny cases

Couldn't union/union all be treated as

EXISTS(a)
OR EXISTS(b)

Kind of confused by what you mean here. Can you give an example?

The usual transformation to consider with UNION is to transform

SELECT ... WHERE x OR y

into

SELECT ...
WHERE x
UNION ALL
SELECT ...
WHERE y AND NOT x

It was a case of the union being inside the EXISTS subquery, ie:

WHERE EXISTS (SELECT * FROM a UNION SELECT * FROM b)

AFAIK that's identical to

WHERE EXISTS (SELECT * FROM a) OR EXISTS (SELECT * FROM b)

But as Tom mentioned, as soon as you have it you can't answer it with
a simple join, so it doesn't matter...
--
Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Attachments:

smime.p7sapplication/pkcs7-signature; name=smime.p7sDownload
#19Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#11)
Re: IN vs EXISTS equivalence

On Fri, 2008-08-08 at 16:23 -0400, Tom Lane wrote:

NOT IN is a lot trickier,
for the same reason that typically trips up novices who try to use it:
if any row of the subselect produces a NULL comparison result, then it
is impossible for the NOT IN to result in TRUE, which means that it
does not function as a standard antijoin. I thought about optimizing
it only in the case where we can prove that the subselect outputs and
the compared-to values are known NOT NULL (which in typical cases we
could prove by looking for NOT NULL constraints on those table
columns). The trouble with this is that that's not a sufficient
condition: you must also assume that the comparison operator involved
never yields NULL for non-null inputs. That might be okay for btree
comparison functions but it's not a very comfy assumption in general;
we certainly haven't got any explicit knowledge that any functions are
guaranteed to act that way. So this case might be worth doing later
but I'm not feeling excited about it. We generally tell people to
avoid NOT IN and I'm happy to keep on saying that.

Just found this comment, after reading what you said on other thread
about NOT IN.

NOT IN is a serious performance issue for most people. We simply can't
say to people "you were told not to".

If we can fix it easily for the majority of cases, we should. We can't
let the "it won't work in certain cases" reason prevent various
optimizations from going in. There are tons of places where we say "XXX
needs later improvement" in code comments. So lets do that here also. It
certainly wouldn't be the first optimization/feature that went into code
in a restricted way that didn't work for all cases: hash joins, ANALYZE,
partial indexes etc..

Anybody that is writing complex SQL with user defined operators knows
enough to re-write their queries correctly, so there will be almost no
negative effect from making the NOT IN optimisation a special case. And
if there is an effect, the people effected can fix the problem.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#15)
Re: IN vs EXISTS equivalence

If you're still interested in testing CVS HEAD's handling of EXISTS,
I've about finished what I wanted to do with it.

regards, tom lane

#21Pavel Stehule
pavel.stehule@gmail.com
In reply to: Tom Lane (#20)
#22Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#20)
#23daveg
daveg@sonic.net
In reply to: Simon Riggs (#19)
#24Asko Oja
ascoja@gmail.com
In reply to: daveg (#23)
#25Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#20)