Oops, I seem to have changed UNION's behavior

Started by Tom Lanealmost 27 years ago10 messages
#1Tom Lane
tgl@sss.pgh.pa.us

The equal() updates I installed yesterday (to fix the "don't know
whether nodes of type 600 are equal" problem) have had an unintended
side effect.

Am I right in thinking that UNION (without ALL) is defined to do a
DISTINCT on its result, so that duplicates are removed even if the
duplicates both came from the same source table? That's what 6.4.2
does, but I do not know if it's strictly kosher according to the SQL
spec.

If so, the code is now busted, because with the equal() extension in
place, cnfify() is able to recognize and remove duplicate select
clauses. That is, "SELECT xxx UNION SELECT xxx" will be folded to
just "SELECT xxx" ... and that doesn't mean the same thing.

An actual example: given the data

play=> select a from tt;
a
-
1
1
2
3
(4 rows)

Under 6.4.2 I get:

play=> select a from tt union select a from tt;
a
-
1
2
3
(3 rows)

Note lack of duplicate "1". Under current sources I get:

ttest=> select a from tt union select a from tt;
a
-
1
1
2
3
(4 rows)

since the query is effectively reduced to just "select a from tt".

Assuming that 6.4.2 is doing the Right Thing, I see two possible fixes:
(1) simplify equal() to say that two T_Query nodes are never equal, or
(2) modify the planner so that the "select distinct" operation is
inserted explicitly, and will thus happen even if the UNIONed selects
are collapsed into just one.

(1) is a trivial fix of course, but it worries me --- maybe someday
we will need equal() to give an honest answer for Query nodes.
But I don't have the expertise to apply (2), and it seems like rather
a lot of work for a boundary case that isn't really interesting in
practice.

Comments? *Is* 6.4.2 behaving according to the SQL spec?

regards, tom lane

#2Oleg Broytmann
phd@sun.med.ru
In reply to: Tom Lane (#1)
Re: [HACKERS] Oops, I seem to have changed UNION's behavior

Hello!

On Sun, 7 Feb 1999, Tom Lane wrote:

The equal() updates I installed yesterday (to fix the "don't know
whether nodes of type 600 are equal" problem) have had an unintended
side effect.

Am I right in thinking that UNION (without ALL) is defined to do a
DISTINCT on its result, so that duplicates are removed even if the
duplicates both came from the same source table? That's what 6.4.2
does, but I do not know if it's strictly kosher according to the SQL
spec.

Yes, this is standard. My books (primary, Gruber) say UNION should work
this way - UNION without ALL implies DISTINCT.

If so, the code is now busted, because with the equal() extension in
place, cnfify() is able to recognize and remove duplicate select
clauses. That is, "SELECT xxx UNION SELECT xxx" will be folded to
just "SELECT xxx" ... and that doesn't mean the same thing.

An actual example: given the data

play=> select a from tt;
a
-
1
1
2
3
(4 rows)

Under 6.4.2 I get:

play=> select a from tt union select a from tt;
a
-
1
2
3
(3 rows)

Note lack of duplicate "1". Under current sources I get:

ttest=> select a from tt union select a from tt;
a
-
1
1
2
3
(4 rows)

since the query is effectively reduced to just "select a from tt".

I am sure my books did not consider such case as UNION that could be
otimized this way. Not sure what is Right Thing here...

Oleg.
----
Oleg Broytmann National Research Surgery Centre http://sun.med.ru/~phd/
Programmers don't die, they just GOSUB without RETURN.

#3Thomas G. Lockhart
lockhart@alumni.caltech.edu
In reply to: Tom Lane (#1)
Re: [HACKERS] Oops, I seem to have changed UNION's behavior

The equal() updates I installed yesterday (to fix the "don't know
whether nodes of type 600 are equal" problem) have had an unintended
side effect.
Am I right in thinking that UNION (without ALL) is defined to do a
DISTINCT on its result, so that duplicates are removed even if the
duplicates both came from the same source table? That's what 6.4.2
does, but I do not know if it's strictly kosher according to the SQL
spec.

Yup, that's the way it should be...

If so, the code is now busted, because with the equal() extension in
place, cnfify() is able to recognize and remove duplicate select
clauses. That is, "SELECT xxx UNION SELECT xxx" will be folded to
just "SELECT xxx" ... and that doesn't mean the same thing.

:(

Assuming that 6.4.2 is doing the Right Thing, I see two possible
fixes:
(1) simplify equal() to say that two T_Query nodes are never equal, or
(2) modify the planner so that the "select distinct" operation is
inserted explicitly, and will thus happen even if the UNIONed selects
are collapsed into just one.

Sounds to me like (2) is the correct way to do it, as long as the
explicit "sort/unique" happens only if the query is collapsed. I haven't
looked at this code, and have no experience with cnfify(), but at the
time it decides to collapse into a single select, can't it ensure that a
sort/unique is done instead? Or is that what you are suggesting?
Wouldn't want to have two sort/unique operations on top of each other...

- Tom

#4Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Tom Lane (#1)
Re: [HACKERS] Oops, I seem to have changed UNION's behavior

Was there a conclusion on this?

The equal() updates I installed yesterday (to fix the "don't know
whether nodes of type 600 are equal" problem) have had an unintended
side effect.

Am I right in thinking that UNION (without ALL) is defined to do a
DISTINCT on its result, so that duplicates are removed even if the
duplicates both came from the same source table? That's what 6.4.2
does, but I do not know if it's strictly kosher according to the SQL
spec.

If so, the code is now busted, because with the equal() extension in
place, cnfify() is able to recognize and remove duplicate select
clauses. That is, "SELECT xxx UNION SELECT xxx" will be folded to
just "SELECT xxx" ... and that doesn't mean the same thing.

An actual example: given the data

play=> select a from tt;
a
-
1
1
2
3
(4 rows)

Under 6.4.2 I get:

play=> select a from tt union select a from tt;
a
-
1
2
3
(3 rows)

Note lack of duplicate "1". Under current sources I get:

ttest=> select a from tt union select a from tt;
a
-
1
1
2
3
(4 rows)

since the query is effectively reduced to just "select a from tt".

Assuming that 6.4.2 is doing the Right Thing, I see two possible fixes:
(1) simplify equal() to say that two T_Query nodes are never equal, or
(2) modify the planner so that the "select distinct" operation is
inserted explicitly, and will thus happen even if the UNIONed selects
are collapsed into just one.

(1) is a trivial fix of course, but it worries me --- maybe someday
we will need equal() to give an honest answer for Query nodes.
But I don't have the expertise to apply (2), and it seems like rather
a lot of work for a boundary case that isn't really interesting in
practice.

Comments? *Is* 6.4.2 behaving according to the SQL spec?

regards, tom lane

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#5Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Bruce Momjian (#4)
Re: [HACKERS] Oops, I seem to have changed UNION's behavior

Am I right in thinking that UNION (without ALL) is defined to do a
DISTINCT on its result, so that duplicates are removed even if the
duplicates both came from the same source table? That's what 6.4.2
does, but I do not know if it's strictly kosher according to the SQL
spec.

(Just in case this is still active)

Yes, this is the right behavior according to SQL92...

- Thomas

--
Thomas Lockhart lockhart@alumni.caltech.edu
South Pasadena, California

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Lockhart (#5)
Re: [HACKERS] Oops, I seem to have changed UNION's behavior

Thomas Lockhart <lockhart@alumni.caltech.edu> writes:

Am I right in thinking that UNION (without ALL) is defined to do a
DISTINCT on its result, so that duplicates are removed even if the
duplicates both came from the same source table? That's what 6.4.2
does, but I do not know if it's strictly kosher according to the SQL
spec.

(Just in case this is still active)

Yes, this is the right behavior according to SQL92...

OK, then 6.5 is still broken :-(. I know a lot more about the planner
than I did then, so I will see if I can fix it "right" --- that is,
without taking out equal()'s ability to detect equality of Query nodes.

If that seems too hard/risky, I will just lobotomize equal() instead.

Thanks for the reminder, Bruce --- I had forgotten about this issue.

regards, tom lane

#7Taral
taral@taral.net
In reply to: Thomas Lockhart (#5)
Re: [HACKERS] Oops, I seem to have changed UNION's behavior

On Sun, 9 May 1999, Thomas Lockhart wrote:

Am I right in thinking that UNION (without ALL) is defined to do a
DISTINCT on its result, so that duplicates are removed even if the
duplicates both came from the same source table? That's what 6.4.2
does, but I do not know if it's strictly kosher according to the SQL
spec.

Yes, this is the right behavior according to SQL92...

In which case something should put a DISTINCT on queries using UNION...
since making T_Query nodes never equal is a deoptimization.

Taral

#8Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Tom Lane (#6)
Re: [HACKERS] Oops, I seem to have changed UNION's behavior

Thomas Lockhart <lockhart@alumni.caltech.edu> writes:

Am I right in thinking that UNION (without ALL) is defined to do a
DISTINCT on its result, so that duplicates are removed even if the
duplicates both came from the same source table? That's what 6.4.2
does, but I do not know if it's strictly kosher according to the SQL
spec.

(Just in case this is still active)

Yes, this is the right behavior according to SQL92...

OK, then 6.5 is still broken :-(. I know a lot more about the planner
than I did then, so I will see if I can fix it "right" --- that is,
without taking out equal()'s ability to detect equality of Query nodes.

If that seems too hard/risky, I will just lobotomize equal() instead.

Thanks for the reminder, Bruce --- I had forgotten about this issue.

Hey, that's why I keep 500 messages in my PostgreSQL mailbox.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#9Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Tom Lane (#1)
Re: [HACKERS] Oops, I seem to have changed UNION's behavior

Can someone comment on this? Is it still an issue with cnf'ify removing
duplicate cases?

The equal() updates I installed yesterday (to fix the "don't know
whether nodes of type 600 are equal" problem) have had an unintended
side effect.

Am I right in thinking that UNION (without ALL) is defined to do a
DISTINCT on its result, so that duplicates are removed even if the
duplicates both came from the same source table? That's what 6.4.2
does, but I do not know if it's strictly kosher according to the SQL
spec.

If so, the code is now busted, because with the equal() extension in
place, cnfify() is able to recognize and remove duplicate select
clauses. That is, "SELECT xxx UNION SELECT xxx" will be folded to
just "SELECT xxx" ... and that doesn't mean the same thing.

An actual example: given the data

play=> select a from tt;
a
-
1
1
2
3
(4 rows)

Under 6.4.2 I get:

play=> select a from tt union select a from tt;
a
-
1
2
3
(3 rows)

Note lack of duplicate "1". Under current sources I get:

ttest=> select a from tt union select a from tt;
a
-
1
1
2
3
(4 rows)

since the query is effectively reduced to just "select a from tt".

Assuming that 6.4.2 is doing the Right Thing, I see two possible fixes:
(1) simplify equal() to say that two T_Query nodes are never equal, or
(2) modify the planner so that the "select distinct" operation is
inserted explicitly, and will thus happen even if the UNIONed selects
are collapsed into just one.

(1) is a trivial fix of course, but it worries me --- maybe someday
we will need equal() to give an honest answer for Query nodes.
But I don't have the expertise to apply (2), and it seems like rather
a lot of work for a boundary case that isn't really interesting in
practice.

Comments? *Is* 6.4.2 behaving according to the SQL spec?

regards, tom lane

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#9)
Re: [HACKERS] Oops, I seem to have changed UNION's behavior

Bruce Momjian <maillist@candle.pha.pa.us> writes:

Can someone comment on this? Is it still an issue with cnf'ify removing
duplicate cases?

This is still an open bug. I looked at it and concluded that a clean
solution probably requires modifying the parsetree structure for UNIONs.
(Currently, UNION, INTERSECT, EXCEPT are converted to OR, AND, AND NOT
structures with queries as arguments ... which would be OK, except that
the semantics aren't really the same, and the difference between UNION
and UNION ALL doesn't fit into that operator set at all.)

That looked like a lot of work for what is surely a low-priority bug,
so I haven't gotten around to it yet. Need more round tuits.

In the meantime, it should be on the TODO list:
* SELECT foo UNION SELECT foo is incorrectly simplified to SELECT foo

regards, tom lane

Show quoted text

The equal() updates I installed yesterday (to fix the "don't know
whether nodes of type 600 are equal" problem) have had an unintended
side effect.

Am I right in thinking that UNION (without ALL) is defined to do a
DISTINCT on its result, so that duplicates are removed even if the
duplicates both came from the same source table? That's what 6.4.2
does, but I do not know if it's strictly kosher according to the SQL
spec.

If so, the code is now busted, because with the equal() extension in
place, cnfify() is able to recognize and remove duplicate select
clauses. That is, "SELECT xxx UNION SELECT xxx" will be folded to
just "SELECT xxx" ... and that doesn't mean the same thing.

An actual example: given the data

play=> select a from tt;
a
-
1
1
2
3
(4 rows)

Under 6.4.2 I get:

play=> select a from tt union select a from tt;
a
-
1
2
3
(3 rows)

Note lack of duplicate "1". Under current sources I get:

ttest=> select a from tt union select a from tt;
a
-
1
1
2
3
(4 rows)

since the query is effectively reduced to just "select a from tt".

Assuming that 6.4.2 is doing the Right Thing, I see two possible fixes:
(1) simplify equal() to say that two T_Query nodes are never equal, or
(2) modify the planner so that the "select distinct" operation is
inserted explicitly, and will thus happen even if the UNIONed selects
are collapsed into just one.

(1) is a trivial fix of course, but it worries me --- maybe someday
we will need equal() to give an honest answer for Query nodes.
But I don't have the expertise to apply (2), and it seems like rather
a lot of work for a boundary case that isn't really interesting in
practice.

Comments? *Is* 6.4.2 behaving according to the SQL spec?

regards, tom lane

-- 
Bruce Momjian                        |  http://www.op.net/~candle
maillist@candle.pha.pa.us            |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026