Overhead of union versus union all

Started by Tim Keittabout 17 years ago19 messagesgeneral

tkeitt@keittlab.org

about 17 years ago

I am combining query results that I know are disjoint. I'm wondering
how much overhead there is in calling union versus union all. (Just
curious really; I can't see a reason not to use union all.) (cc me
please; not subscribed...)

THK

--
Timothy H. Keitt
http://www.keittlab.org/

Alvaro Herrera

alvherre@2ndquadrant.com

about 17 years ago

In reply to: Tim Keitt (#1)

Re: Overhead of union versus union all

Tim Keitt wrote:

I am combining query results that I know are disjoint. I'm wondering
how much overhead there is in calling union versus union all. (Just
curious really; I can't see a reason not to use union all.)

UNION needs to uniquify the output, for which it plasters an additional
sort step, whereas UNION ALL does not need to uniquify its output and
thus it can avoid the sort step. Using UNION ALL is recommended
wherever possible.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Adam Rich

adam.r@sbcglobal.net

about 17 years ago

In reply to: Tim Keitt (#1)

Re: Overhead of union versus union all

Tim Keitt wrote:

I am combining query results that I know are disjoint. I'm wondering
how much overhead there is in calling union versus union all. (Just
curious really; I can't see a reason not to use union all.) (cc me
please; not subscribed...)

THK

I think you can test this one yourself pretty easily. Just run the two
queries with "explain analyze". Union All should run in about the sum
of the separate queries. Plain Union will always be slower, because it
takes the same results from "union all" and runs them through an extra
sort/distinct or hash step. In my tests, on a query with 600,000 rows,
the Plain Union took about 3x as long to complete.

Bruce Momjian

bruce@momjian.us

about 17 years ago

In reply to: Alvaro Herrera (#2)

Re: Overhead of union versus union all

Alvaro Herrera wrote:

Tim Keitt wrote:

I am combining query results that I know are disjoint. I'm wondering
how much overhead there is in calling union versus union all. (Just
curious really; I can't see a reason not to use union all.)

UNION needs to uniquify the output, for which it plasters an additional
sort step, whereas UNION ALL does not need to uniquify its output and
thus it can avoid the sort step. Using UNION ALL is recommended
wherever possible.

Yep, ideally UNION ALL would be the default behavior, but that standard
requires otherwise. Many people don't know that UNION has an extra
SORT/UNIQUE step.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Scott Bailey

artacus@comcast.net

about 17 years ago

In reply to: Alvaro Herrera (#2)

Re: Overhead of union versus union all

Alvaro Herrera wrote:

Tim Keitt wrote:

I am combining query results that I know are disjoint. I'm wondering
how much overhead there is in calling union versus union all. (Just
curious really; I can't see a reason not to use union all.)

UNION needs to uniquify the output, for which it plasters an additional
sort step, whereas UNION ALL does not need to uniquify its output and
thus it can avoid the sort step. Using UNION ALL is recommended
wherever possible.

I think I read somewhere that as of 8.4 it no longer required the sort
step, due to the improvements in hashing. Here it is

http://wiki.postgresql.org/wiki/WhatsNew84#Performance

Bruce Momjian

bruce@momjian.us

about 17 years ago

In reply to: Scott Bailey (#5)

Re: Overhead of union versus union all

Scott Bailey wrote:

Alvaro Herrera wrote:

Tim Keitt wrote:

I am combining query results that I know are disjoint. I'm wondering
how much overhead there is in calling union versus union all. (Just
curious really; I can't see a reason not to use union all.)

UNION needs to uniquify the output, for which it plasters an additional
sort step, whereas UNION ALL does not need to uniquify its output and
thus it can avoid the sort step. Using UNION ALL is recommended
wherever possible.

I think I read somewhere that as of 8.4 it no longer required the sort
step, due to the improvements in hashing. Here it is

http://wiki.postgresql.org/wiki/WhatsNew84#Performance

Oh, yea, hashing is used in some cases rather than sort. I assume sort
is still used if the hash exceeds workmem size.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Scott Marlowe

scott.marlowe@gmail.com

about 17 years ago

In reply to: Bruce Momjian (#6)

Re: Overhead of union versus union all

On Thu, Jul 9, 2009 at 7:58 PM, Bruce Momjian<bruce@momjian.us> wrote:

Scott Bailey wrote:

Alvaro Herrera wrote:

Tim Keitt wrote:

I am combining query results that I know are disjoint. I'm wondering
how much overhead there is in calling union versus union all. (Just
curious really; I can't see a reason not to use union all.)

UNION needs to uniquify the output, for which it plasters an additional
sort step, whereas UNION ALL does not need to uniquify its output and
thus it can avoid the sort step. Using UNION ALL is recommended
wherever possible.

I think I read somewhere that as of 8.4 it no longer required the sort
step, due to the improvements in hashing. Here it is

http://wiki.postgresql.org/wiki/WhatsNew84#Performance

Oh, yea, hashing is used in some cases rather than sort. I assume sort
is still used if the hash exceeds workmem size.

The important point being that it's still more expensive than a plain
union all thought, right?

Bruce Momjian

bruce@momjian.us

about 17 years ago

In reply to: Scott Marlowe (#7)

Re: Overhead of union versus union all

Scott Marlowe wrote:

On Thu, Jul 9, 2009 at 7:58 PM, Bruce Momjian<bruce@momjian.us> wrote:

Scott Bailey wrote:

Alvaro Herrera wrote:

Tim Keitt wrote:

I am combining query results that I know are disjoint. I'm wondering
how much overhead there is in calling union versus union all. (Just
curious really; I can't see a reason not to use union all.)

UNION needs to uniquify the output, for which it plasters an additional
sort step, whereas UNION ALL does not need to uniquify its output and
thus it can avoid the sort step. ?Using UNION ALL is recommended
wherever possible.

I think I read somewhere that as of 8.4 it no longer required the sort
step, due to the improvements in hashing. Here it is

http://wiki.postgresql.org/wiki/WhatsNew84#Performance

Oh, yea, hashing is used in some cases rather than sort. ?I assume sort
is still used if the hash exceeds workmem size.

The important point being that it's still more expensive than a plain
union all thought, right?

Yep.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Simon Riggs

simon@2ndQuadrant.com

about 17 years ago

In reply to: Scott Marlowe (#7)

Re: Overhead of union versus union all

On Thu, 2009-07-09 at 20:41 -0600, Scott Marlowe wrote:

On Thu, Jul 9, 2009 at 7:58 PM, Bruce Momjian<bruce@momjian.us> wrote:

Scott Bailey wrote:

Alvaro Herrera wrote:

Tim Keitt wrote:

I am combining query results that I know are disjoint. I'm wondering
how much overhead there is in calling union versus union all. (Just
curious really; I can't see a reason not to use union all.)

UNION needs to uniquify the output, for which it plasters an additional
sort step, whereas UNION ALL does not need to uniquify its output and
thus it can avoid the sort step. Using UNION ALL is recommended
wherever possible.

I think I read somewhere that as of 8.4 it no longer required the sort
step, due to the improvements in hashing. Here it is

http://wiki.postgresql.org/wiki/WhatsNew84#Performance

Oh, yea, hashing is used in some cases rather than sort. I assume sort
is still used if the hash exceeds workmem size.

The important point being that it's still more expensive than a plain
union all thought, right?

I think it should be possible to use predtest theorem proving to discard
the sort/hash step in cases where we can prove the sets are disjoint.
Often there are top-level quals that can be compared in the WHERE
clauses of the sub-queries, so a shallow search could be quite
profitable in allowing us to rewrite a UNION into a UNION ALL.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#10

Bruce Momjian

bruce@momjian.us

about 17 years ago

In reply to: Simon Riggs (#9)

Re: Overhead of union versus union all

Simon Riggs wrote:

On Thu, 2009-07-09 at 20:41 -0600, Scott Marlowe wrote:

On Thu, Jul 9, 2009 at 7:58 PM, Bruce Momjian<bruce@momjian.us> wrote:

Scott Bailey wrote:

Alvaro Herrera wrote:

Tim Keitt wrote:

I am combining query results that I know are disjoint. I'm wondering
how much overhead there is in calling union versus union all. (Just
curious really; I can't see a reason not to use union all.)

UNION needs to uniquify the output, for which it plasters an additional
sort step, whereas UNION ALL does not need to uniquify its output and
thus it can avoid the sort step. Using UNION ALL is recommended
wherever possible.

I think I read somewhere that as of 8.4 it no longer required the sort
step, due to the improvements in hashing. Here it is

http://wiki.postgresql.org/wiki/WhatsNew84#Performance

Oh, yea, hashing is used in some cases rather than sort. I assume sort
is still used if the hash exceeds workmem size.

The important point being that it's still more expensive than a plain
union all thought, right?

I think it should be possible to use predtest theorem proving to discard
the sort/hash step in cases where we can prove the sets are disjoint.
Often there are top-level quals that can be compared in the WHERE
clauses of the sub-queries, so a shallow search could be quite
profitable in allowing us to rewrite a UNION into a UNION ALL.

I assume we would still need the distinct removal step; we just avoid
the sort/hash.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#11

Simon Riggs

simon@2ndQuadrant.com

about 17 years ago

In reply to: Bruce Momjian (#10)

Re: Overhead of union versus union all

On Fri, 2009-07-10 at 08:59 -0400, Bruce Momjian wrote:

I think it should be possible to use predtest theorem proving to

discard

the sort/hash step in cases where we can prove the sets are

disjoint.

Often there are top-level quals that can be compared in the WHERE
clauses of the sub-queries, so a shallow search could be quite
profitable in allowing us to rewrite a UNION into a UNION ALL.

I assume we would still need the distinct removal step; we just avoid
the sort/hash.

I mean it seems possible to prove that the distinct removal step is not
necessary, by proving that the various sub-queries are already disjoint.
It's a common manual optimization, so automating it seems a reasonable
future goal.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#12

Bruce Momjian

bruce@momjian.us

about 17 years ago

In reply to: Simon Riggs (#11)

Re: Overhead of union versus union all

Simon Riggs wrote:

On Fri, 2009-07-10 at 08:59 -0400, Bruce Momjian wrote:

I think it should be possible to use predtest theorem proving to

discard

the sort/hash step in cases where we can prove the sets are

disjoint.

Often there are top-level quals that can be compared in the WHERE
clauses of the sub-queries, so a shallow search could be quite
profitable in allowing us to rewrite a UNION into a UNION ALL.

I assume we would still need the distinct removal step; we just avoid
the sort/hash.

I mean it seems possible to prove that the distinct removal step is not
necessary, by proving that the various sub-queries are already disjoint.
It's a common manual optimization, so automating it seems a reasonable
future goal.

I am confused what sub-queries produce _distinct_ output. I know there
are some that produce _ordered_ output.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#13

Simon Riggs

simon@2ndQuadrant.com

about 17 years ago

In reply to: Bruce Momjian (#12)

Re: Overhead of union versus union all

On Fri, 2009-07-10 at 09:28 -0400, Bruce Momjian wrote:

Simon Riggs wrote:

On Fri, 2009-07-10 at 08:59 -0400, Bruce Momjian wrote:

I think it should be possible to use predtest theorem proving to

discard

the sort/hash step in cases where we can prove the sets are

disjoint.

Often there are top-level quals that can be compared in the WHERE
clauses of the sub-queries, so a shallow search could be quite
profitable in allowing us to rewrite a UNION into a UNION ALL.

I assume we would still need the distinct removal step; we just avoid
the sort/hash.

I mean it seems possible to prove that the distinct removal step is not
necessary, by proving that the various sub-queries are already disjoint.
It's a common manual optimization, so automating it seems a reasonable
future goal.

I am confused what sub-queries produce _distinct_ output. I know there
are some that produce _ordered_ output.

None, that was not my point.

If you have a query like this

Select ..., status, ...
...
where status = '1'
union
Select ..., status, ...
...
where status = '2';

or a query like this

Select '1', ...
...
union
Select '2', ...
...
;

or a query like this

Select '1', ...
...
union
Select status, ...
...
where status != '1';
;

then it is clear that we could automatically prove that the the distinct
step is redundant and so we could either hash or sort. This is the same
as replacing the UNION with UNION ALL.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#14

Bruce Momjian

bruce@momjian.us

about 17 years ago

In reply to: Simon Riggs (#13)

Re: Overhead of union versus union all

Simon Riggs wrote:

or a query like this

Select '1', ...
...
union
Select status, ...
...
where status != '1';
;

then it is clear that we could automatically prove that the the distinct
step is redundant and so we could either hash or sort. This is the same
as replacing the UNION with UNION ALL.

In the last example, how do you know that status != '1' produces unique
output? I assumed UNION gave distinct for the entire output, not just
remove duplicates from the two UNION branches; that's how Postgres
behaves now:

test=> SELECT 1 UNION (SELECT 2 UNION ALL SELECT 2);
?column?
----------
1
2
(2 rows)

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#15

Simon Riggs

simon@2ndQuadrant.com

about 17 years ago

In reply to: Bruce Momjian (#14)

Re: Overhead of union versus union all

On Fri, 2009-07-10 at 09:46 -0400, Bruce Momjian wrote:

Simon Riggs wrote:

or a query like this

Select '1', ...
...
union
Select status, ...
...
where status != '1';
;

then it is clear that we could automatically prove that the the distinct
step is redundant and so we could either hash or sort. This is the same
as replacing the UNION with UNION ALL.

In the last example, how do you know that status != '1' produces unique
output?

You don't. I was assuming that you could already prove that each
subquery was distinct in itself.

It's one for the TODO, that's all. I see it often, but I'm not planning
to work on the code for this myself.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#16

Jeff Davis

pgsql@j-davis.com

about 17 years ago

In reply to: Simon Riggs (#11)

Re: Overhead of union versus union all

On Fri, 2009-07-10 at 14:22 +0100, Simon Riggs wrote:

I mean it seems possible to prove that the distinct removal step is not
necessary, by proving that the various sub-queries are already disjoint.
It's a common manual optimization, so automating it seems a reasonable
future goal.

There are even simpler cases that postgresql can't optimize. Consider:

-- foo has a primary key
SELECT * FROM foo UNION SELECT * FROM foo;

That's logically equivalent to:

SELECT * FROM foo;

But postgresql will add a sort anyway.

There are lots of optimizations along these lines. They seem obscure,
but these optimizations become much more useful when using views or
complex queries where the same table appears multiple times. For
instance, if you have two views that are projections of the same table,
then, you join the views together, you can optimize away the join in
some cases, and just scan the original table.

I think a lot of these optimizations depend on knowing which tables (or
subqueries) are relations in the relational theory sense; i.e.
unordered, distinct, and have no NULLs in the relevant attributes.

Regards,
Jeff Davis

#17

Bruce Momjian

bruce@momjian.us

about 17 years ago

In reply to: Jeff Davis (#16)

Re: Overhead of union versus union all

On Fri, Jul 10, 2009 at 6:37 PM, Jeff Davis<pgsql@j-davis.com> wrote:

-- foo has a primary key
SELECT * FROM foo UNION SELECT * FROM foo;

That's logically equivalent to:

SELECT * FROM foo;

But postgresql will add a sort anyway.

Well no, it's equivalent to SELECT DISTINCT * FROM foo;

--
greg
http://mit.edu/~gsstark/resume.pdf

#18

Scott Marlowe

scott.marlowe@gmail.com

about 17 years ago

In reply to: Bruce Momjian (#17)

Re: Overhead of union versus union all

On Fri, Jul 10, 2009 at 11:47 AM, Greg Stark<gsstark@mit.edu> wrote:

On Fri, Jul 10, 2009 at 6:37 PM, Jeff Davis<pgsql@j-davis.com> wrote:

-- foo has a primary key
SELECT * FROM foo UNION SELECT * FROM foo;

That's logically equivalent to:

SELECT * FROM foo;

But postgresql will add a sort anyway.

Well no, it's equivalent to SELECT DISTINCT * FROM foo;

And honestly, I'd rather see development effort go into making complex
queries run faster (the things like bitmap indexes on disk etc) rather
than optimising things that i can optimise myself.

#19

Jeff Davis

pgsql@j-davis.com

about 17 years ago

In reply to: Bruce Momjian (#17)

Re: Overhead of union versus union all

On Fri, 2009-07-10 at 18:47 +0100, Greg Stark wrote:

-- foo has a primary key

Well no, it's equivalent to SELECT DISTINCT * FROM foo;

I think you missed that "foo" has a primary key.

Regards,
Jeff Davis