Overhead of union versus union all

Started by Tim Keittalmost 17 years ago19 messagesgeneral
Jump to latest
#1Tim Keitt
tkeitt@keittlab.org

I am combining query results that I know are disjoint. I'm wondering
how much overhead there is in calling union versus union all. (Just
curious really; I can't see a reason not to use union all.) (cc me
please; not subscribed...)

THK

--
Timothy H. Keitt
http://www.keittlab.org/

#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tim Keitt (#1)
Re: Overhead of union versus union all

Tim Keitt wrote:

I am combining query results that I know are disjoint. I'm wondering
how much overhead there is in calling union versus union all. (Just
curious really; I can't see a reason not to use union all.)

UNION needs to uniquify the output, for which it plasters an additional
sort step, whereas UNION ALL does not need to uniquify its output and
thus it can avoid the sort step. Using UNION ALL is recommended
wherever possible.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#3Adam Rich
adam.r@sbcglobal.net
In reply to: Tim Keitt (#1)
Re: Overhead of union versus union all

Tim Keitt wrote:

I am combining query results that I know are disjoint. I'm wondering
how much overhead there is in calling union versus union all. (Just
curious really; I can't see a reason not to use union all.) (cc me
please; not subscribed...)

THK

I think you can test this one yourself pretty easily. Just run the two
queries with "explain analyze". Union All should run in about the sum
of the separate queries. Plain Union will always be slower, because it
takes the same results from "union all" and runs them through an extra
sort/distinct or hash step. In my tests, on a query with 600,000 rows,
the Plain Union took about 3x as long to complete.

#4Bruce Momjian
bruce@momjian.us
In reply to: Alvaro Herrera (#2)
Re: Overhead of union versus union all

Alvaro Herrera wrote:

Tim Keitt wrote:

I am combining query results that I know are disjoint. I'm wondering
how much overhead there is in calling union versus union all. (Just
curious really; I can't see a reason not to use union all.)

UNION needs to uniquify the output, for which it plasters an additional
sort step, whereas UNION ALL does not need to uniquify its output and
thus it can avoid the sort step. Using UNION ALL is recommended
wherever possible.

Yep, ideally UNION ALL would be the default behavior, but that standard
requires otherwise. Many people don't know that UNION has an extra
SORT/UNIQUE step.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#5Scott Bailey
artacus@comcast.net
In reply to: Alvaro Herrera (#2)
Re: Overhead of union versus union all

Alvaro Herrera wrote:

Tim Keitt wrote:

I am combining query results that I know are disjoint. I'm wondering
how much overhead there is in calling union versus union all. (Just
curious really; I can't see a reason not to use union all.)

UNION needs to uniquify the output, for which it plasters an additional
sort step, whereas UNION ALL does not need to uniquify its output and
thus it can avoid the sort step. Using UNION ALL is recommended
wherever possible.

I think I read somewhere that as of 8.4 it no longer required the sort
step, due to the improvements in hashing. Here it is

http://wiki.postgresql.org/wiki/WhatsNew84#Performance

#6Bruce Momjian
bruce@momjian.us
In reply to: Scott Bailey (#5)
Re: Overhead of union versus union all

Scott Bailey wrote:

Alvaro Herrera wrote:

Tim Keitt wrote:

I am combining query results that I know are disjoint. I'm wondering
how much overhead there is in calling union versus union all. (Just
curious really; I can't see a reason not to use union all.)

UNION needs to uniquify the output, for which it plasters an additional
sort step, whereas UNION ALL does not need to uniquify its output and
thus it can avoid the sort step. Using UNION ALL is recommended
wherever possible.

I think I read somewhere that as of 8.4 it no longer required the sort
step, due to the improvements in hashing. Here it is

http://wiki.postgresql.org/wiki/WhatsNew84#Performance

Oh, yea, hashing is used in some cases rather than sort. I assume sort
is still used if the hash exceeds workmem size.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#7Scott Marlowe
scott.marlowe@gmail.com
In reply to: Bruce Momjian (#6)
Re: Overhead of union versus union all

On Thu, Jul 9, 2009 at 7:58 PM, Bruce Momjian<bruce@momjian.us> wrote:

Scott Bailey wrote:

Alvaro Herrera wrote:

Tim Keitt wrote:

I am combining query results that I know are disjoint. I'm wondering
how much overhead there is in calling union versus union all. (Just
curious really; I can't see a reason not to use union all.)

UNION needs to uniquify the output, for which it plasters an additional
sort step, whereas UNION ALL does not need to uniquify its output and
thus it can avoid the sort step.  Using UNION ALL is recommended
wherever possible.

I think I read somewhere that as of 8.4 it no longer required the sort
step, due to the improvements in hashing. Here it is

http://wiki.postgresql.org/wiki/WhatsNew84#Performance

Oh, yea, hashing is used in some cases rather than sort.  I assume sort
is still used if the hash exceeds workmem size.

The important point being that it's still more expensive than a plain
union all thought, right?

#8Bruce Momjian
bruce@momjian.us
In reply to: Scott Marlowe (#7)
Re: Overhead of union versus union all

Scott Marlowe wrote:

On Thu, Jul 9, 2009 at 7:58 PM, Bruce Momjian<bruce@momjian.us> wrote:

Scott Bailey wrote:

Alvaro Herrera wrote:

Tim Keitt wrote:

I am combining query results that I know are disjoint. I'm wondering
how much overhead there is in calling union versus union all. (Just
curious really; I can't see a reason not to use union all.)

UNION needs to uniquify the output, for which it plasters an additional
sort step, whereas UNION ALL does not need to uniquify its output and
thus it can avoid the sort step. ?Using UNION ALL is recommended
wherever possible.

I think I read somewhere that as of 8.4 it no longer required the sort
step, due to the improvements in hashing. Here it is

http://wiki.postgresql.org/wiki/WhatsNew84#Performance

Oh, yea, hashing is used in some cases rather than sort. ?I assume sort
is still used if the hash exceeds workmem size.

The important point being that it's still more expensive than a plain
union all thought, right?

Yep.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#9Simon Riggs
simon@2ndQuadrant.com
In reply to: Scott Marlowe (#7)
Re: Overhead of union versus union all

On Thu, 2009-07-09 at 20:41 -0600, Scott Marlowe wrote:

On Thu, Jul 9, 2009 at 7:58 PM, Bruce Momjian<bruce@momjian.us> wrote:

Scott Bailey wrote:

Alvaro Herrera wrote:

Tim Keitt wrote:

I am combining query results that I know are disjoint. I'm wondering
how much overhead there is in calling union versus union all. (Just
curious really; I can't see a reason not to use union all.)

UNION needs to uniquify the output, for which it plasters an additional
sort step, whereas UNION ALL does not need to uniquify its output and
thus it can avoid the sort step. Using UNION ALL is recommended
wherever possible.

I think I read somewhere that as of 8.4 it no longer required the sort
step, due to the improvements in hashing. Here it is

http://wiki.postgresql.org/wiki/WhatsNew84#Performance

Oh, yea, hashing is used in some cases rather than sort. I assume sort
is still used if the hash exceeds workmem size.

The important point being that it's still more expensive than a plain
union all thought, right?

I think it should be possible to use predtest theorem proving to discard
the sort/hash step in cases where we can prove the sets are disjoint.
Often there are top-level quals that can be compared in the WHERE
clauses of the sub-queries, so a shallow search could be quite
profitable in allowing us to rewrite a UNION into a UNION ALL.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#10Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#9)
Re: Overhead of union versus union all

Simon Riggs wrote:

On Thu, 2009-07-09 at 20:41 -0600, Scott Marlowe wrote:

On Thu, Jul 9, 2009 at 7:58 PM, Bruce Momjian<bruce@momjian.us> wrote:

Scott Bailey wrote:

Alvaro Herrera wrote:

Tim Keitt wrote:

I am combining query results that I know are disjoint. I'm wondering
how much overhead there is in calling union versus union all. (Just
curious really; I can't see a reason not to use union all.)

UNION needs to uniquify the output, for which it plasters an additional
sort step, whereas UNION ALL does not need to uniquify its output and
thus it can avoid the sort step. Using UNION ALL is recommended
wherever possible.

I think I read somewhere that as of 8.4 it no longer required the sort
step, due to the improvements in hashing. Here it is

http://wiki.postgresql.org/wiki/WhatsNew84#Performance

Oh, yea, hashing is used in some cases rather than sort. I assume sort
is still used if the hash exceeds workmem size.

The important point being that it's still more expensive than a plain
union all thought, right?

I think it should be possible to use predtest theorem proving to discard
the sort/hash step in cases where we can prove the sets are disjoint.
Often there are top-level quals that can be compared in the WHERE
clauses of the sub-queries, so a shallow search could be quite
profitable in allowing us to rewrite a UNION into a UNION ALL.

I assume we would still need the distinct removal step; we just avoid
the sort/hash.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#11Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#10)
Re: Overhead of union versus union all

On Fri, 2009-07-10 at 08:59 -0400, Bruce Momjian wrote:

I think it should be possible to use predtest theorem proving to

discard

the sort/hash step in cases where we can prove the sets are

disjoint.

Often there are top-level quals that can be compared in the WHERE
clauses of the sub-queries, so a shallow search could be quite
profitable in allowing us to rewrite a UNION into a UNION ALL.

I assume we would still need the distinct removal step; we just avoid
the sort/hash.

I mean it seems possible to prove that the distinct removal step is not
necessary, by proving that the various sub-queries are already disjoint.
It's a common manual optimization, so automating it seems a reasonable
future goal.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#12Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#11)
Re: Overhead of union versus union all

Simon Riggs wrote:

On Fri, 2009-07-10 at 08:59 -0400, Bruce Momjian wrote:

I think it should be possible to use predtest theorem proving to

discard

the sort/hash step in cases where we can prove the sets are

disjoint.

Often there are top-level quals that can be compared in the WHERE
clauses of the sub-queries, so a shallow search could be quite
profitable in allowing us to rewrite a UNION into a UNION ALL.

I assume we would still need the distinct removal step; we just avoid
the sort/hash.

I mean it seems possible to prove that the distinct removal step is not
necessary, by proving that the various sub-queries are already disjoint.
It's a common manual optimization, so automating it seems a reasonable
future goal.

I am confused what sub-queries produce _distinct_ output. I know there
are some that produce _ordered_ output.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#13Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#12)
Re: Overhead of union versus union all

On Fri, 2009-07-10 at 09:28 -0400, Bruce Momjian wrote:

Simon Riggs wrote:

On Fri, 2009-07-10 at 08:59 -0400, Bruce Momjian wrote:

I think it should be possible to use predtest theorem proving to

discard

the sort/hash step in cases where we can prove the sets are

disjoint.

Often there are top-level quals that can be compared in the WHERE
clauses of the sub-queries, so a shallow search could be quite
profitable in allowing us to rewrite a UNION into a UNION ALL.

I assume we would still need the distinct removal step; we just avoid
the sort/hash.

I mean it seems possible to prove that the distinct removal step is not
necessary, by proving that the various sub-queries are already disjoint.
It's a common manual optimization, so automating it seems a reasonable
future goal.

I am confused what sub-queries produce _distinct_ output. I know there
are some that produce _ordered_ output.

None, that was not my point.

If you have a query like this

Select ..., status, ...
...
where status = '1'
union
Select ..., status, ...
...
where status = '2';

or a query like this

Select '1', ...
...
union
Select '2', ...
...
;

or a query like this

Select '1', ...
...
union
Select status, ...
...
where status != '1';
;

then it is clear that we could automatically prove that the the distinct
step is redundant and so we could either hash or sort. This is the same
as replacing the UNION with UNION ALL.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#14Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#13)
Re: Overhead of union versus union all

Simon Riggs wrote:

or a query like this

Select '1', ...
...
union
Select status, ...
...
where status != '1';
;

then it is clear that we could automatically prove that the the distinct
step is redundant and so we could either hash or sort. This is the same
as replacing the UNION with UNION ALL.

In the last example, how do you know that status != '1' produces unique
output? I assumed UNION gave distinct for the entire output, not just
remove duplicates from the two UNION branches; that's how Postgres
behaves now:

test=> SELECT 1 UNION (SELECT 2 UNION ALL SELECT 2);
?column?
----------
1
2
(2 rows)

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#15Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#14)
Re: Overhead of union versus union all

On Fri, 2009-07-10 at 09:46 -0400, Bruce Momjian wrote:

Simon Riggs wrote:

or a query like this

Select '1', ...
...
union
Select status, ...
...
where status != '1';
;

then it is clear that we could automatically prove that the the distinct
step is redundant and so we could either hash or sort. This is the same
as replacing the UNION with UNION ALL.

In the last example, how do you know that status != '1' produces unique
output?

You don't. I was assuming that you could already prove that each
subquery was distinct in itself.

It's one for the TODO, that's all. I see it often, but I'm not planning
to work on the code for this myself.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#16Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#11)
Re: Overhead of union versus union all

On Fri, 2009-07-10 at 14:22 +0100, Simon Riggs wrote:

I mean it seems possible to prove that the distinct removal step is not
necessary, by proving that the various sub-queries are already disjoint.
It's a common manual optimization, so automating it seems a reasonable
future goal.

There are even simpler cases that postgresql can't optimize. Consider:

-- foo has a primary key
SELECT * FROM foo UNION SELECT * FROM foo;

That's logically equivalent to:

SELECT * FROM foo;

But postgresql will add a sort anyway.

There are lots of optimizations along these lines. They seem obscure,
but these optimizations become much more useful when using views or
complex queries where the same table appears multiple times. For
instance, if you have two views that are projections of the same table,
then, you join the views together, you can optimize away the join in
some cases, and just scan the original table.

I think a lot of these optimizations depend on knowing which tables (or
subqueries) are relations in the relational theory sense; i.e.
unordered, distinct, and have no NULLs in the relevant attributes.

Regards,
Jeff Davis

#17Bruce Momjian
bruce@momjian.us
In reply to: Jeff Davis (#16)
Re: Overhead of union versus union all

On Fri, Jul 10, 2009 at 6:37 PM, Jeff Davis<pgsql@j-davis.com> wrote:

-- foo has a primary key
SELECT * FROM foo UNION SELECT * FROM foo;

That's logically equivalent to:

SELECT * FROM foo;

But postgresql will add a sort anyway.

Well no, it's equivalent to SELECT DISTINCT * FROM foo;

--
greg
http://mit.edu/~gsstark/resume.pdf

#18Scott Marlowe
scott.marlowe@gmail.com
In reply to: Bruce Momjian (#17)
Re: Overhead of union versus union all

On Fri, Jul 10, 2009 at 11:47 AM, Greg Stark<gsstark@mit.edu> wrote:

On Fri, Jul 10, 2009 at 6:37 PM, Jeff Davis<pgsql@j-davis.com> wrote:

-- foo has a primary key
SELECT * FROM foo UNION SELECT * FROM foo;

That's logically equivalent to:

SELECT * FROM foo;

But postgresql will add a sort anyway.

Well no, it's equivalent to SELECT DISTINCT * FROM foo;

And honestly, I'd rather see development effort go into making complex
queries run faster (the things like bitmap indexes on disk etc) rather
than optimising things that i can optimise myself.

#19Jeff Davis
pgsql@j-davis.com
In reply to: Bruce Momjian (#17)
Re: Overhead of union versus union all

On Fri, 2009-07-10 at 18:47 +0100, Greg Stark wrote:

-- foo has a primary key

Well no, it's equivalent to SELECT DISTINCT * FROM foo;

I think you missed that "foo" has a primary key.

Regards,
Jeff Davis