No hash join across partitioned tables?

Started by Kris Jurkaalmost 17 years ago43 messageshackers
Jump to latest
#1Kris Jurka
books@ejurka.com

PG (8.3.7) doesn't seem to want to do a hash join across two partitioned
tables. I have two partition hierarchies: impounds (with different
impound sources) and liens (with vehicle liens from different companies).
Trying to match those up gives:

EXPLAIN SELECT COUNT(*)
FROM impounds i
JOIN liens l ON (i.vin = l.vin);

Aggregate (cost=11164042.66..11164042.67 rows=1 width=0)
-> Nested Loop (cost=0.27..3420012.94 rows=3097611886 width=0)
Join Filter: ((i.vin)::text = (l.vin)::text)
-> Append (cost=0.00..1072.77 rows=33577 width=21)
-> Seq Scan on impounds i (cost=0.00..11.40 rows=140 width=21)
-> Seq Scan on impounds_s1 i (cost=0.00..926.87 rows=29587 width=18)
-> Seq Scan on impounds_s2 i (cost=0.00..99.96 rows=3296 width=18)
-> Seq Scan on impounds_s3 i (cost=0.00..23.14 rows=414 width=18)
-> Seq Scan on impounds_s4 i (cost=0.00..11.40 rows=140 width=21)
-> Append (cost=0.27..101.64 rows=15 width=21)
-> Bitmap Heap Scan on liens l (cost=0.27..5.60 rows=2 width=21)
Recheck Cond: ((l.vin)::text = (i.vin)::text)
-> Bitmap Index Scan on liens_pk (cost=0.00..0.27 rows=2 width=0)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using liens_s1_pk on liens_s1 l (cost=0.00..7.02 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using liens_s2_pk on liens_s2 l (cost=0.00..3.47 rows=1 width=21)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using newliens_s3_pk on liens_s3 l (cost=0.00..7.52 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using newliens_s4_pk on liens_s4 l (cost=0.00..7.67 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using newliens_s5_pk on liens_s5 l (cost=0.00..7.62 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using newliens_s6_pk on liens_s6 l (cost=0.00..7.61 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using newliens_s7_pk on liens_s7 l (cost=0.00..7.50 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using newliens_s8_pk on liens_s8 l (cost=0.00..7.36 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using newliens_s9_pk on liens_s9 l (cost=0.00..7.43 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using newliens_s10_pk on liens_s10 l (cost=0.00..7.79 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using newliens_s11_pk on liens_s11 l (cost=0.00..8.07 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using newliens_s12_pk on liens_s12 l (cost=0.00..8.45 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using newliens_s13_pk on liens_s13 l (cost=0.00..8.53 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)

This takes quite a while as it's got to do tons of index probes which
results it tons of random IO. I killed this after five minutes of
running.

But if I do:

CREATE TABLE i1 AS SELECT * FROM impounds;
CREATE TABLE l1 AS SELECT * FROM liens;

I get a reasonable plan, which runs in about 15 seconds, from:

EXPLAIN SELECT COUNT(*)
FROM i1 i
JOIN l1 l ON (i.vin = l.vin);

Aggregate (cost=749054.78..749054.79 rows=1 width=0)
-> Hash Join (cost=1444.18..748971.43 rows=33338 width=0)
Hash Cond: ((l.vin)::text = (i.vin)::text)
-> Seq Scan on l1 l (cost=0.00..332068.96 rows=18449996
width=18)
-> Hash (cost=1027.97..1027.97 rows=33297 width=18)
-> Seq Scan on i1 i (cost=0.00..1027.97 rows=33297
width=18)

I've tried to force the hash join plan on the partitioned tables via:

set enable_nestloop to off;

This results in a merge join plan which needs to do a giant sort, again
killed after five minutes.

Aggregate (cost=58285765.20..58285765.21 rows=1 width=0)
-> Merge Join (cost=4077389.31..50541735.48 rows=3097611886 width=0)
Merge Cond: ((i.vin)::text = (l.vin)::text)
-> Sort (cost=4286.45..4370.39 rows=33577 width=21)
Sort Key: i.vin
-> Append (cost=0.00..1072.77 rows=33577 width=21)
-> Seq Scan on impounds i (cost=0.00..11.40 rows=140 width=21)
-> [Seq Scans on other partitions]
-> Materialize (cost=4073102.86..4303737.81 rows=18450796 width=21)
-> Sort (cost=4073102.86..4119229.85 rows=18450796 width=21)
Sort Key: l.vin
-> Append (cost=0.00..332797.96 rows=18450796 width=21)
-> Seq Scan on liens l (cost=0.00..14.00 rows=400 width=21)
-> [Seq Scans on other partitions]

Disabling mergejoin pushes it back to a nestloop join. Why can't it hash
join these two together?

Kris Jurka

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kris Jurka (#1)
Re: No hash join across partitioned tables?

Kris Jurka <books@ejurka.com> writes:

PG (8.3.7) doesn't seem to want to do a hash join across two partitioned
tables.

Could we see the whole declaration of these tables? (pg_dump -s output
would be convenient)

regards, tom lane

#3Kris Jurka
books@ejurka.com
In reply to: Tom Lane (#2)
Re: No hash join across partitioned tables?

On Thu, 16 Apr 2009, Tom Lane wrote:

Kris Jurka <books@ejurka.com> writes:

PG (8.3.7) doesn't seem to want to do a hash join across two partitioned
tables.

Could we see the whole declaration of these tables? (pg_dump -s output
would be convenient)

The attached table definition with no data wants to mergejoin first, but
after disabling mergejoin it does indeed do a hashjoin.

Looking back at the cost estimates for the merge and nestloop joins, it
seems to be selecting the number of rows in the cartesian product * .005
while the number of output rows in this case is 2437 (cartesian product *
4e-9). Perhaps the cost estimates for the real data are so high because
of this bogus row count that the fudge factor to disable mergejoin isn't
enough?

Kris Jurka

Attachments:

hash-join-partition.sqltext/plain; charset=US-ASCII; name=hash-join-partition.sqlDownload
#4Kris Jurka
books@ejurka.com
In reply to: Kris Jurka (#3)
Re: No hash join across partitioned tables?

On Thu, 16 Apr 2009, Kris Jurka wrote:

Perhaps the cost estimates for the real data are so high because of this
bogus row count that the fudge factor to disable mergejoin isn't enough?

Indeed, I get these cost estimates on 8.4b1 with an increased
disable_cost value:

nestloop: 11171206.18
merge: 58377401.39
hash: 116763544.76

So the default disable_cost isn't enough to push it to use the hash join
plan and goes back to nestloop. Since disable_cost hasn't been touched
since January 2000, perhaps it's time to bump that up to match today's
hardware and problem sizes? This isn't even a particularly big problem,
it's joing 18M rows against 30k.

The real problem is getting reasonable stats to pass through the partition
Append step, so it can make a reasonable estimate of the join output size.

Kris Jurka

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kris Jurka (#4)
Re: No hash join across partitioned tables?

Kris Jurka <books@ejurka.com> writes:

So the default disable_cost isn't enough to push it to use the hash join
plan and goes back to nestloop. Since disable_cost hasn't been touched
since January 2000, perhaps it's time to bump that up to match today's
hardware and problem sizes?

I think disable_cost was originally set at a time when costs were
integers :-(. Yeah, there's probably no reason not to throw another
zero or two on it.

Is there another issue here besides that one? I think you were hoping
that the hash join would be faster than the alternatives, but the cost
estimate says it's a lot slower. Is that actually the case?

regards, tom lane

#6Kris Jurka
books@ejurka.com
In reply to: Tom Lane (#5)
Re: No hash join across partitioned tables?

Tom Lane wrote:

Is there another issue here besides that one? I think you were hoping
that the hash join would be faster than the alternatives, but the cost
estimate says it's a lot slower. Is that actually the case?

The hash join takes less than twenty seconds, the other two joins I
killed after five minutes. I can try to collect explain analyze results
later today if you'd like.

Kris Jurka

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kris Jurka (#6)
Re: No hash join across partitioned tables?

Kris Jurka <books@ejurka.com> writes:

The hash join takes less than twenty seconds, the other two joins I
killed after five minutes. I can try to collect explain analyze results
later today if you'd like.

Please, unless the test case you already posted has similar behavior.

regards, tom lane

#8Kris Jurka
books@ejurka.com
In reply to: Tom Lane (#7)
Re: No hash join across partitioned tables?

Tom Lane wrote:

Kris Jurka <books@ejurka.com> writes:

The hash join takes less than twenty seconds, the other two joins I
killed after five minutes. I can try to collect explain analyze results
later today if you'd like.

Attached are the explain analyze results. The analyze part hits the
hash join worst of all, so I've also included the timings without analyzing.

Method Time (ms) Time w/Analyze (ms)
nestloop 304853 319060
merge 514517 683757
hash 18957 143731

Kris Jurka

Attachments:

hash.txttext/plain; name=hash.txtDownload
merge.txttext/plain; name=merge.txtDownload
nestloop.txttext/plain; name=nestloop.txtDownload
#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kris Jurka (#4)
Re: No hash join across partitioned tables?

Kris Jurka <books@ejurka.com> writes:

The real problem is getting reasonable stats to pass through the partition
Append step, so it can make a reasonable estimate of the join output size.

I dug around a bit and concluded that the lack of stats for the Append
relation is indeed the main problem. It's not so much the bad join size
estimate (although that could hurt for cases where you need to join this
result to another table). Rather, it's that the planner is deliberately
biased against picking hash joins in the absence of stats for the inner
relation. Per the comments for estimate_hash_bucketsize:

* If no statistics are available, use a default estimate of 0.1. This will
* discourage use of a hash rather strongly if the inner relation is large,
* which is what we want. We do not want to hash unless we know that the
* inner rel is well-dispersed (or the alternatives seem much worse).

While we could back off the default a bit here, I think it'd be better
to fix it by not punting on the stats-for-append-relations problem.
That doesn't seem like material for 8.4 at this point, though.

regards, tom lane

#10Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#9)
Re: No hash join across partitioned tables?

Did this get addressed?

---------------------------------------------------------------------------

Tom Lane wrote:

Kris Jurka <books@ejurka.com> writes:

The real problem is getting reasonable stats to pass through the partition
Append step, so it can make a reasonable estimate of the join output size.

I dug around a bit and concluded that the lack of stats for the Append
relation is indeed the main problem. It's not so much the bad join size
estimate (although that could hurt for cases where you need to join this
result to another table). Rather, it's that the planner is deliberately
biased against picking hash joins in the absence of stats for the inner
relation. Per the comments for estimate_hash_bucketsize:

* If no statistics are available, use a default estimate of 0.1. This will
* discourage use of a hash rather strongly if the inner relation is large,
* which is what we want. We do not want to hash unless we know that the
* inner rel is well-dispersed (or the alternatives seem much worse).

While we could back off the default a bit here, I think it'd be better
to fix it by not punting on the stats-for-append-relations problem.
That doesn't seem like material for 8.4 at this point, though.

regards, tom lane

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +
#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#10)
Re: No hash join across partitioned tables?

Bruce Momjian <bruce@momjian.us> writes:

Did this get addressed?

Partially. There are stats now but autovacuum is not bright about
when to update them.

regards, tom lane

#12Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#11)
Re: No hash join across partitioned tables?

On Thu, Feb 25, 2010 at 7:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Bruce Momjian <bruce@momjian.us> writes:

Did this get addressed?

Partially.  There are stats now but autovacuum is not bright about
when to update them.

Is that something you're planning to fix for 9.0? If not, we at least
need to document what we intend for people to do about it.

...Robert

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#12)
Re: No hash join across partitioned tables?

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Feb 25, 2010 at 7:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Partially. �There are stats now but autovacuum is not bright about
when to update them.

Is that something you're planning to fix for 9.0? If not, we at least
need to document what we intend for people to do about it.

I want to look at it, but I'm not sure whether the fix will be small
enough that we want to put it in during beta.

regards, tom lane

#14Grzegorz Jaśkiewicz
gryzman@gmail.com
In reply to: Tom Lane (#13)
Re: No hash join across partitioned tables?

On Tue, Mar 2, 2010 at 4:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Feb 25, 2010 at 7:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Partially. There are stats now but autovacuum is not bright about
when to update them.

Is that something you're planning to fix for 9.0? If not, we at least
need to document what we intend for people to do about it.

I want to look at it, but I'm not sure whether the fix will be small
enough that we want to put it in during beta.

I am pretty sure many people will appreciate it, even if it isn't going to

be small.

Is that stat collection across child tables any useful by it self ?

--
GJ

#15Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#13)
Re: No hash join across partitioned tables?

On Tue, Mar 2, 2010 at 12:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Feb 25, 2010 at 7:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Partially.  There are stats now but autovacuum is not bright about
when to update them.

Is that something you're planning to fix for 9.0?  If not, we at least
need to document what we intend for people to do about it.

I want to look at it, but I'm not sure whether the fix will be small
enough that we want to put it in during beta.

In going back through emails I had marked as possibly needing another
look before 9.0 is released, I came across this issue again. As I
understand it, analyze (or analyse) now collects statistics for both
the parent individually, and for the parent and its children together.
However, as I further understand it, autovacuum won't actually fire
off an analyze unless there's enough activity on the parent table
considered individually to warrant it. So if you have an empty parent
and a bunch of children with data in it, your stats will still stink,
unless you analyze by hand.

Assuming my understanding of the problem is correct, we could:

(a) fix it,
(b) document that you should consider periodic manual analyze commands
in this situation, or
(c) do nothing.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#15)
Re: No hash join across partitioned tables?

Robert Haas <robertmhaas@gmail.com> writes:

In going back through emails I had marked as possibly needing another
look before 9.0 is released, I came across this issue again. As I
understand it, analyze (or analyse) now collects statistics for both
the parent individually, and for the parent and its children together.
However, as I further understand it, autovacuum won't actually fire
off an analyze unless there's enough activity on the parent table
considered individually to warrant it. So if you have an empty parent
and a bunch of children with data in it, your stats will still stink,
unless you analyze by hand.

Check.

Assuming my understanding of the problem is correct, we could:

(a) fix it,
(b) document that you should consider periodic manual analyze commands
in this situation, or
(c) do nothing.

Thoughts?

The objections to (a) are that it might result in excessive ANALYZE work
if not done intelligently, and that we haven't got a patch ready anyway.
I would have liked to get to this for 9.0 but I feel it's a bit late
now.

regards, tom lane

#17Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#16)
Re: [PERFORM] No hash join across partitioned tables?

(moving to -hackers)

On Wed, Jun 9, 2010 at 4:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

In going back through emails I had marked as possibly needing another
look before 9.0 is released, I came across this issue again.  As I
understand it, analyze (or analyse) now collects statistics for both
the parent individually, and for the parent and its children together.
 However, as I further understand it, autovacuum won't actually fire
off an analyze unless there's enough activity on the parent table
considered individually to warrant it.  So if you have an empty parent
and a bunch of children with data in it, your stats will still stink,
unless you analyze by hand.

Check.

Assuming my understanding of the problem is correct, we could:

(a) fix it,
(b) document that you should consider periodic manual analyze commands
in this situation, or
(c) do nothing.

Thoughts?

The objections to (a) are that it might result in excessive ANALYZE work
if not done intelligently, and that we haven't got a patch ready anyway.
I would have liked to get to this for 9.0 but I feel it's a bit late
now.

I guess I can't really disagree with that. Should we try to document
this in some way?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#18Luxenberg, Scott I.
Scott.Luxenberg@noblis.org
In reply to: Robert Haas (#17)
Error with GIT Repository

Greetings all,

I have been trying to create/run a build farm as part of a project I am
working on. However, I have noticed the primary git repostitory,
git.postgresql.org/git, does not seem to be working. Namely, whenever I
try to clone the directory, I receive this error:

Error: Unable to find 5e4933c31d3cd2750ee1793efe6eca43055fb273e under
http://git.postgresql.org/git/postgresql.git
Cannot obtain needed blob 5e4933c31d3cd2750ee1793efe6eca4305fb273e while
processing commit c5609c66ce2ee4fdb180be95721252b47f90499
Error: fetch failed.

I thought it would be prudent to notify the list so someone could
possibly check into this.

Thanks!

Scott Luxenberg

#19Andrew Dunstan
andrew@dunslane.net
In reply to: Luxenberg, Scott I. (#18)
Re: Error with GIT Repository

Luxenberg, Scott I. wrote:

Greetings all,

I have been trying to create/run a build farm as part of a project I am
working on.

That seems an odd thing to do since we have one ...

However, I have noticed the primary git repostitory,
git.postgresql.org/git, does not seem to be working. Namely, whenever I
try to clone the directory, I receive this error:

Error: Unable to find 5e4933c31d3cd2750ee1793efe6eca43055fb273e under
http://git.postgresql.org/git/postgresql.git
Cannot obtain needed blob 5e4933c31d3cd2750ee1793efe6eca4305fb273e while
processing commit c5609c66ce2ee4fdb180be95721252b47f90499
Error: fetch failed.

I thought it would be prudent to notify the list so someone could
possibly check into this.

Why are you cloning over http? Here is the best way to clone, which
seems to be working:

[andrew@sophia ]$ git clone --mirror
git://git.postgresql.org/git/postgresql.git
Initialized empty Git repository in /home/andrew/postgresql.git/
remote: Counting objects: 376865, done.
remote: Compressing objects: 100% (87569/87569), done.
remote: Total 376865 (delta 310187), reused 352950 (delta 287485)
Receiving objects: 100% (376865/376865), 178.73 MiB | 251 KiB/s, done.
Resolving deltas: 100% (310187/310187), done.
[andrew@sophia ]$

cheers

andrew

#20Stephen Frost
sfrost@snowman.net
In reply to: Andrew Dunstan (#19)
Re: Error with GIT Repository

* Andrew Dunstan (andrew@dunslane.net) wrote:

Luxenberg, Scott I. wrote:

I have been trying to create/run a build farm as part of a project I am
working on.

That seems an odd thing to do since we have one ...

To clarify, he's setting up a build farm *member*. :)

However, I have noticed the primary git repostitory,
git.postgresql.org/git, does not seem to be working. Namely, whenever I
try to clone the directory, I receive this error:

Error: Unable to find 5e4933c31d3cd2750ee1793efe6eca43055fb273e under
http://git.postgresql.org/git/postgresql.git
Cannot obtain needed blob 5e4933c31d3cd2750ee1793efe6eca4305fb273e while
processing commit c5609c66ce2ee4fdb180be95721252b47f90499
Error: fetch failed.

I thought it would be prudent to notify the list so someone could
possibly check into this.

Why are you cloning over http? Here is the best way to clone, which
seems to be working:

Unfortunately for us, the port that git uses isn't currently allowed
outbound by our corporate firewall. I expect that to be true for other
PG users who want git and for some build-farm members, so I think we
really need to support git cloning over http.

As a side-note, it works just fine from git-hub's http mirror and that's
what we've been playing with, but I don't know if we want to recommend
that for build-farm members..

Thanks!

Stephen

#21Andrew Dunstan
andrew@dunslane.net
In reply to: Stephen Frost (#20)
#22Stephen Frost
sfrost@snowman.net
In reply to: Andrew Dunstan (#21)
#23Magnus Hagander
magnus@hagander.net
In reply to: Stephen Frost (#22)
#24Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrew Dunstan (#19)
#25Leonardo Francalanci
m_lists@yahoo.it
In reply to: Andrew Dunstan (#19)
#26Andres Freund
andres@anarazel.de
In reply to: Magnus Hagander (#23)
#27Magnus Hagander
magnus@hagander.net
In reply to: Andres Freund (#26)
#28Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#17)
#29Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#28)
#30Daniel Farina
drfarina@acm.org
In reply to: Magnus Hagander (#27)
#31Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#16)
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#31)
#33Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#32)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#33)
#35Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#34)
#36Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#35)
#37Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#15)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#37)
#39Sam Gendler
sgendler@ideasculptor.com
In reply to: Tom Lane (#38)
#40Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Sam Gendler (#39)
#41Sam Gendler
sgendler@ideasculptor.com
In reply to: Alvaro Herrera (#40)
#42Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Sam Gendler (#41)
#43Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#38)