No hash join across partitioned tables?
PG (8.3.7) doesn't seem to want to do a hash join across two partitioned
tables. I have two partition hierarchies: impounds (with different
impound sources) and liens (with vehicle liens from different companies).
Trying to match those up gives:
EXPLAIN SELECT COUNT(*)
FROM impounds i
JOIN liens l ON (i.vin = l.vin);
Aggregate (cost=11164042.66..11164042.67 rows=1 width=0)
-> Nested Loop (cost=0.27..3420012.94 rows=3097611886 width=0)
Join Filter: ((i.vin)::text = (l.vin)::text)
-> Append (cost=0.00..1072.77 rows=33577 width=21)
-> Seq Scan on impounds i (cost=0.00..11.40 rows=140 width=21)
-> Seq Scan on impounds_s1 i (cost=0.00..926.87 rows=29587 width=18)
-> Seq Scan on impounds_s2 i (cost=0.00..99.96 rows=3296 width=18)
-> Seq Scan on impounds_s3 i (cost=0.00..23.14 rows=414 width=18)
-> Seq Scan on impounds_s4 i (cost=0.00..11.40 rows=140 width=21)
-> Append (cost=0.27..101.64 rows=15 width=21)
-> Bitmap Heap Scan on liens l (cost=0.27..5.60 rows=2 width=21)
Recheck Cond: ((l.vin)::text = (i.vin)::text)
-> Bitmap Index Scan on liens_pk (cost=0.00..0.27 rows=2 width=0)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using liens_s1_pk on liens_s1 l (cost=0.00..7.02 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using liens_s2_pk on liens_s2 l (cost=0.00..3.47 rows=1 width=21)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using newliens_s3_pk on liens_s3 l (cost=0.00..7.52 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using newliens_s4_pk on liens_s4 l (cost=0.00..7.67 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using newliens_s5_pk on liens_s5 l (cost=0.00..7.62 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using newliens_s6_pk on liens_s6 l (cost=0.00..7.61 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using newliens_s7_pk on liens_s7 l (cost=0.00..7.50 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using newliens_s8_pk on liens_s8 l (cost=0.00..7.36 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using newliens_s9_pk on liens_s9 l (cost=0.00..7.43 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using newliens_s10_pk on liens_s10 l (cost=0.00..7.79 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using newliens_s11_pk on liens_s11 l (cost=0.00..8.07 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using newliens_s12_pk on liens_s12 l (cost=0.00..8.45 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)
-> Index Scan using newliens_s13_pk on liens_s13 l (cost=0.00..8.53 rows=1 width=18)
Index Cond: ((l.vin)::text = (i.vin)::text)
This takes quite a while as it's got to do tons of index probes which
results it tons of random IO. I killed this after five minutes of
running.
But if I do:
CREATE TABLE i1 AS SELECT * FROM impounds;
CREATE TABLE l1 AS SELECT * FROM liens;
I get a reasonable plan, which runs in about 15 seconds, from:
EXPLAIN SELECT COUNT(*)
FROM i1 i
JOIN l1 l ON (i.vin = l.vin);
Aggregate (cost=749054.78..749054.79 rows=1 width=0)
-> Hash Join (cost=1444.18..748971.43 rows=33338 width=0)
Hash Cond: ((l.vin)::text = (i.vin)::text)
-> Seq Scan on l1 l (cost=0.00..332068.96 rows=18449996
width=18)
-> Hash (cost=1027.97..1027.97 rows=33297 width=18)
-> Seq Scan on i1 i (cost=0.00..1027.97 rows=33297
width=18)
I've tried to force the hash join plan on the partitioned tables via:
set enable_nestloop to off;
This results in a merge join plan which needs to do a giant sort, again
killed after five minutes.
Aggregate (cost=58285765.20..58285765.21 rows=1 width=0)
-> Merge Join (cost=4077389.31..50541735.48 rows=3097611886 width=0)
Merge Cond: ((i.vin)::text = (l.vin)::text)
-> Sort (cost=4286.45..4370.39 rows=33577 width=21)
Sort Key: i.vin
-> Append (cost=0.00..1072.77 rows=33577 width=21)
-> Seq Scan on impounds i (cost=0.00..11.40 rows=140 width=21)
-> [Seq Scans on other partitions]
-> Materialize (cost=4073102.86..4303737.81 rows=18450796 width=21)
-> Sort (cost=4073102.86..4119229.85 rows=18450796 width=21)
Sort Key: l.vin
-> Append (cost=0.00..332797.96 rows=18450796 width=21)
-> Seq Scan on liens l (cost=0.00..14.00 rows=400 width=21)
-> [Seq Scans on other partitions]
Disabling mergejoin pushes it back to a nestloop join. Why can't it hash
join these two together?
Kris Jurka
Kris Jurka <books@ejurka.com> writes:
PG (8.3.7) doesn't seem to want to do a hash join across two partitioned
tables.
Could we see the whole declaration of these tables? (pg_dump -s output
would be convenient)
regards, tom lane
On Thu, 16 Apr 2009, Tom Lane wrote:
Kris Jurka <books@ejurka.com> writes:
PG (8.3.7) doesn't seem to want to do a hash join across two partitioned
tables.Could we see the whole declaration of these tables? (pg_dump -s output
would be convenient)
The attached table definition with no data wants to mergejoin first, but
after disabling mergejoin it does indeed do a hashjoin.
Looking back at the cost estimates for the merge and nestloop joins, it
seems to be selecting the number of rows in the cartesian product * .005
while the number of output rows in this case is 2437 (cartesian product *
4e-9). Perhaps the cost estimates for the real data are so high because
of this bogus row count that the fudge factor to disable mergejoin isn't
enough?
Kris Jurka
Attachments:
hash-join-partition.sqltext/plain; charset=US-ASCII; name=hash-join-partition.sqlDownload
On Thu, 16 Apr 2009, Kris Jurka wrote:
Perhaps the cost estimates for the real data are so high because of this
bogus row count that the fudge factor to disable mergejoin isn't enough?
Indeed, I get these cost estimates on 8.4b1 with an increased
disable_cost value:
nestloop: 11171206.18
merge: 58377401.39
hash: 116763544.76
So the default disable_cost isn't enough to push it to use the hash join
plan and goes back to nestloop. Since disable_cost hasn't been touched
since January 2000, perhaps it's time to bump that up to match today's
hardware and problem sizes? This isn't even a particularly big problem,
it's joing 18M rows against 30k.
The real problem is getting reasonable stats to pass through the partition
Append step, so it can make a reasonable estimate of the join output size.
Kris Jurka
Kris Jurka <books@ejurka.com> writes:
So the default disable_cost isn't enough to push it to use the hash join
plan and goes back to nestloop. Since disable_cost hasn't been touched
since January 2000, perhaps it's time to bump that up to match today's
hardware and problem sizes?
I think disable_cost was originally set at a time when costs were
integers :-(. Yeah, there's probably no reason not to throw another
zero or two on it.
Is there another issue here besides that one? I think you were hoping
that the hash join would be faster than the alternatives, but the cost
estimate says it's a lot slower. Is that actually the case?
regards, tom lane
Tom Lane wrote:
Is there another issue here besides that one? I think you were hoping
that the hash join would be faster than the alternatives, but the cost
estimate says it's a lot slower. Is that actually the case?
The hash join takes less than twenty seconds, the other two joins I
killed after five minutes. I can try to collect explain analyze results
later today if you'd like.
Kris Jurka
Kris Jurka <books@ejurka.com> writes:
The hash join takes less than twenty seconds, the other two joins I
killed after five minutes. I can try to collect explain analyze results
later today if you'd like.
Please, unless the test case you already posted has similar behavior.
regards, tom lane
Tom Lane wrote:
Kris Jurka <books@ejurka.com> writes:
The hash join takes less than twenty seconds, the other two joins I
killed after five minutes. I can try to collect explain analyze results
later today if you'd like.
Attached are the explain analyze results. The analyze part hits the
hash join worst of all, so I've also included the timings without analyzing.
Method Time (ms) Time w/Analyze (ms)
nestloop 304853 319060
merge 514517 683757
hash 18957 143731
Kris Jurka
Kris Jurka <books@ejurka.com> writes:
The real problem is getting reasonable stats to pass through the partition
Append step, so it can make a reasonable estimate of the join output size.
I dug around a bit and concluded that the lack of stats for the Append
relation is indeed the main problem. It's not so much the bad join size
estimate (although that could hurt for cases where you need to join this
result to another table). Rather, it's that the planner is deliberately
biased against picking hash joins in the absence of stats for the inner
relation. Per the comments for estimate_hash_bucketsize:
* If no statistics are available, use a default estimate of 0.1. This will
* discourage use of a hash rather strongly if the inner relation is large,
* which is what we want. We do not want to hash unless we know that the
* inner rel is well-dispersed (or the alternatives seem much worse).
While we could back off the default a bit here, I think it'd be better
to fix it by not punting on the stats-for-append-relations problem.
That doesn't seem like material for 8.4 at this point, though.
regards, tom lane
Did this get addressed?
---------------------------------------------------------------------------
Tom Lane wrote:
Kris Jurka <books@ejurka.com> writes:
The real problem is getting reasonable stats to pass through the partition
Append step, so it can make a reasonable estimate of the join output size.I dug around a bit and concluded that the lack of stats for the Append
relation is indeed the main problem. It's not so much the bad join size
estimate (although that could hurt for cases where you need to join this
result to another table). Rather, it's that the planner is deliberately
biased against picking hash joins in the absence of stats for the inner
relation. Per the comments for estimate_hash_bucketsize:* If no statistics are available, use a default estimate of 0.1. This will
* discourage use of a hash rather strongly if the inner relation is large,
* which is what we want. We do not want to hash unless we know that the
* inner rel is well-dispersed (or the alternatives seem much worse).While we could back off the default a bit here, I think it'd be better
to fix it by not punting on the stats-for-append-relations problem.
That doesn't seem like material for 8.4 at this point, though.regards, tom lane
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
+ If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes:
Did this get addressed?
Partially. There are stats now but autovacuum is not bright about
when to update them.
regards, tom lane
On Thu, Feb 25, 2010 at 7:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Bruce Momjian <bruce@momjian.us> writes:
Did this get addressed?
Partially. There are stats now but autovacuum is not bright about
when to update them.
Is that something you're planning to fix for 9.0? If not, we at least
need to document what we intend for people to do about it.
...Robert
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Feb 25, 2010 at 7:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Partially. �There are stats now but autovacuum is not bright about
when to update them.
Is that something you're planning to fix for 9.0? If not, we at least
need to document what we intend for people to do about it.
I want to look at it, but I'm not sure whether the fix will be small
enough that we want to put it in during beta.
regards, tom lane
On Tue, Mar 2, 2010 at 4:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Feb 25, 2010 at 7:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Partially. There are stats now but autovacuum is not bright about
when to update them.Is that something you're planning to fix for 9.0? If not, we at least
need to document what we intend for people to do about it.I want to look at it, but I'm not sure whether the fix will be small
enough that we want to put it in during beta.I am pretty sure many people will appreciate it, even if it isn't going to
be small.
Is that stat collection across child tables any useful by it self ?
--
GJ
On Tue, Mar 2, 2010 at 12:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Feb 25, 2010 at 7:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Partially. There are stats now but autovacuum is not bright about
when to update them.Is that something you're planning to fix for 9.0? If not, we at least
need to document what we intend for people to do about it.I want to look at it, but I'm not sure whether the fix will be small
enough that we want to put it in during beta.
In going back through emails I had marked as possibly needing another
look before 9.0 is released, I came across this issue again. As I
understand it, analyze (or analyse) now collects statistics for both
the parent individually, and for the parent and its children together.
However, as I further understand it, autovacuum won't actually fire
off an analyze unless there's enough activity on the parent table
considered individually to warrant it. So if you have an empty parent
and a bunch of children with data in it, your stats will still stink,
unless you analyze by hand.
Assuming my understanding of the problem is correct, we could:
(a) fix it,
(b) document that you should consider periodic manual analyze commands
in this situation, or
(c) do nothing.
Thoughts?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes:
In going back through emails I had marked as possibly needing another
look before 9.0 is released, I came across this issue again. As I
understand it, analyze (or analyse) now collects statistics for both
the parent individually, and for the parent and its children together.
However, as I further understand it, autovacuum won't actually fire
off an analyze unless there's enough activity on the parent table
considered individually to warrant it. So if you have an empty parent
and a bunch of children with data in it, your stats will still stink,
unless you analyze by hand.
Check.
Assuming my understanding of the problem is correct, we could:
(a) fix it,
(b) document that you should consider periodic manual analyze commands
in this situation, or
(c) do nothing.
Thoughts?
The objections to (a) are that it might result in excessive ANALYZE work
if not done intelligently, and that we haven't got a patch ready anyway.
I would have liked to get to this for 9.0 but I feel it's a bit late
now.
regards, tom lane
(moving to -hackers)
On Wed, Jun 9, 2010 at 4:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
In going back through emails I had marked as possibly needing another
look before 9.0 is released, I came across this issue again. As I
understand it, analyze (or analyse) now collects statistics for both
the parent individually, and for the parent and its children together.
However, as I further understand it, autovacuum won't actually fire
off an analyze unless there's enough activity on the parent table
considered individually to warrant it. So if you have an empty parent
and a bunch of children with data in it, your stats will still stink,
unless you analyze by hand.Check.
Assuming my understanding of the problem is correct, we could:
(a) fix it,
(b) document that you should consider periodic manual analyze commands
in this situation, or
(c) do nothing.Thoughts?
The objections to (a) are that it might result in excessive ANALYZE work
if not done intelligently, and that we haven't got a patch ready anyway.
I would have liked to get to this for 9.0 but I feel it's a bit late
now.
I guess I can't really disagree with that. Should we try to document
this in some way?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company
Greetings all,
I have been trying to create/run a build farm as part of a project I am
working on. However, I have noticed the primary git repostitory,
git.postgresql.org/git, does not seem to be working. Namely, whenever I
try to clone the directory, I receive this error:
Error: Unable to find 5e4933c31d3cd2750ee1793efe6eca43055fb273e under
http://git.postgresql.org/git/postgresql.git
Cannot obtain needed blob 5e4933c31d3cd2750ee1793efe6eca4305fb273e while
processing commit c5609c66ce2ee4fdb180be95721252b47f90499
Error: fetch failed.
I thought it would be prudent to notify the list so someone could
possibly check into this.
Thanks!
Scott Luxenberg
Luxenberg, Scott I. wrote:
Greetings all,
I have been trying to create/run a build farm as part of a project I am
working on.
That seems an odd thing to do since we have one ...
However, I have noticed the primary git repostitory,
git.postgresql.org/git, does not seem to be working. Namely, whenever I
try to clone the directory, I receive this error:Error: Unable to find 5e4933c31d3cd2750ee1793efe6eca43055fb273e under
http://git.postgresql.org/git/postgresql.git
Cannot obtain needed blob 5e4933c31d3cd2750ee1793efe6eca4305fb273e while
processing commit c5609c66ce2ee4fdb180be95721252b47f90499
Error: fetch failed.I thought it would be prudent to notify the list so someone could
possibly check into this.
Why are you cloning over http? Here is the best way to clone, which
seems to be working:
[andrew@sophia ]$ git clone --mirror
git://git.postgresql.org/git/postgresql.git
Initialized empty Git repository in /home/andrew/postgresql.git/
remote: Counting objects: 376865, done.
remote: Compressing objects: 100% (87569/87569), done.
remote: Total 376865 (delta 310187), reused 352950 (delta 287485)
Receiving objects: 100% (376865/376865), 178.73 MiB | 251 KiB/s, done.
Resolving deltas: 100% (310187/310187), done.
[andrew@sophia ]$
cheers
andrew
* Andrew Dunstan (andrew@dunslane.net) wrote:
Luxenberg, Scott I. wrote:
I have been trying to create/run a build farm as part of a project I am
working on.That seems an odd thing to do since we have one ...
To clarify, he's setting up a build farm *member*. :)
However, I have noticed the primary git repostitory,
git.postgresql.org/git, does not seem to be working. Namely, whenever I
try to clone the directory, I receive this error:Error: Unable to find 5e4933c31d3cd2750ee1793efe6eca43055fb273e under
http://git.postgresql.org/git/postgresql.git
Cannot obtain needed blob 5e4933c31d3cd2750ee1793efe6eca4305fb273e while
processing commit c5609c66ce2ee4fdb180be95721252b47f90499
Error: fetch failed.I thought it would be prudent to notify the list so someone could
possibly check into this.Why are you cloning over http? Here is the best way to clone, which
seems to be working:
Unfortunately for us, the port that git uses isn't currently allowed
outbound by our corporate firewall. I expect that to be true for other
PG users who want git and for some build-farm members, so I think we
really need to support git cloning over http.
As a side-note, it works just fine from git-hub's http mirror and that's
what we've been playing with, but I don't know if we want to recommend
that for build-farm members..
Thanks!
Stephen