Potential Join Performance Issue
PostgreSQL development community:
Our research group has been using the PostgreSQL code base to test new
join algorithms. During testing, we noticed that the planner is not
pushing down projections to the outer relation in a hash join. Although
this makes sense for in-memory (1 batch) joins, for joins larger than
memory (such as for TPC-H DSS), this causes the system to perform
significantly more disk I/Os when reading/writing batches of the outer
relation.
A simple solution is to add a single line of code to
src\backend\optimizer\plan\createplan.c after line 1771:
disuse_physical_tlist(outer_plan, best_path->jpath.outerjoinpath);
This will always force the projection on the outer relation.
A more complicated modification alternative is to add a state variable
to allow the planner to know how many batches the hash join expects and
only push down the projection if it is greater than one. However,
pushing the projection on the outer relation is almost always the best
choice as it eliminates unneeded attributes for operators above the hash
join in the plan and will be robust in the case of poor estimates.
We have been testing using TPC-H scale factor 1 GB. A sample query that
demonstrates the behavior is:
SELECT c_custkey, c_name, o_orderkey, o_orderdate
FROM Customer, Orders
WHERE c_custkey = o_custkey
Note that EXPLAIN on this query will indicate that the projection is
performed on the outer relation even though it is not done. We found
the difference by modifying our code to track tuples and bytes output to
disk, but it also can be detected by watching the size of the temporary
files produced during the join.
Sincerely,
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
http://people.ok.ubc.ca/rlawrenc/
E-mail: ramon.lawrence@ubc.ca <mailto:ramon.lawrence@ubc.ca>
"Lawrence, Ramon" <ramon.lawrence@ubc.ca> writes:
Our research group has been using the PostgreSQL code base to test new
join algorithms. During testing, we noticed that the planner is not
pushing down projections to the outer relation in a hash join. Although
this makes sense for in-memory (1 batch) joins, for joins larger than
memory (such as for TPC-H DSS), this causes the system to perform
significantly more disk I/Os when reading/writing batches of the outer
relation.
Hm. The proposed patch seems a bit brute-force, since it loses the
benefit of the physical-tlist optimization even if the relations are
certainly too small to require batching.
A more complicated modification alternative is to add a state variable
to allow the planner to know how many batches the hash join expects and
only push down the projection if it is greater than one. However,
pushing the projection on the outer relation is almost always the best
choice as it eliminates unneeded attributes for operators above the hash
join in the plan and will be robust in the case of poor estimates.
Nonetheless, I'm inclined to do it that way. The "robust in the case of
poor estimates" argument doesn't convince me, because the incremental
cost isn't *that* large if we get it wrong; and the other argument is
just bogus because we don't do physical tlists at or above joins anyhow.
regards, tom lane
On Tue, 2008-09-09 at 11:21 -0700, Lawrence, Ramon wrote:
Our research group has been using the PostgreSQL code base to test new
join algorithms.
Sounds cool. I'm sure you'll come up with some good things.
You might be interested in this also
http://archives.postgresql.org/pgsql-hackers/2007-01/msg01600.php
after which Greg Stark and I were investigating using
alternate/compressed data structures to avoid the need to switch to
multi-batch hash joins.
If we knew we were dealing with nearly contiguous ranges of discrete
values, we could store the missing values rather than the present values
using an HRL encoded bitmap. Other ideas are possible also, I'm sure.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Attached is a patch that will disable the physical-tlist optimization
for hash join if the number of batches is greater than 1. The patch and
performance results were created by Michael Henderson (graduate
student).
To keep the changes simple, the update simply calls
ExecChooseHashTableSize() in create_hashjoin_plan() to re-calculate the
expected number of batches. This is more efficient and results in less
code changes than modifying the HashPath struct to store the number of
batches and updating that variable when costing (as cost_hashjoin() will
be called many times during costing).
We have also attached some performance results that show a dramatic
effect when disabling the physical-tlist optimization for joins with
more than one batch.
I do not know the performance tradeoffs of using the physical-tlist
optimization to avoid projection on the outer relation for joins with
one batch. However, there is a potential huge penalty if the optimizer
is wrong. If the optimizer suggests one batch, and on execution either
due to poor estimates or data skew more than one batch is needed, then
the join operator will perform considerably more I/Os on the outer
relation that still contains the unnecessary attributes.
An ideal solution would detect at execution time if the inner relation
remained in memory (one batch) and decide to disable/enable the
physical-tlist optimization on the outer relation accordingly. At this
time, we are uncertain if this would be desirable or possible.
Sincerely,
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon.lawrence@ubc.ca
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: September 9, 2008 6:47 PM
To: Lawrence, Ramon
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Potential Join Performance Issue
"Lawrence, Ramon" <ramon.lawrence@ubc.ca> writes:
Our research group has been using the PostgreSQL code base to test new
join algorithms. During testing, we noticed that the planner is not
pushing down projections to the outer relation in a hash join.
Although
this makes sense for in-memory (1 batch) joins, for joins larger than
memory (such as for TPC-H DSS), this causes the system to perform
significantly more disk I/Os when reading/writing batches of the outer
relation.
Hm. The proposed patch seems a bit brute-force, since it loses the
benefit of the physical-tlist optimization even if the relations are
certainly too small to require batching.
A more complicated modification alternative is to add a state variable
to allow the planner to know how many batches the hash join expects
and
only push down the projection if it is greater than one. However,
pushing the projection on the outer relation is almost always the best
choice as it eliminates unneeded attributes for operators above the
hash
join in the plan and will be robust in the case of poor estimates.
Nonetheless, I'm inclined to do it that way. The "robust in the case of
poor estimates" argument doesn't convince me, because the incremental
cost isn't *that* large if we get it wrong; and the other argument is
just bogus because we don't do physical tlists at or above joins anyhow.
regards, tom lane
Attachments:
createplan_c.diffapplication/octet-stream; name=createplan_c.diffDownload
Index: createplan.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/optimizer/plan/createplan.c,v
retrieving revision 1.248
diff -r1.248 createplan.c
1651a1652,1660
> #define PATH_ROWS(path) \
> (IsA(path, UniquePath) ? \
> ((UniquePath *) (path))->rows : \
> (path)->parent->rows)
>
> extern void ExecChooseHashTableSize(double ntuples, int tupwidth,
> int *numbuckets,
> int *numbatches);
>
1663a1673,1676
> Path *outer_path = best_path->jpath.outerjoinpath;
> double outer_path_rows = PATH_ROWS(outer_path);
> int nbatch;
> int nbuckets;
1699a1713,1719
> /* We don't want any excess columns in tuples we write to disk */
> ExecChooseHashTableSize(outer_path_rows,outer_path->parent->width,&nbuckets,&nbatch);
> if (nbatch > 1)
> {
> disuse_physical_tlist(outer_plan, outer_path);
> }
>
Import Notes
Resolved by subject fallback
"Lawrence, Ramon" <ramon.lawrence@ubc.ca> writes:
To keep the changes simple, the update simply calls
ExecChooseHashTableSize() in create_hashjoin_plan() to re-calculate the
expected number of batches. This is more efficient and results in less
code changes than modifying the HashPath struct to store the number of
batches and updating that variable when costing (as cost_hashjoin() will
be called many times during costing).
I was intending to do it the other way, actually. An extra field in
HashPath hardly costs anything. The other reason for it is that there
are other possible uses for knowing whether a hash will be multi-batch.
(For example, if we were prepared to tell the executor that it *must*
keep the hash to one batch, we could assume that the sort order of the
left input is preserved. I haven't looked into the risks/benefits of
that too much, but it's been in the back of the mind for a long time.)
An ideal solution would detect at execution time if the inner relation
remained in memory (one batch) and decide to disable/enable the
physical-tlist optimization on the outer relation accordingly. At this
time, we are uncertain if this would be desirable or possible.
That seems pretty infeasible really. Aside from changing plan node
output tuple types on-the-fly, it would mean renumbering Vars in the
join node to reference the outer relation's new output columns. The
overhead of supporting that would be paid across-the-board in the
executor whether or not anyone got any real benefit from it.
I'd be more inclined to deal with the issue by trying to establish a
"safety margin" in the estimate of whether the hash will go multi-batch.
IOW we should disuse_physical_tlist if the hash is estimated to be close
to but still within one batch.
regards, tom lane
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
I was intending to do it the other way, actually. An extra field in
HashPath hardly costs anything. The other reason for it is that there
are other possible uses for knowing whether a hash will be
multi-batch.
(For example, if we were prepared to tell the executor that it *must*
keep the hash to one batch, we could assume that the sort order of the
left input is preserved. I haven't looked into the risks/benefits of
that too much, but it's been in the back of the mind for a long time.)
Having the number of batches in HashPath could be potentially useful for
a variety of reasons. For our research, we have added an nbatch
variable in both HashPath and HashJoin. Having it in HashJoin is useful
as we modified EXPLAIN to output the number of batches. There are costs
in putting an nbatch variable in HashPath as the system may set this
variable potentially hundreds/thousands of times during costing and does
not (currently) use it until you convert the chosen HashPath to a plan.
I'd be more inclined to deal with the issue by trying to establish a
"safety margin" in the estimate of whether the hash will go
multi-batch.
IOW we should disuse_physical_tlist if the hash is estimated to be
close
to but still within one batch.
Our experiments with large TPC-H 1GB joins show that it is almost always
better to not use physical_tlists if the number of batches is > 1.
There is a noticeable (approximately 5-15%) improvement when using
physical_tlists for in-memory joins. For batches of size 2, it
sometimes can go either way depending how many attributes are projected
out of the outer relation. Using physical_tlists may be better even for
batches of size 2 if most of the attributes of the outer relation are
kept. For a larger number of batches, the extra I/O cost significantly
dominates over the physical_tlist optimization. Performance of
multi-batch joins may improve 50% or more by disabling the optimization.
It is possible to create a "safety margin" by having
ExecChooseHashTableSize() return the value
inner_rel_bytes/hash_table_bytes which represents the fraction of the
memory available that the inner relation is expected to consume. You
can then make decisions based on that. However, this is only as good
as the inner relation size estimate and especially for large queries,
the estimate may be quite inaccurate. A more robust solution could
examine the "width" of the path and the "width" of the relation combined
with the number of batches to see if projecting early would be worth it.
It may be best to keep it simple and just use number of batches > 1 as a
criteria and instead focus on examining issues with inaccurate join size
estimates.
--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon.lawrence@ubc.ca
Has this been completed? TODO item?
---------------------------------------------------------------------------
Lawrence, Ramon wrote:
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
I was intending to do it the other way, actually. An extra field in
HashPath hardly costs anything. The other reason for it is that there
are other possible uses for knowing whether a hash will bemulti-batch.
(For example, if we were prepared to tell the executor that it *must*
keep the hash to one batch, we could assume that the sort order of the
left input is preserved. I haven't looked into the risks/benefits of
that too much, but it's been in the back of the mind for a long time.)Having the number of batches in HashPath could be potentially useful for
a variety of reasons. For our research, we have added an nbatch
variable in both HashPath and HashJoin. Having it in HashJoin is useful
as we modified EXPLAIN to output the number of batches. There are costs
in putting an nbatch variable in HashPath as the system may set this
variable potentially hundreds/thousands of times during costing and does
not (currently) use it until you convert the chosen HashPath to a plan.I'd be more inclined to deal with the issue by trying to establish a
"safety margin" in the estimate of whether the hash will gomulti-batch.
IOW we should disuse_physical_tlist if the hash is estimated to be
close
to but still within one batch.
Our experiments with large TPC-H 1GB joins show that it is almost always
better to not use physical_tlists if the number of batches is > 1.
There is a noticeable (approximately 5-15%) improvement when using
physical_tlists for in-memory joins. For batches of size 2, it
sometimes can go either way depending how many attributes are projected
out of the outer relation. Using physical_tlists may be better even for
batches of size 2 if most of the attributes of the outer relation are
kept. For a larger number of batches, the extra I/O cost significantly
dominates over the physical_tlist optimization. Performance of
multi-batch joins may improve 50% or more by disabling the optimization.It is possible to create a "safety margin" by having
ExecChooseHashTableSize() return the value
inner_rel_bytes/hash_table_bytes which represents the fraction of the
memory available that the inner relation is expected to consume. You
can then make decisions based on that. However, this is only as good
as the inner relation size estimate and especially for large queries,
the estimate may be quite inaccurate. A more robust solution could
examine the "width" of the path and the "width" of the relation combined
with the number of batches to see if projecting early would be worth it.
It may be best to keep it simple and just use number of batches > 1 as a
criteria and instead focus on examining issues with inaccurate join size
estimates.--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon.lawrence@ubc.ca--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Has this been completed? TODO item?
I'd be more inclined to deal with the issue by trying to establish
a
"safety margin" in the estimate of whether the hash will go
multi-batch.
IOW we should disuse_physical_tlist if the hash is estimated to be
close to but still within one batch.
I do not know how this issue was resolved. It is an issue that is very
important for multi-batch hash joins. The simplest resolution is to
disable physical_tlist on the outer relation for hash joins of more than
one batch. However, as discussed in the thread, more sophisticated
solutions are also viable.
--
Ramon Lawrence
Import Notes
Reply to msg id not found: 3762_1231379944_1231379944_200901080142.n081gkk19548@momjian.us
"Lawrence, Ramon" <ramon.lawrence@ubc.ca> writes:
Attached is a patch that will disable the physical-tlist optimization
for hash join if the number of batches is greater than 1. The patch and
performance results were created by Michael Henderson (graduate
student).
I've applied the attached modified version of this patch.
regards, tom lane