Costing foreign joins in postgres_fdw

Started by Ashutosh Bapatabout 10 years ago5 messages
#1Ashutosh Bapat
ashutosh.bapat@enterprisedb.com

Hi All,
Costs for foreign queries are either obtained from the foreign server using
EXPLAIN (if use_remote_estimate is ON) otherwise they are cooked up locally
based on the statistics available. For joins as well, we have to do the
same. If use_remote_estimates [1]/messages/by-id/CAFjFpRepSC2e3mZ1uYSopJD6R19fOZ0dNNf9Z=gnyKSB6wGk5g@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company is ON, we can get the costs from the
foreign server. Rest of the mail discusses approaches for estimating the
costs when use_remote_estimates is OFF.

1. Unlike base relations where the table data "has to be" fetched from the
foreign server, a join doesn't "have to be" fetched from the foreign
server. So, even if use_remote_estimate is OFF for a base relation, we do
try to estimate something locally. But for a join that's not compulsory, so
we can choose not to estimate anything and not push down the join if
use_remote_estimate is OFF. Whether we do that depends upon how well we can
estimate the join cost when use_remote_estimate is OFF.

2. Locally estimating the cost of join that will be performed on the
foreign server is difficult because we do not know which join strategy the
foreign server is going to use, which in turn depends upon the availability
of indexes, work memory, statistics about joining expressions etc. One way
to do this is to use the cost of cheapest local join path built upon
foreign outer and inner paths, to estimate the cost of executing the join
at the foreign server The startup and run time costs for sending, parsing
and planning query at the foreign server as well as the cost to fetch the
tuples need to be adjusted, so that it doesn't get counted twice. We may
assume that the cost for the foreign join will be some factor of the
adjusted cost, like we have done for estimating cost of sort pushdown. The
reason we choose cheapest path with foreign inner and outer paths is
because that's likely to be a closer to the real estimate than the path
which does not have foreign inner and outer paths. In the absence of such
path, we should probably not push the join down since no local path has
found pushing inner and outer to be cheaper and it's likely (certainly not
a rule) that pushing the join in question down is not going to be cheaper
than the local paths.

1st option is easy but it sounds too restrictive. 2nd option liberal but is
complex.

Any other ideas as to how we can estimate cost of foreign join when
use_remote_estimate is OFF?

[1]: /messages/by-id/CAFjFpRepSC2e3mZ1uYSopJD6R19fOZ0dNNf9Z=gnyKSB6wGk5g@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
/messages/by-id/CAFjFpRepSC2e3mZ1uYSopJD6R19fOZ0dNNf9Z=gnyKSB6wGk5g@mail.gmail.com
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#2Albe Laurenz
laurenz.albe@wien.gv.at
In reply to: Ashutosh Bapat (#1)
Re: Costing foreign joins in postgres_fdw

Ashutosh Bapat wrote:

Costs for foreign queries are either obtained from the foreign server using EXPLAIN (if
use_remote_estimate is ON) otherwise they are cooked up locally based on the statistics available. For
joins as well, we have to do the same. If use_remote_estimates [1] is ON, we can get the costs from
the foreign server. Rest of the mail discusses approaches for estimating the costs when
use_remote_estimates is OFF.

1. Unlike base relations where the table data "has to be" fetched from the foreign server, a join
doesn't "have to be" fetched from the foreign server. So, even if use_remote_estimate is OFF for a
base relation, we do try to estimate something locally. But for a join that's not compulsory, so we
can choose not to estimate anything and not push down the join if use_remote_estimate is OFF. Whether
we do that depends upon how well we can estimate the join cost when use_remote_estimate is OFF.

2. Locally estimating the cost of join that will be performed on the foreign server is difficult
because we do not know which join strategy the foreign server is going to use, which in turn depends
upon the availability of indexes, work memory, statistics about joining expressions etc. One way to do
this is to use the cost of cheapest local join path built upon foreign outer and inner paths, to
estimate the cost of executing the join at the foreign server The startup and run time costs for
sending, parsing and planning query at the foreign server as well as the cost to fetch the tuples need
to be adjusted, so that it doesn't get counted twice. We may assume that the cost for the foreign join
will be some factor of the adjusted cost, like we have done for estimating cost of sort pushdown. The
reason we choose cheapest path with foreign inner and outer paths is because that's likely to be a
closer to the real estimate than the path which does not have foreign inner and outer paths. In the
absence of such path, we should probably not push the join down since no local path has found pushing
inner and outer to be cheaper and it's likely (certainly not a rule) that pushing the join in question
down is not going to be cheaper than the local paths.

1st option is easy but it sounds too restrictive. 2nd option liberal but is complex.

My gut feeling is that for a join where all join predicates can be pushed down, it
will usually be a win to push the join to the foreign server.

So in your first scenario, I'd opt for always pushing down the join
if possible if use_remote_estimate is OFF.

Your second scenario is essentially to estimate that a pushed down join will
always be executed as a nested loop join, which will in most cases produce
an unfairly negative estimate.

What about using local statistics to come up with an estimated row count for
the join and use that as the basis for an estimate? My idea here is that it
is always be a win to push down a join unless the result set is so large that
transferring it becomes the bottleneck.
Maybe, to come up with something remotely realistic, a formula like

sum of locally estimated costs of sequential scan for the base table
plus count of estimated result rows (times a factor)

Yours,
Laurenz Albe

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Robert Haas
robertmhaas@gmail.com
In reply to: Albe Laurenz (#2)
Re: Costing foreign joins in postgres_fdw

On Fri, Dec 18, 2015 at 8:09 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:

My gut feeling is that for a join where all join predicates can be pushed down, it
will usually be a win to push the join to the foreign server.

So in your first scenario, I'd opt for always pushing down the join
if possible if use_remote_estimate is OFF.

Your second scenario is essentially to estimate that a pushed down join will
always be executed as a nested loop join, which will in most cases produce
an unfairly negative estimate.

+1 to all that. Whatever we do here for costing in detail, it should
be set up so that the pushed-down join wins unless there's some pretty
tangible reason to think, in a particular case, that it will lose.

What about using local statistics to come up with an estimated row count for
the join and use that as the basis for an estimate? My idea here is that it
is always be a win to push down a join unless the result set is so large that
transferring it becomes the bottleneck.

This also sounds about right.

Maybe, to come up with something remotely realistic, a formula like

sum of locally estimated costs of sequential scan for the base table
plus count of estimated result rows (times a factor)

Was this meant to say "the base tables", plural?

I think whatever we do here should try to extend the logic in
postgres_fdw's estimate_path_cost_size() to foreign tables in some
reasonably natural way, but I'm not sure exactly what that should look
like. Maybe do what that function currently does for single-table
scans, and then add all the values up, or something like that. I'm a
little worried, though, that the planner might then view a query that
will be executed remotely as a nested loop with inner index-scan as
not worth pushing down, because in that case the join actually will
not touch every row from both tables, as a hash or merge join would.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Albe Laurenz
laurenz.albe@wien.gv.at
In reply to: Robert Haas (#3)
Re: Costing foreign joins in postgres_fdw

Robert Haas wrote:

Maybe, to come up with something remotely realistic, a formula like

sum of locally estimated costs of sequential scan for the base table
plus count of estimated result rows (times a factor)

Was this meant to say "the base tables", plural?

Yes.

I think whatever we do here should try to extend the logic in
postgres_fdw's estimate_path_cost_size() to foreign tables in some
reasonably natural way, but I'm not sure exactly what that should look
like. Maybe do what that function currently does for single-table
scans, and then add all the values up, or something like that. I'm a
little worried, though, that the planner might then view a query that
will be executed remotely as a nested loop with inner index-scan as
not worth pushing down, because in that case the join actually will
not touch every row from both tables, as a hash or merge join would.

That's exactly what I meant, minus a contribution for the estimated
result set size.

There are cases where a nested loop is faster than a hash join,
but I think it is rare that this is by orders of magnitude.

So I'd say it is a decent rough estimate, and that's the best we can
hope for here, if we cannot ask the remote server.

Yours,
Laurenz Albe

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Albe Laurenz (#2)
Re: Costing foreign joins in postgres_fdw

On Fri, Dec 18, 2015 at 6:39 PM, Albe Laurenz <laurenz.albe@wien.gv.at>
wrote:

Ashutosh Bapat wrote:

Costs for foreign queries are either obtained from the foreign server

using EXPLAIN (if

use_remote_estimate is ON) otherwise they are cooked up locally based on

the statistics available. For

joins as well, we have to do the same. If use_remote_estimates [1] is

ON, we can get the costs from

the foreign server. Rest of the mail discusses approaches for estimating

the costs when

use_remote_estimates is OFF.

1. Unlike base relations where the table data "has to be" fetched from

the foreign server, a join

doesn't "have to be" fetched from the foreign server. So, even if

use_remote_estimate is OFF for a

base relation, we do try to estimate something locally. But for a join

that's not compulsory, so we

can choose not to estimate anything and not push down the join if

use_remote_estimate is OFF. Whether

we do that depends upon how well we can estimate the join cost when

use_remote_estimate is OFF.

2. Locally estimating the cost of join that will be performed on the

foreign server is difficult

because we do not know which join strategy the foreign server is going

to use, which in turn depends

upon the availability of indexes, work memory, statistics about joining

expressions etc. One way to do

this is to use the cost of cheapest local join path built upon foreign

outer and inner paths, to

estimate the cost of executing the join at the foreign server The

startup and run time costs for

sending, parsing and planning query at the foreign server as well as the

cost to fetch the tuples need

to be adjusted, so that it doesn't get counted twice. We may assume that

the cost for the foreign join

will be some factor of the adjusted cost, like we have done for

estimating cost of sort pushdown. The

reason we choose cheapest path with foreign inner and outer paths is

because that's likely to be a

closer to the real estimate than the path which does not have foreign

inner and outer paths. In the

absence of such path, we should probably not push the join down since no

local path has found pushing

inner and outer to be cheaper and it's likely (certainly not a rule)

that pushing the join in question

down is not going to be cheaper than the local paths.

1st option is easy but it sounds too restrictive. 2nd option liberal but

is complex.

My gut feeling is that for a join where all join predicates can be pushed
down, it
will usually be a win to push the join to the foreign server.

At least in the first cut, we are planning to push a join down only when
all the "join" predicates are pushable, otherwise, we do not push down the
join.

So in your first scenario, I'd opt for always pushing down the join
if possible if use_remote_estimate is OFF.

Well, that's what the mail is about. How do we decide whether or not push a
join based on the costs when remote estimation is not possible.

Your second scenario is essentially to estimate that a pushed down join
will
always be executed as a nested loop join, which will in most cases produce
an unfairly negative estimate.

No, that's not what is intention behind the second scenario. The cheapest
local strategy needn't always be (in fact it's mostly not) nested loop
join. It could be anything hash join, merge join. The reason I suggested
cheapest local was to give foreign join an advantage over all the local
paths created. If we cost the foreign join on fairly lower side compared to
all the local paths and then add transfer costs, that might give foreign
join an added advantage. But relying on local cheapest strategy may cause
the costs to sway a lot because of inaccuracies in local statistics. That's
where I think your next idea helps.

What about using local statistics to come up with an estimated row count
for
the join and use that as the basis for an estimate? My idea here is that
it
is always be a win to push down a join unless the result set is so large
that
transferring it becomes the bottleneck.
Maybe, to come up with something remotely realistic, a formula like

sum of locally estimated costs of sequential scan for the base table
plus count of estimated result rows (times a factor)

Cost of foreign join = cost of scanning all the involved base relations +
cost of applying quals + cost of transferring result of the join. The join
is planned two relations (base or join) at a time, so we should be able to
compute this cost recursively, rather than grabbing cost of scanning base
relations every time. That will also work if part of the join tree is built
with use_remote_estimate ON and part without.

Yours,
Laurenz Albe

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company