Question about optimising (Postgres_)FDW

Started by Hannu Krosingover 11 years ago9 messages

Hannu Krosing

hannu@2ndQuadrant.com

over 11 years ago

I am playing around with postgres_fdw and found that the following code ...

----------------------------------
CREATE EXTENSION postgres_fdw;

CREATE SERVER loop foreign data wrapper postgres_fdw
OPTIONS (port '5432', dbname 'testdb');

CREATE USER MAPPING FOR PUBLIC SERVER loop;

create table onemillion (
id serial primary key,
inserted timestamp default clock_timestamp(),
data text
);

insert into onemillion(data) select random() from
generate_series(1,1000000);

CREATE FOREIGN TABLE onemillion_pgfdw (
id int,
inserted timestamp,
data text
) SERVER loop
OPTIONS (table_name 'onemillion',
use_remote_estimate 'true');

testdb=# explain analyse
select * from onemillion_pgfdw where id in (select id from onemillion
where data > '0.9' limit 100);
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=122.49..10871.06 rows=500000 width=44) (actual
time=4.269..93.444 rows=100 loops=1)
-> HashAggregate (cost=22.06..23.06 rows=100 width=4) (actual
time=1.110..1.263 rows=100 loops=1)
-> Limit (cost=0.00..20.81 rows=100 width=4) (actual
time=0.038..1.026 rows=100 loops=1)
-> Seq Scan on onemillion (cost=0.00..20834.00
rows=100115 width=4) (actual time=0.036..0.984 rows=100 loops=1)
Filter: (data > '0.9'::text)
Rows Removed by Filter: 805
-> Foreign Scan on onemillion_pgfdw (cost=100.43..108.47 rows=1
width=29) (actual time=0.772..0.773 rows=1 loops=100)
Total runtime: 93.820 ms
(8 rows)

Time: 97.283 ms
------------------------------

... actually performs 100 distinct "SELECT * FROM onemillion WHERE id =
$1" calls on "remote" side.

Is there a way to force it to prefer a plan where the results of (select
id from onemillion where data > '0.9' limit 100)
are passed to FDW as a single IN ( = ANY(...)) query and are retrieved
all at once ?

If not, how hord would it be to add this feature ?

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic Oï¿½

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Hannu Krosing (#1)

Re: Question about optimising (Postgres_)FDW

Hannu Krosing <hannu@2ndQuadrant.com> writes:

Is there a way to force it to prefer a plan where the results of (select
id from onemillion where data > '0.9' limit 100)
are passed to FDW as a single IN ( = ANY(...)) query and are retrieved
all at once ?

You could write the query like that:

select * from onemillion_pgfdw where id = any (array(select id from
onemillion where data > '0.9' limit 100));

Or at least you should be able to, except when I try it I get

explain analyze
select * from onemillion_pgfdw where id = any (array(select id from
onemillion where data > '0.9' limit 100));
ERROR: operator does not exist: integer = integer[]
HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts.
CONTEXT: Remote SQL command: EXPLAIN SELECT id, inserted, data FROM public.onemillion WHERE ((id = ANY ((SELECT null::integer[]))))

so there's something the remote-estimate code is getting wrong here.
(It seems to work without remote_estimate, though.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Hannu Krosing

hannu@2ndQuadrant.com

over 11 years ago

In reply to: Tom Lane (#2)

Re: Question about optimising (Postgres_)FDW

On 04/16/2014 01:25 AM, Tom Lane wrote:

Hannu Krosing <hannu@2ndQuadrant.com> writes:

Is there a way to force it to prefer a plan where the results of (select
id from onemillion where data > '0.9' limit 100)
are passed to FDW as a single IN ( = ANY(...)) query and are retrieved
all at once ?

You could write the query like that:

select * from onemillion_pgfdw where id = any (array(select id from
onemillion where data > '0.9' limit 100));

My actual use-case was about a join between a local and a remote table
and without rewriting the query (they come from ORM)

I was hoping to be able to nudge postgresql towards a better plan via some
tuning of table/fdw options or GUCs.

for example, would postgresql use the WHERE id IN (...) query on remote
side for a query like

select r.data, l.data
from onemillion_pgfdw r
join onemillion l
on r.id = l.id and l.data > '0.999';

if it recognizes that the local side returns only 1000 rows ?

or would it still use 1000 individual WHERE id = $1 queries.

Is getting the foreign data via IN and then turning the data into a hash
for joining one of the plans it considers at all ?

Best
Hannu

Or at least you should be able to, except when I try it I get

explain analyze
select * from onemillion_pgfdw where id = any (array(select id from
onemillion where data > '0.9' limit 100));
ERROR: operator does not exist: integer = integer[]
HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts.
CONTEXT: Remote SQL command: EXPLAIN SELECT id, inserted, data FROM public.onemillion WHERE ((id = ANY ((SELECT null::integer[]))))

so there's something the remote-estimate code is getting wrong here.
(It seems to work without remote_estimate, though.)

regards, tom lane

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic Oï¿½

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Hannu Krosing

hannu@2ndQuadrant.com

over 11 years ago

In reply to: Hannu Krosing (#3)

Re: Question about optimising (Postgres_)FDW

On 04/16/2014 06:12 AM, Hannu Krosing wrote:

On 04/16/2014 01:25 AM, Tom Lane wrote:

Hannu Krosing <hannu@2ndQuadrant.com> writes:

Is there a way to force it to prefer a plan where the results of (select
id from onemillion where data > '0.9' limit 100)
are passed to FDW as a single IN ( = ANY(...)) query and are retrieved
all at once ?

You could write the query like that:

select * from onemillion_pgfdw where id = any (array(select id from
onemillion where data > '0.9' limit 100));

My actual use-case was about a join between a local and a remote table
and without rewriting the query (they come from ORM)

I was hoping to be able to nudge postgresql towards a better plan via some
tuning of table/fdw options or GUCs.

for example, would postgresql use the WHERE id IN (...) query on remote
side for a query like

select r.data, l.data
from onemillion_pgfdw r
join onemillion l
on r.id = l.id and l.data > '0.999';

if it recognizes that the local side returns only 1000 rows ?

or would it still use 1000 individual WHERE id = $1 queries.

Is getting the foreign data via IN and then turning the data into a hash
for joining one of the plans it considers at all ?

It sees that could we need an extra tuning parameter for choosing the

ID IN (...) + HASH plan over individual SELECT .. WHERE ID = $1

something between

`fdw_startup_cost` and `fdw_tuple_cost`

to signify that an IN query returning 1000 rows runs faster than 1000 =
queries

as I understan currently they both would be estimated as

fdw_startup_cost + 1000 * fdw_tuple_cost

the new parameter could be fdw_call_cost or fdw_query_cost and would
estimate
how much each individual call to fdw costs, thus favouring calls which
return more
data in one call

Cheers
Hannu

Best
Hannu

Or at least you should be able to, except when I try it I get

explain analyze
select * from onemillion_pgfdw where id = any (array(select id from
onemillion where data > '0.9' limit 100));
ERROR: operator does not exist: integer = integer[]
HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts.
CONTEXT: Remote SQL command: EXPLAIN SELECT id, inserted, data FROM public.onemillion WHERE ((id = ANY ((SELECT null::integer[]))))

so there's something the remote-estimate code is getting wrong here.
(It seems to work without remote_estimate, though.)

regards, tom lane

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic Oï¿½

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 11 years ago

In reply to: Hannu Krosing (#1)

Re: Question about optimising (Postgres_)FDW

(2014/04/16 6:55), Hannu Krosing wrote:

----------------------------------
CREATE EXTENSION postgres_fdw;

CREATE SERVER loop foreign data wrapper postgres_fdw
OPTIONS (port '5432', dbname 'testdb');

CREATE USER MAPPING FOR PUBLIC SERVER loop;

create table onemillion (
id serial primary key,
inserted timestamp default clock_timestamp(),
data text
);

insert into onemillion(data) select random() from
generate_series(1,1000000);

CREATE FOREIGN TABLE onemillion_pgfdw (
id int,
inserted timestamp,
data text
) SERVER loop
OPTIONS (table_name 'onemillion',
use_remote_estimate 'true');

testdb=# explain analyse
select * from onemillion_pgfdw where id in (select id from onemillion
where data > '0.9' limit 100);
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=122.49..10871.06 rows=500000 width=44) (actual
time=4.269..93.444 rows=100 loops=1)
-> HashAggregate (cost=22.06..23.06 rows=100 width=4) (actual
time=1.110..1.263 rows=100 loops=1)
-> Limit (cost=0.00..20.81 rows=100 width=4) (actual
time=0.038..1.026 rows=100 loops=1)
-> Seq Scan on onemillion (cost=0.00..20834.00
rows=100115 width=4) (actual time=0.036..0.984 rows=100 loops=1)
Filter: (data > '0.9'::text)
Rows Removed by Filter: 805
-> Foreign Scan on onemillion_pgfdw (cost=100.43..108.47 rows=1
width=29) (actual time=0.772..0.773 rows=1 loops=100)
Total runtime: 93.820 ms
(8 rows)

Time: 97.283 ms
------------------------------

... actually performs 100 distinct "SELECT * FROM onemillion WHERE id =
$1" calls on "remote" side.

Maybe I'm missing something, but I think that you can do what I think
you'd like to do by the following procedure:

postgres=# ALTER SERVER loop OPTIONS (ADD fdw_startup_cost '1000');
ALTER SERVER
postgres=# EXPLAIN VERBOSE SELECT * FROM onemillion_pgsql WHERE id in
(SELECT id FROM onemillion WHERE data > '0.9' LIMIT 100);
QUERY PLAN
-----------------------------------------------------------------------------------------------
Hash Semi Join (cost=1023.10..41983.21 rows=100 width=30)
Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
onemillion_pgsql.data
Hash Cond: (onemillion_pgsql.id = onemillion.id)
-> Foreign Scan on public.onemillion_pgsql (cost=1000.00..39334.00
rows=1000000 width=29)
Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
onemillion_pgsql.data
Remote SQL: SELECT id, inserted, data FROM public.onemillion
-> Hash (cost=21.85..21.85 rows=100 width=4)
Output: onemillion.id
-> Limit (cost=0.00..20.85 rows=100 width=4)
Output: onemillion.id
-> Seq Scan on public.onemillion (cost=0.00..20834.00
rows=99918 width=4)
Output: onemillion.id
Filter: (onemillion.data > '0.9'::text)
Planning time: 0.690 ms
(14 rows)

or, that as Tom mentioned, by disabling the use_remote_estimate function:

postgres=# ALTER FOREIGN TABLE onemillion_pgsql OPTIONS (SET
use_remote_estimate 'false');
ALTER FOREIGN TABLE
postgres=# EXPLAIN VERBOSE SELECT * FROM onemillion_pgsql WHERE id in
(SELECT id FROM onemillion WHERE data > '0.9' LIMIT 100);
QUERY PLAN
----------------------------------------------------------------------------------------------
Hash Semi Join (cost=123.10..41083.21 rows=100 width=30)
Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
onemillion_pgsql.data
Hash Cond: (onemillion_pgsql.id = onemillion.id)
-> Foreign Scan on public.onemillion_pgsql (cost=100.00..38434.00
rows=1000000 width=30)
Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
onemillion_pgsql.data
Remote SQL: SELECT id, inserted, data FROM public.onemillion
-> Hash (cost=21.85..21.85 rows=100 width=4)
Output: onemillion.id
-> Limit (cost=0.00..20.85 rows=100 width=4)
Output: onemillion.id
-> Seq Scan on public.onemillion (cost=0.00..20834.00
rows=99918 width=4)
Output: onemillion.id
Filter: (onemillion.data > '0.9'::text)
Planning time: 0.215 ms
(14 rows)

Thanks,

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Hannu Krosing

hannu@krosing.net

over 11 years ago

In reply to: Etsuro Fujita (#5)

Re: Question about optimising (Postgres_)FDW

On 04/16/2014 01:35 PM, Etsuro Fujita wrote:

(2014/04/16 6:55), Hannu Krosing wrote:

...

Maybe I'm missing something, but I think that you can do what I think
you'd like to do by the following procedure:

No, what I'd like PostgreSQL to do is to

1. select the id+set from local table
2. select the rows from remote table with WHERE ID IN (<set selected in
step 1>)
3. then join the original set to selected set, with any suitable join
strategy

The things I do not want are

A. selecting all rows from remote table
(this is what your examples below do)

B. selecting rows from remote table by single selects using "ID = $"
(this is something that I managed to do by some tweaking of costs)

as A will be always slow if there are millions of rows in remote table
and B is slow(ish) when the idset is over a few hundred ids

I hope this is a bit better explanation than I provided before .

Cheers
Hannu

P.S. I am not sure if this is a limitation of postgres_fdw or postgres
itself

P.P.S I tested a little with with Multicorn an postgresql did not
request row
counts for any IN plans, so it may be that the planner does not consider
this
kind of plan at all. (testing was on PgSQL 9.3.4)

Hannu

postgres=# ALTER SERVER loop OPTIONS (ADD fdw_startup_cost '1000');
ALTER SERVER
postgres=# EXPLAIN VERBOSE SELECT * FROM onemillion_pgsql WHERE id in
(SELECT id FROM onemillion WHERE data > '0.9' LIMIT 100);
QUERY PLAN
-----------------------------------------------------------------------------------------------

Hash Semi Join (cost=1023.10..41983.21 rows=100 width=30)
Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
onemillion_pgsql.data
Hash Cond: (onemillion_pgsql.id = onemillion.id)
-> Foreign Scan on public.onemillion_pgsql
(cost=1000.00..39334.00 rows=1000000 width=29)
Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
onemillion_pgsql.data
Remote SQL: SELECT id, inserted, data FROM public.onemillion
-> Hash (cost=21.85..21.85 rows=100 width=4)
Output: onemillion.id
-> Limit (cost=0.00..20.85 rows=100 width=4)
Output: onemillion.id
-> Seq Scan on public.onemillion (cost=0.00..20834.00
rows=99918 width=4)
Output: onemillion.id
Filter: (onemillion.data > '0.9'::text)
Planning time: 0.690 ms
(14 rows)

or, that as Tom mentioned, by disabling the use_remote_estimate function:

postgres=# ALTER FOREIGN TABLE onemillion_pgsql OPTIONS (SET
use_remote_estimate 'false');
ALTER FOREIGN TABLE
postgres=# EXPLAIN VERBOSE SELECT * FROM onemillion_pgsql WHERE id in
(SELECT id FROM onemillion WHERE data > '0.9' LIMIT 100);
QUERY PLAN
----------------------------------------------------------------------------------------------

Hash Semi Join (cost=123.10..41083.21 rows=100 width=30)
Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
onemillion_pgsql.data
Hash Cond: (onemillion_pgsql.id = onemillion.id)
-> Foreign Scan on public.onemillion_pgsql (cost=100.00..38434.00
rows=1000000 width=30)
Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
onemillion_pgsql.data
Remote SQL: SELECT id, inserted, data FROM public.onemillion
-> Hash (cost=21.85..21.85 rows=100 width=4)
Output: onemillion.id
-> Limit (cost=0.00..20.85 rows=100 width=4)
Output: onemillion.id
-> Seq Scan on public.onemillion (cost=0.00..20834.00
rows=99918 width=4)
Output: onemillion.id
Filter: (onemillion.data > '0.9'::text)
Planning time: 0.215 ms
(14 rows)

Thanks,

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Hannu Krosing

hannu@krosing.net

over 11 years ago

In reply to: Hannu Krosing (#6)

Re: Question about optimising (Postgres_)FDW

On 04/16/2014 03:16 PM, Hannu Krosing wrote:

On 04/16/2014 01:35 PM, Etsuro Fujita wrote:

(2014/04/16 6:55), Hannu Krosing wrote:

...

Maybe I'm missing something, but I think that you can do what I think
you'd like to do by the following procedure:

No, what I'd like PostgreSQL to do is to

1. select the id+set from local table
2. select the rows from remote table with WHERE ID IN (<set selected in
step 1>)
3. then join the original set to selected set, with any suitable join
strategy

The things I do not want are

A. selecting all rows from remote table
(this is what your examples below do)

or

B. selecting rows from remote table by single selects using "ID = $"
(this is something that I managed to do by some tweaking of costs)

as A will be always slow if there are millions of rows in remote table
and B is slow(ish) when the idset is over a few hundred ids

I hope this is a bit better explanation than I provided before .

Cheers
Hannu

P.S. I am not sure if this is a limitation of postgres_fdw or postgres
itself

P.P.S I tested a little with with Multicorn an postgresql did not
request row
counts for any IN plans, so it may be that the planner does not consider
this
kind of plan at all. (testing was on PgSQL 9.3.4)

Hannu

Also a sample run of the two plans to illustrate my point

How it is run now:

testdb=# explain analyse verbose
select r.data, l.data
from onemillion_pgfdw r
join onemillion l
on r.id = l.id and l.id between 100000 and 100100;

QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=111.61..198.40 rows=1 width=16) (actual
time=7534.360..8731.541 rows=101 loops=1)
Output: r.data, l.data
Hash Cond: (r.id = l.id)
-> Foreign Scan on public.onemillion_pgfdw r (cost=100.00..178.25
rows=2275 width=12) (actual time=1.628..8364.688 rows=1000000 loops=1)
Output: r.id, r.inserted, r.data
Remote SQL: SELECT id, data FROM public.onemillion
-> Hash (cost=10.39..10.39 rows=98 width=12) (actual
time=0.179..0.179 rows=101 loops=1)
Output: l.data, l.id
Buckets: 1024 Batches: 1 Memory Usage: 5kB
-> Index Scan using onemillion_pkey on public.onemillion l
(cost=0.42..10.39 rows=98 width=12) (actual time=0.049..0.124 rows=101
loops=1)
Output: l.data, l.id
Index Cond: ((l.id >= 100000) AND (l.id <= 100100))
Total runtime: 8732.213 ms
(13 rows)

Time: 8733.799 ms

And how the above query should be planned/executed:

testdb=# explain analyse verbose
select r.data, l.data
from (select * from onemillion_pgfdw where id = any (array(select id
from onemillion where id between 100000 and 100100))) r
join onemillion l
on r.id = l.id;

QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=110.81..1104.30 rows=111 width=16) (actual
time=2.756..3.738 rows=101 loops=1)
Output: onemillion_pgfdw.data, l.data
InitPlan 1 (returns $0)
-> Index Only Scan using onemillion_pkey on public.onemillion
(cost=0.42..10.39 rows=98 width=4) (actual time=0.055..0.118 rows=101
loops=1)
Output: onemillion.id
Index Cond: ((onemillion.id >= 100000) AND (onemillion.id <=
100100))
Heap Fetches: 101
-> Foreign Scan on public.onemillion_pgfdw (cost=100.00..163.41
rows=111 width=12) (actual time=2.729..3.012 rows=101 loops=1)
Output: onemillion_pgfdw.id, onemillion_pgfdw.inserted,
onemillion_pgfdw.data
Remote SQL: SELECT id, data FROM public.onemillion WHERE ((id =
ANY ($1::integer[])))
-> Index Scan using onemillion_pkey on public.onemillion l
(cost=0.42..8.37 rows=1 width=12) (actual time=0.005..0.006 rows=1
loops=101)
Output: l.id, l.inserted, l.data
Index Cond: (l.id = onemillion_pgfdw.id)
Total runtime: 4.469 ms
(14 rows)

Time: 6.437 ms

postgres=# ALTER SERVER loop OPTIONS (ADD fdw_startup_cost '1000');
ALTER SERVER
postgres=# EXPLAIN VERBOSE SELECT * FROM onemillion_pgsql WHERE id in
(SELECT id FROM onemillion WHERE data > '0.9' LIMIT 100);
QUERY PLAN
-----------------------------------------------------------------------------------------------

Hash Semi Join (cost=1023.10..41983.21 rows=100 width=30)
Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
onemillion_pgsql.data
Hash Cond: (onemillion_pgsql.id = onemillion.id)
-> Foreign Scan on public.onemillion_pgsql
(cost=1000.00..39334.00 rows=1000000 width=29)
Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
onemillion_pgsql.data
Remote SQL: SELECT id, inserted, data FROM public.onemillion
-> Hash (cost=21.85..21.85 rows=100 width=4)
Output: onemillion.id
-> Limit (cost=0.00..20.85 rows=100 width=4)
Output: onemillion.id
-> Seq Scan on public.onemillion (cost=0.00..20834.00
rows=99918 width=4)
Output: onemillion.id
Filter: (onemillion.data > '0.9'::text)
Planning time: 0.690 ms
(14 rows)

or, that as Tom mentioned, by disabling the use_remote_estimate function:

postgres=# ALTER FOREIGN TABLE onemillion_pgsql OPTIONS (SET
use_remote_estimate 'false');
ALTER FOREIGN TABLE
postgres=# EXPLAIN VERBOSE SELECT * FROM onemillion_pgsql WHERE id in
(SELECT id FROM onemillion WHERE data > '0.9' LIMIT 100);
QUERY PLAN
----------------------------------------------------------------------------------------------

Hash Semi Join (cost=123.10..41083.21 rows=100 width=30)
Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
onemillion_pgsql.data
Hash Cond: (onemillion_pgsql.id = onemillion.id)
-> Foreign Scan on public.onemillion_pgsql (cost=100.00..38434.00
rows=1000000 width=30)
Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
onemillion_pgsql.data
Remote SQL: SELECT id, inserted, data FROM public.onemillion
-> Hash (cost=21.85..21.85 rows=100 width=4)
Output: onemillion.id
-> Limit (cost=0.00..20.85 rows=100 width=4)
Output: onemillion.id
-> Seq Scan on public.onemillion (cost=0.00..20834.00
rows=99918 width=4)
Output: onemillion.id
Filter: (onemillion.data > '0.9'::text)
Planning time: 0.215 ms
(14 rows)

Thanks,

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 11 years ago

In reply to: Hannu Krosing (#7)

Re: Question about optimising (Postgres_)FDW

AFAIK, PostgreSQL's join nodes (except for hash join) consider one row at a
time from outer table and match inner table rows one at a time. What needs
to be done in the case you are suggesting is that it needs to consider all
the rows of outer table, fetch their respective joining columns and then
pass that information down to inner side. The inner side then would give a
bunch of rows qualifying the join condition. Join this set with outer rows
again.

For an equality operator, this might be possible in Hash join but for other
operator, hash join won't work. Thus for other operators, we will need to
materialize the outer result, which seems to have its cost, which needs to
be factored. Lot of changes, but those may be worth it, for foreign scans
with high connection costs.

On Wed, Apr 16, 2014 at 9:40 PM, Hannu Krosing <hannu@krosing.net> wrote:

On 04/16/2014 03:16 PM, Hannu Krosing wrote:

On 04/16/2014 01:35 PM, Etsuro Fujita wrote:

(2014/04/16 6:55), Hannu Krosing wrote:

...

Maybe I'm missing something, but I think that you can do what I think
you'd like to do by the following procedure:

No, what I'd like PostgreSQL to do is to

1. select the id+set from local table
2. select the rows from remote table with WHERE ID IN (<set selected in
step 1>)
3. then join the original set to selected set, with any suitable join
strategy

The things I do not want are

A. selecting all rows from remote table
(this is what your examples below do)

or

B. selecting rows from remote table by single selects using "ID = $"
(this is something that I managed to do by some tweaking of costs)

as A will be always slow if there are millions of rows in remote table
and B is slow(ish) when the idset is over a few hundred ids

I hope this is a bit better explanation than I provided before .

Cheers
Hannu

P.S. I am not sure if this is a limitation of postgres_fdw or postgres
itself

P.P.S I tested a little with with Multicorn an postgresql did not
request row
counts for any IN plans, so it may be that the planner does not consider
this
kind of plan at all. (testing was on PgSQL 9.3.4)

Hannu

Also a sample run of the two plans to illustrate my point

How it is run now:

testdb=# explain analyse verbose
select r.data, l.data
from onemillion_pgfdw r
join onemillion l
on r.id = l.id and l.id between 100000 and 100100;

QUERY
PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=111.61..198.40 rows=1 width=16) (actual
time=7534.360..8731.541 rows=101 loops=1)
Output: r.data, l.data
Hash Cond: (r.id = l.id)
-> Foreign Scan on public.onemillion_pgfdw r (cost=100.00..178.25
rows=2275 width=12) (actual time=1.628..8364.688 rows=1000000 loops=1)
Output: r.id, r.inserted, r.data
Remote SQL: SELECT id, data FROM public.onemillion
-> Hash (cost=10.39..10.39 rows=98 width=12) (actual
time=0.179..0.179 rows=101 loops=1)
Output: l.data, l.id
Buckets: 1024 Batches: 1 Memory Usage: 5kB
-> Index Scan using onemillion_pkey on public.onemillion l
(cost=0.42..10.39 rows=98 width=12) (actual time=0.049..0.124 rows=101
loops=1)
Output: l.data, l.id
Index Cond: ((l.id >= 100000) AND (l.id <= 100100))
Total runtime: 8732.213 ms
(13 rows)

Time: 8733.799 ms

And how the above query should be planned/executed:

testdb=# explain analyse verbose
select r.data, l.data
from (select * from onemillion_pgfdw where id = any (array(select id
from onemillion where id between 100000 and 100100))) r
join onemillion l
on r.id = l.id;

QUERY
PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=110.81..1104.30 rows=111 width=16) (actual
time=2.756..3.738 rows=101 loops=1)
Output: onemillion_pgfdw.data, l.data
InitPlan 1 (returns $0)
-> Index Only Scan using onemillion_pkey on public.onemillion
(cost=0.42..10.39 rows=98 width=4) (actual time=0.055..0.118 rows=101
loops=1)
Output: onemillion.id
Index Cond: ((onemillion.id >= 100000) AND (onemillion.id <=
100100))
Heap Fetches: 101
-> Foreign Scan on public.onemillion_pgfdw (cost=100.00..163.41
rows=111 width=12) (actual time=2.729..3.012 rows=101 loops=1)
Output: onemillion_pgfdw.id, onemillion_pgfdw.inserted,
onemillion_pgfdw.data
Remote SQL: SELECT id, data FROM public.onemillion WHERE ((id =
ANY ($1::integer[])))
-> Index Scan using onemillion_pkey on public.onemillion l
(cost=0.42..8.37 rows=1 width=12) (actual time=0.005..0.006 rows=1
loops=101)
Output: l.id, l.inserted, l.data
Index Cond: (l.id = onemillion_pgfdw.id)
Total runtime: 4.469 ms
(14 rows)

Time: 6.437 ms

postgres=# ALTER SERVER loop OPTIONS (ADD fdw_startup_cost '1000');
ALTER SERVER
postgres=# EXPLAIN VERBOSE SELECT * FROM onemillion_pgsql WHERE id in
(SELECT id FROM onemillion WHERE data > '0.9' LIMIT 100);
QUERY PLAN

-----------------------------------------------------------------------------------------------

Hash Semi Join (cost=1023.10..41983.21 rows=100 width=30)
Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
onemillion_pgsql.data
Hash Cond: (onemillion_pgsql.id = onemillion.id)
-> Foreign Scan on public.onemillion_pgsql
(cost=1000.00..39334.00 rows=1000000 width=29)
Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
onemillion_pgsql.data
Remote SQL: SELECT id, inserted, data FROM public.onemillion
-> Hash (cost=21.85..21.85 rows=100 width=4)
Output: onemillion.id
-> Limit (cost=0.00..20.85 rows=100 width=4)
Output: onemillion.id
-> Seq Scan on public.onemillion (cost=0.00..20834.00
rows=99918 width=4)
Output: onemillion.id
Filter: (onemillion.data > '0.9'::text)
Planning time: 0.690 ms
(14 rows)

or, that as Tom mentioned, by disabling the use_remote_estimate

function:

postgres=# ALTER FOREIGN TABLE onemillion_pgsql OPTIONS (SET
use_remote_estimate 'false');
ALTER FOREIGN TABLE
postgres=# EXPLAIN VERBOSE SELECT * FROM onemillion_pgsql WHERE id in
(SELECT id FROM onemillion WHERE data > '0.9' LIMIT 100);
QUERY PLAN

----------------------------------------------------------------------------------------------

Hash Semi Join (cost=123.10..41083.21 rows=100 width=30)
Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
onemillion_pgsql.data
Hash Cond: (onemillion_pgsql.id = onemillion.id)
-> Foreign Scan on public.onemillion_pgsql (cost=100.00..38434.00
rows=1000000 width=30)
Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
onemillion_pgsql.data
Remote SQL: SELECT id, inserted, data FROM public.onemillion
-> Hash (cost=21.85..21.85 rows=100 width=4)
Output: onemillion.id
-> Limit (cost=0.00..20.85 rows=100 width=4)
Output: onemillion.id
-> Seq Scan on public.onemillion (cost=0.00..20834.00
rows=99918 width=4)
Output: onemillion.id
Filter: (onemillion.data > '0.9'::text)
Planning time: 0.215 ms
(14 rows)

Thanks,

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 11 years ago

In reply to: Hannu Krosing (#6)

Re: Question about optimising (Postgres_)FDW

(2014/04/16 22:16), Hannu Krosing wrote:

On 04/16/2014 01:35 PM, Etsuro Fujita wrote:

Maybe I'm missing something, but I think that you can do what I think
you'd like to do by the following procedure:

No, what I'd like PostgreSQL to do is to

1. select the id+set from local table
2. select the rows from remote table with WHERE ID IN (<set selected in
step 1>)
3. then join the original set to selected set, with any suitable join
strategy

The things I do not want are

A. selecting all rows from remote table
(this is what your examples below do)

or

B. selecting rows from remote table by single selects using "ID = $"
(this is something that I managed to do by some tweaking of costs)

as A will be always slow if there are millions of rows in remote table
and B is slow(ish) when the idset is over a few hundred ids

I hope this is a bit better explanation than I provided before .

Ah, I understand what you'd like to do. Thank you for the explanation.

P.S. I am not sure if this is a limitation of postgres_fdw or postgres
itself

If I understand correctly, neither the current postgres_fdw planning
function nor the current postgres planner itself support such a plan.
For that I think we would probably need to implement a distributed query
processing technique such as semijoin or bloomjoin in those modules.

Thanks,

P.S.

or, that as Tom mentioned, by disabling the use_remote_estimate function:

I misunderstood the meaning of what Tom pointed out. Sorry for that.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers