Foreign join pushdown vs EvalPlanQual

Started by Etsuro Fujitaover 10 years ago173 messages

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

Hi,

While reviewing the foreign join pushdown core patch, I noticed that the
patch doesn't perform an EvalPlanQual recheck properly. The example
that crashes the server will be shown below (it uses the postgres_fdw
patch [1]/messages/by-id/CAEZqfEe9KGy=1_waGh2rgZPg0o4pqgD+iauYaj8wTze+CYJUHg@mail.gmail.com). I think the reason for that is because the ForeignScan node
performing the foreign join remotely has scanrelid = 0 while
ExecScanFetch assumes that its scan node has scanrelid > 0.

I think this is a bug. I've not figured out how to fix this yet, but I
thought we would also need another plan that evaluates the join locally
for the test tuples for EvalPlanQual. Though I'm missing something though.

Create an environment:

postgres=# create table tab (a int, b int);
CREATE TABLE
postgres=# create foreign table foo (a int) server myserver options
(table_name 'foo');
CREATE FOREIGN TABLE
postgres=# create foreign table bar (a int) server myserver options
(table_name 'bar');
CREATE FOREIGN TABLE
postgres=# insert into tab values (1, 1);
INSERT 0 1
postgres=# insert into foo values (1);
INSERT 0 1
postgres=# insert into bar values (1);
INSERT 0 1
postgres=# analyze tab;
ANALYZE
postgres=# analyze foo;
ANALYZE
postgres=# analyze bar;
ANALYZE

Run the example:

[Terminal 1]
postgres=# begin;
BEGIN
postgres=# update tab set b = b + 1 where a = 1;
UPDATE 1

[Terminal 2]
postgres=# explain verbose select tab.* from tab, foo, bar where tab.a =
foo.a and foo.a = bar.a for update;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------
LockRows (cost=100.00..101.18 rows=4 width=70)
Output: tab.a, tab.b, tab.ctid, foo.*, bar.*
-> Nested Loop (cost=100.00..101.14 rows=4 width=70)
Output: tab.a, tab.b, tab.ctid, foo.*, bar.*
Join Filter: (foo.a = tab.a)
-> Seq Scan on public.tab (cost=0.00..1.01 rows=1 width=14)
Output: tab.a, tab.b, tab.ctid
-> Foreign Scan (cost=100.00..100.08 rows=4 width=64)
Output: foo.*, foo.a, bar.*, bar.a
Relations: (public.foo) INNER JOIN (public.bar)
Remote SQL: SELECT l.a1, l.a2, r.a1, r.a2 FROM (SELECT
ROW(l.a9), l.a9 FROM (SELECT a a9 FROM public.foo FOR UPDATE) l) l (a1,
a2) INNER
JOIN (SELECT ROW(r.a9), r.a9 FROM (SELECT a a9 FROM public.bar FOR
UPDATE) r) r (a1, a2) ON ((l.a2 = r.a2))
(11 rows)

postgres=# select tab.* from tab, foo, bar where tab.a = foo.a and foo.a
= bar.a for update;

[Terminal 1]
postgres=# commit;
COMMIT

[Terminal 2]
(After the commit in Terminal 1, Terminal 2 will show the following.)
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

Best regards,
Etsuro Fujita

[1]: /messages/by-id/CAEZqfEe9KGy=1_waGh2rgZPg0o4pqgD+iauYaj8wTze+CYJUHg@mail.gmail.com
/messages/by-id/CAEZqfEe9KGy=1_waGh2rgZPg0o4pqgD+iauYaj8wTze+CYJUHg@mail.gmail.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kohei KaiGai

kaigai@kaigai.gr.jp

over 10 years ago

In reply to: Etsuro Fujita (#1)

Re: Foreign join pushdown vs EvalPlanQual

Does it make sense to put the result tuple of remote join on evety
estate->es_epqTupleSet[] slot represented by this ForeignScan if
scanrelid==0?

It allows to recheck qualifier for each LockRow that intends to lock
base foreign table underlying the remote join.
ForeignScan->fdw_relids tells us which rtindexes are represented
by this ForeignScan, so infrastructure side may be able to handle.

Thanks,

2015-06-24 11:40 GMT+09:00 Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>:

Hi,

While reviewing the foreign join pushdown core patch, I noticed that the
patch doesn't perform an EvalPlanQual recheck properly. The example
that crashes the server will be shown below (it uses the postgres_fdw
patch [1]). I think the reason for that is because the ForeignScan node
performing the foreign join remotely has scanrelid = 0 while
ExecScanFetch assumes that its scan node has scanrelid > 0.

I think this is a bug. I've not figured out how to fix this yet, but I
thought we would also need another plan that evaluates the join locally
for the test tuples for EvalPlanQual. Though I'm missing something though.

Create an environment:

postgres=# create table tab (a int, b int);
CREATE TABLE
postgres=# create foreign table foo (a int) server myserver options
(table_name 'foo');
CREATE FOREIGN TABLE
postgres=# create foreign table bar (a int) server myserver options
(table_name 'bar');
CREATE FOREIGN TABLE
postgres=# insert into tab values (1, 1);
INSERT 0 1
postgres=# insert into foo values (1);
INSERT 0 1
postgres=# insert into bar values (1);
INSERT 0 1
postgres=# analyze tab;
ANALYZE
postgres=# analyze foo;
ANALYZE
postgres=# analyze bar;
ANALYZE

Run the example:

[Terminal 1]
postgres=# begin;
BEGIN
postgres=# update tab set b = b + 1 where a = 1;
UPDATE 1

[Terminal 2]
postgres=# explain verbose select tab.* from tab, foo, bar where tab.a =
foo.a and foo.a = bar.a for update;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------
LockRows (cost=100.00..101.18 rows=4 width=70)
Output: tab.a, tab.b, tab.ctid, foo.*, bar.*
-> Nested Loop (cost=100.00..101.14 rows=4 width=70)
Output: tab.a, tab.b, tab.ctid, foo.*, bar.*
Join Filter: (foo.a = tab.a)
-> Seq Scan on public.tab (cost=0.00..1.01 rows=1 width=14)
Output: tab.a, tab.b, tab.ctid
-> Foreign Scan (cost=100.00..100.08 rows=4 width=64)
Output: foo.*, foo.a, bar.*, bar.a
Relations: (public.foo) INNER JOIN (public.bar)
Remote SQL: SELECT l.a1, l.a2, r.a1, r.a2 FROM (SELECT
ROW(l.a9), l.a9 FROM (SELECT a a9 FROM public.foo FOR UPDATE) l) l (a1,
a2) INNER
JOIN (SELECT ROW(r.a9), r.a9 FROM (SELECT a a9 FROM public.bar FOR
UPDATE) r) r (a1, a2) ON ((l.a2 = r.a2))
(11 rows)

postgres=# select tab.* from tab, foo, bar where tab.a = foo.a and foo.a
= bar.a for update;

[Terminal 1]
postgres=# commit;
COMMIT

[Terminal 2]
(After the commit in Terminal 1, Terminal 2 will show the following.)
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

Best regards,
Etsuro Fujita

[1]
/messages/by-id/CAEZqfEe9KGy=1_waGh2rgZPg0o4pqgD+iauYaj8wTze+CYJUHg@mail.gmail.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
KaiGai Kohei <kaigai@kaigai.gr.jp>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Kohei KaiGai (#2)

Re: Foreign join pushdown vs EvalPlanQual

Fujita-san,

Does it make sense to put the result tuple of remote join on evety
estate->es_epqTupleSet[] slot represented by this ForeignScan if
scanrelid==0?

Sorry, I misunderstood behavior of the es_epqTupleSet[].

I'd like to suggest a solution that re-construct remote tuple according
to the fdw_scan_tlist on ExecScanFetch, if given scanrelid == 0.
It enables to run local qualifier associated with the ForeignScan node,
and it will also work for the case when tuple in es_epqTupleSet[] was
local heap.

For details:
The es_epqTuple[] is set by EvalPlanQualSetTuple(). It put a tuple
exactly reflects a particular base relation (that has positive rtindex).
Even if it is a foreign-table, ExecLockRows() put a tuple dynamically
constructed via whole-row-reference at EvalPlanQualFetchRowMarks().
So, regardless of copy or reference to heap, we can expect es_epqTuple[]
keeps tuples of the base relations for each.

On the other hands, ForeignScan that replaced local join by remote
join has a valid fdw_scan_tlist list. It contains expression node
to construct individual attribute of the pseudo scan target-list.

So, all we need to do is, (1) if scanrelid == 0 on ExecScanFetch(),
(2) it should be ForeignScan or CustomScan, with *_scan_tlist.
(3) then, we reconstruct a tuple of the pseudo scan based on the
*_scan_tlist, instead of simple reference to es_epqTupleSet[],
(4) and, evaluate local qualifiers of the node.

How about your thought?

BTW, if you try newer version of postgres_fdw foreign join patch,
please provide me to reproduce the problem/

Also, as an aside, postgres_fdw does not implement RefetchForeignRow()
at this moment. Does it make a problem?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kohei KaiGai
Sent: Wednesday, June 24, 2015 10:02 PM
To: Etsuro Fujita
Cc: PostgreSQL-development
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

Does it make sense to put the result tuple of remote join on evety
estate->es_epqTupleSet[] slot represented by this ForeignScan if
scanrelid==0?

It allows to recheck qualifier for each LockRow that intends to lock
base foreign table underlying the remote join.
ForeignScan->fdw_relids tells us which rtindexes are represented
by this ForeignScan, so infrastructure side may be able to handle.

Thanks,

2015-06-24 11:40 GMT+09:00 Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>:

Hi,

While reviewing the foreign join pushdown core patch, I noticed that the
patch doesn't perform an EvalPlanQual recheck properly. The example
that crashes the server will be shown below (it uses the postgres_fdw
patch [1]). I think the reason for that is because the ForeignScan node
performing the foreign join remotely has scanrelid = 0 while
ExecScanFetch assumes that its scan node has scanrelid > 0.

I think this is a bug. I've not figured out how to fix this yet, but I
thought we would also need another plan that evaluates the join locally
for the test tuples for EvalPlanQual. Though I'm missing something though.

Create an environment:

postgres=# create table tab (a int, b int);
CREATE TABLE
postgres=# create foreign table foo (a int) server myserver options
(table_name 'foo');
CREATE FOREIGN TABLE
postgres=# create foreign table bar (a int) server myserver options
(table_name 'bar');
CREATE FOREIGN TABLE
postgres=# insert into tab values (1, 1);
INSERT 0 1
postgres=# insert into foo values (1);
INSERT 0 1
postgres=# insert into bar values (1);
INSERT 0 1
postgres=# analyze tab;
ANALYZE
postgres=# analyze foo;
ANALYZE
postgres=# analyze bar;
ANALYZE

Run the example:

[Terminal 1]
postgres=# begin;
BEGIN
postgres=# update tab set b = b + 1 where a = 1;
UPDATE 1

[Terminal 2]
postgres=# explain verbose select tab.* from tab, foo, bar where tab.a =
foo.a and foo.a = bar.a for update;

QUERY PLAN

----------------------------------------------------------------------------
----------------------------------------------------------------------------

----------------------------------------------------------------------------
--------------------------------

LockRows (cost=100.00..101.18 rows=4 width=70)
Output: tab.a, tab.b, tab.ctid, foo.*, bar.*
-> Nested Loop (cost=100.00..101.14 rows=4 width=70)
Output: tab.a, tab.b, tab.ctid, foo.*, bar.*
Join Filter: (foo.a = tab.a)
-> Seq Scan on public.tab (cost=0.00..1.01 rows=1 width=14)
Output: tab.a, tab.b, tab.ctid
-> Foreign Scan (cost=100.00..100.08 rows=4 width=64)
Output: foo.*, foo.a, bar.*, bar.a
Relations: (public.foo) INNER JOIN (public.bar)
Remote SQL: SELECT l.a1, l.a2, r.a1, r.a2 FROM (SELECT
ROW(l.a9), l.a9 FROM (SELECT a a9 FROM public.foo FOR UPDATE) l) l (a1,
a2) INNER
JOIN (SELECT ROW(r.a9), r.a9 FROM (SELECT a a9 FROM public.bar FOR
UPDATE) r) r (a1, a2) ON ((l.a2 = r.a2))
(11 rows)

postgres=# select tab.* from tab, foo, bar where tab.a = foo.a and foo.a
= bar.a for update;

[Terminal 1]
postgres=# commit;
COMMIT

[Terminal 2]
(After the commit in Terminal 1, Terminal 2 will show the following.)
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

Best regards,
Etsuro Fujita

[1]

/messages/by-id/CAEZqfEe9KGy=1_waGh2rgZPg0o4pqgD+iauYaj
8wTze+CYJUHg@mail.gmail.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
KaiGai Kohei <kaigai@kaigai.gr.jp>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#3)

Re: Foreign join pushdown vs EvalPlanQual

Hi KaiGai-san,

I'd like to work on this issue with you!

On 2015/06/25 10:48, Kouhei Kaigai wrote:

I'd like to suggest a solution that re-construct remote tuple according
to the fdw_scan_tlist on ExecScanFetch, if given scanrelid == 0.
It enables to run local qualifier associated with the ForeignScan node,
and it will also work for the case when tuple in es_epqTupleSet[] was
local heap.

Maybe I'm missing something, but I don't think your proposal works
properly because we don't have any component ForeignScan state node or
subsidiary join state node once we've replaced the entire join with the
ForeignScan performing the join remotely, IIUC. So, my image was to
have another subplan for EvalPlanQual as well as the ForeignScan, to do
the entire join locally for the component test tuples if we are inside
an EvalPlanQual recheck.

BTW, if you try newer version of postgres_fdw foreign join patch,
please provide me to reproduce the problem/

Also, as an aside, postgres_fdw does not implement RefetchForeignRow()
at this moment. Does it make a problem?

I don't think so, though I think it would be better to test that the
foreign join pushdown API patch also allows late row locking using the
postgres_fdw.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Etsuro Fujita (#4)

Re: Foreign join pushdown vs EvalPlanQual

Fujita-san,

BTW, if you try newer version of postgres_fdw foreign join patch,
please provide me to reproduce the problem/

OK

Did you forget to attach the patch, or v17 is in use?

I'd like to suggest a solution that re-construct remote tuple according
to the fdw_scan_tlist on ExecScanFetch, if given scanrelid == 0.
It enables to run local qualifier associated with the ForeignScan node,
and it will also work for the case when tuple in es_epqTupleSet[] was
local heap.

Maybe I'm missing something, but I don't think your proposal works
properly because we don't have any component ForeignScan state node or
subsidiary join state node once we've replaced the entire join with the
ForeignScan performing the join remotely, IIUC. So, my image was to
have another subplan for EvalPlanQual as well as the ForeignScan, to do
the entire join locally for the component test tuples if we are inside
an EvalPlanQual recheck.

Hmm... Probably, we have two standpoints to tackle the problem.

The first standpoint tries to handle the base foreign table as
a prime relation for locking. Thus, we have to provide a way to
fetch a remote tuple identified with the supplied ctid.
The advantage of this approach is the way to fetch tuples from
base relation is quite similar to the existing form, however,
its disadvantage is another side of the same coin, because the
ForeignScan node with scanrelid==0 (that represents remote join
query) may have local qualifiers which shall run on the tuple
according to fdw_scan_tlist.

One other standpoint tries to handle a bunch of base foreign
tables as a unit. That means, if any of base foreign table is
the target of locking, it prompts FDW driver to fetch the latest
"joined" tuple identified by "ctid", even if this join contains
multiple base relations to be locked.
The advantage of this approach is that we can use qualifiers of
the ForeignScan node with scanrelid==0 and no need to pay attention
of remote qualifier and/or join condition individually.
Its disadvantage is, we may extend EState structure to keep the
"joined" tuples, in addition to es_epqTupleSet[].

I'm inclined to think the later standpoint works well, because
it does not need to reproduce an alternative execution path in
local side again, even if a ForeignScan node represents much
complicated remote query.
If we would fetch tuples of individual base relations, we need
to reconstruct identical join path to be executed on remote-
side, don't it?

IIUC, the purpose of EvalPlanQual() is to ensure the tuples to
be locked is still visible, so it is not an essential condition
to fetch base tuples individually.

Just an aside, please tell me if someone know, does EvalPlanQual
logic work correctly even if the tuple to be locked located in
the right tree of HashJoin?
In this case, it seems to me ExecHashJoin does not refresh Hash
table again even if ExecProcNode() is invoked with es_epqTupleSet[],
thus, old tuple is already visible and checked, isn't it?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Etsuro Fujita
Sent: Thursday, June 25, 2015 3:12 PM
To: Kaigai Kouhei(海外浩平)
Cc: PostgreSQL-development
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

Hi KaiGai-san,

I'd like to work on this issue with you!

On 2015/06/25 10:48, Kouhei Kaigai wrote:

I'd like to suggest a solution that re-construct remote tuple according
to the fdw_scan_tlist on ExecScanFetch, if given scanrelid == 0.
It enables to run local qualifier associated with the ForeignScan node,
and it will also work for the case when tuple in es_epqTupleSet[] was
local heap.

Maybe I'm missing something, but I don't think your proposal works
properly because we don't have any component ForeignScan state node or
subsidiary join state node once we've replaced the entire join with the
ForeignScan performing the join remotely, IIUC. So, my image was to
have another subplan for EvalPlanQual as well as the ForeignScan, to do
the entire join locally for the component test tuples if we are inside
an EvalPlanQual recheck.

BTW, if you try newer version of postgres_fdw foreign join patch,
please provide me to reproduce the problem/

OK

Also, as an aside, postgres_fdw does not implement RefetchForeignRow()
at this moment. Does it make a problem?

I don't think so, though I think it would be better to test that the
foreign join pushdown API patch also allows late row locking using the
postgres_fdw.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#5)

Re: Foreign join pushdown vs EvalPlanQual

Hi KaiGai-san,

On 2015/06/27 21:09, Kouhei Kaigai wrote:

BTW, if you try newer version of postgres_fdw foreign join patch,
please provide me to reproduce the problem/

OK

Did you forget to attach the patch, or v17 is in use?

Sorry, I made a mistake. The problem was produced using v16 [1]/messages/by-id/CAEZqfEe9KGy=1_waGh2rgZPg0o4pqgD+iauYaj8wTze+CYJUHg@mail.gmail.com.

I'd like to suggest a solution that re-construct remote tuple according
to the fdw_scan_tlist on ExecScanFetch, if given scanrelid == 0.
It enables to run local qualifier associated with the ForeignScan node,
and it will also work for the case when tuple in es_epqTupleSet[] was
local heap.

Maybe I'm missing something, but I don't think your proposal works
properly because we don't have any component ForeignScan state node or
subsidiary join state node once we've replaced the entire join with the
ForeignScan performing the join remotely, IIUC. So, my image was to
have another subplan for EvalPlanQual as well as the ForeignScan, to do
the entire join locally for the component test tuples if we are inside
an EvalPlanQual recheck.

Hmm... Probably, we have two standpoints to tackle the problem.

The first standpoint tries to handle the base foreign table as
a prime relation for locking. Thus, we have to provide a way to
fetch a remote tuple identified with the supplied ctid.
The advantage of this approach is the way to fetch tuples from
base relation is quite similar to the existing form, however,
its disadvantage is another side of the same coin, because the
ForeignScan node with scanrelid==0 (that represents remote join
query) may have local qualifiers which shall run on the tuple
according to fdw_scan_tlist.

IIUC, I think this approach would also need to evaluate join conditions
and remote qualifiers in addition to local qualifiers in the local, for
component tuples that were re-fetched from the remote (and remaining
component tuples that were copied from whole-row vars, if any), in cases
where the re-fetched tuples were updated versions of those tuples rather
than the same versions priviously obtained.

One other standpoint tries to handle a bunch of base foreign
tables as a unit. That means, if any of base foreign table is
the target of locking, it prompts FDW driver to fetch the latest
"joined" tuple identified by "ctid", even if this join contains
multiple base relations to be locked.
The advantage of this approach is that we can use qualifiers of
the ForeignScan node with scanrelid==0 and no need to pay attention
of remote qualifier and/or join condition individually.
Its disadvantage is, we may extend EState structure to keep the
"joined" tuples, in addition to es_epqTupleSet[].

That is an idea. However, ISTM there is another disadvantage; that is
not efficient because that would need to perform another remote join
query having such additional conditions during an EvalPlanQual check, as
you proposed.

I'm inclined to think the later standpoint works well, because
it does not need to reproduce an alternative execution path in
local side again, even if a ForeignScan node represents much
complicated remote query.
If we would fetch tuples of individual base relations, we need
to reconstruct identical join path to be executed on remote-
side, don't it?

Yeah, that was my image for fixing this issue.

IIUC, the purpose of EvalPlanQual() is to ensure the tuples to
be locked is still visible, so it is not an essential condition
to fetch base tuples individually.

I think so too, but taking the similarity and/or efficiency of
processing into consideration, I would vote for the idea of having an
alternative execution path in the local. That would also allow FDW
authors to write the foreign join pushdown functionality in their FDWs
by smaller efforts.

Best regards,
Etsuro Fujita

[1]: /messages/by-id/CAEZqfEe9KGy=1_waGh2rgZPg0o4pqgD+iauYaj8wTze+CYJUHg@mail.gmail.com
/messages/by-id/CAEZqfEe9KGy=1_waGh2rgZPg0o4pqgD+iauYaj8wTze+CYJUHg@mail.gmail.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Etsuro Fujita (#6)

Re: Foreign join pushdown vs EvalPlanQual

Hi Fujita-san,

Sorry for my late.

On 2015/06/27 21:09, Kouhei Kaigai wrote:

BTW, if you try newer version of postgres_fdw foreign join patch,
please provide me to reproduce the problem/

OK

Did you forget to attach the patch, or v17 is in use?

Sorry, I made a mistake. The problem was produced using v16 [1].

I'd like to suggest a solution that re-construct remote tuple according
to the fdw_scan_tlist on ExecScanFetch, if given scanrelid == 0.
It enables to run local qualifier associated with the ForeignScan node,
and it will also work for the case when tuple in es_epqTupleSet[] was
local heap.

Maybe I'm missing something, but I don't think your proposal works
properly because we don't have any component ForeignScan state node or
subsidiary join state node once we've replaced the entire join with the
ForeignScan performing the join remotely, IIUC. So, my image was to
have another subplan for EvalPlanQual as well as the ForeignScan, to do
the entire join locally for the component test tuples if we are inside
an EvalPlanQual recheck.

Hmm... Probably, we have two standpoints to tackle the problem.

The first standpoint tries to handle the base foreign table as
a prime relation for locking. Thus, we have to provide a way to
fetch a remote tuple identified with the supplied ctid.
The advantage of this approach is the way to fetch tuples from
base relation is quite similar to the existing form, however,
its disadvantage is another side of the same coin, because the
ForeignScan node with scanrelid==0 (that represents remote join
query) may have local qualifiers which shall run on the tuple
according to fdw_scan_tlist.

IIUC, I think this approach would also need to evaluate join conditions
and remote qualifiers in addition to local qualifiers in the local, for
component tuples that were re-fetched from the remote (and remaining
component tuples that were copied from whole-row vars, if any), in cases
where the re-fetched tuples were updated versions of those tuples rather
than the same versions priviously obtained.

One other standpoint tries to handle a bunch of base foreign
tables as a unit. That means, if any of base foreign table is
the target of locking, it prompts FDW driver to fetch the latest
"joined" tuple identified by "ctid", even if this join contains
multiple base relations to be locked.
The advantage of this approach is that we can use qualifiers of
the ForeignScan node with scanrelid==0 and no need to pay attention
of remote qualifier and/or join condition individually.
Its disadvantage is, we may extend EState structure to keep the
"joined" tuples, in addition to es_epqTupleSet[].

That is an idea. However, ISTM there is another disadvantage; that is
not efficient because that would need to perform another remote join
query having such additional conditions during an EvalPlanQual check, as
you proposed.

I'm inclined to think the later standpoint works well, because
it does not need to reproduce an alternative execution path in
local side again, even if a ForeignScan node represents much
complicated remote query.
If we would fetch tuples of individual base relations, we need
to reconstruct identical join path to be executed on remote-
side, don't it?

Yeah, that was my image for fixing this issue.

IIUC, the purpose of EvalPlanQual() is to ensure the tuples to
be locked is still visible, so it is not an essential condition
to fetch base tuples individually.

I think so too, but taking the similarity and/or efficiency of
processing into consideration, I would vote for the idea of having an
alternative execution path in the local. That would also allow FDW
authors to write the foreign join pushdown functionality in their FDWs
by smaller efforts.

Even though I'd like to see committer's opinion, I could not come up
with the idea better than what you proposed; foreign-/custom-scan
has alternative plan if scanrelid==0.

Let me introduce a few cases we should pay attention.

Foreign/CustomScan node may stack; that means a Foreign/CustomScan node
may have child node that includes another Foreign/CustomScan node with
scanrelid==0.
(At this moment, ForeignScan cannot have child node, however, more
aggressive push-down [1]/messages/by-id/9A28C8860F777E439AA12E8AEA7694F8010F20AD@BPXM15GP.gisp.nec.co.jp will need same feature to fetch tuples from
local relation and construct VALUES() clause.)
In this case, the highest Foreign/CustomScan node (that is also nearest
to LockRows or ModifyTuples) run the alternative sub-plan that includes
scan/join plans dominated by fdw_relids or custom_relids.

For example:
LockRows
-> HashJoin
-> CustomScan (AliceJoin)
-> SeqScan on t1
-> CustomScan (CarolJoin)
-> SeqScan on t2
-> SeqScan on t3
-> Hash
-> CustomScan (BobJoin)
-> SeqScan on t4
-> ForeignScan (remote join involves ft5, ft6)

In this case, AliceJoin will have alternative sub-plan to join t1, t2
and t3, then it shall be used on EvalPlanQual(). Also, BobJoin will
have alternative sub-plan to join t4, ft5 and ft6. CarolJoin and the
ForeignScan will also have alternative sub-plan, however, these are
not used in this case.
Probably, it works fine.

Do we have potential scenario if foreign-/custom-join is located over
LockRows node. (Subquery expansion may give such a case?)
Anyway, doesn't it make a problem, does it?

On the next step, how do we implement this design?
I guess that planner needs to keep a path that contains neither
foreign-join nor custom-join with scanrelid==0.
Probably, "cheapest_builtin_path" of RelOptInfo is needed that
never contains these remote/custom join logic, as a seed of
alternative sub-plan.

create_foreignscan_plan() or create_customscan_plan() will be
able to construct these alternative plan, regardless of the
extensions. So, individual FDW/CSP don't need to care about
this alternative sub-plan, do it?

After that, once ExecScanFetch() is called under EvalPlanQual(),
these Foreign/CustomScan with scanrelid==0 runs the alternative
sub-plan, to validate the latest tuple.

Hmm... It looks to me a workable approach.

Fujita-san, are you available to make a patch with this approach?
If so, I'd like to volunteer its reviewing.

[1]: /messages/by-id/9A28C8860F777E439AA12E8AEA7694F8010F20AD@BPXM15GP.gisp.nec.co.jp

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#7)

Re: Foreign join pushdown vs EvalPlanQual

Hi KaiGai-san,

On 2015/07/02 18:31, Kouhei Kaigai wrote:

Even though I'd like to see committer's opinion, I could not come up
with the idea better than what you proposed; foreign-/custom-scan
has alternative plan if scanrelid==0.

I'd also like to hear the opinion!

Let me introduce a few cases we should pay attention.

Foreign/CustomScan node may stack; that means a Foreign/CustomScan node
may have child node that includes another Foreign/CustomScan node with
scanrelid==0.
(At this moment, ForeignScan cannot have child node, however, more
aggressive push-down [1] will need same feature to fetch tuples from
local relation and construct VALUES() clause.)
In this case, the highest Foreign/CustomScan node (that is also nearest
to LockRows or ModifyTuples) run the alternative sub-plan that includes
scan/join plans dominated by fdw_relids or custom_relids.

For example:
LockRows
-> HashJoin
-> CustomScan (AliceJoin)
-> SeqScan on t1
-> CustomScan (CarolJoin)
-> SeqScan on t2
-> SeqScan on t3
-> Hash
-> CustomScan (BobJoin)
-> SeqScan on t4
-> ForeignScan (remote join involves ft5, ft6)

In this case, AliceJoin will have alternative sub-plan to join t1, t2
and t3, then it shall be used on EvalPlanQual(). Also, BobJoin will
have alternative sub-plan to join t4, ft5 and ft6. CarolJoin and the
ForeignScan will also have alternative sub-plan, however, these are
not used in this case.
Probably, it works fine.

Yeah, I think so too.

Do we have potential scenario if foreign-/custom-join is located over
LockRows node. (Subquery expansion may give such a case?)
Anyway, doesn't it make a problem, does it?

IIUC, I don't think we have such a case.

On the next step, how do we implement this design?
I guess that planner needs to keep a path that contains neither
foreign-join nor custom-join with scanrelid==0.
Probably, "cheapest_builtin_path" of RelOptInfo is needed that
never contains these remote/custom join logic, as a seed of
alternative sub-plan.

Yeah, I think so too, but I've not fugiured out how to implement this yet.

To be honest, ISTM that it's difficult to do that simply and efficiently
for the foreign/custom-join-pushdown API that we have for 9.5. It's a
little late, but what I started thinking is to redesign that API so that
that API is called at standard_join_search, as discussed in [2]/messages/by-id/5451.1426271510@sss.pgh.pa.us; (1) to
place that API call *after* the set_cheapest call and (2) to place
another set_cheapest call after that API call for each joinrel. By the
first set_cheapest call, I think we could probably save an alternative
path that we need in "cheapest_builtin_path". By the second
set_cheapest call following that API call, we could consider
foreign/custom-join-pushdown paths also. What do you think about this idea?

create_foreignscan_plan() or create_customscan_plan() will be
able to construct these alternative plan, regardless of the
extensions. So, individual FDW/CSP don't need to care about
this alternative sub-plan, do it?

After that, once ExecScanFetch() is called under EvalPlanQual(),
these Foreign/CustomScan with scanrelid==0 runs the alternative
sub-plan, to validate the latest tuple.

Hmm... It looks to me a workable approach.

Year, I think so too.

Fujita-san, are you available to make a patch with this approach?
If so, I'd like to volunteer its reviewing.

Yeah, I'm willing to make a patch if we obtain the consensus! And I'd
be happy if you help me doing the work!

Best regards,
Etsuro Fujita

[2]: /messages/by-id/5451.1426271510@sss.pgh.pa.us

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Etsuro Fujita (#8)

Re: Foreign join pushdown vs EvalPlanQual

Let me introduce a few cases we should pay attention.

Foreign/CustomScan node may stack; that means a Foreign/CustomScan node
may have child node that includes another Foreign/CustomScan node with
scanrelid==0.
(At this moment, ForeignScan cannot have child node, however, more
aggressive push-down [1] will need same feature to fetch tuples from
local relation and construct VALUES() clause.)
In this case, the highest Foreign/CustomScan node (that is also nearest
to LockRows or ModifyTuples) run the alternative sub-plan that includes
scan/join plans dominated by fdw_relids or custom_relids.

For example:
LockRows
-> HashJoin
-> CustomScan (AliceJoin)
-> SeqScan on t1
-> CustomScan (CarolJoin)
-> SeqScan on t2
-> SeqScan on t3
-> Hash
-> CustomScan (BobJoin)
-> SeqScan on t4
-> ForeignScan (remote join involves ft5, ft6)

In this case, AliceJoin will have alternative sub-plan to join t1, t2
and t3, then it shall be used on EvalPlanQual(). Also, BobJoin will
have alternative sub-plan to join t4, ft5 and ft6. CarolJoin and the
ForeignScan will also have alternative sub-plan, however, these are
not used in this case.
Probably, it works fine.

Yeah, I think so too.

Sorry, I need to adjust my explanation above a bit:

In this case, AliceJoin will have alternative sub-plan to join t1 and
CarolJoin, then CarolJoin will have alternative sub-plan to join t2 and
t3. Also, BobJoin will have alternative sub-plan to join t4 and the
ForeignScan with remote join, and this ForeignScan node will have
alternative sub-plan to join ft5 and ft6.

Why this recursive design is better? Because it makes planner enhancement
much simple than overall approach. Please see my explanation in the
section below.

On the next step, how do we implement this design?
I guess that planner needs to keep a path that contains neither
foreign-join nor custom-join with scanrelid==0.
Probably, "cheapest_builtin_path" of RelOptInfo is needed that
never contains these remote/custom join logic, as a seed of
alternative sub-plan.

Yeah, I think so too, but I've not fugiured out how to implement this yet.

To be honest, ISTM that it's difficult to do that simply and efficiently
for the foreign/custom-join-pushdown API that we have for 9.5. It's a
little late, but what I started thinking is to redesign that API so that
that API is called at standard_join_search, as discussed in [2]; (1) to
place that API call *after* the set_cheapest call and (2) to place
another set_cheapest call after that API call for each joinrel. By the
first set_cheapest call, I think we could probably save an alternative
path that we need in "cheapest_builtin_path". By the second
set_cheapest call following that API call, we could consider
foreign/custom-join-pushdown paths also. What do you think about this idea?

Disadvantage is larger than advantage, sorry.
The reason why we put foreign/custom-join hook on add_paths_to_joinrel()
is that the source relations (inner/outer) were not obvious, thus,
we cannot reproduce which relations are the source of this join.
So, I had to throw a spoon when I tried this approach before.

My idea is that we save the cheapest_total_path of RelOptInfo onto the
new cheapest_builtin_path just before the GetForeignJoinPaths() hook.
Why? It should be a built-in join logic, never be a foreign/custom-join,
because of the hook location; only built-in logic shall be added here.
Even if either/both of join sub-trees contains foreign/custom-join,
these path have own alternative sub-plan at their level, no need to
care about at current level. (It is the reason why I adjust my explanation
above.)
Once this built-in path is kept and foreign/custom-join get chosen by
set_cheapest(), it is easy to attach this sub-plan to ForeignScan or
CustomScan node.
I don't find any significant down-side in this approach.
How about your opinion?

Regarding to the development timeline, I prefer to put something
workaround not to kick Assert() on ExecScanFetch().
We may add a warning in the documentation not to replace built-in
join if either/both of sub-trees are target of UPDATE/DELETE or
FOR SHARE/UPDATE.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#9)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/07/02 23:13, Kouhei Kaigai wrote:

To be honest, ISTM that it's difficult to do that simply and efficiently
for the foreign/custom-join-pushdown API that we have for 9.5. It's a
little late, but what I started thinking is to redesign that API so that
that API is called at standard_join_search, as discussed in [2]; (1) to
place that API call *after* the set_cheapest call and (2) to place
another set_cheapest call after that API call for each joinrel. By the
first set_cheapest call, I think we could probably save an alternative
path that we need in "cheapest_builtin_path". By the second
set_cheapest call following that API call, we could consider
foreign/custom-join-pushdown paths also. What do you think about this idea?

Disadvantage is larger than advantage, sorry.
The reason why we put foreign/custom-join hook on add_paths_to_joinrel()
is that the source relations (inner/outer) were not obvious, thus,
we cannot reproduce which relations are the source of this join.
So, I had to throw a spoon when I tried this approach before.

Maybe I'm missing something, but my image about this approach is that if
base relations for a given joinrel are all foreign tables and belong to
the same foreign server, then by calling that API there, we consider the
remote join over all the foreign tables, and that if not, we give up to
consider the remote join.

My idea is that we save the cheapest_total_path of RelOptInfo onto the
new cheapest_builtin_path just before the GetForeignJoinPaths() hook.
Why? It should be a built-in join logic, never be a foreign/custom-join,
because of the hook location; only built-in logic shall be added here.

My concern about your idea is that since that (a) add_paths_to_joinrel
is called multiple times per joinrel and that (b) repetitive add_path
calls through GetForeignJoinPaths in add_paths_to_joinrel might remove
old paths that are builtin, it's possible to save a path that is not
builtin onto the cheapest_total_path and thus to save that path wrongly
onto the cheapest_builtin_path. There might be a good way to cope with
that, though.

Regarding to the development timeline, I prefer to put something
workaround not to kick Assert() on ExecScanFetch().
We may add a warning in the documentation not to replace built-in
join if either/both of sub-trees are target of UPDATE/DELETE or
FOR SHARE/UPDATE.

I'm not sure that that is a good idea, but anyway, I think we need to
hurry fixing this issue.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Etsuro Fujita (#10)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/07/02 23:13, Kouhei Kaigai wrote:

To be honest, ISTM that it's difficult to do that simply and efficiently
for the foreign/custom-join-pushdown API that we have for 9.5. It's a
little late, but what I started thinking is to redesign that API so that
that API is called at standard_join_search, as discussed in [2]; (1) to
place that API call *after* the set_cheapest call and (2) to place
another set_cheapest call after that API call for each joinrel. By the
first set_cheapest call, I think we could probably save an alternative
path that we need in "cheapest_builtin_path". By the second
set_cheapest call following that API call, we could consider
foreign/custom-join-pushdown paths also. What do you think about this idea?

Disadvantage is larger than advantage, sorry.
The reason why we put foreign/custom-join hook on add_paths_to_joinrel()
is that the source relations (inner/outer) were not obvious, thus,
we cannot reproduce which relations are the source of this join.
So, I had to throw a spoon when I tried this approach before.

Maybe I'm missing something, but my image about this approach is that if
base relations for a given joinrel are all foreign tables and belong to
the same foreign server, then by calling that API there, we consider the
remote join over all the foreign tables, and that if not, we give up to
consider the remote join.

Your understanding is correct, but missing a point. Once foreign tables
to be joined are informed as a bitmap (joinrel->relids), it is not obvious
for extensions which relations are joined with INNER JOIN, and which ones
are joined with OUTER JOIN.
I tried to implement a common utility function under the v9.5 cycle,
however, it was suspicious whether we can make a reliable logic.

Also, I don't want to stick on the assumption that relations involved in
remote join are all managed by same foreign-server no longer.
The following two ideas introduce possible enhancement of remote join
feature that involved local relations; replicated table or transformed
to VALUES() clause.

/messages/by-id/CA+Tgmoai_VUF5h6qVLNLU+FKp0aeBCbnnMT3SCvL-HvOpBR=Xw@mail.gmail.com
/messages/by-id/9A28C8860F777E439AA12E8AEA7694F8010F20AD@BPXM15GP.gisp.nec.co.jp

Once we have to pay attention to the case of local/foreign relations
mixed, we have to care about the path of underlying local or foreign
relations managed by other foreign server.

I think add_paths_to_joinrel() is the best location for foreign-join,
not only custom-join. Relocation to standard_join_search() has larger
disadvantage than its advantage.

My idea is that we save the cheapest_total_path of RelOptInfo onto the
new cheapest_builtin_path just before the GetForeignJoinPaths() hook.
Why? It should be a built-in join logic, never be a foreign/custom-join,
because of the hook location; only built-in logic shall be added here.

My concern about your idea is that since that (a) add_paths_to_joinrel
is called multiple times per joinrel and that (b) repetitive add_path
calls through GetForeignJoinPaths in add_paths_to_joinrel might remove
old paths that are builtin, it's possible to save a path that is not
builtin onto the cheapest_total_path and thus to save that path wrongly
onto the cheapest_builtin_path. There might be a good way to cope with
that, though.

For the concern (a), FDW driver can reference RelOptInfo->fdw_private
that shall be initialized to NULL, then FDW driver will set valid data
if it preliminary adds something. IIRC, postgres_fdw also skips to
add same path multiple times.

For the concern (b), yep, we may enhance add_path() to retain built-in
path, instead of the add_paths_to_joinrel().
Let's adjust the logic a bit. The add_path() can know whether the given
path is usual or exceptional (ForeignPath/CustomPath towards none base
relation) one. If path is exceptional, the cheapest_builtin_path shall
be retained unconditionally. Elsewhere, the cheapest one replace here,
then the cheapest built-in path will survive.

Is it still problematic?

Regarding to the development timeline, I prefer to put something
workaround not to kick Assert() on ExecScanFetch().
We may add a warning in the documentation not to replace built-in
join if either/both of sub-trees are target of UPDATE/DELETE or
FOR SHARE/UPDATE.

I'm not sure that that is a good idea, but anyway, I think we need to
hurry fixing this issue.

My approach is not fix, but avoid. :-)

It may be an idea to implement the above fixup even though it may be
too large/late to apply v9.5 features, but we can understand how many
changes are needed to fixup this problem.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#11)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/07/03 15:32, Kouhei Kaigai wrote:

On 2015/07/02 23:13, Kouhei Kaigai wrote:

To be honest, ISTM that it's difficult to do that simply and efficiently
for the foreign/custom-join-pushdown API that we have for 9.5. It's a
little late, but what I started thinking is to redesign that API so that
that API is called at standard_join_search, as discussed in [2];

Disadvantage is larger than advantage, sorry.
The reason why we put foreign/custom-join hook on add_paths_to_joinrel()
is that the source relations (inner/outer) were not obvious, thus,
we cannot reproduce which relations are the source of this join.
So, I had to throw a spoon when I tried this approach before.

Maybe I'm missing something, but my image about this approach is that if
base relations for a given joinrel are all foreign tables and belong to
the same foreign server, then by calling that API there, we consider the
remote join over all the foreign tables, and that if not, we give up to
consider the remote join.

Your understanding is correct, but missing a point. Once foreign tables
to be joined are informed as a bitmap (joinrel->relids), it is not obvious
for extensions which relations are joined with INNER JOIN, and which ones
are joined with OUTER JOIN.

Can't FDWs get the join information through the root, which I think we
would pass to the API as the argument?

Also, I don't want to stick on the assumption that relations involved in
remote join are all managed by same foreign-server no longer.
The following two ideas introduce possible enhancement of remote join
feature that involved local relations; replicated table or transformed
to VALUES() clause.

/messages/by-id/CA+Tgmoai_VUF5h6qVLNLU+FKp0aeBCbnnMT3SCvL-HvOpBR=Xw@mail.gmail.com
/messages/by-id/9A28C8860F777E439AA12E8AEA7694F8010F20AD@BPXM15GP.gisp.nec.co.jp

Interesting!

I think add_paths_to_joinrel() is the best location for foreign-join,
not only custom-join. Relocation to standard_join_search() has larger
disadvantage than its advantage.

I agree with you that it's important to ensure the expandability, and my
question is, is it possible that the API call from standard_join_search
also realize those idea if FDWs can get the join information through the
root or something like that?

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Etsuro Fujita (#12)

Re: Foreign join pushdown vs EvalPlanQual

Also, I don't want to stick on the assumption that relations involved in
remote join are all managed by same foreign-server no longer.
The following two ideas introduce possible enhancement of remote join
feature that involved local relations; replicated table or transformed
to VALUES() clause.

/messages/by-id/CA+Tgmoai_VUF5h6qVLNLU+FKp0aeBCbnnMT3SC
vL-HvOpBR=Xw@mail.gmail.com

/messages/by-id/9A28C8860F777E439AA12E8AEA7694F8010F20A
D@BPXM15GP.gisp.nec.co.jp

Interesting!

I think add_paths_to_joinrel() is the best location for foreign-join,
not only custom-join. Relocation to standard_join_search() has larger
disadvantage than its advantage.

I agree with you that it's important to ensure the expandability, and my
question is, is it possible that the API call from standard_join_search
also realize those idea if FDWs can get the join information through the
root or something like that?

I don't deny its possibility, even though I once gave up to implement to
reproduce join information - which relations and other ones are joined in
this level - using PlannerInfo and RelOptInfo.
However, we need to pay attention on advantages towards the alternatives.
Hooks on add_paths_to_joinrel() enables to implement identical stuff, with
less complicated logic to reproduce left / right relations from RelOptInfo
of the joinrel. (Note that RelOptInfo->fdw_private enables to avoid path-
construction multiple times.)
I'm uncertain why this API change is necessary to fix up the problem
around EvalPlanQual.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#13)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/07/06 9:42, Kouhei Kaigai wrote:

Also, I don't want to stick on the assumption that relations involved in
remote join are all managed by same foreign-server no longer.
The following two ideas introduce possible enhancement of remote join
feature that involved local relations; replicated table or transformed
to VALUES() clause.

I think add_paths_to_joinrel() is the best location for foreign-join,
not only custom-join. Relocation to standard_join_search() has larger
disadvantage than its advantage.

I agree with you that it's important to ensure the expandability, and my
question is, is it possible that the API call from standard_join_search
also realize those idea if FDWs can get the join information through the
root or something like that?

I don't deny its possibility, even though I once gave up to implement to
reproduce join information - which relations and other ones are joined in
this level - using PlannerInfo and RelOptInfo.

However, we need to pay attention on advantages towards the alternatives.
Hooks on add_paths_to_joinrel() enables to implement identical stuff, with
less complicated logic to reproduce left / right relations from RelOptInfo
of the joinrel. (Note that RelOptInfo->fdw_private enables to avoid path-
construction multiple times.)
I'm uncertain why this API change is necessary to fix up the problem
around EvalPlanQual.

Yeah, maybe we wouldn't need any API change. I think we would be able
to fix this by complicating add_path as you pointed out upthread. I'm
not sure that complicating it is a good idea, though. I think that it
might be possible that the callback in standard_join_search would allow
us to fix this without complicating the core path-cost-comparison stuff
such as add_path. I noticed that what I proposed upthread doesn't work
properly, though.

Actually, I have another concern about the callback location that you
proposed; that might meaninglessly increase planning time in the
postgres_fdw case when using remote estimates, which the proposed
postgres_fdw patch doesn't support currently IIUC, but I think it should
support that. Let me explain about that. If you have A JOIN B JOIN C
all on the same foreign server, for example, we'll have only to perform
a remote EXPLAIN for A-B-C for the estimates (when adopting a strategy
that we push down a join as large as possible into the remote server).
However, ISTM that the callback in add_paths_to_joinrel would perform
remote EXPLAINs not only for A-B-C but for A-B, A-C and B-C according to
the dynamic programming algorithm. (Duplicated remote EXPLAINs for
A-B-C can be eliminated using a way you proposed.) Thus the remote
EXPLAINs for A-B, A-C and B-C seem to me meaningless while incurring
performance degradation in query planning. Maybe I'm missing something,
though.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Etsuro Fujita (#14)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/07/07 19:15, Etsuro Fujita wrote:

On 2015/07/06 9:42, Kouhei Kaigai wrote:

However, we need to pay attention on advantages towards the alternatives.
Hooks on add_paths_to_joinrel() enables to implement identical stuff,
with
less complicated logic to reproduce left / right relations from
RelOptInfo
of the joinrel. (Note that RelOptInfo->fdw_private enables to avoid path-
construction multiple times.)
I'm uncertain why this API change is necessary to fix up the problem
around EvalPlanQual.

Yeah, maybe we wouldn't need any API change. I think we would be able
to fix this by complicating add_path as you pointed out upthread. I'm
not sure that complicating it is a good idea, though. I think that it
might be possible that the callback in standard_join_search would allow
us to fix this without complicating the core path-cost-comparison stuff
such as add_path. I noticed that what I proposed upthread doesn't work
properly, though.

To resolve this issue, I tried to make the core create an alternative
plan that will be used in an EvalPlanQual recheck, instead of a foreign
scan that performs a foreign join remotely (ie, scanrelid = 0). But I
changed that idea. Instead, I'd like to propose that it's the FDW's
responsibility to provide such a plan. Specifically, I'd propose that
(1) we add a new Path field, say subpath, to the ForeignPath data
structure and that (2) when generating a ForeignPath node for a foreign
join, an FDW must provide the subpath Path node by itself. As before,
it'd be recommended to use

ForeignPath *
create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
double rows, Cost startup_cost, Cost total_cost,
List *pathkeys,
Relids required_outer,
Path *subpath,
List *fdw_private)

where subpath is the subpath Path node that has the pathkeys and the
required_outer rels. (subpath is NULL if scanning a base relation.)
Also, it'd be recommended that an FDW generates such ForeignPath nodes
by considering, for each of paths in the rel's pathlist, whether to push
down that path (to generate a ForeignPath node for a foreign join that
has the same pathkeys and parameterization as that path). So, if
generating the ForeignPath node, that path could be used as the subpath
Path node.

(I think the current postgres_fdw patch only considers an unsorted,
unparameterized path for performing a foreign join remotely, but I think
we should also consider presorted and/or parameterized paths.)

I think this idea would apply to the API location that you proposed.
However, ISTM that this idea would work better for the API call from
standard_join_search because the rel's pathlist at that point has more
paths worthy of consideration, in view of not only costs and sizes but
pathkeys and parameterization.

I think the subplan created from the subpath Path node could be used in
an EvalPlanQual recheck, instead of a foreign scan that performs a
foreign join remotely, as discussed previously.

Comments welcome!

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Etsuro Fujita (#12)

Re: Foreign join pushdown vs EvalPlanQual

On Fri, Jul 3, 2015 at 6:25 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

Can't FDWs get the join information through the root, which I think we would
pass to the API as the argument?

This is exactly what Tom suggested originally, and it has some appeal,
but neither KaiGai nor I could see how to make it work . Do you have
an idea? It's not too late to go back and change the API.

The problem that was bothering us (or at least what was bothering me)
is that the PlannerInfo provides only a list of SpecialJoinInfo
structures, which don't directly give you the original join order. In
fact, min_righthand and min_lefthand are intended to constraint the
*possible* join orders, and are deliberately designed *not* to specify
a single join order. If you're sending a query to a remote PostgreSQL
node, you don't want to know what all the possible join orders are;
it's the remote side's job to plan the query. You do, however, need
an easy way to identify one join order that you can use to construct a
query. It didn't seem easy to do that without duplicating
make_join_rel(), which seemed like a bad idea.

But maybe there's a good way to do it. Tom wasn't crazy about this
hook both because of the frequency of calls and also because of the
long argument list. I think those concerns are legitimate; I just
couldn't see how to make the other way work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Robert Haas (#16)

Re: Foreign join pushdown vs EvalPlanQual

On Fri, Jul 3, 2015 at 6:25 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

Can't FDWs get the join information through the root, which I think we would
pass to the API as the argument?

This is exactly what Tom suggested originally, and it has some appeal,
but neither KaiGai nor I could see how to make it work . Do you have
an idea? It's not too late to go back and change the API.

The problem that was bothering us (or at least what was bothering me)
is that the PlannerInfo provides only a list of SpecialJoinInfo
structures, which don't directly give you the original join order. In
fact, min_righthand and min_lefthand are intended to constraint the
*possible* join orders, and are deliberately designed *not* to specify
a single join order. If you're sending a query to a remote PostgreSQL
node, you don't want to know what all the possible join orders are;
it's the remote side's job to plan the query. You do, however, need
an easy way to identify one join order that you can use to construct a
query. It didn't seem easy to do that without duplicating
make_join_rel(), which seemed like a bad idea.

But maybe there's a good way to do it. Tom wasn't crazy about this
hook both because of the frequency of calls and also because of the
long argument list. I think those concerns are legitimate; I just
couldn't see how to make the other way work.

I could have a discussion with Fujita-san about this topic.
He has a little bit tricky, but I didn't have a clear reason to deny,
idea to tackle this matter.
At the line just above set_cheapest() of the standard_join_search(),
at least one built-in join logic are already added to the RelOptInfo,
thus, FDW driver can reference the cheapest path by built-in logic
and its lefttree and righttree that construct a joinrel.
Its assumption is, the best paths by built-in logic are at least
enough reasonable join order than other potential ones.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Robert Haas (#16)

Re: Foreign join pushdown vs EvalPlanQual

Robert Haas <robertmhaas@gmail.com> writes:

The problem that was bothering us (or at least what was bothering me)
is that the PlannerInfo provides only a list of SpecialJoinInfo
structures, which don't directly give you the original join order. In
fact, min_righthand and min_lefthand are intended to constraint the
*possible* join orders, and are deliberately designed *not* to specify
a single join order. If you're sending a query to a remote PostgreSQL
node, you don't want to know what all the possible join orders are;
it's the remote side's job to plan the query. You do, however, need
an easy way to identify one join order that you can use to construct a
query. It didn't seem easy to do that without duplicating
make_join_rel(), which seemed like a bad idea.

In principle it seems like you could traverse root->parse->jointree
as a guide to reconstructing the original syntactic structure; though
I'm not sure how hard it would be to ignore the parts of that tree
that correspond to relations you're not shipping.

But maybe there's a good way to do it. Tom wasn't crazy about this
hook both because of the frequency of calls and also because of the
long argument list. I think those concerns are legitimate; I just
couldn't see how to make the other way work.

In my vision you probably really only want one call per build_join_rel
event (that is, per construction of a new RelOptInfo), not per
make_join_rel event.

It's possible that an FDW that wants to handle joins but is not talking to
a remote query planner would need to grovel through all the join ordering
possibilities individually, and then maybe hooking at make_join_rel is
sensible rather than having to reinvent that logic. But I'd want to see a
concrete use-case first, and I certainly don't think that that's the main
case to design the API around.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Kouhei Kaigai (#17)

Re: Foreign join pushdown vs EvalPlanQual

I could have a discussion with Fujita-san about this topic.

Also, let me share with the discussion towards entire solution.

The primitive reason of this problem is, Scan node with scanrelid==0
represents a relation join that can involve multiple relations, thus,
its TupleDesc of the records will not fit base relations, however,
ExecScanFetch() was not updated when scanrelid==0 gets supported.

FDW/CSP on behalf of the Scan node with scanrelid==0 are responsible
to generate records according to the fdw_/custom_scan_tlist that
reflects the definition of relation join, and only FDW/CSP know how
to combine these base relations.
In addition, host-side expressions (like Plan->qual) are initialized
to reference the records generated by FDW/CSP, so the least invasive
approach is to allow FDW/CSP to have own logic to recheck, I think.

Below is the structure of ExecScanFetch().

ExecScanFetch(ScanState *node,
ExecScanAccessMtd accessMtd,
ExecScanRecheckMtd recheckMtd)
{
EState *estate = node->ps.state;

if (estate->es_epqTuple != NULL)
{
/*
* We are inside an EvalPlanQual recheck. Return the test tuple if
* one is available, after rechecking any access-method-specific
* conditions.
*/
Index scanrelid = ((Scan *) node->ps.plan)->scanrelid;

Assert(scanrelid > 0);
if (estate->es_epqTupleSet[scanrelid - 1])
{
TupleTableSlot *slot = node->ss_ScanTupleSlot;
:
return slot;
}
}
return (*accessMtd) (node);
}

When we are inside of EPQ, it fetches a tuple in es_epqTuple[] array and
checks its visibility (ForeignRecheck() always say 'yep, it is visible'),
then ExecScan() applies its qualifiers by ExecQual().
So, as long as FDW/CSP can return a record that satisfies the TupleDesc
of this relation, made by the tuples in es_epqTuple[] array, rest of the
code paths are common.

I have an idea to solve the problem.
It adds recheckMtd() call if scanrelid==0 just before the assertion above,
and add a callback of FDW on ForeignRecheck().
The role of this new callback is to set up the supplied TupleTableSlot
and check its visibility, but does not define how to do this.
It is arbitrarily by FDW driver, like invocation of alternative plan
consists of only built-in logic.

Invocation of alternative plan is one of the most feasible way to
implement EPQ logic on FDW, so I think FDW also needs a mechanism
that takes child path-nodes like custom_paths in CustomPath node.
Once a valid path node is linked to this list, createplan.c transform
them to relevant plan node, then FDW can initialize and invoke this
plan node during execution, like ForeignRecheck().

This design can solve another problem Fujita-san has also mentioned.
If scan qualifier is pushed-down to the remote query and its expression
node is saved in the private area of ForeignScan, the callback on
ForeignRecheck() can evaluate the qualifier by itself. (Note that only
FDW driver can know where and how expression node being pushed-down
is saved in the private area.)

In the summary, the following three enhancements are a straightforward
way to fix up the problem he reported.
1. Add a special path to call recheckMtd in ExecScanFetch if scanrelid==0
2. Add a callback of FDW in ForeignRecheck() - to construct a record
according to the fdw_scan_tlist definition and to evaluate its
visibility, or to evaluate qualifier pushed-down if base relation.
3. Add List *fdw_paths in ForeignPath like custom_paths of CustomPaths,
to construct plan nodes for EPQ evaluation.

On the other hands, we also need to pay attention the development
timeline. It is a really problem of v9.5, however, it looks to me
the straight forward solution needs enhancement of FDW APIs.

I'd like to see people's comment.
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai
Sent: Saturday, August 01, 2015 10:35 PM
To: Robert Haas; Etsuro Fujita
Cc: PostgreSQL-development; 花田茂
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

On Fri, Jul 3, 2015 at 6:25 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

Can't FDWs get the join information through the root, which I think we would
pass to the API as the argument?

This is exactly what Tom suggested originally, and it has some appeal,
but neither KaiGai nor I could see how to make it work . Do you have
an idea? It's not too late to go back and change the API.

The problem that was bothering us (or at least what was bothering me)
is that the PlannerInfo provides only a list of SpecialJoinInfo
structures, which don't directly give you the original join order. In
fact, min_righthand and min_lefthand are intended to constraint the
*possible* join orders, and are deliberately designed *not* to specify
a single join order. If you're sending a query to a remote PostgreSQL
node, you don't want to know what all the possible join orders are;
it's the remote side's job to plan the query. You do, however, need
an easy way to identify one join order that you can use to construct a
query. It didn't seem easy to do that without duplicating
make_join_rel(), which seemed like a bad idea.

But maybe there's a good way to do it. Tom wasn't crazy about this
hook both because of the frequency of calls and also because of the
long argument list. I think those concerns are legitimate; I just
couldn't see how to make the other way work.

I could have a discussion with Fujita-san about this topic.
He has a little bit tricky, but I didn't have a clear reason to deny,
idea to tackle this matter.
At the line just above set_cheapest() of the standard_join_search(),
at least one built-in join logic are already added to the RelOptInfo,
thus, FDW driver can reference the cheapest path by built-in logic
and its lefttree and righttree that construct a joinrel.
Its assumption is, the best paths by built-in logic are at least
enough reasonable join order than other potential ones.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Kouhei Kaigai (#19)

Re: Foreign join pushdown vs EvalPlanQual

On Fri, Aug 7, 2015 at 3:37 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

I could have a discussion with Fujita-san about this topic.

Also, let me share with the discussion towards entire solution.

The primitive reason of this problem is, Scan node with scanrelid==0
represents a relation join that can involve multiple relations, thus,
its TupleDesc of the records will not fit base relations, however,
ExecScanFetch() was not updated when scanrelid==0 gets supported.

FDW/CSP on behalf of the Scan node with scanrelid==0 are responsible
to generate records according to the fdw_/custom_scan_tlist that
reflects the definition of relation join, and only FDW/CSP know how
to combine these base relations.
In addition, host-side expressions (like Plan->qual) are initialized
to reference the records generated by FDW/CSP, so the least invasive
approach is to allow FDW/CSP to have own logic to recheck, I think.

Below is the structure of ExecScanFetch().

ExecScanFetch(ScanState *node,
ExecScanAccessMtd accessMtd,
ExecScanRecheckMtd recheckMtd)
{
EState *estate = node->ps.state;

if (estate->es_epqTuple != NULL)
{
/*
* We are inside an EvalPlanQual recheck. Return the test tuple if
* one is available, after rechecking any access-method-specific
* conditions.
*/
Index scanrelid = ((Scan *) node->ps.plan)->scanrelid;

Assert(scanrelid > 0);
if (estate->es_epqTupleSet[scanrelid - 1])
{
TupleTableSlot *slot = node->ss_ScanTupleSlot;
:
return slot;
}
}
return (*accessMtd) (node);
}

When we are inside of EPQ, it fetches a tuple in es_epqTuple[] array and
checks its visibility (ForeignRecheck() always say 'yep, it is visible'),
then ExecScan() applies its qualifiers by ExecQual().
So, as long as FDW/CSP can return a record that satisfies the TupleDesc
of this relation, made by the tuples in es_epqTuple[] array, rest of the
code paths are common.

I have an idea to solve the problem.
It adds recheckMtd() call if scanrelid==0 just before the assertion above,
and add a callback of FDW on ForeignRecheck().
The role of this new callback is to set up the supplied TupleTableSlot
and check its visibility, but does not define how to do this.
It is arbitrarily by FDW driver, like invocation of alternative plan
consists of only built-in logic.

Invocation of alternative plan is one of the most feasible way to
implement EPQ logic on FDW, so I think FDW also needs a mechanism
that takes child path-nodes like custom_paths in CustomPath node.
Once a valid path node is linked to this list, createplan.c transform
them to relevant plan node, then FDW can initialize and invoke this
plan node during execution, like ForeignRecheck().

This design can solve another problem Fujita-san has also mentioned.
If scan qualifier is pushed-down to the remote query and its expression
node is saved in the private area of ForeignScan, the callback on
ForeignRecheck() can evaluate the qualifier by itself. (Note that only
FDW driver can know where and how expression node being pushed-down
is saved in the private area.)

In the summary, the following three enhancements are a straightforward
way to fix up the problem he reported.
1. Add a special path to call recheckMtd in ExecScanFetch if scanrelid==0
2. Add a callback of FDW in ForeignRecheck() - to construct a record
according to the fdw_scan_tlist definition and to evaluate its
visibility, or to evaluate qualifier pushed-down if base relation.
3. Add List *fdw_paths in ForeignPath like custom_paths of CustomPaths,
to construct plan nodes for EPQ evaluation.

On the other hands, we also need to pay attention the development
timeline. It is a really problem of v9.5, however, it looks to me
the straight forward solution needs enhancement of FDW APIs.

I'd like to see people's comment.

I'm not an expert in this area, but this plan does not seem unreasonable to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Robert Haas (#20)

2 attachment(s)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/08/12 7:21, Robert Haas wrote:

On Fri, Aug 7, 2015 at 3:37 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

I could have a discussion with Fujita-san about this topic.

Also, let me share with the discussion towards entire solution.

The primitive reason of this problem is, Scan node with scanrelid==0
represents a relation join that can involve multiple relations, thus,
its TupleDesc of the records will not fit base relations, however,
ExecScanFetch() was not updated when scanrelid==0 gets supported.

FDW/CSP on behalf of the Scan node with scanrelid==0 are responsible
to generate records according to the fdw_/custom_scan_tlist that
reflects the definition of relation join, and only FDW/CSP know how
to combine these base relations.
In addition, host-side expressions (like Plan->qual) are initialized
to reference the records generated by FDW/CSP, so the least invasive
approach is to allow FDW/CSP to have own logic to recheck, I think.

Below is the structure of ExecScanFetch().

ExecScanFetch(ScanState *node,
ExecScanAccessMtd accessMtd,
ExecScanRecheckMtd recheckMtd)
{
EState *estate = node->ps.state;

if (estate->es_epqTuple != NULL)
{
/*
* We are inside an EvalPlanQual recheck. Return the test tuple if
* one is available, after rechecking any access-method-specific
* conditions.
*/
Index scanrelid = ((Scan *) node->ps.plan)->scanrelid;

Assert(scanrelid > 0);
if (estate->es_epqTupleSet[scanrelid - 1])
{
TupleTableSlot *slot = node->ss_ScanTupleSlot;
:
return slot;
}
}
return (*accessMtd) (node);
}

When we are inside of EPQ, it fetches a tuple in es_epqTuple[] array and
checks its visibility (ForeignRecheck() always say 'yep, it is visible'),
then ExecScan() applies its qualifiers by ExecQual().
So, as long as FDW/CSP can return a record that satisfies the TupleDesc
of this relation, made by the tuples in es_epqTuple[] array, rest of the
code paths are common.

I have an idea to solve the problem.
It adds recheckMtd() call if scanrelid==0 just before the assertion above,
and add a callback of FDW on ForeignRecheck().
The role of this new callback is to set up the supplied TupleTableSlot
and check its visibility, but does not define how to do this.
It is arbitrarily by FDW driver, like invocation of alternative plan
consists of only built-in logic.

Invocation of alternative plan is one of the most feasible way to
implement EPQ logic on FDW, so I think FDW also needs a mechanism
that takes child path-nodes like custom_paths in CustomPath node.
Once a valid path node is linked to this list, createplan.c transform
them to relevant plan node, then FDW can initialize and invoke this
plan node during execution, like ForeignRecheck().

This design can solve another problem Fujita-san has also mentioned.
If scan qualifier is pushed-down to the remote query and its expression
node is saved in the private area of ForeignScan, the callback on
ForeignRecheck() can evaluate the qualifier by itself. (Note that only
FDW driver can know where and how expression node being pushed-down
is saved in the private area.)

In the summary, the following three enhancements are a straightforward
way to fix up the problem he reported.
1. Add a special path to call recheckMtd in ExecScanFetch if scanrelid==0
2. Add a callback of FDW in ForeignRecheck() - to construct a record
according to the fdw_scan_tlist definition and to evaluate its
visibility, or to evaluate qualifier pushed-down if base relation.
3. Add List *fdw_paths in ForeignPath like custom_paths of CustomPaths,
to construct plan nodes for EPQ evaluation.

I'm not an expert in this area, but this plan does not seem unreasonable to me.

IIRC the discussion with KaiGai-san, I think that that would work. I
think that that would be more suitable for CSPs, though. Correct me if
I'm wrong, KaiGai-san. In either case, I'm not sure that the idea of
transferring both processing to a single callback routine hooked in
ForeignRecheck is a good idea: (a) check to see if the test tuple for
each component foreign table satisfies the remote qual condition and (b)
check to see if those tuples satisfy the remote join condition. I think
that that would be too complicated, probably making the callback routine
bug-prone. So, I'd still propose that *the core* processes (a) and (b)
*separately*.

* As for (a), the core checks the remote qual condition as in [1]/messages/by-id/55B204A0.1080507@lab.ntt.co.jp.

* As for (b), the core executes an alternative subplan locally if inside
an EPQ recheck. The subplan is created as described in [2]/messages/by-id/55B9F95F.5060506@lab.ntt.co.jp.

Attached is a WIP patch for that against 9.5
(fdw-eval-plan-qual-0.1.patch), which includes an updated version of the
patch in [1]/messages/by-id/55B204A0.1080507@lab.ntt.co.jp. I haven't done anything about custom joins yet. Also, I
left the join pushdown API as-is. But I still think that it would be
better that we hook that API in standard_join_search. So, I plan to
modify the patch so in the next version.

For tests, I did a very basic update of the latest postgres_fdw patch in
[3]: /messages/by-id/CAEZqfEe9KGy=1_waGh2rgZPg0o4pqgD+iauYaj8wTze+CYJUHg@mail.gmail.com
patches in the following order:

fdw-eval-plan-qual-0.1.patch
usermapping_matching.patch (in [3]/messages/by-id/CAEZqfEe9KGy=1_waGh2rgZPg0o4pqgD+iauYaj8wTze+CYJUHg@mail.gmail.com)
add_GetUserMappingById.patch (in [3]/messages/by-id/CAEZqfEe9KGy=1_waGh2rgZPg0o4pqgD+iauYaj8wTze+CYJUHg@mail.gmail.com)
foreign_join_v16_efujita.patch

(Note that you cannot do tests of [1]/messages/by-id/55B204A0.1080507@lab.ntt.co.jp. For that, apply
fdw-eval-plan-qual-0.1.patch and the postgres_fdw patch in [1]/messages/by-id/55B204A0.1080507@lab.ntt.co.jp in this
order.)

Comments welcome!

Best regards,
Etsuro Fujita

[1]: /messages/by-id/55B204A0.1080507@lab.ntt.co.jp
[2]: /messages/by-id/55B9F95F.5060506@lab.ntt.co.jp
[3]: /messages/by-id/CAEZqfEe9KGy=1_waGh2rgZPg0o4pqgD+iauYaj8wTze+CYJUHg@mail.gmail.com
/messages/by-id/CAEZqfEe9KGy=1_waGh2rgZPg0o4pqgD+iauYaj8wTze+CYJUHg@mail.gmail.com

Attachments:

fdw-eval-plan-qual-0.1.patchtext/x-patch; name=fdw-eval-plan-qual-0.1.patchDownload

*** a/contrib/file_fdw/file_fdw.c
--- b/contrib/file_fdw/file_fdw.c
***************
*** 525,530 **** fileGetForeignPaths(PlannerInfo *root,
--- 525,531 ----
  									 total_cost,
  									 NIL,		/* no pathkeys */
  									 NULL,		/* no outer rel either */
+ 									 NULL,		/* no alternative path */
  									 coptions));
  
  	/*
***************
*** 563,569 **** fileGetForeignPlan(PlannerInfo *root,
  							scan_relid,
  							NIL,	/* no expressions to evaluate */
  							best_path->fdw_private,
! 							NIL /* no custom tlist */ );
  }
  
  /*
--- 564,571 ----
  							scan_relid,
  							NIL,	/* no expressions to evaluate */
  							best_path->fdw_private,
! 							NIL,	/* no custom tlist */
! 							NIL /* no remote quals */ );
  }
  
  /*
*** a/contrib/postgres_fdw/postgres_fdw.c
--- b/contrib/postgres_fdw/postgres_fdw.c
***************
*** 560,565 **** postgresGetForeignPaths(PlannerInfo *root,
--- 560,566 ----
  								   fpinfo->total_cost,
  								   NIL, /* no pathkeys */
  								   NULL,		/* no outer rel either */
+ 								   NULL,		/* no alternative path */
  								   NIL);		/* no fdw_private list */
  	add_path(baserel, (Path *) path);
  
***************
*** 727,732 **** postgresGetForeignPaths(PlannerInfo *root,
--- 728,734 ----
  									   total_cost,
  									   NIL,		/* no pathkeys */
  									   param_info->ppi_req_outer,
+ 									   NULL,	/* no alternative path */
  									   NIL);	/* no fdw_private list */
  		add_path(baserel, (Path *) path);
  	}
***************
*** 748,753 **** postgresGetForeignPlan(PlannerInfo *root,
--- 750,756 ----
  	Index		scan_relid = baserel->relid;
  	List	   *fdw_private;
  	List	   *remote_conds = NIL;
+ 	List	   *remote_exprs = NIL;
  	List	   *local_exprs = NIL;
  	List	   *params_list = NIL;
  	List	   *retrieved_attrs;
***************
*** 769,776 **** postgresGetForeignPlan(PlannerInfo *root,
  	 *
  	 * This code must match "extract_actual_clauses(scan_clauses, false)"
  	 * except for the additional decision about remote versus local execution.
! 	 * Note however that we only strip the RestrictInfo nodes from the
! 	 * local_exprs list, since appendWhereClause expects a list of
  	 * RestrictInfos.
  	 */
  	foreach(lc, scan_clauses)
--- 772,779 ----
  	 *
  	 * This code must match "extract_actual_clauses(scan_clauses, false)"
  	 * except for the additional decision about remote versus local execution.
! 	 * Note however that we don't strip the RestrictInfo nodes from the
! 	 * remote_conds list, since appendWhereClause expects a list of
  	 * RestrictInfos.
  	 */
  	foreach(lc, scan_clauses)
***************
*** 784,794 **** postgresGetForeignPlan(PlannerInfo *root,
--- 787,803 ----
  			continue;
  
  		if (list_member_ptr(fpinfo->remote_conds, rinfo))
+ 		{
  			remote_conds = lappend(remote_conds, rinfo);
+ 			remote_exprs = lappend(remote_exprs, rinfo->clause);
+ 		}
  		else if (list_member_ptr(fpinfo->local_conds, rinfo))
  			local_exprs = lappend(local_exprs, rinfo->clause);
  		else if (is_foreign_expr(root, baserel, rinfo->clause))
+ 		{
  			remote_conds = lappend(remote_conds, rinfo);
+ 			remote_exprs = lappend(remote_exprs, rinfo->clause);
+ 		}
  		else
  			local_exprs = lappend(local_exprs, rinfo->clause);
  	}
***************
*** 874,880 **** postgresGetForeignPlan(PlannerInfo *root,
  							scan_relid,
  							params_list,
  							fdw_private,
! 							NIL /* no custom tlist */ );
  }
  
  /*
--- 883,890 ----
  							scan_relid,
  							params_list,
  							fdw_private,
! 							NIL,	/* no custom tlist */
! 							remote_exprs);
  }
  
  /*
*** a/src/backend/executor/nodeForeignscan.c
--- b/src/backend/executor/nodeForeignscan.c
***************
*** 72,79 **** ForeignNext(ForeignScanState *node)
  static bool
  ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
  {
! 	/* There are no access-method-specific conditions to recheck. */
! 	return true;
  }
  
  /* ----------------------------------------------------------------
--- 72,90 ----
  static bool
  ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
  {
! 	ExprContext *econtext;
! 
! 	/*
! 	 * extract necessary information from foreign scan node
! 	 */
! 	econtext = node->ss.ps.ps_ExprContext;
! 
! 	/* Does the tuple meet the remotequals condition? */
! 	econtext->ecxt_scantuple = slot;
! 
! 	ResetExprContext(econtext);
! 
! 	return ExecQual(node->fss_remotequals, econtext, false);
  }
  
  /* ----------------------------------------------------------------
***************
*** 88,93 **** ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
--- 99,122 ----
  TupleTableSlot *
  ExecForeignScan(ForeignScanState *node)
  {
+ 	EState	   *estate = node->ss.ps.state;
+ 
+ 	if (estate->es_epqTuple != NULL)
+ 	{
+ 		/*
+ 		 * We are inside an EvalPlanQual recheck.  If foreign join, get next
+ 		 * tuple from subplan.
+ 		 */
+ 		Index		scanrelid = ((Scan *) node->ss.ps.plan)->scanrelid;
+ 
+ 		if (scanrelid == 0)
+ 		{
+ 			PlanState  *outerPlan = outerPlanState(node);
+ 
+ 			return ExecProcNode(outerPlan);
+ 		}
+ 	}
+ 
  	return ExecScan((ScanState *) node,
  					(ExecScanAccessMtd) ForeignNext,
  					(ExecScanRecheckMtd) ForeignRecheck);
***************
*** 117,122 **** ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
--- 146,166 ----
  	scanstate->ss.ps.plan = (Plan *) node;
  	scanstate->ss.ps.state = estate;
  
+ 	if (estate->es_epqTuple != NULL)
+ 	{
+ 		/*
+ 		 * We are inside an EvalPlanQual recheck.  If foreign join, initialize
+ 		 * subplan.
+ 		 */
+ 		if (scanrelid == 0)
+ 		{
+ 			Plan	   *subplan = node->fs_subplan;
+ 
+ 			outerPlanState(scanstate) = ExecInitNode(subplan, estate, eflags);
+ 			return scanstate;
+ 		}
+ 	}
+ 
  	/*
  	 * Miscellaneous initialization
  	 *
***************
*** 135,140 **** ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
--- 179,187 ----
  	scanstate->ss.ps.qual = (List *)
  		ExecInitExpr((Expr *) node->scan.plan.qual,
  					 (PlanState *) scanstate);
+ 	scanstate->fss_remotequals = (List *)
+ 		ExecInitExpr((Expr *) node->fs_remotequals,
+ 					 (PlanState *) scanstate);
  
  	/*
  	 * tuple table initialization
***************
*** 207,212 **** ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
--- 254,276 ----
  void
  ExecEndForeignScan(ForeignScanState *node)
  {
+ 	EState	   *estate = node->ss.ps.state;
+ 
+ 	if (estate->es_epqTuple != NULL)
+ 	{
+ 		/*
+ 		 * We are inside an EvalPlanQual recheck.  If foreign join, close down
+ 		 * subplan.
+ 		 */
+ 		Index		scanrelid = ((Scan *) node->ss.ps.plan)->scanrelid;
+ 
+ 		if (scanrelid == 0)
+ 		{
+ 			ExecEndNode(outerPlanState(node));
+ 			return;
+ 		}
+ 	}
+ 
  	/* Let the FDW shut down */
  	node->fdwroutine->EndForeignScan(node);
  
***************
*** 231,236 **** ExecEndForeignScan(ForeignScanState *node)
--- 295,324 ----
  void
  ExecReScanForeignScan(ForeignScanState *node)
  {
+ 	EState	   *estate = node->ss.ps.state;
+ 
+ 	if (estate->es_epqTuple != NULL)
+ 	{
+ 		/*
+ 		 * We are inside an EvalPlanQual recheck.  If foreign join, re-scan
+ 		 * subplan.
+ 		 */
+ 		Index		scanrelid = ((Scan *) node->ss.ps.plan)->scanrelid;
+ 
+ 		if (scanrelid == 0)
+ 		{
+ 			PlanState  *outerPlan = outerPlanState(node);
+ 
+ 			/*
+ 			 * If outerPlan->chgParam is not null then plan will be
+ 			 * automatically re-scanned by first ExecProcNode.
+ 			 */
+ 			if (outerPlan->chgParam == NULL)
+ 				ExecReScan(outerPlan);
+ 			return;
+ 		}
+ 	}
+ 
  	node->fdwroutine->ReScanForeignScan(node);
  
  	ExecScanReScan(&node->ss);
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
***************
*** 625,630 **** _copyForeignScan(const ForeignScan *from)
--- 625,632 ----
  	COPY_NODE_FIELD(fdw_private);
  	COPY_NODE_FIELD(fdw_scan_tlist);
  	COPY_BITMAPSET_FIELD(fs_relids);
+ 	COPY_NODE_FIELD(fs_subplan);
+ 	COPY_NODE_FIELD(fs_remotequals);
  	COPY_SCALAR_FIELD(fsSystemCol);
  
  	return newnode;
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
***************
*** 580,585 **** _outForeignScan(StringInfo str, const ForeignScan *node)
--- 580,587 ----
  	WRITE_NODE_FIELD(fdw_private);
  	WRITE_NODE_FIELD(fdw_scan_tlist);
  	WRITE_BITMAPSET_FIELD(fs_relids);
+ 	WRITE_NODE_FIELD(fs_subplan);
+ 	WRITE_NODE_FIELD(fs_remotequals);
  	WRITE_BOOL_FIELD(fsSystemCol);
  }
  
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
***************
*** 2117,2125 **** create_foreignscan_plan(PlannerInfo *root, ForeignPath *best_path,
--- 2117,2134 ----
  			replace_nestloop_params(root, (Node *) scan_plan->scan.plan.qual);
  		scan_plan->fdw_exprs = (List *)
  			replace_nestloop_params(root, (Node *) scan_plan->fdw_exprs);
+ 		scan_plan->fs_remotequals = (List *)
+ 			replace_nestloop_params(root, (Node *) scan_plan->fs_remotequals);
  	}
  
  	/*
+ 	 * If we're scanning a join relation, generate the local join plan for
+ 	 * EvalPlanQual support.  (Irrelevant if scanning a base relation.)
+ 	 */
+ 	if (scan_relid == 0)
+ 		scan_plan->fs_subplan = create_plan_recurse(root, best_path->subpath);
+ 
+ 	/*
  	 * Detect whether any system columns are requested from rel.  This is a
  	 * bit of a kluge and might go away someday, so we intentionally leave it
  	 * out of the API presented to FDWs.
***************
*** 3702,3708 **** make_foreignscan(List *qptlist,
  				 Index scanrelid,
  				 List *fdw_exprs,
  				 List *fdw_private,
! 				 List *fdw_scan_tlist)
  {
  	ForeignScan *node = makeNode(ForeignScan);
  	Plan	   *plan = &node->scan.plan;
--- 3711,3718 ----
  				 Index scanrelid,
  				 List *fdw_exprs,
  				 List *fdw_private,
! 				 List *fdw_scan_tlist,
! 				 List *fs_remotequals)
  {
  	ForeignScan *node = makeNode(ForeignScan);
  	Plan	   *plan = &node->scan.plan;
***************
*** 3720,3725 **** make_foreignscan(List *qptlist,
--- 3730,3738 ----
  	node->fdw_scan_tlist = fdw_scan_tlist;
  	/* fs_relids will be filled in by create_foreignscan_plan */
  	node->fs_relids = NULL;
+ 	/* fs_subplan will be filled in by create_foreignscan_plan */
+ 	node->fs_subplan = NULL;
+ 	node->fs_remotequals = fs_remotequals;
  	/* fsSystemCol will be filled in by create_foreignscan_plan */
  	node->fsSystemCol = false;
  
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
***************
*** 1124,1129 **** set_foreignscan_references(PlannerInfo *root,
--- 1124,1131 ----
  		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
  		fscan->fdw_scan_tlist =
  			fix_scan_list(root, fscan->fdw_scan_tlist, rtoffset);
+ 		/* fs_subplan needs set_plan_refs() adjustments */
+ 		set_plan_refs(root, fscan->fs_subplan, rtoffset);
  	}
  	else
  	{
***************
*** 1134,1139 **** set_foreignscan_references(PlannerInfo *root,
--- 1136,1144 ----
  			fix_scan_list(root, fscan->scan.plan.qual, rtoffset);
  		fscan->fdw_exprs =
  			fix_scan_list(root, fscan->fdw_exprs, rtoffset);
+ 		/* fs_remotequals needs the same adjustments, too */
+ 		fscan->fs_remotequals =
+ 			fix_scan_list(root, fscan->fs_remotequals, rtoffset);
  	}
  
  	/* Adjust fs_relids if needed */
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
***************
*** 2375,2380 **** finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
--- 2375,2391 ----
  							  &context);
  			/* We assume fdw_scan_tlist cannot contain Params */
  			context.paramids = bms_add_members(context.paramids, scan_params);
+ 
+ 			/*
+ 			 * We need not look at fs_remotequals, since it will have the same
+ 			 * param references as fdw_exprs.  Also we need not include params
+ 			 * in fs_subplan.  However, fs_subplan itself needs finalize_plan()
+ 			 * processing.
+ 			 */
+ 			finalize_plan(root,
+ 						  ((ForeignScan *) plan)->fs_subplan,
+ 						  valid_params,
+ 						  scan_params);
  			break;
  
  		case T_CustomScan:
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
***************
*** 1462,1467 **** create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
--- 1462,1468 ----
  						double rows, Cost startup_cost, Cost total_cost,
  						List *pathkeys,
  						Relids required_outer,
+ 						Path *subpath,
  						List *fdw_private)
  {
  	ForeignPath *pathnode = makeNode(ForeignPath);
***************
*** 1475,1480 **** create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
--- 1476,1482 ----
  	pathnode->path.total_cost = total_cost;
  	pathnode->path.pathkeys = pathkeys;
  
+ 	pathnode->subpath = subpath;
  	pathnode->fdw_private = fdw_private;
  
  	return pathnode;
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
***************
*** 1580,1585 **** typedef struct WorkTableScanState
--- 1580,1586 ----
  typedef struct ForeignScanState
  {
  	ScanState	ss;				/* its first field is NodeTag */
+ 	List	   *fss_remotequals;
  	/* use struct pointer to avoid including fdwapi.h here */
  	struct FdwRoutine *fdwroutine;
  	void	   *fdw_state;		/* foreign-data wrapper can keep state here */
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
***************
*** 522,527 **** typedef struct ForeignScan
--- 522,529 ----
  	List	   *fdw_private;	/* private data for FDW */
  	List	   *fdw_scan_tlist; /* optional tlist describing scan tuple */
  	Bitmapset  *fs_relids;		/* RTIs generated by this scan */
+ 	Plan	   *fs_subplan;		/* alternative Plan node if foreign join */
+ 	List	   *fs_remotequals;	/* list of remote quals if foreign table */
  	bool		fsSystemCol;	/* true if any "system column" is needed */
  } ForeignScan;
  
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
***************
*** 890,899 **** typedef struct TidPath
--- 890,905 ----
   * generally a good idea to use a representation that can be dumped by
   * nodeToString(), so that you can examine the structure during debugging
   * with tools like pprint().
+  *
+  * If a ForeignPath node represents a remote join of foreign tables, subpath
+  * is a local join of those tables with equivalent results that will be used
+  * for EvalPlanQual testing.  Note that subpath must have the same pathkeys
+  * and parameterization as that of the path's output.
   */
  typedef struct ForeignPath
  {
  	Path		path;
+ 	Path	   *subpath;
  	List	   *fdw_private;
  } ForeignPath;
  
*** a/src/include/optimizer/pathnode.h
--- b/src/include/optimizer/pathnode.h
***************
*** 83,88 **** extern ForeignPath *create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
--- 83,89 ----
  						double rows, Cost startup_cost, Cost total_cost,
  						List *pathkeys,
  						Relids required_outer,
+ 						Path *subpath,
  						List *fdw_private);
  
  extern Relids calc_nestloop_required_outer(Path *outer_path, Path *inner_path);
*** a/src/include/optimizer/planmain.h
--- b/src/include/optimizer/planmain.h
***************
*** 45,51 **** extern SubqueryScan *make_subqueryscan(List *qptlist, List *qpqual,
  				  Index scanrelid, Plan *subplan);
  extern ForeignScan *make_foreignscan(List *qptlist, List *qpqual,
  				 Index scanrelid, List *fdw_exprs, List *fdw_private,
! 				 List *fdw_scan_tlist);
  extern Append *make_append(List *appendplans, List *tlist);
  extern RecursiveUnion *make_recursive_union(List *tlist,
  					 Plan *lefttree, Plan *righttree, int wtParam,
--- 45,51 ----
  				  Index scanrelid, Plan *subplan);
  extern ForeignScan *make_foreignscan(List *qptlist, List *qpqual,
  				 Index scanrelid, List *fdw_exprs, List *fdw_private,
! 				 List *fdw_scan_tlist, List *fs_remotequals);
  extern Append *make_append(List *appendplans, List *tlist);
  extern RecursiveUnion *make_recursive_union(List *tlist,
  					 Plan *lefttree, Plan *righttree, int wtParam,

foreign_join_v16_efujita.patchtext/x-patch; name=foreign_join_v16_efujita.patchDownload

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index 81cb2b4..08bd352 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -44,8 +44,11 @@
 #include "catalog/pg_proc.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
+#include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/plannodes.h"
 #include "optimizer/clauses.h"
+#include "optimizer/prep.h"
 #include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "utils/builtins.h"
@@ -89,6 +92,8 @@ typedef struct deparse_expr_cxt
 	RelOptInfo *foreignrel;		/* the foreign relation we are planning for */
 	StringInfo	buf;			/* output buffer to append to */
 	List	  **params_list;	/* exprs that will become remote Params */
+	List	   *outertlist;		/* outer child's target list */
+	List	   *innertlist;		/* inner child's target list */
 } deparse_expr_cxt;
 
 /*
@@ -108,7 +113,8 @@ static void deparseTargetList(StringInfo buf,
 				  Index rtindex,
 				  Relation rel,
 				  Bitmapset *attrs_used,
-				  List **retrieved_attrs);
+				  List **retrieved_attrs,
+				  bool alias);
 static void deparseReturningList(StringInfo buf, PlannerInfo *root,
 					 Index rtindex, Relation rel,
 					 bool trig_after_row,
@@ -136,6 +142,7 @@ static void printRemoteParam(int paramindex, Oid paramtype, int32 paramtypmod,
 				 deparse_expr_cxt *context);
 static void printRemotePlaceholder(Oid paramtype, int32 paramtypmod,
 					   deparse_expr_cxt *context);
+static const char *get_jointype_name(JoinType jointype);
 
 
 /*
@@ -250,7 +257,7 @@ foreign_expr_walker(Node *node,
 				 * Param's collation, ie it's not safe for it to have a
 				 * non-default collation.
 				 */
-				if (var->varno == glob_cxt->foreignrel->relid &&
+				if (bms_is_member(var->varno, glob_cxt->foreignrel->relids) &&
 					var->varlevelsup == 0)
 				{
 					/* Var belongs to foreign table */
@@ -675,18 +682,88 @@ is_builtin(Oid oid)
  *
  * We also create an integer List of the columns being retrieved, which is
  * returned to *retrieved_attrs.
+ *
+ * The relations is a string buffer for "Relations" portion of EXPLAIN output,
+ * or NULL if caller doesn't need it.  Note that it should have been
+ * initialized by caller.
+ *
+ * The alias is a flag to add aliases of columns and tables.  This should be
+ * false in the initial call, and will be set true when this function is called
+ * for building a part of a join query.
  */
 void
 deparseSelectSql(StringInfo buf,
 				 PlannerInfo *root,
 				 RelOptInfo *baserel,
 				 Bitmapset *attrs_used,
-				 List **retrieved_attrs)
+				 List *remote_conds,
+				 List **params_list,
+				 List **fdw_scan_tlist,
+				 List **retrieved_attrs,
+				 StringInfo relations,
+				 bool alias)
 {
+	PgFdwRelationInfo  *fpinfo = (PgFdwRelationInfo *) baserel->fdw_private;
 	RangeTblEntry *rte = planner_rt_fetch(baserel->relid, root);
 	Relation	rel;
 
 	/*
+	 * If given relation was a join relation, recursively construct statement
+	 * by putting each outer and inner relations in FROM clause as a subquery
+	 * with aliasing.
+	 */
+	if (baserel->reloptkind == RELOPT_JOINREL)
+	{
+		RelOptInfo		   *rel_o = fpinfo->outerrel;
+		RelOptInfo		   *rel_i = fpinfo->innerrel;
+		PgFdwRelationInfo  *fpinfo_o = (PgFdwRelationInfo *) rel_o->fdw_private;
+		PgFdwRelationInfo  *fpinfo_i = (PgFdwRelationInfo *) rel_i->fdw_private;
+		StringInfoData		sql_o;
+		StringInfoData		sql_i;
+		List			   *ret_attrs_tmp;	/* not used */
+		StringInfoData		relations_o;
+		StringInfoData		relations_i;
+		const char		   *jointype_str;
+
+		/*
+		 * Deparse query for outer and inner relation, and combine them into
+		 * a query.
+		 *
+		 * Here we don't pass fdw_scan_tlist because targets of underlying
+		 * relations are already put in joinrel->reltargetlist, and
+		 * deparseJoinRel() takes all care about it.
+		 */
+		initStringInfo(&sql_o);
+		initStringInfo(&relations_o);
+		deparseSelectSql(&sql_o, root, rel_o, fpinfo_o->attrs_used,
+						 fpinfo_o->remote_conds, params_list,
+						 NULL, &ret_attrs_tmp, &relations_o, true);
+		initStringInfo(&sql_i);
+		initStringInfo(&relations_i);
+		deparseSelectSql(&sql_i, root, rel_i, fpinfo_i->attrs_used,
+						 fpinfo_i->remote_conds, params_list,
+						 NULL, &ret_attrs_tmp, &relations_i, true);
+
+		/* For EXPLAIN output */
+		jointype_str = get_jointype_name(fpinfo->jointype);
+		if (relations)
+			appendStringInfo(relations, "(%s) %s JOIN (%s)",
+							 relations_o.data, jointype_str, relations_i.data);
+
+		deparseJoinSql(buf, root, baserel,
+					   fpinfo->outerrel,
+					   fpinfo->innerrel,
+					   sql_o.data,
+					   sql_i.data,
+					   fpinfo->jointype,
+					   fpinfo->joinclauses,
+					   fpinfo->otherclauses,
+					   fdw_scan_tlist,
+					   retrieved_attrs);
+		return;
+	}
+
+	/*
 	 * Core code already has some lock on each rel being planned, so we can
 	 * use NoLock here.
 	 */
@@ -697,7 +774,7 @@ deparseSelectSql(StringInfo buf,
 	 */
 	appendStringInfoString(buf, "SELECT ");
 	deparseTargetList(buf, root, baserel->relid, rel, attrs_used,
-					  retrieved_attrs);
+					  retrieved_attrs, alias);
 
 	/*
 	 * Construct FROM clause
@@ -705,6 +782,87 @@ deparseSelectSql(StringInfo buf,
 	appendStringInfoString(buf, " FROM ");
 	deparseRelation(buf, rel);
 
+	/*
+	 * Return local relation name for EXPLAIN output.
+	 * We can't know VERBOSE option is specified or not, so always add shcema
+	 * name.
+	 */
+	if (relations)
+	{
+		const char	   *namespace;
+		const char	   *relname;
+		const char	   *refname;
+
+		namespace = get_namespace_name(get_rel_namespace(rte->relid));
+		relname = get_rel_name(rte->relid);
+		refname = rte->eref->aliasname;
+		appendStringInfo(relations, "%s.%s",
+						 quote_identifier(namespace),
+						 quote_identifier(relname));
+		if (*refname && strcmp(refname, relname) != 0)
+			appendStringInfo(relations, " %s",
+							 quote_identifier(rte->eref->aliasname));
+	}
+
+	/*
+	 * Construct WHERE clause
+	 */
+	if (remote_conds)
+		appendConditions(buf, root, baserel, NULL, NULL, remote_conds,
+						 " WHERE ", params_list);
+
+	/*
+	 * Add FOR UPDATE/SHARE if appropriate.  We apply locking during the
+	 * initial row fetch, rather than later on as is done for local tables.
+	 * The extra roundtrips involved in trying to duplicate the local
+	 * semantics exactly don't seem worthwhile (see also comments for
+	 * RowMarkType).
+	 *
+	 * Note: because we actually run the query as a cursor, this assumes
+	 * that DECLARE CURSOR ... FOR UPDATE is supported, which it isn't
+	 * before 8.3.
+	 */
+	if (baserel->relid == root->parse->resultRelation &&
+		(root->parse->commandType == CMD_UPDATE ||
+		 root->parse->commandType == CMD_DELETE))
+	{
+		/* Relation is UPDATE/DELETE target, so use FOR UPDATE */
+		appendStringInfoString(buf, " FOR UPDATE");
+	}
+	else
+	{
+		PlanRowMark *rc = get_plan_rowmark(root->rowMarks, baserel->relid);
+
+		if (rc)
+		{
+			/*
+			 * Relation is specified as a FOR UPDATE/SHARE target, so handle
+			 * that.  (But we could also see LCS_NONE, meaning this isn't a
+			 * target relation after all.)
+			 *
+			 * For now, just ignore any [NO] KEY specification, since (a)
+			 * it's not clear what that means for a remote table that we
+			 * don't have complete information about, and (b) it wouldn't
+			 * work anyway on older remote servers.  Likewise, we don't
+			 * worry about NOWAIT.
+			 */
+			switch (rc->strength)
+			{
+				case LCS_NONE:
+					/* No locking needed */
+					break;
+				case LCS_FORKEYSHARE:
+				case LCS_FORSHARE:
+					appendStringInfoString(buf, " FOR SHARE");
+					break;
+				case LCS_FORNOKEYUPDATE:
+				case LCS_FORUPDATE:
+					appendStringInfoString(buf, " FOR UPDATE");
+					break;
+			}
+		}
+	}
+
 	heap_close(rel, NoLock);
 }
 
@@ -721,7 +879,8 @@ deparseTargetList(StringInfo buf,
 				  Index rtindex,
 				  Relation rel,
 				  Bitmapset *attrs_used,
-				  List **retrieved_attrs)
+				  List **retrieved_attrs,
+				  bool alias)
 {
 	TupleDesc	tupdesc = RelationGetDescr(rel);
 	bool		have_wholerow;
@@ -752,6 +911,9 @@ deparseTargetList(StringInfo buf,
 			first = false;
 
 			deparseColumnRef(buf, rtindex, i, root);
+			if (alias)
+				appendStringInfo(buf, " a%d",
+								 i - FirstLowInvalidHeapAttributeNumber);
 
 			*retrieved_attrs = lappend_int(*retrieved_attrs, i);
 		}
@@ -769,6 +931,9 @@ deparseTargetList(StringInfo buf,
 		first = false;
 
 		appendStringInfoString(buf, "ctid");
+		if (alias)
+			appendStringInfo(buf, " a%d",
+							 SelfItemPointerAttributeNumber - FirstLowInvalidHeapAttributeNumber);
 
 		*retrieved_attrs = lappend_int(*retrieved_attrs,
 									   SelfItemPointerAttributeNumber);
@@ -780,11 +945,13 @@ deparseTargetList(StringInfo buf,
 }
 
 /*
- * Deparse WHERE clauses in given list of RestrictInfos and append them to buf.
+ * Deparse conditions, such as WHERE clause and ON clause of JOIN, in the given
+ * list, consist of RestrictInfo or Expr, and append string representation of
+ * them to buf.
  *
  * baserel is the foreign table we're planning for.
  *
- * If no WHERE clause already exists in the buffer, is_first should be true.
+ * prefix is placed before the conditions, if any.
  *
  * If params is not NULL, it receives a list of Params and other-relation Vars
  * used in the clauses; these values must be transmitted to the remote server
@@ -794,16 +961,19 @@ deparseTargetList(StringInfo buf,
  * so Params and other-relation Vars should be replaced by dummy values.
  */
 void
-appendWhereClause(StringInfo buf,
-				  PlannerInfo *root,
-				  RelOptInfo *baserel,
-				  List *exprs,
-				  bool is_first,
-				  List **params)
+appendConditions(StringInfo buf,
+				 PlannerInfo *root,
+				 RelOptInfo *baserel,
+				 List *outertlist,
+				 List *innertlist,
+				 List *exprs,
+				 const char *prefix,
+				 List **params)
 {
 	deparse_expr_cxt context;
 	int			nestlevel;
 	ListCell   *lc;
+	bool		is_first = prefix == NULL ? false : true;
 
 	if (params)
 		*params = NIL;			/* initialize result list to empty */
@@ -813,22 +983,36 @@ appendWhereClause(StringInfo buf,
 	context.foreignrel = baserel;
 	context.buf = buf;
 	context.params_list = params;
+	context.outertlist = outertlist;
+	context.innertlist = innertlist;
 
 	/* Make sure any constants in the exprs are printed portably */
 	nestlevel = set_transmission_modes();
 
 	foreach(lc, exprs)
 	{
+		Node	   *node = (Node *) lfirst(lc);
 		RestrictInfo *ri = (RestrictInfo *) lfirst(lc);
+		Expr	   *expr = (Expr *) lfirst(lc);
+
+		if (IsA(node, RestrictInfo))
+		{
+			expr = ri->clause;
+		}
+		else
+		{
+			expr = ri->clause;
+			expr = (Expr *) node;
+		}
 
 		/* Connect expressions with "AND" and parenthesize each condition. */
 		if (is_first)
-			appendStringInfoString(buf, " WHERE ");
+			appendStringInfoString(buf, prefix);
 		else
 			appendStringInfoString(buf, " AND ");
 
 		appendStringInfoChar(buf, '(');
-		deparseExpr(ri->clause, &context);
+		deparseExpr(expr, &context);
 		appendStringInfoChar(buf, ')');
 
 		is_first = false;
@@ -838,6 +1022,297 @@ appendWhereClause(StringInfo buf,
 }
 
 /*
+ * Returns position index (start with 1) of given var in given target list, or
+ * 0 when not found.
+ */
+static int
+find_var_pos(Var *node, List *tlist)
+{
+	int		pos = 1;
+	ListCell *lc;
+
+	foreach(lc, tlist)
+	{
+		Var *var = (Var *) lfirst(lc);
+
+		if (equal(var, node))
+		{
+			return pos;
+		}
+		pos++;
+	}
+
+	return 0;
+}
+
+/*
+ * Deparse given Var into buf.
+ */
+static void
+deparseJoinVar(Var *node, deparse_expr_cxt *context)
+{
+	char		side;
+	int			pos;
+
+	pos = find_var_pos(node, context->outertlist);
+	if (pos > 0)
+		side = 'l';
+	else
+	{
+		side = 'r';
+		pos = find_var_pos(node, context->innertlist);
+	}
+
+	/*
+	 * We treat whole-row reference same as ordinary attribute references,
+	 * because such transformation should be done in lower level.
+	 */
+	appendStringInfo(context->buf, "%c.a%d", side, pos);
+}
+
+/*
+ * Deparse column alias list for a subquery in FROM clause.
+ */
+static void
+deparseColumnAliases(StringInfo buf, List *tlist)
+{
+	int			pos;
+	ListCell   *lc;
+
+	pos = 1;
+	foreach(lc, tlist)
+	{
+		/* Deparse column alias for the subquery */
+		if (pos > 1)
+			appendStringInfoString(buf, ", ");
+		appendStringInfo(buf, "a%d", pos);
+		pos++;
+	}
+}
+
+/*
+ * Deparse "wrapper" SQL for a query which projects target lists in proper
+ * order and contents.  Note that this treatment is necessary only for queries
+ * used in FROM clause of a join query.
+ *
+ * Even if the SQL is enough simple (no ctid, no whole-row reference), the order
+ * of output column might different from underlying scan, so we always need to
+ * wrap the queries for join sources.
+ *
+ */
+static const char *
+deparseProjectionSql(PlannerInfo *root,
+					 RelOptInfo *baserel,
+					 const char *sql,
+					 char side)
+{
+	StringInfoData wholerow;
+	StringInfoData buf;
+	ListCell   *lc;
+	bool		first;
+	bool		have_wholerow = false;
+
+	/*
+	 * We have nothing to do if the targetlist contains no special reference,
+	 * such as whole-row and ctid.
+	 */
+	foreach(lc, baserel->reltargetlist)
+	{
+		Var		   *var = (Var *) lfirst(lc);
+		if (var->varattno == 0)
+		{
+			have_wholerow = true;
+			break;
+		}
+	}
+
+	/*
+	 * Construct whole-row reference with ROW() syntax
+	 */
+	if (have_wholerow)
+	{
+		RangeTblEntry *rte;
+		Relation		rel;
+		TupleDesc		tupdesc;
+		int				i;
+
+		/* Obtain TupleDesc for deparsing all valid columns */
+		rte = planner_rt_fetch(baserel->relid, root);
+		rel = heap_open(rte->relid, NoLock);
+		tupdesc = rel->rd_att;
+
+		/* Print all valid columns in ROW() to generate whole-row value */
+		initStringInfo(&wholerow);
+		appendStringInfoString(&wholerow, "ROW(");
+		first = true;
+		for (i = 1; i <= tupdesc->natts; i++)
+		{
+			Form_pg_attribute attr = tupdesc->attrs[i - 1];
+
+			/* Ignore dropped columns. */
+			if (attr->attisdropped)
+				continue;
+
+			if (!first)
+				appendStringInfoString(&wholerow, ", ");
+			first = false;
+
+			appendStringInfo(&wholerow, "%c.a%d", side,
+							 i - FirstLowInvalidHeapAttributeNumber);
+		}
+		appendStringInfoString(&wholerow, ")");
+
+		heap_close(rel, NoLock);
+	}
+
+	/*
+	 * Construct a SELECT statement which has the original query in its FROM
+	 * clause, and have target list entries in its SELECT clause.  The number
+	 * used in column aliases are attnum - FirstLowInvalidHeapAttributeNumber in
+	 * order to make all numbers positive even for system columns which have
+	 * minus value as attnum.
+	 */
+	initStringInfo(&buf);
+	appendStringInfoString(&buf, "SELECT ");
+	first = true;
+	foreach(lc, baserel->reltargetlist)
+	{
+		Var *var = (Var *) lfirst(lc);
+
+		if (!first)
+			appendStringInfoString(&buf, ", ");
+	
+		if (var->varattno == 0)
+			appendStringInfo(&buf, "%s", wholerow.data);
+		else
+			appendStringInfo(&buf, "%c.a%d", side,
+							 var->varattno - FirstLowInvalidHeapAttributeNumber);
+
+		first = false;
+	}
+	appendStringInfo(&buf, " FROM (%s) %c", sql, side);
+
+	return buf.data;
+}
+
+static const char *
+get_jointype_name(JoinType jointype)
+{
+	if (jointype == JOIN_INNER)
+		return "INNER";
+	else if (jointype == JOIN_LEFT)
+		return "LEFT";
+	else if (jointype == JOIN_RIGHT)
+		return "RIGHT";
+	else if (jointype == JOIN_FULL)
+		return "FULL";
+
+	/* not reached */
+	elog(ERROR, "unsupported join type %d", jointype);
+}
+
+/*
+ * Construct a SELECT statement which contains join clause.
+ *
+ * We also create an TargetEntry List of the columns being retrieved, which is
+ * returned to *fdw_scan_tlist.
+ *
+ * path_o, tl_o, sql_o are respectively path, targetlist, and remote query
+ * statement of the outer child relation.  postfix _i means those for the inner
+ * child relation.  jointype and joinclauses are information of join method.
+ * fdw_scan_tlist is output parameter to pass target list of the pseudo scan to
+ * caller.
+ */
+void
+deparseJoinSql(StringInfo buf,
+			   PlannerInfo *root,
+			   RelOptInfo *baserel,
+			   RelOptInfo *outerrel,
+			   RelOptInfo *innerrel,
+			   const char *sql_o,
+			   const char *sql_i,
+			   JoinType jointype,
+			   List *joinclauses,
+			   List *otherclauses,
+			   List **fdw_scan_tlist,
+			   List **retrieved_attrs)
+{
+	StringInfoData selbuf;		/* buffer for SELECT clause */
+	StringInfoData abuf_o;		/* buffer for column alias list of outer */
+	StringInfoData abuf_i;		/* buffer for column alias list of inner */
+	int			i;
+	ListCell   *lc;
+	const char *jointype_str;
+	deparse_expr_cxt context;
+
+	context.root = root;
+	context.foreignrel = baserel;
+	context.buf = &selbuf;
+	context.params_list = NULL;
+	context.outertlist = outerrel->reltargetlist;
+	context.innertlist = innerrel->reltargetlist;
+
+	jointype_str = get_jointype_name(jointype);
+	*retrieved_attrs = NIL;
+
+	/* print SELECT clause of the join scan */
+	initStringInfo(&selbuf);
+	i = 0;
+	foreach(lc, baserel->reltargetlist)
+	{
+		Var		   *var = (Var *) lfirst(lc);
+		TargetEntry *tle;
+
+		if (i > 0)
+			appendStringInfoString(&selbuf, ", ");
+		deparseJoinVar(var, &context);
+
+		tle = makeTargetEntry((Expr *) var, i + 1, NULL, false);
+		if (fdw_scan_tlist)
+			*fdw_scan_tlist = lappend(*fdw_scan_tlist, tle);
+
+		*retrieved_attrs = lappend_int(*retrieved_attrs, i + 1);
+
+		i++;
+	}
+	if (i == 0)
+		appendStringInfoString(&selbuf, "NULL");
+
+	/*
+	 * Do pseudo-projection for an underlying scan on a foreign table, if a) the
+	 * relation is a base relation, and b) its targetlist contains whole-row
+	 * reference.
+	 */
+	if (outerrel->reloptkind == RELOPT_BASEREL)
+		sql_o = deparseProjectionSql(root, outerrel, sql_o, 'l');
+	if (innerrel->reloptkind == RELOPT_BASEREL)
+		sql_i = deparseProjectionSql(root, innerrel, sql_i, 'r');
+
+	/* Deparse column alias portion of subquery in FROM clause. */
+	initStringInfo(&abuf_o);
+	deparseColumnAliases(&abuf_o, outerrel->reltargetlist);
+	initStringInfo(&abuf_i);
+	deparseColumnAliases(&abuf_i, innerrel->reltargetlist);
+
+	/* Construct SELECT statement */
+	appendStringInfo(buf, "SELECT %s FROM", selbuf.data);
+	appendStringInfo(buf, " (%s) l (%s) %s JOIN (%s) r (%s)",
+					 sql_o, abuf_o.data, jointype_str, sql_i, abuf_i.data);
+	/* Append ON clause */
+	if (joinclauses)
+		appendConditions(buf, root, baserel,
+						 outerrel->reltargetlist, innerrel->reltargetlist,
+						 joinclauses,
+						 " ON ", NULL);
+	/* Append WHERE clause */
+	if (otherclauses)
+		appendConditions(buf, root, baserel,
+						 outerrel->reltargetlist, innerrel->reltargetlist,
+						 otherclauses,
+						 " WHERE ", NULL);
+}
+
+/*
  * deparse remote INSERT statement
  *
  * The statement text is appended to buf, and we also create an integer List
@@ -997,7 +1472,7 @@ deparseReturningList(StringInfo buf, PlannerInfo *root,
 	{
 		appendStringInfoString(buf, " RETURNING ");
 		deparseTargetList(buf, root, rtindex, rel, attrs_used,
-						  retrieved_attrs);
+						  retrieved_attrs, false);
 	}
 	else
 		*retrieved_attrs = NIL;
@@ -1264,6 +1739,8 @@ deparseExpr(Expr *node, deparse_expr_cxt *context)
 /*
  * Deparse given Var node into context->buf.
  *
+ * If context has valid innerrel, this is invoked for a join conditions.
+ *
  * If the Var belongs to the foreign relation, just print its remote name.
  * Otherwise, it's effectively a Param (and will in fact be a Param at
  * run time).  Handle it the same way we handle plain Params --- see
@@ -1274,39 +1751,46 @@ deparseVar(Var *node, deparse_expr_cxt *context)
 {
 	StringInfo	buf = context->buf;
 
-	if (node->varno == context->foreignrel->relid &&
-		node->varlevelsup == 0)
+	if (context->foreignrel->reloptkind == RELOPT_JOINREL)
 	{
-		/* Var belongs to foreign table */
-		deparseColumnRef(buf, node->varno, node->varattno, context->root);
+		deparseJoinVar(node, context);
 	}
 	else
 	{
-		/* Treat like a Param */
-		if (context->params_list)
+		if (node->varno == context->foreignrel->relid &&
+			node->varlevelsup == 0)
 		{
-			int			pindex = 0;
-			ListCell   *lc;
-
-			/* find its index in params_list */
-			foreach(lc, *context->params_list)
+			/* Var belongs to foreign table */
+			deparseColumnRef(buf, node->varno, node->varattno, context->root);
+		}
+		else
+		{
+			/* Treat like a Param */
+			if (context->params_list)
 			{
-				pindex++;
-				if (equal(node, (Node *) lfirst(lc)))
-					break;
+				int			pindex = 0;
+				ListCell   *lc;
+
+				/* find its index in params_list */
+				foreach(lc, *context->params_list)
+				{
+					pindex++;
+					if (equal(node, (Node *) lfirst(lc)))
+						break;
+				}
+				if (lc == NULL)
+				{
+					/* not in list, so add it */
+					pindex++;
+					*context->params_list = lappend(*context->params_list, node);
+				}
+
+				printRemoteParam(pindex, node->vartype, node->vartypmod, context);
 			}
-			if (lc == NULL)
+			else
 			{
-				/* not in list, so add it */
-				pindex++;
-				*context->params_list = lappend(*context->params_list, node);
+				printRemotePlaceholder(node->vartype, node->vartypmod, context);
 			}
-
-			printRemoteParam(pindex, node->vartype, node->vartypmod, context);
-		}
-		else
-		{
-			printRemotePlaceholder(node->vartype, node->vartypmod, context);
 		}
 	}
 }
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 1f417b3..80e22ae 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -9,11 +9,16 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback2 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
 -- ===================================================================
 -- create objects used through FDW loopback server
 -- ===================================================================
@@ -35,6 +40,18 @@ CREATE TABLE "S 1"."T 2" (
 	c2 text,
 	CONSTRAINT t2_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 3" (
+	c1 int NOT NULL,
+	c2 int NOT NULL,
+	c3 text,
+	CONSTRAINT t3_pkey PRIMARY KEY (c1)
+);
+CREATE TABLE "S 1"."T 4" (
+	c1 int NOT NULL,
+	c2 int NOT NULL,
+	c4 text,
+	CONSTRAINT t4_pkey PRIMARY KEY (c1)
+);
 INSERT INTO "S 1"."T 1"
 	SELECT id,
 	       id % 10,
@@ -49,8 +66,22 @@ INSERT INTO "S 1"."T 2"
 	SELECT id,
 	       'AAA' || to_char(id, 'FM000')
 	FROM generate_series(1, 100) id;
+INSERT INTO "S 1"."T 3"
+	SELECT id,
+	       id + 1,
+	       'AAA' || to_char(id, 'FM000')
+	FROM generate_series(1, 100) id;
+DELETE FROM "S 1"."T 3" WHERE c1 % 2 != 0;	-- delete for outer join tests
+INSERT INTO "S 1"."T 4"
+	SELECT id,
+	       id + 1,
+	       'AAA' || to_char(id, 'FM000')
+	FROM generate_series(1, 100) id;
+DELETE FROM "S 1"."T 4" WHERE c1 % 3 != 0;	-- delete for outer join tests
 ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
+ANALYZE "S 1"."T 3";
+ANALYZE "S 1"."T 4";
 -- ===================================================================
 -- create foreign tables
 -- ===================================================================
@@ -78,6 +109,26 @@ CREATE FOREIGN TABLE ft2 (
 	c8 user_enum
 ) SERVER loopback;
 ALTER FOREIGN TABLE ft2 DROP COLUMN cx;
+CREATE FOREIGN TABLE ft4 (
+	c1 int NOT NULL,
+	c2 int NOT NULL,
+	c3 text
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 3');
+CREATE FOREIGN TABLE ft5 (
+	c1 int NOT NULL,
+	c2 int NOT NULL,
+	c3 text
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 4');
+CREATE FOREIGN TABLE ft6 (
+	c1 int NOT NULL,
+	c2 int NOT NULL,
+	c3 text
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
+CREATE USER view_owner;
+GRANT ALL ON ft5 TO view_owner;
+CREATE VIEW v_ft5 AS SELECT * FROM ft5;
+ALTER VIEW v_ft5 OWNER TO view_owner;
+CREATE USER MAPPING FOR view_owner SERVER loopback;
 -- ===================================================================
 -- tests for validator
 -- ===================================================================
@@ -119,12 +170,15 @@ ALTER FOREIGN TABLE ft2 OPTIONS (schema_name 'S 1', table_name 'T 1');
 ALTER FOREIGN TABLE ft1 ALTER COLUMN c1 OPTIONS (column_name 'C 1');
 ALTER FOREIGN TABLE ft2 ALTER COLUMN c1 OPTIONS (column_name 'C 1');
 \det+
-                             List of foreign tables
- Schema | Table |  Server  |              FDW Options              | Description 
---------+-------+----------+---------------------------------------+-------------
- public | ft1   | loopback | (schema_name 'S 1', table_name 'T 1') | 
- public | ft2   | loopback | (schema_name 'S 1', table_name 'T 1') | 
-(2 rows)
+                              List of foreign tables
+ Schema | Table |  Server   |              FDW Options              | Description 
+--------+-------+-----------+---------------------------------------+-------------
+ public | ft1   | loopback  | (schema_name 'S 1', table_name 'T 1') | 
+ public | ft2   | loopback  | (schema_name 'S 1', table_name 'T 1') | 
+ public | ft4   | loopback  | (schema_name 'S 1', table_name 'T 3') | 
+ public | ft5   | loopback  | (schema_name 'S 1', table_name 'T 4') | 
+ public | ft6   | loopback2 | (schema_name 'S 1', table_name 'T 4') | 
+(5 rows)
 
 -- Now we should be able to run ANALYZE.
 -- To exercise multiple code paths, we use local stats on ft1
@@ -277,22 +331,6 @@ SELECT COUNT(*) FROM ft1 t1;
   1000
 (1 row)
 
--- join two tables
-SELECT t1.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
- c1  
------
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
-(10 rows)
-
 -- subquery
 SELECT * FROM ft1 t1 WHERE t1.c3 IN (SELECT c3 FROM ft2 t2 WHERE c1 <= 10) ORDER BY c1;
  c1 | c2 |  c3   |              c4              |            c5            | c6 |     c7     | c8  
@@ -489,17 +527,13 @@ EXPLAIN (VERBOSE, COSTS false) SELECT * FROM ft1 t1 WHERE c8 = 'foo';  -- can't
 -- parameterized remote path
 EXPLAIN (VERBOSE, COSTS false)
   SELECT * FROM ft2 a, ft2 b WHERE a.c1 = 47 AND b.c1 = a.c2;
-                                                 QUERY PLAN                                                  
--------------------------------------------------------------------------------------------------------------
- Nested Loop
+                                                                                                                                                                                                                                                                                     QUERY PLAN                                                                                                                                                                                                                                                                                      
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Foreign Scan
    Output: a.c1, a.c2, a.c3, a.c4, a.c5, a.c6, a.c7, a.c8, b.c1, b.c2, b.c3, b.c4, b.c5, b.c6, b.c7, b.c8
-   ->  Foreign Scan on public.ft2 a
-         Output: a.c1, a.c2, a.c3, a.c4, a.c5, a.c6, a.c7, a.c8
-         Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE (("C 1" = 47))
-   ->  Foreign Scan on public.ft2 b
-         Output: b.c1, b.c2, b.c3, b.c4, b.c5, b.c6, b.c7, b.c8
-         Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE (($1::integer = "C 1"))
-(8 rows)
+   Relations: (public.ft2 a) INNER JOIN (public.ft2 b)
+   Remote SQL: SELECT l.a1, l.a2, l.a3, l.a4, l.a5, l.a6, l.a7, l.a8, r.a1, r.a2, r.a3, r.a4, r.a5, r.a6, r.a7, r.a8 FROM (SELECT l.a9, l.a10, l.a12, l.a13, l.a14, l.a15, l.a16, l.a17 FROM (SELECT "C 1" a9, c2 a10, c3 a12, c4 a13, c5 a14, c6 a15, c7 a16, c8 a17 FROM "S 1"."T 1" WHERE (("C 1" = 47))) l) l (a1, a2, a3, a4, a5, a6, a7, a8) INNER JOIN (SELECT r.a9, r.a10, r.a12, r.a13, r.a14, r.a15, r.a16, r.a17 FROM (SELECT "C 1" a9, c2 a10, c3 a12, c4 a13, c5 a14, c6 a15, c7 a16, c8 a17 FROM "S 1"."T 1") r) r (a1, a2, a3, a4, a5, a6, a7, a8) ON ((l.a2 = r.a1))
+(4 rows)
 
 SELECT * FROM ft2 a, ft2 b WHERE a.c1 = 47 AND b.c1 = a.c2;
  c1 | c2 |  c3   |              c4              |            c5            | c6 |     c7     | c8  | c1 | c2 |  c3   |              c4              |            c5            | c6 |     c7     | c8  
@@ -651,6 +685,670 @@ SELECT * FROM ft2 WHERE c1 = ANY (ARRAY(SELECT c1 FROM ft1 WHERE c1 < 5));
 (4 rows)
 
 -- ===================================================================
+-- JOIN queries
+-- ===================================================================
+-- join two tables
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
+                                                                                                               QUERY PLAN                                                                                                                
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c1, t1.c3
+   ->  Sort
+         Output: t1.c1, t2.c1, t1.c3
+         Sort Key: t1.c3, t1.c1
+         ->  Foreign Scan
+               Output: t1.c1, t2.c1, t1.c3
+               Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+               Remote SQL: SELECT l.a1, l.a2, r.a1 FROM (SELECT l.a10, l.a12 FROM (SELECT "C 1" a10, c3 a12 FROM "S 1"."T 1") l) l (a1, a2) INNER JOIN (SELECT r.a9 FROM (SELECT "C 1" a9 FROM "S 1"."T 1") r) r (a1) ON ((l.a1 = r.a1))
+(9 rows)
+
+SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
+ c1  | c1  
+-----+-----
+ 101 | 101
+ 102 | 102
+ 103 | 103
+ 104 | 104
+ 105 | 105
+ 106 | 106
+ 107 | 107
+ 108 | 108
+ 109 | 109
+ 110 | 110
+(10 rows)
+
+-- join three tables
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c2, t3.c3 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) JOIN ft4 t3 ON (t3.c1 = t1.c1) ORDER BY t1.c3, t1.c1 OFFSET 10 LIMIT 10;
+                                                                                                                                                                                                              QUERY PLAN                                                                                                                                                                                                               
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c2, t3.c3, t1.c3
+   ->  Sort
+         Output: t1.c1, t2.c2, t3.c3, t1.c3
+         Sort Key: t1.c3, t1.c1
+         ->  Foreign Scan
+               Output: t1.c1, t2.c2, t3.c3, t1.c3
+               Relations: ((public.ft1 t1) INNER JOIN (public.ft2 t2)) INNER JOIN (public.ft4 t3)
+               Remote SQL: SELECT l.a1, l.a2, l.a3, r.a1 FROM (SELECT l.a1, l.a2, r.a1, r.a2 FROM (SELECT l.a10, l.a12 FROM (SELECT "C 1" a10, c3 a12 FROM "S 1"."T 1") l) l (a1, a2) INNER JOIN (SELECT r.a10, r.a9 FROM (SELECT "C 1" a9, c2 a10 FROM "S 1"."T 1") r) r (a1, a2) ON ((l.a1 = r.a2))) l (a1, a2, a3, a4) INNER JOIN (SELECT r.a11, r.a9 FROM (SELECT c1 a9, c3 a11 FROM "S 1"."T 3") r) r (a1, a2) ON ((l.a1 = r.a2))
+(9 rows)
+
+SELECT t1.c1, t2.c2, t3.c3 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) JOIN ft4 t3 ON (t3.c1 = t1.c1) ORDER BY t1.c3, t1.c1 OFFSET 10 LIMIT 10;
+ c1 | c2 |   c3   
+----+----+--------
+ 22 |  2 | AAA022
+ 24 |  4 | AAA024
+ 26 |  6 | AAA026
+ 28 |  8 | AAA028
+ 30 |  0 | AAA030
+ 32 |  2 | AAA032
+ 34 |  4 | AAA034
+ 36 |  6 | AAA036
+ 38 |  8 | AAA038
+ 40 |  0 | AAA040
+(10 rows)
+
+-- left outer join
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft4 t1 LEFT JOIN ft5 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 10 LIMIT 10;
+                                                                                              QUERY PLAN                                                                                               
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c1
+   ->  Sort
+         Output: t1.c1, t2.c1
+         Sort Key: t1.c1, t2.c1
+         ->  Foreign Scan
+               Output: t1.c1, t2.c1
+               Relations: (public.ft4 t1) LEFT JOIN (public.ft5 t2)
+               Remote SQL: SELECT l.a1, r.a1 FROM (SELECT l.a9 FROM (SELECT c1 a9 FROM "S 1"."T 3") l) l (a1) LEFT JOIN (SELECT r.a9 FROM (SELECT c1 a9 FROM "S 1"."T 4") r) r (a1) ON ((l.a1 = r.a1))
+(9 rows)
+
+SELECT t1.c1, t2.c1 FROM ft4 t1 LEFT JOIN ft5 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 10 LIMIT 10;
+ c1 | c1 
+----+----
+ 22 |   
+ 24 | 24
+ 26 |   
+ 28 |   
+ 30 | 30
+ 32 |   
+ 34 |   
+ 36 | 36
+ 38 |   
+ 40 |   
+(10 rows)
+
+-- right outer join
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft5 t1 RIGHT JOIN ft4 t2 ON (t1.c1 = t2.c1) ORDER BY t2.c1, t1.c1 OFFSET 10 LIMIT 10;
+                                                                                              QUERY PLAN                                                                                               
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c1
+   ->  Sort
+         Output: t1.c1, t2.c1
+         Sort Key: t2.c1, t1.c1
+         ->  Foreign Scan
+               Output: t1.c1, t2.c1
+               Relations: (public.ft4 t2) LEFT JOIN (public.ft5 t1)
+               Remote SQL: SELECT l.a1, r.a1 FROM (SELECT l.a9 FROM (SELECT c1 a9 FROM "S 1"."T 3") l) l (a1) LEFT JOIN (SELECT r.a9 FROM (SELECT c1 a9 FROM "S 1"."T 4") r) r (a1) ON ((r.a1 = l.a1))
+(9 rows)
+
+SELECT t1.c1, t2.c1 FROM ft5 t1 RIGHT JOIN ft4 t2 ON (t1.c1 = t2.c1) ORDER BY t2.c1, t1.c1 OFFSET 10 LIMIT 10;
+ c1 | c1 
+----+----
+    | 22
+ 24 | 24
+    | 26
+    | 28
+ 30 | 30
+    | 32
+    | 34
+ 36 | 36
+    | 38
+    | 40
+(10 rows)
+
+-- full outer join
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft4 t1 FULL JOIN ft5 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 45 LIMIT 10;
+                                                                                              QUERY PLAN                                                                                               
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c1
+   ->  Sort
+         Output: t1.c1, t2.c1
+         Sort Key: t1.c1, t2.c1
+         ->  Foreign Scan
+               Output: t1.c1, t2.c1
+               Relations: (public.ft4 t1) FULL JOIN (public.ft5 t2)
+               Remote SQL: SELECT l.a1, r.a1 FROM (SELECT l.a9 FROM (SELECT c1 a9 FROM "S 1"."T 3") l) l (a1) FULL JOIN (SELECT r.a9 FROM (SELECT c1 a9 FROM "S 1"."T 4") r) r (a1) ON ((l.a1 = r.a1))
+(9 rows)
+
+SELECT t1.c1, t2.c1 FROM ft4 t1 FULL JOIN ft5 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 45 LIMIT 10;
+ c1  | c1 
+-----+----
+  92 |   
+  94 |   
+  96 | 96
+  98 |   
+ 100 |   
+     |  3
+     |  9
+     | 15
+     | 21
+     | 27
+(10 rows)
+
+-- full outer join + WHERE clause, only matched rows
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft4 t1 FULL JOIN ft5 t2 ON (t1.c1 = t2.c1) WHERE (t1.c1 = t2.c1 OR t1.c1 IS NULL) ORDER BY t1.c1, t2.c1 OFFSET 10 LIMIT 10;
+                                                                                                                   QUERY PLAN                                                                                                                    
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c1
+   ->  Sort
+         Output: t1.c1, t2.c1
+         Sort Key: t1.c1, t2.c1
+         ->  Foreign Scan
+               Output: t1.c1, t2.c1
+               Relations: (public.ft4 t1) FULL JOIN (public.ft5 t2)
+               Remote SQL: SELECT l.a1, r.a1 FROM (SELECT l.a9 FROM (SELECT c1 a9 FROM "S 1"."T 3") l) l (a1) FULL JOIN (SELECT r.a9 FROM (SELECT c1 a9 FROM "S 1"."T 4") r) r (a1) ON ((l.a1 = r.a1)) WHERE (((l.a1 = r.a1) OR (l.a1 IS NULL)))
+(9 rows)
+
+SELECT t1.c1, t2.c1 FROM ft4 t1 FULL JOIN ft5 t2 ON (t1.c1 = t2.c1) WHERE (t1.c1 = t2.c1 OR t1.c1 IS NULL) ORDER BY t1.c1, t2.c1 OFFSET 10 LIMIT 10;
+ c1 | c1 
+----+----
+ 66 | 66
+ 72 | 72
+ 78 | 78
+ 84 | 84
+ 90 | 90
+ 96 | 96
+    |  3
+    |  9
+    | 15
+    | 21
+(10 rows)
+
+-- join at WHERE clause 
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON true WHERE (t1.c1 = t2.c1) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
+                                                                                                               QUERY PLAN                                                                                                                
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c1, t1.c3
+   ->  Sort
+         Output: t1.c1, t2.c1, t1.c3
+         Sort Key: t1.c3, t1.c1
+         ->  Foreign Scan
+               Output: t1.c1, t2.c1, t1.c3
+               Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+               Remote SQL: SELECT l.a1, l.a2, r.a1 FROM (SELECT l.a10, l.a12 FROM (SELECT "C 1" a10, c3 a12 FROM "S 1"."T 1") l) l (a1, a2) INNER JOIN (SELECT r.a9 FROM (SELECT "C 1" a9 FROM "S 1"."T 1") r) r (a1) ON ((l.a1 = r.a1))
+(9 rows)
+
+SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON true WHERE (t1.c1 = t2.c1) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
+ c1  | c1  
+-----+-----
+ 101 | 101
+ 102 | 102
+ 103 | 103
+ 104 | 104
+ 105 | 105
+ 106 | 106
+ 107 | 107
+ 108 | 108
+ 109 | 109
+ 110 | 110
+(10 rows)
+
+-- join in CTE
+EXPLAIN (COSTS false, VERBOSE)
+WITH t (c1_1, c1_3, c2_1) AS (SELECT t1.c1, t1.c3, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1)) SELECT c1_1, c2_1 FROM t ORDER BY c1_3, c1_1 OFFSET 100 LIMIT 10;
+                                                                                                             QUERY PLAN                                                                                                              
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t.c1_1, t.c2_1, t.c1_3
+   CTE t
+     ->  Foreign Scan
+           Output: t1.c1, t1.c3, t2.c1
+           Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+           Remote SQL: SELECT l.a1, l.a2, r.a1 FROM (SELECT l.a10, l.a12 FROM (SELECT "C 1" a10, c3 a12 FROM "S 1"."T 1") l) l (a1, a2) INNER JOIN (SELECT r.a9 FROM (SELECT "C 1" a9 FROM "S 1"."T 1") r) r (a1) ON ((l.a1 = r.a1))
+   ->  Sort
+         Output: t.c1_1, t.c2_1, t.c1_3
+         Sort Key: t.c1_3, t.c1_1
+         ->  CTE Scan on t
+               Output: t.c1_1, t.c2_1, t.c1_3
+(12 rows)
+
+WITH t (c1_1, c1_3, c2_1) AS (SELECT t1.c1, t1.c3, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1)) SELECT c1_1, c2_1 FROM t ORDER BY c1_3, c1_1 OFFSET 100 LIMIT 10;
+ c1_1 | c2_1 
+------+------
+  101 |  101
+  102 |  102
+  103 |  103
+  104 |  104
+  105 |  105
+  106 |  106
+  107 |  107
+  108 |  108
+  109 |  109
+  110 |  110
+(10 rows)
+
+-- ctid with whole-row reference
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.ctid, t1, t2, t1.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
+                                                                                                                                                                                                                                                   QUERY PLAN                                                                                                                                                                                                                                                    
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t1.ctid, t1.*, t2.*, t1.c1, t1.c3
+   ->  Sort
+         Output: t1.ctid, t1.*, t2.*, t1.c1, t1.c3
+         Sort Key: t1.c3, t1.c1
+         ->  Foreign Scan
+               Output: t1.ctid, t1.*, t2.*, t1.c1, t1.c3
+               Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+               Remote SQL: SELECT l.a1, l.a2, l.a3, l.a4, r.a1 FROM (SELECT l.a7, ROW(l.a10, l.a11, l.a12, l.a13, l.a14, l.a15, l.a16, l.a17), l.a10, l.a12 FROM (SELECT "C 1" a10, c2 a11, c3 a12, c4 a13, c5 a14, c6 a15, c7 a16, c8 a17, ctid a7 FROM "S 1"."T 1") l) l (a1, a2, a3, a4) INNER JOIN (SELECT ROW(r.a9, r.a10, r.a12, r.a13, r.a14, r.a15, r.a16, r.a17), r.a9 FROM (SELECT "C 1" a9, c2 a10, c3 a12, c4 a13, c5 a14, c6 a15, c7 a16, c8 a17 FROM "S 1"."T 1") r) r (a1, a2) ON ((l.a3 = r.a2))
+(9 rows)
+
+SELECT t1.ctid, t1, t2, t1.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
+  ctid  |                                             t1                                             |                                             t2                                             | c1  
+--------+--------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+-----
+ (1,4)  | (101,1,00101,"Fri Jan 02 00:00:00 1970 PST","Fri Jan 02 00:00:00 1970",1,"1         ",foo) | (101,1,00101,"Fri Jan 02 00:00:00 1970 PST","Fri Jan 02 00:00:00 1970",1,"1         ",foo) | 101
+ (1,5)  | (102,2,00102,"Sat Jan 03 00:00:00 1970 PST","Sat Jan 03 00:00:00 1970",2,"2         ",foo) | (102,2,00102,"Sat Jan 03 00:00:00 1970 PST","Sat Jan 03 00:00:00 1970",2,"2         ",foo) | 102
+ (1,6)  | (103,3,00103,"Sun Jan 04 00:00:00 1970 PST","Sun Jan 04 00:00:00 1970",3,"3         ",foo) | (103,3,00103,"Sun Jan 04 00:00:00 1970 PST","Sun Jan 04 00:00:00 1970",3,"3         ",foo) | 103
+ (1,7)  | (104,4,00104,"Mon Jan 05 00:00:00 1970 PST","Mon Jan 05 00:00:00 1970",4,"4         ",foo) | (104,4,00104,"Mon Jan 05 00:00:00 1970 PST","Mon Jan 05 00:00:00 1970",4,"4         ",foo) | 104
+ (1,8)  | (105,5,00105,"Tue Jan 06 00:00:00 1970 PST","Tue Jan 06 00:00:00 1970",5,"5         ",foo) | (105,5,00105,"Tue Jan 06 00:00:00 1970 PST","Tue Jan 06 00:00:00 1970",5,"5         ",foo) | 105
+ (1,9)  | (106,6,00106,"Wed Jan 07 00:00:00 1970 PST","Wed Jan 07 00:00:00 1970",6,"6         ",foo) | (106,6,00106,"Wed Jan 07 00:00:00 1970 PST","Wed Jan 07 00:00:00 1970",6,"6         ",foo) | 106
+ (1,10) | (107,7,00107,"Thu Jan 08 00:00:00 1970 PST","Thu Jan 08 00:00:00 1970",7,"7         ",foo) | (107,7,00107,"Thu Jan 08 00:00:00 1970 PST","Thu Jan 08 00:00:00 1970",7,"7         ",foo) | 107
+ (1,11) | (108,8,00108,"Fri Jan 09 00:00:00 1970 PST","Fri Jan 09 00:00:00 1970",8,"8         ",foo) | (108,8,00108,"Fri Jan 09 00:00:00 1970 PST","Fri Jan 09 00:00:00 1970",8,"8         ",foo) | 108
+ (1,12) | (109,9,00109,"Sat Jan 10 00:00:00 1970 PST","Sat Jan 10 00:00:00 1970",9,"9         ",foo) | (109,9,00109,"Sat Jan 10 00:00:00 1970 PST","Sat Jan 10 00:00:00 1970",9,"9         ",foo) | 109
+ (1,13) | (110,0,00110,"Sun Jan 11 00:00:00 1970 PST","Sun Jan 11 00:00:00 1970",0,"0         ",foo) | (110,0,00110,"Sun Jan 11 00:00:00 1970 PST","Sun Jan 11 00:00:00 1970",0,"0         ",foo) | 110
+(10 rows)
+
+-- partially unsafe to push down, not pushed down
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1 FROM ft1 t1 JOIN ft2 t2 ON t2.c1 = t2.c1 JOIN ft4 t3 ON t2.c1 = t3.c1 ORDER BY t1.c1 OFFSET 10 LIMIT 10;
+                                                                                                               QUERY PLAN                                                                                                                
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t1.c1
+   ->  Sort
+         Output: t1.c1
+         Sort Key: t1.c1
+         ->  Nested Loop
+               Output: t1.c1
+               ->  Foreign Scan on public.ft1 t1
+                     Output: t1.c1
+                     Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+               ->  Materialize
+                     ->  Foreign Scan
+                           Relations: (public.ft2 t2) INNER JOIN (public.ft4 t3)
+                           Remote SQL: SELECT NULL FROM (SELECT l.a9 FROM (SELECT "C 1" a9 FROM "S 1"."T 1" WHERE (("C 1" = "C 1"))) l) l (a1) INNER JOIN (SELECT r.a9 FROM (SELECT c1 a9 FROM "S 1"."T 3") r) r (a1) ON ((l.a1 = r.a1))
+(14 rows)
+
+SELECT t1.c1 FROM ft1 t1 JOIN ft2 t2 ON t2.c1 = t2.c1 JOIN ft4 t3 ON t2.c1 = t3.c1 ORDER BY t1.c1 OFFSET 10 LIMIT 10;
+ c1 
+----
+  1
+  1
+  1
+  1
+  1
+  1
+  1
+  1
+  1
+  1
+(10 rows)
+
+-- SEMI JOIN, not pushed down
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
+ Limit
+   Output: t1.c1
+   ->  Sort
+         Output: t1.c1
+         Sort Key: t1.c1
+         ->  Hash Join
+               Output: t1.c1
+               Hash Cond: (t1.c1 = t2.c1)
+               ->  Foreign Scan on public.ft1 t1
+                     Output: t1.c1
+                     Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+               ->  Hash
+                     Output: t2.c1
+                     ->  HashAggregate
+                           Output: t2.c1
+                           Group Key: t2.c1
+                           ->  Foreign Scan on public.ft2 t2
+                                 Output: t2.c1
+                                 Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+(19 rows)
+
+SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
+ c1  
+-----
+ 101
+ 102
+ 103
+ 104
+ 105
+ 106
+ 107
+ 108
+ 109
+ 110
+(10 rows)
+
+-- ANTI JOIN, not pushed down
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Limit
+   Output: t1.c1
+   ->  Sort
+         Output: t1.c1
+         Sort Key: t1.c1
+         ->  Hash Anti Join
+               Output: t1.c1
+               Hash Cond: (t1.c1 = t2.c2)
+               ->  Foreign Scan on public.ft1 t1
+                     Output: t1.c1
+                     Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+               ->  Hash
+                     Output: t2.c2
+                     ->  Foreign Scan on public.ft2 t2
+                           Output: t2.c2
+                           Remote SQL: SELECT c2 FROM "S 1"."T 1"
+(16 rows)
+
+SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
+ c1  
+-----
+ 110
+ 111
+ 112
+ 113
+ 114
+ 115
+ 116
+ 117
+ 118
+ 119
+(10 rows)
+
+-- CROSS JOIN, not pushed down
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+                             QUERY PLAN                              
+---------------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c1
+   ->  Sort
+         Output: t1.c1, t2.c1
+         Sort Key: t1.c1, t2.c1
+         ->  Nested Loop
+               Output: t1.c1, t2.c1
+               ->  Foreign Scan on public.ft1 t1
+                     Output: t1.c1
+                     Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+               ->  Materialize
+                     Output: t2.c1
+                     ->  Foreign Scan on public.ft2 t2
+                           Output: t2.c1
+                           Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+(15 rows)
+
+SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+ c1 | c1  
+----+-----
+  1 | 101
+  1 | 102
+  1 | 103
+  1 | 104
+  1 | 105
+  1 | 106
+  1 | 107
+  1 | 108
+  1 | 109
+  1 | 110
+(10 rows)
+
+-- different server
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c1
+   ->  Merge Join
+         Output: t1.c1, t2.c1
+         Merge Cond: (t1.c1 = t2.c1)
+         ->  Sort
+               Output: t1.c1
+               Sort Key: t1.c1
+               ->  Foreign Scan on public.ft5 t1
+                     Output: t1.c1
+                     Remote SQL: SELECT c1 FROM "S 1"."T 4"
+         ->  Sort
+               Output: t2.c1
+               Sort Key: t2.c1
+               ->  Foreign Scan on public.ft6 t2
+                     Output: t2.c1
+                     Remote SQL: SELECT c1 FROM "S 1"."T 4"
+(17 rows)
+
+SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+ c1 | c1 
+----+----
+(0 rows)
+
+-- different effective user for permission check
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN v_ft5 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Limit
+   Output: t1.c1, ft5.c1
+   ->  Merge Join
+         Output: t1.c1, ft5.c1
+         Merge Cond: (t1.c1 = ft5.c1)
+         ->  Sort
+               Output: t1.c1
+               Sort Key: t1.c1
+               ->  Foreign Scan on public.ft5 t1
+                     Output: t1.c1
+                     Remote SQL: SELECT c1 FROM "S 1"."T 4"
+         ->  Sort
+               Output: ft5.c1
+               Sort Key: ft5.c1
+               ->  Foreign Scan on public.ft5
+                     Output: ft5.c1
+                     Remote SQL: SELECT c1 FROM "S 1"."T 4"
+(17 rows)
+
+SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN v_ft5 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+ c1 | c1 
+----+----
+(0 rows)
+
+-- unsafe join conditions
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c8 = t2.c8) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
+                                 QUERY PLAN                                  
+-----------------------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c1, t1.c3
+   ->  Sort
+         Output: t1.c1, t2.c1, t1.c3
+         Sort Key: t1.c3, t1.c1
+         ->  Merge Join
+               Output: t1.c1, t2.c1, t1.c3
+               Merge Cond: (t1.c8 = t2.c8)
+               ->  Sort
+                     Output: t1.c1, t1.c3, t1.c8
+                     Sort Key: t1.c8
+                     ->  Foreign Scan on public.ft1 t1
+                           Output: t1.c1, t1.c3, t1.c8
+                           Remote SQL: SELECT "C 1", c3, c8 FROM "S 1"."T 1"
+               ->  Sort
+                     Output: t2.c1, t2.c8
+                     Sort Key: t2.c8
+                     ->  Foreign Scan on public.ft2 t2
+                           Output: t2.c1, t2.c8
+                           Remote SQL: SELECT "C 1", c8 FROM "S 1"."T 1"
+(20 rows)
+
+SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c8 = t2.c8) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
+ c1 | c1  
+----+-----
+  1 | 102
+  1 | 103
+  1 | 104
+  1 | 105
+  1 | 106
+  1 | 107
+  1 | 108
+  1 | 109
+  1 | 110
+  1 |   1
+(10 rows)
+
+-- local filter (unsafe conditions on one side)
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) WHERE t1.c8 = 'foo' ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
+                                 QUERY PLAN                                  
+-----------------------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c1, t1.c3
+   ->  Sort
+         Output: t1.c1, t2.c1, t1.c3
+         Sort Key: t1.c3, t1.c1
+         ->  Hash Join
+               Output: t1.c1, t2.c1, t1.c3
+               Hash Cond: (t2.c1 = t1.c1)
+               ->  Foreign Scan on public.ft2 t2
+                     Output: t2.c1
+                     Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+               ->  Hash
+                     Output: t1.c1, t1.c3
+                     ->  Foreign Scan on public.ft1 t1
+                           Output: t1.c1, t1.c3
+                           Filter: (t1.c8 = 'foo'::user_enum)
+                           Remote SQL: SELECT "C 1", c3, c8 FROM "S 1"."T 1"
+(17 rows)
+
+SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) WHERE t1.c8 = 'foo' ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
+ c1  | c1  
+-----+-----
+ 101 | 101
+ 102 | 102
+ 103 | 103
+ 104 | 104
+ 105 | 105
+ 106 | 106
+ 107 | 107
+ 108 | 108
+ 109 | 109
+ 110 | 110
+(10 rows)
+
+-- Aggregate after UNION, for testing setrefs
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1c1, avg(t1c1 + t2c1) FROM (SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) UNION SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1)) AS t (t1c1, t2c1) GROUP BY t1c1 ORDER BY t1c1 OFFSET 100 LIMIT 10;
+                                                                                                            QUERY PLAN                                                                                                            
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t1.c1, (avg((t1.c1 + t2.c1)))
+   ->  Sort
+         Output: t1.c1, (avg((t1.c1 + t2.c1)))
+         Sort Key: t1.c1
+         ->  HashAggregate
+               Output: t1.c1, avg((t1.c1 + t2.c1))
+               Group Key: t1.c1
+               ->  HashAggregate
+                     Output: t1.c1, t2.c1
+                     Group Key: t1.c1, t2.c1
+                     ->  Append
+                           ->  Foreign Scan
+                                 Output: t1.c1, t2.c1
+                                 Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+                                 Remote SQL: SELECT l.a1, r.a1 FROM (SELECT l.a10 FROM (SELECT "C 1" a10 FROM "S 1"."T 1") l) l (a1) INNER JOIN (SELECT r.a9 FROM (SELECT "C 1" a9 FROM "S 1"."T 1") r) r (a1) ON ((l.a1 = r.a1))
+                           ->  Foreign Scan
+                                 Output: t1_1.c1, t2_1.c1
+                                 Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+                                 Remote SQL: SELECT l.a1, r.a1 FROM (SELECT l.a10 FROM (SELECT "C 1" a10 FROM "S 1"."T 1") l) l (a1) INNER JOIN (SELECT r.a9 FROM (SELECT "C 1" a9 FROM "S 1"."T 1") r) r (a1) ON ((l.a1 = r.a1))
+(20 rows)
+
+SELECT t1c1, avg(t1c1 + t2c1) FROM (SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) UNION SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1)) AS t (t1c1, t2c1) GROUP BY t1c1 ORDER BY t1c1 OFFSET 100 LIMIT 10;
+ t1c1 |         avg          
+------+----------------------
+  101 | 202.0000000000000000
+  102 | 204.0000000000000000
+  103 | 206.0000000000000000
+  104 | 208.0000000000000000
+  105 | 210.0000000000000000
+  106 | 212.0000000000000000
+  107 | 214.0000000000000000
+  108 | 216.0000000000000000
+  109 | 218.0000000000000000
+  110 | 220.0000000000000000
+(10 rows)
+
+-- join two foreign tables and two local tables
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft1 t1 LEFT JOIN ft2 t2 ON t1.c1 = t2.c1 JOIN "S 1"."T 1" t3 ON t1.c1 = t3."C 1" JOIN "S 1"."T 2" t4 ON t1.c1 = t4.c1 ORDER BY t1.c1 OFFSET 10 LIMIT 10;
+                                                                                                     QUERY PLAN                                                                                                      
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c1
+   ->  Sort
+         Output: t1.c1, t2.c1
+         Sort Key: t1.c1
+         ->  Hash Join
+               Output: t1.c1, t2.c1
+               Hash Cond: (t1.c1 = t3."C 1")
+               ->  Foreign Scan
+                     Output: t1.c1, t2.c1
+                     Relations: (public.ft1 t1) LEFT JOIN (public.ft2 t2)
+                     Remote SQL: SELECT l.a1, r.a1 FROM (SELECT l.a10 FROM (SELECT "C 1" a10 FROM "S 1"."T 1") l) l (a1) LEFT JOIN (SELECT r.a9 FROM (SELECT "C 1" a9 FROM "S 1"."T 1") r) r (a1) ON ((l.a1 = r.a1))
+               ->  Hash
+                     Output: t3."C 1", t4.c1
+                     ->  Merge Join
+                           Output: t3."C 1", t4.c1
+                           Merge Cond: (t3."C 1" = t4.c1)
+                           ->  Index Only Scan using t1_pkey on "S 1"."T 1" t3
+                                 Output: t3."C 1"
+                           ->  Sort
+                                 Output: t4.c1
+                                 Sort Key: t4.c1
+                                 ->  Seq Scan on "S 1"."T 2" t4
+                                       Output: t4.c1
+(24 rows)
+
+SELECT t1.c1, t2.c1 FROM ft1 t1 LEFT JOIN ft2 t2 ON t1.c1 = t2.c1 JOIN "S 1"."T 1" t3 ON t1.c1 = t3."C 1" JOIN "S 1"."T 2" t4 ON t1.c1 = t4.c1 ORDER BY t1.c1 OFFSET 10 LIMIT 10;
+ c1 | c1 
+----+----
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+(10 rows)
+
+-- ===================================================================
 -- parameterized queries
 -- ===================================================================
 -- simple join
@@ -1210,22 +1908,15 @@ UPDATE ft2 SET c2 = c2 + 400, c3 = c3 || '_update7' WHERE c1 % 10 = 7 RETURNING
 EXPLAIN (verbose, costs off)
 UPDATE ft2 SET c2 = ft2.c2 + 500, c3 = ft2.c3 || '_update9', c7 = DEFAULT
   FROM ft1 WHERE ft1.c1 = ft2.c2 AND ft1.c1 % 10 = 9;
-                                                                            QUERY PLAN                                                                             
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
+                                                                                                                                                                                                                                                                       QUERY PLAN                                                                                                                                                                                                                                                                       
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Update on public.ft2
    Remote SQL: UPDATE "S 1"."T 1" SET c2 = $2, c3 = $3, c7 = $4 WHERE ctid = $1
-   ->  Hash Join
+   ->  Foreign Scan
          Output: ft2.c1, (ft2.c2 + 500), NULL::integer, (ft2.c3 || '_update9'::text), ft2.c4, ft2.c5, ft2.c6, 'ft2       '::character(10), ft2.c8, ft2.ctid, ft1.*
-         Hash Cond: (ft2.c2 = ft1.c1)
-         ->  Foreign Scan on public.ft2
-               Output: ft2.c1, ft2.c2, ft2.c3, ft2.c4, ft2.c5, ft2.c6, ft2.c8, ft2.ctid
-               Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c8, ctid FROM "S 1"."T 1" FOR UPDATE
-         ->  Hash
-               Output: ft1.*, ft1.c1
-               ->  Foreign Scan on public.ft1
-                     Output: ft1.*, ft1.c1
-                     Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE ((("C 1" % 10) = 9))
-(13 rows)
+         Relations: (public.ft2) INNER JOIN (public.ft1)
+         Remote SQL: SELECT l.a1, l.a2, l.a3, l.a4, l.a5, l.a6, l.a7, l.a8, r.a1 FROM (SELECT l.a9, l.a10, l.a12, l.a13, l.a14, l.a15, l.a17, l.a7 FROM (SELECT "C 1" a9, c2 a10, c3 a12, c4 a13, c5 a14, c6 a15, c8 a17, ctid a7 FROM "S 1"."T 1" FOR UPDATE) l) l (a1, a2, a3, a4, a5, a6, a7, a8) INNER JOIN (SELECT ROW(r.a10, r.a11, r.a12, r.a13, r.a14, r.a15, r.a16, r.a17), r.a10 FROM (SELECT "C 1" a10, c2 a11, c3 a12, c4 a13, c5 a14, c6 a15, c7 a16, c8 a17 FROM "S 1"."T 1" WHERE ((("C 1" % 10) = 9))) r) r (a1, a2) ON ((l.a2 = r.a2))
+(6 rows)
 
 UPDATE ft2 SET c2 = ft2.c2 + 500, c3 = ft2.c3 || '_update9', c7 = DEFAULT
   FROM ft1 WHERE ft1.c1 = ft2.c2 AND ft1.c1 % 10 = 9;
@@ -1351,22 +2042,15 @@ DELETE FROM ft2 WHERE c1 % 10 = 5 RETURNING c1, c4;
 
 EXPLAIN (verbose, costs off)
 DELETE FROM ft2 USING ft1 WHERE ft1.c1 = ft2.c2 AND ft1.c1 % 10 = 2;
-                                                      QUERY PLAN                                                      
-----------------------------------------------------------------------------------------------------------------------
+                                                                                                                                                                                        QUERY PLAN                                                                                                                                                                                         
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Delete on public.ft2
    Remote SQL: DELETE FROM "S 1"."T 1" WHERE ctid = $1
-   ->  Hash Join
+   ->  Foreign Scan
          Output: ft2.ctid, ft1.*
-         Hash Cond: (ft2.c2 = ft1.c1)
-         ->  Foreign Scan on public.ft2
-               Output: ft2.ctid, ft2.c2
-               Remote SQL: SELECT c2, ctid FROM "S 1"."T 1" FOR UPDATE
-         ->  Hash
-               Output: ft1.*, ft1.c1
-               ->  Foreign Scan on public.ft1
-                     Output: ft1.*, ft1.c1
-                     Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE ((("C 1" % 10) = 2))
-(13 rows)
+         Relations: (public.ft2) INNER JOIN (public.ft1)
+         Remote SQL: SELECT l.a1, r.a1 FROM (SELECT l.a7, l.a10 FROM (SELECT c2 a10, ctid a7 FROM "S 1"."T 1" FOR UPDATE) l) l (a1, a2) INNER JOIN (SELECT ROW(r.a10, r.a11, r.a12, r.a13, r.a14, r.a15, r.a16, r.a17), r.a10 FROM (SELECT "C 1" a10, c2 a11, c3 a12, c4 a13, c5 a14, c6 a15, c7 a16, c8 a17 FROM "S 1"."T 1" WHERE ((("C 1" % 10) = 2))) r) r (a1, a2) ON ((l.a2 = r.a2))
+(6 rows)
 
 DELETE FROM ft2 USING ft1 WHERE ft1.c1 = ft2.c2 AND ft1.c1 % 10 = 2;
 SELECT c1,c2,c3,c4 FROM ft2 ORDER BY c1;
@@ -3641,3 +4325,6 @@ QUERY:  CREATE FOREIGN TABLE t5 (
 OPTIONS (schema_name 'import_source', table_name 't5');
 CONTEXT:  importing foreign table "t5"
 ROLLBACK;
+-- Cleanup
+DROP OWNED BY view_owner;
+DROP USER view_owner;
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index d7ae201..61d694b 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -28,7 +28,6 @@
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/planmain.h"
-#include "optimizer/prep.h"
 #include "optimizer/restrictinfo.h"
 #include "optimizer/var.h"
 #include "parser/parsetree.h"
@@ -48,41 +47,8 @@ PG_MODULE_MAGIC;
 #define DEFAULT_FDW_TUPLE_COST		0.01
 
 /*
- * FDW-specific planner information kept in RelOptInfo.fdw_private for a
- * foreign table.  This information is collected by postgresGetForeignRelSize.
- */
-typedef struct PgFdwRelationInfo
-{
-	/* baserestrictinfo clauses, broken down into safe and unsafe subsets. */
-	List	   *remote_conds;
-	List	   *local_conds;
-
-	/* Bitmap of attr numbers we need to fetch from the remote server. */
-	Bitmapset  *attrs_used;
-
-	/* Cost and selectivity of local_conds. */
-	QualCost	local_conds_cost;
-	Selectivity local_conds_sel;
-
-	/* Estimated size and cost for a scan with baserestrictinfo quals. */
-	double		rows;
-	int			width;
-	Cost		startup_cost;
-	Cost		total_cost;
-
-	/* Options extracted from catalogs. */
-	bool		use_remote_estimate;
-	Cost		fdw_startup_cost;
-	Cost		fdw_tuple_cost;
-
-	/* Cached catalog information. */
-	ForeignTable *table;
-	ForeignServer *server;
-	UserMapping *user;			/* only set in use_remote_estimate mode */
-} PgFdwRelationInfo;
-
-/*
- * Indexes of FDW-private information stored in fdw_private lists.
+ * Indexes of FDW-private information stored in fdw_private of ForeignScan of
+ * a simple foreign table scan for a SELECT statement.
  *
  * We store various information in ForeignScan.fdw_private to pass it from
  * planner to executor.  Currently we store:
@@ -99,7 +65,13 @@ enum FdwScanPrivateIndex
 	/* SQL statement to execute remotely (as a String node) */
 	FdwScanPrivateSelectSql,
 	/* Integer list of attribute numbers retrieved by the SELECT */
-	FdwScanPrivateRetrievedAttrs
+	FdwScanPrivateRetrievedAttrs,
+	/* Integer value of server for the scan */
+	FdwScanPrivateServerOid,
+	/* Integer value of user mapping for the scan */
+	FdwScanPrivateUserMappingOid,
+	/* Names of relation scanned, added when the scan is join */
+	FdwScanPrivateRelations,
 };
 
 /*
@@ -129,7 +101,8 @@ enum FdwModifyPrivateIndex
  */
 typedef struct PgFdwScanState
 {
-	Relation	rel;			/* relcache entry for the foreign table */
+	const char *relname;		/* name of relation being scanned */
+	TupleDesc	tupdesc;		/* tuple descriptor of the scan */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* extracted fdw_private data */
@@ -195,6 +168,8 @@ typedef struct PgFdwAnalyzeState
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 	List	   *retrieved_attrs;	/* attr numbers retrieved by query */
 
+	char	   *query;			/* text of SELECT command */
+
 	/* collected sample rows */
 	HeapTuple  *rows;			/* array of size targrows */
 	int			targrows;		/* target # of sample rows */
@@ -215,7 +190,10 @@ typedef struct PgFdwAnalyzeState
  */
 typedef struct ConversionLocation
 {
-	Relation	rel;			/* foreign table's relcache entry */
+	const char *relname;		/* name of relation being processed, or NULL for
+								   a foreign join */
+	const char *query;			/* query being processed */
+	TupleDesc	tupdesc;		/* tuple descriptor for attribute names */
 	AttrNumber	cur_attno;		/* attribute number being processed, or 0 */
 } ConversionLocation;
 
@@ -289,6 +267,12 @@ static bool postgresAnalyzeForeignTable(Relation relation,
 							BlockNumber *totalpages);
 static List *postgresImportForeignSchema(ImportForeignSchemaStmt *stmt,
 							Oid serverOid);
+static void postgresGetForeignJoinPaths(PlannerInfo *root,
+										RelOptInfo *joinrel,
+										RelOptInfo *outerrel,
+										RelOptInfo *innerrel,
+										JoinType jointype,
+										JoinPathExtraData *extra);
 
 /*
  * Helper functions
@@ -324,12 +308,41 @@ static void analyze_row_processor(PGresult *res, int row,
 					  PgFdwAnalyzeState *astate);
 static HeapTuple make_tuple_from_result_row(PGresult *res,
 						   int row,
-						   Relation rel,
+						   const char *relname,
+						   const char *query,
+						   TupleDesc tupdesc,
 						   AttInMetadata *attinmeta,
 						   List *retrieved_attrs,
 						   MemoryContext temp_context);
 static void conversion_error_callback(void *arg);
+static Path *get_unsorted_unparameterized_path(List *paths);
+
+/*
+ * Describe Bitmapset as comma-separated integer list.
+ * For debug purpose.
+ * XXX Can this become a member of bitmapset.c?
+ */
+static char *
+bms_to_str(Bitmapset *bmp)
+{
+	StringInfoData buf;
+	bool		first = true;
+	int			x;
+
+	initStringInfo(&buf);
+
+	x = -1;
+	while ((x = bms_next_member(bmp, x)) >= 0)
+	{
+		if (!first)
+			appendStringInfoString(&buf, ", ");
+		appendStringInfo(&buf, "%d", x);
+
+		first = false;
+	}
 
+	return buf.data;
+}
 
 /*
  * Foreign-data wrapper handler function: return a struct with pointers
@@ -369,6 +382,9 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	routine->ImportForeignSchema = postgresImportForeignSchema;
 
+	/* Support functions for join push-down */
+	routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -394,9 +410,13 @@ postgresGetForeignRelSize(PlannerInfo *root,
 	fpinfo = (PgFdwRelationInfo *) palloc0(sizeof(PgFdwRelationInfo));
 	baserel->fdw_private = (void *) fpinfo;
 
+	/* This scan can be pushed down to the remote. */
+	fpinfo->pushdown_safe = true;
+
 	/* Look up foreign-table catalog info. */
 	fpinfo->table = GetForeignTable(foreigntableid);
 	fpinfo->server = GetForeignServer(fpinfo->table->serverid);
+	fpinfo->umid = baserel->umid;
 
 	/*
 	 * Extract user-settable option values.  Note that per-table setting of
@@ -429,22 +449,6 @@ postgresGetForeignRelSize(PlannerInfo *root,
 	}
 
 	/*
-	 * If the table or the server is configured to use remote estimates,
-	 * identify which user to do remote access as during planning.  This
-	 * should match what ExecCheckRTEPerms() does.  If we fail due to lack of
-	 * permissions, the query would have failed at runtime anyway.
-	 */
-	if (fpinfo->use_remote_estimate)
-	{
-		RangeTblEntry *rte = planner_rt_fetch(baserel->relid, root);
-		Oid			userid = rte->checkAsUser ? rte->checkAsUser : GetUserId();
-
-		fpinfo->user = GetUserMapping(userid, fpinfo->server->serverid);
-	}
-	else
-		fpinfo->user = NULL;
-
-	/*
 	 * Identify which baserestrictinfo clauses can be sent to the remote
 	 * server and which can't.
 	 */
@@ -756,6 +760,8 @@ postgresGetForeignPlan(PlannerInfo *root,
 	List	   *retrieved_attrs;
 	StringInfoData sql;
 	ListCell   *lc;
+	List	   *fdw_scan_tlist = NIL;
+	StringInfoData relations;
 
 	/*
 	 * Separate the scan_clauses into those that can be executed remotely and
@@ -804,71 +810,27 @@ postgresGetForeignPlan(PlannerInfo *root,
 
 	/*
 	 * Build the query string to be sent for execution, and identify
-	 * expressions to be sent as parameters.
+	 * expressions to be sent as parameters.  If the relation to scan is a join
+	 * relation, receive constructed relations string from deparseSelectSql.
 	 */
 	initStringInfo(&sql);
-	deparseSelectSql(&sql, root, baserel, fpinfo->attrs_used,
-					 &retrieved_attrs);
-	if (remote_conds)
-		appendWhereClause(&sql, root, baserel, remote_conds,
-						  true, &params_list);
-
-	/*
-	 * Add FOR UPDATE/SHARE if appropriate.  We apply locking during the
-	 * initial row fetch, rather than later on as is done for local tables.
-	 * The extra roundtrips involved in trying to duplicate the local
-	 * semantics exactly don't seem worthwhile (see also comments for
-	 * RowMarkType).
-	 *
-	 * Note: because we actually run the query as a cursor, this assumes that
-	 * DECLARE CURSOR ... FOR UPDATE is supported, which it isn't before 8.3.
-	 */
-	if (baserel->relid == root->parse->resultRelation &&
-		(root->parse->commandType == CMD_UPDATE ||
-		 root->parse->commandType == CMD_DELETE))
-	{
-		/* Relation is UPDATE/DELETE target, so use FOR UPDATE */
-		appendStringInfoString(&sql, " FOR UPDATE");
-	}
-	else
-	{
-		PlanRowMark *rc = get_plan_rowmark(root->rowMarks, baserel->relid);
-
-		if (rc)
-		{
-			/*
-			 * Relation is specified as a FOR UPDATE/SHARE target, so handle
-			 * that.  (But we could also see LCS_NONE, meaning this isn't a
-			 * target relation after all.)
-			 *
-			 * For now, just ignore any [NO] KEY specification, since (a) it's
-			 * not clear what that means for a remote table that we don't have
-			 * complete information about, and (b) it wouldn't work anyway on
-			 * older remote servers.  Likewise, we don't worry about NOWAIT.
-			 */
-			switch (rc->strength)
-			{
-				case LCS_NONE:
-					/* No locking needed */
-					break;
-				case LCS_FORKEYSHARE:
-				case LCS_FORSHARE:
-					appendStringInfoString(&sql, " FOR SHARE");
-					break;
-				case LCS_FORNOKEYUPDATE:
-				case LCS_FORUPDATE:
-					appendStringInfoString(&sql, " FOR UPDATE");
-					break;
-			}
-		}
-	}
+	if (baserel->reloptkind == RELOPT_JOINREL)
+		initStringInfo(&relations);
+	deparseSelectSql(&sql, root, baserel, fpinfo->attrs_used, remote_conds,
+					 &params_list, &fdw_scan_tlist, &retrieved_attrs,
+					 baserel->reloptkind == RELOPT_JOINREL ? &relations : NULL,
+					 false);
 
 	/*
-	 * Build the fdw_private list that will be available to the executor.
+	 * Build the fdw_private list that will be available in the executor.
 	 * Items in the list must match enum FdwScanPrivateIndex, above.
 	 */
-	fdw_private = list_make2(makeString(sql.data),
-							 retrieved_attrs);
+	fdw_private = list_make4(makeString(sql.data),
+							 retrieved_attrs,
+							 makeInteger(fpinfo->server->serverid),
+							 makeInteger(fpinfo->umid));
+	if (baserel->reloptkind == RELOPT_JOINREL)
+		fdw_private = lappend(fdw_private, makeString(relations.data));
 
 	/*
 	 * Create the ForeignScan node from target list, local filtering
@@ -883,7 +845,7 @@ postgresGetForeignPlan(PlannerInfo *root,
 							scan_relid,
 							params_list,
 							fdw_private,
-							NIL,	/* no custom tlist */
+							fdw_scan_tlist,
 							remote_exprs);
 }
 
@@ -897,9 +859,8 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	ForeignScan *fsplan = (ForeignScan *) node->ss.ps.plan;
 	EState	   *estate = node->ss.ps.state;
 	PgFdwScanState *fsstate;
-	RangeTblEntry *rte;
-	Oid			userid;
-	ForeignTable *table;
+	Oid			serverid;
+	Oid			umid;
 	ForeignServer *server;
 	UserMapping *user;
 	int			numParams;
@@ -919,22 +880,13 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	node->fdw_state = (void *) fsstate;
 
 	/*
-	 * Identify which user to do the remote access as.  This should match what
-	 * ExecCheckRTEPerms() does.
-	 */
-	rte = rt_fetch(fsplan->scan.scanrelid, estate->es_range_table);
-	userid = rte->checkAsUser ? rte->checkAsUser : GetUserId();
-
-	/* Get info about foreign table. */
-	fsstate->rel = node->ss.ss_currentRelation;
-	table = GetForeignTable(RelationGetRelid(fsstate->rel));
-	server = GetForeignServer(table->serverid);
-	user = GetUserMapping(userid, server->serverid);
-
-	/*
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
+	serverid = intVal(list_nth(fsplan->fdw_private, FdwScanPrivateServerOid));
+	umid = intVal(list_nth(fsplan->fdw_private, FdwScanPrivateUserMappingOid));
+	server = GetForeignServer(serverid);
+	user = GetUserMappingById(umid);
 	fsstate->conn = GetConnection(server, user, false);
 
 	/* Assign a unique ID for my cursor */
@@ -959,8 +911,18 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 											  ALLOCSET_SMALL_INITSIZE,
 											  ALLOCSET_SMALL_MAXSIZE);
 
-	/* Get info we'll need for input data conversion. */
-	fsstate->attinmeta = TupleDescGetAttInMetadata(RelationGetDescr(fsstate->rel));
+	/* Get info we'll need for input data conversion and error report. */
+	if (fsplan->scan.scanrelid > 0)
+	{
+		fsstate->relname = RelationGetRelationName(node->ss.ss_currentRelation);
+		fsstate->tupdesc = RelationGetDescr(node->ss.ss_currentRelation);
+	}
+	else
+	{
+		fsstate->relname = NULL;
+		fsstate->tupdesc = node->ss.ss_ScanTupleSlot->tts_tupleDescriptor;
+	}
+	fsstate->attinmeta = TupleDescGetAttInMetadata(fsstate->tupdesc);
 
 	/* Prepare for output conversion of parameters used in remote query. */
 	numParams = list_length(fsplan->fdw_exprs);
@@ -1689,10 +1651,25 @@ postgresExplainForeignScan(ForeignScanState *node, ExplainState *es)
 {
 	List	   *fdw_private;
 	char	   *sql;
+	char	   *relations;
+
+	fdw_private = ((ForeignScan *) node->ss.ps.plan)->fdw_private;
 
+	/*
+	 * Add names of relation handled by the foreign scan when the scan is a
+	 * join
+	 */
+	if (list_length(fdw_private) > FdwScanPrivateRelations)
+	{
+		relations = strVal(list_nth(fdw_private, FdwScanPrivateRelations));
+		ExplainPropertyText("Relations", relations, es);
+	}
+
+	/*
+	 * Add remote query, when VERBOSE option is specified.
+	 */
 	if (es->verbose)
 	{
-		fdw_private = ((ForeignScan *) node->ss.ps.plan)->fdw_private;
 		sql = strVal(list_nth(fdw_private, FdwScanPrivateSelectSql));
 		ExplainPropertyText("Remote SQL", sql, es);
 	}
@@ -1751,10 +1728,12 @@ estimate_path_cost_size(PlannerInfo *root,
 	 */
 	if (fpinfo->use_remote_estimate)
 	{
+		List	   *remote_conds;
 		List	   *remote_join_conds;
 		List	   *local_join_conds;
 		StringInfoData sql;
 		List	   *retrieved_attrs;
+		UserMapping *user;
 		PGconn	   *conn;
 		Selectivity local_sel;
 		QualCost	local_cost;
@@ -1766,24 +1745,24 @@ estimate_path_cost_size(PlannerInfo *root,
 		classifyConditions(root, baserel, join_conds,
 						   &remote_join_conds, &local_join_conds);
 
+		remote_conds = copyObject(fpinfo->remote_conds);
+		remote_conds = list_concat(remote_conds, remote_join_conds);
+
 		/*
 		 * Construct EXPLAIN query including the desired SELECT, FROM, and
 		 * WHERE clauses.  Params and other-relation Vars are replaced by
 		 * dummy values.
+		 * Here we waste params_list and fdw_scan_tlist because they are
+		 * unnecessary for EXPLAIN.
 		 */
 		initStringInfo(&sql);
 		appendStringInfoString(&sql, "EXPLAIN ");
-		deparseSelectSql(&sql, root, baserel, fpinfo->attrs_used,
-						 &retrieved_attrs);
-		if (fpinfo->remote_conds)
-			appendWhereClause(&sql, root, baserel, fpinfo->remote_conds,
-							  true, NULL);
-		if (remote_join_conds)
-			appendWhereClause(&sql, root, baserel, remote_join_conds,
-							  (fpinfo->remote_conds == NIL), NULL);
+		deparseSelectSql(&sql, root, baserel, fpinfo->attrs_used, remote_conds,
+						 NULL, NULL, &retrieved_attrs, NULL, false);
 
 		/* Get the remote estimate */
-		conn = GetConnection(fpinfo->server, fpinfo->user, false);
+		user = GetUserMappingById(fpinfo->umid);
+		conn = GetConnection(fpinfo->server, user, false);
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -2080,7 +2059,9 @@ fetch_more_data(ForeignScanState *node)
 		{
 			fsstate->tuples[i] =
 				make_tuple_from_result_row(res, i,
-										   fsstate->rel,
+										   fsstate->relname,
+										   fsstate->query,
+										   fsstate->tupdesc,
 										   fsstate->attinmeta,
 										   fsstate->retrieved_attrs,
 										   fsstate->temp_cxt);
@@ -2298,7 +2279,9 @@ store_returning_result(PgFdwModifyState *fmstate,
 		HeapTuple	newtup;
 
 		newtup = make_tuple_from_result_row(res, 0,
-											fmstate->rel,
+										RelationGetRelationName(fmstate->rel),
+											fmstate->query,
+											RelationGetDescr(fmstate->rel),
 											fmstate->attinmeta,
 											fmstate->retrieved_attrs,
 											fmstate->temp_cxt);
@@ -2448,6 +2431,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 	initStringInfo(&sql);
 	appendStringInfo(&sql, "DECLARE c%u CURSOR FOR ", cursor_number);
 	deparseAnalyzeSql(&sql, relation, &astate.retrieved_attrs);
+	astate.query = sql.data;
 
 	/* In what follows, do not risk leaking any PGresults. */
 	PG_TRY();
@@ -2589,7 +2573,9 @@ analyze_row_processor(PGresult *res, int row, PgFdwAnalyzeState *astate)
 		oldcontext = MemoryContextSwitchTo(astate->anl_cxt);
 
 		astate->rows[pos] = make_tuple_from_result_row(res, row,
-													   astate->rel,
+										   RelationGetRelationName(astate->rel),
+													   astate->query,
+											   RelationGetDescr(astate->rel),
 													   astate->attinmeta,
 													 astate->retrieved_attrs,
 													   astate->temp_cxt);
@@ -2863,6 +2849,331 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 }
 
 /*
+ * Construct PgFdwRelationInfo from two join sources
+ */
+static void
+merge_fpinfo(RelOptInfo *outerrel,
+			 RelOptInfo *innerrel,
+			 PgFdwRelationInfo *fpinfo,
+			 JoinType jointype,
+			 double rows,
+			 int width)
+{
+	PgFdwRelationInfo *fpinfo_o;
+	PgFdwRelationInfo *fpinfo_i;
+
+	fpinfo_o = (PgFdwRelationInfo *) outerrel->fdw_private;
+	fpinfo_i = (PgFdwRelationInfo *) innerrel->fdw_private;
+
+	/* Mark that this join can be pushed down safely */
+	fpinfo->pushdown_safe = true;
+
+	/* Join relation must have conditions come from sources */
+	fpinfo->remote_conds = list_concat(copyObject(fpinfo_o->remote_conds),
+									   copyObject(fpinfo_i->remote_conds));
+	fpinfo->local_conds = list_concat(copyObject(fpinfo_o->local_conds),
+									  copyObject(fpinfo_i->local_conds));
+
+	/* Only for simple foreign table scan */
+	fpinfo->attrs_used = NULL;
+
+	/* rows and width will be set later */
+	fpinfo->rows = rows;
+	fpinfo->width = width;
+
+	/* A join have local conditions for outer and inner, so sum up them. */
+	fpinfo->local_conds_cost.startup = fpinfo_o->local_conds_cost.startup +
+									   fpinfo_i->local_conds_cost.startup;
+	fpinfo->local_conds_cost.per_tuple = fpinfo_o->local_conds_cost.per_tuple +
+										 fpinfo_i->local_conds_cost.per_tuple;
+
+	/* Don't consider correlation between local filters. */
+	fpinfo->local_conds_sel = fpinfo_o->local_conds_sel *
+							  fpinfo_i->local_conds_sel;
+
+	fpinfo->use_remote_estimate = false;
+
+	/*
+	 * These two comes default or per-server setting, so outer and inner must
+	 * have same value.
+	 */
+	fpinfo->fdw_startup_cost = fpinfo_o->fdw_startup_cost;
+	fpinfo->fdw_tuple_cost = fpinfo_o->fdw_tuple_cost;
+
+	/*
+	 * TODO estimate more accurately
+	 */
+	fpinfo->startup_cost = fpinfo->fdw_startup_cost +
+						   fpinfo->local_conds_cost.startup;
+	fpinfo->total_cost = fpinfo->startup_cost +
+						 (fpinfo->fdw_tuple_cost +
+						  fpinfo->local_conds_cost.per_tuple +
+						  cpu_tuple_cost) * fpinfo->rows;
+
+	/* serverid and userid are respectively identical */
+	fpinfo->server = fpinfo_o->server;
+	fpinfo->umid = fpinfo_o->umid;
+
+	fpinfo->outerrel = outerrel;
+	fpinfo->innerrel = innerrel;
+	fpinfo->jointype = jointype;
+
+	/* This join can be pushed down safely */
+	fpinfo->pushdown_safe = true;
+
+	/* joinclauses and otherclauses will be set later */
+}
+
+/*
+ * Get a copy of unsorted, unparameterized path
+ */
+static Path *
+get_unsorted_unparameterized_path(List *paths)
+{
+	ListCell   *l;
+
+	foreach(l, paths)
+	{
+		Path	   *path = (Path *) lfirst(l);
+
+		if (path->pathkeys == NIL && path->param_info == NULL)
+		{
+			switch (path->pathtype)
+			{
+				case T_MergeJoin:
+					{
+						MergePath  *retval = makeNode(MergePath);
+						*retval = *((MergePath *) path);
+						return (Path *) retval;
+					}
+				case T_HashJoin:
+					{
+						HashPath   *retval = makeNode(HashPath);
+						*retval = *((HashPath *) path);
+						return (Path *) retval;
+					}
+				case T_NestLoop:
+					{
+						NestPath   *retval = makeNode(NestPath);
+						*retval = *((NestPath *) path);
+						return (Path *) retval;
+					}
+				default:
+					elog(ERROR, "unrecognized node type: %d",
+						 (int) path->pathtype);
+					return NULL;
+			}
+		}
+	}
+	return NULL;
+}
+
+/*
+ * postgresGetForeignJoinPaths
+ *		Add possible ForeignPath to joinrel.
+ *
+ * Joins satisfy conditions below can be pushed down to the remote PostgreSQL
+ * server.
+ *
+ * 1) Join type is INNER or OUTER (one of LEFT/RIGHT/FULL)
+ * 2) Both outer and inner portions are safe to push-down
+ * 3) All foreign tables in the join belong to the same foreign server
+ * 4) All join conditions are safe to push down
+ * 5) No relation has local filter (this can be relaxed for INNER JOIN with
+ * no volatile function/operator, but as of now we want safer way)
+ */
+static void
+postgresGetForeignJoinPaths(PlannerInfo *root,
+							RelOptInfo *joinrel,
+							RelOptInfo *outerrel,
+							RelOptInfo *innerrel,
+							JoinType jointype,
+							JoinPathExtraData *extra)
+{
+	PgFdwRelationInfo *fpinfo;
+	PgFdwRelationInfo *fpinfo_o;
+	PgFdwRelationInfo *fpinfo_i;
+	ForeignPath	   *joinpath;
+	double			rows;
+	Cost			startup_cost;
+	Cost			total_cost;
+	Path		   *subpath;
+
+	ListCell	   *lc;
+	List		   *joinclauses;
+	List		   *otherclauses;
+
+	/*
+	 * Skip if this join combination has been considered already.
+	 */
+	if (joinrel->fdw_private)
+	{
+		ereport(DEBUG3, (errmsg("combination already considered")));
+		return;
+	}
+
+	/*
+	 * Create unfinished PgFdwRelationInfo entry which is used to indicate that
+	 * the join relaiton is already considered but the join can't be pushed
+	 * down.  Once we know that this join can be pushed down, we fill the entry
+	 * and make it valid by calling merge_fpinfo.
+	 *
+	 * This unfinished entry prevents redandunt checks for a join combination
+	 * which is already known as unsafe to push down.
+	 */
+	fpinfo = (PgFdwRelationInfo *) palloc0(sizeof(PgFdwRelationInfo));
+	fpinfo->pushdown_safe = false;
+	joinrel->fdw_private = fpinfo;
+
+	/*
+	 * We support all outer joins in addition to inner join.  CROSS JOIN is
+	 * an INNER JOIN with no conditions internally, so will be checked later.
+	 */
+	if (jointype != JOIN_INNER && jointype != JOIN_LEFT &&
+		jointype != JOIN_RIGHT && jointype != JOIN_FULL)
+	{
+		ereport(DEBUG3, (errmsg("unsupported join type (SEMI, ANTI)")));
+		return;
+	}
+
+	/*
+	 * Having valid PgFdwRelationInfo marked as "safe to push down" in
+	 * RelOptInfo#fdw_private indicates that scanning against the relation can
+	 * be pushed down.  If either of them doesn't have PgFdwRelationInfo or it
+	 * is not marked as safe, give up to push down this join relation.
+	 */
+	fpinfo_o = (PgFdwRelationInfo *) outerrel->fdw_private;
+	if (!fpinfo_o || !fpinfo_o->pushdown_safe)
+	{
+		ereport(DEBUG3, (errmsg("outer is not safe to push-down")));
+		return;
+	}
+	fpinfo_i = (PgFdwRelationInfo *) innerrel->fdw_private;
+	if (!fpinfo_i || !fpinfo_i->pushdown_safe)
+	{
+		ereport(DEBUG3, (errmsg("inner is not safe to push-down")));
+		return;
+	}
+
+	/*
+	 * All relations in the join must belong to same server.  Having a valid
+	 * fdw_private means that all relations in the relations belong to the
+	 * server the fdw_private has, so what we should do is just compare
+	 * serverid of outer/inner relations.
+	 */
+	if (fpinfo_o->server->serverid != fpinfo_i->server->serverid)
+	{
+		ereport(DEBUG3, (errmsg("server unmatch")));
+		return;
+	}
+
+	/*
+	 * No source relation can have local conditions.  This can be relaxed
+	 * if the join is an inner join and local conditions don't contain
+	 * volatile function/operator, but as of now we leave it as future
+	 * enhancement.
+	 */
+	if (fpinfo_o->local_conds != NULL || fpinfo_i->local_conds != NULL)
+	{
+		ereport(DEBUG3, (errmsg("join with local filter")));
+		return;
+	}
+
+	/*
+	 * Separate restrictlist into two lists, join conditions and remote filters.
+	 */
+	joinclauses = extra->restrictlist;
+	if (IS_OUTER_JOIN(jointype))
+	{
+		extract_actual_join_clauses(joinclauses, &joinclauses, &otherclauses);
+	}
+	else
+	{
+		joinclauses = extract_actual_clauses(joinclauses, false);
+		otherclauses = NIL;
+	}
+
+	/*
+	 * Note that CROSS JOIN (cartesian product) is transformed to JOIN_INNER
+	 * with empty joinclauses.  Pushing down CROSS JOIN usually produces more
+	 * result than retrieving each tables separately, so we don't push down
+	 * such joins.
+	 */
+	if (jointype == JOIN_INNER && joinclauses == NIL)
+	{
+		ereport(DEBUG3, (errmsg("unsupported join type (CROSS)")));
+		return;
+	}
+
+	/*
+	 * Join condition must be safe to push down.
+	 */
+	foreach(lc, joinclauses)
+	{
+		Expr *expr = (Expr *) lfirst(lc);
+
+		if (!is_foreign_expr(root, joinrel, expr))
+		{
+			ereport(DEBUG3, (errmsg("join quals contains unsafe conditions")));
+			return;
+		}
+	}
+
+	/*
+	 * Other condition for the join must be safe to push down.
+	 */
+	foreach(lc, otherclauses)
+	{
+		Expr *expr = (Expr *) lfirst(lc);
+
+		if (!is_foreign_expr(root, joinrel, expr))
+		{
+			ereport(DEBUG3, (errmsg("filter contains unsafe conditions")));
+			return;
+		}
+	}
+
+	/* Here we know that this join can be pushed-down to remote side. */
+
+	/* Construct fpinfo for the join relation */
+	merge_fpinfo(outerrel, innerrel, fpinfo, jointype, joinrel->rows,
+				 joinrel->width); 
+	fpinfo->joinclauses = joinclauses;
+	fpinfo->otherclauses = otherclauses;
+
+	/* TODO determine more accurate cost and rows of the join. */
+	rows = joinrel->rows;
+	startup_cost = fpinfo->startup_cost;
+	total_cost = fpinfo->total_cost;
+
+	/* Get an alternative path for this foreign join */
+	subpath = get_unsorted_unparameterized_path(joinrel->pathlist);
+	if (subpath == NULL)
+		elog(ERROR, "could not get any alternative path for a foreign join");
+
+	/*
+	 * Create a new join path and add it to the joinrel which represents a join
+	 * between foreign tables.
+	 */
+	joinpath = create_foreignscan_path(root,
+									   joinrel,
+									   rows,
+									   startup_cost,
+									   total_cost,
+									   NIL,		/* no pathkeys */
+									   NULL,	/* no required_outer */
+									   subpath,
+									   NIL);	/* no fdw_private */
+
+	/* Add generated path into joinrel by add_path(). */
+	add_path(joinrel, (Path *) joinpath);
+	elog(DEBUG3, "join path added for (%s) join (%s)",
+		 bms_to_str(outerrel->relids), bms_to_str(innerrel->relids));
+
+	/* TODO consider parameterized paths */
+}
+
+/*
  * Create a tuple from the specified row of the PGresult.
  *
  * rel is the local representation of the foreign table, attinmeta is
@@ -2873,13 +3184,14 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 static HeapTuple
 make_tuple_from_result_row(PGresult *res,
 						   int row,
-						   Relation rel,
+						   const char *relname,
+						   const char *query,
+						   TupleDesc tupdesc,
 						   AttInMetadata *attinmeta,
 						   List *retrieved_attrs,
 						   MemoryContext temp_context)
 {
 	HeapTuple	tuple;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
 	Datum	   *values;
 	bool	   *nulls;
 	ItemPointer ctid = NULL;
@@ -2906,7 +3218,9 @@ make_tuple_from_result_row(PGresult *res,
 	/*
 	 * Set up and install callback to report where conversion error occurs.
 	 */
-	errpos.rel = rel;
+	errpos.relname = relname;
+	errpos.query = query;
+	errpos.tupdesc = tupdesc;
 	errpos.cur_attno = 0;
 	errcallback.callback = conversion_error_callback;
 	errcallback.arg = (void *) &errpos;
@@ -2996,11 +3310,39 @@ make_tuple_from_result_row(PGresult *res,
 static void
 conversion_error_callback(void *arg)
 {
+	const char *attname;
+	const char *relname;
 	ConversionLocation *errpos = (ConversionLocation *) arg;
-	TupleDesc	tupdesc = RelationGetDescr(errpos->rel);
+	TupleDesc	tupdesc = errpos->tupdesc;
+	StringInfoData buf;
+
+	if (errpos->relname)
+	{
+		/* error occurred in a scan against a foreign table */ 
+		initStringInfo(&buf);
+		if (errpos->cur_attno > 0)
+			appendStringInfo(&buf, "column \"%s\"",
+					 NameStr(tupdesc->attrs[errpos->cur_attno - 1]->attname));
+		else if (errpos->cur_attno == SelfItemPointerAttributeNumber)
+			appendStringInfoString(&buf, "column \"ctid\"");
+		attname = buf.data;
+
+		initStringInfo(&buf);
+		appendStringInfo(&buf, "foreign table \"%s\"", errpos->relname);
+		relname = buf.data;
+	}
+	else
+	{
+		/* error occurred in a scan against a foreign join */ 
+		initStringInfo(&buf);
+		appendStringInfo(&buf, "column %d", errpos->cur_attno - 1);
+		attname = buf.data;
+
+		initStringInfo(&buf);
+		appendStringInfo(&buf, "foreign join \"%s\"", errpos->query);
+		relname = buf.data;
+	}
 
 	if (errpos->cur_attno > 0 && errpos->cur_attno <= tupdesc->natts)
-		errcontext("column \"%s\" of foreign table \"%s\"",
-				   NameStr(tupdesc->attrs[errpos->cur_attno - 1]->attname),
-				   RelationGetRelationName(errpos->rel));
+		errcontext("%s of %s", attname, relname);
 }
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 3835ddb..82ce480 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -16,10 +16,59 @@
 #include "foreign/foreign.h"
 #include "lib/stringinfo.h"
 #include "nodes/relation.h"
+#include "nodes/plannodes.h"
 #include "utils/relcache.h"
 
 #include "libpq-fe.h"
 
+/*
+ * FDW-specific planner information kept in RelOptInfo.fdw_private for a
+ * foreign table or a foreign join.  This information is collected by
+ * postgresGetForeignRelSize, or calculated from join source relations.
+ */
+typedef struct PgFdwRelationInfo
+{
+	/*
+	 * True means that the relation can be pushed down.  Always true for
+	 * simple foreign scan.
+	 */
+	bool		pushdown_safe;
+
+	/* baserestrictinfo clauses, broken down into safe and unsafe subsets. */
+	List	   *remote_conds;
+	List	   *local_conds;
+
+	/* Bitmap of attr numbers we need to fetch from the remote server. */
+	Bitmapset  *attrs_used;
+
+	/* Cost and selectivity of local_conds. */
+	QualCost	local_conds_cost;
+	Selectivity local_conds_sel;
+
+	/* Estimated size and cost for a scan with baserestrictinfo quals. */
+	double		rows;
+	int			width;
+	Cost		startup_cost;
+	Cost		total_cost;
+
+	/* Options extracted from catalogs. */
+	bool		use_remote_estimate;
+	Cost		fdw_startup_cost;
+	Cost		fdw_tuple_cost;
+
+	/* Cached catalog information. */
+	ForeignTable *table;
+	ForeignServer *server;
+	Oid			umid;
+
+	/* Join information */
+	RelOptInfo *outerrel;
+	RelOptInfo *innerrel;
+	JoinType	jointype;
+	List	   *joinclauses;
+	List	   *otherclauses;
+} PgFdwRelationInfo;
+
 /* in postgres_fdw.c */
 extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
@@ -51,13 +100,32 @@ extern void deparseSelectSql(StringInfo buf,
 				 PlannerInfo *root,
 				 RelOptInfo *baserel,
 				 Bitmapset *attrs_used,
-				 List **retrieved_attrs);
-extern void appendWhereClause(StringInfo buf,
+				 List *remote_conds,
+				 List **params_list,
+				 List **fdw_scan_tlist,
+				 List **retrieved_attrs,
+				 StringInfo relations,
+				 bool alias);
+extern void appendConditions(StringInfo buf,
 				  PlannerInfo *root,
 				  RelOptInfo *baserel,
+				  List *outertlist,
+				  List *innertlist,
 				  List *exprs,
-				  bool is_first,
+				  const char *prefix,
 				  List **params);
+extern void deparseJoinSql(StringInfo sql,
+			   PlannerInfo *root,
+			   RelOptInfo *baserel,
+			   RelOptInfo *outerrel,
+			   RelOptInfo *innerrel,
+			   const char *sql_o,
+			   const char *sql_i,
+			   JoinType jointype,
+			   List *joinclauses,
+			   List *otherclauses,
+			   List **fdw_scan_tlist,
+			   List **retrieved_attrs);
 extern void deparseInsertSql(StringInfo buf, PlannerInfo *root,
 				 Index rtindex, Relation rel,
 				 List *targetAttrs, bool doNothing, List *returningList,
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index fcdd92e..9429d34 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -11,12 +11,17 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback2 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 
 CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
 
 -- ===================================================================
 -- create objects used through FDW loopback server
@@ -39,6 +44,18 @@ CREATE TABLE "S 1"."T 2" (
 	c2 text,
 	CONSTRAINT t2_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 3" (
+	c1 int NOT NULL,
+	c2 int NOT NULL,
+	c3 text,
+	CONSTRAINT t3_pkey PRIMARY KEY (c1)
+);
+CREATE TABLE "S 1"."T 4" (
+	c1 int NOT NULL,
+	c2 int NOT NULL,
+	c4 text,
+	CONSTRAINT t4_pkey PRIMARY KEY (c1)
+);
 
 INSERT INTO "S 1"."T 1"
 	SELECT id,
@@ -54,9 +71,23 @@ INSERT INTO "S 1"."T 2"
 	SELECT id,
 	       'AAA' || to_char(id, 'FM000')
 	FROM generate_series(1, 100) id;
+INSERT INTO "S 1"."T 3"
+	SELECT id,
+	       id + 1,
+	       'AAA' || to_char(id, 'FM000')
+	FROM generate_series(1, 100) id;
+DELETE FROM "S 1"."T 3" WHERE c1 % 2 != 0;	-- delete for outer join tests
+INSERT INTO "S 1"."T 4"
+	SELECT id,
+	       id + 1,
+	       'AAA' || to_char(id, 'FM000')
+	FROM generate_series(1, 100) id;
+DELETE FROM "S 1"."T 4" WHERE c1 % 3 != 0;	-- delete for outer join tests
 
 ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
+ANALYZE "S 1"."T 3";
+ANALYZE "S 1"."T 4";
 
 -- ===================================================================
 -- create foreign tables
@@ -87,6 +118,29 @@ CREATE FOREIGN TABLE ft2 (
 ) SERVER loopback;
 ALTER FOREIGN TABLE ft2 DROP COLUMN cx;
 
+CREATE FOREIGN TABLE ft4 (
+	c1 int NOT NULL,
+	c2 int NOT NULL,
+	c3 text
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 3');
+
+CREATE FOREIGN TABLE ft5 (
+	c1 int NOT NULL,
+	c2 int NOT NULL,
+	c3 text
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 4');
+
+CREATE FOREIGN TABLE ft6 (
+	c1 int NOT NULL,
+	c2 int NOT NULL,
+	c3 text
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
+CREATE USER view_owner;
+GRANT ALL ON ft5 TO view_owner;
+CREATE VIEW v_ft5 AS SELECT * FROM ft5;
+ALTER VIEW v_ft5 OWNER TO view_owner;
+CREATE USER MAPPING FOR view_owner SERVER loopback;
+
 -- ===================================================================
 -- tests for validator
 -- ===================================================================
@@ -158,8 +212,6 @@ EXPLAIN (VERBOSE, COSTS false) SELECT * FROM ft1 t1 WHERE c1 = 102 FOR SHARE;
 SELECT * FROM ft1 t1 WHERE c1 = 102 FOR SHARE;
 -- aggregate
 SELECT COUNT(*) FROM ft1 t1;
--- join two tables
-SELECT t1.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
 -- subquery
 SELECT * FROM ft1 t1 WHERE t1.c3 IN (SELECT c3 FROM ft2 t2 WHERE c1 <= 10) ORDER BY c1;
 -- subquery+MAX
@@ -216,6 +268,86 @@ SELECT * FROM ft1 WHERE c1 = ANY (ARRAY(SELECT c1 FROM ft2 WHERE c1 < 5));
 SELECT * FROM ft2 WHERE c1 = ANY (ARRAY(SELECT c1 FROM ft1 WHERE c1 < 5));
 
 -- ===================================================================
+-- JOIN queries
+-- ===================================================================
+-- join two tables
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
+SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
+-- join three tables
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c2, t3.c3 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) JOIN ft4 t3 ON (t3.c1 = t1.c1) ORDER BY t1.c3, t1.c1 OFFSET 10 LIMIT 10;
+SELECT t1.c1, t2.c2, t3.c3 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) JOIN ft4 t3 ON (t3.c1 = t1.c1) ORDER BY t1.c3, t1.c1 OFFSET 10 LIMIT 10;
+-- left outer join
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft4 t1 LEFT JOIN ft5 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 10 LIMIT 10;
+SELECT t1.c1, t2.c1 FROM ft4 t1 LEFT JOIN ft5 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 10 LIMIT 10;
+-- right outer join
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft5 t1 RIGHT JOIN ft4 t2 ON (t1.c1 = t2.c1) ORDER BY t2.c1, t1.c1 OFFSET 10 LIMIT 10;
+SELECT t1.c1, t2.c1 FROM ft5 t1 RIGHT JOIN ft4 t2 ON (t1.c1 = t2.c1) ORDER BY t2.c1, t1.c1 OFFSET 10 LIMIT 10;
+-- full outer join
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft4 t1 FULL JOIN ft5 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 45 LIMIT 10;
+SELECT t1.c1, t2.c1 FROM ft4 t1 FULL JOIN ft5 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 45 LIMIT 10;
+-- full outer join + WHERE clause, only matched rows
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft4 t1 FULL JOIN ft5 t2 ON (t1.c1 = t2.c1) WHERE (t1.c1 = t2.c1 OR t1.c1 IS NULL) ORDER BY t1.c1, t2.c1 OFFSET 10 LIMIT 10;
+SELECT t1.c1, t2.c1 FROM ft4 t1 FULL JOIN ft5 t2 ON (t1.c1 = t2.c1) WHERE (t1.c1 = t2.c1 OR t1.c1 IS NULL) ORDER BY t1.c1, t2.c1 OFFSET 10 LIMIT 10;
+-- join at WHERE clause 
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON true WHERE (t1.c1 = t2.c1) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
+SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON true WHERE (t1.c1 = t2.c1) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
+-- join in CTE
+EXPLAIN (COSTS false, VERBOSE)
+WITH t (c1_1, c1_3, c2_1) AS (SELECT t1.c1, t1.c3, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1)) SELECT c1_1, c2_1 FROM t ORDER BY c1_3, c1_1 OFFSET 100 LIMIT 10;
+WITH t (c1_1, c1_3, c2_1) AS (SELECT t1.c1, t1.c3, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1)) SELECT c1_1, c2_1 FROM t ORDER BY c1_3, c1_1 OFFSET 100 LIMIT 10;
+-- ctid with whole-row reference
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.ctid, t1, t2, t1.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
+SELECT t1.ctid, t1, t2, t1.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
+-- partially unsafe to push down, not pushed down
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1 FROM ft1 t1 JOIN ft2 t2 ON t2.c1 = t2.c1 JOIN ft4 t3 ON t2.c1 = t3.c1 ORDER BY t1.c1 OFFSET 10 LIMIT 10;
+SELECT t1.c1 FROM ft1 t1 JOIN ft2 t2 ON t2.c1 = t2.c1 JOIN ft4 t3 ON t2.c1 = t3.c1 ORDER BY t1.c1 OFFSET 10 LIMIT 10;
+-- SEMI JOIN, not pushed down
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
+SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
+-- ANTI JOIN, not pushed down
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
+SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
+-- CROSS JOIN, not pushed down
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+-- different server
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+-- different effective user for permission check
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN v_ft5 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN v_ft5 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+-- unsafe join conditions
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c8 = t2.c8) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
+SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c8 = t2.c8) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
+-- local filter (unsafe conditions on one side)
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) WHERE t1.c8 = 'foo' ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
+SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) WHERE t1.c8 = 'foo' ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
+-- Aggregate after UNION, for testing setrefs
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1c1, avg(t1c1 + t2c1) FROM (SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) UNION SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1)) AS t (t1c1, t2c1) GROUP BY t1c1 ORDER BY t1c1 OFFSET 100 LIMIT 10;
+SELECT t1c1, avg(t1c1 + t2c1) FROM (SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) UNION SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1)) AS t (t1c1, t2c1) GROUP BY t1c1 ORDER BY t1c1 OFFSET 100 LIMIT 10;
+-- join two foreign tables and two local tables
+EXPLAIN (COSTS false, VERBOSE)
+SELECT t1.c1, t2.c1 FROM ft1 t1 LEFT JOIN ft2 t2 ON t1.c1 = t2.c1 JOIN "S 1"."T 1" t3 ON t1.c1 = t3."C 1" JOIN "S 1"."T 2" t4 ON t1.c1 = t4.c1 ORDER BY t1.c1 OFFSET 10 LIMIT 10;
+SELECT t1.c1, t2.c1 FROM ft1 t1 LEFT JOIN ft2 t2 ON t1.c1 = t2.c1 JOIN "S 1"."T 1" t3 ON t1.c1 = t3."C 1" JOIN "S 1"."T 2" t4 ON t1.c1 = t4.c1 ORDER BY t1.c1 OFFSET 10 LIMIT 10;
+
+-- ===================================================================
 -- parameterized queries
 -- ===================================================================
 -- simple join
@@ -834,3 +966,7 @@ DROP TYPE "Colors" CASCADE;
 IMPORT FOREIGN SCHEMA import_source LIMIT TO (t5)
   FROM SERVER loopback INTO import_dest5;  -- ERROR
 ROLLBACK;
+
+-- Cleanup
+DROP OWNED BY view_owner;
+DROP USER view_owner;

#22

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Etsuro Fujita (#21)

Re: Foreign join pushdown vs EvalPlanQual

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Etsuro Fujita
Sent: Wednesday, August 12, 2015 8:26 PM
To: Robert Haas; Kaigai Kouhei(海外浩平)
Cc: PostgreSQL-development; 花田茂
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

On 2015/08/12 7:21, Robert Haas wrote:

On Fri, Aug 7, 2015 at 3:37 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

I could have a discussion with Fujita-san about this topic.

Also, let me share with the discussion towards entire solution.

The primitive reason of this problem is, Scan node with scanrelid==0
represents a relation join that can involve multiple relations, thus,
its TupleDesc of the records will not fit base relations, however,
ExecScanFetch() was not updated when scanrelid==0 gets supported.

FDW/CSP on behalf of the Scan node with scanrelid==0 are responsible
to generate records according to the fdw_/custom_scan_tlist that
reflects the definition of relation join, and only FDW/CSP know how
to combine these base relations.
In addition, host-side expressions (like Plan->qual) are initialized
to reference the records generated by FDW/CSP, so the least invasive
approach is to allow FDW/CSP to have own logic to recheck, I think.

Below is the structure of ExecScanFetch().

ExecScanFetch(ScanState *node,
ExecScanAccessMtd accessMtd,
ExecScanRecheckMtd recheckMtd)
{
EState *estate = node->ps.state;

if (estate->es_epqTuple != NULL)
{
/*
* We are inside an EvalPlanQual recheck. Return the test tuple

if

* one is available, after rechecking any access-method-specific
* conditions.
*/
Index scanrelid = ((Scan *) node->ps.plan)->scanrelid;

Assert(scanrelid > 0);
if (estate->es_epqTupleSet[scanrelid - 1])
{
TupleTableSlot *slot = node->ss_ScanTupleSlot;
:
return slot;
}
}
return (*accessMtd) (node);
}

When we are inside of EPQ, it fetches a tuple in es_epqTuple[] array and
checks its visibility (ForeignRecheck() always say 'yep, it is visible'),
then ExecScan() applies its qualifiers by ExecQual().
So, as long as FDW/CSP can return a record that satisfies the TupleDesc
of this relation, made by the tuples in es_epqTuple[] array, rest of the
code paths are common.

I have an idea to solve the problem.
It adds recheckMtd() call if scanrelid==0 just before the assertion above,
and add a callback of FDW on ForeignRecheck().
The role of this new callback is to set up the supplied TupleTableSlot
and check its visibility, but does not define how to do this.
It is arbitrarily by FDW driver, like invocation of alternative plan
consists of only built-in logic.

Invocation of alternative plan is one of the most feasible way to
implement EPQ logic on FDW, so I think FDW also needs a mechanism
that takes child path-nodes like custom_paths in CustomPath node.
Once a valid path node is linked to this list, createplan.c transform
them to relevant plan node, then FDW can initialize and invoke this
plan node during execution, like ForeignRecheck().

This design can solve another problem Fujita-san has also mentioned.
If scan qualifier is pushed-down to the remote query and its expression
node is saved in the private area of ForeignScan, the callback on
ForeignRecheck() can evaluate the qualifier by itself. (Note that only
FDW driver can know where and how expression node being pushed-down
is saved in the private area.)

In the summary, the following three enhancements are a straightforward
way to fix up the problem he reported.
1. Add a special path to call recheckMtd in ExecScanFetch if scanrelid==0
2. Add a callback of FDW in ForeignRecheck() - to construct a record
according to the fdw_scan_tlist definition and to evaluate its
visibility, or to evaluate qualifier pushed-down if base relation.
3. Add List *fdw_paths in ForeignPath like custom_paths of CustomPaths,
to construct plan nodes for EPQ evaluation.

I'm not an expert in this area, but this plan does not seem unreasonable to

me.

IIRC the discussion with KaiGai-san, I think that that would work. I
think that that would be more suitable for CSPs, though. Correct me if
I'm wrong, KaiGai-san. In either case, I'm not sure that the idea of
transferring both processing to a single callback routine hooked in
ForeignRecheck is a good idea: (a) check to see if the test tuple for
each component foreign table satisfies the remote qual condition and (b)
check to see if those tuples satisfy the remote join condition. I think
that that would be too complicated, probably making the callback routine
bug-prone. So, I'd still propose that *the core* processes (a) and (b)
*separately*.

* As for (a), the core checks the remote qual condition as in [1].

* As for (b), the core executes an alternative subplan locally if inside
an EPQ recheck. The subplan is created as described in [2].

I don't think it is "too" complicated because (a) visibility check of
the base tuples (saved in es_epqTuple[]) shall be done in the underlying
base foreign-scan node, executed as a part of alternative plan, and
(b) evaluation of remote qual is done with ExecQual() call.

I seems to me your proposition tends to assume a particular design
towards FDW drivers, however, we already have various kind of FDW
drivers not only wrapper of remote RDBMS.
https://wiki.postgresql.org/wiki/Foreign_data_wrappers

Is the [1] and [2] suitable for "all" of them, actually?

Let's assume a FDW module that implements own columnar storage,
has a special JOIN capability if both side are its columnar storage.
Does it need alternative sub-plan for EPQ rechecks? Probably no,
because it has own capability to run JOIN by itself.
It is inconvenience for this FDW if core automatically kicks sub-
plan in spite of its own functionality/capability.

If potential bugs are concerned, a common part shall be cut down
and provided as a utility function. FDW can determine whether it
shall be used, but never enforced.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Etsuro Fujita (#1)

1 attachment(s)

Re: Foreign join pushdown vs EvalPlanQual

Fujita-san,

The attached patch enhanced the FDW interface according to the direction
below (but not tested yet).

In the summary, the following three enhancements are a straightforward
way to fix up the problem he reported.
1. Add a special path to call recheckMtd in ExecScanFetch if scanrelid==0
2. Add a callback of FDW in ForeignRecheck() - to construct a record
according to the fdw_scan_tlist definition and to evaluate its
visibility, or to evaluate qualifier pushed-down if base relation.
3. Add List *fdw_paths in ForeignPath like custom_paths of CustomPaths,
to construct plan nodes for EPQ evaluation.

Likely, what you need to do are...
1. Save the alternative path on fdw_paths when foreign join push-down.
GetForeignJoinPaths() may be called multiple times towards a particular
joinrel according to the combination of innerrel/outerrel.
RelOptInfo->fdw_private allows to avoid construction of same remote
join path multiple times. On the second or later invocation, it may be
a good tactics to reference cheapest_startup_path and replace the saved
one if later invocation have cheaper one, prior to exit.
2. Save the alternative Plan nodes on fdw_plans or lefttree/righttree
somewhere you like at the GetForeignPlan()
3. Makes BeginForeignScan() to call ExecInitNode() towards the plan node
saved at (2), then save the PlanState on fdw_ps, lefttree/righttree,
or somewhere private area if not displayed on EXPLAIN.
4. Implement ForeignRecheck() routine. If scanrelid==0, it kicks the
planstate node saved at (3) to generate tuple slot. Then, call the
ExecQual() to check qualifiers being pushed down.
5. Makes EndForeignScab() to call ExecEndNode() towards the PlanState
saved at (3).

I never think above steps are "too" complicated for people who can write
FDW drivers. It is what developer usually does.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Show quoted text

-----Original Message-----
From: Kaigai Kouhei(海外浩平)
Sent: Wednesday, August 12, 2015 11:17 PM
To: 'Etsuro Fujita'; Robert Haas
Cc: PostgreSQL-development; 花田茂
Subject: RE: [HACKERS] Foreign join pushdown vs EvalPlanQual

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Etsuro Fujita
Sent: Wednesday, August 12, 2015 8:26 PM
To: Robert Haas; Kaigai Kouhei(海外浩平)
Cc: PostgreSQL-development; 花田茂
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

On 2015/08/12 7:21, Robert Haas wrote:

On Fri, Aug 7, 2015 at 3:37 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

I could have a discussion with Fujita-san about this topic.

Also, let me share with the discussion towards entire solution.

The primitive reason of this problem is, Scan node with scanrelid==0
represents a relation join that can involve multiple relations, thus,
its TupleDesc of the records will not fit base relations, however,
ExecScanFetch() was not updated when scanrelid==0 gets supported.

FDW/CSP on behalf of the Scan node with scanrelid==0 are responsible
to generate records according to the fdw_/custom_scan_tlist that
reflects the definition of relation join, and only FDW/CSP know how
to combine these base relations.
In addition, host-side expressions (like Plan->qual) are initialized
to reference the records generated by FDW/CSP, so the least invasive
approach is to allow FDW/CSP to have own logic to recheck, I think.

Below is the structure of ExecScanFetch().

ExecScanFetch(ScanState *node,
ExecScanAccessMtd accessMtd,
ExecScanRecheckMtd recheckMtd)
{
EState *estate = node->ps.state;

if (estate->es_epqTuple != NULL)
{
/*
* We are inside an EvalPlanQual recheck. Return the test tuple

if

* one is available, after rechecking any access-method-specific
* conditions.
*/
Index scanrelid = ((Scan *) node->ps.plan)->scanrelid;

Assert(scanrelid > 0);
if (estate->es_epqTupleSet[scanrelid - 1])
{
TupleTableSlot *slot = node->ss_ScanTupleSlot;
:
return slot;
}
}
return (*accessMtd) (node);
}

When we are inside of EPQ, it fetches a tuple in es_epqTuple[] array and
checks its visibility (ForeignRecheck() always say 'yep, it is visible'),
then ExecScan() applies its qualifiers by ExecQual().
So, as long as FDW/CSP can return a record that satisfies the TupleDesc
of this relation, made by the tuples in es_epqTuple[] array, rest of the
code paths are common.

I have an idea to solve the problem.
It adds recheckMtd() call if scanrelid==0 just before the assertion above,
and add a callback of FDW on ForeignRecheck().
The role of this new callback is to set up the supplied TupleTableSlot
and check its visibility, but does not define how to do this.
It is arbitrarily by FDW driver, like invocation of alternative plan
consists of only built-in logic.

Invocation of alternative plan is one of the most feasible way to
implement EPQ logic on FDW, so I think FDW also needs a mechanism
that takes child path-nodes like custom_paths in CustomPath node.
Once a valid path node is linked to this list, createplan.c transform
them to relevant plan node, then FDW can initialize and invoke this
plan node during execution, like ForeignRecheck().

This design can solve another problem Fujita-san has also mentioned.
If scan qualifier is pushed-down to the remote query and its expression
node is saved in the private area of ForeignScan, the callback on
ForeignRecheck() can evaluate the qualifier by itself. (Note that only
FDW driver can know where and how expression node being pushed-down
is saved in the private area.)

In the summary, the following three enhancements are a straightforward
way to fix up the problem he reported.
1. Add a special path to call recheckMtd in ExecScanFetch if scanrelid==0
2. Add a callback of FDW in ForeignRecheck() - to construct a record
according to the fdw_scan_tlist definition and to evaluate its
visibility, or to evaluate qualifier pushed-down if base relation.
3. Add List *fdw_paths in ForeignPath like custom_paths of CustomPaths,
to construct plan nodes for EPQ evaluation.

I'm not an expert in this area, but this plan does not seem unreasonable to

me.

IIRC the discussion with KaiGai-san, I think that that would work. I
think that that would be more suitable for CSPs, though. Correct me if
I'm wrong, KaiGai-san. In either case, I'm not sure that the idea of
transferring both processing to a single callback routine hooked in
ForeignRecheck is a good idea: (a) check to see if the test tuple for
each component foreign table satisfies the remote qual condition and (b)
check to see if those tuples satisfy the remote join condition. I think
that that would be too complicated, probably making the callback routine
bug-prone. So, I'd still propose that *the core* processes (a) and (b)
*separately*.

* As for (a), the core checks the remote qual condition as in [1].

* As for (b), the core executes an alternative subplan locally if inside
an EPQ recheck. The subplan is created as described in [2].

I don't think it is "too" complicated because (a) visibility check of
the base tuples (saved in es_epqTuple[]) shall be done in the underlying
base foreign-scan node, executed as a part of alternative plan, and
(b) evaluation of remote qual is done with ExecQual() call.

I seems to me your proposition tends to assume a particular design
towards FDW drivers, however, we already have various kind of FDW
drivers not only wrapper of remote RDBMS.
https://wiki.postgresql.org/wiki/Foreign_data_wrappers

Is the [1] and [2] suitable for "all" of them, actually?

Let's assume a FDW module that implements own columnar storage,
has a special JOIN capability if both side are its columnar storage.
Does it need alternative sub-plan for EPQ rechecks? Probably no,
because it has own capability to run JOIN by itself.
It is inconvenience for this FDW if core automatically kicks sub-
plan in spite of its own functionality/capability.

If potential bugs are concerned, a common part shall be cut down
and provided as a utility function. FDW can determine whether it
shall be used, but never enforced.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Attachments:

pgsql-fdw-epq-recheck.v1.patchapplication/octet-stream; name=pgsql-fdw-epq-recheck.v1.patchDownload

 src/backend/commands/explain.c          | 23 +++++++++++++++++++++++
 src/backend/executor/execScan.c         | 13 +++++++++++--
 src/backend/executor/nodeForeignscan.c  | 13 +++++++++++++
 src/backend/nodes/copyfuncs.c           |  1 +
 src/backend/nodes/outfuncs.c            |  2 ++
 src/backend/optimizer/plan/createplan.c | 13 ++++++++++++-
 src/backend/optimizer/plan/setrefs.c    |  8 ++++++++
 src/backend/optimizer/plan/subselect.c  | 24 ++++++++++++++++++++----
 src/include/foreign/fdwapi.h            |  7 ++++++-
 src/include/nodes/execnodes.h           |  1 +
 src/include/nodes/plannodes.h           |  1 +
 src/include/nodes/relation.h            |  1 +
 12 files changed, 99 insertions(+), 8 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 5d06fa4..3396208 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -117,6 +117,8 @@ static void ExplainMemberNodes(List *plans, PlanState **planstates,
 				   List *ancestors, ExplainState *es);
 static void ExplainSubPlans(List *plans, List *ancestors,
 				const char *relationship, ExplainState *es);
+static void ExplainForeignChildren(ForeignScanState *fss,
+					  List *ancestors, ExplainState *es);
 static void ExplainCustomChildren(CustomScanState *css,
 					  List *ancestors, ExplainState *es);
 static void ExplainProperty(const char *qlabel, const char *value,
@@ -1615,6 +1617,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		IsA(plan, BitmapAnd) ||
 		IsA(plan, BitmapOr) ||
 		IsA(plan, SubqueryScan) ||
+		(IsA(planstate, ForeignScanState) &&
+		 ((ForeignScanState *) planstate)->fdw_ps != NIL) ||
 		(IsA(planstate, CustomScanState) &&
 		 ((CustomScanState *) planstate)->custom_ps != NIL) ||
 		planstate->subPlan;
@@ -1671,6 +1675,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			ExplainNode(((SubqueryScanState *) planstate)->subplan, ancestors,
 						"Subquery", NULL, es);
 			break;
+		case T_ForeignScan:
+			ExplainForeignChildren((ForeignScanState *) planstate,
+								   ancestors, es);
+			break;
 		case T_CustomScan:
 			ExplainCustomChildren((CustomScanState *) planstate,
 								  ancestors, es);
@@ -2711,6 +2719,21 @@ ExplainSubPlans(List *plans, List *ancestors,
 }
 
 /*
+ * Explain a list of children of a ForeignScan.
+ */
+static void
+ExplainForeignChildren(ForeignScanState *fss,
+					   List *ancestors, ExplainState *es)
+{
+	ListCell   *cell;
+	const char *label =
+		(list_length(fss->fdw_ps) != 1 ? "children" : "child");
+
+	foreach(cell, fss->fdw_ps)
+		ExplainNode((PlanState *) lfirst(cell), ancestors, label, NULL, es);
+}
+
+/*
  * Explain a list of children of a CustomScan.
  */
 static void
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index a96e826..1a9bcba 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -49,8 +49,17 @@ ExecScanFetch(ScanState *node,
 		 */
 		Index		scanrelid = ((Scan *) node->ps.plan)->scanrelid;
 
-		Assert(scanrelid > 0);
-		if (estate->es_epqTupleSet[scanrelid - 1])
+		if (scanrelid == 0)
+		{
+			TupleTableSlot *slot = node->ss_ScanTupleSlot;
+
+			/* Check if it meets the access-method conditions */
+			if (!(*recheckMtd) (node, slot))
+				ExecClearTuple(slot);   /* would not be returned by scan */
+
+			return slot;
+		}
+		else if (estate->es_epqTupleSet[scanrelid - 1])
 		{
 			TupleTableSlot *slot = node->ss_ScanTupleSlot;
 
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index bb28a73..d1b36ab 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -72,6 +72,19 @@ ForeignNext(ForeignScanState *node)
 static bool
 ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
 {
+	/*
+	 * This FDW callback have two tasks. (1) If this ForeignScanState
+	 * represents an external join (thus scanrelid==0), it need to
+	 * construct a tuple according to TupleDesc of the slot; that is
+	 * initialized according to the fdw_scan_tlist. (2) If this node
+	 * has any qualifiers not to be executed locally, it has to apply
+	 * visibility checks by the qualifier (because ExecQual on ExecScan
+	 * runs towards node->scan.plan.qual, not on the qualifier pushed-
+	 * down).
+	 */
+	if (!node->fdwroutine->RecheckForeignScan)
+		return node->fdwroutine->RecheckForeignScan(node, slot);
+
 	/* There are no access-method-specific conditions to recheck. */
 	return true;
 }
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 1c8425d..23f9942 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -621,6 +621,7 @@ _copyForeignScan(const ForeignScan *from)
 	 * copy remainder of node
 	 */
 	COPY_SCALAR_FIELD(fs_server);
+	COPY_NODE_FIELD(fdw_plans);
 	COPY_NODE_FIELD(fdw_exprs);
 	COPY_NODE_FIELD(fdw_private);
 	COPY_NODE_FIELD(fdw_scan_tlist);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index a878498..fbe5d05 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -576,6 +576,7 @@ _outForeignScan(StringInfo str, const ForeignScan *node)
 	_outScanInfo(str, (const Scan *) node);
 
 	WRITE_OID_FIELD(fs_server);
+	WRITE_NODE_FIELD(fdw_plans);
 	WRITE_NODE_FIELD(fdw_exprs);
 	WRITE_NODE_FIELD(fdw_private);
 	WRITE_NODE_FIELD(fdw_scan_tlist);
@@ -1664,6 +1665,7 @@ _outForeignPath(StringInfo str, const ForeignPath *node)
 
 	_outPathInfo(str, (const Path *) node);
 
+	WRITE_NODE_FIELD(fdw_paths);
 	WRITE_NODE_FIELD(fdw_private);
 }
 
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 404c6f5..a915cb6 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -2059,11 +2059,20 @@ create_foreignscan_plan(PlannerInfo *root, ForeignPath *best_path,
 	Index		scan_relid = rel->relid;
 	Oid			rel_oid = InvalidOid;
 	Bitmapset  *attrs_used = NULL;
+	List	   *fdw_plans = NIL;
 	ListCell   *lc;
 	int			i;
 
 	Assert(rel->fdwroutine != NULL);
 
+	/* Recursively transform child paths. */
+	foreach (lc, best_path->fdw_paths)
+	{
+		Plan   *plan = create_plan_recurse(root, (Path *) lfirst(lc));
+
+		fdw_plans = lappend(fdw_plans, plan);
+	}
+
 	/*
 	 * If we're scanning a base relation, fetch its OID.  (Irrelevant if
 	 * scanning a join relation.)
@@ -2093,7 +2102,9 @@ create_foreignscan_plan(PlannerInfo *root, ForeignPath *best_path,
 	 */
 	scan_plan = rel->fdwroutine->GetForeignPlan(root, rel, rel_oid,
 												best_path,
-												tlist, scan_clauses);
+												tlist,
+												scan_clauses,
+												fdw_plans);
 
 	/* Copy cost data from Path to Plan; no need to make FDW do this */
 	copy_path_costsize(&scan_plan->scan.plan, &best_path->path);
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index ee8710d..dba91ee 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -1093,6 +1093,8 @@ set_foreignscan_references(PlannerInfo *root,
 						   ForeignScan *fscan,
 						   int rtoffset)
 {
+	ListCell   *lc;
+
 	/* Adjust scanrelid if it's valid */
 	if (fscan->scan.scanrelid > 0)
 		fscan->scan.scanrelid += rtoffset;
@@ -1136,6 +1138,12 @@ set_foreignscan_references(PlannerInfo *root,
 			fix_scan_list(root, fscan->fdw_exprs, rtoffset);
 	}
 
+	/* Adjust child plan-nodes recursively, if needed */
+	foreach(lc, fscan->fdw_plans)
+	{
+		lfirst(lc) = set_plan_refs(root, (Plan *) lfirst(lc), rtoffset);
+	}
+
 	/* Adjust fs_relids if needed */
 	if (rtoffset > 0)
 	{
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index d0bc412..26bfeb4 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2394,10 +2394,26 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			break;
 
 		case T_ForeignScan:
-			finalize_primnode((Node *) ((ForeignScan *) plan)->fdw_exprs,
-							  &context);
-			/* We assume fdw_scan_tlist cannot contain Params */
-			context.paramids = bms_add_members(context.paramids, scan_params);
+			{
+				ForeignScan	*fscan = (ForeignScan *) plan;
+				ListCell	*lc;
+
+				finalize_primnode((Node *) fscan->fdw_exprs, &context);
+				/* We assume fdw_scan_tlist cannot contain Params */
+				context.paramids =
+					bms_add_members(context.paramids, scan_params);
+
+				/* child nodes if any */
+				foreach (lc, fscan->fdw_plans)
+				{
+					context.paramids =
+						bms_add_members(context.paramids,
+										finalize_plan(root,
+													  (Plan *) lfirst(lc),
+													  valid_params,
+													  scan_params));
+				}
+			}
 			break;
 
 		case T_CustomScan:
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 69b48b4..4a41351 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -36,13 +36,17 @@ typedef ForeignScan *(*GetForeignPlan_function) (PlannerInfo *root,
 														  Oid foreigntableid,
 													  ForeignPath *best_path,
 															 List *tlist,
-														 List *scan_clauses);
+												 List *scan_clauses,
+												 List *fdw_plans);
 
 typedef void (*BeginForeignScan_function) (ForeignScanState *node,
 													   int eflags);
 
 typedef TupleTableSlot *(*IterateForeignScan_function) (ForeignScanState *node);
 
+typedef bool (*RecheckForeignScan_function) (ForeignScanState *node,
+											 TupleTableSlot *slot);
+
 typedef void (*ReScanForeignScan_function) (ForeignScanState *node);
 
 typedef void (*EndForeignScan_function) (ForeignScanState *node);
@@ -138,6 +142,7 @@ typedef struct FdwRoutine
 	GetForeignPlan_function GetForeignPlan;
 	BeginForeignScan_function BeginForeignScan;
 	IterateForeignScan_function IterateForeignScan;
+	RecheckForeignScan_function RecheckForeignScan;
 	ReScanForeignScan_function ReScanForeignScan;
 	EndForeignScan_function EndForeignScan;
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5796de8..1453be2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1579,6 +1579,7 @@ typedef struct ForeignScanState
 	ScanState	ss;				/* its first field is NodeTag */
 	/* use struct pointer to avoid including fdwapi.h here */
 	struct FdwRoutine *fdwroutine;
+	List	   *fdw_ps;			/* list of child PlanState nodes, if any */
 	void	   *fdw_state;		/* foreign-data wrapper can keep state here */
 } ForeignScanState;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 0654d02..2a5045f 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -518,6 +518,7 @@ typedef struct ForeignScan
 {
 	Scan		scan;
 	Oid			fs_server;		/* OID of foreign server */
+	List	   *fdw_plans;		/* list of Plan nodes, if any */
 	List	   *fdw_exprs;		/* expressions that FDW may evaluate */
 	List	   *fdw_private;	/* private data for FDW */
 	List	   *fdw_scan_tlist; /* optional tlist describing scan tuple */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 5dc23d9..78038d2 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -901,6 +901,7 @@ typedef struct TidPath
 typedef struct ForeignPath
 {
 	Path		path;
+	List	   *fdw_paths;
 	List	   *fdw_private;
 } ForeignPath;

#24

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Kouhei Kaigai (#23)

Re: Foreign join pushdown vs EvalPlanQual

Fujita-san,

How about your opinion towards the solution?
CF:Sep will start next week, so I'd like to make a consensus of
the direction, at least.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai
Sent: Thursday, August 13, 2015 10:13 AM
To: Etsuro Fujita; Robert Haas
Cc: PostgreSQL-development; 花田茂
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

Fujita-san,

The attached patch enhanced the FDW interface according to the direction
below (but not tested yet).

In the summary, the following three enhancements are a straightforward
way to fix up the problem he reported.
1. Add a special path to call recheckMtd in ExecScanFetch if scanrelid==0
2. Add a callback of FDW in ForeignRecheck() - to construct a record
according to the fdw_scan_tlist definition and to evaluate its
visibility, or to evaluate qualifier pushed-down if base relation.
3. Add List *fdw_paths in ForeignPath like custom_paths of CustomPaths,
to construct plan nodes for EPQ evaluation.

Likely, what you need to do are...
1. Save the alternative path on fdw_paths when foreign join push-down.
GetForeignJoinPaths() may be called multiple times towards a particular
joinrel according to the combination of innerrel/outerrel.
RelOptInfo->fdw_private allows to avoid construction of same remote
join path multiple times. On the second or later invocation, it may be
a good tactics to reference cheapest_startup_path and replace the saved
one if later invocation have cheaper one, prior to exit.
2. Save the alternative Plan nodes on fdw_plans or lefttree/righttree
somewhere you like at the GetForeignPlan()
3. Makes BeginForeignScan() to call ExecInitNode() towards the plan node
saved at (2), then save the PlanState on fdw_ps, lefttree/righttree,
or somewhere private area if not displayed on EXPLAIN.
4. Implement ForeignRecheck() routine. If scanrelid==0, it kicks the
planstate node saved at (3) to generate tuple slot. Then, call the
ExecQual() to check qualifiers being pushed down.
5. Makes EndForeignScab() to call ExecEndNode() towards the PlanState
saved at (3).

I never think above steps are "too" complicated for people who can write
FDW drivers. It is what developer usually does.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

-----Original Message-----
From: Kaigai Kouhei(海外浩平)
Sent: Wednesday, August 12, 2015 11:17 PM
To: 'Etsuro Fujita'; Robert Haas
Cc: PostgreSQL-development; 花田茂
Subject: RE: [HACKERS] Foreign join pushdown vs EvalPlanQual

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Etsuro Fujita
Sent: Wednesday, August 12, 2015 8:26 PM
To: Robert Haas; Kaigai Kouhei(海外浩平)
Cc: PostgreSQL-development; 花田茂
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

On 2015/08/12 7:21, Robert Haas wrote:

On Fri, Aug 7, 2015 at 3:37 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

I could have a discussion with Fujita-san about this topic.

Also, let me share with the discussion towards entire solution.

The primitive reason of this problem is, Scan node with scanrelid==0
represents a relation join that can involve multiple relations, thus,
its TupleDesc of the records will not fit base relations, however,
ExecScanFetch() was not updated when scanrelid==0 gets supported.

FDW/CSP on behalf of the Scan node with scanrelid==0 are responsible
to generate records according to the fdw_/custom_scan_tlist that
reflects the definition of relation join, and only FDW/CSP know how
to combine these base relations.
In addition, host-side expressions (like Plan->qual) are initialized
to reference the records generated by FDW/CSP, so the least invasive
approach is to allow FDW/CSP to have own logic to recheck, I think.

Below is the structure of ExecScanFetch().

ExecScanFetch(ScanState *node,
ExecScanAccessMtd accessMtd,
ExecScanRecheckMtd recheckMtd)
{
EState *estate = node->ps.state;

if (estate->es_epqTuple != NULL)
{
/*
* We are inside an EvalPlanQual recheck. Return the test tuple

if

* one is available, after rechecking any

access-method-specific

* conditions.
*/
Index scanrelid = ((Scan *) node->ps.plan)->scanrelid;

Assert(scanrelid > 0);
if (estate->es_epqTupleSet[scanrelid - 1])
{
TupleTableSlot *slot = node->ss_ScanTupleSlot;
:
return slot;
}
}
return (*accessMtd) (node);
}

When we are inside of EPQ, it fetches a tuple in es_epqTuple[] array and
checks its visibility (ForeignRecheck() always say 'yep, it is visible'),
then ExecScan() applies its qualifiers by ExecQual().
So, as long as FDW/CSP can return a record that satisfies the TupleDesc
of this relation, made by the tuples in es_epqTuple[] array, rest of the
code paths are common.

I have an idea to solve the problem.
It adds recheckMtd() call if scanrelid==0 just before the assertion above,
and add a callback of FDW on ForeignRecheck().
The role of this new callback is to set up the supplied TupleTableSlot
and check its visibility, but does not define how to do this.
It is arbitrarily by FDW driver, like invocation of alternative plan
consists of only built-in logic.

Invocation of alternative plan is one of the most feasible way to
implement EPQ logic on FDW, so I think FDW also needs a mechanism
that takes child path-nodes like custom_paths in CustomPath node.
Once a valid path node is linked to this list, createplan.c transform
them to relevant plan node, then FDW can initialize and invoke this
plan node during execution, like ForeignRecheck().

This design can solve another problem Fujita-san has also mentioned.
If scan qualifier is pushed-down to the remote query and its expression
node is saved in the private area of ForeignScan, the callback on
ForeignRecheck() can evaluate the qualifier by itself. (Note that only
FDW driver can know where and how expression node being pushed-down
is saved in the private area.)

In the summary, the following three enhancements are a straightforward
way to fix up the problem he reported.
1. Add a special path to call recheckMtd in ExecScanFetch if scanrelid==0
2. Add a callback of FDW in ForeignRecheck() - to construct a record
according to the fdw_scan_tlist definition and to evaluate its
visibility, or to evaluate qualifier pushed-down if base relation.
3. Add List *fdw_paths in ForeignPath like custom_paths of CustomPaths,
to construct plan nodes for EPQ evaluation.

I'm not an expert in this area, but this plan does not seem unreasonable

to

me.

IIRC the discussion with KaiGai-san, I think that that would work. I
think that that would be more suitable for CSPs, though. Correct me if
I'm wrong, KaiGai-san. In either case, I'm not sure that the idea of
transferring both processing to a single callback routine hooked in
ForeignRecheck is a good idea: (a) check to see if the test tuple for
each component foreign table satisfies the remote qual condition and (b)
check to see if those tuples satisfy the remote join condition. I think
that that would be too complicated, probably making the callback routine
bug-prone. So, I'd still propose that *the core* processes (a) and (b)
*separately*.

* As for (a), the core checks the remote qual condition as in [1].

* As for (b), the core executes an alternative subplan locally if inside
an EPQ recheck. The subplan is created as described in [2].

I don't think it is "too" complicated because (a) visibility check of
the base tuples (saved in es_epqTuple[]) shall be done in the underlying
base foreign-scan node, executed as a part of alternative plan, and
(b) evaluation of remote qual is done with ExecQual() call.

I seems to me your proposition tends to assume a particular design
towards FDW drivers, however, we already have various kind of FDW
drivers not only wrapper of remote RDBMS.
https://wiki.postgresql.org/wiki/Foreign_data_wrappers

Is the [1] and [2] suitable for "all" of them, actually?

Let's assume a FDW module that implements own columnar storage,
has a special JOIN capability if both side are its columnar storage.
Does it need alternative sub-plan for EPQ rechecks? Probably no,
because it has own capability to run JOIN by itself.
It is inconvenience for this FDW if core automatically kicks sub-
plan in spite of its own functionality/capability.

If potential bugs are concerned, a common part shall be cut down
and provided as a utility function. FDW can determine whether it
shall be used, but never enforced.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#24)

Re: Foreign join pushdown vs EvalPlanQual

Hi KaiGai-san,

On 2015/08/25 10:18, Kouhei Kaigai wrote:

How about your opinion towards the solution?

Likely, what you need to do are...
1. Save the alternative path on fdw_paths when foreign join push-down.
GetForeignJoinPaths() may be called multiple times towards a particular
joinrel according to the combination of innerrel/outerrel.
RelOptInfo->fdw_private allows to avoid construction of same remote
join path multiple times. On the second or later invocation, it may be
a good tactics to reference cheapest_startup_path and replace the saved
one if later invocation have cheaper one, prior to exit.

I'm not sure that the tactics is a good one. I think you probably
assume that GetForeignJoinPaths executes set_cheapest each time that
gets called, but ISTM that that would be expensive. (That is one of the
reason why I think it would be better to hook that routine in
standard_join_search.)

2. Save the alternative Plan nodes on fdw_plans or lefttree/righttree
somewhere you like at the GetForeignPlan()
3. Makes BeginForeignScan() to call ExecInitNode() towards the plan node
saved at (2), then save the PlanState on fdw_ps, lefttree/righttree,
or somewhere private area if not displayed on EXPLAIN.
4. Implement ForeignRecheck() routine. If scanrelid==0, it kicks the
planstate node saved at (3) to generate tuple slot. Then, call the
ExecQual() to check qualifiers being pushed down.
5. Makes EndForeignScab() to call ExecEndNode() towards the PlanState
saved at (3).

I never think above steps are "too" complicated for people who can write
FDW drivers. It is what developer usually does.

Sorry, my explanation was not accurate, but the design that you proposed
looks complicated beyond necessity. I think we should add an FDW API
for doing something if FDWs have more knowledge about doing that than
the core, but in your proposal, instead of the core, an FDW has to
eventually do a lot of the core's work: ExecInitNode, ExecProcNode,
ExecQual, ExecEndNode and so on.

Thank you for the comments!

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Etsuro Fujita (#25)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/08/25 10:18, Kouhei Kaigai wrote:

How about your opinion towards the solution?

Likely, what you need to do are...
1. Save the alternative path on fdw_paths when foreign join push-down.
GetForeignJoinPaths() may be called multiple times towards a particular
joinrel according to the combination of innerrel/outerrel.
RelOptInfo->fdw_private allows to avoid construction of same remote
join path multiple times. On the second or later invocation, it may be
a good tactics to reference cheapest_startup_path and replace the saved
one if later invocation have cheaper one, prior to exit.

I'm not sure that the tactics is a good one. I think you probably
assume that GetForeignJoinPaths executes set_cheapest each time that
gets called, but ISTM that that would be expensive. (That is one of the
reason why I think it would be better to hook that routine in
standard_join_search.)

Here is two different problems. I'd like to identify whether the problem
is "must be solved" or "nice to have". Obviously, failure on EPQ check
is a problem must be solved, however, hook location is nice to have.

In addition, you may misunderstand the proposition of mine above.
You can check RelOptInfo->fdw_private on top of the GetForeignJoinPaths,
then, if it is second or later invocation, you can check cost of the
alternative path kept in the ForeignPath node previously constructed.
If cheapest_total_path at the moment of GetForeignJoinPaths invocation
is cheaper than the saved alternative path, you can adjust the node to
replace the alternative path node.

2. Save the alternative Plan nodes on fdw_plans or lefttree/righttree
somewhere you like at the GetForeignPlan()
3. Makes BeginForeignScan() to call ExecInitNode() towards the plan node
saved at (2), then save the PlanState on fdw_ps, lefttree/righttree,
or somewhere private area if not displayed on EXPLAIN.
4. Implement ForeignRecheck() routine. If scanrelid==0, it kicks the
planstate node saved at (3) to generate tuple slot. Then, call the
ExecQual() to check qualifiers being pushed down.
5. Makes EndForeignScab() to call ExecEndNode() towards the PlanState
saved at (3).

I never think above steps are "too" complicated for people who can write
FDW drivers. It is what developer usually does.

Sorry, my explanation was not accurate, but the design that you proposed
looks complicated beyond necessity. I think we should add an FDW API
for doing something if FDWs have more knowledge about doing that than
the core, but in your proposal, instead of the core, an FDW has to
eventually do a lot of the core's work: ExecInitNode, ExecProcNode,
ExecQual, ExecEndNode and so on.

It is a trade-off problem between interface flexibility and code smallness
of FDW extension if it fits scope of the core support.
I stand on the viewpoint that gives highest priority on the flexibility,
especially, in case when unpredictable type of modules are expected.
Your proposition is comfortable to FDW on behalf of RDBMS, however, nobody
can promise it is beneficial to FDW on behalf of columnar-store for example.

If you stick on the code smallness of FDW on behalf of RDBMS, we can add
utility functions on foreign.c or somewhere. It will be able to provide
equivalent functionality, but FDW can determine whether it use the routines.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#26)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/08/26 13:49, Kouhei Kaigai wrote:

On 2015/08/25 10:18, Kouhei Kaigai wrote:

Likely, what you need to do are...
1. Save the alternative path on fdw_paths when foreign join push-down.
GetForeignJoinPaths() may be called multiple times towards a particular
joinrel according to the combination of innerrel/outerrel.
RelOptInfo->fdw_private allows to avoid construction of same remote
join path multiple times. On the second or later invocation, it may be
a good tactics to reference cheapest_startup_path and replace the saved
one if later invocation have cheaper one, prior to exit.

I'm not sure that the tactics is a good one. I think you probably
assume that GetForeignJoinPaths executes set_cheapest each time that
gets called, but ISTM that that would be expensive. (That is one of the
reason why I think it would be better to hook that routine in
standard_join_search.)

Here is two different problems. I'd like to identify whether the problem
is "must be solved" or "nice to have". Obviously, failure on EPQ check
is a problem must be solved, however, hook location is nice to have.

OK I'll focus on the "must be solved" problem at least on this thread.

In addition, you may misunderstand the proposition of mine above.
You can check RelOptInfo->fdw_private on top of the GetForeignJoinPaths,
then, if it is second or later invocation, you can check cost of the
alternative path kept in the ForeignPath node previously constructed.
If cheapest_total_path at the moment of GetForeignJoinPaths invocation
is cheaper than the saved alternative path, you can adjust the node to
replace the alternative path node.

To get the (probably unparameterized) cheapest_total_path, IIUC, we need
to do set_cheapest during GetForeignJoinPaths in each subsequent
invocation of that routine, don't we? And set_cheapest is expensive,
isn't it?

2. Save the alternative Plan nodes on fdw_plans or lefttree/righttree
somewhere you like at the GetForeignPlan()
3. Makes BeginForeignScan() to call ExecInitNode() towards the plan node
saved at (2), then save the PlanState on fdw_ps, lefttree/righttree,
or somewhere private area if not displayed on EXPLAIN.
4. Implement ForeignRecheck() routine. If scanrelid==0, it kicks the
planstate node saved at (3) to generate tuple slot. Then, call the
ExecQual() to check qualifiers being pushed down.
5. Makes EndForeignScab() to call ExecEndNode() towards the PlanState
saved at (3).

but the design that you proposed
looks complicated beyond necessity. I think we should add an FDW API
for doing something if FDWs have more knowledge about doing that than
the core, but in your proposal, instead of the core, an FDW has to
eventually do a lot of the core's work: ExecInitNode, ExecProcNode,
ExecQual, ExecEndNode and so on.

It is a trade-off problem between interface flexibility and code smallness
of FDW extension if it fits scope of the core support.
I stand on the viewpoint that gives highest priority on the flexibility,
especially, in case when unpredictable type of modules are expected.
Your proposition is comfortable to FDW on behalf of RDBMS, however, nobody
can promise it is beneficial to FDW on behalf of columnar-store for example.

Maybe I'm missing something, but why do we need such a flexiblity for
the columnar-stores?

If you stick on the code smallness of FDW on behalf of RDBMS, we can add
utility functions on foreign.c or somewhere. It will be able to provide
equivalent functionality, but FDW can determine whether it use the routines.

That might be an idea, but I'd like to hear the opinions of others.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Etsuro Fujita (#27)

Re: Foreign join pushdown vs EvalPlanQual

In addition, you may misunderstand the proposition of mine above.
You can check RelOptInfo->fdw_private on top of the GetForeignJoinPaths,
then, if it is second or later invocation, you can check cost of the
alternative path kept in the ForeignPath node previously constructed.
If cheapest_total_path at the moment of GetForeignJoinPaths invocation
is cheaper than the saved alternative path, you can adjust the node to
replace the alternative path node.

To get the (probably unparameterized) cheapest_total_path, IIUC, we need
to do set_cheapest during GetForeignJoinPaths in each subsequent
invocation of that routine, don't we? And set_cheapest is expensive,
isn't it?

add_path() usually drop paths that are obviously lesser than others,
so walk on join->pathlist shall have reasonable length.
Even though it has hundreds items on the pathlist, you CAN implement
EPQ fallback using alternative built-in logic.

2. Save the alternative Plan nodes on fdw_plans or lefttree/righttree
somewhere you like at the GetForeignPlan()
3. Makes BeginForeignScan() to call ExecInitNode() towards the plan node
saved at (2), then save the PlanState on fdw_ps, lefttree/righttree,
or somewhere private area if not displayed on EXPLAIN.
4. Implement ForeignRecheck() routine. If scanrelid==0, it kicks the
planstate node saved at (3) to generate tuple slot. Then, call the
ExecQual() to check qualifiers being pushed down.
5. Makes EndForeignScab() to call ExecEndNode() towards the PlanState
saved at (3).

but the design that you proposed
looks complicated beyond necessity. I think we should add an FDW API
for doing something if FDWs have more knowledge about doing that than
the core, but in your proposal, instead of the core, an FDW has to
eventually do a lot of the core's work: ExecInitNode, ExecProcNode,
ExecQual, ExecEndNode and so on.

It is a trade-off problem between interface flexibility and code smallness
of FDW extension if it fits scope of the core support.
I stand on the viewpoint that gives highest priority on the flexibility,
especially, in case when unpredictable type of modules are expected.
Your proposition is comfortable to FDW on behalf of RDBMS, however, nobody
can promise it is beneficial to FDW on behalf of columnar-store for example.

Maybe I'm missing something, but why do we need such a flexiblity for
the columnar-stores?

We have various kind of FDW drivers, some of use cases were unpredictable
preliminary. Our community knows 86 kind of FDW drivers in total, and only
15 of them are for RDBMS but 71 of them for other data source.
https://wiki.postgresql.org/wiki/Foreign_data_wrappers

Even if we enforce them a new interface specification comfortable to RDBMS,
we cannot guarantee it is also comfortable to other type of FDW drivers.

If module-X wants to implement the EPQ fallback routine by itself, without
alternative plan, too rich interface design prevents what module-X really
wants to do.

If you stick on the code smallness of FDW on behalf of RDBMS, we can add
utility functions on foreign.c or somewhere. It will be able to provide
equivalent functionality, but FDW can determine whether it use the routines.

That might be an idea, but I'd like to hear the opinions of others.

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#28)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/08/26 16:07, Kouhei Kaigai wrote:
I wrote:

Maybe I'm missing something, but why do we need such a flexiblity for
the columnar-stores?

Even if we enforce them a new interface specification comfortable to RDBMS,
we cannot guarantee it is also comfortable to other type of FDW drivers.

Specifically, what kind of points about the patch are specific to RDBMS?

If module-X wants to implement the EPQ fallback routine by itself, without
alternative plan, too rich interface design prevents what module-X really
wants to do.

Sorry, I fail to see the need or advantage for module-X to do so, in
practice because I think EPQ testing is only execute a subplan for a
*single* set of component test tuples. Maybe I'm missing something, though.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Etsuro Fujita (#29)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/08/26 16:07, Kouhei Kaigai wrote:
I wrote:

Maybe I'm missing something, but why do we need such a flexiblity for
the columnar-stores?

Even if we enforce them a new interface specification comfortable to RDBMS,
we cannot guarantee it is also comfortable to other type of FDW drivers.

Specifically, what kind of points about the patch are specific to RDBMS?

  *** 88,93 **** ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
  --- 99,122 ----
    TupleTableSlot *
    ExecForeignScan(ForeignScanState *node)
    {
  + 	EState	   *estate = node->ss.ps.state;
  + 
  + 	if (estate->es_epqTuple != NULL)
  + 	{
  + 		/*
  + 		 * We are inside an EvalPlanQual recheck.  If foreign join, get next
  + 		 * tuple from subplan.
  + 		 */
  + 		Index		scanrelid = ((Scan *) node->ss.ps.plan)->scanrelid;
  + 
  + 		if (scanrelid == 0)
  + 		{
  + 			PlanState  *outerPlan = outerPlanState(node);
  + 
  + 			return ExecProcNode(outerPlan);
  + 		}
  + 	}
  + 
    	return ExecScan((ScanState *) node,
    					(ExecScanAccessMtd) ForeignNext,
    					(ExecScanRecheckMtd) ForeignRecheck);

It might not be specific to RDBMS, however, we cannot guarantee all the FDW are
comfortable to run the alternative plan node on EPQ recheck.
This design does not allow FDW drivers to implement own EPQ recheck, possibly
more efficient than built-in logic.

I never deny to run the alternative plan to implement EPQ recheck, according
to the decision by FDW driver, however, it is unacceptable pain to enforce all
the FDW driver to use alternative plan as a solution of EPQ check.

If module-X wants to implement the EPQ fallback routine by itself, without
alternative plan, too rich interface design prevents what module-X really
wants to do.

Sorry, I fail to see the need or advantage for module-X to do so, in
practice because I think EPQ testing is only execute a subplan for a
*single* set of component test tuples. Maybe I'm missing something, though.

You may think execution of alternative plan is the best way for EPQ rechecks,
however, other folks may think their own implementation is the best for EPQ
rechecks. I never argue which approach is better.
What I point out is freedom/flexibility of implementation choice.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#30)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/08/26 17:05, Kouhei Kaigai wrote:

On 2015/08/26 16:07, Kouhei Kaigai wrote:

Even if we enforce them a new interface specification comfortable to RDBMS,
we cannot guarantee it is also comfortable to other type of FDW drivers.

Specifically, what kind of points about the patch are specific to RDBMS?

TupleTableSlot *
ExecForeignScan(ForeignScanState *node)
{
+ 	EState	   *estate = node->ss.ps.state;
+
+ 	if (estate->es_epqTuple != NULL)
+ 	{
+ 		/*
+ 		 * We are inside an EvalPlanQual recheck.  If foreign join, get next
+ 		 * tuple from subplan.
+ 		 */
+ 		Index		scanrelid = ((Scan *) node->ss.ps.plan)->scanrelid;
+
+ 		if (scanrelid == 0)
+ 		{
+ 			PlanState  *outerPlan = outerPlanState(node);
+
+ 			return ExecProcNode(outerPlan);
+ 		}
+ 	}

It might not be specific to RDBMS, however, we cannot guarantee all the FDW are
comfortable to run the alternative plan node on EPQ recheck.
This design does not allow FDW drivers to implement own EPQ recheck, possibly
more efficient than built-in logic.

As I said below, EPQ testing is only execute a subplan for a *single*
set of component test tuples, so I think the performance gain by its own
EPQ testing implemented by an FDW would be probably negligible in
practice. No?

If module-X wants to implement the EPQ fallback routine by itself, without
alternative plan, too rich interface design prevents what module-X really
wants to do.

Sorry, I fail to see the need or advantage for module-X to do so, in
practice because I think EPQ testing is only execute a subplan for a
*single* set of component test tuples. Maybe I'm missing something, though.

You may think execution of alternative plan is the best way for EPQ rechecks,
however, other folks may think their own implementation is the best for EPQ
rechecks. I never argue which approach is better.
What I point out is freedom/flexibility of implementation choice.

No, I just want to know the need or advantage for that specifically.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Etsuro Fujita (#31)

Re: Foreign join pushdown vs EvalPlanQual

-----Original Message-----
From: Etsuro Fujita [mailto:fujita.etsuro@lab.ntt.co.jp]
Sent: Wednesday, August 26, 2015 5:38 PM
To: Kaigai Kouhei(海外浩平); Robert Haas
Cc: PostgreSQL-development; 花田茂
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

On 2015/08/26 17:05, Kouhei Kaigai wrote:

On 2015/08/26 16:07, Kouhei Kaigai wrote:

Even if we enforce them a new interface specification comfortable to RDBMS,
we cannot guarantee it is also comfortable to other type of FDW drivers.

Specifically, what kind of points about the patch are specific to RDBMS?
TupleTableSlot *
ExecForeignScan(ForeignScanState *node)
{
+ 	EState	   *estate = node->ss.ps.state;
+
+ 	if (estate->es_epqTuple != NULL)
+ 	{
+ 		/*
+ 		 * We are inside an EvalPlanQual recheck.  If foreign join, get
next
+ 		 * tuple from subplan.
+ 		 */
+ 		Index		scanrelid = ((Scan *)
node->ss.ps.plan)->scanrelid;
+
+ 		if (scanrelid == 0)
+ 		{
+ 			PlanState  *outerPlan = outerPlanState(node);
+
+ 			return ExecProcNode(outerPlan);
+ 		}
+ 	}
It might not be specific to RDBMS, however, we cannot guarantee all the FDW

are

comfortable to run the alternative plan node on EPQ recheck.
This design does not allow FDW drivers to implement own EPQ recheck, possibly
more efficient than built-in logic.

As I said below, EPQ testing is only execute a subplan for a *single*
set of component test tuples, so I think the performance gain by its own
EPQ testing implemented by an FDW would be probably negligible in
practice. No?

If module-X wants to implement the EPQ fallback routine by itself, without
alternative plan, too rich interface design prevents what module-X really
wants to do.

Sorry, I fail to see the need or advantage for module-X to do so, in
practice because I think EPQ testing is only execute a subplan for a
*single* set of component test tuples. Maybe I'm missing something, though.

You may think execution of alternative plan is the best way for EPQ rechecks,
however, other folks may think their own implementation is the best for EPQ
rechecks. I never argue which approach is better.
What I point out is freedom/flexibility of implementation choice.

No, I just want to know the need or advantage for that specifically.

I'm not interested in advantage / disadvantage of individual FDW driver's
implementation. It is matter of FDW drivers, not a matter of core PostgreSQL.

The only and significant point I repeatedly emphasized is, it is developer's
choice thus it is important to provide options for developers.
If they want, FDW developer can follow the manner of alternative plan
execution for EPQ rechecks. I never deny your idea, but should be one of
the options we can take.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#32)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/08/26 18:01, Kouhei Kaigai wrote:

You may think execution of alternative plan is the best way for EPQ rechecks,
however, other folks may think their own implementation is the best for EPQ
rechecks. I never argue which approach is better.
What I point out is freedom/flexibility of implementation choice.

Maybe my explanation was not accurate, but I just want to know use
cases, to understand the need to provide the flexiblity.

The only and significant point I repeatedly emphasized is, it is developer's
choice thus it is important to provide options for developers.
If they want, FDW developer can follow the manner of alternative plan
execution for EPQ rechecks. I never deny your idea, but should be one of
the options we can take.

I don't object about your idea either, but I have a concern about that;
it looks like that the more flexiblity we provide, the more the FDWs
implementing their own EPQ would be subject to an internal change in the
core.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Etsuro Fujita (#33)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/08/26 18:01, Kouhei Kaigai wrote:

You may think execution of alternative plan is the best way for EPQ rechecks,
however, other folks may think their own implementation is the best for EPQ
rechecks. I never argue which approach is better.
What I point out is freedom/flexibility of implementation choice.

Maybe my explanation was not accurate, but I just want to know use
cases, to understand the need to provide the flexiblity.

Let's assume the following situation:

Someone wants to implement FPGA acceleration feature on top of FDW.
(You may know the earliest PG-Strom was built on FDW interface)
It enables to run SQL join workloads on FPGA device, but has equivalent
fallback routines to be executed if FPGA returned an error.
On EPQ check case, it is quite natural that he wants to re-use this
fallback routine to validate EPQ tuple. Alternative plan may consume
additional (at least not zero) memory and other system resource.

As I have said repeatedly, it is software design decision by the author
of extension. Even if it consumes 100 times larger memory and 1000 times
slower, it is his decision and responsibility.
Why he has to be forced to use a particular logic despite his intension?

The only and significant point I repeatedly emphasized is, it is developer's
choice thus it is important to provide options for developers.
If they want, FDW developer can follow the manner of alternative plan
execution for EPQ rechecks. I never deny your idea, but should be one of
the options we can take.

I don't object about your idea either, but I have a concern about that;
it looks like that the more flexiblity we provide, the more the FDWs
implementing their own EPQ would be subject to an internal change in the
core.

We never guarantee interface compatibility across major versions. All we
can say is 'best efforts'. So, it is always role of extension owner, as
long as he continue to maintain his module.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#34)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/08/27 11:08, Kouhei Kaigai wrote:

On 2015/08/26 18:01, Kouhei Kaigai wrote:

You may think execution of alternative plan is the best way for EPQ rechecks,
however, other folks may think their own implementation is the best for EPQ
rechecks. I never argue which approach is better.
What I point out is freedom/flexibility of implementation choice.

Maybe my explanation was not accurate, but I just want to know use
cases, to understand the need to provide the flexiblity.

Let's assume the following situation:

Someone wants to implement FPGA acceleration feature on top of FDW.

It enables to run SQL join workloads on FPGA device, but has equivalent
fallback routines to be executed if FPGA returned an error.
On EPQ check case, it is quite natural that he wants to re-use this
fallback routine to validate EPQ tuple. Alternative plan may consume
additional (at least not zero) memory and other system resource.

Thanks for the answer, but I'm not still convinced. I think the EPQ
testing shown in that use-case would probably not efficient, compared to
the core's.

As I have said repeatedly, it is software design decision by the author
of extension. Even if it consumes 100 times larger memory and 1000 times
slower, it is his decision and responsibility.
Why he has to be forced to use a particular logic despite his intension?

I don't understand what you proposed, but ISTM that your proposal is
more like a feature, rather than a bugfix. For what you proposed, I
think we should also improve the existing EPQ mechanism including the
corresponding FDW routines. One possible improvement is the behavior of
late row locking. Currently, we do that by 1) re-fetching each
component tuple from the foreign table after locking it by
RefetchForeignRow and then 2) if necessary, doing an EPQ recheck, ie,
re-running the query locally for such component tuples by the core. So,
if we could re-run the join part of the query remotely without
tranferring such component tuples from the foreign tables, we would be
able to not only avoid useless data transfer but improve concurrency
when the join fails.

So, how about addressing this issue in two steps; first, work on the
bugfix patch in [1]/messages/by-id/55CB2D45.7040100@lab.ntt.co.jp, and then, work on what you propsed. The latter
would need more discussion/work, so I think it would be better to take
that in 9.6. If it's OK, I'll update the patch in [1]/messages/by-id/55CB2D45.7040100@lab.ntt.co.jp and add it to the
upcoming CF.

I don't object about your idea either, but I have a concern about that;
it looks like that the more flexiblity we provide, the more the FDWs
implementing their own EPQ would be subject to an internal change in the
core.

We never guarantee interface compatibility across major versions. All we
can say is 'best efforts'. So, it is always role of extension owner, as
long as he continue to maintain his module.

I think we cannot 100% guarantee the compatibility. That is why I think
we should avoid an FDW improvement that would be subject to an internal
change in the core, unless there is a good reason or use-case for that.

Best regards,
Etsuro Fujita

[1]: /messages/by-id/55CB2D45.7040100@lab.ntt.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Etsuro Fujita (#35)

Re: Foreign join pushdown vs EvalPlanQual

As I have said repeatedly, it is software design decision by the author
of extension. Even if it consumes 100 times larger memory and 1000 times
slower, it is his decision and responsibility.
Why he has to be forced to use a particular logic despite his intension?

I don't understand what you proposed,

What I'm talking about is philosophy of software/interface design.
I understand EPQ recheck by alternative plan is "one" reasonable way,
however, people often have different ideas and may be better than
your idea depending on its context/environment/prerequisites/etc...
It is always unpredictable, only God can know what is the best solution.

In other words, I didn't talk about taste of restaurant, the problem is
lack of variation on the menu. You may not want, but we have freedom to
eat terrible taste meal.

but ISTM that your proposal is
more like a feature, rather than a bugfix.

Yes, the problem we are facing is lack of a feature. It might be my
oversight when I designed join pushdown infrastructure. Sorry.
So, it is quite natural to add the missing piece to fix up the bug.

For what you proposed, I
think we should also improve the existing EPQ mechanism including the
corresponding FDW routines. One possible improvement is the behavior of
late row locking. Currently, we do that by 1) re-fetching each
component tuple from the foreign table after locking it by
RefetchForeignRow and then 2) if necessary, doing an EPQ recheck, ie,
re-running the query locally for such component tuples by the core. So,
if we could re-run the join part of the query remotely without
tranferring such component tuples from the foreign tables, we would be
able to not only avoid useless data transfer but improve concurrency
when the join fails.

So, how about addressing this issue in two steps; first, work on the
bugfix patch in [1], and then, work on what you propsed. The latter
would need more discussion/work, so I think it would be better to take
that in 9.6. If it's OK, I'll update the patch in [1] and add it to the
upcoming CF.

It seems to me too invasive for bugfix, and assumes a particular solution.
Please do the rechecking part in the extension, not in the core.

I don't object about your idea either, but I have a concern about that;
it looks like that the more flexiblity we provide, the more the FDWs
implementing their own EPQ would be subject to an internal change in the
core.

We never guarantee interface compatibility across major versions. All we
can say is 'best efforts'. So, it is always role of extension owner, as
long as he continue to maintain his module.

I think we cannot 100% guarantee the compatibility. That is why I think
we should avoid an FDW improvement that would be subject to an internal
change in the core, unless there is a good reason or use-case for that.

It does not make sense unless we don't provide stable and well specified
interface, because developers will have validation and adjustment of their
extension to new major versions.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#36)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/08/27 16:52, Kouhei Kaigai wrote:
I wrote:

I don't understand what you proposed,

What I'm talking about is philosophy of software/interface design.
I understand EPQ recheck by alternative plan is "one" reasonable way,
however, people often have different ideas and may be better than
your idea depending on its context/environment/prerequisites/etc...
It is always unpredictable, only God can know what is the best solution.

In other words, I didn't talk about taste of restaurant, the problem is
lack of variation on the menu. You may not want, but we have freedom to
eat terrible taste meal.

but ISTM that your proposal is
more like a feature, rather than a bugfix.

Yes, the problem we are facing is lack of a feature. It might be my
oversight when I designed join pushdown infrastructure. Sorry.
So, it is quite natural to add the missing piece to fix up the bug.

For what you proposed, I
think we should also improve the existing EPQ mechanism including the
corresponding FDW routines. One possible improvement is the behavior of
late row locking. Currently, we do that by 1) re-fetching each
component tuple from the foreign table after locking it by
RefetchForeignRow and then 2) if necessary, doing an EPQ recheck, ie,
re-running the query locally for such component tuples by the core. So,
if we could re-run the join part of the query remotely without
tranferring such component tuples from the foreign tables, we would be
able to not only avoid useless data transfer but improve concurrency
when the join fails.

So, how about addressing this issue in two steps; first, work on the
bugfix patch in [1], and then, work on what you propsed. The latter
would need more discussion/work, so I think it would be better to take
that in 9.6. If it's OK, I'll update the patch in [1] and add it to the
upcoming CF.

It seems to me too invasive for bugfix, and assumes a particular solution.
Please do the rechecking part in the extension, not in the core.

I think we would probably need others' opinions about this issue.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Etsuro Fujita (#37)

1 attachment(s)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/08/27 17:30, Etsuro Fujita wrote:

I think we would probably need others' opinions about this issue.

Attached is an updated version of the patch [1]/messages/by-id/55CB2D45.7040100@lab.ntt.co.jp. I'd be happy if it
helps people discuss about this issue.

Changes:
* rebased to HEAD.
* add some more docs and comments.
* fix a bug in handling tlist of a ForeignScan node when the node is the
top node.
* fix a bug in doing ExecAssignScanTypeFromOuterPlan at the top of a
ForeignScan node.

Best regards,
Etsuro Fujita

[1]: /messages/by-id/55CB2D45.7040100@lab.ntt.co.jp

Attachments:

fdw-eval-plan-qual-1.0.patchtext/x-patch; name=fdw-eval-plan-qual-1.0.patchDownload

*** a/contrib/file_fdw/file_fdw.c
--- b/contrib/file_fdw/file_fdw.c
***************
*** 525,530 **** fileGetForeignPaths(PlannerInfo *root,
--- 525,531 ----
  									 total_cost,
  									 NIL,		/* no pathkeys */
  									 NULL,		/* no outer rel either */
+ 									 NULL,		/* no alternative path */
  									 coptions));
  
  	/*
***************
*** 563,569 **** fileGetForeignPlan(PlannerInfo *root,
  							scan_relid,
  							NIL,	/* no expressions to evaluate */
  							best_path->fdw_private,
! 							NIL /* no custom tlist */ );
  }
  
  /*
--- 564,571 ----
  							scan_relid,
  							NIL,	/* no expressions to evaluate */
  							best_path->fdw_private,
! 							NIL,	/* no custom tlist */
! 							NIL /* no remote quals */ );
  }
  
  /*
*** a/contrib/postgres_fdw/postgres_fdw.c
--- b/contrib/postgres_fdw/postgres_fdw.c
***************
*** 560,565 **** postgresGetForeignPaths(PlannerInfo *root,
--- 560,566 ----
  								   fpinfo->total_cost,
  								   NIL, /* no pathkeys */
  								   NULL,		/* no outer rel either */
+ 								   NULL,		/* no alternative path */
  								   NIL);		/* no fdw_private list */
  	add_path(baserel, (Path *) path);
  
***************
*** 727,732 **** postgresGetForeignPaths(PlannerInfo *root,
--- 728,734 ----
  									   total_cost,
  									   NIL,		/* no pathkeys */
  									   param_info->ppi_req_outer,
+ 									   NULL,	/* no alternative path */
  									   NIL);	/* no fdw_private list */
  		add_path(baserel, (Path *) path);
  	}
***************
*** 748,753 **** postgresGetForeignPlan(PlannerInfo *root,
--- 750,756 ----
  	Index		scan_relid = baserel->relid;
  	List	   *fdw_private;
  	List	   *remote_conds = NIL;
+ 	List	   *remote_exprs = NIL;
  	List	   *local_exprs = NIL;
  	List	   *params_list = NIL;
  	List	   *retrieved_attrs;
***************
*** 769,776 **** postgresGetForeignPlan(PlannerInfo *root,
  	 *
  	 * This code must match "extract_actual_clauses(scan_clauses, false)"
  	 * except for the additional decision about remote versus local execution.
! 	 * Note however that we only strip the RestrictInfo nodes from the
! 	 * local_exprs list, since appendWhereClause expects a list of
  	 * RestrictInfos.
  	 */
  	foreach(lc, scan_clauses)
--- 772,779 ----
  	 *
  	 * This code must match "extract_actual_clauses(scan_clauses, false)"
  	 * except for the additional decision about remote versus local execution.
! 	 * Note however that we don't strip the RestrictInfo nodes from the
! 	 * remote_conds list, since appendWhereClause expects a list of
  	 * RestrictInfos.
  	 */
  	foreach(lc, scan_clauses)
***************
*** 784,794 **** postgresGetForeignPlan(PlannerInfo *root,
--- 787,803 ----
  			continue;
  
  		if (list_member_ptr(fpinfo->remote_conds, rinfo))
+ 		{
  			remote_conds = lappend(remote_conds, rinfo);
+ 			remote_exprs = lappend(remote_exprs, rinfo->clause);
+ 		}
  		else if (list_member_ptr(fpinfo->local_conds, rinfo))
  			local_exprs = lappend(local_exprs, rinfo->clause);
  		else if (is_foreign_expr(root, baserel, rinfo->clause))
+ 		{
  			remote_conds = lappend(remote_conds, rinfo);
+ 			remote_exprs = lappend(remote_exprs, rinfo->clause);
+ 		}
  		else
  			local_exprs = lappend(local_exprs, rinfo->clause);
  	}
***************
*** 874,880 **** postgresGetForeignPlan(PlannerInfo *root,
  							scan_relid,
  							params_list,
  							fdw_private,
! 							NIL /* no custom tlist */ );
  }
  
  /*
--- 883,890 ----
  							scan_relid,
  							params_list,
  							fdw_private,
! 							NIL,	/* no custom tlist */
! 							remote_exprs);
  }
  
  /*
*** a/doc/src/sgml/fdwhandler.sgml
--- b/doc/src/sgml/fdwhandler.sgml
***************
*** 333,339 **** GetForeignJoinPaths (PlannerInfo *root,
       remote join cannot be found from the system catalogs, the FDW must
       fill <structfield>fdw_scan_tlist</> with an appropriate list
       of <structfield>TargetEntry</> nodes, representing the set of columns
!      it will supply at runtime in the tuples it returns.
      </para>
  
      <para>
--- 333,346 ----
       remote join cannot be found from the system catalogs, the FDW must
       fill <structfield>fdw_scan_tlist</> with an appropriate list
       of <structfield>TargetEntry</> nodes, representing the set of columns
!      it will supply at runtime in the tuples it returns.  Yet another
!      difference is that the FDW must provide <structfield>fs_subplan</> with
!      an appropriate plan node involving local joining in preparation for
!      possible use in the <productname>PostgreSQL</productname> executor, while
!      <structfield>fdw_quals</> should be set to NIL, which represents the set
!      of restriction clauses to be enforced remotely in a case when a
!      <structname>ForeignScan</> node is created for a foreign table scan, not
!      a join.
      </para>
  
      <para>
*** a/src/backend/executor/nodeForeignscan.c
--- b/src/backend/executor/nodeForeignscan.c
***************
*** 72,79 **** ForeignNext(ForeignScanState *node)
  static bool
  ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
  {
! 	/* There are no access-method-specific conditions to recheck. */
! 	return true;
  }
  
  /* ----------------------------------------------------------------
--- 72,90 ----
  static bool
  ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
  {
! 	ExprContext *econtext;
! 
! 	/*
! 	 * extract necessary information from foreign scan node
! 	 */
! 	econtext = node->ss.ps.ps_ExprContext;
! 
! 	/* Does the tuple meet the remote qual condition? */
! 	econtext->ecxt_scantuple = slot;
! 
! 	ResetExprContext(econtext);
! 
! 	return ExecQual(node->fdw_quals, econtext, false);
  }
  
  /* ----------------------------------------------------------------
***************
*** 88,93 **** ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
--- 99,122 ----
  TupleTableSlot *
  ExecForeignScan(ForeignScanState *node)
  {
+ 	EState	   *estate = node->ss.ps.state;
+ 
+ 	if (estate->es_epqTuple != NULL)
+ 	{
+ 		/*
+ 		 * We are inside an EvalPlanQual recheck.  If foreign join, get next
+ 		 * tuple from subplan.
+ 		 */
+ 		Index		scanrelid = ((Scan *) node->ss.ps.plan)->scanrelid;
+ 
+ 		if (scanrelid == 0)
+ 		{
+ 			PlanState  *outerPlan = outerPlanState(node);
+ 
+ 			return ExecProcNode(outerPlan);
+ 		}
+ 	}
+ 
  	return ExecScan((ScanState *) node,
  					(ExecScanAccessMtd) ForeignNext,
  					(ExecScanRecheckMtd) ForeignRecheck);
***************
*** 135,140 **** ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
--- 164,172 ----
  	scanstate->ss.ps.qual = (List *)
  		ExecInitExpr((Expr *) node->scan.plan.qual,
  					 (PlanState *) scanstate);
+ 	scanstate->fdw_quals = (List *)
+ 		ExecInitExpr((Expr *) node->fdw_quals,
+ 					 (PlanState *) scanstate);
  
  	/*
  	 * tuple table initialization
***************
*** 195,200 **** ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
--- 227,246 ----
  	 */
  	fdwroutine->BeginForeignScan(scanstate, eflags);
  
+ 	if (estate->es_epqTuple != NULL)
+ 	{
+ 		/*
+ 		 * We are inside an EvalPlanQual recheck.  If foreign join, initialize
+ 		 * subplan.
+ 		 */
+ 		if (scanrelid == 0)
+ 		{
+ 			Plan	   *subplan = node->fs_subplan;
+ 
+ 			outerPlanState(scanstate) = ExecInitNode(subplan, estate, eflags);
+ 		}
+ 	}
+ 
  	return scanstate;
  }
  
***************
*** 207,212 **** ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
--- 253,276 ----
  void
  ExecEndForeignScan(ForeignScanState *node)
  {
+ 	EState	   *estate = node->ss.ps.state;
+ 
+ 	if (estate->es_epqTuple != NULL)
+ 	{
+ 		/*
+ 		 * We are inside an EvalPlanQual recheck.  If foreign join, close down
+ 		 * subplan.
+ 		 */
+ 		Index		scanrelid = ((Scan *) node->ss.ps.plan)->scanrelid;
+ 
+ 		if (scanrelid == 0)
+ 		{
+ 			PlanState  *outerPlan = outerPlanState(node);
+ 
+ 			ExecEndNode(outerPlan);
+ 		}
+ 	}
+ 
  	/* Let the FDW shut down */
  	node->fdwroutine->EndForeignScan(node);
  
***************
*** 231,236 **** ExecEndForeignScan(ForeignScanState *node)
--- 295,324 ----
  void
  ExecReScanForeignScan(ForeignScanState *node)
  {
+ 	EState	   *estate = node->ss.ps.state;
+ 
+ 	if (estate->es_epqTuple != NULL)
+ 	{
+ 		/*
+ 		 * We are inside an EvalPlanQual recheck.  If foreign join, re-scan
+ 		 * subplan.
+ 		 */
+ 		Index		scanrelid = ((Scan *) node->ss.ps.plan)->scanrelid;
+ 
+ 		if (scanrelid == 0)
+ 		{
+ 			PlanState  *outerPlan = outerPlanState(node);
+ 
+ 			/*
+ 			 * If outerPlan->chgParam is not null then plan will be
+ 			 * automatically re-scanned by first ExecProcNode.
+ 			 */
+ 			if (outerPlan->chgParam == NULL)
+ 				ExecReScan(outerPlan);
+ 			return;
+ 		}
+ 	}
+ 
  	node->fdwroutine->ReScanForeignScan(node);
  
  	ExecScanReScan(&node->ss);
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
***************
*** 624,629 **** _copyForeignScan(const ForeignScan *from)
--- 624,631 ----
  	COPY_NODE_FIELD(fdw_exprs);
  	COPY_NODE_FIELD(fdw_private);
  	COPY_NODE_FIELD(fdw_scan_tlist);
+ 	COPY_NODE_FIELD(fdw_quals);
+ 	COPY_NODE_FIELD(fs_subplan);
  	COPY_BITMAPSET_FIELD(fs_relids);
  	COPY_SCALAR_FIELD(fsSystemCol);
  
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
***************
*** 579,584 **** _outForeignScan(StringInfo str, const ForeignScan *node)
--- 579,586 ----
  	WRITE_NODE_FIELD(fdw_exprs);
  	WRITE_NODE_FIELD(fdw_private);
  	WRITE_NODE_FIELD(fdw_scan_tlist);
+ 	WRITE_NODE_FIELD(fdw_quals);
+ 	WRITE_NODE_FIELD(fs_subplan);
  	WRITE_BITMAPSET_FIELD(fs_relids);
  	WRITE_BOOL_FIELD(fsSystemCol);
  }
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
***************
*** 2117,2125 **** create_foreignscan_plan(PlannerInfo *root, ForeignPath *best_path,
--- 2117,2134 ----
  			replace_nestloop_params(root, (Node *) scan_plan->scan.plan.qual);
  		scan_plan->fdw_exprs = (List *)
  			replace_nestloop_params(root, (Node *) scan_plan->fdw_exprs);
+ 		scan_plan->fdw_quals = (List *)
+ 			replace_nestloop_params(root, (Node *) scan_plan->fdw_quals);
  	}
  
  	/*
+ 	 * If we're scanning a join relation, generate the local join plan for
+ 	 * EvalPlanQual support.  (Irrelevant if scanning a base relation.)
+ 	 */
+ 	if (scan_relid == 0)
+ 		scan_plan->fs_subplan = create_plan_recurse(root, best_path->subpath);
+ 
+ 	/*
  	 * Detect whether any system columns are requested from rel.  This is a
  	 * bit of a kluge and might go away someday, so we intentionally leave it
  	 * out of the API presented to FDWs.
***************
*** 3702,3708 **** make_foreignscan(List *qptlist,
  				 Index scanrelid,
  				 List *fdw_exprs,
  				 List *fdw_private,
! 				 List *fdw_scan_tlist)
  {
  	ForeignScan *node = makeNode(ForeignScan);
  	Plan	   *plan = &node->scan.plan;
--- 3711,3718 ----
  				 Index scanrelid,
  				 List *fdw_exprs,
  				 List *fdw_private,
! 				 List *fdw_scan_tlist,
! 				 List *fdw_quals)
  {
  	ForeignScan *node = makeNode(ForeignScan);
  	Plan	   *plan = &node->scan.plan;
***************
*** 3718,3723 **** make_foreignscan(List *qptlist,
--- 3728,3736 ----
  	node->fdw_exprs = fdw_exprs;
  	node->fdw_private = fdw_private;
  	node->fdw_scan_tlist = fdw_scan_tlist;
+ 	node->fdw_quals = fdw_quals;
+ 	/* fs_subplan will be filled in by create_foreignscan_plan */
+ 	node->fs_subplan = NULL;
  	/* fs_relids will be filled in by create_foreignscan_plan */
  	node->fs_relids = NULL;
  	/* fsSystemCol will be filled in by create_foreignscan_plan */
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
***************
*** 1124,1139 **** set_foreignscan_references(PlannerInfo *root,
  		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
  		fscan->fdw_scan_tlist =
  			fix_scan_list(root, fscan->fdw_scan_tlist, rtoffset);
  	}
  	else
  	{
! 		/* Adjust tlist, qual, fdw_exprs in the standard way */
  		fscan->scan.plan.targetlist =
  			fix_scan_list(root, fscan->scan.plan.targetlist, rtoffset);
  		fscan->scan.plan.qual =
  			fix_scan_list(root, fscan->scan.plan.qual, rtoffset);
  		fscan->fdw_exprs =
  			fix_scan_list(root, fscan->fdw_exprs, rtoffset);
  	}
  
  	/* Adjust fs_relids if needed */
--- 1124,1143 ----
  		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
  		fscan->fdw_scan_tlist =
  			fix_scan_list(root, fscan->fdw_scan_tlist, rtoffset);
+ 		/* fs_subplan needs set_plan_refs() adjustments */
+ 		set_plan_refs(root, fscan->fs_subplan, rtoffset);
  	}
  	else
  	{
! 		/* Adjust tlist, qual, fdw_exprs, fdw_quals in the standard way */
  		fscan->scan.plan.targetlist =
  			fix_scan_list(root, fscan->scan.plan.targetlist, rtoffset);
  		fscan->scan.plan.qual =
  			fix_scan_list(root, fscan->scan.plan.qual, rtoffset);
  		fscan->fdw_exprs =
  			fix_scan_list(root, fscan->fdw_exprs, rtoffset);
+ 		fscan->fdw_quals =
+ 			fix_scan_list(root, fscan->fdw_quals, rtoffset);
  	}
  
  	/* Adjust fs_relids if needed */
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
***************
*** 2394,2403 **** finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
  			break;
  
  		case T_ForeignScan:
! 			finalize_primnode((Node *) ((ForeignScan *) plan)->fdw_exprs,
! 							  &context);
! 			/* We assume fdw_scan_tlist cannot contain Params */
! 			context.paramids = bms_add_members(context.paramids, scan_params);
  			break;
  
  		case T_CustomScan:
--- 2394,2435 ----
  			break;
  
  		case T_ForeignScan:
! 			{
! 				ForeignScan *fscan = (ForeignScan *) plan;
! 
! 				finalize_primnode((Node *) fscan->fdw_exprs, &context);
! 
! 				/* We assume fdw_scan_tlist cannot contain Params */
! 				context.paramids =
! 					bms_add_members(context.paramids, scan_params);
! 
! 				/*
! 				 * We need not look at fdw_quals, since it will have the same
! 				 * param references as fdw_exprs.
! 				 */
! 
! 				/* subplan node if foreign join */
! 				if (fscan->scan.scanrelid == 0)
! 				{
! 					/*
! 					 * grouping_planner might have replaced the targetlist of
! 					 * the ForeignScan node if the node was the top plan node.
! 					 * To be safe, replace the targetlist of the subplan node.
! 					 */
! 					fscan->fs_subplan->targetlist = plan->targetlist;
! 
! 					/*
! 					 * We need not include params in fs_subplan, since it will
! 					 * have the same param references as the ForeignScan node.
! 					 * However, fs_subplan itself needs finalize_plan()
! 					 * processing.
! 					 */
! 					finalize_plan(root,
! 								  fscan->fs_subplan,
! 								  valid_params,
! 								  scan_params);
! 				}
! 			}
  			break;
  
  		case T_CustomScan:
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
***************
*** 1462,1467 **** create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
--- 1462,1468 ----
  						double rows, Cost startup_cost, Cost total_cost,
  						List *pathkeys,
  						Relids required_outer,
+ 						Path *subpath,
  						List *fdw_private)
  {
  	ForeignPath *pathnode = makeNode(ForeignPath);
***************
*** 1475,1480 **** create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
--- 1476,1482 ----
  	pathnode->path.total_cost = total_cost;
  	pathnode->path.pathkeys = pathkeys;
  
+ 	pathnode->subpath = subpath;
  	pathnode->fdw_private = fdw_private;
  
  	return pathnode;
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
***************
*** 1577,1582 **** typedef struct WorkTableScanState
--- 1577,1583 ----
  typedef struct ForeignScanState
  {
  	ScanState	ss;				/* its first field is NodeTag */
+ 	List	   *fdw_quals;		/* remote quals if foreign table */
  	/* use struct pointer to avoid including fdwapi.h here */
  	struct FdwRoutine *fdwroutine;
  	void	   *fdw_state;		/* foreign-data wrapper can keep state here */
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
***************
*** 521,526 **** typedef struct ForeignScan
--- 521,528 ----
  	List	   *fdw_exprs;		/* expressions that FDW may evaluate */
  	List	   *fdw_private;	/* private data for FDW */
  	List	   *fdw_scan_tlist; /* optional tlist describing scan tuple */
+ 	List	   *fdw_quals;		/* remote quals if foreign table */
+ 	Plan	   *fs_subplan;		/* local join plan if foreign join */
  	Bitmapset  *fs_relids;		/* RTIs generated by this scan */
  	bool		fsSystemCol;	/* true if any "system column" is needed */
  } ForeignScan;
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
***************
*** 897,906 **** typedef struct TidPath
--- 897,914 ----
   * generally a good idea to use a representation that can be dumped by
   * nodeToString(), so that you can examine the structure during debugging
   * with tools like pprint().
+  *
+  * If a ForeignPath node represents a remote join of foreign tables, subpath
+  * is a local join of those tables with equivalent results that will be used
+  * for EvalPlanQual testing.  The pathkeys and parameterization of subpath
+  * must be the same as that of the path's output.  (The requirement for the
+  * pathkeys is unnecessary, since the testing can return at most one tuple
+  * for any particular set of scan tuples of those tables, but let's be safe.)
   */
  typedef struct ForeignPath
  {
  	Path		path;
+ 	Path	   *subpath;
  	List	   *fdw_private;
  } ForeignPath;
  
*** a/src/include/optimizer/pathnode.h
--- b/src/include/optimizer/pathnode.h
***************
*** 83,88 **** extern ForeignPath *create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
--- 83,89 ----
  						double rows, Cost startup_cost, Cost total_cost,
  						List *pathkeys,
  						Relids required_outer,
+ 						Path *subpath,
  						List *fdw_private);
  
  extern Relids calc_nestloop_required_outer(Path *outer_path, Path *inner_path);
*** a/src/include/optimizer/planmain.h
--- b/src/include/optimizer/planmain.h
***************
*** 45,51 **** extern SubqueryScan *make_subqueryscan(List *qptlist, List *qpqual,
  				  Index scanrelid, Plan *subplan);
  extern ForeignScan *make_foreignscan(List *qptlist, List *qpqual,
  				 Index scanrelid, List *fdw_exprs, List *fdw_private,
! 				 List *fdw_scan_tlist);
  extern Append *make_append(List *appendplans, List *tlist);
  extern RecursiveUnion *make_recursive_union(List *tlist,
  					 Plan *lefttree, Plan *righttree, int wtParam,
--- 45,51 ----
  				  Index scanrelid, Plan *subplan);
  extern ForeignScan *make_foreignscan(List *qptlist, List *qpqual,
  				 Index scanrelid, List *fdw_exprs, List *fdw_private,
! 				 List *fdw_scan_tlist, List *fdw_quals);
  extern Append *make_append(List *appendplans, List *tlist);
  extern RecursiveUnion *make_recursive_union(List *tlist,
  					 Plan *lefttree, Plan *righttree, int wtParam,

#39

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Tom Lane (#18)

Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual)

On 2015/08/01 23:25, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

The problem that was bothering us (or at least what was bothering me)
is that the PlannerInfo provides only a list of SpecialJoinInfo
structures, which don't directly give you the original join order. In
fact, min_righthand and min_lefthand are intended to constraint the
*possible* join orders, and are deliberately designed *not* to specify
a single join order. If you're sending a query to a remote PostgreSQL
node, you don't want to know what all the possible join orders are;
it's the remote side's job to plan the query. You do, however, need
an easy way to identify one join order that you can use to construct a
query. It didn't seem easy to do that without duplicating
make_join_rel(), which seemed like a bad idea.

In principle it seems like you could traverse root->parse->jointree
as a guide to reconstructing the original syntactic structure; though
I'm not sure how hard it would be to ignore the parts of that tree
that correspond to relations you're not shipping.

I'll investigate this.

But maybe there's a good way to do it. Tom wasn't crazy about this
hook both because of the frequency of calls and also because of the
long argument list. I think those concerns are legitimate; I just
couldn't see how to make the other way work.

In my vision you probably really only want one call per build_join_rel
event (that is, per construction of a new RelOptInfo), not per
make_join_rel event.

It's possible that an FDW that wants to handle joins but is not talking to
a remote query planner would need to grovel through all the join ordering
possibilities individually, and then maybe hooking at make_join_rel is
sensible rather than having to reinvent that logic. But I'd want to see a
concrete use-case first, and I certainly don't think that that's the main
case to design the API around.

I'd vote for hooking at standard_join_search. Here is a use-case:

* When the callback routine is hooked at that funcition (right after
allpaths.c:1817), an FDW would collect lists of all the available
local-join-path orderings and parameterizations by looking at each path
in rel->pathlist (if the join rel only contains foreign tables that all
belong to the same foreign server).

* Then the FDW would use these as a heuristic to indcate which sort
orderings and parameterizations we should build foreign-join paths for.
(These would be also used as alternative paths for EvalPlanQual
handling, as discussed upthread.) It seems reasonable to me to consider
pushed-down versions of these paths as first candidates, but
foreign-join paths to build are not limited to such ones. The FDW is
allowed to consider any foreign-join paths as long as their alternative
paths are provided.

IMO one thing to consider for the postgres_fdw case would be the
use_remote_estimate option. In the case when the option is true, I
think we should perform remote EXPLAINs for pushed-down-join queries to
obtain cost estimates. But it would require too much time to do that
for each of the possible join rel. So, I think it would be better to
put off the callback routine's work as long as possible. I think that
that could probably be done by looking at rel->joininfo,
root->join_info_list and/or something like that. (When considering a
join rel A JOIN B both on the same foreign server, for example, we can
skip the routine's work if the join rel proved to be joined with C on
the same foreign server by looking at rel->joininfo, for example.)
Maybe I'm missing something, though.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Etsuro Fujita (#39)

Re: Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual)

Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> writes:

On 2015/08/01 23:25, Tom Lane wrote:

In my vision you probably really only want one call per build_join_rel
event (that is, per construction of a new RelOptInfo), not per
make_join_rel event.

I'd vote for hooking at standard_join_search.

I think that method would require the FDW to duplicate a whole lot of the
join search mechanism, for not a whole lot of benefit. It's possible that
there'd be value in doing some initial reconnaissance once we've examined
all the baserels, so I'm not necessarily against providing a hook there.
But if you have in mind that typical FDWs would actually create join paths
at that point, consider that

1. The FDW would have to find all the combinations of its supplied
relations (unless you are only intending to generate one path for the
union of all such rels, which seems pretty narrow-minded from here).

2. The FDW would have to account for join_is_legal considerations.

3. The FDW would have to arrange for creation of joinrel RelOptInfo
structures. While that's possible, the available infrastructure for it
assumes that joinrels are built up from pairs of simpler joinrels, so
you couldn't go directly to the union of all the FDW's rels anyway.

So I still think that the most directly useful infrastructure here
would involve, when build_join_rel() first creates a given joinrel,
noticing whether both sides belong to the same foreign server and
if so giving the FDW a callback to consider producing pushed-down
joins. That would be extremely cheap to do and it would not involve
adding overhead for an FDW to discover what the valid sets of joins
are. In a large join problem, that's *not* going to be a cheap
thing to duplicate. If there are multiple FDWs involved, the idea
that each one of them would do its own join search is particularly
horrid.

One other problem with the proposal is that we might never call
standard_join_search at all: GEQO overrides it, and so can external
users of join_search_hook.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Tom Lane (#40)

Re: Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual)

On Wed, Sep 2, 2015 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

But if you have in mind that typical FDWs would actually create join paths
at that point, consider that

1. The FDW would have to find all the combinations of its supplied
relations (unless you are only intending to generate one path for the
union of all such rels, which seems pretty narrow-minded from here).

Well, if the remote end is another database server, presumably we can
leave it to optimize the query, so why would we need more than one
path? I can see that we need more than one path because of sort-order
considerations, which would affect the query we ship to the remote
side. But I don't see the point of considering multiple join orders
unless the remote end is dumber than our optimizer, which might be
true in some cases, but not if the remote end is PostgreSQL.

2. The FDW would have to account for join_is_legal considerations.

I agree with this.

3. The FDW would have to arrange for creation of joinrel RelOptInfo
structures. While that's possible, the available infrastructure for it
assumes that joinrels are built up from pairs of simpler joinrels, so
you couldn't go directly to the union of all the FDW's rels anyway.

And with this.

So I still think that the most directly useful infrastructure here
would involve, when build_join_rel() first creates a given joinrel,
noticing whether both sides belong to the same foreign server and
if so giving the FDW a callback to consider producing pushed-down
joins. That would be extremely cheap to do and it would not involve
adding overhead for an FDW to discover what the valid sets of joins
are. In a large join problem, that's *not* going to be a cheap
thing to duplicate. If there are multiple FDWs involved, the idea
that each one of them would do its own join search is particularly
horrid.

So, the problem is that I don't think this entirely skirts the
join_is_legal issues, which are a principal point of concern for me.
Say this is a joinrel between (A B) and (C D E). We need to generate
an SQL query for (A B C D E). We know that the outermost syntactic
join can be (A B) to (C D E). But how do we know which join orders
are legal as among (C D E)? Maybe there's a simple way to handle this
that I'm not seeing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Robert Haas (#41)

Re: Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual)

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Sep 2, 2015 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

But if you have in mind that typical FDWs would actually create join paths
at that point, consider that

1. The FDW would have to find all the combinations of its supplied
relations (unless you are only intending to generate one path for the
union of all such rels, which seems pretty narrow-minded from here).

Well, if the remote end is another database server, presumably we can
leave it to optimize the query, so why would we need more than one
path?

If you have say 5 relations in the query, 3 of which are foreign, it might
make sense to join all 3 at the remote end, or maybe you should only join
2 of them remotely because it's better to then join to one of the local
rels before joining the last remote rel. Even if you claim that that
would never make sense from a cost standpoint (a claim easily seen to be
silly), there might not be any legal way to join all 3 directly because of
join order constraints.

The larger point is that we can't expect the remote server to be fully
responsible for optimizing, because it will know nothing of what's being
done on our end.

I can see that we need more than one path because of sort-order
considerations, which would affect the query we ship to the remote
side. But I don't see the point of considering multiple join orders
unless the remote end is dumber than our optimizer, which might be
true in some cases, but not if the remote end is PostgreSQL.

(1) not all remote ends are Postgres, (2) the remote end doesn't have any
access to info about our end.

So, the problem is that I don't think this entirely skirts the
join_is_legal issues, which are a principal point of concern for me.
Say this is a joinrel between (A B) and (C D E). We need to generate
an SQL query for (A B C D E). We know that the outermost syntactic
join can be (A B) to (C D E). But how do we know which join orders
are legal as among (C D E)? Maybe there's a simple way to handle this
that I'm not seeing.

Well, if the joins get built up in the way I think should happen, we'd
have already considered (C D E), and we could have recorded the legal join
orders within that at the time. (I imagine that we should allow FDWs to
store some data within RelOptInfo structs that represent foreign joins
belonging entirely to them, so that there'd be a handy place to keep that
data till later.) Or we could trawl through the paths associated with the
child joinrel, which will presumably include instances of every reasonable
sub-join combination. Or the FDW could look at the SpecialJoinInfo data
and determine things for itself (or more likely, ask join_is_legal about
that).

In the case of postgres_fdw, I think the actual requirement will be to be
able to reconstruct a SQL query that correctly expresses the join; that
is, we need to send over something like "from c left join d on (...) full
join e on (...)", not just "from c, d, e", or we'll get totally bogus
estimates as well as bogus execution results. Offhand I think that the
most likely way to build that text will be to examine the query's jointree
to see where c,d,e appear in it. But in any case, that's a separate issue
and I fail to see how plopping the join search problem into the FDW's lap
would make it any easier.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Tom Lane (#42)

Re: Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual)

I wrote:

... I imagine that we should allow FDWs to
store some data within RelOptInfo structs that represent foreign joins
belonging entirely to them, so that there'd be a handy place to keep that
data till later.

Actually, if we do that (ie, provide a "void *fdw_state" field in join
RelOptInfos), then the FDW could use the nullness or not-nullness of
such a field to realize whether or not it had already considered this
join relation. So I'm now thinking that the best API is to call the
FDW at the end of each make_join_rel call, whether it's the first one
for the joinrel or not. If the FDW wants a call for each legal pair of
input sub-relations, it's got one. If it only wants one call per joinrel,
it can just make sure to put something into fdw_state, and then on
subsequent calls for the same joinrel it can just exit immediately if
fdw_state is already non-null. So we have both use-cases covered.
Also, by doing this at the end, the FDW can look at the "regular" (local
join execution) paths that were already generated, should it wish to.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Kouhei Kaigai (#30)

Re: Foreign join pushdown vs EvalPlanQual

, On Wed, Aug 26, 2015 at 4:05 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

On 2015/08/26 16:07, Kouhei Kaigai wrote:
I wrote:

Maybe I'm missing something, but why do we need such a flexiblity for
the columnar-stores?

Even if we enforce them a new interface specification comfortable to RDBMS,
we cannot guarantee it is also comfortable to other type of FDW drivers.

Specifically, what kind of points about the patch are specific to RDBMS?
*** 88,93 **** ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
--- 99,122 ----
TupleTableSlot *
ExecForeignScan(ForeignScanState *node)
{
+     EState     *estate = node->ss.ps.state;
+
+     if (estate->es_epqTuple != NULL)
+     {
+             /*
+              * We are inside an EvalPlanQual recheck.  If foreign join, get next
+              * tuple from subplan.
+              */
+             Index           scanrelid = ((Scan *) node->ss.ps.plan)->scanrelid;
+
+             if (scanrelid == 0)
+             {
+                     PlanState  *outerPlan = outerPlanState(node);
+
+                     return ExecProcNode(outerPlan);
+             }
+     }
+
return ExecScan((ScanState *) node,
(ExecScanAccessMtd) ForeignNext,
(ExecScanRecheckMtd) ForeignRecheck);
It might not be specific to RDBMS, however, we cannot guarantee all the FDW are
comfortable to run the alternative plan node on EPQ recheck.
This design does not allow FDW drivers to implement own EPQ recheck, possibly
more efficient than built-in logic.

I'm not convinced that this problem is more than hypothetical. EPQ
rechecks should be quite rare, so it shouldn't really matter if we
jump through a few extra hoops when they happen. And really, are
those hoops all that expensive? It's not as if ExecInitNode should be
doing any sort of expensive operation, or ExecEndScan either. And
they will be able to tell if they're being called for an EPQ-recheck
by fishing out the estate, so if there's some processing that they
want to short-circuit for that case, they can. So I'm not seeing the
problem. Do you have any evidence that either the performance cost or
the code complexity cost is significant for PG-Strom or any other
extension?

That having been said, I don't entirely like Fujita-san's patch
either. Much of the new code is called immediately adjacent to an FDW
callback which could pretty trivially do the same thing itself, if
needed. And much of it is contingent on whether estate->es_epqTuple
!= NULL and scanrelid == 0, but perhaps out would be better to check
whether the subplan is actually present instead of checking whether we
think it should be present. Also, the naming is a bit weird:
node->fs_subplan gets shoved into outerPlanState(), which seems like a
kludge.

I'm wondering if there's another approach. If I understand correctly,
there are two reasons why the current situation is untenable. The
first is that ForeignRecheck always returns true, but we could instead
call an FDW-supplied callback routine there. The callback could be
optional, so that we just return true if there is none, which is nice
for already-existing FDWs that then don't need to do anything. The
second is that ExecScanFetch needs scanrelid > 0 so that
estate->es_epqTupleSet[scanrelid - 1] isn't indexing off the beginning
of the array, and similarly estate->es_epqScanDone[scanrelid - 1] and
estate->es_epqTuple[scanrelid - 1]. But, waving my hands wildly, that
also seems like a solvable problem. I mean, we're joining a non-empty
set of relations, so the entries in the EPQ-related arrays for those
RTIs are not getting used for anything, so we can use any of them for
the joinrel. We need some way for this code to decide what RTI to
use, but that shouldn't be too hard to finagle.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Robert Haas (#44)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/09/03 9:41, Robert Haas wrote:

That having been said, I don't entirely like Fujita-san's patch
either. Much of the new code is called immediately adjacent to an FDW
callback which could pretty trivially do the same thing itself, if
needed.

Another idea about that code is to call that code in eg, ExecProcNode,
instead of calling ExecForeignScan there. I think that that might be
much cleaner and resolve the naming problem below.

And much of it is contingent on whether estate->es_epqTuple
!= NULL and scanrelid == 0, but perhaps out would be better to check
whether the subplan is actually present instead of checking whether we
think it should be present.

Agreed with this.

Also, the naming is a bit weird:
node->fs_subplan gets shoved into outerPlanState(), which seems like a
kludge.

And with this. Proposals welcome.

I'm wondering if there's another approach. If I understand correctly,
there are two reasons why the current situation is untenable. The
first is that ForeignRecheck always returns true, but we could instead
call an FDW-supplied callback routine there. The callback could be
optional, so that we just return true if there is none, which is nice
for already-existing FDWs that then don't need to do anything.

My question about this is, is the callback really needed? If there are
any FDWs that want to do the work *in their own way*, instead of just
doing ExecProcNode for executing a local join execution plan in case of
foreign join (or just doing ExecQual for checking remote quals in case
of foreign table), I'd agree with introducing the callback, but if not,
I don't think that that makes much sense.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Etsuro Fujita (#45)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/09/03 14:22, Etsuro Fujita wrote:

On 2015/09/03 9:41, Robert Haas wrote:

That having been said, I don't entirely like Fujita-san's patch
either. Much of the new code is called immediately adjacent to an FDW
callback which could pretty trivially do the same thing itself, if
needed.

Another idea about that code is to call that code in eg, ExecProcNode,
instead of calling ExecForeignScan there. I think that that might be
much cleaner and resolve the naming problem below.

I gave it another thought; the following changes to ExecInitNode would
make the patch much simpler, ie, we would no longer need to call the new
code in ExecInitForeignScan, ExecForeignScan, ExecEndForeignScan, and
ExecReScanForeignScan. I think that would resolve the name problem also.

*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 247,254 **** ExecInitNode(Plan *node, EState *estate, int eflags)
             break;

case T_ForeignScan:
! result = (PlanState *) ExecInitForeignScan((ForeignScan *) node,
! estate, eflags);
break;

         case T_CustomScan:
--- 247,269 ----
             break;

case T_ForeignScan:
! {
! Index scanrelid = ((ForeignScan *)
node)->scan.scanrelid;
!
! if (estate->es_epqTuple != NULL && scanrelid == 0)
! {
! /*
! * We are in foreign join inside an EvalPlanQual
recheck.
! * Initialize local join execution plan, instead.
! */
! Plan *subplan = ((ForeignScan *)
node)->fs_subplan;
!
! result = ExecInitNode(subplan, estate, eflags);
! }
! else
! result = (PlanState *)
ExecInitForeignScan((ForeignScan *) node,
! estate,
eflags);
! }
break;

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Robert Haas (#44)

Re: Foreign join pushdown vs EvalPlanQual

, On Wed, Aug 26, 2015 at 4:05 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
On 2015/08/26 16:07, Kouhei Kaigai wrote:
I wrote:

Maybe I'm missing something, but why do we need such a flexiblity for
the columnar-stores?

Even if we enforce them a new interface specification comfortable to RDBMS,
we cannot guarantee it is also comfortable to other type of FDW drivers.

Specifically, what kind of points about the patch are specific to RDBMS?
*** 88,93 **** ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
--- 99,122 ----
TupleTableSlot *
ExecForeignScan(ForeignScanState *node)
{
+     EState     *estate = node->ss.ps.state;
+
+     if (estate->es_epqTuple != NULL)
+     {
+             /*
+              * We are inside an EvalPlanQual recheck.  If foreign join,
get next
+              * tuple from subplan.
+              */
+             Index           scanrelid = ((Scan *)
node->ss.ps.plan)->scanrelid;
+
+             if (scanrelid == 0)
+             {
+                     PlanState  *outerPlan = outerPlanState(node);
+
+                     return ExecProcNode(outerPlan);
+             }
+     }
+
return ExecScan((ScanState *) node,
(ExecScanAccessMtd) ForeignNext,
(ExecScanRecheckMtd)
ForeignRecheck);

It might not be specific to RDBMS, however, we cannot guarantee all the FDW

are

comfortable to run the alternative plan node on EPQ recheck.
This design does not allow FDW drivers to implement own EPQ recheck, possibly
more efficient than built-in logic.

I'm not convinced that this problem is more than hypothetical. EPQ
rechecks should be quite rare, so it shouldn't really matter if we
jump through a few extra hoops when they happen. And really, are
those hoops all that expensive? It's not as if ExecInitNode should be
doing any sort of expensive operation, or ExecEndScan either. And
they will be able to tell if they're being called for an EPQ-recheck
by fishing out the estate, so if there's some processing that they
want to short-circuit for that case, they can. So I'm not seeing the
problem. Do you have any evidence that either the performance cost or
the code complexity cost is significant for PG-Strom or any other
extension?

Even though PG-Strom does not implement EPQ recheck mechanism yet
(and not implemented on top of FDW), I plan to re-use CPU fallback
mechanism (*1) rather than having alternative plan approach.
I also don't care about performance penalty, however, don't want to
have alternative plan because of code complexity.
I don't deny individual extensions have alternative path by their
decision, but should not be enforced.

(*1) GPU often cannot execute expression because of exceptional
data like very long numeric or external toast etc..., but to be
executable. In this case, PG-Strom evaluates this expression in
the CPU side (of course, it is worse than normal execution path
but better than error). This logic is almost same as what we need
on EPQ recheck.

I'm wondering if there's another approach. If I understand correctly,
there are two reasons why the current situation is untenable. The
first is that ForeignRecheck always returns true, but we could instead
call an FDW-supplied callback routine there. The callback could be
optional, so that we just return true if there is none, which is nice
for already-existing FDWs that then don't need to do anything. The
second is that ExecScanFetch needs scanrelid > 0 so that
estate->es_epqTupleSet[scanrelid - 1] isn't indexing off the beginning
of the array, and similarly estate->es_epqScanDone[scanrelid - 1] and
estate->es_epqTuple[scanrelid - 1]. But, waving my hands wildly, that
also seems like a solvable problem. I mean, we're joining a non-empty
set of relations, so the entries in the EPQ-related arrays for those
RTIs are not getting used for anything, so we can use any of them for
the joinrel. We need some way for this code to decide what RTI to
use, but that shouldn't be too hard to finagle.

ForeignScan->fs_relids and CustomScan->custom_relids know which RTIs
shall be involved in this joinrel.

However, only extension know how these relations (including the case
of N-way join) shall be joined. FDW drivers may keep joinrestrictinfo
in their comfortable way, like a compiled GPU native binary, so I don't
think core side can do something relevant reasonably.
Even though Fujita-san proposed a new special fields in ForeignScan
to attach expression node that was pushed down, however, it looks to
me interface contract makes more complicated. Rather than various
special purpose fields, it is more straightforward to call back
extension when scanrelid==0. We can provide equivalent feature as
a utility function that has capability Fujita-san wants.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Tom Lane (#42)

Re: Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual)

On Wed, Sep 2, 2015 at 1:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Sep 2, 2015 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

But if you have in mind that typical FDWs would actually create join paths
at that point, consider that

1. The FDW would have to find all the combinations of its supplied
relations (unless you are only intending to generate one path for the
union of all such rels, which seems pretty narrow-minded from here).

Well, if the remote end is another database server, presumably we can
leave it to optimize the query, so why would we need more than one
path?

If you have say 5 relations in the query, 3 of which are foreign, it might
make sense to join all 3 at the remote end, or maybe you should only join
2 of them remotely because it's better to then join to one of the local
rels before joining the last remote rel.

True. But that's not the problem I'm concerned about. Suppose the
query looks like this:

SELECT * FROM ft1 LEFT JOIN ft2 ON ft1.x = ft2.x LEFT JOIN t1 ON ft2.y
= t1.y LEFT JOIN ft3 ON ft1.z = ft3.z LEFT JOIN t2 ON ft1.w = t2.w;

Now, no matter where we put the hooks, we'll consider foreign join
paths for all of the various combinations of tables that we could push
down. We'll decide between those various options based on cost, which
is fine. But let's consider just one joinrel, the one that includes
(ft1 ft2 ft3). Assuming that the remote tables have the same name as
the local tables. The path that implements a pushed-down join of all
three tables will send one of these two queries to the remote server:

SELECT * FROM ft1 LEFT JOIN ft2 ON ft1.x = ft2.x LEFT JOIN ft3 ON ft1.z = ft3.z;
SELECT * FROM ft1 LEFT JOIN ft3 ON ft1.z = ft3.z LEFT JOIN ft2 ON
ft1.x = ft2.x ;

We need to generate one of those two queries, and we need to figure
out what the remote server thinks it will cost to execute. We
presumably do not to cost both of them, because if it's legal to
commute the joins, the remote server can and will do that itself. It
would be stupid to cost both possible queries if the remote server is
going to pick the same plan either way. However - and this is the key
point - the one we choose to generate *must represent a legal join
order*. If the ft1-ft2 join were a FULL JOIN instead of a LEFT JOIN,
the second query wouldn't be a legal thing to send to the remote
server. So, the problem I'm worried about is: given that we know we
want to at least consider the path that pushes the whole join to the
remote server, how do we construct an SQL query that embodies a legal
join order of the relations being pushed down?

Even if you claim that that
would never make sense from a cost standpoint (a claim easily seen to be
silly), there might not be any legal way to join all 3 directly because of
join order constraints.

The larger point is that we can't expect the remote server to be fully
responsible for optimizing, because it will know nothing of what's being
done on our end.

No argument with any of that.

So, the problem is that I don't think this entirely skirts the
join_is_legal issues, which are a principal point of concern for me.
Say this is a joinrel between (A B) and (C D E). We need to generate
an SQL query for (A B C D E). We know that the outermost syntactic
join can be (A B) to (C D E). But how do we know which join orders
are legal as among (C D E)? Maybe there's a simple way to handle this
that I'm not seeing.

Well, if the joins get built up in the way I think should happen, we'd
have already considered (C D E), and we could have recorded the legal join
orders within that at the time. (I imagine that we should allow FDWs to
store some data within RelOptInfo structs that represent foreign joins
belonging entirely to them, so that there'd be a handy place to keep that
data till later.)

Yes, that would help. Can fdw_private serve that purpose, or do we
need something else?

Or we could trawl through the paths associated with the
child joinrel, which will presumably include instances of every reasonable
sub-join combination. Or the FDW could look at the SpecialJoinInfo data
and determine things for itself (or more likely, ask join_is_legal about
that).

Yeah, this is the part I'm worried will be complex, which accounts for
the current hook placement. I'm worried that trawling through that
SpecialJoinInfo data will end up needing to duplicate much of
make_join_rel and add_paths_to_joinrel. For example, consider:

SELECT * FROM verysmall v JOIN (bigft1 FULL JOIN bigft2 ON bigft1.x =
bigft2.x) ON v.q = bigft1.q AND v.r = bigft2.r;

The best path for this plan is presumably something like this:

Nested Loop
-> Seq Scan on verysmall v
-> Foreign Scan on bigft1 and bigft2
Remote SQL: SELECT * FROM bigft1 FULL JOIN bigft2 ON bigft1.x =
bigft2.x AND bigft1.q = $1 AND bigft2.r = $2

Now, how is the FDW going to figure out that it needs to generate this
parameterized path without duplicating this code from
add_paths_to_joinrel?

/*
* Decide whether it's sensible to generate parameterized paths for this
* joinrel, and if so, which relations such paths should require. There
* is usually no need to create a parameterized result path unless there
...

Maybe there's a very simple answer to this question and I'm just not
seeing it, but I really don't see how that's going to work.

In the case of postgres_fdw, I think the actual requirement will be to be
able to reconstruct a SQL query that correctly expresses the join; that
is, we need to send over something like "from c left join d on (...) full
join e on (...)", not just "from c, d, e", or we'll get totally bogus
estimates as well as bogus execution results.

Agreed.

Offhand I think that the
most likely way to build that text will be to examine the query's jointree
to see where c,d,e appear in it. But in any case, that's a separate issue
and I fail to see how plopping the join search problem into the FDW's lap
would make it any easier.

Yeah, I am not advocating for putting the hook in
standard_join_search. I'm explaining why I put it in
add_paths_to_joinrel instead of, as I believe you were advocating, in
make_join_rel prior to the big switch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Robert Haas (#48)

Re: Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual)

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Sep 2, 2015 at 1:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Offhand I think that the
most likely way to build that text will be to examine the query's jointree
to see where c,d,e appear in it. But in any case, that's a separate issue
and I fail to see how plopping the join search problem into the FDW's lap
would make it any easier.

Yeah, I am not advocating for putting the hook in
standard_join_search. I'm explaining why I put it in
add_paths_to_joinrel instead of, as I believe you were advocating, in
make_join_rel prior to the big switch.

If you had a solution to the how-to-build-the-query-text problem,
and it depended on that hook placement, then your argument might
make some sense. As is, you've entirely failed to convince me
that this placement is not wrong, wasteful, and likely to create
unnecessary API breaks for FDWs.

(Also, per my last message on the subject, *after* the switch
is what I think makes sense.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Etsuro Fujita (#46)

1 attachment(s)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/09/03 19:25, Etsuro Fujita wrote:

On 2015/09/03 14:22, Etsuro Fujita wrote:

On 2015/09/03 9:41, Robert Haas wrote:

That having been said, I don't entirely like Fujita-san's patch
either. Much of the new code is called immediately adjacent to an FDW
callback which could pretty trivially do the same thing itself, if
needed.

Another idea about that code is to call that code in eg, ExecProcNode,
instead of calling ExecForeignScan there. I think that that might be
much cleaner and resolve the naming problem below.

I gave it another thought; the following changes to ExecInitNode would
make the patch much simpler, ie, we would no longer need to call the new
code in ExecInitForeignScan, ExecForeignScan, ExecEndForeignScan, and
ExecReScanForeignScan. I think that would resolve the name problem also.

I'm attaching an updated version of the patch. The patch is based on
the SS_finalize_plan patch that has been recently committed. I'd be
happy if this helps people discuss more about how to fix this issue.

Best regards,
Etsuro Fujita

Attachments:

fdw-eval-plan-qual-2.0.patchtext/x-patch; name=fdw-eval-plan-qual-2.0.patchDownload

*** a/contrib/file_fdw/file_fdw.c
--- b/contrib/file_fdw/file_fdw.c
***************
*** 525,530 **** fileGetForeignPaths(PlannerInfo *root,
--- 525,531 ----
  									 total_cost,
  									 NIL,		/* no pathkeys */
  									 NULL,		/* no outer rel either */
+ 									 NULL,		/* no alternative path */
  									 coptions));
  
  	/*
***************
*** 563,569 **** fileGetForeignPlan(PlannerInfo *root,
  							scan_relid,
  							NIL,	/* no expressions to evaluate */
  							best_path->fdw_private,
! 							NIL /* no custom tlist */ );
  }
  
  /*
--- 564,571 ----
  							scan_relid,
  							NIL,	/* no expressions to evaluate */
  							best_path->fdw_private,
! 							NIL,	/* no custom tlist */
! 							NIL /* no remote quals */ );
  }
  
  /*
*** a/contrib/postgres_fdw/postgres_fdw.c
--- b/contrib/postgres_fdw/postgres_fdw.c
***************
*** 560,565 **** postgresGetForeignPaths(PlannerInfo *root,
--- 560,566 ----
  								   fpinfo->total_cost,
  								   NIL, /* no pathkeys */
  								   NULL,		/* no outer rel either */
+ 								   NULL,		/* no alternative path */
  								   NIL);		/* no fdw_private list */
  	add_path(baserel, (Path *) path);
  
***************
*** 727,732 **** postgresGetForeignPaths(PlannerInfo *root,
--- 728,734 ----
  									   total_cost,
  									   NIL,		/* no pathkeys */
  									   param_info->ppi_req_outer,
+ 									   NULL,	/* no alternative path */
  									   NIL);	/* no fdw_private list */
  		add_path(baserel, (Path *) path);
  	}
***************
*** 748,753 **** postgresGetForeignPlan(PlannerInfo *root,
--- 750,756 ----
  	Index		scan_relid = baserel->relid;
  	List	   *fdw_private;
  	List	   *remote_conds = NIL;
+ 	List	   *remote_exprs = NIL;
  	List	   *local_exprs = NIL;
  	List	   *params_list = NIL;
  	List	   *retrieved_attrs;
***************
*** 769,776 **** postgresGetForeignPlan(PlannerInfo *root,
  	 *
  	 * This code must match "extract_actual_clauses(scan_clauses, false)"
  	 * except for the additional decision about remote versus local execution.
! 	 * Note however that we only strip the RestrictInfo nodes from the
! 	 * local_exprs list, since appendWhereClause expects a list of
  	 * RestrictInfos.
  	 */
  	foreach(lc, scan_clauses)
--- 772,779 ----
  	 *
  	 * This code must match "extract_actual_clauses(scan_clauses, false)"
  	 * except for the additional decision about remote versus local execution.
! 	 * Note however that we don't strip the RestrictInfo nodes from the
! 	 * remote_conds list, since appendWhereClause expects a list of
  	 * RestrictInfos.
  	 */
  	foreach(lc, scan_clauses)
***************
*** 784,794 **** postgresGetForeignPlan(PlannerInfo *root,
--- 787,803 ----
  			continue;
  
  		if (list_member_ptr(fpinfo->remote_conds, rinfo))
+ 		{
  			remote_conds = lappend(remote_conds, rinfo);
+ 			remote_exprs = lappend(remote_exprs, rinfo->clause);
+ 		}
  		else if (list_member_ptr(fpinfo->local_conds, rinfo))
  			local_exprs = lappend(local_exprs, rinfo->clause);
  		else if (is_foreign_expr(root, baserel, rinfo->clause))
+ 		{
  			remote_conds = lappend(remote_conds, rinfo);
+ 			remote_exprs = lappend(remote_exprs, rinfo->clause);
+ 		}
  		else
  			local_exprs = lappend(local_exprs, rinfo->clause);
  	}
***************
*** 874,880 **** postgresGetForeignPlan(PlannerInfo *root,
  							scan_relid,
  							params_list,
  							fdw_private,
! 							NIL /* no custom tlist */ );
  }
  
  /*
--- 883,890 ----
  							scan_relid,
  							params_list,
  							fdw_private,
! 							NIL,	/* no custom tlist */
! 							remote_exprs);
  }
  
  /*
*** a/doc/src/sgml/fdwhandler.sgml
--- b/doc/src/sgml/fdwhandler.sgml
***************
*** 333,339 **** GetForeignJoinPaths (PlannerInfo *root,
       remote join cannot be found from the system catalogs, the FDW must
       fill <structfield>fdw_scan_tlist</> with an appropriate list
       of <structfield>TargetEntry</> nodes, representing the set of columns
!      it will supply at runtime in the tuples it returns.
      </para>
  
      <para>
--- 333,343 ----
       remote join cannot be found from the system catalogs, the FDW must
       fill <structfield>fdw_scan_tlist</> with an appropriate list
       of <structfield>TargetEntry</> nodes, representing the set of columns
!      it will supply at runtime in the tuples it returns.  Yet another
!      difference is that, for possible use in the
!      <productname>PostgreSQL</productname> executor, the FDW must fill
!      <structfield>fs_subplan</> with a local join plan equivalent to
!      the resulting <structname>ForeignScan</> plan.
      </para>
  
      <para>
***************
*** 1111,1117 **** GetForeignServerByName(const char *name, bool missing_ok);
       clauses will be checked by the executor at run time.  More complex FDWs
       may be able to check some of the clauses internally, in which case those
       clauses can be removed from the plan node's qual list so that the
!      executor doesn't waste time rechecking them.
      </para>
  
      <para>
--- 1115,1124 ----
       clauses will be checked by the executor at run time.  More complex FDWs
       may be able to check some of the clauses internally, in which case those
       clauses can be removed from the plan node's qual list so that the
!      executor doesn't waste time rechecking them.  In the case of planning
!      foreign table scans, removed clauses must be added to the
!      <structfield>fdw_quals</> list of the <structname>ForeignScan</> node
!      for possible use in the <productname>PostgreSQL</productname> executor.
      </para>
  
      <para>
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 247,254 **** ExecInitNode(Plan *node, EState *estate, int eflags)
  			break;
  
  		case T_ForeignScan:
! 			result = (PlanState *) ExecInitForeignScan((ForeignScan *) node,
! 													   estate, eflags);
  			break;
  
  		case T_CustomScan:
--- 247,274 ----
  			break;
  
  		case T_ForeignScan:
! 			{
! 				Index		scanrelid = ((ForeignScan *) node)->scan.scanrelid;
! 
! 				if (estate->es_epqTuple != NULL && scanrelid == 0)
! 				{
! 					/*
! 					 * We are in foreign join for EvalPlanQual testing.
! 					 * Initialize the local join plan, instead.
! 					 *
! 					 * Note: initPlans attached to this PlanState node below
! 					 * will be in line with those attached to the plan.  See
! 					 * SS_finalize_plan().
! 					 */
! 					Plan	   *subplan = ((ForeignScan *) node)->fs_subplan;
! 
! 					Assert(subplan != NULL);
! 					result = ExecInitNode(subplan, estate, eflags);
! 				}
! 				else
! 					result = (PlanState *) ExecInitForeignScan((ForeignScan *) node,
! 															   estate, eflags);
! 			}
  			break;
  
  		case T_CustomScan:
*** a/src/backend/executor/nodeForeignscan.c
--- b/src/backend/executor/nodeForeignscan.c
***************
*** 72,79 **** ForeignNext(ForeignScanState *node)
  static bool
  ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
  {
! 	/* There are no access-method-specific conditions to recheck. */
! 	return true;
  }
  
  /* ----------------------------------------------------------------
--- 72,90 ----
  static bool
  ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
  {
! 	ExprContext *econtext;
! 
! 	/*
! 	 * extract necessary information from foreign scan node
! 	 */
! 	econtext = node->ss.ps.ps_ExprContext;
! 
! 	/* Does the tuple meet the remote qual condition? */
! 	econtext->ecxt_scantuple = slot;
! 
! 	ResetExprContext(econtext);
! 
! 	return ExecQual(node->fdw_quals, econtext, false);
  }
  
  /* ----------------------------------------------------------------
***************
*** 135,140 **** ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
--- 146,154 ----
  	scanstate->ss.ps.qual = (List *)
  		ExecInitExpr((Expr *) node->scan.plan.qual,
  					 (PlanState *) scanstate);
+ 	scanstate->fdw_quals = (List *)
+ 		ExecInitExpr((Expr *) node->fdw_quals,
+ 					 (PlanState *) scanstate);
  
  	/*
  	 * tuple table initialization
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
***************
*** 624,629 **** _copyForeignScan(const ForeignScan *from)
--- 624,631 ----
  	COPY_NODE_FIELD(fdw_exprs);
  	COPY_NODE_FIELD(fdw_private);
  	COPY_NODE_FIELD(fdw_scan_tlist);
+ 	COPY_NODE_FIELD(fdw_quals);
+ 	COPY_NODE_FIELD(fs_subplan);
  	COPY_BITMAPSET_FIELD(fs_relids);
  	COPY_SCALAR_FIELD(fsSystemCol);
  
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
***************
*** 579,584 **** _outForeignScan(StringInfo str, const ForeignScan *node)
--- 579,586 ----
  	WRITE_NODE_FIELD(fdw_exprs);
  	WRITE_NODE_FIELD(fdw_private);
  	WRITE_NODE_FIELD(fdw_scan_tlist);
+ 	WRITE_NODE_FIELD(fdw_quals);
+ 	WRITE_NODE_FIELD(fs_subplan);
  	WRITE_BITMAPSET_FIELD(fs_relids);
  	WRITE_BOOL_FIELD(fsSystemCol);
  }
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
***************
*** 2117,2125 **** create_foreignscan_plan(PlannerInfo *root, ForeignPath *best_path,
--- 2117,2134 ----
  			replace_nestloop_params(root, (Node *) scan_plan->scan.plan.qual);
  		scan_plan->fdw_exprs = (List *)
  			replace_nestloop_params(root, (Node *) scan_plan->fdw_exprs);
+ 		scan_plan->fdw_quals = (List *)
+ 			replace_nestloop_params(root, (Node *) scan_plan->fdw_quals);
  	}
  
  	/*
+ 	 * If we're scanning a join relation, generate the local join plan for
+ 	 * EvalPlanQual support.  (Irrelevant if scanning a base relation.)
+ 	 */
+ 	if (scan_relid == 0)
+ 		scan_plan->fs_subplan = create_plan_recurse(root, best_path->subpath);
+ 
+ 	/*
  	 * Detect whether any system columns are requested from rel.  This is a
  	 * bit of a kluge and might go away someday, so we intentionally leave it
  	 * out of the API presented to FDWs.
***************
*** 3702,3708 **** make_foreignscan(List *qptlist,
  				 Index scanrelid,
  				 List *fdw_exprs,
  				 List *fdw_private,
! 				 List *fdw_scan_tlist)
  {
  	ForeignScan *node = makeNode(ForeignScan);
  	Plan	   *plan = &node->scan.plan;
--- 3711,3718 ----
  				 Index scanrelid,
  				 List *fdw_exprs,
  				 List *fdw_private,
! 				 List *fdw_scan_tlist,
! 				 List *fdw_quals)
  {
  	ForeignScan *node = makeNode(ForeignScan);
  	Plan	   *plan = &node->scan.plan;
***************
*** 3718,3723 **** make_foreignscan(List *qptlist,
--- 3728,3736 ----
  	node->fdw_exprs = fdw_exprs;
  	node->fdw_private = fdw_private;
  	node->fdw_scan_tlist = fdw_scan_tlist;
+ 	node->fdw_quals = fdw_quals;
+ 	/* fs_subplan will be filled in by create_foreignscan_plan */
+ 	node->fs_subplan = NULL;
  	/* fs_relids will be filled in by create_foreignscan_plan */
  	node->fs_relids = NULL;
  	/* fsSystemCol will be filled in by create_foreignscan_plan */
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
***************
*** 1124,1139 **** set_foreignscan_references(PlannerInfo *root,
  		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
  		fscan->fdw_scan_tlist =
  			fix_scan_list(root, fscan->fdw_scan_tlist, rtoffset);
  	}
  	else
  	{
! 		/* Adjust tlist, qual, fdw_exprs in the standard way */
  		fscan->scan.plan.targetlist =
  			fix_scan_list(root, fscan->scan.plan.targetlist, rtoffset);
  		fscan->scan.plan.qual =
  			fix_scan_list(root, fscan->scan.plan.qual, rtoffset);
  		fscan->fdw_exprs =
  			fix_scan_list(root, fscan->fdw_exprs, rtoffset);
  	}
  
  	/* Adjust fs_relids if needed */
--- 1124,1143 ----
  		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
  		fscan->fdw_scan_tlist =
  			fix_scan_list(root, fscan->fdw_scan_tlist, rtoffset);
+ 		/* fs_subplan needs set_plan_refs() adjustments */
+ 		set_plan_refs(root, fscan->fs_subplan, rtoffset);
  	}
  	else
  	{
! 		/* Adjust tlist, qual, fdw_exprs, fdw_quals in the standard way */
  		fscan->scan.plan.targetlist =
  			fix_scan_list(root, fscan->scan.plan.targetlist, rtoffset);
  		fscan->scan.plan.qual =
  			fix_scan_list(root, fscan->scan.plan.qual, rtoffset);
  		fscan->fdw_exprs =
  			fix_scan_list(root, fscan->fdw_exprs, rtoffset);
+ 		fscan->fdw_quals =
+ 			fix_scan_list(root, fscan->fdw_quals, rtoffset);
  	}
  
  	/* Adjust fs_relids if needed */
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
***************
*** 2394,2403 **** finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
  			break;
  
  		case T_ForeignScan:
! 			finalize_primnode((Node *) ((ForeignScan *) plan)->fdw_exprs,
! 							  &context);
! 			/* We assume fdw_scan_tlist cannot contain Params */
! 			context.paramids = bms_add_members(context.paramids, scan_params);
  			break;
  
  		case T_CustomScan:
--- 2394,2444 ----
  			break;
  
  		case T_ForeignScan:
! 			{
! 				ForeignScan *fscan = (ForeignScan *) plan;
! 
! 				finalize_primnode((Node *) fscan->fdw_exprs, &context);
! 
! 				/* We assume fdw_scan_tlist cannot contain Params */
! 				context.paramids =
! 					bms_add_members(context.paramids, scan_params);
! 
! 				/*
! 				 * We need not look at fdw_quals, since it will have the same
! 				 * param references as fdw_exprs.
! 				 */
! 
! 				/* subplan node if foreign join */
! 				if (fscan->scan.scanrelid == 0)
! 				{
! 					/*
! 					 * If the ForeignScan node was the topmost scan/join plan
! 					 * node, grouping_planner() might have replaced the tlist
! 					 * of the ForeignScan node.  So, replace the tlist of the
! 					 * subplan with that of the ForeignScan node.
! 					 */
! 					fscan->fs_subplan->targetlist = plan->targetlist;
! 
! 					/*
! 					 * If the ForeignScan node was the topmost plan node for
! 					 * the query level, it might have initplans.  To match
! 					 * the computed extParam/allParam sets for the subplan
! 					 * with those for the ForeignScan node accurately, set
! 					 * the initPlans of the subplan to the ForeignScan node's
! 					 * initPlans.  Not sure this is needed.
! 					 */
! 					fscan->fs_subplan->initPlan = plan->initPlan;
! 
! 					/*
! 					 * We need not include the subplan's params.  However,
! 					 * the subplan itself needs finalize_plan() processing.
! 					 */
! 					finalize_plan(root,
! 								  fscan->fs_subplan,
! 								  valid_params,
! 								  scan_params);
! 				}
! 			}
  			break;
  
  		case T_CustomScan:
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
***************
*** 1462,1467 **** create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
--- 1462,1468 ----
  						double rows, Cost startup_cost, Cost total_cost,
  						List *pathkeys,
  						Relids required_outer,
+ 						Path *subpath,
  						List *fdw_private)
  {
  	ForeignPath *pathnode = makeNode(ForeignPath);
***************
*** 1475,1480 **** create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
--- 1476,1482 ----
  	pathnode->path.total_cost = total_cost;
  	pathnode->path.pathkeys = pathkeys;
  
+ 	pathnode->subpath = subpath;
  	pathnode->fdw_private = fdw_private;
  
  	return pathnode;
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
***************
*** 1577,1582 **** typedef struct WorkTableScanState
--- 1577,1583 ----
  typedef struct ForeignScanState
  {
  	ScanState	ss;				/* its first field is NodeTag */
+ 	List	   *fdw_quals;		/* remote quals if foreign table */
  	/* use struct pointer to avoid including fdwapi.h here */
  	struct FdwRoutine *fdwroutine;
  	void	   *fdw_state;		/* foreign-data wrapper can keep state here */
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
***************
*** 521,526 **** typedef struct ForeignScan
--- 521,528 ----
  	List	   *fdw_exprs;		/* expressions that FDW may evaluate */
  	List	   *fdw_private;	/* private data for FDW */
  	List	   *fdw_scan_tlist; /* optional tlist describing scan tuple */
+ 	List	   *fdw_quals;		/* remote quals if foreign table */
+ 	Plan	   *fs_subplan;		/* local join plan if foreign join */
  	Bitmapset  *fs_relids;		/* RTIs generated by this scan */
  	bool		fsSystemCol;	/* true if any "system column" is needed */
  } ForeignScan;
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
***************
*** 897,906 **** typedef struct TidPath
--- 897,914 ----
   * generally a good idea to use a representation that can be dumped by
   * nodeToString(), so that you can examine the structure during debugging
   * with tools like pprint().
+  *
+  * If a ForeignPath node represents a remote join of foreign tables, subpath
+  * is a local join of those tables with equivalent results that will be used
+  * for EvalPlanQual testing.  The pathkeys and parameterization of subpath
+  * must be the same as that of the path's output.  (The requirement for the
+  * pathkeys is unnecessary, since the testing can return at most one tuple
+  * for any particular set of scan tuples of those tables, but let's be safe.)
   */
  typedef struct ForeignPath
  {
  	Path		path;
+ 	Path	   *subpath;
  	List	   *fdw_private;
  } ForeignPath;
  
*** a/src/include/optimizer/pathnode.h
--- b/src/include/optimizer/pathnode.h
***************
*** 83,88 **** extern ForeignPath *create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
--- 83,89 ----
  						double rows, Cost startup_cost, Cost total_cost,
  						List *pathkeys,
  						Relids required_outer,
+ 						Path *subpath,
  						List *fdw_private);
  
  extern Relids calc_nestloop_required_outer(Path *outer_path, Path *inner_path);
*** a/src/include/optimizer/planmain.h
--- b/src/include/optimizer/planmain.h
***************
*** 45,51 **** extern SubqueryScan *make_subqueryscan(List *qptlist, List *qpqual,
  				  Index scanrelid, Plan *subplan);
  extern ForeignScan *make_foreignscan(List *qptlist, List *qpqual,
  				 Index scanrelid, List *fdw_exprs, List *fdw_private,
! 				 List *fdw_scan_tlist);
  extern Append *make_append(List *appendplans, List *tlist);
  extern RecursiveUnion *make_recursive_union(List *tlist,
  					 Plan *lefttree, Plan *righttree, int wtParam,
--- 45,51 ----
  				  Index scanrelid, Plan *subplan);
  extern ForeignScan *make_foreignscan(List *qptlist, List *qpqual,
  				 Index scanrelid, List *fdw_exprs, List *fdw_private,
! 				 List *fdw_scan_tlist, List *fdw_quals);
  extern Append *make_append(List *appendplans, List *tlist);
  extern RecursiveUnion *make_recursive_union(List *tlist,
  					 Plan *lefttree, Plan *righttree, int wtParam,

#51

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Etsuro Fujita (#50)

1 attachment(s)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/09/04 19:50, Etsuro Fujita wrote:

I'm attaching an updated version of the patch. The patch is based on
the SS_finalize_plan patch that has been recently committed. I'd be
happy if this helps people discuss more about how to fix this issue.

In the updated version, I modified finalize_plan so that initPlans
attached to a ForeignScan node doing a remote join are considered for
the computed params for a local join plan for EvalPlanQual testing. But
I noticed no need for that. The reason is, no initPlans will be
attached to the ForeignScan node due to that the ForeignScan node is
unable to be the topmost plan node for the query level in case of
EvalPlanQual testing. So, I removed that code. Patch attached. (That
no longer depends on the SS_finalize_plan patch.)

Best regards,
Etsuro Fujita

Attachments:

fdw-eval-plan-qual-3.0.patchtext/x-patch; name=fdw-eval-plan-qual-3.0.patchDownload

*** a/contrib/file_fdw/file_fdw.c
--- b/contrib/file_fdw/file_fdw.c
***************
*** 525,530 **** fileGetForeignPaths(PlannerInfo *root,
--- 525,531 ----
  									 total_cost,
  									 NIL,		/* no pathkeys */
  									 NULL,		/* no outer rel either */
+ 									 NULL,		/* no alternative path */
  									 coptions));
  
  	/*
***************
*** 563,569 **** fileGetForeignPlan(PlannerInfo *root,
  							scan_relid,
  							NIL,	/* no expressions to evaluate */
  							best_path->fdw_private,
! 							NIL /* no custom tlist */ );
  }
  
  /*
--- 564,571 ----
  							scan_relid,
  							NIL,	/* no expressions to evaluate */
  							best_path->fdw_private,
! 							NIL,	/* no custom tlist */
! 							NIL /* no remote quals */ );
  }
  
  /*
*** a/contrib/postgres_fdw/postgres_fdw.c
--- b/contrib/postgres_fdw/postgres_fdw.c
***************
*** 560,565 **** postgresGetForeignPaths(PlannerInfo *root,
--- 560,566 ----
  								   fpinfo->total_cost,
  								   NIL, /* no pathkeys */
  								   NULL,		/* no outer rel either */
+ 								   NULL,		/* no alternative path */
  								   NIL);		/* no fdw_private list */
  	add_path(baserel, (Path *) path);
  
***************
*** 727,732 **** postgresGetForeignPaths(PlannerInfo *root,
--- 728,734 ----
  									   total_cost,
  									   NIL,		/* no pathkeys */
  									   param_info->ppi_req_outer,
+ 									   NULL,	/* no alternative path */
  									   NIL);	/* no fdw_private list */
  		add_path(baserel, (Path *) path);
  	}
***************
*** 748,753 **** postgresGetForeignPlan(PlannerInfo *root,
--- 750,756 ----
  	Index		scan_relid = baserel->relid;
  	List	   *fdw_private;
  	List	   *remote_conds = NIL;
+ 	List	   *remote_exprs = NIL;
  	List	   *local_exprs = NIL;
  	List	   *params_list = NIL;
  	List	   *retrieved_attrs;
***************
*** 769,776 **** postgresGetForeignPlan(PlannerInfo *root,
  	 *
  	 * This code must match "extract_actual_clauses(scan_clauses, false)"
  	 * except for the additional decision about remote versus local execution.
! 	 * Note however that we only strip the RestrictInfo nodes from the
! 	 * local_exprs list, since appendWhereClause expects a list of
  	 * RestrictInfos.
  	 */
  	foreach(lc, scan_clauses)
--- 772,779 ----
  	 *
  	 * This code must match "extract_actual_clauses(scan_clauses, false)"
  	 * except for the additional decision about remote versus local execution.
! 	 * Note however that we don't strip the RestrictInfo nodes from the
! 	 * remote_conds list, since appendWhereClause expects a list of
  	 * RestrictInfos.
  	 */
  	foreach(lc, scan_clauses)
***************
*** 784,794 **** postgresGetForeignPlan(PlannerInfo *root,
--- 787,803 ----
  			continue;
  
  		if (list_member_ptr(fpinfo->remote_conds, rinfo))
+ 		{
  			remote_conds = lappend(remote_conds, rinfo);
+ 			remote_exprs = lappend(remote_exprs, rinfo->clause);
+ 		}
  		else if (list_member_ptr(fpinfo->local_conds, rinfo))
  			local_exprs = lappend(local_exprs, rinfo->clause);
  		else if (is_foreign_expr(root, baserel, rinfo->clause))
+ 		{
  			remote_conds = lappend(remote_conds, rinfo);
+ 			remote_exprs = lappend(remote_exprs, rinfo->clause);
+ 		}
  		else
  			local_exprs = lappend(local_exprs, rinfo->clause);
  	}
***************
*** 874,880 **** postgresGetForeignPlan(PlannerInfo *root,
  							scan_relid,
  							params_list,
  							fdw_private,
! 							NIL /* no custom tlist */ );
  }
  
  /*
--- 883,890 ----
  							scan_relid,
  							params_list,
  							fdw_private,
! 							NIL,	/* no custom tlist */
! 							remote_exprs);
  }
  
  /*
*** a/doc/src/sgml/fdwhandler.sgml
--- b/doc/src/sgml/fdwhandler.sgml
***************
*** 333,339 **** GetForeignJoinPaths (PlannerInfo *root,
       remote join cannot be found from the system catalogs, the FDW must
       fill <structfield>fdw_scan_tlist</> with an appropriate list
       of <structfield>TargetEntry</> nodes, representing the set of columns
!      it will supply at runtime in the tuples it returns.
      </para>
  
      <para>
--- 333,343 ----
       remote join cannot be found from the system catalogs, the FDW must
       fill <structfield>fdw_scan_tlist</> with an appropriate list
       of <structfield>TargetEntry</> nodes, representing the set of columns
!      it will supply at runtime in the tuples it returns.  Yet another
!      difference is that, for possible use in the
!      <productname>PostgreSQL</productname> executor, the FDW must fill
!      <structfield>fs_subplan</> with a local join plan equivalent to
!      the resulting <structname>ForeignScan</> plan.
      </para>
  
      <para>
***************
*** 1111,1117 **** GetForeignServerByName(const char *name, bool missing_ok);
       clauses will be checked by the executor at run time.  More complex FDWs
       may be able to check some of the clauses internally, in which case those
       clauses can be removed from the plan node's qual list so that the
!      executor doesn't waste time rechecking them.
      </para>
  
      <para>
--- 1115,1124 ----
       clauses will be checked by the executor at run time.  More complex FDWs
       may be able to check some of the clauses internally, in which case those
       clauses can be removed from the plan node's qual list so that the
!      executor doesn't waste time rechecking them.  In the case of planning
!      foreign table scans, removed clauses must be added to the
!      <structfield>fdw_quals</> list of the <structname>ForeignScan</> node
!      for possible use in the <productname>PostgreSQL</productname> executor.
      </para>
  
      <para>
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 247,254 **** ExecInitNode(Plan *node, EState *estate, int eflags)
  			break;
  
  		case T_ForeignScan:
! 			result = (PlanState *) ExecInitForeignScan((ForeignScan *) node,
! 													   estate, eflags);
  			break;
  
  		case T_CustomScan:
--- 247,270 ----
  			break;
  
  		case T_ForeignScan:
! 			{
! 				Index		scanrelid = ((ForeignScan *) node)->scan.scanrelid;
! 
! 				if (estate->es_epqTuple != NULL && scanrelid == 0)
! 				{
! 					/*
! 					 * We are in foreign join for EvalPlanQual testing.
! 					 * Initialize the local join plan, instead.
! 					 */
! 					Plan	   *subplan = ((ForeignScan *) node)->fs_subplan;
! 
! 					Assert(subplan != NULL);
! 					result = ExecInitNode(subplan, estate, eflags);
! 				}
! 				else
! 					result = (PlanState *) ExecInitForeignScan((ForeignScan *) node,
! 															   estate, eflags);
! 			}
  			break;
  
  		case T_CustomScan:
*** a/src/backend/executor/nodeForeignscan.c
--- b/src/backend/executor/nodeForeignscan.c
***************
*** 72,79 **** ForeignNext(ForeignScanState *node)
  static bool
  ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
  {
! 	/* There are no access-method-specific conditions to recheck. */
! 	return true;
  }
  
  /* ----------------------------------------------------------------
--- 72,90 ----
  static bool
  ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
  {
! 	ExprContext *econtext;
! 
! 	/*
! 	 * extract necessary information from foreign scan node
! 	 */
! 	econtext = node->ss.ps.ps_ExprContext;
! 
! 	/* Does the tuple meet the remote qual condition? */
! 	econtext->ecxt_scantuple = slot;
! 
! 	ResetExprContext(econtext);
! 
! 	return ExecQual(node->fdw_quals, econtext, false);
  }
  
  /* ----------------------------------------------------------------
***************
*** 135,140 **** ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
--- 146,154 ----
  	scanstate->ss.ps.qual = (List *)
  		ExecInitExpr((Expr *) node->scan.plan.qual,
  					 (PlanState *) scanstate);
+ 	scanstate->fdw_quals = (List *)
+ 		ExecInitExpr((Expr *) node->fdw_quals,
+ 					 (PlanState *) scanstate);
  
  	/*
  	 * tuple table initialization
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
***************
*** 624,629 **** _copyForeignScan(const ForeignScan *from)
--- 624,631 ----
  	COPY_NODE_FIELD(fdw_exprs);
  	COPY_NODE_FIELD(fdw_private);
  	COPY_NODE_FIELD(fdw_scan_tlist);
+ 	COPY_NODE_FIELD(fdw_quals);
+ 	COPY_NODE_FIELD(fs_subplan);
  	COPY_BITMAPSET_FIELD(fs_relids);
  	COPY_SCALAR_FIELD(fsSystemCol);
  
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
***************
*** 579,584 **** _outForeignScan(StringInfo str, const ForeignScan *node)
--- 579,586 ----
  	WRITE_NODE_FIELD(fdw_exprs);
  	WRITE_NODE_FIELD(fdw_private);
  	WRITE_NODE_FIELD(fdw_scan_tlist);
+ 	WRITE_NODE_FIELD(fdw_quals);
+ 	WRITE_NODE_FIELD(fs_subplan);
  	WRITE_BITMAPSET_FIELD(fs_relids);
  	WRITE_BOOL_FIELD(fsSystemCol);
  }
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
***************
*** 2117,2125 **** create_foreignscan_plan(PlannerInfo *root, ForeignPath *best_path,
--- 2117,2134 ----
  			replace_nestloop_params(root, (Node *) scan_plan->scan.plan.qual);
  		scan_plan->fdw_exprs = (List *)
  			replace_nestloop_params(root, (Node *) scan_plan->fdw_exprs);
+ 		scan_plan->fdw_quals = (List *)
+ 			replace_nestloop_params(root, (Node *) scan_plan->fdw_quals);
  	}
  
  	/*
+ 	 * If we're scanning a join relation, generate the local join plan for
+ 	 * EvalPlanQual support.  (Irrelevant if scanning a base relation.)
+ 	 */
+ 	if (scan_relid == 0)
+ 		scan_plan->fs_subplan = create_plan_recurse(root, best_path->subpath);
+ 
+ 	/*
  	 * Detect whether any system columns are requested from rel.  This is a
  	 * bit of a kluge and might go away someday, so we intentionally leave it
  	 * out of the API presented to FDWs.
***************
*** 3702,3708 **** make_foreignscan(List *qptlist,
  				 Index scanrelid,
  				 List *fdw_exprs,
  				 List *fdw_private,
! 				 List *fdw_scan_tlist)
  {
  	ForeignScan *node = makeNode(ForeignScan);
  	Plan	   *plan = &node->scan.plan;
--- 3711,3718 ----
  				 Index scanrelid,
  				 List *fdw_exprs,
  				 List *fdw_private,
! 				 List *fdw_scan_tlist,
! 				 List *fdw_quals)
  {
  	ForeignScan *node = makeNode(ForeignScan);
  	Plan	   *plan = &node->scan.plan;
***************
*** 3718,3723 **** make_foreignscan(List *qptlist,
--- 3728,3736 ----
  	node->fdw_exprs = fdw_exprs;
  	node->fdw_private = fdw_private;
  	node->fdw_scan_tlist = fdw_scan_tlist;
+ 	node->fdw_quals = fdw_quals;
+ 	/* fs_subplan will be filled in by create_foreignscan_plan */
+ 	node->fs_subplan = NULL;
  	/* fs_relids will be filled in by create_foreignscan_plan */
  	node->fs_relids = NULL;
  	/* fsSystemCol will be filled in by create_foreignscan_plan */
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
***************
*** 1124,1139 **** set_foreignscan_references(PlannerInfo *root,
  		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
  		fscan->fdw_scan_tlist =
  			fix_scan_list(root, fscan->fdw_scan_tlist, rtoffset);
  	}
  	else
  	{
! 		/* Adjust tlist, qual, fdw_exprs in the standard way */
  		fscan->scan.plan.targetlist =
  			fix_scan_list(root, fscan->scan.plan.targetlist, rtoffset);
  		fscan->scan.plan.qual =
  			fix_scan_list(root, fscan->scan.plan.qual, rtoffset);
  		fscan->fdw_exprs =
  			fix_scan_list(root, fscan->fdw_exprs, rtoffset);
  	}
  
  	/* Adjust fs_relids if needed */
--- 1124,1143 ----
  		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
  		fscan->fdw_scan_tlist =
  			fix_scan_list(root, fscan->fdw_scan_tlist, rtoffset);
+ 		/* fs_subplan needs set_plan_refs() adjustments */
+ 		set_plan_refs(root, fscan->fs_subplan, rtoffset);
  	}
  	else
  	{
! 		/* Adjust tlist, qual, fdw_exprs, fdw_quals in the standard way */
  		fscan->scan.plan.targetlist =
  			fix_scan_list(root, fscan->scan.plan.targetlist, rtoffset);
  		fscan->scan.plan.qual =
  			fix_scan_list(root, fscan->scan.plan.qual, rtoffset);
  		fscan->fdw_exprs =
  			fix_scan_list(root, fscan->fdw_exprs, rtoffset);
+ 		fscan->fdw_quals =
+ 			fix_scan_list(root, fscan->fdw_quals, rtoffset);
  	}
  
  	/* Adjust fs_relids if needed */
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
***************
*** 2394,2403 **** finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
  			break;
  
  		case T_ForeignScan:
! 			finalize_primnode((Node *) ((ForeignScan *) plan)->fdw_exprs,
! 							  &context);
! 			/* We assume fdw_scan_tlist cannot contain Params */
! 			context.paramids = bms_add_members(context.paramids, scan_params);
  			break;
  
  		case T_CustomScan:
--- 2394,2445 ----
  			break;
  
  		case T_ForeignScan:
! 			{
! 				ForeignScan *fscan = (ForeignScan *) plan;
! 
! 				finalize_primnode((Node *) fscan->fdw_exprs, &context);
! 
! 				/* We assume fdw_scan_tlist cannot contain Params */
! 				context.paramids =
! 					bms_add_members(context.paramids, scan_params);
! 
! 				/*
! 				 * We need not look at fdw_quals, since it will have the same
! 				 * param references as fdw_exprs.
! 				 */
! 
! 				/* subplan node if foreign join */
! 				if (fscan->scan.scanrelid == 0)
! 				{
! 					/*
! 					 * If the ForeignScan node was the topmost scan/join plan
! 					 * node, grouping_planner() might have replaced the tlist
! 					 * of the ForeignScan node.  So, replace the tlist of the
! 					 * subplan with that of the ForeignScan node.
! 					 */
! 					fscan->fs_subplan->targetlist = plan->targetlist;
! 
! 					/*
! 					 * If the ForeignScan node was the topmost plan node for
! 					 * the query level, SS_attach_initplans() might have
! 					 * attached initPlans to it.  In case of EvalPlanQual
! 					 * testing, however, it will not happen because a
! 					 * ForeignScan node is unable to be the topmost node for
! 					 * the query level in that case.  So no need to consider
! 					 * the ForeignScan's initPlans for the computed
! 					 * extParam/allParam sets for the subplan.
! 					 */
! 
! 					/*
! 					 * We need not include the subplan's params.  However,
! 					 * the subplan itself needs finalize_plan() processing.
! 					 */
! 					finalize_plan(root,
! 								  fscan->fs_subplan,
! 								  valid_params,
! 								  scan_params);
! 				}
! 			}
  			break;
  
  		case T_CustomScan:
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
***************
*** 1462,1467 **** create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
--- 1462,1468 ----
  						double rows, Cost startup_cost, Cost total_cost,
  						List *pathkeys,
  						Relids required_outer,
+ 						Path *subpath,
  						List *fdw_private)
  {
  	ForeignPath *pathnode = makeNode(ForeignPath);
***************
*** 1475,1480 **** create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
--- 1476,1482 ----
  	pathnode->path.total_cost = total_cost;
  	pathnode->path.pathkeys = pathkeys;
  
+ 	pathnode->subpath = subpath;
  	pathnode->fdw_private = fdw_private;
  
  	return pathnode;
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
***************
*** 1577,1582 **** typedef struct WorkTableScanState
--- 1577,1583 ----
  typedef struct ForeignScanState
  {
  	ScanState	ss;				/* its first field is NodeTag */
+ 	List	   *fdw_quals;		/* remote quals if foreign table */
  	/* use struct pointer to avoid including fdwapi.h here */
  	struct FdwRoutine *fdwroutine;
  	void	   *fdw_state;		/* foreign-data wrapper can keep state here */
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
***************
*** 521,526 **** typedef struct ForeignScan
--- 521,528 ----
  	List	   *fdw_exprs;		/* expressions that FDW may evaluate */
  	List	   *fdw_private;	/* private data for FDW */
  	List	   *fdw_scan_tlist; /* optional tlist describing scan tuple */
+ 	List	   *fdw_quals;		/* remote quals if foreign table */
+ 	Plan	   *fs_subplan;		/* local join plan if foreign join */
  	Bitmapset  *fs_relids;		/* RTIs generated by this scan */
  	bool		fsSystemCol;	/* true if any "system column" is needed */
  } ForeignScan;
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
***************
*** 897,906 **** typedef struct TidPath
--- 897,914 ----
   * generally a good idea to use a representation that can be dumped by
   * nodeToString(), so that you can examine the structure during debugging
   * with tools like pprint().
+  *
+  * If a ForeignPath node represents a remote join of foreign tables, subpath
+  * is a local join of those tables with equivalent results that will be used
+  * for EvalPlanQual testing.  The pathkeys and parameterization of subpath
+  * must be the same as that of the path's output.  (The requirement for the
+  * pathkeys is unnecessary, since the testing can return at most one tuple
+  * for any particular set of scan tuples of those tables, but let's be safe.)
   */
  typedef struct ForeignPath
  {
  	Path		path;
+ 	Path	   *subpath;
  	List	   *fdw_private;
  } ForeignPath;
  
*** a/src/include/optimizer/pathnode.h
--- b/src/include/optimizer/pathnode.h
***************
*** 83,88 **** extern ForeignPath *create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
--- 83,89 ----
  						double rows, Cost startup_cost, Cost total_cost,
  						List *pathkeys,
  						Relids required_outer,
+ 						Path *subpath,
  						List *fdw_private);
  
  extern Relids calc_nestloop_required_outer(Path *outer_path, Path *inner_path);
*** a/src/include/optimizer/planmain.h
--- b/src/include/optimizer/planmain.h
***************
*** 45,51 **** extern SubqueryScan *make_subqueryscan(List *qptlist, List *qpqual,
  				  Index scanrelid, Plan *subplan);
  extern ForeignScan *make_foreignscan(List *qptlist, List *qpqual,
  				 Index scanrelid, List *fdw_exprs, List *fdw_private,
! 				 List *fdw_scan_tlist);
  extern Append *make_append(List *appendplans, List *tlist);
  extern RecursiveUnion *make_recursive_union(List *tlist,
  					 Plan *lefttree, Plan *righttree, int wtParam,
--- 45,51 ----
  				  Index scanrelid, Plan *subplan);
  extern ForeignScan *make_foreignscan(List *qptlist, List *qpqual,
  				 Index scanrelid, List *fdw_exprs, List *fdw_private,
! 				 List *fdw_scan_tlist, List *fdw_quals);
  extern Append *make_append(List *appendplans, List *tlist);
  extern RecursiveUnion *make_recursive_union(List *tlist,
  					 Plan *lefttree, Plan *righttree, int wtParam,

#52

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 10 years ago

In reply to: Etsuro Fujita (#50)

Re: Foreign join pushdown vs EvalPlanQual

Hello, sorry in advance for possible brought up of past
discussions or pointless discussion.

I'm attaching an updated version of the patch. The patch is based on
the SS_finalize_plan patch that has been recently committed. I'd be
happy if this helps people discuss more about how to fix this issue.

The two patches make a good contrast to clarify the problem for
me, maybe.

code in ExecInitForeignScan, ExecForeignScan, ExecEndForeignScan, and
ExecReScanForeignScan. I think that would resolve the name problem
also.

I found two points in this discussion.

1. Where (or When) to initialize a foreign/custom scan node for
recheck.

Having a new list to hold substitute plans in planner global
(and PlannedStmt) is added, EvalPlanQualStart() looks to be
the best place to initialize them.

Of couse it could not be a solution unless the new member and
related code are not acceptable or rather unreasonable. The
possible timing left for the case would be ExecInitNode() (as
v2.0) or FDW routines (as v1.0).

2. How the core informs fdw/custom scan handlers wheter it is
during recheck.

In v1.0 patch, nodeForignscan.c routines detect the situation
using es_epqTuple and Scan.scanrelid which the core as is
gives, and v2.0 alternatively replaces scan node implicitly
(and maybe irregularly) itself on initialization. The latter
don't looks to me tidy.

I think refining v1.0 would be more desirable, and resolving
the redundancy would be simply a matter of notation.

If I understand there correctly, Exec*ForeignScan() other than
ExecInitForeignScan() can determine the behavior simply
looking outerPlanState(scanstate). (If we continue to use the
member lefttree for the purpose..). Is it right? and does it
eliminate the redundancy?

ExecEndForeignScan()
{
if ((outerplan = outerPlanState(node)) != NULL)
ExecEndNode(outerPlan);
...

regards,

At Fri, 04 Sep 2015 19:50:46 +0900, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote in <55E97786.30404@lab.ntt.co.jp>

On 2015/09/03 19:25, Etsuro Fujita wrote:

On 2015/09/03 14:22, Etsuro Fujita wrote:

On 2015/09/03 9:41, Robert Haas wrote:

That having been said, I don't entirely like Fujita-san's patch
either. Much of the new code is called immediately adjacent to an FDW
callback which could pretty trivially do the same thing itself, if
needed.

...

I gave it another thought; the following changes to ExecInitNode would
make the patch much simpler, ie, we would no longer need to call the
new
code in ExecInitForeignScan, ExecForeignScan, ExecEndForeignScan, and
ExecReScanForeignScan. I think that would resolve the name problem
also.

I'm attaching an updated version of the patch. The patch is based on
the SS_finalize_plan patch that has been recently committed. I'd be
happy if this helps people discuss more about how to fix this issue.

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Tom Lane (#40)

Re: Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual)

On 2015/09/02 23:30, Tom Lane wrote:

Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> writes:

On 2015/08/01 23:25, Tom Lane wrote:

In my vision you probably really only want one call per build_join_rel
event (that is, per construction of a new RelOptInfo), not per
make_join_rel event.

I'd vote for hooking at standard_join_search.

I think that method would require the FDW to duplicate a whole lot of the
join search mechanism, for not a whole lot of benefit. It's possible that
there'd be value in doing some initial reconnaissance once we've examined
all the baserels, so I'm not necessarily against providing a hook there.
But if you have in mind that typical FDWs would actually create join paths
at that point, consider that

1. The FDW would have to find all the combinations of its supplied
relations (unless you are only intending to generate one path for the
union of all such rels, which seems pretty narrow-minded from here).

2. The FDW would have to account for join_is_legal considerations.

3. The FDW would have to arrange for creation of joinrel RelOptInfo
structures. While that's possible, the available infrastructure for it
assumes that joinrels are built up from pairs of simpler joinrels, so
you couldn't go directly to the union of all the FDW's rels anyway.

Maybe my explanation was not correct, but the hook placement I think is
just before the set_cheapest call for each joinrel in
standard_join_search, as you proposed in [1]/messages/by-id/5451.1426271510@sss.pgh.pa.us. And I think that if that
joinrel contains only foreign tables that all belong to the same foreign
server, then we give the FDW a chance to consider producing pushed-down
joins for that joinrel, ie, remote joins for all the foreign tables
contained in that joinrel. So, there is no need for #2 and #3. Also I
think that would allow us to consider producing pushed-down joins for
all the legal combinations of foreign tables that belong to the same
foreign server, according to the dynamic-programming method, in
principle. I've not had a solution to the how-to-build-the-query-text
problem yet, though.

Best regards,
Etsuro Fujita

[1]: /messages/by-id/5451.1426271510@sss.pgh.pa.us

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Robert Haas (#48)

Re: Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual)

On 2015/09/04 0:33, Robert Haas wrote:

I'm worried that trawling through that
SpecialJoinInfo data will end up needing to duplicate much of
make_join_rel and add_paths_to_joinrel. For example, consider:

SELECT * FROM verysmall v JOIN (bigft1 FULL JOIN bigft2 ON bigft1.x =
bigft2.x) ON v.q = bigft1.q AND v.r = bigft2.r;

The best path for this plan is presumably something like this:

Nested Loop
-> Seq Scan on verysmall v
-> Foreign Scan on bigft1 and bigft2
Remote SQL: SELECT * FROM bigft1 FULL JOIN bigft2 ON bigft1.x =
bigft2.x AND bigft1.q = $1 AND bigft2.r = $2

Now, how is the FDW going to figure out that it needs to generate this
parameterized path without duplicating this code from
add_paths_to_joinrel?

/*
* Decide whether it's sensible to generate parameterized paths for this
* joinrel, and if so, which relations such paths should require. There
* is usually no need to create a parameterized result path unless there
...

Maybe there's a very simple answer to this question and I'm just not
seeing it, but I really don't see how that's going to work.

Why don't you look at the "regular" (local join execution) paths that
were already generated. I think that if we called the FDW at a proper
hook location, the FDW could probably find a regular path in
rel->pathlist of the join rel (bigft1, bigft2) that possibly generates
something like:

Nested Loop
-> Seq Scan on verysmall v
-> Nested Loop
Join Filter: (bigft1.a = bigft2.a)
-> Foreign Scan on bigft1
Remote SQL: SELECT * FROM bigft1 WHERE bigft1.q = $1
-> Foreign Scan on bigft2
Remote SQL: SELECT * FROM bigft2 WHERE bigft2.r = $2

From the parameterization of the regular nestloop path for joining
bigft1 and bigft2 locally, I think that the FDW could find that it's
sensible to generate the foreign-join path for (bigft1, bigft2) with the
parameterization.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Etsuro Fujita (#54)

Re: Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual)

On Tue, Sep 8, 2015 at 5:35 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

On 2015/09/04 0:33, Robert Haas wrote:

I'm worried that trawling through that
SpecialJoinInfo data will end up needing to duplicate much of
make_join_rel and add_paths_to_joinrel. For example, consider:

SELECT * FROM verysmall v JOIN (bigft1 FULL JOIN bigft2 ON bigft1.x =
bigft2.x) ON v.q = bigft1.q AND v.r = bigft2.r;

The best path for this plan is presumably something like this:

Nested Loop
-> Seq Scan on verysmall v
-> Foreign Scan on bigft1 and bigft2
Remote SQL: SELECT * FROM bigft1 FULL JOIN bigft2 ON bigft1.x =
bigft2.x AND bigft1.q = $1 AND bigft2.r = $2

Now, how is the FDW going to figure out that it needs to generate this
parameterized path without duplicating this code from
add_paths_to_joinrel?

/*
* Decide whether it's sensible to generate parameterized paths for
this
* joinrel, and if so, which relations such paths should require.
There
* is usually no need to create a parameterized result path unless
there
...

Maybe there's a very simple answer to this question and I'm just not
seeing it, but I really don't see how that's going to work.

Why don't you look at the "regular" (local join execution) paths that were
already generated. I think that if we called the FDW at a proper hook
location, the FDW could probably find a regular path in rel->pathlist of the
join rel (bigft1, bigft2) that possibly generates something like:

Nested Loop
-> Seq Scan on verysmall v
-> Nested Loop
Join Filter: (bigft1.a = bigft2.a)
-> Foreign Scan on bigft1
Remote SQL: SELECT * FROM bigft1 WHERE bigft1.q = $1
-> Foreign Scan on bigft2
Remote SQL: SELECT * FROM bigft2 WHERE bigft2.r = $2

From the parameterization of the regular nestloop path for joining bigft1
and bigft2 locally, I think that the FDW could find that it's sensible to
generate the foreign-join path for (bigft1, bigft2) with the
parameterization.

But that path might have already been discarded on the basis of cost.
I think Tom's idea is better: let the FDW consult some state cached
for this purpose in the RelOptInfo.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Tom Lane (#49)

Re: Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual)

On Thu, Sep 3, 2015 at 11:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Sep 2, 2015 at 1:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Offhand I think that the
most likely way to build that text will be to examine the query's jointree
to see where c,d,e appear in it. But in any case, that's a separate issue
and I fail to see how plopping the join search problem into the FDW's lap
would make it any easier.

Yeah, I am not advocating for putting the hook in
standard_join_search. I'm explaining why I put it in
add_paths_to_joinrel instead of, as I believe you were advocating, in
make_join_rel prior to the big switch.

If you had a solution to the how-to-build-the-query-text problem,
and it depended on that hook placement, then your argument might
make some sense. As is, you've entirely failed to convince me
that this placement is not wrong, wasteful, and likely to create
unnecessary API breaks for FDWs.

(Also, per my last message on the subject, *after* the switch
is what I think makes sense.)

After re-reading a few emails, I've realized that I've let myself get
a bit confused here and have unwittingly switched sides in this
argument. <puts brown paper bag over head>

When we originally discussed this back in April, I was arguing for
either make_join_rel() or add_paths_to_joinrel() and you were arguing
for standard_join_search(). See here:

/messages/by-id/CA+TgmobOADxTbsCt-j+dDVefWGK1WxY4p8AVDp1Pz48_TX4XTA@mail.gmail.com

I thought we were still having the same argument, but we're not.
You're now arguing for make_one_rel(), which back then was perfectly
acceptable to me, and now that I've gotten by thinking un-fuzzed,
really still is, except for the question posed in the closing
paragraph of that email, which is (mostly) whether clients like
postgres_fdw are going to need extra_lateral_rels in order to do the
right thing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Robert Haas (#55)

Re: Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual)

On 2015/09/09 3:53, Robert Haas wrote:

On Tue, Sep 8, 2015 at 5:35 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

On 2015/09/04 0:33, Robert Haas wrote:

I'm worried that trawling through that
SpecialJoinInfo data will end up needing to duplicate much of
make_join_rel and add_paths_to_joinrel. For example, consider:

SELECT * FROM verysmall v JOIN (bigft1 FULL JOIN bigft2 ON bigft1.x =
bigft2.x) ON v.q = bigft1.q AND v.r = bigft2.r;

The best path for this plan is presumably something like this:

Nested Loop
-> Seq Scan on verysmall v
-> Foreign Scan on bigft1 and bigft2
Remote SQL: SELECT * FROM bigft1 FULL JOIN bigft2 ON bigft1.x =
bigft2.x AND bigft1.q = $1 AND bigft2.r = $2

Now, how is the FDW going to figure out that it needs to generate this
parameterized path without duplicating this code from
add_paths_to_joinrel?

/*
* Decide whether it's sensible to generate parameterized paths for
this
* joinrel, and if so, which relations such paths should require.
There
* is usually no need to create a parameterized result path unless
there
...

Maybe there's a very simple answer to this question and I'm just not
seeing it, but I really don't see how that's going to work.

Why don't you look at the "regular" (local join execution) paths that were
already generated. I think that if we called the FDW at a proper hook
location, the FDW could probably find a regular path in rel->pathlist of the
join rel (bigft1, bigft2) that possibly generates something like:

Nested Loop
-> Seq Scan on verysmall v
-> Nested Loop
Join Filter: (bigft1.a = bigft2.a)
-> Foreign Scan on bigft1
Remote SQL: SELECT * FROM bigft1 WHERE bigft1.q = $1
-> Foreign Scan on bigft2
Remote SQL: SELECT * FROM bigft2 WHERE bigft2.r = $2

From the parameterization of the regular nestloop path for joining bigft1
and bigft2 locally, I think that the FDW could find that it's sensible to
generate the foreign-join path for (bigft1, bigft2) with the
parameterization.

But that path might have already been discarded on the basis of cost.
I think Tom's idea is better: let the FDW consult some state cached
for this purpose in the RelOptInfo.

Do you have an idea of what information would be collected into the
state and how the FDW would derive parameterizations to consider
producing pushed-down joins with from that information? What I'm
concerned about that is to reduce the number of parameterizations to
consider, to reduce overhead in costing the corresponding queries. I'm
missing something, though.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Etsuro Fujita (#46)

Re: Foreign join pushdown vs EvalPlanQual

On Thu, Sep 3, 2015 at 6:25 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

I gave it another thought; the following changes to ExecInitNode would make
the patch much simpler, ie, we would no longer need to call the new code in
ExecInitForeignScan, ExecForeignScan, ExecEndForeignScan, and
ExecReScanForeignScan. I think that would resolve the name problem also.
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 247,254 **** ExecInitNode(Plan *node, EState *estate, int eflags)
break;
case T_ForeignScan:
! result = (PlanState *) ExecInitForeignScan((ForeignScan *) node,
! estate, eflags);
break;
case T_CustomScan:
--- 247,269 ----
break;
case T_ForeignScan:
! {
! Index scanrelid = ((ForeignScan *)
node)->scan.scanrelid;
!
! if (estate->es_epqTuple != NULL && scanrelid == 0)
! {
! /*
! * We are in foreign join inside an EvalPlanQual
recheck.
! * Initialize local join execution plan, instead.
! */
! Plan *subplan = ((ForeignScan *)
node)->fs_subplan;
!
! result = ExecInitNode(subplan, estate, eflags);
! }
! else
! result = (PlanState *) ExecInitForeignScan((ForeignScan
*) node,
! estate,
eflags);
! }
break;

I don't think that's a good idea. The Plan tree and the PlanState
tree should be mirror images of each other; breaking that equivalence
will cause confusion, at least.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Etsuro Fujita (#45)

Re: Foreign join pushdown vs EvalPlanQual

On Thu, Sep 3, 2015 at 1:22 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

I'm wondering if there's another approach. If I understand correctly,
there are two reasons why the current situation is untenable. The
first is that ForeignRecheck always returns true, but we could instead
call an FDW-supplied callback routine there. The callback could be
optional, so that we just return true if there is none, which is nice
for already-existing FDWs that then don't need to do anything.

My question about this is, is the callback really needed? If there are any
FDWs that want to do the work *in their own way*, instead of just doing
ExecProcNode for executing a local join execution plan in case of foreign
join (or just doing ExecQual for checking remote quals in case of foreign
table), I'd agree with introducing the callback, but if not, I don't think
that that makes much sense.

It doesn't seem to me that it hurts much of anything to add the
callback there, and it does provide some flexibility. Actually, I'm
not really sure why we're thinking we need a subplan here at all,
rather than just having a ForeignRecheck callback that can do whatever
it needs to do with no particular help from the core infrastructure.
I think you wrote some code to show how postgres_fdw would use the API
you are proposing, but I can't find it. Can you point me in the right
direction?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Etsuro Fujita (#57)

Re: Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual)

On Wed, Sep 9, 2015 at 2:30 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

But that path might have already been discarded on the basis of cost.
I think Tom's idea is better: let the FDW consult some state cached
for this purpose in the RelOptInfo.

Do you have an idea of what information would be collected into the state
and how the FDW would derive parameterizations to consider producing
pushed-down joins with from that information? What I'm concerned about that
is to reduce the number of parameterizations to consider, to reduce overhead
in costing the corresponding queries. I'm missing something, though.

I think the thing we'd want to store in the state would be enough
information to reconstruct a valid join nest. For example, the
reloptinfo for (A B) might note that A needs to be left-joined to B.
When we go to construct paths for (A B C), and there is no
SpecialJoinInfo that mentions C, we know that we can construct (A LJ
B) IJ C rather than (A IJ B) IJ C. If any paths survived, we could
find a way to pull that information out of the path, but pulling it
out of the RelOptInfo should always work.

I am not sure what to do about parameterizations. That's one of my
remaining concerns about moving the hook.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Robert Haas (#59)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/09/11 6:24, Robert Haas wrote:

On Thu, Sep 3, 2015 at 1:22 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

I'm wondering if there's another approach. If I understand correctly,
there are two reasons why the current situation is untenable. The
first is that ForeignRecheck always returns true, but we could instead
call an FDW-supplied callback routine there. The callback could be
optional, so that we just return true if there is none, which is nice
for already-existing FDWs that then don't need to do anything.

My question about this is, is the callback really needed? If there are any
FDWs that want to do the work *in their own way*, instead of just doing
ExecProcNode for executing a local join execution plan in case of foreign
join (or just doing ExecQual for checking remote quals in case of foreign
table), I'd agree with introducing the callback, but if not, I don't think
that that makes much sense.

It doesn't seem to me that it hurts much of anything to add the
callback there, and it does provide some flexibility. Actually, I'm
not really sure why we're thinking we need a subplan here at all,
rather than just having a ForeignRecheck callback that can do whatever
it needs to do with no particular help from the core infrastructure.
I think you wrote some code to show how postgres_fdw would use the API
you are proposing, but I can't find it. Can you point me in the right
direction?

I've proposed the following API changes:

* I modified create_foreignscan_path, which is called from
postgresGetForeignJoinPaths/postgresGetForeignPaths, so that a path,
subpath, is passed as the eighth argument of the function. subpath
represents a local join execution path if scanrelid==0, but NULL if
scanrelid>0.

* I modified make_foreignscan, which is called from
postgresGetForeignPlan, so that a list of quals, fdw_quals, is passed as
the last argument of the function. fdw_quals represents remote quals if
scanrelid>0, but NIL if scanrelid==0.

You can find that code in the postgres_fdw patch
(foreign_join_v16_efujita.patch) attached to [1]/messages/by-id/55CB2D45.7040100@lab.ntt.co.jp.

Best regards,
Etsuro Fujita

[1]: /messages/by-id/55CB2D45.7040100@lab.ntt.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 10 years ago

In reply to: Robert Haas (#59)

Re: Foreign join pushdown vs EvalPlanQual

Hello,

At Thu, 10 Sep 2015 17:24:00 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmobxksR2=3wEdY5cEgpd1hQ6Z0WoZEBBoxgs=XKZpbfUXA@mail.gmail.com>

On Thu, Sep 3, 2015 at 1:22 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

I'm wondering if there's another approach. If I understand correctly,
there are two reasons why the current situation is untenable. The
first is that ForeignRecheck always returns true, but we could instead
call an FDW-supplied callback routine there. The callback could be
optional, so that we just return true if there is none, which is nice
for already-existing FDWs that then don't need to do anything.

My question about this is, is the callback really needed? If there are any
FDWs that want to do the work *in their own way*, instead of just doing
ExecProcNode for executing a local join execution plan in case of foreign
join (or just doing ExecQual for checking remote quals in case of foreign
table), I'd agree with introducing the callback, but if not, I don't think
that that makes much sense.

It doesn't seem to me that it hurts much of anything to add the
callback there, and it does provide some flexibility. Actually, I'm
not really sure why we're thinking we need a subplan here at all,
rather than just having a ForeignRecheck callback that can do whatever
it needs to do with no particular help from the core infrastructure.
I think you wrote some code to show how postgres_fdw would use the API
you are proposing, but I can't find it. Can you point me in the right
direction?

I've heard that the reason for the (fs_)subplan is that it should
be initialized using create_plan_recurse, set_plan_refs and
finalyze_plan (or others), which are static functions in the
planner, unavailable in fdw code.

Is this pointless?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 10 years ago

In reply to: Kyotaro HORIGUCHI (#62)

Re: Foreign join pushdown vs EvalPlanQual

Sorry, that's quite wrong.. Please let me fix it.

- Is this pointless?
+ Does it make sense?

=====
Hello,

At Thu, 10 Sep 2015 17:24:00 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmobxksR2=3wEdY5cEgpd1hQ6Z0WoZEBBoxgs=XKZpbfUXA@mail.gmail.com>

On Thu, Sep 3, 2015 at 1:22 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

I'm wondering if there's another approach. If I understand correctly,
there are two reasons why the current situation is untenable. The
first is that ForeignRecheck always returns true, but we could instead
call an FDW-supplied callback routine there. The callback could be
optional, so that we just return true if there is none, which is nice
for already-existing FDWs that then don't need to do anything.

My question about this is, is the callback really needed? If there are any
FDWs that want to do the work *in their own way*, instead of just doing
ExecProcNode for executing a local join execution plan in case of foreign
join (or just doing ExecQual for checking remote quals in case of foreign
table), I'd agree with introducing the callback, but if not, I don't think
that that makes much sense.

It doesn't seem to me that it hurts much of anything to add the
callback there, and it does provide some flexibility. Actually, I'm
not really sure why we're thinking we need a subplan here at all,
rather than just having a ForeignRecheck callback that can do whatever
it needs to do with no particular help from the core infrastructure.
I think you wrote some code to show how postgres_fdw would use the API
you are proposing, but I can't find it. Can you point me in the right
direction?

Does it make sense?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Kyotaro HORIGUCHI (#62)

Re: Foreign join pushdown vs EvalPlanQual

Hello,

-----Original Message-----
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
Sent: Friday, September 11, 2015 2:05 PM
To: robertmhaas@gmail.com
Cc: fujita.etsuro@lab.ntt.co.jp; Kaigai Kouhei(海外浩平);
pgsql-hackers@postgresql.org; shigeru.hanada@gmail.com
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

Hello,

At Thu, 10 Sep 2015 17:24:00 -0400, Robert Haas <robertmhaas@gmail.com> wrote
in <CA+TgmobxksR2=3wEdY5cEgpd1hQ6Z0WoZEBBoxgs=XKZpbfUXA@mail.gmail.com>

On Thu, Sep 3, 2015 at 1:22 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

I'm wondering if there's another approach. If I understand correctly,
there are two reasons why the current situation is untenable. The
first is that ForeignRecheck always returns true, but we could instead
call an FDW-supplied callback routine there. The callback could be
optional, so that we just return true if there is none, which is nice
for already-existing FDWs that then don't need to do anything.

My question about this is, is the callback really needed? If there are any
FDWs that want to do the work *in their own way*, instead of just doing
ExecProcNode for executing a local join execution plan in case of foreign
join (or just doing ExecQual for checking remote quals in case of foreign
table), I'd agree with introducing the callback, but if not, I don't think
that that makes much sense.

It doesn't seem to me that it hurts much of anything to add the
callback there, and it does provide some flexibility. Actually, I'm
not really sure why we're thinking we need a subplan here at all,
rather than just having a ForeignRecheck callback that can do whatever
it needs to do with no particular help from the core infrastructure.
I think you wrote some code to show how postgres_fdw would use the API
you are proposing, but I can't find it. Can you point me in the right
direction?

I've heard that the reason for the (fs_)subplan is that it should
be initialized using create_plan_recurse, set_plan_refs and
finalyze_plan (or others), which are static functions in the
planner, unavailable in fdw code.

It was a discussion when custom-scan/join interface got merged, because
I primarily designed the interface to call create_plan_recurse() from
the extension, however, we concluded that we keep this function as static
and tells the core a bunch of path-nodes to be initialized.
It also reduced interface complexity because we can omit callbacks to
be placed on the setrefs.c and subselect.c.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Etsuro Fujita (#61)

Re: Foreign join pushdown vs EvalPlanQual

-----Original Message-----
From: Etsuro Fujita [mailto:fujita.etsuro@lab.ntt.co.jp]
Sent: Friday, September 11, 2015 12:36 PM
To: Robert Haas
Cc: Kaigai Kouhei(海外浩平); PostgreSQL-development; 花田茂
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

On 2015/09/11 6:24, Robert Haas wrote:

On Thu, Sep 3, 2015 at 1:22 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

I'm wondering if there's another approach. If I understand correctly,
there are two reasons why the current situation is untenable. The
first is that ForeignRecheck always returns true, but we could instead
call an FDW-supplied callback routine there. The callback could be
optional, so that we just return true if there is none, which is nice
for already-existing FDWs that then don't need to do anything.

My question about this is, is the callback really needed? If there are any
FDWs that want to do the work *in their own way*, instead of just doing
ExecProcNode for executing a local join execution plan in case of foreign
join (or just doing ExecQual for checking remote quals in case of foreign
table), I'd agree with introducing the callback, but if not, I don't think
that that makes much sense.

It doesn't seem to me that it hurts much of anything to add the
callback there, and it does provide some flexibility. Actually, I'm
not really sure why we're thinking we need a subplan here at all,
rather than just having a ForeignRecheck callback that can do whatever
it needs to do with no particular help from the core infrastructure.
I think you wrote some code to show how postgres_fdw would use the API
you are proposing, but I can't find it. Can you point me in the right
direction?

I've proposed the following API changes:

* I modified create_foreignscan_path, which is called from
postgresGetForeignJoinPaths/postgresGetForeignPaths, so that a path,
subpath, is passed as the eighth argument of the function. subpath
represents a local join execution path if scanrelid==0, but NULL if
scanrelid>0.

I like to suggest to have multiple path nodes, like custom-scan, because
the infrastructure will be also helpful to implement FDW driver that can
have multiple sub-plans. One expected usage is here:
/messages/by-id/9A28C8860F777E439AA12E8AEA7694F8010F20AD@BPXM15GP.gisp.nec.co.jp

* I modified make_foreignscan, which is called from
postgresGetForeignPlan, so that a list of quals, fdw_quals, is passed as
the last argument of the function. fdw_quals represents remote quals if
scanrelid>0, but NIL if scanrelid==0.

If a callback on ForeignRecheck processes EPQ rechecks, the core PostgreSQL
don't need to know what expression was pushed down and how does it kept in
the private field (fdw_exprs). Only FDW driver knows which private field is
the expression node that was pushed down to the remote side. It shall not be
an interface contract.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Robert Haas (#58)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/09/11 6:02, Robert Haas wrote:

On Thu, Sep 3, 2015 at 6:25 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:
I gave it another thought; the following changes to ExecInitNode would make
the patch much simpler, ie, we would no longer need to call the new code in
ExecInitForeignScan, ExecForeignScan, ExecEndForeignScan, and
ExecReScanForeignScan. I think that would resolve the name problem also.
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 247,254 **** ExecInitNode(Plan *node, EState *estate, int eflags)
break;
case T_ForeignScan:
! result = (PlanState *) ExecInitForeignScan((ForeignScan *) node,
! estate, eflags);
break;
case T_CustomScan:
--- 247,269 ----
break;
case T_ForeignScan:
! {
! Index scanrelid = ((ForeignScan *)
node)->scan.scanrelid;
!
! if (estate->es_epqTuple != NULL && scanrelid == 0)
! {
! /*
! * We are in foreign join inside an EvalPlanQual
recheck.
! * Initialize local join execution plan, instead.
! */
! Plan *subplan = ((ForeignScan *)
node)->fs_subplan;
!
! result = ExecInitNode(subplan, estate, eflags);
! }
! else
! result = (PlanState *) ExecInitForeignScan((ForeignScan
*) node,
! estate,
eflags);
! }
break;
I don't think that's a good idea. The Plan tree and the PlanState
tree should be mirror images of each other; breaking that equivalence
will cause confusion, at least.

IIRC, Horiguchi-san also pointed that out. Honestly, I also think that
that is weird, but IIUC, I think it can't hurt. What I was concerned
about was EXPLAIN, but EXPLAIN doesn't handle an EvalPlanQual PlanState
tree at least currently.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Robert Haas (#60)

Re: Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual)

On 2015/09/11 6:30, Robert Haas wrote:

On Wed, Sep 9, 2015 at 2:30 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

But that path might have already been discarded on the basis of cost.
I think Tom's idea is better: let the FDW consult some state cached
for this purpose in the RelOptInfo.

Do you have an idea of what information would be collected into the state
and how the FDW would derive parameterizations to consider producing
pushed-down joins with from that information? What I'm concerned about that
is to reduce the number of parameterizations to consider, to reduce overhead
in costing the corresponding queries. I'm missing something, though.

I think the thing we'd want to store in the state would be enough
information to reconstruct a valid join nest. For example, the
reloptinfo for (A B) might note that A needs to be left-joined to B.
When we go to construct paths for (A B C), and there is no
SpecialJoinInfo that mentions C, we know that we can construct (A LJ
B) IJ C rather than (A IJ B) IJ C. If any paths survived, we could
find a way to pull that information out of the path, but pulling it
out of the RelOptInfo should always work.

So, information to address the how-to-build-the-query-text
problem would be stored in the state, in other words. Right?

I am not sure what to do about parameterizations. That's one of my
remaining concerns about moving the hook.

I think we should also make it clear what to do about sort orderings.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Etsuro Fujita (#61)

Re: Foreign join pushdown vs EvalPlanQual

On Thu, Sep 10, 2015 at 11:36 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

I've proposed the following API changes:

* I modified create_foreignscan_path, which is called from
postgresGetForeignJoinPaths/postgresGetForeignPaths, so that a path,
subpath, is passed as the eighth argument of the function. subpath
represents a local join execution path if scanrelid==0, but NULL if
scanrelid>0.

OK, I see now. But I don't much like the way
get_unsorted_unparameterized_path() looks.

First, it's basically praying that MergePath, NodePath, and NestPath
can be flat-copied without breaking anything. In general, we have
copyfuncs.c support for nodes that we need to be able to copy, and we
use copyObject() to do it. Even if what you've got here works today,
it's not very future-proof.

Second, what guarantee do we have that we'll find a path with no
pathkeys and a NULL param_info? Why can't all of the paths for a join
relation have pathkeys? Why can't they all be parameterized? I can't
think of anything that would guarantee that.

Third, even if such a guarantee existed, why is this the right
behavior? Any join type will produce the same output; it's just a
question of performance. And if you have only one tuple on each side,
surely a nested loop would be fine.

It seems to me that what you ought to be doing is using data hung off
the fdw_private field of each RelOptInfo to cache a NestPath that can
be used for EPQ rechecks at that level. When you go to consider
pushing down another join, you can build up a new NestPath that's
suitable for the new level. That seems much cleaner than groveling
through the list of surviving paths and hoping you find the right kind
of thing.

And all that having been said, I still don't really understand why you
are resisting the idea of providing a callback so that the FDW can
execute arbitrary code in the recheck path. There doesn't seem to be
any reason not to let the FDW take control of the rechecks if it
wishes, and there's no real cost in complexity that I can see.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Kouhei Kaigai (#65)

Re: Foreign join pushdown vs EvalPlanQual

On Fri, Sep 11, 2015 at 2:01 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

If a callback on ForeignRecheck processes EPQ rechecks, the core PostgreSQL
don't need to know what expression was pushed down and how does it kept in
the private field (fdw_exprs). Only FDW driver knows which private field is
the expression node that was pushed down to the remote side. It shall not be
an interface contract.

I agree. It seems needless to involve the core code here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Etsuro Fujita (#66)

Re: Foreign join pushdown vs EvalPlanQual

On Fri, Sep 11, 2015 at 3:08 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

IIRC, Horiguchi-san also pointed that out. Honestly, I also think that that
is weird, but IIUC, I think it can't hurt. What I was concerned about was
EXPLAIN, but EXPLAIN doesn't handle an EvalPlanQual PlanState tree at least
currently.

This has come up a few times before and some people have argued for
changing the coding rule. Nevertheless, for now, it is the rule.
IMHO, it's a pretty good rule that makes things easier to understand
and reason about. If there's an argument for changing it, it's
performance, not developer convenience. Anyway, we should try to fix
this problem without getting tangled in that argument.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Etsuro Fujita (#67)

Re: Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual)

On Fri, Sep 11, 2015 at 3:12 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

So, information to address the how-to-build-the-query-text
problem would be stored in the state, in other words. Right?

Right.

I am not sure what to do about parameterizations. That's one of my
remaining concerns about moving the hook.

I think we should also make it clear what to do about sort orderings.

How does that come into it?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Robert Haas (#68)

1 attachment(s)

Re: Foreign join pushdown vs EvalPlanQual

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
Sent: Saturday, September 12, 2015 1:39 AM
To: Etsuro Fujita
Cc: Kaigai Kouhei(海外浩平); PostgreSQL-development; 花田茂
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

On Thu, Sep 10, 2015 at 11:36 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

I've proposed the following API changes:

* I modified create_foreignscan_path, which is called from
postgresGetForeignJoinPaths/postgresGetForeignPaths, so that a path,
subpath, is passed as the eighth argument of the function. subpath
represents a local join execution path if scanrelid==0, but NULL if
scanrelid>0.

OK, I see now. But I don't much like the way
get_unsorted_unparameterized_path() looks.

First, it's basically praying that MergePath, NodePath, and NestPath
can be flat-copied without breaking anything. In general, we have
copyfuncs.c support for nodes that we need to be able to copy, and we
use copyObject() to do it. Even if what you've got here works today,
it's not very future-proof.

Second, what guarantee do we have that we'll find a path with no
pathkeys and a NULL param_info? Why can't all of the paths for a join
relation have pathkeys? Why can't they all be parameterized? I can't
think of anything that would guarantee that.

Third, even if such a guarantee existed, why is this the right
behavior? Any join type will produce the same output; it's just a
question of performance. And if you have only one tuple on each side,
surely a nested loop would be fine.

It seems to me that what you ought to be doing is using data hung off
the fdw_private field of each RelOptInfo to cache a NestPath that can
be used for EPQ rechecks at that level. When you go to consider
pushing down another join, you can build up a new NestPath that's
suitable for the new level. That seems much cleaner than groveling
through the list of surviving paths and hoping you find the right kind
of thing.

And all that having been said, I still don't really understand why you
are resisting the idea of providing a callback so that the FDW can
execute arbitrary code in the recheck path. There doesn't seem to be
any reason not to let the FDW take control of the rechecks if it
wishes, and there's no real cost in complexity that I can see.

The discussion has been pending for two weeks, even though we put this
problem on the open item towards v9.5; that means we recognize it is
a problem to be fixed by the v9.5 release.

The attached patch allows FDW driver to handle EPQ recheck by its own
preferable way, even if it is alternative local join or ExecQual to
the expression being pushed down.

Regarding to the alternative join path selection, I initially thought
it is valuable to choose the best path from performance standpoint,
however, what we need to do here is visibility check towards all the
EPQ tuples already loaded to EState. So, unparametalized NestLoop is
sufficient to execute qualifier across relations.
(What happen if HashJoin is chosen? It's probably problematic.)

So, if your modified postgres_fdw keeps an alternative path, what
we need to do is construction of dummy NestPath with no param_info,
no pathkeys, and dummy cost. Then, give this path on fdw_paths of
ForeignPath. It shall be transformed to plan-nodes, then eventually
transformed to plan-state-node by postgres_fdw itself.
I cannot find out something difficult to do any more.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Attachments:

pgsql-fdw-epq-recheck.v2.patchapplication/octet-stream; name=pgsql-fdw-epq-recheck.v2.patchDownload

 doc/src/sgml/fdwhandler.sgml            | 24 +++++++++++++++++++++++-
 src/backend/commands/explain.c          | 23 +++++++++++++++++++++++
 src/backend/executor/execScan.c         | 12 ++++++++++--
 src/backend/executor/nodeForeignscan.c  | 15 +++++++++++++++
 src/backend/nodes/copyfuncs.c           |  1 +
 src/backend/nodes/nodeFuncs.c           |  7 +++++++
 src/backend/nodes/outfuncs.c            |  2 ++
 src/backend/nodes/readfuncs.c           |  1 +
 src/backend/optimizer/plan/createplan.c | 13 ++++++++++++-
 src/backend/optimizer/plan/setrefs.c    |  8 ++++++++
 src/backend/optimizer/plan/subselect.c  | 24 ++++++++++++++++++++----
 src/include/foreign/fdwapi.h            |  7 ++++++-
 src/include/nodes/execnodes.h           |  1 +
 src/include/nodes/plannodes.h           |  1 +
 src/include/nodes/relation.h            |  1 +
 15 files changed, 131 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 1dac7ad..317c21c 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -168,7 +168,8 @@ GetForeignPlan (PlannerInfo *root,
                 Oid foreigntableid,
                 ForeignPath *best_path,
                 List *tlist,
-                List *scan_clauses);
+                List *scan_clauses,
+                List *fdw_plans)
 </programlisting>
 
      Create a <structname>ForeignScan</> plan node from the selected foreign
@@ -259,6 +260,27 @@ IterateForeignScan (ForeignScanState *node);
 
     <para>
 <programlisting>
+bool
+RecheckForeignScan (ForeignScanState *node, TupleTableSlot *slot);
+</programlisting>
+     Rechecks visibility of the EPQ tuples according to the latest status.
+     Once row-level update or lock contention get detected, EPQ mechanism
+     reloads the target rows using <function>RefetchForeignRow</>, then
+     tries to recheck whether the latest row is still visible.
+    </para>
+    <para>
+     When <structname>ForeignScanState</> represents base relation,
+     the supplied <literal>slot</> is expected to have the latest row
+     of the target relation.
+     Elsewhere, if <literal>scanrelid</> equals zero thus it represents
+     multiple joined relations, the callback is expected to fill up the
+     supplied <literal>slot</> accoding to the <structfield>fdw_scan_tlist</>
+     definition. It should know which EPQ tuples are the source of its
+     result tuple.
+    </para>
+
+    <para>
+<programlisting>
 void
 ReScanForeignScan (ForeignScanState *node);
 </programlisting>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index f0d9e94..1504069 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -114,6 +114,8 @@ static void ExplainMemberNodes(List *plans, PlanState **planstates,
 				   List *ancestors, ExplainState *es);
 static void ExplainSubPlans(List *plans, List *ancestors,
 				const char *relationship, ExplainState *es);
+static void ExplainForeignChildren(ForeignScanState *fss,
+								   List *ancestors, ExplainState *es);
 static void ExplainCustomChildren(CustomScanState *css,
 					  List *ancestors, ExplainState *es);
 static void ExplainProperty(const char *qlabel, const char *value,
@@ -1528,6 +1530,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		IsA(plan, BitmapAnd) ||
 		IsA(plan, BitmapOr) ||
 		IsA(plan, SubqueryScan) ||
+		(IsA(planstate, ForeignScanState) &&
+		 ((ForeignScanState *) planstate)->fdw_ps != NIL) ||
 		(IsA(planstate, CustomScanState) &&
 		 ((CustomScanState *) planstate)->custom_ps != NIL) ||
 		planstate->subPlan;
@@ -1584,6 +1588,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			ExplainNode(((SubqueryScanState *) planstate)->subplan, ancestors,
 						"Subquery", NULL, es);
 			break;
+		case T_ForeignScan:
+			ExplainForeignChildren((ForeignScanState *) planstate,
+								   ancestors, es);
+			break;
 		case T_CustomScan:
 			ExplainCustomChildren((CustomScanState *) planstate,
 								  ancestors, es);
@@ -2624,6 +2632,21 @@ ExplainSubPlans(List *plans, List *ancestors,
 }
 
 /*
+ * Explain a list of children of a ForeignScan.
+ */
+static void
+ExplainForeignChildren(ForeignScanState *fss,
+					   List *ancestors, ExplainState *es)
+{
+	ListCell   *cell;
+	const char *label =
+		(list_length(fss->fdw_ps) != 1 ? "children" : "child");
+
+	foreach(cell, fss->fdw_ps)
+		ExplainNode((PlanState *) lfirst(cell), ancestors, label, NULL, es);
+}
+
+/*
  * Explain a list of children of a CustomScan.
  */
 static void
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index a96e826..c88da92 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -49,8 +49,16 @@ ExecScanFetch(ScanState *node,
 		 */
 		Index		scanrelid = ((Scan *) node->ps.plan)->scanrelid;
 
-		Assert(scanrelid > 0);
-		if (estate->es_epqTupleSet[scanrelid - 1])
+		if (scanrelid == 0)
+		{
+			TupleTableSlot *slot = node->ss_ScanTupleSlot;
+
+			/* Check if it meets the access-method conditions */
+			if (!(*recheckMtd) (node, slot))
+				ExecClearTuple(slot);	/* would not be returned by scan */
+			return slot;
+		}
+		else if (estate->es_epqTupleSet[scanrelid - 1])
 		{
 			TupleTableSlot *slot = node->ss_ScanTupleSlot;
 
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index bb28a73..6c3d920 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -72,6 +72,21 @@ ForeignNext(ForeignScanState *node)
 static bool
 ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
 {
+	FdwRoutine	   *fdwroutine = node->fdwroutine;
+
+	/*
+	 * This FDW callback have two tasks. (1) If this ForeignScanState
+	 * represents an external join (thus scanrelid==0), it need to
+	 * construct a tuple according to TupleDesc of the slot; that is
+	 * initialized according to the fdw_scan_tlist. (2) If this node
+	 * has any qualifiers not to be executed locally, it has to apply
+	 * visibility checks by the qualifier (because ExecQual on ExecScan
+	 * runs towards node->scan.plan.qual, not on the qualifier pushed-
+	 * down).
+	 */
+	if (fdwroutine->RecheckForeignScan)
+		return fdwroutine->RecheckForeignScan(node, slot);
+
 	/* There are no access-method-specific conditions to recheck. */
 	return true;
 }
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 62355aa..989833e 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -622,6 +622,7 @@ _copyForeignScan(const ForeignScan *from)
 	 * copy remainder of node
 	 */
 	COPY_SCALAR_FIELD(fs_server);
+	COPY_NODE_FIELD(fdw_plans);
 	COPY_NODE_FIELD(fdw_exprs);
 	COPY_NODE_FIELD(fdw_private);
 	COPY_NODE_FIELD(fdw_scan_tlist);
diff --git a/src/backend/nodes/nodeFuncs.c b/src/backend/nodes/nodeFuncs.c
index a11cb9f..99e03a9 100644
--- a/src/backend/nodes/nodeFuncs.c
+++ b/src/backend/nodes/nodeFuncs.c
@@ -3485,6 +3485,13 @@ planstate_tree_walker(PlanState *planstate, bool (*walker) (), void *context)
 			if (walker(((SubqueryScanState *) planstate)->subplan, context))
 				return true;
 			break;
+		case T_ForeignScan:
+			foreach (lc, ((ForeignScanState *) planstate)->fdw_ps)
+			{
+				if (walker((PlanState *) lfirst(lc), context))
+					return true;
+			}
+			break;
 		case T_CustomScan:
 			foreach (lc, ((CustomScanState *) planstate)->custom_ps)
 			{
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index c91273c..438d1d4 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -579,6 +579,7 @@ _outForeignScan(StringInfo str, const ForeignScan *node)
 	_outScanInfo(str, (const Scan *) node);
 
 	WRITE_OID_FIELD(fs_server);
+	WRITE_NODE_FIELD(fdw_plans);
 	WRITE_NODE_FIELD(fdw_exprs);
 	WRITE_NODE_FIELD(fdw_private);
 	WRITE_NODE_FIELD(fdw_scan_tlist);
@@ -1667,6 +1668,7 @@ _outForeignPath(StringInfo str, const ForeignPath *node)
 
 	_outPathInfo(str, (const Path *) node);
 
+	WRITE_NODE_FIELD(fdw_paths);
 	WRITE_NODE_FIELD(fdw_private);
 }
 
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 08519ed..5b274da 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1795,6 +1795,7 @@ _readForeignScan(void)
 	ReadCommonScan(&local_node->scan);
 
 	READ_OID_FIELD(fs_server);
+	READ_NODE_FIELD(fdw_plans);
 	READ_NODE_FIELD(fdw_exprs);
 	READ_NODE_FIELD(fdw_private);
 	READ_NODE_FIELD(fdw_scan_tlist);
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 404c6f5..a915cb6 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -2059,11 +2059,20 @@ create_foreignscan_plan(PlannerInfo *root, ForeignPath *best_path,
 	Index		scan_relid = rel->relid;
 	Oid			rel_oid = InvalidOid;
 	Bitmapset  *attrs_used = NULL;
+	List	   *fdw_plans = NIL;
 	ListCell   *lc;
 	int			i;
 
 	Assert(rel->fdwroutine != NULL);
 
+	/* Recursively transform child paths. */
+	foreach (lc, best_path->fdw_paths)
+	{
+		Plan   *plan = create_plan_recurse(root, (Path *) lfirst(lc));
+
+		fdw_plans = lappend(fdw_plans, plan);
+	}
+
 	/*
 	 * If we're scanning a base relation, fetch its OID.  (Irrelevant if
 	 * scanning a join relation.)
@@ -2093,7 +2102,9 @@ create_foreignscan_plan(PlannerInfo *root, ForeignPath *best_path,
 	 */
 	scan_plan = rel->fdwroutine->GetForeignPlan(root, rel, rel_oid,
 												best_path,
-												tlist, scan_clauses);
+												tlist,
+												scan_clauses,
+												fdw_plans);
 
 	/* Copy cost data from Path to Plan; no need to make FDW do this */
 	copy_path_costsize(&scan_plan->scan.plan, &best_path->path);
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index daeb584..b4972cd 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -1093,6 +1093,8 @@ set_foreignscan_references(PlannerInfo *root,
 						   ForeignScan *fscan,
 						   int rtoffset)
 {
+	ListCell   *lc;
+
 	/* Adjust scanrelid if it's valid */
 	if (fscan->scan.scanrelid > 0)
 		fscan->scan.scanrelid += rtoffset;
@@ -1136,6 +1138,12 @@ set_foreignscan_references(PlannerInfo *root,
 			fix_scan_list(root, fscan->fdw_exprs, rtoffset);
 	}
 
+	/* Adjust child plan-nodes recursively, if needed */
+	foreach (lc, fscan->fdw_plans)
+	{
+		lfirst(lc) = set_plan_refs(root, (Plan *) lfirst(lc), rtoffset);
+	}
+
 	/* Adjust fs_relids if needed */
 	if (rtoffset > 0)
 	{
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index d0bc412..cdc8cde 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2394,10 +2394,26 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			break;
 
 		case T_ForeignScan:
-			finalize_primnode((Node *) ((ForeignScan *) plan)->fdw_exprs,
-							  &context);
-			/* We assume fdw_scan_tlist cannot contain Params */
-			context.paramids = bms_add_members(context.paramids, scan_params);
+			{
+				ForeignScan	   *fscan = (ForeignScan *) plan;
+				ListCell	   *lc;
+
+				finalize_primnode((Node *) fscan->fdw_exprs, &context);
+				/* We assume fdw_scan_tlist cannot contain Params */
+				context.paramids =
+					bms_add_members(context.paramids, scan_params);
+
+				/* child nodes if any */
+				foreach (lc, fscan->fdw_plans)
+				{
+					context.paramids =
+						bms_add_members(context.paramids,
+										finalize_plan(root,
+													  (Plan *) lfirst(lc),
+													  valid_params,
+													  scan_params));
+				}
+			}
 			break;
 
 		case T_CustomScan:
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 69b48b4..4a41351 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -36,13 +36,17 @@ typedef ForeignScan *(*GetForeignPlan_function) (PlannerInfo *root,
 														  Oid foreigntableid,
 													  ForeignPath *best_path,
 															 List *tlist,
-														 List *scan_clauses);
+												 List *scan_clauses,
+												 List *fdw_plans);
 
 typedef void (*BeginForeignScan_function) (ForeignScanState *node,
 													   int eflags);
 
 typedef TupleTableSlot *(*IterateForeignScan_function) (ForeignScanState *node);
 
+typedef bool (*RecheckForeignScan_function) (ForeignScanState *node,
+											 TupleTableSlot *slot);
+
 typedef void (*ReScanForeignScan_function) (ForeignScanState *node);
 
 typedef void (*EndForeignScan_function) (ForeignScanState *node);
@@ -138,6 +142,7 @@ typedef struct FdwRoutine
 	GetForeignPlan_function GetForeignPlan;
 	BeginForeignScan_function BeginForeignScan;
 	IterateForeignScan_function IterateForeignScan;
+	RecheckForeignScan_function RecheckForeignScan;
 	ReScanForeignScan_function ReScanForeignScan;
 	EndForeignScan_function EndForeignScan;
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4ae2f3e..fdab372 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1581,6 +1581,7 @@ typedef struct ForeignScanState
 	ScanState	ss;				/* its first field is NodeTag */
 	/* use struct pointer to avoid including fdwapi.h here */
 	struct FdwRoutine *fdwroutine;
+	List	   *fdw_ps;			/* list of child PlanState nodes, if any */
 	void	   *fdw_state;		/* foreign-data wrapper can keep state here */
 } ForeignScanState;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index cc259f1..e253678 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -520,6 +520,7 @@ typedef struct ForeignScan
 {
 	Scan		scan;
 	Oid			fs_server;		/* OID of foreign server */
+	List	   *fdw_plans;		/* list of Plan nodes, if any */
 	List	   *fdw_exprs;		/* expressions that FDW may evaluate */
 	List	   *fdw_private;	/* private data for FDW */
 	List	   *fdw_scan_tlist; /* optional tlist describing scan tuple */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 79bed33..809bf60 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -905,6 +905,7 @@ typedef struct TidPath
 typedef struct ForeignPath
 {
 	Path		path;
+	List	   *fdw_paths;
 	List	   *fdw_private;
 } ForeignPath;

#73

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Kouhei Kaigai (#72)

Re: Foreign join pushdown vs EvalPlanQual

On Mon, Sep 28, 2015 at 3:34 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

The attached patch allows FDW driver to handle EPQ recheck by its own
preferable way, even if it is alternative local join or ExecQual to
the expression being pushed down.

Thanks. I was all set to commit this, or at least part of it, when I
noticed that we already have an FDW callback called RefetchForeignRow.
We seem to be intending that this new callback should refetch the row
from the foreign server and verify that any pushed-down quals apply to
it. But why can't RefetchForeignRow do that? That seems to be what
it's for.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Robert Haas (#73)

Re: Foreign join pushdown vs EvalPlanQual

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
Sent: Tuesday, September 29, 2015 5:46 AM
To: Kaigai Kouhei(海外浩平)
Cc: Etsuro Fujita; PostgreSQL-development; 花田茂
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

On Mon, Sep 28, 2015 at 3:34 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

The attached patch allows FDW driver to handle EPQ recheck by its own
preferable way, even if it is alternative local join or ExecQual to
the expression being pushed down.

Thanks. I was all set to commit this, or at least part of it, when I
noticed that we already have an FDW callback called RefetchForeignRow.
We seem to be intending that this new callback should refetch the row
from the foreign server and verify that any pushed-down quals apply to
it. But why can't RefetchForeignRow do that? That seems to be what
it's for.

At least here are two matters to solve the problem with RefetchForeignRow.

1. RefetchForeignRow() does not take ForeignScanState argument, so it is
not obvious how to cooperate with the private state in ForeignScanState;
that may include expression pushed down, and so on.

2. ForeignScan with scanrelid == 0 represents the result of joined
relations. Even if the refetched tuple is visible on base-relation
level, it may not survive the join condition at the upper level.
Once relations join get pushed down, only FDW driver knows how
base relations are joined.

So, it is the only reasonable way to ask FDW driver on ExecScanFetch,
to check visibility of a particular tuple or another tuple made from
this.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#74)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/09/29 9:13, Kouhei Kaigai wrote:

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
Sent: Tuesday, September 29, 2015 5:46 AM
To: Kaigai Kouhei(海外浩平)
Cc: Etsuro Fujita; PostgreSQL-development; 花田茂
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

On Mon, Sep 28, 2015 at 3:34 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

The attached patch allows FDW driver to handle EPQ recheck by its own
preferable way, even if it is alternative local join or ExecQual to
the expression being pushed down.

Thanks for the work, KaiGai-san!

Thanks. I was all set to commit this, or at least part of it, when I
noticed that we already have an FDW callback called RefetchForeignRow.
We seem to be intending that this new callback should refetch the row
from the foreign server and verify that any pushed-down quals apply to
it. But why can't RefetchForeignRow do that? That seems to be what
it's for.

Thanks for the comments, Robert!

I thought the same thing [1]/messages/by-id/55DEB5A9.8010604@lab.ntt.co.jp. While I thought it was relatively easy to
make changes to RefetchForeignRow that way for the foreign table case
(scanrelid>0), I was not sure how hard it would be to do so for the
foreign join case (scanrelid==0). So, I proposed to leave that changes
for 9.6. I'll have a rethink on this issue along the lines of that
approach.

Sorry for having had no response. I was on vacation.

Best regards,
Etsuro Fujita

[1]: /messages/by-id/55DEB5A9.8010604@lab.ntt.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Etsuro Fujita (#75)

Re: Foreign join pushdown vs EvalPlanQual

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Etsuro Fujita
Sent: Tuesday, September 29, 2015 12:15 PM
To: Kaigai Kouhei(海外浩平); Robert Haas
Cc: PostgreSQL-development; 花田茂
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

On 2015/09/29 9:13, Kouhei Kaigai wrote:

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
Sent: Tuesday, September 29, 2015 5:46 AM
To: Kaigai Kouhei(海外浩平)
Cc: Etsuro Fujita; PostgreSQL-development; 花田茂
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

On Mon, Sep 28, 2015 at 3:34 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

The attached patch allows FDW driver to handle EPQ recheck by its own
preferable way, even if it is alternative local join or ExecQual to
the expression being pushed down.

Thanks for the work, KaiGai-san!

Thanks. I was all set to commit this, or at least part of it, when I
noticed that we already have an FDW callback called RefetchForeignRow.
We seem to be intending that this new callback should refetch the row
from the foreign server and verify that any pushed-down quals apply to
it. But why can't RefetchForeignRow do that? That seems to be what
it's for.

Thanks for the comments, Robert!

I thought the same thing [1]. While I thought it was relatively easy to
make changes to RefetchForeignRow that way for the foreign table case
(scanrelid>0), I was not sure how hard it would be to do so for the
foreign join case (scanrelid==0). So, I proposed to leave that changes
for 9.6. I'll have a rethink on this issue along the lines of that
approach.

Even if base relation case, is it really easy to do?

RefetchForeignRow() does not take ForeignScanState as its argument,
so it is not obvious to access its private field, isn't it?
ExecRowMark contains "rti" field, so it might be feasible to find out
the target PlanState using walker routine recently supported, although
it is not a simple enough.
Unless we don't have reference to the private field, it is not feasible
to access expression that was pushed down to the remote-side, therefore,
it does not allow to apply proper rechecks here.

In addition, it is problematic when scanrelid==0 because we have no
relevant ForeignScanState which represents the base relations, even
though ExecRowMark is associated with a particular base relation.
In case of scanrelid==0, EPQ recheck routine also have to ensure
the EPQ tuple is visible towards the join condition in addition to
the qualifier of base relation. These information is also stored within
private data field, so it has to have a reference to the private data
of ForeignScanState of the remote join (scanrelid==0) which contains
the target relation.

Could you introduce us (1) how to access private data field of
ForeignScanState from the RefetchForeignRow callback? (2) why it
is reasonable to implement than the callback on ForeignRecheck().

Sorry for having had no response. I was on vacation.

Me too. :-)

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#77

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#76)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/09/29 13:55, Kouhei Kaigai wrote:

From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Etsuro Fujita
On 2015/09/29 9:13, Kouhei Kaigai wrote:

From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
On Mon, Sep 28, 2015 at 3:34 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

The attached patch allows FDW driver to handle EPQ recheck by its own
preferable way, even if it is alternative local join or ExecQual to
the expression being pushed down.

Thanks. I was all set to commit this, or at least part of it, when I
noticed that we already have an FDW callback called RefetchForeignRow.
We seem to be intending that this new callback should refetch the row
from the foreign server and verify that any pushed-down quals apply to
it. But why can't RefetchForeignRow do that? That seems to be what
it's for.

I thought the same thing [1]. While I thought it was relatively easy to
make changes to RefetchForeignRow that way for the foreign table case
(scanrelid>0), I was not sure how hard it would be to do so for the
foreign join case (scanrelid==0). So, I proposed to leave that changes
for 9.6. I'll have a rethink on this issue along the lines of that
approach.

Even if base relation case, is it really easy to do?

RefetchForeignRow() does not take ForeignScanState as its argument,
so it is not obvious to access its private field, isn't it?
ExecRowMark contains "rti" field, so it might be feasible to find out
the target PlanState using walker routine recently supported, although
it is not a simple enough.
Unless we don't have reference to the private field, it is not feasible
to access expression that was pushed down to the remote-side, therefore,
it does not allow to apply proper rechecks here.

In addition, it is problematic when scanrelid==0 because we have no
relevant ForeignScanState which represents the base relations, even
though ExecRowMark is associated with a particular base relation.
In case of scanrelid==0, EPQ recheck routine also have to ensure
the EPQ tuple is visible towards the join condition in addition to
the qualifier of base relation. These information is also stored within
private data field, so it has to have a reference to the private data
of ForeignScanState of the remote join (scanrelid==0) which contains
the target relation.

Could you introduce us (1) how to access private data field of
ForeignScanState from the RefetchForeignRow callback? (2) why it
is reasonable to implement than the callback on ForeignRecheck().

For the foreign table case (scanrelid>0), I imagined an approach
different than yours. In that case, I thought the issue would be
probably addressed by just modifying the remote query performed in
RefetchForeignRow, which would be of the form "SELECT ctid, * FROM
remote table WHERE ctid = $1", so that the modified query would be of
the form "SELECT ctid, * FROM remote table WHERE ctid = $1 AND *remote
quals*".

For the foreign join case (scanrelid==0), in my vision, I think we would
need some changes not only to RefetchForeignRow but to the existing
EvalPlanQual machinery in the core. I've not had a clear image yet, though.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Etsuro Fujita (#77)

1 attachment(s)

Re: Foreign join pushdown vs EvalPlanQual

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Etsuro Fujita
Sent: Tuesday, September 29, 2015 4:36 PM
To: Kaigai Kouhei(海外浩平); Robert Haas
Cc: PostgreSQL-development; 花田茂
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

On 2015/09/29 13:55, Kouhei Kaigai wrote:

From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Etsuro Fujita
On 2015/09/29 9:13, Kouhei Kaigai wrote:

From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
On Mon, Sep 28, 2015 at 3:34 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

The attached patch allows FDW driver to handle EPQ recheck by its own
preferable way, even if it is alternative local join or ExecQual to
the expression being pushed down.

Thanks. I was all set to commit this, or at least part of it, when I
noticed that we already have an FDW callback called RefetchForeignRow.
We seem to be intending that this new callback should refetch the row
from the foreign server and verify that any pushed-down quals apply to
it. But why can't RefetchForeignRow do that? That seems to be what
it's for.

I thought the same thing [1]. While I thought it was relatively easy to
make changes to RefetchForeignRow that way for the foreign table case
(scanrelid>0), I was not sure how hard it would be to do so for the
foreign join case (scanrelid==0). So, I proposed to leave that changes
for 9.6. I'll have a rethink on this issue along the lines of that
approach.

Even if base relation case, is it really easy to do?

RefetchForeignRow() does not take ForeignScanState as its argument,
so it is not obvious to access its private field, isn't it?
ExecRowMark contains "rti" field, so it might be feasible to find out
the target PlanState using walker routine recently supported, although
it is not a simple enough.
Unless we don't have reference to the private field, it is not feasible
to access expression that was pushed down to the remote-side, therefore,
it does not allow to apply proper rechecks here.

In addition, it is problematic when scanrelid==0 because we have no
relevant ForeignScanState which represents the base relations, even
though ExecRowMark is associated with a particular base relation.
In case of scanrelid==0, EPQ recheck routine also have to ensure
the EPQ tuple is visible towards the join condition in addition to
the qualifier of base relation. These information is also stored within
private data field, so it has to have a reference to the private data
of ForeignScanState of the remote join (scanrelid==0) which contains
the target relation.

Could you introduce us (1) how to access private data field of
ForeignScanState from the RefetchForeignRow callback? (2) why it
is reasonable to implement than the callback on ForeignRecheck().

For the foreign table case (scanrelid>0), I imagined an approach
different than yours. In that case, I thought the issue would be
probably addressed by just modifying the remote query performed in
RefetchForeignRow, which would be of the form "SELECT ctid, * FROM
remote table WHERE ctid = $1", so that the modified query would be of
the form "SELECT ctid, * FROM remote table WHERE ctid = $1 AND *remote
quals*".

My question is how to pull expression of the remote query.
It shall be stored at somewhere private field of ForeignScanState,
however, RefetchForeignRow does not have direct access to the
relevant ForeignScanState node.
It is what I asked at the question (1).

Also note that EvalPlanQualFetchRowMarks() will raise an error
if RefetchForeignRow callback returned NULL tuple.
Is it right or expected behavior?
It looks to me this callback is designed to pull out a particular
tuple identified by the remote-row-id, regardless of the qualifier
checks based on the latest value.

For the foreign join case (scanrelid==0), in my vision, I think we would
need some changes not only to RefetchForeignRow but to the existing
EvalPlanQual machinery in the core. I've not had a clear image yet, though.

If people agree with FDW remote join is incomplete feature in v9.5,
the attached fix-up is the minimum requirement from the standpoint
of custom-scan/join.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Attachments:

pgsql-fdw-epq-recheck.v3.patchapplication/octet-stream; name=pgsql-fdw-epq-recheck.v3.patchDownload

 src/backend/executor/execScan.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index a96e826..89c75ca 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -49,8 +49,19 @@ ExecScanFetch(ScanState *node,
 		 */
 		Index		scanrelid = ((Scan *) node->ps.plan)->scanrelid;
 
-		Assert(scanrelid > 0);
-		if (estate->es_epqTupleSet[scanrelid - 1])
+		if (scanrelid == 0)
+		{
+			TupleTableSlot	   *slot = ExecClearTuple(node->ss_ScanTupleSlot);
+
+			/* Only ForeignScan or CustomScan can have scanrelid==0 */
+			Assert(IsA(node, ForeignScanState) ||
+				   IsA(node, CustomScanState));
+			/* Check if it meets the access-method conditions */
+			if (!(*recheckMtd) (node, slot))
+				ExecClearTuple(slot);	/* ensure an empty slot is returned */
+			return slot;
+		}
+		else if (estate->es_epqTupleSet[scanrelid - 1])
 		{
 			TupleTableSlot *slot = node->ss_ScanTupleSlot;

#79

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#78)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/09/29 17:49, Kouhei Kaigai wrote:

From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Etsuro Fujita

RefetchForeignRow() does not take ForeignScanState as its argument,
so it is not obvious to access its private field, isn't it?
ExecRowMark contains "rti" field, so it might be feasible to find out
the target PlanState using walker routine recently supported, although
it is not a simple enough.
Unless we don't have reference to the private field, it is not feasible
to access expression that was pushed down to the remote-side, therefore,
it does not allow to apply proper rechecks here.

Could you introduce us (1) how to access private data field of
ForeignScanState from the RefetchForeignRow callback?

For the foreign table case (scanrelid>0), I imagined an approach
different than yours. In that case, I thought the issue would be
probably addressed by just modifying the remote query performed in
RefetchForeignRow, which would be of the form "SELECT ctid, * FROM
remote table WHERE ctid = $1", so that the modified query would be of
the form "SELECT ctid, * FROM remote table WHERE ctid = $1 AND *remote
quals*".

Sorry, I forgot to add "FOR UPDATE" to the before/after queries.

My question is how to pull expression of the remote query.
It shall be stored at somewhere private field of ForeignScanState,
however, RefetchForeignRow does not have direct access to the
relevant ForeignScanState node.
It is what I asked at the question (1).

I imagined the following steps to get the remote query string: (1)
create the remote query string and store it in fdw_private during
postgresGetForeignPlan, (2) extract the string from fdw_private and
store it in erm->ermExtra during postgresBeginForeignScan, and (3)
extract the string from erm->ermExtra in postgresRefetchForeignRow.

Also note that EvalPlanQualFetchRowMarks() will raise an error
if RefetchForeignRow callback returned NULL tuple.
Is it right or expected behavior?

IIUC, I think that that behavior is reasonable.

It looks to me this callback is designed to pull out a particular
tuple identified by the remote-row-id, regardless of the qualifier
checks based on the latest value.

Because erm->markType==ROW_MARK_REFERENCE, I don't think that that
behavior would cause any problem. Maybe I'm missing something, though.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#80

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Etsuro Fujita (#79)

Re: Foreign join pushdown vs EvalPlanQual

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Etsuro Fujita
Sent: Tuesday, September 29, 2015 8:00 PM
To: Kaigai Kouhei(海外浩平); Robert Haas
Cc: PostgreSQL-development; 花田茂
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

On 2015/09/29 17:49, Kouhei Kaigai wrote:

From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Etsuro Fujita

RefetchForeignRow() does not take ForeignScanState as its argument,
so it is not obvious to access its private field, isn't it?
ExecRowMark contains "rti" field, so it might be feasible to find out
the target PlanState using walker routine recently supported, although
it is not a simple enough.
Unless we don't have reference to the private field, it is not feasible
to access expression that was pushed down to the remote-side, therefore,
it does not allow to apply proper rechecks here.

Could you introduce us (1) how to access private data field of
ForeignScanState from the RefetchForeignRow callback?

For the foreign table case (scanrelid>0), I imagined an approach
different than yours. In that case, I thought the issue would be
probably addressed by just modifying the remote query performed in
RefetchForeignRow, which would be of the form "SELECT ctid, * FROM
remote table WHERE ctid = $1", so that the modified query would be of
the form "SELECT ctid, * FROM remote table WHERE ctid = $1 AND *remote
quals*".

Sorry, I forgot to add "FOR UPDATE" to the before/after queries.

My question is how to pull expression of the remote query.
It shall be stored at somewhere private field of ForeignScanState,
however, RefetchForeignRow does not have direct access to the
relevant ForeignScanState node.
It is what I asked at the question (1).

I imagined the following steps to get the remote query string: (1)
create the remote query string and store it in fdw_private during
postgresGetForeignPlan, (2) extract the string from fdw_private and
store it in erm->ermExtra during postgresBeginForeignScan, and (3)
extract the string from erm->ermExtra in postgresRefetchForeignRow.

Also note that EvalPlanQualFetchRowMarks() will raise an error
if RefetchForeignRow callback returned NULL tuple.
Is it right or expected behavior?

IIUC, I think that that behavior is reasonable.

It looks to me this callback is designed to pull out a particular
tuple identified by the remote-row-id, regardless of the qualifier
checks based on the latest value.

Because erm->markType==ROW_MARK_REFERENCE, I don't think that that
behavior would cause any problem. Maybe I'm missing something, though.

Really?

ExecLockRows() calls EvalPlanQualFetchRowMarks() to fill up EPQ tuple
slot prior to EvalPlanQualNext(), because these tuples are referenced
during EPQ rechecks.
The purpose of EvalPlanQualNext() is evaluate whether the current bunch
of rows are visible towards the qualifiers of underlying scan/join.
Then, if not visible, it *ignores* the current tuples, as follows.

/*
* Now fetch any non-locked source rows --- the EPQ logic knows how to
* do that.
*/
EvalPlanQualSetSlot(&node->lr_epqstate, slot);
EvalPlanQualFetchRowMarks(&node->lr_epqstate); <--- LOAD REMOTE ROWS

/*
* And finally we can re-evaluate the tuple.
*/
slot = EvalPlanQualNext(&node->lr_epqstate); <--- EVALUATE QUALIFIERS
if (TupIsNull(slot))
{
/* Updated tuple fails qual, so ignore it and go on */
goto lnext; <-- IGNORE THE ROW, NOT RAISE AN ERROR
}

What happen if RefetchForeignRow raise an error in case when the latest
row exists but violated towards the "remote quals" ?
This is the case to be ignored, unlike the case when remote row identified
by row-id didn't exist.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#81

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Kouhei Kaigai (#78)

Re: Foreign join pushdown vs EvalPlanQual

On Tue, Sep 29, 2015 at 4:49 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Also note that EvalPlanQualFetchRowMarks() will raise an error
if RefetchForeignRow callback returned NULL tuple.
Is it right or expected behavior?

That's not how I read the code. If RefetchForeignRow returns NULL, we
just ignore the row and continue on to the next one:

if (copyTuple == NULL)
{
/* couldn't get the lock, so skip this row */
goto lnext;
}

And that seems exactly right: RefetchForeignRow needs to test that the
tuple is still present on the remote side, and that any remote quals
are matched. If either of those is false, it can return NULL.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Etsuro Fujita (#75)

Re: Foreign join pushdown vs EvalPlanQual

On Mon, Sep 28, 2015 at 11:15 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

I thought the same thing [1]. While I thought it was relatively easy to
make changes to RefetchForeignRow that way for the foreign table case
(scanrelid>0), I was not sure how hard it would be to do so for the foreign
join case (scanrelid==0). So, I proposed to leave that changes for 9.6.
I'll have a rethink on this issue along the lines of that approach.

Well, I spent some more time looking at this today, and testing it out
using a fixed-up version of your foreign_join_v16 patch, and I decided
that RefetchForeignRow is basically a red herring. That's only used
for FDWs that do late row locking, but postgres_fdw (and probably many
others) do early row locking, in which case RefetchForeignRow never
gets called. Instead, the row is treated as a "non-locked source row"
by ExecLockRows (even though it is in fact locked) and is re-fetched
by EvalPlanQualFetchRowMarks. We should probably update the comment
about non-locked source rows to mention the case of FDWs that do early
row locking.

Anyway, everything appears to work OK up to this point: we correctly
retrieve the saved whole-rows from the foreign side and call
EvalPlanQualSetTuple on each one, setting es_epqTuple[rti - 1] and
es_epqTupleSet[rti - 1]. So far, so good. Now we call
EvalPlanQualNext, and that's where we get into trouble. We've got the
already-locked tuples from the foreign side and those tuples CANNOT
have gone away or been modified because we have already locked them.
So, all the foreign join needs to do is return the same tuple that it
returned before: the EPQ recheck was triggered by some *other* table
involved in the plan, not our table. A local table also involved in
the query, or conceivably a foreign table that does late row locking,
could have had something change under it after the row was fetched,
but in postgres_fdw that can't happen because we locked the row up
front. And thus, again, all we need to do is re-return the same
tuple. But we don't have that. Instead, the ROW_MARK_COPY logic has
caused us to preserve a copy of each *baserel* tuple.

Now, this is as sad as can be. Early row locking has huge advantages
for FDWs, both in terms of minimizing server round trips and also
because the FDW doesn't really need to do anything about EPQ. Sure,
it's inefficient to carry around whole-row references, but it makes
life easy for the FDW author.

So, if we wanted to fix this in a way that preserves the spirit of
what's there now, it seems to me that we'd want the FDW to return
something that's like a whole row reference, but represents the output
of the foreign join rather than some underlying base table. And then
get the EPQ machinery to have the evaluation of the ForeignScan for
the join, when it happens in an EPQ context, to return that tuple.
But I don't really have a good idea how to do that.

More thought seems needed here...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#80)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/09/29 21:38, Kouhei Kaigai wrote:

Also note that EvalPlanQualFetchRowMarks() will raise an error
if RefetchForeignRow callback returned NULL tuple.
Is it right or expected behavior?

IIUC, I think that that behavior is reasonable.

It looks to me this callback is designed to pull out a particular
tuple identified by the remote-row-id, regardless of the qualifier
checks based on the latest value.

Because erm->markType==ROW_MARK_REFERENCE, I don't think that that
behavior would cause any problem. Maybe I'm missing something, though.

Really?

Yeah, I think RefetchForeignRow should work differently depending on the
rowmark type. When erm->markType==ROW_MARK_REFERENCE, the callback
should fetch a particular tuple identified by the rowid (ie, the same
version previously obtained) successfully. So for that case, I don't
think the remote quals need to be checked during RefetchForeignRow.

ExecLockRows() calls EvalPlanQualFetchRowMarks() to fill up EPQ tuple
slot prior to EvalPlanQualNext(), because these tuples are referenced
during EPQ rechecks.
The purpose of EvalPlanQualNext() is evaluate whether the current bunch
of rows are visible towards the qualifiers of underlying scan/join.
Then, if not visible, it *ignores* the current tuples, as follows.

/*
* Now fetch any non-locked source rows --- the EPQ logic knows how to
* do that.
*/
EvalPlanQualSetSlot(&node->lr_epqstate, slot);
EvalPlanQualFetchRowMarks(&node->lr_epqstate); <--- LOAD REMOTE ROWS

/*
* And finally we can re-evaluate the tuple.
*/
slot = EvalPlanQualNext(&node->lr_epqstate); <--- EVALUATE QUALIFIERS
if (TupIsNull(slot))
{
/* Updated tuple fails qual, so ignore it and go on */
goto lnext; <-- IGNORE THE ROW, NOT RAISE AN ERROR
}

What happen if RefetchForeignRow raise an error in case when the latest
row exists but violated towards the "remote quals" ?
This is the case to be ignored, unlike the case when remote row identified
by row-id didn't exist.

IIUC, I think that that depends on where RefetchForeignRow is called
(ie, the rowmark type). When it is called from
EvalPlanQualFetchRowMarks, the transaction should be aborted as I
mentioned above, if it couldn't fetch the same version previously
obtained. But when RefetchForeignRow is called from ExecLockRows, the
tuple should be just ignored as the above code, if the latest version on
the remote side didn't satisfy the remote quals.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Robert Haas (#82)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/09/30 6:55, Robert Haas wrote:

On Mon, Sep 28, 2015 at 11:15 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

I thought the same thing [1]. While I thought it was relatively easy to
make changes to RefetchForeignRow that way for the foreign table case
(scanrelid>0), I was not sure how hard it would be to do so for the foreign
join case (scanrelid==0). So, I proposed to leave that changes for 9.6.
I'll have a rethink on this issue along the lines of that approach.

Well, I spent some more time looking at this today, and testing it out
using a fixed-up version of your foreign_join_v16 patch, and I decided
that RefetchForeignRow is basically a red herring. That's only used
for FDWs that do late row locking, but postgres_fdw (and probably many
others) do early row locking, in which case RefetchForeignRow never
gets called. Instead, the row is treated as a "non-locked source row"
by ExecLockRows (even though it is in fact locked) and is re-fetched
by EvalPlanQualFetchRowMarks. We should probably update the comment
about non-locked source rows to mention the case of FDWs that do early
row locking.

Anyway, everything appears to work OK up to this point: we correctly
retrieve the saved whole-rows from the foreign side and call
EvalPlanQualSetTuple on each one, setting es_epqTuple[rti - 1] and
es_epqTupleSet[rti - 1]. So far, so good. Now we call
EvalPlanQualNext, and that's where we get into trouble. We've got the
already-locked tuples from the foreign side and those tuples CANNOT
have gone away or been modified because we have already locked them.
So, all the foreign join needs to do is return the same tuple that it
returned before: the EPQ recheck was triggered by some *other* table
involved in the plan, not our table. A local table also involved in
the query, or conceivably a foreign table that does late row locking,
could have had something change under it after the row was fetched,
but in postgres_fdw that can't happen because we locked the row up
front. And thus, again, all we need to do is re-return the same
tuple. But we don't have that. Instead, the ROW_MARK_COPY logic has
caused us to preserve a copy of each *baserel* tuple.

Now, this is as sad as can be. Early row locking has huge advantages
for FDWs, both in terms of minimizing server round trips and also
because the FDW doesn't really need to do anything about EPQ. Sure,
it's inefficient to carry around whole-row references, but it makes
life easy for the FDW author.

So, if we wanted to fix this in a way that preserves the spirit of
what's there now, it seems to me that we'd want the FDW to return
something that's like a whole row reference, but represents the output
of the foreign join rather than some underlying base table. And then
get the EPQ machinery to have the evaluation of the ForeignScan for
the join, when it happens in an EPQ context, to return that tuple.
But I don't really have a good idea how to do that.

I like a general solution. Can't we extend that idea so that foreign
tables involved in a foreign join are allowed to have different rowmark
methods other than ROW_MARK_COPY, eg, ROW_MARK_EXCLUSIVE?

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#85

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Robert Haas (#82)

Re: Foreign join pushdown vs EvalPlanQual

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
Sent: Wednesday, September 30, 2015 6:55 AM
To: Etsuro Fujita
Cc: Kaigai Kouhei(海外浩平); PostgreSQL-development; 花田茂
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

On Mon, Sep 28, 2015 at 11:15 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

I thought the same thing [1]. While I thought it was relatively easy to
make changes to RefetchForeignRow that way for the foreign table case
(scanrelid>0), I was not sure how hard it would be to do so for the foreign
join case (scanrelid==0). So, I proposed to leave that changes for 9.6.
I'll have a rethink on this issue along the lines of that approach.

Well, I spent some more time looking at this today, and testing it out
using a fixed-up version of your foreign_join_v16 patch, and I decided
that RefetchForeignRow is basically a red herring. That's only used
for FDWs that do late row locking, but postgres_fdw (and probably many
others) do early row locking, in which case RefetchForeignRow never
gets called. Instead, the row is treated as a "non-locked source row"
by ExecLockRows (even though it is in fact locked) and is re-fetched
by EvalPlanQualFetchRowMarks. We should probably update the comment
about non-locked source rows to mention the case of FDWs that do early
row locking.

Indeed, select_rowmark_type() says ROW_MARK_COPY if GetForeignRowMarkType
callback is not defined.

Anyway, everything appears to work OK up to this point: we correctly
retrieve the saved whole-rows from the foreign side and call
EvalPlanQualSetTuple on each one, setting es_epqTuple[rti - 1] and
es_epqTupleSet[rti - 1]. So far, so good. Now we call
EvalPlanQualNext, and that's where we get into trouble. We've got the
already-locked tuples from the foreign side and those tuples CANNOT
have gone away or been modified because we have already locked them.
So, all the foreign join needs to do is return the same tuple that it
returned before: the EPQ recheck was triggered by some *other* table
involved in the plan, not our table. A local table also involved in
the query, or conceivably a foreign table that does late row locking,
could have had something change under it after the row was fetched,
but in postgres_fdw that can't happen because we locked the row up
front. And thus, again, all we need to do is re-return the same
tuple. But we don't have that. Instead, the ROW_MARK_COPY logic has
caused us to preserve a copy of each *baserel* tuple.

Now, this is as sad as can be. Early row locking has huge advantages
for FDWs, both in terms of minimizing server round trips and also
because the FDW doesn't really need to do anything about EPQ. Sure,
it's inefficient to carry around whole-row references, but it makes
life easy for the FDW author.

I got the point. Is it helpful to add description why ROW_MARK_COPY
does not need recheck on both of local/remote tuples?
http://www.postgresql.org/docs/devel/static/fdw-row-locking.html

So, if we wanted to fix this in a way that preserves the spirit of
what's there now, it seems to me that we'd want the FDW to return
something that's like a whole row reference, but represents the output
of the foreign join rather than some underlying base table. And then
get the EPQ machinery to have the evaluation of the ForeignScan for
the join, when it happens in an EPQ context, to return that tuple.
But I don't really have a good idea how to do that.

More thought seems needed here...

Alternative built-in join execution?
Once it is executed under the EPQ context, built-in join node fetches
a tuple from both of inner and outer side for each. It is eventually
fetched from the EPQ slot, then the alternative join produce a result
tuple.
In case when FDW is not designed to handle join by itself, it is
a reasonable fallback I think.

I expect FDW driver needs to handle EPQ recheck in the case below:
* ForeignScan on base relation and it uses late row locking.
* ForeignScan on join relation, even if early locking.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#85)

Re: Foreign join pushdown vs EvalPlanQual

Hello, I caught up this thread, maybe.

So, if we wanted to fix this in a way that preserves the spirit of
what's there now, it seems to me that we'd want the FDW to return
something that's like a whole row reference, but represents the output
of the foreign join rather than some underlying base table. And then
get the EPQ machinery to have the evaluation of the ForeignScan for
the join, when it happens in an EPQ context, to return that tuple.
But I don't really have a good idea how to do that.

More thought seems needed here...

Alternative built-in join execution?
Once it is executed under the EPQ context, built-in join node fetches
a tuple from both of inner and outer side for each. It is eventually
fetched from the EPQ slot, then the alternative join produce a result
tuple.

It seems quite similar to what Fujita-san is trying now by
somehow *replacing* "foreign join" scan node with alternative
local join plan when EPQ. I think what Robert says is that
"foreign join" scans that completely behaves as a ordinary scan
node on executor. Current framework of foreign join pushdown
seems a bit tricky because it incompletely emulating local join
on foreign scans. The mixture seems to be the root cause of this
problem.

1. Somehow run local joins on current EPQ tuples currently given
by "foreign join" scans.

1.1 Somehow detecting running EPQ and switch the plan to run in
ExecScanFetch or somewhere else.

1.2 Replace "foreign join scan" node with the alternative local
join node on ExecInit. (I don't like this.)

1.3 In-core alternative local join executor for join pushdown?

2. "foreign join" scan plan node completely compliant to current
executor semantics of ordinary scan node.

In other words, the node has corresponding RTE_RELATION RTE,
marked with ROW_MARK_COPY on locking and returns a slot with
tlist that contains join result columns and the whole-row var
on them. Then, ExecPlanQualFetchRowMarks gets the whole-row var
and set it into eqpTuple for corresponding *relid*.

I prefer the 2, but have no good idea how to do that now, too.

In case when FDW is not designed to handle join by itself, it is
a reasonable fallback I think.

I expect FDW driver needs to handle EPQ recheck in the case below:
* ForeignScan on base relation and it uses late row locking.

I think this is indisputable.

* ForeignScan on join relation, even if early locking.

This could be unnecessary if the "foreign join" scan node can
have its own rowmark of ROW_MARK_COPY.

regards,

At Thu, 1 Oct 2015 02:15:29 +0000, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote in <9A28C8860F777E439AA12E8AEA7694F80114D442@BPXM15GP.gisp.nec.co.jp>

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
Sent: Wednesday, September 30, 2015 6:55 AM
To: Etsuro Fujita
Cc: Kaigai Kouhei(海外浩平); PostgreSQL-development; 花田茂
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

On Mon, Sep 28, 2015 at 11:15 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

I thought the same thing [1]. While I thought it was relatively easy to
make changes to RefetchForeignRow that way for the foreign table case
(scanrelid>0), I was not sure how hard it would be to do so for the foreign
join case (scanrelid==0). So, I proposed to leave that changes for 9.6.
I'll have a rethink on this issue along the lines of that approach.

Well, I spent some more time looking at this today, and testing it out
using a fixed-up version of your foreign_join_v16 patch, and I decided
that RefetchForeignRow is basically a red herring. That's only used
for FDWs that do late row locking, but postgres_fdw (and probably many
others) do early row locking, in which case RefetchForeignRow never
gets called. Instead, the row is treated as a "non-locked source row"
by ExecLockRows (even though it is in fact locked) and is re-fetched
by EvalPlanQualFetchRowMarks. We should probably update the comment
about non-locked source rows to mention the case of FDWs that do early
row locking.

Indeed, select_rowmark_type() says ROW_MARK_COPY if GetForeignRowMarkType
callback is not defined.

Anyway, everything appears to work OK up to this point: we correctly
retrieve the saved whole-rows from the foreign side and call
EvalPlanQualSetTuple on each one, setting es_epqTuple[rti - 1] and
es_epqTupleSet[rti - 1]. So far, so good. Now we call
EvalPlanQualNext, and that's where we get into trouble. We've got the
already-locked tuples from the foreign side and those tuples CANNOT
have gone away or been modified because we have already locked them.
So, all the foreign join needs to do is return the same tuple that it
returned before: the EPQ recheck was triggered by some *other* table
involved in the plan, not our table. A local table also involved in
the query, or conceivably a foreign table that does late row locking,
could have had something change under it after the row was fetched,
but in postgres_fdw that can't happen because we locked the row up
front. And thus, again, all we need to do is re-return the same
tuple. But we don't have that. Instead, the ROW_MARK_COPY logic has
caused us to preserve a copy of each *baserel* tuple.

Now, this is as sad as can be. Early row locking has huge advantages
for FDWs, both in terms of minimizing server round trips and also
because the FDW doesn't really need to do anything about EPQ. Sure,
it's inefficient to carry around whole-row references, but it makes
life easy for the FDW author.

I got the point. Is it helpful to add description why ROW_MARK_COPY
does not need recheck on both of local/remote tuples?
http://www.postgresql.org/docs/devel/static/fdw-row-locking.html

So, if we wanted to fix this in a way that preserves the spirit of
what's there now, it seems to me that we'd want the FDW to return
something that's like a whole row reference, but represents the output
of the foreign join rather than some underlying base table. And then
get the EPQ machinery to have the evaluation of the ForeignScan for
the join, when it happens in an EPQ context, to return that tuple.
But I don't really have a good idea how to do that.

More thought seems needed here...

Alternative built-in join execution?
Once it is executed under the EPQ context, built-in join node fetches
a tuple from both of inner and outer side for each. It is eventually
fetched from the EPQ slot, then the alternative join produce a result
tuple.
In case when FDW is not designed to handle join by itself, it is
a reasonable fallback I think.

I expect FDW driver needs to handle EPQ recheck in the case below:
* ForeignScan on base relation and it uses late row locking.
* ForeignScan on join relation, even if early locking.

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#85)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/01 11:15, Kouhei Kaigai wrote:

From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
On Mon, Sep 28, 2015 at 11:15 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

I thought the same thing [1]. While I thought it was relatively easy to
make changes to RefetchForeignRow that way for the foreign table case
(scanrelid>0), I was not sure how hard it would be to do so for the foreign
join case (scanrelid==0). So, I proposed to leave that changes for 9.6.
I'll have a rethink on this issue along the lines of that approach.

So, if we wanted to fix this in a way that preserves the spirit of
what's there now, it seems to me that we'd want the FDW to return
something that's like a whole row reference, but represents the output
of the foreign join rather than some underlying base table. And then
get the EPQ machinery to have the evaluation of the ForeignScan for
the join, when it happens in an EPQ context, to return that tuple.
But I don't really have a good idea how to do that.

Alternative built-in join execution?
Once it is executed under the EPQ context, built-in join node fetches
a tuple from both of inner and outer side for each. It is eventually
fetched from the EPQ slot, then the alternative join produce a result
tuple.
In case when FDW is not designed to handle join by itself, it is
a reasonable fallback I think.

I expect FDW driver needs to handle EPQ recheck in the case below:
* ForeignScan on base relation and it uses late row locking.
* ForeignScan on join relation, even if early locking.

I also think the approach would be one choice. But one thing I'm
concerned about is plan creation for that by the FDW author; that would
make life hard for the FDW author. (That was proposed by me ...)

So, I'd like to investigate another approach that preserves the
applicability of late row locking to the join pushdown case as well as
the spirit of what's there now. The basic idea is (1) add a new
callback routine RefetchForeignJoinRow that refetches one foreign-join
tuple from the foreign server, after locking remote tuples for the
component foreign tables if required, and (2) call that routine in
ExecScanFetch if the target scan is for a foreign join and the component
foreign tables require to be locked lately, else just return the
foreign-join tuple stored in the parent's state tree, which is the tuple
mentioned by Robert, for preserving the spirit of what's there now. I
think that ExecLockRows and EvalPlanQualFetchRowMarks should probably be
modified so as to skip foreign tables involved in a foreign join.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#88

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Kyotaro HORIGUCHI (#86)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/01 15:38, Kyotaro HORIGUCHI wrote:

I expect FDW driver needs to handle EPQ recheck in the case below:
* ForeignScan on base relation and it uses late row locking.

I think this is indisputable.

I think so. But I think this case would probably be handled by the
existing RefetchForeignRow routine as I said upthread.

* ForeignScan on join relation, even if early locking.

This could be unnecessary if the "foreign join" scan node can
have its own rowmark of ROW_MARK_COPY.

That's an idea, but I'd vote for preserving the applicability of late
row locking to the foreign join case, allowing component foreign tables
involved in a foreign join to have different rowmark methods other than
ROW_MARK_COPY.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#89

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 10 years ago

In reply to: Etsuro Fujita (#87)

Re: Foreign join pushdown vs EvalPlanQual

Hello,

At Thu, 1 Oct 2015 17:50:25 +0900, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote in <560CF3D1.9060305@lab.ntt.co.jp>

On 2015/10/01 11:15, Kouhei Kaigai wrote:

From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
On Mon, Sep 28, 2015 at 11:15 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:
So, if we wanted to fix this in a way that preserves the spirit of
what's there now, it seems to me that we'd want the FDW to return
something that's like a whole row reference, but represents the output
of the foreign join rather than some underlying base table. And then
get the EPQ machinery to have the evaluation of the ForeignScan for
the join, when it happens in an EPQ context, to return that tuple.
But I don't really have a good idea how to do that.

Alternative built-in join execution?
Once it is executed under the EPQ context, built-in join node fetches
a tuple from both of inner and outer side for each. It is eventually
fetched from the EPQ slot, then the alternative join produce a result
tuple.
In case when FDW is not designed to handle join by itself, it is
a reasonable fallback I think.

I expect FDW driver needs to handle EPQ recheck in the case below:
* ForeignScan on base relation and it uses late row locking.
* ForeignScan on join relation, even if early locking.

I also think the approach would be one choice. But one thing I'm
concerned about is plan creation for that by the FDW author; that
would make life hard for the FDW author. (That was proposed by me
...)

So, I'd like to investigate another approach that preserves the
applicability of late row locking to the join pushdown case as well as
the spirit of what's there now. The basic idea is (1) add a new
callback routine RefetchForeignJoinRow that refetches one foreign-join
tuple from the foreign server, after locking remote tuples for the
component foreign tables if required,

It would be the case that at least one of the component relations
of a foreign join is other than ROW_MARK_COPY, which is not
possible so far on postgres_fdw. For the case that some of the
component relations are other than ROW_MARK_COPY, we might should
call RefetchForeignRow for such relations and construct joined
row involving ROW_MARK_COPY relations.

Indeed we could consider some logic for the case, it is obvious
that the case now we should focus on is a "foreign join" scan
with all underlying foreign scans are ROW_MARK_COPY, I
think. "foreign join" scan with ROW_MARK_COPY looks to be
promising (for me) and in future it would be able to coexist with
refetch mechanism maybe in your mind from this point of
view... Maybe:p

and (2) call that routine in
ExecScanFetch if the target scan is for a foreign join and the
component foreign tables require to be locked lately, else just return
the foreign-join tuple stored in the parent's state tree, which is the
tuple mentioned by Robert, for preserving the spirit of what's there
now. I think that ExecLockRows and EvalPlanQualFetchRowMarks should
probably be modified so as to skip foreign tables involved in a
foreign join.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#90

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Kyotaro HORIGUCHI (#89)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/01 19:02, Kyotaro HORIGUCHI wrote:

At Thu, 1 Oct 2015 17:50:25 +0900, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote in <560CF3D1.9060305@lab.ntt.co.jp>

From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas

So, if we wanted to fix this in a way that preserves the spirit of
what's there now, it seems to me that we'd want the FDW to return
something that's like a whole row reference, but represents the output
of the foreign join rather than some underlying base table. And then
get the EPQ machinery to have the evaluation of the ForeignScan for
the join, when it happens in an EPQ context, to return that tuple.
But I don't really have a good idea how to do that.

So, I'd like to investigate another approach that preserves the
applicability of late row locking to the join pushdown case as well as
the spirit of what's there now. The basic idea is (1) add a new
callback routine RefetchForeignJoinRow that refetches one foreign-join
tuple from the foreign server, after locking remote tuples for the
component foreign tables if required,

It would be the case that at least one of the component relations
of a foreign join is other than ROW_MARK_COPY, which is not
possible so far on postgres_fdw.

Yes. To be exact, it's possible for the component relations to have
rowmark methods other than ROW_MARK_COPY using GetForeignRowMarkType, in
principle, but the server crashes ...

For the case that some of the
component relations are other than ROW_MARK_COPY, we might should
call RefetchForeignRow for such relations and construct joined
row involving ROW_MARK_COPY relations.

You are saying that we should construct the joined row using an
alternative local join execution plan?

Indeed we could consider some logic for the case, it is obvious
that the case now we should focus on is a "foreign join" scan
with all underlying foreign scans are ROW_MARK_COPY, I
think. "foreign join" scan with ROW_MARK_COPY looks to be
promising (for me) and in future it would be able to coexist with
refetch mechanism maybe in your mind from this point of
view... Maybe:p

I agree that the approach "foreign-join scan with ROW_MARK_COPY" would
be promising.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#91

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Etsuro Fujita (#87)

Re: Foreign join pushdown vs EvalPlanQual

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Etsuro Fujita
Sent: Thursday, October 01, 2015 5:50 PM
To: Kaigai Kouhei(海外浩平); Robert Haas
Cc: PostgreSQL-development; 花田茂
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

On 2015/10/01 11:15, Kouhei Kaigai wrote:

From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
On Mon, Sep 28, 2015 at 11:15 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

I thought the same thing [1]. While I thought it was relatively easy to
make changes to RefetchForeignRow that way for the foreign table case
(scanrelid>0), I was not sure how hard it would be to do so for the foreign
join case (scanrelid==0). So, I proposed to leave that changes for 9.6.
I'll have a rethink on this issue along the lines of that approach.

So, if we wanted to fix this in a way that preserves the spirit of
what's there now, it seems to me that we'd want the FDW to return
something that's like a whole row reference, but represents the output
of the foreign join rather than some underlying base table. And then
get the EPQ machinery to have the evaluation of the ForeignScan for
the join, when it happens in an EPQ context, to return that tuple.
But I don't really have a good idea how to do that.

Alternative built-in join execution?
Once it is executed under the EPQ context, built-in join node fetches
a tuple from both of inner and outer side for each. It is eventually
fetched from the EPQ slot, then the alternative join produce a result
tuple.
In case when FDW is not designed to handle join by itself, it is
a reasonable fallback I think.

I expect FDW driver needs to handle EPQ recheck in the case below:
* ForeignScan on base relation and it uses late row locking.
* ForeignScan on join relation, even if early locking.

I also think the approach would be one choice. But one thing I'm
concerned about is plan creation for that by the FDW author; that would
make life hard for the FDW author. (That was proposed by me ...)

I don't follow the standpoint, but not valuable to repeat same discussion.

So, I'd like to investigate another approach that preserves the
applicability of late row locking to the join pushdown case as well as
the spirit of what's there now. The basic idea is (1) add a new
callback routine RefetchForeignJoinRow that refetches one foreign-join
tuple from the foreign server, after locking remote tuples for the
component foreign tables if required, and (2) call that routine in
ExecScanFetch if the target scan is for a foreign join and the component
foreign tables require to be locked lately, else just return the
foreign-join tuple stored in the parent's state tree, which is the tuple
mentioned by Robert, for preserving the spirit of what's there now. I
think that ExecLockRows and EvalPlanQualFetchRowMarks should probably be
modified so as to skip foreign tables involved in a foreign join.

As long as FDW author can choose their best way to produce a joined
tuple, it may be worth to investigate.

My comments are:
* ForeignRecheck is the best location to call RefetchForeignJoinRow
when scanrelid==0, not ExecScanFetch. Why you try to add special
case for FDW in the common routine.
* It is FDW's choice where the remote join tuple is kept, even though
most of FDW will keep it on the private field of ForeignScanState.

Apart from FDW requirement, custom-scan/join needs recheckMtd is
called when scanrelid==0 to avoid assertion fail. I hope FDW has
symmetric structure, however, not a mandatory requirement for me.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#92

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#91)

Re: Foreign join pushdown vs EvalPlanQual

Hello, I had more condieration on this.

As long as FDW author can choose their best way to produce a joined
tuple, it may be worth to investigate.

My comments are:
* ForeignRecheck is the best location to call RefetchForeignJoinRow
when scanrelid==0, not ExecScanFetch. Why you try to add special
case for FDW in the common routine.
* It is FDW's choice where the remote join tuple is kept, even though
most of FDW will keep it on the private field of ForeignScanState.

I think that scanrelid == 0 means that the node in focus is not a
scan node in current executor
semantics. EvalPlanQualFetchRowMarks fetches the possiblly
modified row then EvalPlanQualNext does recheck for the new
row. It's the roles of each functions.

In this criteria, recheck routines are not the place for
refetching. EvalPlanQualFetchRowMarks is that.

Again, the problem here is that "foreign join" scan node is
actually a scan node but it doesn't provide all materials which
executor expects for a scan node. So the way to fix this
preserving the semantics would be in two choices.

1. make "foreign join" scan node to behave as complete scan
node. That is, EvalPlanQualFetchRowMarks can retrieve the
modified row version anyhow according to the type of row mark.

2. make "foreign join" node that the node actuall a join node
which has subnodes and the "foreign join" node can reconstruct
the result row using the result of subnodes on EPQ.
(ExecForeignJoinNode would cease to call subnodes if it is
actually a scan node)

"3". Any other means to break current semantics of joins and
scans in executor, as you recommends. Some more adjustment
would be needed to go on this way.

I don't know how the current disign of FDW has been built,
especialy about join pushdown feature so I should be missing
something but I think as the above for this issue.

Apart from FDW requirement, custom-scan/join needs recheckMtd is
called when scanrelid==0 to avoid assertion fail. I hope FDW has
symmetric structure, however, not a mandatory requirement for me.

It wouldn't be needed if EvalPlanQualFetchRowMarks works as
exepcted. Is this wrong?

regards,

At Thu, 1 Oct 2015 13:17:34 +0000, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote in <9A28C8860F777E439AA12E8AEA7694F80114D7BB@BPXM15GP.gisp.nec.co.jp>

In case when FDW is not designed to handle join by itself, it is
a reasonable fallback I think.

I expect FDW driver needs to handle EPQ recheck in the case below:
* ForeignScan on base relation and it uses late row locking.
* ForeignScan on join relation, even if early locking.

I also think the approach would be one choice. But one thing I'm
concerned about is plan creation for that by the FDW author; that would
make life hard for the FDW author. (That was proposed by me ...)

I don't follow the standpoint, but not valuable to repeat same discussion.

So, I'd like to investigate another approach that preserves the
applicability of late row locking to the join pushdown case as well as
the spirit of what's there now. The basic idea is (1) add a new
callback routine RefetchForeignJoinRow that refetches one foreign-join
tuple from the foreign server, after locking remote tuples for the
component foreign tables if required, and (2) call that routine in
ExecScanFetch if the target scan is for a foreign join and the component
foreign tables require to be locked lately, else just return the
foreign-join tuple stored in the parent's state tree, which is the tuple
mentioned by Robert, for preserving the spirit of what's there now. I
think that ExecLockRows and EvalPlanQualFetchRowMarks should probably be
modified so as to skip foreign tables involved in a foreign join.

As long as FDW author can choose their best way to produce a joined
tuple, it may be worth to investigate.

My comments are:
* ForeignRecheck is the best location to call RefetchForeignJoinRow
when scanrelid==0, not ExecScanFetch. Why you try to add special
case for FDW in the common routine.
* It is FDW's choice where the remote join tuple is kept, even though
most of FDW will keep it on the private field of ForeignScanState.

Apart from FDW requirement, custom-scan/join needs recheckMtd is
called when scanrelid==0 to avoid assertion fail. I hope FDW has
symmetric structure, however, not a mandatory requirement for me.

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#93

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Kyotaro HORIGUCHI (#92)

Re: Foreign join pushdown vs EvalPlanQual

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kyotaro HORIGUCHI
Sent: Friday, October 02, 2015 9:50 AM
To: Kaigai Kouhei(海外浩平)
Cc: fujita.etsuro@lab.ntt.co.jp; robertmhaas@gmail.com;
pgsql-hackers@postgresql.org; shigeru.hanada@gmail.com
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

Hello, I had more condieration on this.

As long as FDW author can choose their best way to produce a joined
tuple, it may be worth to investigate.

My comments are:
* ForeignRecheck is the best location to call RefetchForeignJoinRow
when scanrelid==0, not ExecScanFetch. Why you try to add special
case for FDW in the common routine.
* It is FDW's choice where the remote join tuple is kept, even though
most of FDW will keep it on the private field of ForeignScanState.

I think that scanrelid == 0 means that the node in focus is not a
scan node in current executor
semantics. EvalPlanQualFetchRowMarks fetches the possiblly
modified row then EvalPlanQualNext does recheck for the new
row. It's the roles of each functions.

In this criteria, recheck routines are not the place for
refetching. EvalPlanQualFetchRowMarks is that.

I never say FDW should refetch tuples on the recheck routine.
All I suggest is, projection to generate a joined tuple and
recheck according to the qualifier pushed down are role of
FDW driver, because it knows the best strategy to do the job.

Again, the problem here is that "foreign join" scan node is
actually a scan node but it doesn't provide all materials which
executor expects for a scan node. So the way to fix this
preserving the semantics would be in two choices.

1. make "foreign join" scan node to behave as complete scan
node. That is, EvalPlanQualFetchRowMarks can retrieve the
modified row version anyhow according to the type of row mark.

2. make "foreign join" node that the node actuall a join node
which has subnodes and the "foreign join" node can reconstruct
the result row using the result of subnodes on EPQ.
(ExecForeignJoinNode would cease to call subnodes if it is
actually a scan node)

"3". Any other means to break current semantics of joins and
scans in executor, as you recommends. Some more adjustment
would be needed to go on this way.

I don't know how the current disign of FDW has been built,
especialy about join pushdown feature so I should be missing
something but I think as the above for this issue.

It looks to me all of them makes the problem complicated more.
I never heard why "foreign-join" scan node is difficult to construct
a joined tuple using the EPQ slots that are already loaded on.

Regardless of the early or late locking, EPQ slots of base relation
are already filled up, aren't it?

All mission of the "foreign-join" scan node is return a joined
tuple as if it was executed by local join logic.
Local join consumes two tuples then generate one tuple.
The "foreign-join" scan node can perform equivalently, even if it
is under EPQ recheck context.

So, job of FDW driver is...
Step-1) Fetch tuples from the EPQ slots of the base foreign relation
to be joined. Please note that it is just a pointer reference.
Step-2) Try to join these two (or more) tuples according to the
join condition (only FDW knows because it is kept in private)
Step-3) If result is valid, FDW driver makes a projection from these
tuples, then return it.

If you concern about re-invention of the code for each FDW, core
can provide a utility routine to cover 95% of FDW structure.

I want to keep EvalPlanQualFetchRowMarks per base relation basis.
It is a bad choice to consider join at this point.

Apart from FDW requirement, custom-scan/join needs recheckMtd is
called when scanrelid==0 to avoid assertion fail. I hope FDW has
symmetric structure, however, not a mandatory requirement for me.

It wouldn't be needed if EvalPlanQualFetchRowMarks works as
exepcted. Is this wrong?

Yes, it does not work.
Expected behavior EvalPlanQualFetchRowMarks is to load the tuple
to be rechecked onto EPQ slot, using heap_fetch or copied image.
It is per base relation basis.

Who can provide a projection to generate joined tuple?
It is a job of individual plan-state-node to be walked on during
EvalPlanQualNext().

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

regards,

At Thu, 1 Oct 2015 13:17:34 +0000, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote
in <9A28C8860F777E439AA12E8AEA7694F80114D7BB@BPXM15GP.gisp.nec.co.jp>

In case when FDW is not designed to handle join by itself, it is
a reasonable fallback I think.

I expect FDW driver needs to handle EPQ recheck in the case below:
* ForeignScan on base relation and it uses late row locking.
* ForeignScan on join relation, even if early locking.

I also think the approach would be one choice. But one thing I'm
concerned about is plan creation for that by the FDW author; that would
make life hard for the FDW author. (That was proposed by me ...)

I don't follow the standpoint, but not valuable to repeat same discussion.

So, I'd like to investigate another approach that preserves the
applicability of late row locking to the join pushdown case as well as
the spirit of what's there now. The basic idea is (1) add a new
callback routine RefetchForeignJoinRow that refetches one foreign-join
tuple from the foreign server, after locking remote tuples for the
component foreign tables if required, and (2) call that routine in
ExecScanFetch if the target scan is for a foreign join and the component
foreign tables require to be locked lately, else just return the
foreign-join tuple stored in the parent's state tree, which is the tuple
mentioned by Robert, for preserving the spirit of what's there now. I
think that ExecLockRows and EvalPlanQualFetchRowMarks should probably be
modified so as to skip foreign tables involved in a foreign join.

As long as FDW author can choose their best way to produce a joined
tuple, it may be worth to investigate.

My comments are:
* ForeignRecheck is the best location to call RefetchForeignJoinRow
when scanrelid==0, not ExecScanFetch. Why you try to add special
case for FDW in the common routine.
* It is FDW's choice where the remote join tuple is kept, even though
most of FDW will keep it on the private field of ForeignScanState.

Apart from FDW requirement, custom-scan/join needs recheckMtd is
called when scanrelid==0 to avoid assertion fail. I hope FDW has
symmetric structure, however, not a mandatory requirement for me.

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#94

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Kyotaro HORIGUCHI (#92)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/02 9:50, Kyotaro HORIGUCHI wrote:

As long as FDW author can choose their best way to produce a joined
tuple, it may be worth to investigate.

My comments are:
* ForeignRecheck is the best location to call RefetchForeignJoinRow
when scanrelid==0, not ExecScanFetch. Why you try to add special
case for FDW in the common routine.

In my understanding, the job that ExecScanRecheckMtd should do is to
check if the test tuple *already stored* in the plan node's scan slot
meets the access-method conditions, in general. So, ISTM that it'd be
somewhat odd to replace RefetchForeignJoinRow within ForeignRecheck, to
store the remote join tuple in the slot. Also, RefetchForeignRow is
called from the common routines ExecLockRows/EvalPlanQualFetchRowMarks ...

* It is FDW's choice where the remote join tuple is kept, even though
most of FDW will keep it on the private field of ForeignScanState.

I see.

To make it possible that the FDW doesn't have to do anything for cases
where the FDW doesn't do any late row locking, however, I think it'd be
more promising to use the remote join tuple stored in the scan slot of
the corresponding ForeignScanState node in the parent's planstate tree.
I haven't had a good idea for that yet, though.

EvalPlanQualFetchRowMarks fetches the possiblly
modified row then EvalPlanQualNext does recheck for the new
row.

Really? EvalPlanQualFetchRowMarks fetches the tuples for any non-locked
relations, so I think that that function should fetch the same version
previously obtained for each such relation successfully.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#95

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#93)

Re: Foreign join pushdown vs EvalPlanQual

Hello,

At Fri, 2 Oct 2015 03:10:01 +0000, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote in <9A28C8860F777E439AA12E8AEA7694F80114DAEC@BPXM15GP.gisp.nec.co.jp>

As long as FDW author can choose their best way to produce a joined
tuple, it may be worth to investigate.

My comments are:
* ForeignRecheck is the best location to call RefetchForeignJoinRow
when scanrelid==0, not ExecScanFetch. Why you try to add special
case for FDW in the common routine.
* It is FDW's choice where the remote join tuple is kept, even though
most of FDW will keep it on the private field of ForeignScanState.

I think that scanrelid == 0 means that the node in focus is not a
scan node in current executor
semantics. EvalPlanQualFetchRowMarks fetches the possiblly
modified row then EvalPlanQualNext does recheck for the new
row. It's the roles of each functions.

In this criteria, recheck routines are not the place for
refetching. EvalPlanQualFetchRowMarks is that.

I never say FDW should refetch tuples on the recheck routine.
All I suggest is, projection to generate a joined tuple and
recheck according to the qualifier pushed down are role of
FDW driver, because it knows the best strategy to do the job.

I have no objection that rechecking is FDW's job.

I think you are thinking that all ROW_MARK_COPY base rows are
held in ss_ScanTupleSlot so simply calling recheckMtd on the slot
gives enough data to the function. (EPQState would also be needed
to retrieve, though..) Right?

All the underlying foreign tables should be marked as
ROW_MARK_COPY to call recheckMtd safely. And somehow it required
to know what column stores what base tuple.

It looks to me all of them makes the problem complicated more.
I never heard why "foreign-join" scan node is difficult to construct
a joined tuple using the EPQ slots that are already loaded on.

Regardless of the early or late locking, EPQ slots of base relation
are already filled up, aren't it?

recheckMtd needs to take EState as a parameter?

All mission of the "foreign-join" scan node is return a joined
tuple as if it was executed by local join logic.
Local join consumes two tuples then generate one tuple.
The "foreign-join" scan node can perform equivalently, even if it
is under EPQ recheck context.

So, job of FDW driver is...
Step-1) Fetch tuples from the EPQ slots of the base foreign relation
to be joined. Please note that it is just a pointer reference.
Step-2) Try to join these two (or more) tuples according to the
join condition (only FDW knows because it is kept in private)
Step-3) If result is valid, FDW driver makes a projection from these
tuples, then return it.

If you concern about re-invention of the code for each FDW, core
can provide a utility routine to cover 95% of FDW structure.

I want to keep EvalPlanQualFetchRowMarks per base relation basis.
It is a bad choice to consider join at this point.

Apart from FDW requirement, custom-scan/join needs recheckMtd is
called when scanrelid==0 to avoid assertion fail. I hope FDW has
symmetric structure, however, not a mandatory requirement for me.

It wouldn't be needed if EvalPlanQualFetchRowMarks works as
exepcted. Is this wrong?

Yes, it does not work.
Expected behavior EvalPlanQualFetchRowMarks is to load the tuple
to be rechecked onto EPQ slot, using heap_fetch or copied image.
It is per base relation basis.

Hmm. What I said by "works as expected" is that the function
stores the tuple for the "foreign join" scan node. If it doesn't,
you're right.

Who can provide a projection to generate joined tuple?
It is a job of individual plan-state-node to be walked on during
EvalPlanQualNext().

EvalPlanQualNext simply does recheck tuples stored in epqTuples,
which are designed to be provided by EvalPlanQualFetchRowMarks.

I think that that premise shouldn't be broken for convenience...

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#96

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 10 years ago

In reply to: Etsuro Fujita (#94)

Re: Foreign join pushdown vs EvalPlanQual

Hello,

At Fri, 2 Oct 2015 12:51:42 +0900, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote in <560DFF4E.2000001@lab.ntt.co.jp>

On 2015/10/02 9:50, Kyotaro HORIGUCHI wrote:

Most of the citation are of Kiagai-san's mentions:)

As long as FDW author can choose their best way to produce a joined
tuple, it may be worth to investigate.

My comments are:
* ForeignRecheck is the best location to call RefetchForeignJoinRow
when scanrelid==0, not ExecScanFetch. Why you try to add special
case for FDW in the common routine.

In my understanding, the job that ExecScanRecheckMtd should do is to
check if the test tuple *already stored* in the plan node's scan slot
meets the access-method conditions, in general. So, ISTM that it'd be
somewhat odd to replace RefetchForeignJoinRow within ForeignRecheck,
to store the remote join tuple in the slot. Also, RefetchForeignRow
is called from the common routines
ExecLockRows/EvalPlanQualFetchRowMarks ...

Agreed, except for the necessity of RefetchForeignJoinRow.

* It is FDW's choice where the remote join tuple is kept, even though
most of FDW will keep it on the private field of ForeignScanState.

I see.

To make it possible that the FDW doesn't have to do anything for cases
where the FDW doesn't do any late row locking, however, I think it'd
be more promising to use the remote join tuple stored in the scan slot
of the corresponding ForeignScanState node in the parent's planstate
tree. I haven't had a good idea for that yet, though.

One coarse idea is that adding root->rowMarks for the "foreign
join" paths (then removing rowMarks for underlying scans later if
the foreign join wins). Somehow propagating it to
epqstate->arowMarks, EvalPlanQualFetchRowMarks will stores needed
tuple into eqptuples. This is which Kaigai-san criticized as
'makes things too complex'.:)

But I'm awkward to break the assumption of ExecScanFetch.

EvalPlanQualFetchRowMarks fetches the possiblly
modified row then EvalPlanQualNext does recheck for the new
row.

Really? EvalPlanQualFetchRowMarks fetches the tuples for any
non-locked relations, so I think that that function should fetch the
same version previously obtained for each such relation successfully.

Sorry, please ignore "possibly modified".

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#97

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Kyotaro HORIGUCHI (#95)

Re: Foreign join pushdown vs EvalPlanQual

-----Original Message-----
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
Sent: Friday, October 02, 2015 1:28 PM
To: Kaigai Kouhei(海外浩平)
Cc: fujita.etsuro@lab.ntt.co.jp; robertmhaas@gmail.com;
pgsql-hackers@postgresql.org; shigeru.hanada@gmail.com
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

Hello,

At Fri, 2 Oct 2015 03:10:01 +0000, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote
in <9A28C8860F777E439AA12E8AEA7694F80114DAEC@BPXM15GP.gisp.nec.co.jp>

As long as FDW author can choose their best way to produce a joined
tuple, it may be worth to investigate.

My comments are:
* ForeignRecheck is the best location to call RefetchForeignJoinRow
when scanrelid==0, not ExecScanFetch. Why you try to add special
case for FDW in the common routine.
* It is FDW's choice where the remote join tuple is kept, even though
most of FDW will keep it on the private field of ForeignScanState.

I think that scanrelid == 0 means that the node in focus is not a
scan node in current executor
semantics. EvalPlanQualFetchRowMarks fetches the possiblly
modified row then EvalPlanQualNext does recheck for the new
row. It's the roles of each functions.

In this criteria, recheck routines are not the place for
refetching. EvalPlanQualFetchRowMarks is that.

I never say FDW should refetch tuples on the recheck routine.
All I suggest is, projection to generate a joined tuple and
recheck according to the qualifier pushed down are role of
FDW driver, because it knows the best strategy to do the job.

I have no objection that rechecking is FDW's job.

I think you are thinking that all ROW_MARK_COPY base rows are
held in ss_ScanTupleSlot so simply calling recheckMtd on the slot
gives enough data to the function. (EPQState would also be needed
to retrieve, though..) Right?

Not ss_ScanTupleSlot. It is initialized according to fdw_scan_tlist
in case of scanrelid==0, regardless of base foreign relation's
definition.
My expectation is, FDW callback construct tts_values/tts_isnull
of ss_ScanTupleSlot according to the preloaded tuples in EPQ slots
and underlying projection. Only FDW driver knows the best way to
construct this result tuple.

You can pull out EState reference from PlanState portion of the
ForeignScanState, so nothing needs to be changed.

All the underlying foreign tables should be marked as
ROW_MARK_COPY to call recheckMtd safely. And somehow it required
to know what column stores what base tuple.

Even if ROW_MARK_REFERENCE by later locking, the tuple to be rechecked
is already loaded estate->es_epqTuple[], isn't it?
Recheck routine does not needs to care about row-mark policy.

It looks to me all of them makes the problem complicated more.
I never heard why "foreign-join" scan node is difficult to construct
a joined tuple using the EPQ slots that are already loaded on.

Regardless of the early or late locking, EPQ slots of base relation
are already filled up, aren't it?

recheckMtd needs to take EState as a parameter?

No.

All mission of the "foreign-join" scan node is return a joined
tuple as if it was executed by local join logic.
Local join consumes two tuples then generate one tuple.
The "foreign-join" scan node can perform equivalently, even if it
is under EPQ recheck context.

So, job of FDW driver is...
Step-1) Fetch tuples from the EPQ slots of the base foreign relation
to be joined. Please note that it is just a pointer reference.
Step-2) Try to join these two (or more) tuples according to the
join condition (only FDW knows because it is kept in private)
Step-3) If result is valid, FDW driver makes a projection from these
tuples, then return it.

If you concern about re-invention of the code for each FDW, core
can provide a utility routine to cover 95% of FDW structure.

I want to keep EvalPlanQualFetchRowMarks per base relation basis.
It is a bad choice to consider join at this point.

Apart from FDW requirement, custom-scan/join needs recheckMtd is
called when scanrelid==0 to avoid assertion fail. I hope FDW has
symmetric structure, however, not a mandatory requirement for me.

It wouldn't be needed if EvalPlanQualFetchRowMarks works as
exepcted. Is this wrong?

Yes, it does not work.
Expected behavior EvalPlanQualFetchRowMarks is to load the tuple
to be rechecked onto EPQ slot, using heap_fetch or copied image.
It is per base relation basis.

Hmm. What I said by "works as expected" is that the function
stores the tuple for the "foreign join" scan node. If it doesn't,
you're right.

Which slot of the EPQ slot will save the joined tuple?
scanrelid is zero, and we have no identifier of join planstate.

Who can provide a projection to generate joined tuple?
It is a job of individual plan-state-node to be walked on during
EvalPlanQualNext().

EvalPlanQualNext simply does recheck tuples stored in epqTuples,
which are designed to be provided by EvalPlanQualFetchRowMarks.

I think that that premise shouldn't be broken for convenience...

Do I see something different or understand incorrectly?
EvalPlanQualNext() walks down entire subtree of the Lock node.
(epqstate->planstate is entire subplan of Lock node.)

TupleTableSlot *
EvalPlanQualNext(EPQState *epqstate)
{
MemoryContext oldcontext;
TupleTableSlot *slot;

oldcontext = MemoryContextSwitchTo(epqstate->estate->es_query_cxt);
slot = ExecProcNode(epqstate->planstate);
MemoryContextSwitchTo(oldcontext);

return slot;
}

If and when relations joins are kept in the sub-plan, ExecProcNode()
processes the projection by join, doesn't it?

Why projection by join is not a part of EvalPlanQualNext()?
It is the core of its job. Unless projection by join, upper node cannot
recheck the tuple come from child nodes.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#98

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#97)

Re: Foreign join pushdown vs EvalPlanQual

Hello, thank you for explanation. I understood the background.

On the current planner implement, row marks are tightly bound to
initial RTEs. This is quite natural for the purpose of row marks.

During join search, a joinrel should be comptible between local
joins and remote joins, of course target list also should be
so. So it is quite difficult to add wholerow resjunk for joinrels
before whole join tree is completed even if we allow row marks
that are not bound to base RTEs.

The result of make_rel_from_joinlist contains only winner paths
so we might be able to transform target list for this joinrel so
that it has join wholerows (and doesn't have unnecessary RTE
wholerows), but I don't see any clean way to do that.

As the result, all that LockRow can collect for EPQ are tuples
for base relations. No room to pass joined whole row so far.

At Fri, 2 Oct 2015 05:04:44 +0000, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote in <9A28C8860F777E439AA12E8AEA7694F80114DBFB@BPXM15GP.gisp.nec.co.jp>

I never say FDW should refetch tuples on the recheck routine.
All I suggest is, projection to generate a joined tuple and
recheck according to the qualifier pushed down are role of
FDW driver, because it knows the best strategy to do the job.

I have no objection that rechecking is FDW's job.

I think you are thinking that all ROW_MARK_COPY base rows are
held in ss_ScanTupleSlot so simply calling recheckMtd on the slot
gives enough data to the function. (EPQState would also be needed
to retrieve, though..) Right?

Not ss_ScanTupleSlot. It is initialized according to fdw_scan_tlist
in case of scanrelid==0, regardless of base foreign relation's
definition.

Sorry, EvalPlanQualFetchRowMarks retrieves wholerows from
epqstate->origslot.

My expectation is, FDW callback construct tts_values/tts_isnull
of ss_ScanTupleSlot according to the preloaded tuples in EPQ slots
and underlying projection. Only FDW driver knows the best way to
construct this result tuple.

Currently only FDW itself knows how the joined relaiton are made
precisely.

You can pull out EState reference from PlanState portion of the
ForeignScanState, so nothing needs to be changed.

Exactly.

Apart from FDW requirement, custom-scan/join needs recheckMtd is
called when scanrelid==0 to avoid assertion fail. I hope FDW has
symmetric structure, however, not a mandatory requirement for me.

...

Hmm. What I said by "works as expected" is that the function
stores the tuple for the "foreign join" scan node. If it doesn't,
you're right.

Which slot of the EPQ slot will save the joined tuple?

Yes, that is the second significant problem. As described above,
furtermore, the way to inject joined wholrow var into the target
list for the pushed-down join seems more difficult to find

scanrelid is zero, and we have no identifier of join planstate.

Who can provide a projection to generate joined tuple?
It is a job of individual plan-state-node to be walked on during
EvalPlanQualNext().

EvalPlanQualNext simply does recheck tuples stored in epqTuples,
which are designed to be provided by EvalPlanQualFetchRowMarks.

I think that that premise shouldn't be broken for convenience...

Do I see something different or understand incorrectly?
EvalPlanQualNext() walks down entire subtree of the Lock node.
(epqstate->planstate is entire subplan of Lock node.)

TupleTableSlot *
EvalPlanQualNext(EPQState *epqstate)
{
MemoryContext oldcontext;
TupleTableSlot *slot;

oldcontext = MemoryContextSwitchTo(epqstate->estate->es_query_cxt);
slot = ExecProcNode(epqstate->planstate);
MemoryContextSwitchTo(oldcontext);

return slot;
}

If and when relations joins are kept in the sub-plan, ExecProcNode()
processes the projection by join, doesn't it?

Yes, but it is needed to prepare alternative plan to do such
projection.

Why projection by join is not a part of EvalPlanQualNext()?
It is the core of its job. Unless projection by join, upper node cannot
recheck the tuple come from child nodes.

What I'm uneasy on is the foreign join introduced the difference
in behavior between ordinary fetching and epq fetching. It is
quite natural having joined whole row but is seems hard to get.

Another reason is that ExecScanFetch with scanrelid == 0 on EPQ
is FDW/CS specific feature and looks to be a kind of hack. (Even
if it would be one of many)

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#99

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Etsuro Fujita (#77)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/09/29 16:36, Etsuro Fujita wrote:

For the foreign table case (scanrelid>0), I imagined an approach
different than yours. In that case, I thought the issue would be
probably addressed by just modifying the remote query performed in
RefetchForeignRow, which would be of the form "SELECT ctid, * FROM
remote table WHERE ctid = $1", so that the modified query would be of
the form "SELECT ctid, * FROM remote table WHERE ctid = $1 AND *remote
quals*".

Sorry, I was wrong. I noticed that the modifieid query (that will be
sent to the remote server during postgresRefetchForeignRow) should be of
the form "SELECT * FROM (SELECT ctid, * FROM remote table WHERE ctid =
$1) ss WHERE *remote quals*". (I think the query of the form "SELECT
ctid, * FROM remote table WHERE ctid = $1 AND *remote quals*" would be
okay if using a TID scan on the remote side, though.)

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#100

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Kyotaro HORIGUCHI (#98)

Re: Foreign join pushdown vs EvalPlanQual

On Fri, Oct 2, 2015 at 4:26 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

During join search, a joinrel should be comptible between local
joins and remote joins, of course target list also should be
so. So it is quite difficult to add wholerow resjunk for joinrels
before whole join tree is completed even if we allow row marks
that are not bound to base RTEs.

Suppose ROW_MARK_COPY is in use, and suppose the query is: SELECT
ft1.a, ft1.b, ft2.a, ft2.b FROM ft1, ft2 WHERE ft1.x = ft2.x;

When the foreign join is executed, there's going to be a slot that
needs to be populated with ft1.a, ft1.b, ft2.a, ft2.b, and a whole row
reference. Now, let's suppose the slot descriptor has 5 columns: those
4, plus a whole-row reference for ROW_MARK_COPY. If we know what
values we're going to store in columns 1..4, couldn't we just form
them into a tuple to populate column 5? We don't actually need to be
able to fetch such a tuple from the remote side because we can just
construct it. I think.

Is this a dumb idea, or could it work?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#101

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Kyotaro HORIGUCHI (#98)

Re: Foreign join pushdown vs EvalPlanQual

Who can provide a projection to generate joined tuple?
It is a job of individual plan-state-node to be walked on during
EvalPlanQualNext().

EvalPlanQualNext simply does recheck tuples stored in epqTuples,
which are designed to be provided by EvalPlanQualFetchRowMarks.

I think that that premise shouldn't be broken for convenience...

Do I see something different or understand incorrectly?
EvalPlanQualNext() walks down entire subtree of the Lock node.
(epqstate->planstate is entire subplan of Lock node.)

TupleTableSlot *
EvalPlanQualNext(EPQState *epqstate)
{
MemoryContext oldcontext;
TupleTableSlot *slot;

oldcontext = MemoryContextSwitchTo(epqstate->estate->es_query_cxt);
slot = ExecProcNode(epqstate->planstate);
MemoryContextSwitchTo(oldcontext);

return slot;
}

If and when relations joins are kept in the sub-plan, ExecProcNode()
processes the projection by join, doesn't it?

Yes, but it is needed to prepare alternative plan to do such
projection.

No matter. The custom-scan node is a good reference to have underlying
plan nodes that can be kicked by extension.
If we adopt same semantics, these alternative plan shall not be kicked
unless FDW driver does not want.

Also, I don't think it is difficult to construct an alternative join-
path using unparametalized nested-loop (note that all we need to do is
evaluation towards a most one tuples for each base relation).

If people felt it is really re-invention of the wheel, core backend can
provide a utility function to produce the alternative path.

Probably,

Path *
foreign_join_alternative_local_join_path(PlannerInfo *root,
RelOptInfo *joinrel)

can generate what we need.

Why projection by join is not a part of EvalPlanQualNext()?
It is the core of its job. Unless projection by join, upper node cannot
recheck the tuple come from child nodes.

What I'm uneasy on is the foreign join introduced the difference
in behavior between ordinary fetching and epq fetching. It is
quite natural having joined whole row but is seems hard to get.

hard to get, and easy to construct on the fly.

Another reason is that ExecScanFetch with scanrelid == 0 on EPQ
is FDW/CS specific feature and looks to be a kind of hack. (Even
if it would be one of many)

It means these are kind of exceptional ones, thus it is reasonable
to avoid fundamental changes in RowLock mechanism, isn't it?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#102

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Robert Haas (#100)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/07 6:19, Robert Haas wrote:

On Fri, Oct 2, 2015 at 4:26 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

During join search, a joinrel should be comptible between local
joins and remote joins, of course target list also should be
so. So it is quite difficult to add wholerow resjunk for joinrels
before whole join tree is completed even if we allow row marks
that are not bound to base RTEs.

Suppose ROW_MARK_COPY is in use, and suppose the query is: SELECT
ft1.a, ft1.b, ft2.a, ft2.b FROM ft1, ft2 WHERE ft1.x = ft2.x;

When the foreign join is executed, there's going to be a slot that
needs to be populated with ft1.a, ft1.b, ft2.a, ft2.b, and a whole row
reference. Now, let's suppose the slot descriptor has 5 columns: those
4, plus a whole-row reference for ROW_MARK_COPY.

IIUC, I think that if ROW_MARK_COPY is in use, the descriptor would have
6 columns: those 4, plus a whole-row var for ft1 and another whole-row
bar for ft2. Maybe I'm missing something, though.

4, plus a whole-row reference for ROW_MARK_COPY. If we know what
values we're going to store in columns 1..4, couldn't we just form
them into a tuple to populate column 5? We don't actually need to be
able to fetch such a tuple from the remote side because we can just
construct it. I think.

I also was thinking whether we could replace one of the whole-row vars
with a whole-row var that represents the scan slot of the
ForeignScanState node.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#103

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 10 years ago

In reply to: Etsuro Fujita (#102)

Re: Foreign join pushdown vs EvalPlanQual

Hello,

At Wed, 7 Oct 2015 12:30:27 +0900, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote in <561491D3.3070901@lab.ntt.co.jp>

On 2015/10/07 6:19, Robert Haas wrote:

On Fri, Oct 2, 2015 at 4:26 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

During join search, a joinrel should be comptible between local
joins and remote joins, of course target list also should be
so. So it is quite difficult to add wholerow resjunk for joinrels
before whole join tree is completed even if we allow row marks
that are not bound to base RTEs.

Suppose ROW_MARK_COPY is in use, and suppose the query is: SELECT
ft1.a, ft1.b, ft2.a, ft2.b FROM ft1, ft2 WHERE ft1.x = ft2.x;

When the foreign join is executed, there's going to be a slot that
needs to be populated with ft1.a, ft1.b, ft2.a, ft2.b, and a whole row
reference. Now, let's suppose the slot descriptor has 5 columns: those
4, plus a whole-row reference for ROW_MARK_COPY.

IIUC, I think that if ROW_MARK_COPY is in use, the descriptor would
have 6 columns: those 4, plus a whole-row var for ft1 and another
whole-row bar for ft2. Maybe I'm missing something, though.

You're right. The result tuple for the Robert's example has 6
attributes in the order of [ft1.a, ft1.b, (ft1.a, ft1.b), ft2.a...]

But the point of the discussion is in another point. The problem
is when such joins are joined with another local table. For such
case, the whole-row reference for the intermediate foreign-join
would lose the targets in top-level tuple.

4, plus a whole-row reference for ROW_MARK_COPY. If we know what
values we're going to store in columns 1..4, couldn't we just form
them into a tuple to populate column 5? We don't actually need to be
able to fetch such a tuple from the remote side because we can just
construct it. I think.

I also was thinking whether we could replace one of the whole-row vars
with a whole-row var that represents the scan slot of the
ForeignScanState node.

I suppose it requires additional resjunk to be added on joinrel
creation, which is what Kaigai-san said as overkill. But I'm
interedted in what it looks.

cheers,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#104

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Etsuro Fujita (#102)

Re: Foreign join pushdown vs EvalPlanQual

On Tue, Oct 6, 2015 at 11:30 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

IIUC, I think that if ROW_MARK_COPY is in use, the descriptor would have 6
columns: those 4, plus a whole-row var for ft1 and another whole-row bar for
ft2. Maybe I'm missing something, though.

Currently, yes, but I think we should change it so that...

4, plus a whole-row reference for ROW_MARK_COPY. If we know what
values we're going to store in columns 1..4, couldn't we just form
them into a tuple to populate column 5? We don't actually need to be
able to fetch such a tuple from the remote side because we can just
construct it. I think.

I also was thinking whether we could replace one of the whole-row vars with
a whole-row var that represents the scan slot of the ForeignScanState node.

...it works like this instead.

KaiGai is suggesting that constructing a join plan to live under the
foreign scan-qua-join isn't really that difficult, but that is like
saying that it's OK to go out to a nice sushi restaurant without
bringing any money because it won't be too hard to talk the manager
into letting you leave for a quick trip to the ATM at the end of the
meal. Maybe so, maybe not, but if you plan ahead and bring money then
you don't have to worry about it. The only reason why we would need
the join plan in the first place is because we had the foreign scan
output whole-row vars for the baserels instead of its own slot. If we
have it output a var for its own slot then it doesn't matter whether
constructing the join plan is easy or hard, because we don't need it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#105

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Kyotaro HORIGUCHI (#103)

Re: Foreign join pushdown vs EvalPlanQual

On Wed, Oct 7, 2015 at 12:10 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

IIUC, I think that if ROW_MARK_COPY is in use, the descriptor would
have 6 columns: those 4, plus a whole-row var for ft1 and another
whole-row bar for ft2. Maybe I'm missing something, though.

You're right. The result tuple for the Robert's example has 6
attributes in the order of [ft1.a, ft1.b, (ft1.a, ft1.b), ft2.a...]

But the point of the discussion is in another point. The problem
is when such joins are joined with another local table. For such
case, the whole-row reference for the intermediate foreign-join
would lose the targets in top-level tuple.

Really? Would that mean that ROW_MARK_COPY is totally broken? I bet it's not.

4, plus a whole-row reference for ROW_MARK_COPY. If we know what
values we're going to store in columns 1..4, couldn't we just form
them into a tuple to populate column 5? We don't actually need to be
able to fetch such a tuple from the remote side because we can just
construct it. I think.

I also was thinking whether we could replace one of the whole-row vars
with a whole-row var that represents the scan slot of the
ForeignScanState node.

I suppose it requires additional resjunk to be added on joinrel
creation, which is what Kaigai-san said as overkill. But I'm
interedted in what it looks.

I think it rather requires *replacing* two resjunk columns by one new
one. The whole-row references for the individual foreign tables are
only there to support EvalPlanQual; if we instead have a column to
populate the foreign scan's slot directly, then we can use that column
for that purpose directly and there's no remaining use for the
whole-row vars on the baserels.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#106

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 10 years ago

In reply to: Robert Haas (#105)

Re: Foreign join pushdown vs EvalPlanQual

Hello,

At Wed, 7 Oct 2015 00:24:57 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoZRqXdtPh-RPbX-fRSdq+_c8U6dXcTovu+zgY0hrnR6TQ@mail.gmail.com>

On Wed, Oct 7, 2015 at 12:10 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

IIUC, I think that if ROW_MARK_COPY is in use, the descriptor would
have 6 columns: those 4, plus a whole-row var for ft1 and another
whole-row bar for ft2. Maybe I'm missing something, though.

You're right. The result tuple for the Robert's example has 6
attributes in the order of [ft1.a, ft1.b, (ft1.a, ft1.b), ft2.a...]

But the point of the discussion is in another point. The problem
is when such joins are joined with another local table. For such
case, the whole-row reference for the intermediate foreign-join
would lose the targets in top-level tuple.

Really? Would that mean that ROW_MARK_COPY is totally broken? I bet it's not.

The semantics of ROW_MARK_COPY is the tuple should hold whole-row
*value* as in resjunk column. I should misunderstood "whole row
*reference*" by confising planner and executor behaviors. I
understood the new story as adding to a tuple a reference to
itself. If it is wrong and the correct story is having additional
whole-row *value* in the whole joined tuple including resjunks
passed from the underlying tuples, it should work.

4, plus a whole-row reference for ROW_MARK_COPY. If we know what
values we're going to store in columns 1..4, couldn't we just form
them into a tuple to populate column 5? We don't actually need to be
able to fetch such a tuple from the remote side because we can just
construct it. I think.

I also was thinking whether we could replace one of the whole-row vars
with a whole-row var that represents the scan slot of the
ForeignScanState node.

I suppose it requires additional resjunk to be added on joinrel
creation, which is what Kaigai-san said as overkill. But I'm
interedted in what it looks.

I think it rather requires *replacing* two resjunk columns by one new
one. The whole-row references for the individual foreign tables are
only there to support EvalPlanQual; if we instead have a column to
populate the foreign scan's slot directly, then we can use that column
for that purpose directly and there's no remaining use for the
whole-row vars on the baserels.

It is what I had in mind. Target lists for joinrels cannot be
built straight-forward way as it is.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#107

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Kyotaro HORIGUCHI (#106)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/07 15:06, Kyotaro HORIGUCHI wrote:

At Wed, 7 Oct 2015 00:24:57 -0400, Robert Haas <robertmhaas@gmail.com> wrote

I think it rather requires *replacing* two resjunk columns by one new
one. The whole-row references for the individual foreign tables are
only there to support EvalPlanQual; if we instead have a column to
populate the foreign scan's slot directly, then we can use that column
for that purpose directly and there's no remaining use for the
whole-row vars on the baserels.

It is what I had in mind.

OK I'll investigate this further.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#108

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Etsuro Fujita (#107)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/07 15:39, Etsuro Fujita wrote:

On 2015/10/07 15:06, Kyotaro HORIGUCHI wrote:

At Wed, 7 Oct 2015 00:24:57 -0400, Robert Haas <robertmhaas@gmail.com>
wrote

I think it rather requires *replacing* two resjunk columns by one new
one. The whole-row references for the individual foreign tables are
only there to support EvalPlanQual; if we instead have a column to
populate the foreign scan's slot directly, then we can use that column
for that purpose directly and there's no remaining use for the
whole-row vars on the baserels.

It is what I had in mind.

OK I'll investigate this further.

I noticed that the approach using a column to populate the foreign
scan's slot directly wouldn't work well in some cases. For example,
consider:

SELECT * FROM verysmall v LEFT JOIN (bigft1 JOIN bigft2 ON bigft1.x =
bigft2.x) ON v.q = bigft1.q AND v.r = bigft2.r FOR UPDATE OF v;

The best plan is presumably something like this as you said before:

LockRows
-> Nested Loop
-> Seq Scan on verysmall v
-> Foreign Scan on bigft1 and bigft2
Remote SQL: SELECT * FROM bigft1 JOIN bigft2 ON bigft1.x =
bigft2.x AND bigft1.q = $1 AND bigft2.r = $2

Consider the EvalPlanQual testing to see if the updated version of a
tuple in v satisfies the query. If we use the column in the testing, we
would get the wrong results in some cases.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#109

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Etsuro Fujita (#108)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/08 19:55, Etsuro Fujita wrote:

I noticed that the approach using a column to populate the foreign
scan's slot directly wouldn't work well in some cases. For example,
consider:

SELECT * FROM verysmall v LEFT JOIN (bigft1 JOIN bigft2 ON bigft1.x =
bigft2.x) ON v.q = bigft1.q AND v.r = bigft2.r FOR UPDATE OF v;

Oops, I should have written "JOIN", not "LEFT JOIN".

The best plan is presumably something like this as you said before:

LockRows
-> Nested Loop
-> Seq Scan on verysmall v
-> Foreign Scan on bigft1 and bigft2
Remote SQL: SELECT * FROM bigft1 JOIN bigft2 ON bigft1.x =
bigft2.x AND bigft1.q = $1 AND bigft2.r = $2

Consider the EvalPlanQual testing to see if the updated version of a
tuple in v satisfies the query. If we use the column in the testing, we
would get the wrong results in some cases.

More precisely, we would get the wrong result when the value of v.q or
v.r in the updated version has changed.

I don't have a good idea for this, so would an approach using an local
join execution plan be the good way to go?

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#110

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 10 years ago

In reply to: Etsuro Fujita (#109)

Re: Foreign join pushdown vs EvalPlanQual

Hi,

At Fri, 9 Oct 2015 12:00:30 +0900, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote in <56172DCE.7080604@lab.ntt.co.jp>

On 2015/10/08 19:55, Etsuro Fujita wrote:

I noticed that the approach using a column to populate the foreign
scan's slot directly wouldn't work well in some cases. For example,
consider:

SELECT * FROM verysmall v LEFT JOIN (bigft1 JOIN bigft2 ON bigft1.x =
bigft2.x) ON v.q = bigft1.q AND v.r = bigft2.r FOR UPDATE OF v;

Oops, I should have written "JOIN", not "LEFT JOIN".

The best plan is presumably something like this as you said before:

LockRows
-> Nested Loop
-> Seq Scan on verysmall v
-> Foreign Scan on bigft1 and bigft2
Remote SQL: SELECT * FROM bigft1 JOIN bigft2 ON bigft1.x =
bigft2.x AND bigft1.q = $1 AND bigft2.r = $2

Consider the EvalPlanQual testing to see if the updated version of a
tuple in v satisfies the query. If we use the column in the testing,
we
would get the wrong results in some cases.

More precisely, we would get the wrong result when the value of v.q or
v.r in the updated version has changed.

What do you think the right behavior?

Assuming that it is simply a join between local tables.

SELECT * FROM t1 JOIN t2 on (t1.a = t2.a) FOR UPDATE;

IIUC, if t1.a gets updated and EPQ runs, the tuple for t1 is
refetched using ctid and that for t2 reused, so it would fail to
be qualified and the joined tuple won't be returned.

What happens on the foreign join example seems to be exactly the
same thing.

I don't have a good idea for this, so would an approach using an local
join execution plan be the good way to go?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#111

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Kyotaro HORIGUCHI (#110)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/09 15:04, Kyotaro HORIGUCHI wrote:

At Fri, 9 Oct 2015 12:00:30 +0900, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote in <56172DCE.7080604@lab.ntt.co.jp>

On 2015/10/08 19:55, Etsuro Fujita wrote:

I noticed that the approach using a column to populate the foreign
scan's slot directly wouldn't work well in some cases. For example,
consider:

SELECT * FROM verysmall v LEFT JOIN (bigft1 JOIN bigft2 ON bigft1.x =
bigft2.x) ON v.q = bigft1.q AND v.r = bigft2.r FOR UPDATE OF v;

Oops, I should have written "JOIN", not "LEFT JOIN".

The best plan is presumably something like this as you said before:

LockRows
-> Nested Loop
-> Seq Scan on verysmall v
-> Foreign Scan on bigft1 and bigft2
Remote SQL: SELECT * FROM bigft1 JOIN bigft2 ON bigft1.x =
bigft2.x AND bigft1.q = $1 AND bigft2.r = $2

Consider the EvalPlanQual testing to see if the updated version of a
tuple in v satisfies the query. If we use the column in the testing,
we
would get the wrong results in some cases.

More precisely, we would get the wrong result when the value of v.q or
v.r in the updated version has changed.

What do you think the right behavior?

IIUC, I think that the foreign scan's slot should be set empty, that the
join should fail, and that the updated version of the tuple in v should
be ignored in that scenario since that for the updated version of the
tuple in v, the tuples obtained from those two foreign tables wouldn't
satisfy the remote query. But if populating the foreign scan's slot
from that column, then the join would success and the updated version of
the tuple in v would be returned wrongly, I think.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#112

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Robert Haas (#60)

Re: Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual)

On 2015/09/11 6:30, Robert Haas wrote:

On Wed, Sep 9, 2015 at 2:30 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

But that path might have already been discarded on the basis of cost.
I think Tom's idea is better: let the FDW consult some state cached
for this purpose in the RelOptInfo

Do you have an idea of what information would be collected into the state
and how the FDW would derive parameterizations to consider producing
pushed-down joins with from that information? What I'm concerned about that
is to reduce the number of parameterizations to consider, to reduce overhead
in costing the corresponding queries. I'm missing something, though.

I think the thing we'd want to store in the state would be enough
information to reconstruct a valid join nest. For example, the
reloptinfo for (A B) might note that A needs to be left-joined to B.
When we go to construct paths for (A B C), and there is no
SpecialJoinInfo that mentions C, we know that we can construct (A LJ
B) IJ C rather than (A IJ B) IJ C. If any paths survived, we could
find a way to pull that information out of the path, but pulling it
out of the RelOptInfo should always work.

I am not sure what to do about parameterizations. That's one of my
remaining concerns about moving the hook.

Do you have any plan about the hook?

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#113

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Robert Haas (#68)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/09/12 1:38, Robert Haas wrote:

On Thu, Sep 10, 2015 at 11:36 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

I've proposed the following API changes:

* I modified create_foreignscan_path, which is called from
postgresGetForeignJoinPaths/postgresGetForeignPaths, so that a path,
subpath, is passed as the eighth argument of the function. subpath
represents a local join execution path if scanrelid==0, but NULL if
scanrelid>0.

OK, I see now. But I don't much like the way
get_unsorted_unparameterized_path() looks.

First, it's basically praying that MergePath, NodePath, and NestPath
can be flat-copied without breaking anything. In general, we have
copyfuncs.c support for nodes that we need to be able to copy, and we
use copyObject() to do it. Even if what you've got here works today,
it's not very future-proof.

Agreed.

Second, what guarantee do we have that we'll find a path with no
pathkeys and a NULL param_info? Why can't all of the paths for a join
relation have pathkeys? Why can't they all be parameterized? I can't
think of anything that would guarantee that.

No. The reason why I've modified the patch that way is simply because
the latest postgres_fdw patch doesn't support creating a remote query
for a presorted or parameterized path for a remote join.

Third, even if such a guarantee existed, why is this the right
behavior? Any join type will produce the same output; it's just a
question of performance. And if you have only one tuple on each side,
surely a nested loop would be fine.

Yeah, I think we would also need to consider the parameterization.

It seems to me that what you ought to be doing is using data hung off
the fdw_private field of each RelOptInfo to cache a NestPath that can
be used for EPQ rechecks at that level. When you go to consider
pushing down another join, you can build up a new NestPath that's
suitable for the new level. That seems much cleaner than groveling
through the list of surviving paths and hoping you find the right kind
of thing.

Agreed.

(From the first, I am not against that an FDW author creates the local
join execution path by itself. The reason why I've modified the patch
so as to find a local join execution path from the path list is simply
because that is simple. The main point I'd like to discuss about the
patch is the changes to the core code).

And all that having been said, I still don't really understand why you
are resisting the idea of providing a callback so that the FDW can
execute arbitrary code in the recheck path. There doesn't seem to be
any reason not to let the FDW take control of the rechecks if it
wishes, and there's no real cost in complexity that I can see.

IMO I thought there would be not a little development burden on an FDW
author. So, I was rather against the idea of providing such a callback.

I know we still haven't reached a consensus on whether we address this
issue by using a local join execution path.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#114

Jeevan Chalke

jeevan.chalke@enterprisedb.com

over 10 years ago

In reply to: Etsuro Fujita (#113)

Re: Foreign join pushdown vs EvalPlanQual

On Fri, Oct 9, 2015 at 3:35 PM, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>
wrote:

Hi,

Just to have hands on, I started looking into this issue and trying to
grasp it as this is totally new code for me. And later I want to review
this code changes.

I have noticed that, this thread started saying we are getting a crash
with the given steps with foreign_join_v16.patch, I am correct?

Then there are various patches which trying to fix this,
fdw-eval-plan-qual-*.patch

I have tried applying foreign_join_v16.patch, which was good. And tried
reproducing the crash. But instead of crash I am getting following error.

ERROR: could not serialize access due to concurrent update
CONTEXT: Remote SQL command: SELECT a FROM public.foo FOR UPDATE
Remote SQL command: SELECT a FROM public.tab FOR UPDATE

Then I have applied fdw-eval-plan-qual-3.0.patch on top of it. It was not
getting applied cleanly (may be due to some other changes meanwhile).
I fixed the conflicts and the warnings to make it compile.

When I run same sql sequence, I am getting crash in terminal 2 at EXPLAIN
it self.

server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

Following sql statement I am using:

create table tab (a int, b int);
create foreign table foo (a int) server myserver options(table_name 'tab');
create foreign table bar (a int) server myserver options(table_name 'tab');

insert into tab values (1, 1);
insert into foo values (1);
insert into bar values (1);

analyze tab;
analyze foo;
analyze bar;

Run the example:

[Terminal 1]
begin;
update tab set b = b + 1 where a = 1;

[Terminal 2]
explain verbose select tab.* from tab, foo, bar where tab.a =
foo.a and foo.a = bar.a for update;

Am I missing something here?
Do I need to apply any other patch from other mail-threads?

Do you have sample test-case explaining the issue and fix?

With these simple questions, I might have taking the thread slightly off
from the design considerations, please excuse me for that.

Thanks

--
Jeevan B Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

#115

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Etsuro Fujita (#112)

Re: Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual)

On Fri, Oct 9, 2015 at 5:41 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

Do you have any plan about the hook?

No. I think if you and Tom think it should be moved, one of you
should propose a patch. Ideally accompanied by a demo of how
postgres_fdw would be expected to use the revised hook.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#116

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Etsuro Fujita (#109)

Re: Foreign join pushdown vs EvalPlanQual

On Thu, Oct 8, 2015 at 11:00 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

The best plan is presumably something like this as you said before:

LockRows
-> Nested Loop
-> Seq Scan on verysmall v
-> Foreign Scan on bigft1 and bigft2
Remote SQL: SELECT * FROM bigft1 JOIN bigft2 ON bigft1.x =
bigft2.x AND bigft1.q = $1 AND bigft2.r = $2

Consider the EvalPlanQual testing to see if the updated version of a
tuple in v satisfies the query. If we use the column in the testing, we
would get the wrong results in some cases.

More precisely, we would get the wrong result when the value of v.q or v.r
in the updated version has changed.

Interesting test case. It's worth considering why this works if you
were to replace the Foreign Scan with an Index Scan; suppose the query
is SELECT * FROM verysmall v LEFT JOIN realbiglocaltable t ON v.x =
t.x FOR UPDATE OF v, so that you get:

LockRows
-> Nested Loop
-> Seq Scan on verysmall v
-> Foreign Scan on realbiglocaltable t
Index Cond: v.x = t.x

In your example, the remote SQL pushes down certain quals to the
remote server, and so if we just return the same tuple they might no
longer be satisfied. In this example, the index qual is essentially a
filter condition that has been "pushed down" into the index AM. The
EvalPlanQual machinery prevents this from generating wrong answers by
rechecking the index cond - see IndexRecheck. Even though it's
normally the AM's job to enforce the index cond, and the executor does
not need to recheck, in the EvalPlanQual case it does need to recheck.

I think the foreign data wrapper case should be handled the same way.
Any condition that we initially pushed down to the foreign server
needs to be locally rechecked if we're inside EPQ.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#117

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 10 years ago

In reply to: Jeevan Chalke (#114)

Re: Foreign join pushdown vs EvalPlanQual

Hi,

At Fri, 9 Oct 2015 18:18:52 +0530, Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote in <CAM2+6=XsXhMw_owFiKJP9syUx9eFc0x5U9jGOtO9v34G5epd8g@mail.gmail.com>

I have noticed that, this thread started saying we are getting a crash
with the given steps with foreign_join_v16.patch, I am correct?

Your're correct. The immediate cause of the crash is an assertion
failure that EvalPlanQualNext doesn't find a tuple to examine for
a "foreign join" changed into a ForeignScan as the result of
foreign join pushdown.

Then there are various patches which trying to fix this,
fdw-eval-plan-qual-*.patch

I have tried applying foreign_join_v16.patch, which was good. And tried
reproducing the crash. But instead of crash I am getting following error.

ERROR: could not serialize access due to concurrent update
CONTEXT: Remote SQL command: SELECT a FROM public.foo FOR UPDATE
Remote SQL command: SELECT a FROM public.tab FOR UPDATE

It is because you took wrong steps.

FDW runs a transaction in the isolation level above REPEATABLE
READ. You updated a value locally while the fdw is locking the
same tuple in REPEATABLE READ transaction. You should map
different table as the foreign tables from the locally-modified
table.

- create table tab (a int, b int);
- create foreign table foo (a int) server myserver options(table_name 'tab');
- create foreign table bar (a int) server myserver options(table_name 'tab');
+ create table tab (a int, b int);
+ create table lfb (a int, b int);
+ create foreign table foo (a int) server myserver options(table_name 'lfb);
+ create foreign table bar (a int) server myserver options(table_name 'lfb');

And you'll get the following assertion failure.

| TRAP: FailedAssertion("!(scanrelid > 0)", File: "execScan.c", Line: 52)
| LOG: unexpected EOF on client connection with an open transaction
| LOG: server process (PID 16738) was terminated by signal 6: Aborted
| DETAIL: Failed process was running: explain (verbose, analyze) select t1.* from t1, ft2, ft2_2 where t1.a = ft2.a and ft2.a = ft2_2.a for update;
| LOG: terminating any other active server proces

Then I have applied fdw-eval-plan-qual-3.0.patch on top of it. It was not
getting applied cleanly (may be due to some other changes meanwhile).
I fixed the conflicts and the warnings to make it compile.

The combination won't work because the patch requires
postgres_fdw to put alternative path as subpath to
create_foreignscan_path. AFAICS no corresponding forign-join
patch has shown in this thread. This thread continues to discuss
the desirable join pushdown API for FDW.

When I run same sql sequence, I am getting crash in terminal 2 at EXPLAIN
it self.

server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

Following sql statement I am using:

create table tab (a int, b int);
create foreign table foo (a int) server myserver options(table_name 'tab');
create foreign table bar (a int) server myserver options(table_name 'tab');

insert into tab values (1, 1);
insert into foo values (1);
insert into bar values (1);

analyze tab;
analyze foo;
analyze bar;

Run the example:

[Terminal 1]
begin;
update tab set b = b + 1 where a = 1;

[Terminal 2]
explain verbose select tab.* from tab, foo, bar where tab.a =
foo.a and foo.a = bar.a for update;

Am I missing something here?
Do I need to apply any other patch from other mail-threads?

Do you have sample test-case explaining the issue and fix?

With these simple questions, I might have taking the thread slightly off
from the design considerations, please excuse me for that.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#118

Kouhei Kaigai

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Etsuro Fujita (#108)

Re: Foreign join pushdown vs EvalPlanQual

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Etsuro Fujita
Sent: Thursday, October 08, 2015 7:56 PM
To: Kyotaro HORIGUCHI; robertmhaas@gmail.com
Cc: Kaigai Kouhei(海外浩平); pgsql-hackers@postgresql.org;
shigeru.hanada@gmail.com
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

On 2015/10/07 15:39, Etsuro Fujita wrote:

On 2015/10/07 15:06, Kyotaro HORIGUCHI wrote:

At Wed, 7 Oct 2015 00:24:57 -0400, Robert Haas <robertmhaas@gmail.com>
wrote

I think it rather requires *replacing* two resjunk columns by one new
one. The whole-row references for the individual foreign tables are
only there to support EvalPlanQual; if we instead have a column to
populate the foreign scan's slot directly, then we can use that column
for that purpose directly and there's no remaining use for the
whole-row vars on the baserels.

It is what I had in mind.

OK I'll investigate this further.

I noticed that the approach using a column to populate the foreign
scan's slot directly wouldn't work well in some cases. For example,
consider:

SELECT * FROM verysmall v LEFT JOIN (bigft1 JOIN bigft2 ON bigft1.x =
bigft2.x) ON v.q = bigft1.q AND v.r = bigft2.r FOR UPDATE OF v;

The best plan is presumably something like this as you said before:

LockRows
-> Nested Loop
-> Seq Scan on verysmall v
-> Foreign Scan on bigft1 and bigft2
Remote SQL: SELECT * FROM bigft1 JOIN bigft2 ON bigft1.x =
bigft2.x AND bigft1.q = $1 AND bigft2.r = $2

Consider the EvalPlanQual testing to see if the updated version of a
tuple in v satisfies the query. If we use the column in the testing, we
would get the wrong results in some cases.

In this case, does ForeignScan have to be reset prior to ExecProcNode()?
Once ExecReScanForeignScan() gets called by ExecNestLoop(), it marks EPQ
slot is invalid. So, more or less, ForeignScan needs to kick the remote
join again based on the new parameter come from the latest verysmall tuple.
Please correct me, if I don't understand correctly.
In case of unparametalized ForeignScan case, the cached join-tuple work
well because it is independent from verysmall.

Once again, if FDW driver is responsible to construct join-tuple from
the base relation's tuple cached in EPQ slot, this case don't need to
kick remote query again, because all the materials to construct join-
tuple are already held locally. Right?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#119

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 10 years ago

In reply to: Etsuro Fujita (#111)

Re: Foreign join pushdown vs EvalPlanQual

Hello,

At Fri, 9 Oct 2015 18:40:32 +0900, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote in <56178B90.4030206@lab.ntt.co.jp>

What do you think the right behavior?

# 'is' was omitted..

IIUC, I think that the foreign scan's slot should be set empty, that

Even for the case, EvalPlanQualFetchRowMarks retrieves tuples of
remote tables out of the whole-row resjunks and set them to
es_epqTuple[] so that EvalPlanQualNext can use it. The behavior
is not different from the 'FOR UPDATE;' (for all tables) cases.

I supposed that whole-row value for the joined tuple would be
treated in the same manner to the case of the tuples of base
foreign relations.

This is because preprocess_rowmarks makes rowMarks of the type
LCS_NONE for the relations other than the designated by "OF
colref" for "FOR UPDATE". Then it is converted into ROW_MARK_COPY
by select_rowmark_type, which causes the behavior above, as the
default behavior for foreign tables.

the join should fail, and that the updated version of the tuple in v
should be ignored in that scenario since that for the updated version
of the tuple in v, the tuples obtained from those two foreign tables
wouldn't satisfy the remote query.

AFAICS, no updated version for remote tables are obtained.

Even though the behavior I described above is correct, the join
would fail, too. But it is because v.r is no longer equal to
bigft2.r in the whole-row-var tuples. This seems seemingly the
same behavior with that on local tables.

# LCS_NONE for local tables is converted into ROW_MARK_COPY if no
# securityQuals are attached.

But if populating the foreign
scan's slot from that column, then the join would success and the
updated version of the tuple in v would be returned wrongly, I think.

I might understand wrongly in some points..

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#120

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 10 years ago

In reply to: Kouhei Kaigai (#118)

Re: Foreign join pushdown vs EvalPlanQual

Hello,

At Wed, 14 Oct 2015 03:07:31 +0000, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote in <9A28C8860F777E439AA12E8AEA7694F801157077@BPXM15GP.gisp.nec.co.jp>

I noticed that the approach using a column to populate the foreign
scan's slot directly wouldn't work well in some cases. For example,
consider:

SELECT * FROM verysmall v LEFT JOIN (bigft1 JOIN bigft2 ON bigft1.x =
bigft2.x) ON v.q = bigft1.q AND v.r = bigft2.r FOR UPDATE OF v;

The best plan is presumably something like this as you said before:

LockRows
-> Nested Loop
-> Seq Scan on verysmall v
-> Foreign Scan on bigft1 and bigft2
Remote SQL: SELECT * FROM bigft1 JOIN bigft2 ON bigft1.x =
bigft2.x AND bigft1.q = $1 AND bigft2.r = $2

Consider the EvalPlanQual testing to see if the updated version of a
tuple in v satisfies the query. If we use the column in the testing, we
would get the wrong results in some cases.

I have a basic (or maybe silly) qustion. Is it true that the
join-inner (the foreignscan in the example) is re-executed with
the modified value of v.r? I observed for a join case among only
local tables that previously fetched tuples for the inner are
simplly reused regardless of join types. Even when a refetch
happens (I haven't confirmed but it would occur in the case of no
security quals), the tuple is pointed by ctid so the re-join
between local and remote would fail. Is this wrong?

In this case, does ForeignScan have to be reset prior to ExecProcNode()?
Once ExecReScanForeignScan() gets called by ExecNestLoop(), it marks EPQ
slot is invalid. So, more or less, ForeignScan needs to kick the remote
join again based on the new parameter come from the latest verysmall tuple.
Please correct me, if I don't understand correctly.

So, no rescan would happen for the cases, I think. ReScan seems
to be kicked only for the new(next) outer tuple that causes
change of parameter, but not kicked for EPQ. I might take you
wrongly..

In case of unparametalized ForeignScan case, the cached join-tuple work
well because it is independent from verysmall.

Once again, if FDW driver is responsible to construct join-tuple from
the base relation's tuple cached in EPQ slot, this case don't need to
kick remote query again, because all the materials to construct join-
tuple are already held locally. Right?

It is definitely right and should be doable. But I think the
point we are argueing here is what is the desirable behavior.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#121

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

about 10 years ago

In reply to: Robert Haas (#116)

1 attachment(s)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/10 10:17, Robert Haas wrote:

On Thu, Oct 8, 2015 at 11:00 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

The best plan is presumably something like this as you said before:

LockRows
-> Nested Loop
-> Seq Scan on verysmall v
-> Foreign Scan on bigft1 and bigft2
Remote SQL: SELECT * FROM bigft1 JOIN bigft2 ON bigft1.x =
bigft2.x AND bigft1.q = $1 AND bigft2.r = $2

Consider the EvalPlanQual testing to see if the updated version of a
tuple in v satisfies the query. If we use the column in the testing, we
would get the wrong results in some cases.

More precisely, we would get the wrong result when the value of v.q or v.r
in the updated version has changed.

Interesting test case. It's worth considering why this works if you
were to replace the Foreign Scan with an Index Scan; suppose the query
is SELECT * FROM verysmall v LEFT JOIN realbiglocaltable t ON v.x =
t.x FOR UPDATE OF v, so that you get:

LockRows
-> Nested Loop
-> Seq Scan on verysmall v
-> Foreign Scan on realbiglocaltable t
Index Cond: v.x = t.x

In your example, the remote SQL pushes down certain quals to the
remote server, and so if we just return the same tuple they might no
longer be satisfied. In this example, the index qual is essentially a
filter condition that has been "pushed down" into the index AM. The
EvalPlanQual machinery prevents this from generating wrong answers by
rechecking the index cond - see IndexRecheck. Even though it's
normally the AM's job to enforce the index cond, and the executor does
not need to recheck, in the EvalPlanQual case it does need to recheck.

I think the foreign data wrapper case should be handled the same way.
Any condition that we initially pushed down to the foreign server
needs to be locally rechecked if we're inside EPQ.

Agreed.

As KaiGai-san also pointed out before, I think we should address this in
each of the following cases:

1) remote qual (scanrelid>0)
2) remote join (scanrelid==0)

As for #1, I noticed that there is a bug in handling the same kind of
FDW queries, which will be shown below. As you said, I think this
should be addressed by rechecking the remote quals *locally*. (I
thought another fix for this kind of bug before, though.) IIUC, I think
this should be fixed separately from #2, as this is a bug not only in
9.5, but in back branches. Please find attached a patch.

Create an environment:

mydatabase=# create table t1 (a int primary key, b text);
mydatabase=# insert into t1 select a, 'notsolongtext' from
generate_series(1, 1000000) a;

postgres=# create server myserver foreign data wrapper postgres_fdw
options (dbname 'mydatabase');
postgres=# create user mapping for current_user server myserver;
postgres=# create foreign table ft1 (a int, b text) server myserver
options (table_name 't1');
postgres=# alter foreign table ft1 options (add use_remote_estimate 'true');
postgres=# create table inttab (a int);
postgres=# insert into inttab select a from generate_series(1, 10) a;
postgres=# analyze ft1;
postgres=# analyze inttab;

Run concurrent transactions that produce incorrect result:

[Terminal1]
postgres=# begin;
BEGIN
postgres=# update inttab set a = a + 1 where a = 1;
UPDATE 1

[Terminal2]
postgres=# explain verbose select * from inttab, ft1 where inttab.a =
ft1.a limit 1 for update;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Limit (cost=100.43..198.99 rows=1 width=70)
Output: inttab.a, ft1.a, ft1.b, inttab.ctid, ft1.*
-> LockRows (cost=100.43..1086.00 rows=10 width=70)
Output: inttab.a, ft1.a, ft1.b, inttab.ctid, ft1.*
-> Nested Loop (cost=100.43..1085.90 rows=10 width=70)
Output: inttab.a, ft1.a, ft1.b, inttab.ctid, ft1.*
-> Seq Scan on public.inttab (cost=0.00..1.10 rows=10
width=10)
Output: inttab.a, inttab.ctid
-> Foreign Scan on public.ft1 (cost=100.43..108.47
rows=1 width=18)
Output: ft1.a, ft1.b, ft1.*
Remote SQL: SELECT a, b FROM public.t1 WHERE
(($1::integer = a)) FOR UPDATE
(11 rows)

postgres=# select * from inttab, ft1 where inttab.a = ft1.a limit 1 for
update;

[Terminal1]
postgres=# commit;
COMMIT

[Terminal2]
(After the commit in Terminal1, the following result will be shown in
Terminal2. Note that the values of inttab.a and ft1.a wouldn't satisfy
the remote qual!)
a | a | b
---+---+---------------
2 | 1 | notsolongtext
(1 row)

As for #2, I didn't come up with any solution to locally rechecking
pushed-down join conditions against a joined tuple populated from a
column that we discussed. Instead, I'd like to revise a
local-join-execution-plan-based approach that we discussed before, by
addressing your comments such as [1]/messages/by-id/CA+TgmoaAzs0dR23R7PTBseQfwOtuVCPNBqDHxeBo9Gi+dMxj8w@mail.gmail.com. Would it be the right way to go?

Best regards,
Etsuro Fujita

[1]: /messages/by-id/CA+TgmoaAzs0dR23R7PTBseQfwOtuVCPNBqDHxeBo9Gi+dMxj8w@mail.gmail.com
/messages/by-id/CA+TgmoaAzs0dR23R7PTBseQfwOtuVCPNBqDHxeBo9Gi+dMxj8w@mail.gmail.com

Attachments:

foreign-recheck-for-foreign-table-1.patchtext/x-patch; name=foreign-recheck-for-foreign-table-1.patchDownload

*** a/contrib/file_fdw/file_fdw.c
--- b/contrib/file_fdw/file_fdw.c
***************
*** 563,569 **** fileGetForeignPlan(PlannerInfo *root,
  							scan_relid,
  							NIL,	/* no expressions to evaluate */
  							best_path->fdw_private,
! 							NIL /* no custom tlist */ );
  }
  
  /*
--- 563,570 ----
  							scan_relid,
  							NIL,	/* no expressions to evaluate */
  							best_path->fdw_private,
! 							NIL,	/* no custom tlist */
! 							NIL /* no remote quals */ );
  }
  
  /*
*** a/contrib/postgres_fdw/postgres_fdw.c
--- b/contrib/postgres_fdw/postgres_fdw.c
***************
*** 748,753 **** postgresGetForeignPlan(PlannerInfo *root,
--- 748,754 ----
  	Index		scan_relid = baserel->relid;
  	List	   *fdw_private;
  	List	   *remote_conds = NIL;
+ 	List	   *remote_exprs = NIL;
  	List	   *local_exprs = NIL;
  	List	   *params_list = NIL;
  	List	   *retrieved_attrs;
***************
*** 769,776 **** postgresGetForeignPlan(PlannerInfo *root,
  	 *
  	 * This code must match "extract_actual_clauses(scan_clauses, false)"
  	 * except for the additional decision about remote versus local execution.
! 	 * Note however that we only strip the RestrictInfo nodes from the
! 	 * local_exprs list, since appendWhereClause expects a list of
  	 * RestrictInfos.
  	 */
  	foreach(lc, scan_clauses)
--- 770,777 ----
  	 *
  	 * This code must match "extract_actual_clauses(scan_clauses, false)"
  	 * except for the additional decision about remote versus local execution.
! 	 * Note however that we don't strip the RestrictInfo nodes from the
! 	 * remote_conds list, since appendWhereClause expects a list of
  	 * RestrictInfos.
  	 */
  	foreach(lc, scan_clauses)
***************
*** 784,794 **** postgresGetForeignPlan(PlannerInfo *root,
--- 785,801 ----
  			continue;
  
  		if (list_member_ptr(fpinfo->remote_conds, rinfo))
+ 		{
  			remote_conds = lappend(remote_conds, rinfo);
+ 			remote_exprs = lappend(remote_exprs, rinfo->clause);
+ 		}
  		else if (list_member_ptr(fpinfo->local_conds, rinfo))
  			local_exprs = lappend(local_exprs, rinfo->clause);
  		else if (is_foreign_expr(root, baserel, rinfo->clause))
+ 		{
  			remote_conds = lappend(remote_conds, rinfo);
+ 			remote_exprs = lappend(remote_exprs, rinfo->clause);
+ 		}
  		else
  			local_exprs = lappend(local_exprs, rinfo->clause);
  	}
***************
*** 874,880 **** postgresGetForeignPlan(PlannerInfo *root,
  							scan_relid,
  							params_list,
  							fdw_private,
! 							NIL /* no custom tlist */ );
  }
  
  /*
--- 881,888 ----
  							scan_relid,
  							params_list,
  							fdw_private,
! 							NIL,	/* no custom tlist */
! 							remote_exprs);
  }
  
  /*
*** a/src/backend/executor/nodeForeignscan.c
--- b/src/backend/executor/nodeForeignscan.c
***************
*** 72,79 **** ForeignNext(ForeignScanState *node)
  static bool
  ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
  {
! 	/* There are no access-method-specific conditions to recheck. */
! 	return true;
  }
  
  /* ----------------------------------------------------------------
--- 72,90 ----
  static bool
  ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
  {
! 	ExprContext *econtext;
! 
! 	/*
! 	 * extract necessary information from foreign scan node
! 	 */
! 	econtext = node->ss.ps.ps_ExprContext;
! 
! 	/* Does the tuple meet the remote qual condition? */
! 	econtext->ecxt_scantuple = slot;
! 
! 	ResetExprContext(econtext);
! 
! 	return ExecQual(node->fdw_scan_quals, econtext, false);
  }
  
  /* ----------------------------------------------------------------
***************
*** 135,140 **** ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
--- 146,154 ----
  	scanstate->ss.ps.qual = (List *)
  		ExecInitExpr((Expr *) node->scan.plan.qual,
  					 (PlanState *) scanstate);
+ 	scanstate->fdw_scan_quals = (List *)
+ 		ExecInitExpr((Expr *) node->fdw_scan_quals,
+ 					 (PlanState *) scanstate);
  
  	/*
  	 * tuple table initialization
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
***************
*** 648,653 **** _copyForeignScan(const ForeignScan *from)
--- 648,654 ----
  	COPY_NODE_FIELD(fdw_exprs);
  	COPY_NODE_FIELD(fdw_private);
  	COPY_NODE_FIELD(fdw_scan_tlist);
+ 	COPY_NODE_FIELD(fdw_scan_quals);
  	COPY_BITMAPSET_FIELD(fs_relids);
  	COPY_SCALAR_FIELD(fsSystemCol);
  
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
***************
*** 594,599 **** _outForeignScan(StringInfo str, const ForeignScan *node)
--- 594,600 ----
  	WRITE_NODE_FIELD(fdw_exprs);
  	WRITE_NODE_FIELD(fdw_private);
  	WRITE_NODE_FIELD(fdw_scan_tlist);
+ 	WRITE_NODE_FIELD(fdw_scan_quals);
  	WRITE_BITMAPSET_FIELD(fs_relids);
  	WRITE_BOOL_FIELD(fsSystemCol);
  }
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
***************
*** 2153,2158 **** create_foreignscan_plan(PlannerInfo *root, ForeignPath *best_path,
--- 2153,2160 ----
  			replace_nestloop_params(root, (Node *) scan_plan->scan.plan.qual);
  		scan_plan->fdw_exprs = (List *)
  			replace_nestloop_params(root, (Node *) scan_plan->fdw_exprs);
+ 		scan_plan->fdw_scan_quals = (List *)
+ 			replace_nestloop_params(root, (Node *) scan_plan->fdw_scan_quals);
  	}
  
  	/*
***************
*** 3738,3744 **** make_foreignscan(List *qptlist,
  				 Index scanrelid,
  				 List *fdw_exprs,
  				 List *fdw_private,
! 				 List *fdw_scan_tlist)
  {
  	ForeignScan *node = makeNode(ForeignScan);
  	Plan	   *plan = &node->scan.plan;
--- 3740,3747 ----
  				 Index scanrelid,
  				 List *fdw_exprs,
  				 List *fdw_private,
! 				 List *fdw_scan_tlist,
! 				 List *fdw_scan_quals)
  {
  	ForeignScan *node = makeNode(ForeignScan);
  	Plan	   *plan = &node->scan.plan;
***************
*** 3754,3759 **** make_foreignscan(List *qptlist,
--- 3757,3763 ----
  	node->fdw_exprs = fdw_exprs;
  	node->fdw_private = fdw_private;
  	node->fdw_scan_tlist = fdw_scan_tlist;
+ 	node->fdw_scan_quals = fdw_scan_quals;
  	/* fs_relids will be filled in by create_foreignscan_plan */
  	node->fs_relids = NULL;
  	/* fsSystemCol will be filled in by create_foreignscan_plan */
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
***************
*** 1140,1145 **** set_foreignscan_references(PlannerInfo *root,
--- 1140,1148 ----
  			fix_scan_list(root, fscan->scan.plan.qual, rtoffset);
  		fscan->fdw_exprs =
  			fix_scan_list(root, fscan->fdw_exprs, rtoffset);
+ 		/* fdw_scan_quals too */
+ 		fscan->fdw_scan_quals =
+ 			fix_scan_list(root, fscan->fdw_scan_quals, rtoffset);
  	}
  
  	/* Adjust fs_relids if needed */
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
***************
*** 2396,2401 **** finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
--- 2396,2407 ----
  		case T_ForeignScan:
  			finalize_primnode((Node *) ((ForeignScan *) plan)->fdw_exprs,
  							  &context);
+ 
+ 			/*
+ 			 * We need not look at fdw_scan_quals, since it will have the same
+ 			 * param references as fdw_exprs.
+ 			 */
+ 
  			/* We assume fdw_scan_tlist cannot contain Params */
  			context.paramids = bms_add_members(context.paramids, scan_params);
  			break;
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
***************
*** 1579,1584 **** typedef struct WorkTableScanState
--- 1579,1585 ----
  typedef struct ForeignScanState
  {
  	ScanState	ss;				/* its first field is NodeTag */
+ 	List	   *fdw_scan_quals;	/* remote quals if foreign table */
  	/* use struct pointer to avoid including fdwapi.h here */
  	struct FdwRoutine *fdwroutine;
  	void	   *fdw_state;		/* foreign-data wrapper can keep state here */
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
***************
*** 524,529 **** typedef struct ForeignScan
--- 524,530 ----
  	List	   *fdw_exprs;		/* expressions that FDW may evaluate */
  	List	   *fdw_private;	/* private data for FDW */
  	List	   *fdw_scan_tlist; /* optional tlist describing scan tuple */
+ 	List	   *fdw_scan_quals;	/* remote quals if foreign table */
  	Bitmapset  *fs_relids;		/* RTIs generated by this scan */
  	bool		fsSystemCol;	/* true if any "system column" is needed */
  } ForeignScan;
*** a/src/include/optimizer/planmain.h
--- b/src/include/optimizer/planmain.h
***************
*** 45,51 **** extern SubqueryScan *make_subqueryscan(List *qptlist, List *qpqual,
  				  Index scanrelid, Plan *subplan);
  extern ForeignScan *make_foreignscan(List *qptlist, List *qpqual,
  				 Index scanrelid, List *fdw_exprs, List *fdw_private,
! 				 List *fdw_scan_tlist);
  extern Append *make_append(List *appendplans, List *tlist);
  extern RecursiveUnion *make_recursive_union(List *tlist,
  					 Plan *lefttree, Plan *righttree, int wtParam,
--- 45,51 ----
  				  Index scanrelid, Plan *subplan);
  extern ForeignScan *make_foreignscan(List *qptlist, List *qpqual,
  				 Index scanrelid, List *fdw_exprs, List *fdw_private,
! 				 List *fdw_scan_tlist, List *fdw_scan_quals);
  extern Append *make_append(List *appendplans, List *tlist);
  extern RecursiveUnion *make_recursive_union(List *tlist,
  					 Plan *lefttree, Plan *righttree, int wtParam,

#122

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

about 10 years ago

In reply to: Kouhei Kaigai (#118)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/14 12:07, Kouhei Kaigai wrote:

On 2015/10/07 15:39, Etsuro Fujita wrote:
I noticed that the approach using a column to populate the foreign
scan's slot directly wouldn't work well in some cases. For example,
consider:

SELECT * FROM verysmall v LEFT JOIN (bigft1 JOIN bigft2 ON bigft1.x =
bigft2.x) ON v.q = bigft1.q AND v.r = bigft2.r FOR UPDATE OF v;

The best plan is presumably something like this as you said before:

LockRows
-> Nested Loop
-> Seq Scan on verysmall v
-> Foreign Scan on bigft1 and bigft2
Remote SQL: SELECT * FROM bigft1 JOIN bigft2 ON bigft1.x =
bigft2.x AND bigft1.q = $1 AND bigft2.r = $2

Consider the EvalPlanQual testing to see if the updated version of a
tuple in v satisfies the query. If we use the column in the testing, we
would get the wrong results in some cases.

In this case, does ForeignScan have to be reset prior to ExecProcNode()?
Once ExecReScanForeignScan() gets called by ExecNestLoop(), it marks EPQ
slot is invalid. So, more or less, ForeignScan needs to kick the remote
join again based on the new parameter come from the latest verysmall tuple.
Please correct me, if I don't understand correctly.
In case of unparametalized ForeignScan case, the cached join-tuple work
well because it is independent from verysmall.

Once again, if FDW driver is responsible to construct join-tuple from
the base relation's tuple cached in EPQ slot, this case don't need to
kick remote query again, because all the materials to construct join-
tuple are already held locally. Right?

Sorry, maybe I misunderstand your words, but we are talking here about
an approach using a whole-row var that would populate a join tuple that
is returned by an FDW and stored in the scan slot in the corresponding
ForeingScanState node in the parent state tree.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#123

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 10 years ago

In reply to: Etsuro Fujita (#121)

Re: Foreign join pushdown vs EvalPlanQual

Ah..

I understood that what you mentioned is the lack of local recheck
of foreigh tuples. Sorry for the noise.

At Wed, 14 Oct 2015 17:31:16 +0900, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote in <561E12D4.7040403@lab.ntt.co.jp>
On 2015/10/10 10:17, Robert Haas wrote:

On Thu, Oct 8, 2015 at 11:00 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

The best plan is presumably something like this as you said before:

LockRows
-> Nested Loop
-> Seq Scan on verysmall v
-> Foreign Scan on bigft1 and bigft2
Remote SQL: SELECT * FROM bigft1 JOIN bigft2 ON bigft1.x =
bigft2.x AND bigft1.q = $1 AND bigft2.r = $2

Consider the EvalPlanQual testing to see if the updated version of a
tuple in v satisfies the query. If we use the column in the testing,
we
would get the wrong results in some cases.

More precisely, we would get the wrong result when the value of v.q or
v.r
in the updated version has changed.

Interesting test case. It's worth considering why this works if you
were to replace the Foreign Scan with an Index Scan; suppose the query
is SELECT * FROM verysmall v LEFT JOIN realbiglocaltable t ON v.x =
t.x FOR UPDATE OF v, so that you get:

LockRows
-> Nested Loop
-> Seq Scan on verysmall v
-> Foreign Scan on realbiglocaltable t
Index Cond: v.x = t.x

In your example, the remote SQL pushes down certain quals to the
remote server, and so if we just return the same tuple they might no
longer be satisfied. In this example, the index qual is essentially a
filter condition that has been "pushed down" into the index AM. The
EvalPlanQual machinery prevents this from generating wrong answers by
rechecking the index cond - see IndexRecheck. Even though it's
normally the AM's job to enforce the index cond, and the executor does
not need to recheck, in the EvalPlanQual case it does need to recheck.

I think the foreign data wrapper case should be handled the same way.
Any condition that we initially pushed down to the foreign server
needs to be locally rechecked if we're inside EPQ.

Agreed.

As KaiGai-san also pointed out before, I think we should address this
in each of the following cases:

1) remote qual (scanrelid>0)
2) remote join (scanrelid==0)

As for #1, I noticed that there is a bug in handling the same kind of
FDW queries, which will be shown below. As you said, I think this
should be addressed by rechecking the remote quals *locally*. (I
thought another fix for this kind of bug before, though.) IIUC, I
think this should be fixed separately from #2, as this is a bug not
only in 9.5, but in back branches. Please find attached a patch.

Create an environment:

mydatabase=# create table t1 (a int primary key, b text);
mydatabase=# insert into t1 select a, 'notsolongtext' from
generate_series(1, 1000000) a;

postgres=# create server myserver foreign data wrapper postgres_fdw
options (dbname 'mydatabase');
postgres=# create user mapping for current_user server myserver;
postgres=# create foreign table ft1 (a int, b text) server myserver
options (table_name 't1');
postgres=# alter foreign table ft1 options (add use_remote_estimate
'true');
postgres=# create table inttab (a int);
postgres=# insert into inttab select a from generate_series(1, 10) a;
postgres=# analyze ft1;
postgres=# analyze inttab;

Run concurrent transactions that produce incorrect result:

[Terminal1]
postgres=# begin;
BEGIN
postgres=# update inttab set a = a + 1 where a = 1;
UPDATE 1

[Terminal2]
postgres=# explain verbose select * from inttab, ft1 where inttab.a =
ft1.a limit 1 for update;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Limit (cost=100.43..198.99 rows=1 width=70)
Output: inttab.a, ft1.a, ft1.b, inttab.ctid, ft1.*
-> LockRows (cost=100.43..1086.00 rows=10 width=70)
Output: inttab.a, ft1.a, ft1.b, inttab.ctid, ft1.*
-> Nested Loop (cost=100.43..1085.90 rows=10 width=70)
Output: inttab.a, ft1.a, ft1.b, inttab.ctid, ft1.*
-> Seq Scan on public.inttab (cost=0.00..1.10 rows=10
-> width=10)
Output: inttab.a, inttab.ctid
-> Foreign Scan on public.ft1 (cost=100.43..108.47 rows=1
-> width=18)
Output: ft1.a, ft1.b, ft1.*
Remote SQL: SELECT a, b FROM public.t1 WHERE
(($1::integer = a)) FOR UPDATE
(11 rows)

postgres=# select * from inttab, ft1 where inttab.a = ft1.a limit 1
for update;

[Terminal1]
postgres=# commit;
COMMIT

[Terminal2]
(After the commit in Terminal1, the following result will be shown in
Terminal2. Note that the values of inttab.a and ft1.a wouldn't
satisfy the remote qual!)
a | a | b
---+---+---------------
2 | 1 | notsolongtext
(1 row)

As for #2, I didn't come up with any solution to locally rechecking
pushed-down join conditions against a joined tuple populated from a
column that we discussed. Instead, I'd like to revise a
local-join-execution-plan-based approach that we discussed before, by
addressing your comments such as [1]. Would it be the right way to
go?

Best regards,
Etsuro Fujita

[1]
/messages/by-id/CA+TgmoaAzs0dR23R7PTBseQfwOtuVCPNBqDHxeBo9Gi+dMxj8w@mail.gmail.com

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#124

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Kyotaro HORIGUCHI (#119)

Re: Foreign join pushdown vs EvalPlanQual

On Wed, Oct 14, 2015 at 3:10 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

AFAICS, no updated version for remote tables are obtained.

You're right, but that's OK: the previously-obtained tuples fail to
meet the current version of the quals, so there's no problem (that I
can see).

Even though the behavior I described above is correct, the join
would fail, too. But it is because v.r is no longer equal to
bigft2.r in the whole-row-var tuples. This seems seemingly the
same behavior with that on local tables.

Yeah.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#125

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Etsuro Fujita (#121)

1 attachment(s)

Re: Foreign join pushdown vs EvalPlanQual

On Wed, Oct 14, 2015 at 4:31 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

Agreed.

As KaiGai-san also pointed out before, I think we should address this in
each of the following cases:

1) remote qual (scanrelid>0)
2) remote join (scanrelid==0)

As for #1, I noticed that there is a bug in handling the same kind of FDW
queries, which will be shown below. As you said, I think this should be
addressed by rechecking the remote quals *locally*. (I thought another fix
for this kind of bug before, though.) IIUC, I think this should be fixed
separately from #2, as this is a bug not only in 9.5, but in back branches.
Please find attached a patch.

+1 for doing something like this. However, I don't think we can
commit this to released branches, despite the fact that it's a bug
fix, because breaking third-party FDWs in a minor release seems
unfriendly. We might be able to slip it into 9.5, though, if we act
quickly.

A few review comments:

- nodeForeignscan.c now needs to #include "utils/memutils.h"
- I think it'd be safer for finalize_plan() not to try to shortcut
processing fdw_scan_quals.
- You forgot to update _readForeignScan.
- The documentation needs updating.
- I think we should use the name fdw_recheck_quals.

Here's an updated patch with those changes and some improvements to
the comments. Absent objections, I will commit it and back-patch to
9.5 only.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

foreign-recheck-for-foreign-table-v2.patchapplication/x-patch; name=foreign-recheck-for-foreign-table-v2.patchDownload

diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 499f24f..5ce8f90 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -563,7 +563,8 @@ fileGetForeignPlan(PlannerInfo *root,
 							scan_relid,
 							NIL,	/* no expressions to evaluate */
 							best_path->fdw_private,
-							NIL /* no custom tlist */ );
+							NIL,	/* no custom tlist */
+							NIL /* no remote quals */ );
 }
 
 /*
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index e4d799c..1902f1f 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -748,6 +748,7 @@ postgresGetForeignPlan(PlannerInfo *root,
 	Index		scan_relid = baserel->relid;
 	List	   *fdw_private;
 	List	   *remote_conds = NIL;
+	List	   *remote_exprs = NIL;
 	List	   *local_exprs = NIL;
 	List	   *params_list = NIL;
 	List	   *retrieved_attrs;
@@ -769,8 +770,8 @@ postgresGetForeignPlan(PlannerInfo *root,
 	 *
 	 * This code must match "extract_actual_clauses(scan_clauses, false)"
 	 * except for the additional decision about remote versus local execution.
-	 * Note however that we only strip the RestrictInfo nodes from the
-	 * local_exprs list, since appendWhereClause expects a list of
+	 * Note however that we don't strip the RestrictInfo nodes from the
+	 * remote_conds list, since appendWhereClause expects a list of
 	 * RestrictInfos.
 	 */
 	foreach(lc, scan_clauses)
@@ -784,11 +785,17 @@ postgresGetForeignPlan(PlannerInfo *root,
 			continue;
 
 		if (list_member_ptr(fpinfo->remote_conds, rinfo))
+		{
 			remote_conds = lappend(remote_conds, rinfo);
+			remote_exprs = lappend(remote_exprs, rinfo->clause);
+		}
 		else if (list_member_ptr(fpinfo->local_conds, rinfo))
 			local_exprs = lappend(local_exprs, rinfo->clause);
 		else if (is_foreign_expr(root, baserel, rinfo->clause))
+		{
 			remote_conds = lappend(remote_conds, rinfo);
+			remote_exprs = lappend(remote_exprs, rinfo->clause);
+		}
 		else
 			local_exprs = lappend(local_exprs, rinfo->clause);
 	}
@@ -874,7 +881,8 @@ postgresGetForeignPlan(PlannerInfo *root,
 							scan_relid,
 							params_list,
 							fdw_private,
-							NIL /* no custom tlist */ );
+							NIL,	/* no custom tlist */
+							remote_exprs);
 }
 
 /*
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 4c410c7..1533a6b 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1136,6 +1136,15 @@ GetForeignServerByName(const char *name, bool missing_ok);
     </para>
 
     <para>
+     Any clauses removed from the plan node's qual list must instead be added
+     to <literal>fdw_recheck_quals</> in order to ensure correct behavior
+     at the <literal>READ COMMITTED</> isolation level.  When a concurrent
+     update occurs for some other table involved in the query, the executor
+     may need to verify that all of the original quals are still satisfied for
+     the tuple, possibly against a different set of parameter values.
+    </para>
+
+    <para>
      Another <structname>ForeignScan</> field that can be filled by FDWs
      is <structfield>fdw_scan_tlist</>, which describes the tuples returned by
      the FDW for this plan node.  For simple foreign table scans this can be
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index bb28a73..6165e4a 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -25,6 +25,7 @@
 #include "executor/executor.h"
 #include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
+#include "utils/memutils.h"
 #include "utils/rel.h"
 
 static TupleTableSlot *ForeignNext(ForeignScanState *node);
@@ -72,8 +73,19 @@ ForeignNext(ForeignScanState *node)
 static bool
 ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
 {
-	/* There are no access-method-specific conditions to recheck. */
-	return true;
+	ExprContext *econtext;
+
+	/*
+	 * extract necessary information from foreign scan node
+	 */
+	econtext = node->ss.ps.ps_ExprContext;
+
+	/* Does the tuple meet the remote qual condition? */
+	econtext->ecxt_scantuple = slot;
+
+	ResetExprContext(econtext);
+
+	return ExecQual(node->fdw_recheck_quals, econtext, false);
 }
 
 /* ----------------------------------------------------------------
@@ -135,6 +147,9 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
 	scanstate->ss.ps.qual = (List *)
 		ExecInitExpr((Expr *) node->scan.plan.qual,
 					 (PlanState *) scanstate);
+	scanstate->fdw_recheck_quals = (List *)
+		ExecInitExpr((Expr *) node->fdw_recheck_quals,
+					 (PlanState *) scanstate);
 
 	/*
 	 * tuple table initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 0b4ab23..c176ff9 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -648,6 +648,7 @@ _copyForeignScan(const ForeignScan *from)
 	COPY_NODE_FIELD(fdw_exprs);
 	COPY_NODE_FIELD(fdw_private);
 	COPY_NODE_FIELD(fdw_scan_tlist);
+	COPY_NODE_FIELD(fdw_recheck_quals);
 	COPY_BITMAPSET_FIELD(fs_relids);
 	COPY_SCALAR_FIELD(fsSystemCol);
 
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index df7f6e1..3e75cd1 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -594,6 +594,7 @@ _outForeignScan(StringInfo str, const ForeignScan *node)
 	WRITE_NODE_FIELD(fdw_exprs);
 	WRITE_NODE_FIELD(fdw_private);
 	WRITE_NODE_FIELD(fdw_scan_tlist);
+	WRITE_NODE_FIELD(fdw_recheck_quals);
 	WRITE_BITMAPSET_FIELD(fs_relids);
 	WRITE_BOOL_FIELD(fsSystemCol);
 }
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 5802a73..94ba6dc 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1798,6 +1798,7 @@ _readForeignScan(void)
 	READ_NODE_FIELD(fdw_exprs);
 	READ_NODE_FIELD(fdw_private);
 	READ_NODE_FIELD(fdw_scan_tlist);
+	READ_NODE_FIELD(fdw_recheck_quals);
 	READ_BITMAPSET_FIELD(fs_relids);
 	READ_BOOL_FIELD(fsSystemCol);
 
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 0ee7392..791b64e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -2153,6 +2153,9 @@ create_foreignscan_plan(PlannerInfo *root, ForeignPath *best_path,
 			replace_nestloop_params(root, (Node *) scan_plan->scan.plan.qual);
 		scan_plan->fdw_exprs = (List *)
 			replace_nestloop_params(root, (Node *) scan_plan->fdw_exprs);
+		scan_plan->fdw_recheck_quals = (List *)
+			replace_nestloop_params(root,
+									(Node *) scan_plan->fdw_recheck_quals);
 	}
 
 	/*
@@ -3738,7 +3741,8 @@ make_foreignscan(List *qptlist,
 				 Index scanrelid,
 				 List *fdw_exprs,
 				 List *fdw_private,
-				 List *fdw_scan_tlist)
+				 List *fdw_scan_tlist,
+				 List *fdw_recheck_quals)
 {
 	ForeignScan *node = makeNode(ForeignScan);
 	Plan	   *plan = &node->scan.plan;
@@ -3754,6 +3758,7 @@ make_foreignscan(List *qptlist,
 	node->fdw_exprs = fdw_exprs;
 	node->fdw_private = fdw_private;
 	node->fdw_scan_tlist = fdw_scan_tlist;
+	node->fdw_recheck_quals = fdw_recheck_quals;
 	/* fs_relids will be filled in by create_foreignscan_plan */
 	node->fs_relids = NULL;
 	/* fsSystemCol will be filled in by create_foreignscan_plan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 9392d61..8c6c571 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -1133,13 +1133,15 @@ set_foreignscan_references(PlannerInfo *root,
 	}
 	else
 	{
-		/* Adjust tlist, qual, fdw_exprs in the standard way */
+		/* Adjust tlist, qual, fdw_exprs, etc. in the standard way */
 		fscan->scan.plan.targetlist =
 			fix_scan_list(root, fscan->scan.plan.targetlist, rtoffset);
 		fscan->scan.plan.qual =
 			fix_scan_list(root, fscan->scan.plan.qual, rtoffset);
 		fscan->fdw_exprs =
 			fix_scan_list(root, fscan->fdw_exprs, rtoffset);
+		fscan->fdw_recheck_quals =
+			fix_scan_list(root, fscan->fdw_recheck_quals, rtoffset);
 	}
 
 	/* Adjust fs_relids if needed */
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 6b32f85..60b4ae1 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2394,11 +2394,19 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			break;
 
 		case T_ForeignScan:
-			finalize_primnode((Node *) ((ForeignScan *) plan)->fdw_exprs,
-							  &context);
-			/* We assume fdw_scan_tlist cannot contain Params */
-			context.paramids = bms_add_members(context.paramids, scan_params);
-			break;
+			{
+				ForeignScan *fscan = (ForeignScan *) plan;
+
+				finalize_primnode((Node *) fscan->fdw_exprs,
+								  &context);
+				finalize_primnode((Node *) fscan->fdw_recheck_quals,
+								  &context);
+
+				/* We assume fdw_scan_tlist cannot contain Params */
+				context.paramids = bms_add_members(context.paramids,
+												   scan_params);
+				break;
+			}
 
 		case T_CustomScan:
 			{
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b6895f9..23670e1 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1579,6 +1579,7 @@ typedef struct WorkTableScanState
 typedef struct ForeignScanState
 {
 	ScanState	ss;				/* its first field is NodeTag */
+	List	   *fdw_recheck_quals;	/* original quals not in ss.ps.qual */
 	/* use struct pointer to avoid including fdwapi.h here */
 	struct FdwRoutine *fdwroutine;
 	void	   *fdw_state;		/* foreign-data wrapper can keep state here */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 1f9213c..92fd8e4 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -512,6 +512,11 @@ typedef struct WorkTableScan
  * fdw_scan_tlist is never actually executed; it just holds expression trees
  * describing what is in the scan tuple's columns.
  *
+ * fdw_recheck_quals should contain any quals which the core system passed to
+ * the FDW but which were not added to scan.plan.quals; that is, it should
+ * contain the quals being checked remotely.  This is needed for correct
+ * behavior during EvalPlanQual rechecks.
+ *
  * When the plan node represents a foreign join, scan.scanrelid is zero and
  * fs_relids must be consulted to identify the join relation.  (fs_relids
  * is valid for simple scans as well, but will always match scan.scanrelid.)
@@ -524,6 +529,7 @@ typedef struct ForeignScan
 	List	   *fdw_exprs;		/* expressions that FDW may evaluate */
 	List	   *fdw_private;	/* private data for FDW */
 	List	   *fdw_scan_tlist; /* optional tlist describing scan tuple */
+	List	   *fdw_recheck_quals;	/* original quals not in scan.plan.quals */
 	Bitmapset  *fs_relids;		/* RTIs generated by this scan */
 	bool		fsSystemCol;	/* true if any "system column" is needed */
 } ForeignScan;
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 52b077a..1fb8504 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -45,7 +45,7 @@ extern SubqueryScan *make_subqueryscan(List *qptlist, List *qpqual,
 				  Index scanrelid, Plan *subplan);
 extern ForeignScan *make_foreignscan(List *qptlist, List *qpqual,
 				 Index scanrelid, List *fdw_exprs, List *fdw_private,
-				 List *fdw_scan_tlist);
+				 List *fdw_scan_tlist, List *fdw_recheck_quals);
 extern Append *make_append(List *appendplans, List *tlist);
 extern RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree, Plan *righttree, int wtParam,

#126

Kouhei Kaigai

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Kyotaro HORIGUCHI (#120)

Re: Foreign join pushdown vs EvalPlanQual

-----Original Message-----
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
Sent: Wednesday, October 14, 2015 4:40 PM
To: Kaigai Kouhei(海外浩平)
Cc: fujita.etsuro@lab.ntt.co.jp; pgsql-hackers@postgresql.org;
shigeru.hanada@gmail.com; robertmhaas@gmail.com
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

Hello,

At Wed, 14 Oct 2015 03:07:31 +0000, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote
in <9A28C8860F777E439AA12E8AEA7694F801157077@BPXM15GP.gisp.nec.co.jp>

I noticed that the approach using a column to populate the foreign
scan's slot directly wouldn't work well in some cases. For example,
consider:

SELECT * FROM verysmall v LEFT JOIN (bigft1 JOIN bigft2 ON bigft1.x =
bigft2.x) ON v.q = bigft1.q AND v.r = bigft2.r FOR UPDATE OF v;

The best plan is presumably something like this as you said before:

LockRows
-> Nested Loop
-> Seq Scan on verysmall v
-> Foreign Scan on bigft1 and bigft2
Remote SQL: SELECT * FROM bigft1 JOIN bigft2 ON bigft1.x =
bigft2.x AND bigft1.q = $1 AND bigft2.r = $2

Consider the EvalPlanQual testing to see if the updated version of a
tuple in v satisfies the query. If we use the column in the testing, we
would get the wrong results in some cases.

I have a basic (or maybe silly) qustion. Is it true that the
join-inner (the foreignscan in the example) is re-executed with
the modified value of v.r? I observed for a join case among only
local tables that previously fetched tuples for the inner are
simplly reused regardless of join types. Even when a refetch
happens (I haven't confirmed but it would occur in the case of no
security quals), the tuple is pointed by ctid so the re-join
between local and remote would fail. Is this wrong?

Let's dive into ExecNestLoop().
Once nl_NeedNewOuter is true, ExecProcNode(outerPlan) is called then
ExecReScan(innerPlan) is called with new param-info delivered from the
outer-tuple.

nl_NeedNewOuter is reset just after ExecProcNode(outerPlan), then
it is set once outer-tuple is needed again when inner-scan reached
to end of the relation, or found a tuple on semi-join.
In case of semi-join returned a joined-tuple then EPQ recheck is
applied, it can call ExecProcNode(outerPlan) and reset inner-plan
state.

It is what I can say from the existing code.
I doubt whether the behavior is right on EPQ rechecks. The above scenario
introduces the inner-relation (verysmall) is updated by the concurrent
session, thus param-info has to be updated.

However, it does not looks to me the implementation pays attention here.
If ExecNestLoop() is called under the EPQ recheck context, it needs to
call ExecProcNode() towards both of outer and inner plan to ensure the
visibility of joined-tuple towards the latest status.
Of course, underlying scan plans for base relations never make advance
the scan pointer. It just returns a tuple in EPQ slot, then I want
ExecNestLoop() to evaluate whether these tuples satisfies the join-clause.

In this case, does ForeignScan have to be reset prior to ExecProcNode()?
Once ExecReScanForeignScan() gets called by ExecNestLoop(), it marks EPQ
slot is invalid. So, more or less, ForeignScan needs to kick the remote
join again based on the new parameter come from the latest verysmall tuple.
Please correct me, if I don't understand correctly.

So, no rescan would happen for the cases, I think. ReScan seems
to be kicked only for the new(next) outer tuple that causes
change of parameter, but not kicked for EPQ. I might take you
wrongly..

In case of unparametalized ForeignScan case, the cached join-tuple work
well because it is independent from verysmall.

Once again, if FDW driver is responsible to construct join-tuple from
the base relation's tuple cached in EPQ slot, this case don't need to
kick remote query again, because all the materials to construct join-
tuple are already held locally. Right?

It is definitely right and should be doable. But I think the
point we are argueing here is what is the desirable behavior.

In case of scanrelid==0, expectation to ForeignScan/CustomScan is to
behave as if local join exists here. It requires ForeignScan to generate
joined-tuple as a result of remote join, that may contains multiple junk
TLEs to carry whole-var references of base foreign tables.
According to the criteria, the desirable behavior is clear as below:

1. FDW/CSP picks up base relation's tuple from the EPQ slots.
It shall be setup by whole-row reference if earlier row-lock semantics,
or by RefetchForeignRow if later row-lock semantics.

2. Fill up ss_ScanTupleSlot according to the xxx_scan_tlist.
We may be able to provide a common support function here, because this
list keeps relation between a particular attribute of the joined-tuple
and its source column.

3. Apply join-clause and base-restrict that were pushed down.
setrefs.c initializes expressions kept in fdw_exprs/custom_exprs to run
on the ss_ScanTupleSlot. It is the easiest way to check here.

4. If joined-tuple is still visible after the step 3, FDW/CSP returns
joined-tuple. Elsewhere, returns an empty slot.

It is entirely compatible behavior even if local join is located on
the point of ForeignScan/CustomScan with scanrelid==0.

Even if remote join is parametalized by other relation, we can simply
use param-info delivered from the corresponding outer scan at the step-3.
EState should have the parameters already updated, FDW driver needs to
care about nothing.

It is quite less invasive approach towards the existing EPQ recheck
mechanism. I cannot understand why Fujita-san never "try" this approach.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#127

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 10 years ago

In reply to: Robert Haas (#125)

Re: Foreign join pushdown vs EvalPlanQual

Hello,

I confirmed that an epqtuple of foreign parameterized scan is
correctly rejected by fdw_recheck_quals with modified outer
tuple.

I have no objection to this and have two humble comments.

In file_fdw.c, the comment for the last parameter just after the
added line seems to be better to be aligned with other comments.

In subselect.c, the added break is in the added curly-braces but
it would be better to place it after the closing brace, like the
other cases.

regards,

At Wed, 14 Oct 2015 15:21:41 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoZ8REePoFv7ZjjDH-T54sQw40fnP4Mkr8hw5eizbxA4BA@mail.gmail.com>

On Wed, Oct 14, 2015 at 4:31 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

1) remote qual (scanrelid>0)
2) remote join (scanrelid==0)

As for #1, I noticed that there is a bug in handling the same kind of FDW
queries, which will be shown below. As you said, I think this should be
addressed by rechecking the remote quals *locally*. (I thought another fix
for this kind of bug before, though.) IIUC, I think this should be fixed
separately from #2, as this is a bug not only in 9.5, but in back branches.
Please find attached a patch.

+1 for doing something like this. However, I don't think we can
commit this to released branches, despite the fact that it's a bug
fix, because breaking third-party FDWs in a minor release seems
unfriendly. We might be able to slip it into 9.5, though, if we act
quickly.

A few review comments:

- nodeForeignscan.c now needs to #include "utils/memutils.h"
- I think it'd be safer for finalize_plan() not to try to shortcut
processing fdw_scan_quals.
- You forgot to update _readForeignScan.
- The documentation needs updating.
- I think we should use the name fdw_recheck_quals.

Here's an updated patch with those changes and some improvements to
the comments. Absent objections, I will commit it and back-patch to
9.5 only.

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#128

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

about 10 years ago

In reply to: Kouhei Kaigai (#126)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/15 11:36, Kouhei Kaigai wrote:

Once again, if FDW driver is responsible to construct join-tuple from
the base relation's tuple cached in EPQ slot, this case don't need to
kick remote query again, because all the materials to construct join-
tuple are already held locally. Right?

I now understand clearly what you mean. Sorry for my misunderstanding.

In case of scanrelid==0, expectation to ForeignScan/CustomScan is to
behave as if local join exists here. It requires ForeignScan to generate
joined-tuple as a result of remote join, that may contains multiple junk
TLEs to carry whole-var references of base foreign tables.
According to the criteria, the desirable behavior is clear as below:

1. FDW/CSP picks up base relation's tuple from the EPQ slots.
It shall be setup by whole-row reference if earlier row-lock semantics,
or by RefetchForeignRow if later row-lock semantics.

2. Fill up ss_ScanTupleSlot according to the xxx_scan_tlist.
We may be able to provide a common support function here, because this
list keeps relation between a particular attribute of the joined-tuple
and its source column.

3. Apply join-clause and base-restrict that were pushed down.
setrefs.c initializes expressions kept in fdw_exprs/custom_exprs to run
on the ss_ScanTupleSlot. It is the easiest way to check here.

4. If joined-tuple is still visible after the step 3, FDW/CSP returns
joined-tuple. Elsewhere, returns an empty slot.

It is entirely compatible behavior even if local join is located on
the point of ForeignScan/CustomScan with scanrelid==0.

Even if remote join is parametalized by other relation, we can simply
use param-info delivered from the corresponding outer scan at the step-3.
EState should have the parameters already updated, FDW driver needs to
care about nothing.

It is quite less invasive approach towards the existing EPQ recheck
mechanism.

I see. That's an idea, but I guess that step 2 and 3 would need to add
a lot of code to the core. Why don't you use a local join execution
plan that we discussed? I think that that would make the series of
processing much simpler. I'm now revising the patch that I created for
that. If it's okay, I'd like to propose an updated version of the patch
in a few days.

I cannot understand why Fujita-san never "try" this approach.

Maybe my explanation was not correct, but I didn't say such a thing.
What I rather objected against was to add a new FDW callback routine for
rechecking pushed-down quals or pushed-down joins, which I think you
insisted on.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#129

Kouhei Kaigai

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Etsuro Fujita (#128)

Re: Foreign join pushdown vs EvalPlanQual

-----Original Message-----
From: Etsuro Fujita [mailto:fujita.etsuro@lab.ntt.co.jp]
Sent: Thursday, October 15, 2015 7:00 PM
To: Kaigai Kouhei(海外浩平); Kyotaro HORIGUCHI
Cc: pgsql-hackers@postgresql.org; shigeru.hanada@gmail.com;
robertmhaas@gmail.com
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

On 2015/10/15 11:36, Kouhei Kaigai wrote:

Once again, if FDW driver is responsible to construct join-tuple from
the base relation's tuple cached in EPQ slot, this case don't need to
kick remote query again, because all the materials to construct join-
tuple are already held locally. Right?

I now understand clearly what you mean. Sorry for my misunderstanding.

In case of scanrelid==0, expectation to ForeignScan/CustomScan is to
behave as if local join exists here. It requires ForeignScan to generate
joined-tuple as a result of remote join, that may contains multiple junk
TLEs to carry whole-var references of base foreign tables.
According to the criteria, the desirable behavior is clear as below:

1. FDW/CSP picks up base relation's tuple from the EPQ slots.
It shall be setup by whole-row reference if earlier row-lock semantics,
or by RefetchForeignRow if later row-lock semantics.

2. Fill up ss_ScanTupleSlot according to the xxx_scan_tlist.
We may be able to provide a common support function here, because this
list keeps relation between a particular attribute of the joined-tuple
and its source column.

3. Apply join-clause and base-restrict that were pushed down.
setrefs.c initializes expressions kept in fdw_exprs/custom_exprs to run
on the ss_ScanTupleSlot. It is the easiest way to check here.

4. If joined-tuple is still visible after the step 3, FDW/CSP returns
joined-tuple. Elsewhere, returns an empty slot.

It is entirely compatible behavior even if local join is located on
the point of ForeignScan/CustomScan with scanrelid==0.

Even if remote join is parametalized by other relation, we can simply
use param-info delivered from the corresponding outer scan at the step-3.
EState should have the parameters already updated, FDW driver needs to
care about nothing.

It is quite less invasive approach towards the existing EPQ recheck
mechanism.

I see. That's an idea, but I guess that step 2 and 3 would need to add
a lot of code to the core. Why don't you use a local join execution
plan that we discussed? I think that that would make the series of
processing much simpler. I'm now revising the patch that I created for
that. If it's okay, I'd like to propose an updated version of the patch
in a few days.

I have to introduce why above idea is simpler and suitable for v9.5
timeline.
As I've consistently proposed for this two months, the step-2 and 3
are assumed to be handled in the callback routine to be kicked from
ForeignRecheck().

Even if core backend eventually provides utility routines to support
above tasks, it is not mandatory requirement from the beginning; v9.5
timeline at least.
As long as the callback is provided, FDW driver "can" implement above
features by itself, with their comfortable way.
Note that alternative local join plan is one way to implement the above
step-2 and -3, however, I never enforce people to use a particular way.
People can choose.

Regarding to scale of the code in the core backend, it is pretty small
because all we need to add is just a callback in v9.5. We can implement
the remaining support routine in v9.6 timeline, but not now.

I cannot understand why Fujita-san never "try" this approach.

Maybe my explanation was not correct, but I didn't say such a thing.
What I rather objected against was to add a new FDW callback routine for
rechecking pushed-down quals or pushed-down joins, which I think you
insisted on.

My proposition has been consistent.
The interface contract (that is the job of callback implementation in
other words) in the series of sequence is above 4-steps I introduced.
We can use alternative local join plan, or own implementation to fill
up ss_ScanTupleSlot, or something common support routine provided by
core.
Regardless of the implementation choice, the callback approach minimizes
the impact towards existing EPQ recheck mechanism and release schedule of
v9.5. Also, it can cover the case handling when scanrelid==0.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#130

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Kyotaro HORIGUCHI (#127)

Re: Foreign join pushdown vs EvalPlanQual

On Thu, Oct 15, 2015 at 3:04 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I confirmed that an epqtuple of foreign parameterized scan is
correctly rejected by fdw_recheck_quals with modified outer
tuple.

I have no objection to this and have two humble comments.

In file_fdw.c, the comment for the last parameter just after the
added line seems to be better to be aligned with other comments.

I've pgindented the file. Any other space we might choose would just
be changed by the next pgindent run, so there's no point in trying to
vary.

In subselect.c, the added break is in the added curly-braces but
it would be better to place it after the closing brace, like the
other cases.

Changed that, and committed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#131

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

about 10 years ago

In reply to: Robert Haas (#130)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/16 2:14, Robert Haas wrote:

On Thu, Oct 15, 2015 at 3:04 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I confirmed that an epqtuple of foreign parameterized scan is
correctly rejected by fdw_recheck_quals with modified outer
tuple.

I have no objection to this and have two humble comments.

In file_fdw.c, the comment for the last parameter just after the
added line seems to be better to be aligned with other comments.

I've pgindented the file. Any other space we might choose would just
be changed by the next pgindent run, so there's no point in trying to
vary.

In subselect.c, the added break is in the added curly-braces but
it would be better to place it after the closing brace, like the
other cases.

Changed that, and committed.

Thanks, Robert and Horiguchi-san.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#132

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

about 10 years ago

In reply to: Kouhei Kaigai (#129)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/15 11:36, Kouhei Kaigai wrote:

In case of scanrelid==0, expectation to ForeignScan/CustomScan is to
behave as if local join exists here. It requires ForeignScan to generate
joined-tuple as a result of remote join, that may contains multiple junk
TLEs to carry whole-var references of base foreign tables.
According to the criteria, the desirable behavior is clear as below:

1. FDW/CSP picks up base relation's tuple from the EPQ slots.
It shall be setup by whole-row reference if earlier row-lock semantics,
or by RefetchForeignRow if later row-lock semantics.

2. Fill up ss_ScanTupleSlot according to the xxx_scan_tlist.
We may be able to provide a common support function here, because this
list keeps relation between a particular attribute of the joined-tuple
and its source column.

3. Apply join-clause and base-restrict that were pushed down.
setrefs.c initializes expressions kept in fdw_exprs/custom_exprs to run
on the ss_ScanTupleSlot. It is the easiest way to check here.

4. If joined-tuple is still visible after the step 3, FDW/CSP returns
joined-tuple. Elsewhere, returns an empty slot.

It is entirely compatible behavior even if local join is located on
the point of ForeignScan/CustomScan with scanrelid==0.

Even if remote join is parametalized by other relation, we can simply
use param-info delivered from the corresponding outer scan at the step-3.
EState should have the parameters already updated, FDW driver needs to
care about nothing.

It is quite less invasive approach towards the existing EPQ recheck
mechanism.

I wrote:

I see. That's an idea, but I guess that step 2 and 3 would need to add
a lot of code to the core. Why don't you use a local join execution
plan that we discussed? I think that that would make the series of
processing much simpler. I'm now revising the patch that I created for
that. If it's okay, I'd like to propose an updated version of the patch
in a few days.

On 2015/10/15 20:19, Kouhei Kaigai wrote:

I have to introduce why above idea is simpler and suitable for v9.5
timeline.
As I've consistently proposed for this two months, the step-2 and 3
are assumed to be handled in the callback routine to be kicked from
ForeignRecheck().

Honestly, I still don't think I would see the much value in doing so.
As Robert mentioned in [1]/messages/by-id/CA+Tgmoau7jVTLF0Oh9a_Mu9S=vrw7i6u_h7JSpzBXv0xtyo_Bg@mail.gmail.com, I think that if we're inside EPQ,
pushed-down quals and/or pushed-down joins should be locally rechecked
in the same way as other cases such as IndexRecheck. So, I'll propose
the updated version of the patch.

Thanks for the explanation!

Best regards,
Etsuro Fujita

[1]: /messages/by-id/CA+Tgmoau7jVTLF0Oh9a_Mu9S=vrw7i6u_h7JSpzBXv0xtyo_Bg@mail.gmail.com
/messages/by-id/CA+Tgmoau7jVTLF0Oh9a_Mu9S=vrw7i6u_h7JSpzBXv0xtyo_Bg@mail.gmail.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#133

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

about 10 years ago

In reply to: Etsuro Fujita (#121)

1 attachment(s)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/14 17:31, Etsuro Fujita wrote:

As KaiGai-san also pointed out before, I think we should address this in
each of the following cases:

1) remote qual (scanrelid>0)
2) remote join (scanrelid==0)

As for #2, I updated the patch, which uses a local join execution plan
for an EvalPlanQual rechech, according to the comment from Robert [1]/messages/by-id/CA+TgmoaAzs0dR23R7PTBseQfwOtuVCPNBqDHxeBo9Gi+dMxj8w@mail.gmail.com.
Attached is an updated version of the patch. This is a WIP patch, but
it would be appreciated if I could get feedback earlier.

For tests, apply the patches:

foreign-recheck-for-foreign-join-1.patch
usermapping_matching.patch [2]/messages/by-id/CAEZqfEe9KGy=1_waGh2rgZPg0o4pqgD+iauYaj8wTze+CYJUHg@mail.gmail.com
add_GetUserMappingById.patch [2]/messages/by-id/CAEZqfEe9KGy=1_waGh2rgZPg0o4pqgD+iauYaj8wTze+CYJUHg@mail.gmail.com
foreign_join_v16_efujita.patch [3]/messages/by-id/55CB2D45.7040100@lab.ntt.co.jp

Since that as I said upthread, what I'd like to discuss is changes to
the PG core, I didn't do anything about the postgres_fdw patches.

Best regards,
Etsuro Fujita

[1]: /messages/by-id/CA+TgmoaAzs0dR23R7PTBseQfwOtuVCPNBqDHxeBo9Gi+dMxj8w@mail.gmail.com
/messages/by-id/CA+TgmoaAzs0dR23R7PTBseQfwOtuVCPNBqDHxeBo9Gi+dMxj8w@mail.gmail.com
[2]: /messages/by-id/CAEZqfEe9KGy=1_waGh2rgZPg0o4pqgD+iauYaj8wTze+CYJUHg@mail.gmail.com
/messages/by-id/CAEZqfEe9KGy=1_waGh2rgZPg0o4pqgD+iauYaj8wTze+CYJUHg@mail.gmail.com
[3]: /messages/by-id/55CB2D45.7040100@lab.ntt.co.jp

Attachments:

foreign-recheck-for-foreign-join-1.patchtext/x-patch; name=foreign-recheck-for-foreign-join-1.patchDownload

*** a/contrib/file_fdw/file_fdw.c
--- b/contrib/file_fdw/file_fdw.c
***************
*** 525,530 **** fileGetForeignPaths(PlannerInfo *root,
--- 525,531 ----
  									 total_cost,
  									 NIL,		/* no pathkeys */
  									 NULL,		/* no outer rel either */
+ 									 NULL,		/* no alternative path */
  									 coptions));
  
  	/*
*** a/contrib/postgres_fdw/postgres_fdw.c
--- b/contrib/postgres_fdw/postgres_fdw.c
***************
*** 560,565 **** postgresGetForeignPaths(PlannerInfo *root,
--- 560,566 ----
  								   fpinfo->total_cost,
  								   NIL, /* no pathkeys */
  								   NULL,		/* no outer rel either */
+ 								   NULL,		/* no alternative path */
  								   NIL);		/* no fdw_private list */
  	add_path(baserel, (Path *) path);
  
***************
*** 727,732 **** postgresGetForeignPaths(PlannerInfo *root,
--- 728,734 ----
  									   total_cost,
  									   NIL,		/* no pathkeys */
  									   param_info->ppi_req_outer,
+ 									   NULL,	/* no alternative path */
  									   NIL);	/* no fdw_private list */
  		add_path(baserel, (Path *) path);
  	}
*** a/src/backend/executor/execScan.c
--- b/src/backend/executor/execScan.c
***************
*** 48,59 **** ExecScanFetch(ScanState *node,
  		 * conditions.
  		 */
  		Index		scanrelid = ((Scan *) node->ps.plan)->scanrelid;
  
  		Assert(scanrelid > 0);
  		if (estate->es_epqTupleSet[scanrelid - 1])
  		{
- 			TupleTableSlot *slot = node->ss_ScanTupleSlot;
- 
  			/* Return empty slot if we already returned a tuple */
  			if (estate->es_epqScanDone[scanrelid - 1])
  				return ExecClearTuple(slot);
--- 48,67 ----
  		 * conditions.
  		 */
  		Index		scanrelid = ((Scan *) node->ps.plan)->scanrelid;
+ 		TupleTableSlot *slot = node->ss_ScanTupleSlot;
+ 
+ 		/*
+ 		 * Execute recheck plan and get the next tuple if foreign join.
+ 		 */
+ 		if (scanrelid == 0)
+ 		{
+ 			(*recheckMtd) (node, slot);
+ 			return slot;
+ 		}
  
  		Assert(scanrelid > 0);
  		if (estate->es_epqTupleSet[scanrelid - 1])
  		{
  			/* Return empty slot if we already returned a tuple */
  			if (estate->es_epqScanDone[scanrelid - 1])
  				return ExecClearTuple(slot);
***************
*** 347,352 **** ExecScanReScan(ScanState *node)
--- 355,363 ----
  	{
  		Index		scanrelid = ((Scan *) node->ps.plan)->scanrelid;
  
+ 		if (scanrelid == 0)
+ 			return;				/* nothing to do */
+ 
  		Assert(scanrelid > 0);
  
  		estate->es_epqScanDone[scanrelid - 1] = false;
*** a/src/backend/executor/nodeForeignscan.c
--- b/src/backend/executor/nodeForeignscan.c
***************
*** 24,29 ****
--- 24,30 ----
  
  #include "executor/executor.h"
  #include "executor/nodeForeignscan.h"
+ #include "executor/tuptable.h"
  #include "foreign/fdwapi.h"
  #include "utils/memutils.h"
  #include "utils/rel.h"
***************
*** 80,85 **** ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
--- 81,103 ----
  	 */
  	econtext = node->ss.ps.ps_ExprContext;
  
+ 	if (node->fdw_recheck_plan != NULL)
+ 	{
+ 		TupleTableSlot *result;
+ 		MemoryContext oldcontext;
+ 
+ 		/* Must be in query context to call recheck plan */
+ 		oldcontext = MemoryContextSwitchTo(econtext->ecxt_per_query_memory);
+ 		/* Execute recheck plan */
+ 		result = ExecProcNode(node->fdw_recheck_plan);
+ 		MemoryContextSwitchTo(oldcontext);
+ 		if (TupIsNull(result))
+ 			return false;
+ 		/* Store result in the given slot */
+ 		ExecCopySlot(slot, result);
+ 		return true;
+ 	}
+ 
  	/* Does the tuple meet the remote qual condition? */
  	econtext->ecxt_scantuple = slot;
  
***************
*** 200,205 **** ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
--- 218,229 ----
  	ExecAssignScanProjectionInfoWithVarno(&scanstate->ss, tlistvarno);
  
  	/*
+ 	 * Initialize recheck plan.
+ 	 */
+ 	scanstate->fdw_recheck_plan = ExecInitNode(node->fdw_recheck_plan,
+ 											   estate, eflags);
+ 
+ 	/*
  	 * Initialize FDW-related state.
  	 */
  	scanstate->fdwroutine = fdwroutine;
***************
*** 235,240 **** ExecEndForeignScan(ForeignScanState *node)
--- 259,267 ----
  	/* close the relation. */
  	if (node->ss.ss_currentRelation)
  		ExecCloseScanRelation(node->ss.ss_currentRelation);
+ 
+ 	/* shut down recheck plan. */
+ 	ExecEndNode(node->fdw_recheck_plan);
  }
  
  /* ----------------------------------------------------------------
***************
*** 249,252 **** ExecReScanForeignScan(ForeignScanState *node)
--- 276,296 ----
  	node->fdwroutine->ReScanForeignScan(node);
  
  	ExecScanReScan(&node->ss);
+ 
+ 	if (node->fdw_recheck_plan != NULL)
+ 	{
+ 		/*
+ 		 * set chgParam for recheck plan
+ 		 */
+ 		if (((PlanState *) node)->chgParam != NULL)
+ 			UpdateChangedParamSet(node->fdw_recheck_plan,
+ 								  ((PlanState *) node)->chgParam);
+ 
+ 		/*
+ 		 * if chgParam of recheck plan is not null then the plan will be
+ 		 * re-scanned by first ExecProcNode.
+ 		 */
+ 		if (node->fdw_recheck_plan->chgParam == NULL)
+ 			ExecReScan(node->fdw_recheck_plan);
+ 	}
  }
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
***************
*** 648,653 **** _copyForeignScan(const ForeignScan *from)
--- 648,654 ----
  	COPY_NODE_FIELD(fdw_exprs);
  	COPY_NODE_FIELD(fdw_private);
  	COPY_NODE_FIELD(fdw_scan_tlist);
+ 	COPY_NODE_FIELD(fdw_recheck_plan);
  	COPY_NODE_FIELD(fdw_recheck_quals);
  	COPY_BITMAPSET_FIELD(fs_relids);
  	COPY_SCALAR_FIELD(fsSystemCol);
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
***************
*** 594,599 **** _outForeignScan(StringInfo str, const ForeignScan *node)
--- 594,600 ----
  	WRITE_NODE_FIELD(fdw_exprs);
  	WRITE_NODE_FIELD(fdw_private);
  	WRITE_NODE_FIELD(fdw_scan_tlist);
+ 	WRITE_NODE_FIELD(fdw_recheck_plan);
  	WRITE_NODE_FIELD(fdw_recheck_quals);
  	WRITE_BITMAPSET_FIELD(fs_relids);
  	WRITE_BOOL_FIELD(fsSystemCol);
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
***************
*** 1798,1803 **** _readForeignScan(void)
--- 1798,1804 ----
  	READ_NODE_FIELD(fdw_exprs);
  	READ_NODE_FIELD(fdw_private);
  	READ_NODE_FIELD(fdw_scan_tlist);
+ 	READ_NODE_FIELD(fdw_recheck_plan);
  	READ_NODE_FIELD(fdw_recheck_quals);
  	READ_BITMAPSET_FIELD(fs_relids);
  	READ_BOOL_FIELD(fsSystemCol);
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
***************
*** 2141,2146 **** create_foreignscan_plan(PlannerInfo *root, ForeignPath *best_path,
--- 2141,2157 ----
  	scan_plan->fs_relids = best_path->path.parent->relids;
  
  	/*
+ 	 * If we're scanning a join relation, generate a recheck plan for
+ 	 * EvalPlanQual support.  (Irrelevant if scanning a base relation.)
+ 	 */
+ 	if (scan_relid == 0)
+ 	{
+ 		scan_plan->fdw_recheck_plan =
+ 			create_plan_recurse(root, best_path->fdw_recheck_path);
+ 		scan_plan->fdw_recheck_plan->targetlist = scan_plan->fdw_scan_tlist;
+ 	}
+ 
+ 	/*
  	 * Replace any outer-relation variables with nestloop params in the qual
  	 * and fdw_exprs expressions.  We do this last so that the FDW doesn't
  	 * have to be involved.  (Note that parts of fdw_exprs could have come
***************
*** 3758,3763 **** make_foreignscan(List *qptlist,
--- 3769,3776 ----
  	node->fdw_exprs = fdw_exprs;
  	node->fdw_private = fdw_private;
  	node->fdw_scan_tlist = fdw_scan_tlist;
+ 	/* fdw_recheck_plan will be filled in by create_foreignscan_plan */
+ 	node->fdw_recheck_plan = NULL;
  	node->fdw_recheck_quals = fdw_recheck_quals;
  	/* fs_relids will be filled in by create_foreignscan_plan */
  	node->fs_relids = NULL;
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
***************
*** 1130,1135 **** set_foreignscan_references(PlannerInfo *root,
--- 1130,1137 ----
  		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
  		fscan->fdw_scan_tlist =
  			fix_scan_list(root, fscan->fdw_scan_tlist, rtoffset);
+ 		/* fdw_recheck_plan needs set_plan_refs() adjustments */
+ 		set_plan_refs(root, fscan->fdw_recheck_plan, rtoffset);
  	}
  	else
  	{
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
***************
*** 2405,2410 **** finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
--- 2405,2419 ----
  				/* We assume fdw_scan_tlist cannot contain Params */
  				context.paramids = bms_add_members(context.paramids,
  												   scan_params);
+ 
+ 				/* recheck plan if foreign join */
+ 				if (fscan->scan.scanrelid == 0)
+ 					context.paramids =
+ 						bms_add_members(context.paramids,
+ 										finalize_plan(root,
+ 													  fscan->fdw_recheck_plan,
+ 													  valid_params,
+ 													  scan_params));
  			}
  			break;
  
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
***************
*** 1488,1493 **** create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
--- 1488,1494 ----
  						double rows, Cost startup_cost, Cost total_cost,
  						List *pathkeys,
  						Relids required_outer,
+ 						Path *fdw_recheck_path,
  						List *fdw_private)
  {
  	ForeignPath *pathnode = makeNode(ForeignPath);
***************
*** 1501,1506 **** create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
--- 1502,1508 ----
  	pathnode->path.total_cost = total_cost;
  	pathnode->path.pathkeys = pathkeys;
  
+ 	pathnode->fdw_recheck_path = fdw_recheck_path;
  	pathnode->fdw_private = fdw_private;
  
  	return pathnode;
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
***************
*** 1579,1584 **** typedef struct WorkTableScanState
--- 1579,1585 ----
  typedef struct ForeignScanState
  {
  	ScanState	ss;				/* its first field is NodeTag */
+ 	PlanState  *fdw_recheck_plan;	/* local join execution plan */
  	List	   *fdw_recheck_quals;	/* original quals not in ss.ps.qual */
  	/* use struct pointer to avoid including fdwapi.h here */
  	struct FdwRoutine *fdwroutine;
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
***************
*** 529,534 **** typedef struct ForeignScan
--- 529,535 ----
  	List	   *fdw_exprs;		/* expressions that FDW may evaluate */
  	List	   *fdw_private;	/* private data for FDW */
  	List	   *fdw_scan_tlist; /* optional tlist describing scan tuple */
+ 	Plan	   *fdw_recheck_plan;	/* local join execution plan */
  	List	   *fdw_recheck_quals;	/* original quals not in scan.plan.quals */
  	Bitmapset  *fs_relids;		/* RTIs generated by this scan */
  	bool		fsSystemCol;	/* true if any "system column" is needed */
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
***************
*** 903,912 **** typedef struct TidPath
--- 903,916 ----
   * generally a good idea to use a representation that can be dumped by
   * nodeToString(), so that you can examine the structure during debugging
   * with tools like pprint().
+  *
+  * If a ForeignPath node represents a remote join, then fdw_recheck_path is
+  * a local join execution path for use in EvalPlanQual.  (Else it is NULL.)
   */
  typedef struct ForeignPath
  {
  	Path		path;
+ 	Path	   *fdw_recheck_path;
  	List	   *fdw_private;
  } ForeignPath;
  
*** a/src/include/optimizer/pathnode.h
--- b/src/include/optimizer/pathnode.h
***************
*** 86,91 **** extern ForeignPath *create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
--- 86,92 ----
  						double rows, Cost startup_cost, Cost total_cost,
  						List *pathkeys,
  						Relids required_outer,
+ 						Path *fdw_recheck_path,
  						List *fdw_private);
  
  extern Relids calc_nestloop_required_outer(Path *outer_path, Path *inner_path);

#134

Jeevan Chalke

jeevan.chalke@enterprisedb.com

about 10 years ago

In reply to: Robert Haas (#130)

Re: Foreign join pushdown vs EvalPlanQual

On Thu, Oct 15, 2015 at 10:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Oct 15, 2015 at 3:04 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I confirmed that an epqtuple of foreign parameterized scan is
correctly rejected by fdw_recheck_quals with modified outer
tuple.

I have no objection to this and have two humble comments.

In file_fdw.c, the comment for the last parameter just after the
added line seems to be better to be aligned with other comments.

I've pgindented the file. Any other space we might choose would just
be changed by the next pgindent run, so there's no point in trying to
vary.

In subselect.c, the added break is in the added curly-braces but
it would be better to place it after the closing brace, like the
other cases.

Changed that, and committed.

With the latest sources having this commit, when I follow same steps,
I get
ERROR: unrecognized node type: 525
error.

It looks like, we have missed to handle T_RestrictInfo.
I am getting this error from expression_tree_mutator().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Jeevan B Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

#135

Kouhei Kaigai

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Etsuro Fujita (#132)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/15 11:36, Kouhei Kaigai wrote:

In case of scanrelid==0, expectation to ForeignScan/CustomScan is to
behave as if local join exists here. It requires ForeignScan to generate
joined-tuple as a result of remote join, that may contains multiple junk
TLEs to carry whole-var references of base foreign tables.
According to the criteria, the desirable behavior is clear as below:

1. FDW/CSP picks up base relation's tuple from the EPQ slots.
It shall be setup by whole-row reference if earlier row-lock semantics,
or by RefetchForeignRow if later row-lock semantics.

2. Fill up ss_ScanTupleSlot according to the xxx_scan_tlist.
We may be able to provide a common support function here, because this
list keeps relation between a particular attribute of the joined-tuple
and its source column.

3. Apply join-clause and base-restrict that were pushed down.
setrefs.c initializes expressions kept in fdw_exprs/custom_exprs to run
on the ss_ScanTupleSlot. It is the easiest way to check here.

4. If joined-tuple is still visible after the step 3, FDW/CSP returns
joined-tuple. Elsewhere, returns an empty slot.

It is entirely compatible behavior even if local join is located on
the point of ForeignScan/CustomScan with scanrelid==0.

Even if remote join is parametalized by other relation, we can simply
use param-info delivered from the corresponding outer scan at the step-3.
EState should have the parameters already updated, FDW driver needs to
care about nothing.

It is quite less invasive approach towards the existing EPQ recheck
mechanism.

I wrote:

I see. That's an idea, but I guess that step 2 and 3 would need to add
a lot of code to the core. Why don't you use a local join execution
plan that we discussed? I think that that would make the series of
processing much simpler. I'm now revising the patch that I created for
that. If it's okay, I'd like to propose an updated version of the patch
in a few days.

On 2015/10/15 20:19, Kouhei Kaigai wrote:

I have to introduce why above idea is simpler and suitable for v9.5
timeline.
As I've consistently proposed for this two months, the step-2 and 3
are assumed to be handled in the callback routine to be kicked from
ForeignRecheck().

Honestly, I still don't think I would see the much value in doing so.
As Robert mentioned in [1], I think that if we're inside EPQ,
pushed-down quals and/or pushed-down joins should be locally rechecked
in the same way as other cases such as IndexRecheck. So, I'll propose
the updated version of the patch.

You have never answered my question for two months.

I never deny to execute the pushed-down qualifiers locally.
It is likely the best tactics in most cases.
But, why you try to enforce all the people a particular manner?

Here are various kind of FDW drivers. How do you guarantee it is
the best solution for all the people? It is basically impossible.
(Please google "Probatio diabolica")

You try to add two special purpose fields in ForeignScan;
fdw_recheck_plan and fdw_recheck_quals.
It requires FDW drivers to have pushed-down qualifier in a particular
data format, and also requires FDW drivers to process EPQ recheck by
alternative local plan, even if a part of FDW drivers can process
these jobs by its own implementation better.

I've repeatedly pointed out this issue, but never get reasonable
answer from you.

Again, I also admit alternative plan may be reasonable tactics for
most of FDW drivers. However, only FDW author can "decide" it is
the best tactics to handle the task for their module, not us.

I don't think it is a good interface design to enforce people to
follow a particular implementation manner. It should be discretion
of the extension.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#136

Jeevan Chalke

jeevan.chalke@enterprisedb.com

about 10 years ago

In reply to: Jeevan Chalke (#134)

Re: Foreign join pushdown vs EvalPlanQual

On Fri, Oct 16, 2015 at 3:10 PM, Jeevan Chalke <
jeevan.chalke@enterprisedb.com> wrote:

On Thu, Oct 15, 2015 at 10:44 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Thu, Oct 15, 2015 at 3:04 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I confirmed that an epqtuple of foreign parameterized scan is
correctly rejected by fdw_recheck_quals with modified outer
tuple.

I have no objection to this and have two humble comments.

In file_fdw.c, the comment for the last parameter just after the
added line seems to be better to be aligned with other comments.

I've pgindented the file. Any other space we might choose would just
be changed by the next pgindent run, so there's no point in trying to
vary.

In subselect.c, the added break is in the added curly-braces but
it would be better to place it after the closing brace, like the
other cases.

Changed that, and committed.

With the latest sources having this commit, when I follow same steps,
I get
ERROR: unrecognized node type: 525
error.

It looks like, we have missed to handle T_RestrictInfo.
I am getting this error from expression_tree_mutator().

Ignore this.
It was caused due to some compilation issue on my system.

It is working as expected in the latest sources.

Sorry for the noise and inconvenience caused.

--
Jeevan B Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

#137

Kouhei Kaigai

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Etsuro Fujita (#133)

Re: Foreign join pushdown vs EvalPlanQual

I briefly browsed the patch apart from my preference towards the approach.

It has at least two oversight.

*** 48,59 **** ExecScanFetch(ScanState *node,
+ 		/*
+ 		 * Execute recheck plan and get the next tuple if foreign join.
+ 		 */
+ 		if (scanrelid == 0)
+ 		{
+ 			(*recheckMtd) (node, slot);
+ 			return slot;
+ 		}

Ensure the slot is empty if recheckMtd returned false, as base relation
case doing so.

*** 347,352 **** ExecScanReScan(ScanState *node)
{
Index scanrelid = ((Scan *) node->ps.plan)->scanrelid;

+ 		if (scanrelid == 0)
+ 			return;				/* nothing to do */
+ 
  		Assert(scanrelid > 0);

estate->es_epqScanDone[scanrelid - 1] = false;

Why nothing to do?
Base relations managed by ForeignScan are tracked in fs_relids bitmap.
As you introduced a few days before, if ForeignScan has parametalized
remote join, EPQ slot contains invalid tuples based on old outer tuple.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

-----Original Message-----
From: Etsuro Fujita [mailto:fujita.etsuro@lab.ntt.co.jp]
Sent: Friday, October 16, 2015 6:01 PM
To: Robert Haas
Cc: Kyotaro HORIGUCHI; Kaigai Kouhei(海外浩平); pgsql-hackers@postgresql.org;
Shigeru Hanada
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

On 2015/10/14 17:31, Etsuro Fujita wrote:

As KaiGai-san also pointed out before, I think we should address this in
each of the following cases:

1) remote qual (scanrelid>0)
2) remote join (scanrelid==0)

As for #2, I updated the patch, which uses a local join execution plan
for an EvalPlanQual rechech, according to the comment from Robert [1].
Attached is an updated version of the patch. This is a WIP patch, but
it would be appreciated if I could get feedback earlier.

For tests, apply the patches:

foreign-recheck-for-foreign-join-1.patch
usermapping_matching.patch [2]
add_GetUserMappingById.patch [2]
foreign_join_v16_efujita.patch [3]

Since that as I said upthread, what I'd like to discuss is changes to
the PG core, I didn't do anything about the postgres_fdw patches.

Best regards,
Etsuro Fujita

[1]
/messages/by-id/CA+TgmoaAzs0dR23R7PTBseQfwOtuVCPNBqDHxe
Bo9Gi+dMxj8w@mail.gmail.com
[2]
/messages/by-id/CAEZqfEe9KGy=1_waGh2rgZPg0o4pqgD+iauYaj
8wTze+CYJUHg@mail.gmail.com
[3] /messages/by-id/55CB2D45.7040100@lab.ntt.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#138

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Etsuro Fujita (#133)

1 attachment(s)

Re: Foreign join pushdown vs EvalPlanQual

On Fri, Oct 16, 2015 at 5:00 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

As for #2, I updated the patch, which uses a local join execution plan for
an EvalPlanQual rechech, according to the comment from Robert [1]. Attached
is an updated version of the patch. This is a WIP patch, but it would be
appreciated if I could get feedback earlier.

I don't see how this can be right. You're basically just pretending
EPQ doesn't exist in the remote join case, which isn't going to work
at all. Those bits of code that look at es_epqTuple, es_epqTupleSet,
and es_epqScanDone are not optional. You can't just skip over those
as if they don't matter.

Again, the root of the problem here is that the EPQ machinery provides
1 slot per RTI, and it uses scanrelid to figure out which EPQ slot is
applicable for a given scan node. Here, we have scanrelid == 0, so it
gets confused. But it's not the case that a pushed-down join has NO
scanrelid. It actually has, in effect, *multiple* scanrelids. So we
should pick any one of those, say the lowest-numbered one, and use
that to decide which EPQ slot to use. The attached patch shows
roughly what I have in mind, although this is just crap code to
demonstrate the basic idea and doesn't pretend to adjust everything
that needs fixing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

stupid-epq-tricks.patchapplication/x-patch; name=stupid-epq-tricks.patchDownload

diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index a96e826..5c4a4f4 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -26,6 +26,32 @@
 static bool tlist_matches_tupdesc(PlanState *ps, List *tlist, Index varno, TupleDesc tupdesc);
 
 
+static Index
+get_proxy_scanrelid(ScanState *node)
+{
+	Scan	*scan = (Scan *) node->ps.plan;
+
+	Assert(scan->scanrelid == 0);
+
+	switch (nodeTag(scan))
+	{
+		case T_ForeignScan:
+			{
+				ForeignScan *fs = (ForeignScan *) scan;
+				return bms_first_member(fs->fs_relids);
+			}
+
+		case T_CustomScan:
+			{
+				CustomScan *cs = (CustomScan *) scan;
+				return bms_first_member(cs->custom_relids);
+			}
+
+		default:
+			elog(FATAL, "unexpected node type: %d", (int) nodeTag(scan));
+	}
+}
+
 /*
  * ExecScanFetch -- fetch next potential tuple
  *
@@ -49,6 +75,8 @@ ExecScanFetch(ScanState *node,
 		 */
 		Index		scanrelid = ((Scan *) node->ps.plan)->scanrelid;
 
+		if (scanrelid == 0)
+			scanrelid = get_proxy_scanrelid(node);
 		Assert(scanrelid > 0);
 		if (estate->es_epqTupleSet[scanrelid - 1])
 		{
@@ -347,6 +375,8 @@ ExecScanReScan(ScanState *node)
 	{
 		Index		scanrelid = ((Scan *) node->ps.plan)->scanrelid;
 
+		if (scanrelid == 0)
+			scanrelid = get_proxy_scanrelid(node);
 		Assert(scanrelid > 0);
 
 		estate->es_epqScanDone[scanrelid - 1] = false;

#139

Kouhei Kaigai

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Robert Haas (#138)

Re: Foreign join pushdown vs EvalPlanQual

On Fri, Oct 16, 2015 at 5:00 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

As for #2, I updated the patch, which uses a local join execution plan for
an EvalPlanQual rechech, according to the comment from Robert [1]. Attached
is an updated version of the patch. This is a WIP patch, but it would be
appreciated if I could get feedback earlier.

I don't see how this can be right. You're basically just pretending
EPQ doesn't exist in the remote join case, which isn't going to work
at all. Those bits of code that look at es_epqTuple, es_epqTupleSet,
and es_epqScanDone are not optional. You can't just skip over those
as if they don't matter.

I think, it is right approach to pretend EPQ doesn't exist if scanrelid==0,
because what replaced by these ForeignScan/CustomScan node are local join
node like NestLoop. They don't have its own EPQ slot, but constructs joined-
tuple based on the underlying scan-tuple originally stored within EPQ slots.

Again, the root of the problem here is that the EPQ machinery provides
1 slot per RTI, and it uses scanrelid to figure out which EPQ slot is
applicable for a given scan node. Here, we have scanrelid == 0, so it
gets confused. But it's not the case that a pushed-down join has NO
scanrelid. It actually has, in effect, *multiple* scanrelids. So we
should pick any one of those, say the lowest-numbered one, and use
that to decide which EPQ slot to use. The attached patch shows
roughly what I have in mind, although this is just crap code to
demonstrate the basic idea and doesn't pretend to adjust everything
that needs fixing.

One tricky point of this idea is ExecStoreTuple() in ExecScanFetch(),
because the EPQ slot picked up by get_proxy_scanrelid() contains
a tuple of base relation then it tries to put this tuple on the
TupleTableSlot initialized to save the joined-tuple.
Of course, recheckMtd is called soon, so callback will be able to
handle the request correctly. However, it is a bit unnatural to store
a tuple on incompatible TupleTableSlot.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#140

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Kouhei Kaigai (#139)

Re: Foreign join pushdown vs EvalPlanQual

Kouhei Kaigai <kaigai@ak.jp.nec.com> writes:

On Fri, Oct 16, 2015 at 5:00 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:
I don't see how this can be right. You're basically just pretending
EPQ doesn't exist in the remote join case, which isn't going to work
at all. Those bits of code that look at es_epqTuple, es_epqTupleSet,
and es_epqScanDone are not optional. You can't just skip over those
as if they don't matter.

I think, it is right approach to pretend EPQ doesn't exist if scanrelid==0,
because what replaced by these ForeignScan/CustomScan node are local join
node like NestLoop.

That's just nonsense. The reason that nestloop doesn't need its own EPQ
slot is that what it will be joining is tuples provided by scan nodes,
and it was the scan nodes that took care of fetching correct,
updated-if-need-be tuples for the EPQ check. You can't just discard that
responsibility when you're implementing a join remotely ... at least not
if you want to preserve semantics similar to what happens with local
tables.

Or maybe I misunderstood what you meant, but it's certainly not OK to say
that EPQ is a no-op for a pushed-down join. Rather, it has to perform all
the same checks that would have happened for any of its constitutent
tables.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#141

Kouhei Kaigai

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Tom Lane (#140)

Re: Foreign join pushdown vs EvalPlanQual

Kouhei Kaigai <kaigai@ak.jp.nec.com> writes:

On Fri, Oct 16, 2015 at 5:00 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:
I don't see how this can be right. You're basically just pretending
EPQ doesn't exist in the remote join case, which isn't going to work
at all. Those bits of code that look at es_epqTuple, es_epqTupleSet,
and es_epqScanDone are not optional. You can't just skip over those
as if they don't matter.

I think, it is right approach to pretend EPQ doesn't exist if scanrelid==0,
because what replaced by these ForeignScan/CustomScan node are local join
node like NestLoop.

That's just nonsense. The reason that nestloop doesn't need its own EPQ
slot is that what it will be joining is tuples provided by scan nodes,
and it was the scan nodes that took care of fetching correct,
updated-if-need-be tuples for the EPQ check. You can't just discard that
responsibility when you're implementing a join remotely ... at least not
if you want to preserve semantics similar to what happens with local
tables.

NestLoop itself does not need its own EPQ slot, no doubt. However,
entire sub-tree of NestLoop takes at least two underlying EPQ slots
of the base relations to be joined.

My opinion is, simply, ForeignScan/CustomScan with scanrelid==0 takes
over the responsibility of EPQ recheck of entire join sub-tree that is
replaced by the ForeignScan/CustomScan node.
If ForeignScan run a remote join on foreign tables: A and B, it shall
apply both of scan-quals and join-clause towards the tuples kept in
the EPQ slots, in some fashion depending on FDW implementation.

Nobody concerned about what check shall be applied here. EPQ recheck
shall be applied as if entire join sub-tree exists here.
Major difference between I and Fujita-san is how to recheck it.
I think FDW knows the best way to do (even if we can provide utility
routines for majority cases), however, Fujita-san says a particular
implementation is the best for all the people. I cannot agree with
his opinion.

Or maybe I misunderstood what you meant, but it's certainly not OK to say
that EPQ is a no-op for a pushed-down join. Rather, it has to perform all
the same checks that would have happened for any of its constitutent
tables.

I've never said that EPQ should be no-op for a pushed-down join.
Responsibility of the entire join sub-tree is not discarded, and
not changed, even if it is replaced by a remote-join.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#142

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Kouhei Kaigai (#139)

Re: Foreign join pushdown vs EvalPlanQual

On Fri, Oct 16, 2015 at 6:12 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

I think, it is right approach to pretend EPQ doesn't exist if scanrelid==0,
because what replaced by these ForeignScan/CustomScan node are local join
node like NestLoop. They don't have its own EPQ slot, but constructs joined-
tuple based on the underlying scan-tuple originally stored within EPQ slots.

I think you've got that backwards. The fact that they don't have
their own EPQ slot is the problem we need to solve. When an EPQ
recheck happens, we rescan every relation in the query. Each relation
needs to return 0 or 1 tuples. If it returns a tuple, the tuple it
returns must be either the same tuple it previously returned, or an
updated version of that tuple. But "the tuple it previously returned"
does not necessarily mean the tuple it returned most recently. It
means the tuple that it returned which, when passed through the rest
of the plan, contributed to generate the result tuple that is being
rechecked.

Now, if you don't have an EPQ slot, how are you going to do this?
When the EPQ machinery engages, you need to somehow get the tuple you
previously returned stored someplace. And the first time thereafter
that you get called by ExecProcNode, you need to return that tuple,
provided that it still passes the quals. The second time you get
called, and any subsequent times, you need to return an empty slot.
The EPQ slot is well-suited to this task. It's got a TupleTableSlot
to store the tuple you need to return, and it's got a flag indicating
whether you've already returned that tuple. So you're good.

But with Etsuro Fujita's patch, and I think what you have proposed has
been similar, how are you going to do it? The proposal is to call the
recheck method and hope for the best, but what is the recheck method
going to do? Where is it going to get the previously-returned tuple?
How will it know if it has already returned it during the lifetime of
this EPQ check? Offhand, it looks to me like, at least in some
circumstances, you're probably going to return whatever tuple you
returned most recently (which has a good chance of being the right
one, but not necessarily) over and over again. That's not going to
fly.

The bottom line is that a foreign scan that is a pushed-down join is
still a *scan*, and every already-existing scan type has an EPQ slot
*for a reason*. They *need* it in order to deliver the correct
behavior. And foreign scans and custom scans need it to, and for the
same reason.

Again, the root of the problem here is that the EPQ machinery provides
1 slot per RTI, and it uses scanrelid to figure out which EPQ slot is
applicable for a given scan node. Here, we have scanrelid == 0, so it
gets confused. But it's not the case that a pushed-down join has NO
scanrelid. It actually has, in effect, *multiple* scanrelids. So we
should pick any one of those, say the lowest-numbered one, and use
that to decide which EPQ slot to use. The attached patch shows
roughly what I have in mind, although this is just crap code to
demonstrate the basic idea and doesn't pretend to adjust everything
that needs fixing.

One tricky point of this idea is ExecStoreTuple() in ExecScanFetch(),
because the EPQ slot picked up by get_proxy_scanrelid() contains
a tuple of base relation then it tries to put this tuple on the
TupleTableSlot initialized to save the joined-tuple.
Of course, recheckMtd is called soon, so callback will be able to
handle the request correctly. However, it is a bit unnatural to store
a tuple on incompatible TupleTableSlot.

I think that the TupleTableSlot is incompatible because the dummy
patch I posted only addresses half of the problem. I didn't do
anything about the code that stores stuff into an EPQ slot. If that
were also fixed, then the tuple which the patch retrieves from the
slot would not be incompatible.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#143

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Kouhei Kaigai (#141)

Re: Foreign join pushdown vs EvalPlanQual

On Fri, Oct 16, 2015 at 7:48 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

My opinion is, simply, ForeignScan/CustomScan with scanrelid==0 takes
over the responsibility of EPQ recheck of entire join sub-tree that is
replaced by the ForeignScan/CustomScan node.
If ForeignScan run a remote join on foreign tables: A and B, it shall
apply both of scan-quals and join-clause towards the tuples kept in
the EPQ slots, in some fashion depending on FDW implementation.

And my opinion, as I said before, is that's completely wrong. The
ForeignScan which represents a pushed-down join is a *scan*. In
general, scans have one EPQ slot, and that is the right number. This
pushed-down join scan, though, is in a state of confusion. The code
that populates the EPQ slots thinks it's got multiple slots, one per
underlying relation. Meanwhile, the code that reads data back out of
those slots thinks it doesn't have any slots at all. Both of those
pieces of code are wrong. This foreign scan, like any other scan,
should use ONE slot.

Both you and Etsuro Fujita are proposing to fix this problem by
somehow making it the FDW's problem to reconstruct the tuple
previously produced by the join from whole-row images of the baserels.
But that's not looking back far enough: why are we asking for
whole-row images of the baserels when what we really want is a
whole-row image of the output of the join? The output of the join is
what we need to re-return.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#144

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Robert Haas (#143)

Re: Foreign join pushdown vs EvalPlanQual

Robert Haas <robertmhaas@gmail.com> writes:

Both you and Etsuro Fujita are proposing to fix this problem by
somehow making it the FDW's problem to reconstruct the tuple
previously produced by the join from whole-row images of the baserels.
But that's not looking back far enough: why are we asking for
whole-row images of the baserels when what we really want is a
whole-row image of the output of the join? The output of the join is
what we need to re-return.

There are multiple components to the requirement though:

1. Recheck the rows that were in the baserels and possibly fetch updated
versions of them. (Once we're in EPQ, we want the most recent row
versions, not what the query snapshot can see.)

2. Apply relevant restriction clauses and see if the updated row versions
still pass the clauses.

3. If so, form a join row and return that. Else return NULL.

One way or another, the FDW has to do all of the above, or as much of it
as it can't fob off on the remote server, once it's decided to bypass
local implementation of the join. Just recomputing the original join
row is *not* good enough.

I think what Kaigai-san and Etsuro-san are after is trying to find a way
to reuse some of the existing EPQ machinery to help with that. This may
not be practical, or it may end up being messier than a standalone
implementation; but it's not silly on its face to want to reuse some of
that code.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#145

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Tom Lane (#144)

Re: Foreign join pushdown vs EvalPlanQual

On Fri, Oct 16, 2015 at 9:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Both you and Etsuro Fujita are proposing to fix this problem by
somehow making it the FDW's problem to reconstruct the tuple
previously produced by the join from whole-row images of the baserels.
But that's not looking back far enough: why are we asking for
whole-row images of the baserels when what we really want is a
whole-row image of the output of the join? The output of the join is
what we need to re-return.

There are multiple components to the requirement though:

1. Recheck the rows that were in the baserels and possibly fetch updated
versions of them. (Once we're in EPQ, we want the most recent row
versions, not what the query snapshot can see.)

Check. But postgres_fdw, and probably quite a few other FDWs, use
early row locking. So ROW_MARK_COPY is in use and we need not worry
about refetching rows.

2. Apply relevant restriction clauses and see if the updated row versions
still pass the clauses.

Check.

3. If so, form a join row and return that. Else return NULL.

Not check.

Suppose we've got two foreign tables ft1 and ft2, using postgres_fdw.
There is a local table t. The user does something like UPDATE t SET
... FROM ft1, ft2 WHERE t = ft1.a AND ft1.b = ft2.b AND .... The
query planner generates something like:

Update
-> Join
-> Scan on t
-> Foreign Scan on <ft1, ft2>

If an EPQ recheck occurs, the only thing that matters is that the
Foreign Scan return the right output row (or possibly now rows, if the
row it would have formed no longer matches the quals). It doesn't
matter how it does this. Let's say the columns actually needed by the
query from the ft1-ft2 join are ft1.a, ft1.b, ft2.a, and ft2.b.
Currently, the output of the foreign scan is something like: ft1.a,
ft1.b, ft2.a, ft.b, ft1.*, ft2.*. The EPQ recheck has access to ft1.*
and ft2.*, but it's not straightforward for postgres_fdw to regenerate
the join tuple from that. Maybe the pushed-down was a left join,
maybe it was a right join, maybe it was a full join. So some of the
columns could have gone to NULL. To figure it out, you need to build
a secondary plan tree that mimics the structure of the join you pushed
down, which is kinda hairy.

Think how much easier your life would be if you hadn't bothered
fetching ft1.* and ft2.*, which aren't so handy in this case, and had
instead made the output of the foreign scan ft1.a, ft1.b, ft2.a,
ft2.b, ROW(ft1.a, ft1.b, ft2.a, ft2.b) -- and that the output of that
ROW() operation was stored in an EPQ slot. Now, you don't need the
secondary plan tree any more. You've got all the data you need right
in your hand. The values inside the ROW() constructor were evaluated
after accounting for the goes-to-NULL effects of any pushed-down
joins.

This example is of the early row locking case, but I think the story
is about the same if the FDW wants to do late row locking instead. If
there's an EPQ recheck, it could issue individual row re-fetches
against every base table and then re-do all the joins that it pushed
down locally. But it would be faster and cleaner, I think, to send
one query to the remote side that re-fetches all the rows at once, and
whose target list is exactly what we need, rather than whole row
targetlists for each baserel that then have to be rejiggered on our
side.

I think what Kaigai-san and Etsuro-san are after is trying to find a way
to reuse some of the existing EPQ machinery to help with that. This may
not be practical, or it may end up being messier than a standalone
implementation; but it's not silly on its face to want to reuse some of
that code.

Yeah, I think we're all in agreement that reusing as much of the EPQ
machinery as is sensible is something we should do. We are not in
agreement on which parts of it need to be changed or extended.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#146

Kouhei Kaigai

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Robert Haas (#145)

Re: Foreign join pushdown vs EvalPlanQual

On Fri, Oct 16, 2015 at 6:12 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

I think, it is right approach to pretend EPQ doesn't exist if scanrelid==0,
because what replaced by these ForeignScan/CustomScan node are local join
node like NestLoop. They don't have its own EPQ slot, but constructs joined-
tuple based on the underlying scan-tuple originally stored within EPQ slots.

I think you've got that backwards. The fact that they don't have
their own EPQ slot is the problem we need to solve. When an EPQ
recheck happens, we rescan every relation in the query. Each relation
needs to return 0 or 1 tuples. If it returns a tuple, the tuple it
returns must be either the same tuple it previously returned, or an
updated version of that tuple. But "the tuple it previously returned"
does not necessarily mean the tuple it returned most recently. It
means the tuple that it returned which, when passed through the rest
of the plan, contributed to generate the result tuple that is being
rechecked.

Yes, it is the reason why citd or whole-var (if early row locking) or
something rowid (if later row locking) are required to fill up EPQ slot
of base relations.
I understand the tuple returned most recently is not answer here.
(E.g, in case when ForeignScan is located under MergeJoin)

Now, if you don't have an EPQ slot, how are you going to do this?
When the EPQ machinery engages, you need to somehow get the tuple you
previously returned stored someplace. And the first time thereafter
that you get called by ExecProcNode, you need to return that tuple,
provided that it still passes the quals. The second time you get
called, and any subsequent times, you need to return an empty slot.
The EPQ slot is well-suited to this task. It's got a TupleTableSlot
to store the tuple you need to return, and it's got a flag indicating
whether you've already returned that tuple. So you're good.

But with Etsuro Fujita's patch, and I think what you have proposed has
been similar, how are you going to do it? The proposal is to call the
recheck method and hope for the best, but what is the recheck method
going to do? Where is it going to get the previously-returned tuple?
How will it know if it has already returned it during the lifetime of
this EPQ check? Offhand, it looks to me like, at least in some
circumstances, you're probably going to return whatever tuple you
returned most recently (which has a good chance of being the right
one, but not necessarily) over and over again. That's not going to
fly.

I think the job of recheck method to do "hope for the best" is below.

1. Fetch every EPQ slot of base relations involved in this join.
In case of ForeignScan, all the required tuples of base relations
should be filled because it is preliminary fetched by whole-row var
if earlier row-locking, or by RefetchForeignRow if later row-locking.
In case of CustomScan, it can call ExecProcNode() to generate the
first tuple even if it does not exists.
Anyway, I assume all the component tuples of this join can be fetched
using existing EPQ slot because they are owned by base relations.

2. The recheck callback fills up ss_ScanTupleSlot according to the
fdw_scan_tlist or custom_scan_tlist. The callback knows the best way
to reconstruct the joined tuple from the base relations' tuple fetched
on the step-1.
For example, if joined tuple is consists of (t1.a, t1.b, t2.x, t3.s),
the callback picks up 't1.a' and 't1.b' from the tuple fetched from
the EPQ slot of t1, then put these values onto the 1st and 2nd slot.
Also, it picks up 't2.x' from the tuple fetched from the EPQ slot of
t2, then put this value onto the 3rd slot. Same as above for 't3'.
At this point, ss_ScanTupleSlot gets filled up by the expected fields
as if join clauses are satisfied.

3. The recheck callback also checks qualifiers of base relations that
are pushed down. Because expression nodes kept in fds_exprs or
custom_exprs are initialized to reference ss_ScanTupleSlot at setrefs.c,
it is more reasonable to run ExecQual after the step-2.
If one of the qualifiers of base relation was evaluated as false,
the recheck callback returns an empty slot.

4. The recheck callback also checks join-clauses to join underlying
base relations. Due to same reason at step-3, it is more reasonable
to execute ExecQual after the step-2.
If one of the join-clauses was evaluated as false, the recheck returns
an empty slot.
Elsewhere, it returns ss_ScanTupleSlot, then ExecScan will process
any further jobs.

Even though Fujita-san's patch implements the step-2 to step-4 using
alternative local plan with no other option, it stands on similar concept.
- EPQ slot contains the tuple of base relation that contributed the join.
- FDW/CSP knows the best how to construct the joined-tuple.
- Joined tuple is constructed on the fly, not kept in a particular EPQ slot.

The bottom line is that a foreign scan that is a pushed-down join is
still a *scan*, and every already-existing scan type has an EPQ slot
*for a reason*. They *need* it in order to deliver the correct
behavior. And foreign scans and custom scans need it to, and for the
same reason.

Probably, it is the reason of mismatch for the solution.
Even though ForeignScan/CustomScan is categorized to scan node, from the
standpoint of the core backend, it is expected to take responsibility of
join in addition to scan of base relation.
This multi-roleness gives ForeignScan/CustomScan capability and
responsibility to handle multiple EPQ slots, for join rechecks.

Please assume the reason why existing scan node is associated with
a particular EPQ slot is that it has only one role; to scan a particular
base relation. But, what is natural manner if a scan node actually has
multiple roles?

On Fri, Oct 16, 2015 at 7:48 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

My opinion is, simply, ForeignScan/CustomScan with scanrelid==0 takes
over the responsibility of EPQ recheck of entire join sub-tree that is
replaced by the ForeignScan/CustomScan node.
If ForeignScan run a remote join on foreign tables: A and B, it shall
apply both of scan-quals and join-clause towards the tuples kept in
the EPQ slots, in some fashion depending on FDW implementation.

And my opinion, as I said before, is that's completely wrong. The
ForeignScan which represents a pushed-down join is a *scan*. In
general, scans have one EPQ slot, and that is the right number. This
pushed-down join scan, though, is in a state of confusion. The code
that populates the EPQ slots thinks it's got multiple slots, one per
underlying relation. Meanwhile, the code that reads data back out of
those slots thinks it doesn't have any slots at all. Both of those
pieces of code are wrong. This foreign scan, like any other scan,
should use ONE slot.

Both you and Etsuro Fujita are proposing to fix this problem by
somehow making it the FDW's problem to reconstruct the tuple
previously produced by the join from whole-row images of the baserels.
But that's not looking back far enough: why are we asking for
whole-row images of the baserels when what we really want is a
whole-row image of the output of the join? The output of the join is
what we need to re-return.

Yes, the output of the join is exactly what we need to re-return.
On the other hands, the joined tuple image is depends on the latest
image of base relation's tuples that construct joined tuples.

Once a part of the base relations' tuple is re-fetched and updated,
it affects to the contents of joined tuple and its visibility.
It means, more or less, we need to have capability to reconstruct
joined-tuple from the base relations again, in addition to rechecks.

Therefore, I concluded that joined-tuple re-construction by FDW/CSP
on the fly is reasonably implementable and less invasive approach
than others.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Resolved by subject fallback

#147

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

about 10 years ago

In reply to: Robert Haas (#145)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/17 12:22, Robert Haas wrote:

On Fri, Oct 16, 2015 at 9:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Both you and Etsuro Fujita are proposing to fix this problem by
somehow making it the FDW's problem to reconstruct the tuple
previously produced by the join from whole-row images of the baserels.
But that's not looking back far enough: why are we asking for
whole-row images of the baserels when what we really want is a
whole-row image of the output of the join? The output of the join is
what we need to re-return.

There are multiple components to the requirement though:

3. If so, form a join row and return that. Else return NULL.

Not check.

Suppose we've got two foreign tables ft1 and ft2, using postgres_fdw.
There is a local table t. The user does something like UPDATE t SET
... FROM ft1, ft2 WHERE t = ft1.a AND ft1.b = ft2.b AND .... The
query planner generates something like:

Update
-> Join
-> Scan on t
-> Foreign Scan on <ft1, ft2>

If an EPQ recheck occurs, the only thing that matters is that the
Foreign Scan return the right output row (or possibly now rows, if the
row it would have formed no longer matches the quals). It doesn't
matter how it does this. Let's say the columns actually needed by the
query from the ft1-ft2 join are ft1.a, ft1.b, ft2.a, and ft2.b.
Currently, the output of the foreign scan is something like: ft1.a,
ft1.b, ft2.a, ft.b, ft1.*, ft2.*. The EPQ recheck has access to ft1.*
and ft2.*, but it's not straightforward for postgres_fdw to regenerate
the join tuple from that. Maybe the pushed-down was a left join,
maybe it was a right join, maybe it was a full join. So some of the
columns could have gone to NULL. To figure it out, you need to build
a secondary plan tree that mimics the structure of the join you pushed
down, which is kinda hairy.

As Tom mentioned, just recomputing the original join tuple is not good
enough. We would need to rejoin the test tuples for the baserels even
if ROW_MARK_COPY is in use. Consider:

A=# BEGIN;
A=# UPDATE t SET a = a + 1 WHERE b = 1;
B=# SELECT * from t, ft1, ft2
WHERE t.a = ft1.a AND t.b = ft2.b AND ft1.c = ft2.c FOR UPDATE;
A=# COMMIT;

where the plan for the SELECT FOR UPDATE is

LockRows
-> Nested Loop
-> Seq Scan on t
-> Foreign Scan on <ft1, ft2>
Remote SQL: SELECT * FROM ft1 JOIN ft2 WHERE ft1.c = ft2.c AND
ft1.a = $1 AND ft2.b = $2

If an EPQ recheck is invoked by the A's UPDATE, just recomputing the
original join tuple from the whole-row image that you proposed would
output an incorrect result in the EQP recheck since the value a in the
updated version of a to-be-joined tuple in t would no longer match the
value ft1.a extracted from the whole-row image if the A's UPDATE has
committed successfully. So I think we would need to rejoin the tuples
populated from the whole-row images for the baserels ft1 and ft2, by
executing the secondary plan with the new parameter values for a and b.

As for the secondary plan, I think we could create the corresponding
local join execution path during GetForeignJoinPaths, (1) by looking at
the pathlist of the joinrel RelOptInfo, which would have already
contained some local join execution paths, as does the patch, or (2) by
calling a helper function that creates a local join execution path from
given outer/inner paths selected from the pathlists of the
outerrel/innerrel RelOptInfos, as proposed be KaiGai-san before. ISTM
that the latter would be better, so I plan to propose such a function as
part of the postgres_fdw join pushdown patch for 9.6.

This example is of the early row locking case, but I think the story
is about the same if the FDW wants to do late row locking instead. If
there's an EPQ recheck, it could issue individual row re-fetches
against every base table and then re-do all the joins that it pushed
down locally. But it would be faster and cleaner, I think, to send
one query to the remote side that re-fetches all the rows at once, and
whose target list is exactly what we need, rather than whole row
targetlists for each baserel that then have to be rejiggered on our
side.

I agree with you on that point. (In fact, I thought that too!) But
considering that many FDWs including postgres_fdw use early row locking
(ie, ROW_MARK_COPY) currently, I'd like to leave that for future work.

I think what Kaigai-san and Etsuro-san are after is trying to find a way
to reuse some of the existing EPQ machinery to help with that. This may
not be practical, or it may end up being messier than a standalone
implementation; but it's not silly on its face to want to reuse some of
that code.

Yeah, I think we're all in agreement that reusing as much of the EPQ
machinery as is sensible is something we should do. We are not in
agreement on which parts of it need to be changed or extended.

Agreed.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#148

Kouhei Kaigai

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Etsuro Fujita (#147)

Re: Foreign join pushdown vs EvalPlanQual

On Fri, Oct 16, 2015 at 9:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Both you and Etsuro Fujita are proposing to fix this problem by
somehow making it the FDW's problem to reconstruct the tuple
previously produced by the join from whole-row images of the baserels.
But that's not looking back far enough: why are we asking for
whole-row images of the baserels when what we really want is a
whole-row image of the output of the join? The output of the join is
what we need to re-return.

There are multiple components to the requirement though:

1. Recheck the rows that were in the baserels and possibly fetch updated
versions of them. (Once we're in EPQ, we want the most recent row
versions, not what the query snapshot can see.)

Check. But postgres_fdw, and probably quite a few other FDWs, use
early row locking. So ROW_MARK_COPY is in use and we need not worry
about refetching rows.

2. Apply relevant restriction clauses and see if the updated row versions
still pass the clauses.

Check.

3. If so, form a join row and return that. Else return NULL.

Not check.

Suppose we've got two foreign tables ft1 and ft2, using postgres_fdw.
There is a local table t. The user does something like UPDATE t SET
... FROM ft1, ft2 WHERE t = ft1.a AND ft1.b = ft2.b AND .... The
query planner generates something like:

Update
-> Join
-> Scan on t
-> Foreign Scan on <ft1, ft2>

If an EPQ recheck occurs, the only thing that matters is that the
Foreign Scan return the right output row (or possibly now rows, if the
row it would have formed no longer matches the quals). It doesn't
matter how it does this. Let's say the columns actually needed by the
query from the ft1-ft2 join are ft1.a, ft1.b, ft2.a, and ft2.b.
Currently, the output of the foreign scan is something like: ft1.a,
ft1.b, ft2.a, ft.b, ft1.*, ft2.*. The EPQ recheck has access to ft1.*
and ft2.*, but it's not straightforward for postgres_fdw to regenerate
the join tuple from that. Maybe the pushed-down was a left join,
maybe it was a right join, maybe it was a full join. So some of the
columns could have gone to NULL. To figure it out, you need to build
a secondary plan tree that mimics the structure of the join you pushed
down, which is kinda hairy.

In case of outer join, do we need to care about join-clause, unlike
scan qualifiers?

Rows filled-up by NULLs appears when here is no matched tuple on other
side. It means any rows in the relation of non-NUllable side are visible
regardless of join-clause, even though it may be or may not be matched
with the latest rows refetched based on the latest values.

Example)
remote table: ft1
id | val
---+-------
1 | 'aaa'
2 | 'bbb'
3 | 'ccc'

remote table: ft2
id | val
---+-------
2 | 'xxx'
3 | 'yyy'
4 | 'zzz'

If remote join query is:
SELECT *, ft1.*, ft2.* FROM ft1 LEFT JOIN ft2 ON ft1.id = ft2.id WHERE ft1.id < 3;
its expected result is:
ft1.id | ft1.val | ft2.id | ft2.val | ft1.* | ft2.* |
-------+---------+--------+---------+---------+---------+
1 | 'aaa' | null | null |(1,'aaa')| null |
2 | 'bbb' | 2 | 'xxx' |(2,'bbb')|(2,'xxx')|

The non-NULLs side (ft1 in this case) are visible regardless of the join-
clause, as long as tuples in ft1 satisfies the scan-qualifier (ft1.id < 3).

FDW/CSP knows the type of joins that should be responsible, so it can skip
evaluation of join-clauses but apply only scan-qualifiers on base relation's
tuple.

Think how much easier your life would be if you hadn't bothered
fetching ft1.* and ft2.*, which aren't so handy in this case, and had
instead made the output of the foreign scan ft1.a, ft1.b, ft2.a,
ft2.b, ROW(ft1.a, ft1.b, ft2.a, ft2.b) -- and that the output of that
ROW() operation was stored in an EPQ slot. Now, you don't need the
secondary plan tree any more. You've got all the data you need right
in your hand. The values inside the ROW() constructor were evaluated
after accounting for the goes-to-NULL effects of any pushed-down
joins.

This example is of the early row locking case, but I think the story
is about the same if the FDW wants to do late row locking instead. If
there's an EPQ recheck, it could issue individual row re-fetches
against every base table and then re-do all the joins that it pushed
down locally. But it would be faster and cleaner, I think, to send
one query to the remote side that re-fetches all the rows at once, and
whose target list is exactly what we need, rather than whole row
targetlists for each baserel that then have to be rejiggered on our
side.

Which approach is more reasonable?

In case of early row locking, FDW ensures all the rows involved in
the join is protected by concurrent accesses. So, no need to concern
about refetching from the remote side.
On the other hands, in case of late row locking, we need to pay attention
whether a part of (or all) base relations are updated by the concurrent
accesses. In this case, joined-tuple is no longer valid, so we may need
to fetch the joined-tuple from the remote side during rechecking.
Probably, relevant rowid (ctid system column in postgres_fdw) enables to
identify the tuples to be fetched from the remote side effectively, so it
shall not be a heavy query, however, it needs to run a remote query once.

If we reconstruct a joined tuple from the base relations kept in EPQ slot,
it needs additional reconstruction cost if early row locking case (disadvantage),
however, no need to run remote join again in late row locking situation
because base relation's tuples are already fetched by the infrastructure
(advantage). The local reconstruction approach also has an advantage -
that does not need to enhance existing EPQ slot mechanism so much.
All this approach needs EPQ slot holds tuple of the base relation.

Please correct me, if I misunderstand your proposition.

I think what Kaigai-san and Etsuro-san are after is trying to find a way
to reuse some of the existing EPQ machinery to help with that. This may
not be practical, or it may end up being messier than a standalone
implementation; but it's not silly on its face to want to reuse some of
that code.

Yeah, I think we're all in agreement that reusing as much of the EPQ
machinery as is sensible is something we should do. We are not in
agreement on which parts of it need to be changed or extended.

Yes. I'd also like to reuse existing EPQ infrastructure as long as we can.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Resolved by subject fallback

#149

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

about 10 years ago

In reply to: Robert Haas (#142)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/17 9:58, Robert Haas wrote:

But with Etsuro Fujita's patch, and I think what you have proposed has
been similar, how are you going to do it? The proposal is to call the
recheck method and hope for the best, but what is the recheck method
going to do? Where is it going to get the previously-returned tuple?

As I explained in a previous email, just returning the
previously-returned tuple is not good enough.

How will it know if it has already returned it during the lifetime of
this EPQ check? Offhand, it looks to me like, at least in some
circumstances, you're probably going to return whatever tuple you
returned most recently (which has a good chance of being the right
one, but not necessarily) over and over again. That's not going to
fly.

No. Since the local join execution plan is created so that the scan
slot for each foreign table involved in the pushed-down join looks at
its EPQ slot, I think the plan can return at most one tuple.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#150

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

about 10 years ago

In reply to: Kouhei Kaigai (#137)

1 attachment(s)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/16 19:03, Kouhei Kaigai wrote:

*** 48,59 **** ExecScanFetch(ScanState *node,
+ 		/*
+ 		 * Execute recheck plan and get the next tuple if foreign join.
+ 		 */
+ 		if (scanrelid == 0)
+ 		{
+ 			(*recheckMtd) (node, slot);
+ 			return slot;
+ 		}

Ensure the slot is empty if recheckMtd returned false, as base relation
case doing so.

Fixed.

*** 347,352 **** ExecScanReScan(ScanState *node)
{
Index scanrelid = ((Scan *) node->ps.plan)->scanrelid;
+ 		if (scanrelid == 0)
+ 			return;				/* nothing to do */
+
Assert(scanrelid > 0);
estate->es_epqScanDone[scanrelid - 1] = false;

Why nothing to do?
Base relations managed by ForeignScan are tracked in fs_relids bitmap.

I think the estate->es_epqScanDone flag should be initialized when we do
ExecScanReSacn for each of the component ForeignScanState nodes in the
local join execution plan state tree.

As you introduced a few days before, if ForeignScan has parametalized
remote join, EPQ slot contains invalid tuples based on old outer tuple.

Maybe my explanation was not enough, but I haven't said such a thing.
The problem in that case is that just returning the previously-returned
foeign-join tuple would produce an incorrect result if an outer tuple to
be joined has changed due to a concurrent transaction, as explained
upthread. (I think that the EPQ slots would contain valid tuples.)

Attached is an updated version of the patch.

Other changes:
* remove unnecessary memory-context handling for the foreign-join case
in ForeignRecheck
* revise code a bit and add a bit more comments

Thanks for the comments!

Best regards,
Etsuro Fujita

Attachments:

foreign-recheck-for-foreign-join-v2.patchtext/x-patch; name=foreign-recheck-for-foreign-join-v2.patchDownload

*** a/contrib/file_fdw/file_fdw.c
--- b/contrib/file_fdw/file_fdw.c
***************
*** 525,530 **** fileGetForeignPaths(PlannerInfo *root,
--- 525,531 ----
  									 total_cost,
  									 NIL,		/* no pathkeys */
  									 NULL,		/* no outer rel either */
+ 									 NULL,		/* no alternative path */
  									 coptions));
  
  	/*
*** a/contrib/postgres_fdw/postgres_fdw.c
--- b/contrib/postgres_fdw/postgres_fdw.c
***************
*** 560,565 **** postgresGetForeignPaths(PlannerInfo *root,
--- 560,566 ----
  								   fpinfo->total_cost,
  								   NIL, /* no pathkeys */
  								   NULL,		/* no outer rel either */
+ 								   NULL,		/* no alternative path */
  								   NIL);		/* no fdw_private list */
  	add_path(baserel, (Path *) path);
  
***************
*** 727,732 **** postgresGetForeignPaths(PlannerInfo *root,
--- 728,734 ----
  									   total_cost,
  									   NIL,		/* no pathkeys */
  									   param_info->ppi_req_outer,
+ 									   NULL,	/* no alternative path */
  									   NIL);	/* no fdw_private list */
  		add_path(baserel, (Path *) path);
  	}
*** a/src/backend/executor/execScan.c
--- b/src/backend/executor/execScan.c
***************
*** 48,59 **** ExecScanFetch(ScanState *node,
  		 * conditions.
  		 */
  		Index		scanrelid = ((Scan *) node->ps.plan)->scanrelid;
  
  		Assert(scanrelid > 0);
  		if (estate->es_epqTupleSet[scanrelid - 1])
  		{
- 			TupleTableSlot *slot = node->ss_ScanTupleSlot;
- 
  			/* Return empty slot if we already returned a tuple */
  			if (estate->es_epqScanDone[scanrelid - 1])
  				return ExecClearTuple(slot);
--- 48,67 ----
  		 * conditions.
  		 */
  		Index		scanrelid = ((Scan *) node->ps.plan)->scanrelid;
+ 		TupleTableSlot *slot = node->ss_ScanTupleSlot;
+ 
+ 		if (scanrelid == 0)
+ 		{
+ 			/* Execute recheck plan and store result in the slot */
+ 			if (!(*recheckMtd) (node, slot))
+ 				ExecClearTuple(slot);	/* would not be returned by scan */
+ 
+ 			return slot;
+ 		}
  
  		Assert(scanrelid > 0);
  		if (estate->es_epqTupleSet[scanrelid - 1])
  		{
  			/* Return empty slot if we already returned a tuple */
  			if (estate->es_epqScanDone[scanrelid - 1])
  				return ExecClearTuple(slot);
***************
*** 347,352 **** ExecScanReScan(ScanState *node)
--- 355,363 ----
  	{
  		Index		scanrelid = ((Scan *) node->ps.plan)->scanrelid;
  
+ 		if (scanrelid == 0)
+ 			return;				/* nothing to do */
+ 
  		Assert(scanrelid > 0);
  
  		estate->es_epqScanDone[scanrelid - 1] = false;
*** a/src/backend/executor/nodeForeignscan.c
--- b/src/backend/executor/nodeForeignscan.c
***************
*** 24,29 ****
--- 24,30 ----
  
  #include "executor/executor.h"
  #include "executor/nodeForeignscan.h"
+ #include "executor/tuptable.h"
  #include "foreign/fdwapi.h"
  #include "utils/memutils.h"
  #include "utils/rel.h"
***************
*** 73,80 **** ForeignNext(ForeignScanState *node)
--- 74,99 ----
  static bool
  ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
  {
+ 	Index		scanrelid = ((Scan *) node->ss.ps.plan)->scanrelid;
  	ExprContext *econtext;
  
+ 	if (scanrelid == 0)
+ 	{
+ 		TupleTableSlot *result;
+ 
+ 		Assert(node->fdw_recheck_plan != NULL);
+ 
+ 		/* Execute recheck plan */
+ 		result = ExecProcNode(node->fdw_recheck_plan);
+ 		if (TupIsNull(result))
+ 			return false;
+ 
+ 		/* Store result in the given slot */
+ 		ExecCopySlot(slot, result);
+ 
+ 		return true;
+ 	}
+ 
  	/*
  	 * extract necessary information from foreign scan node
  	 */
***************
*** 200,205 **** ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
--- 219,230 ----
  	ExecAssignScanProjectionInfoWithVarno(&scanstate->ss, tlistvarno);
  
  	/*
+ 	 * Initialize recheck plan.
+ 	 */
+ 	scanstate->fdw_recheck_plan = ExecInitNode(node->fdw_recheck_plan,
+ 											   estate, eflags);
+ 
+ 	/*
  	 * Initialize FDW-related state.
  	 */
  	scanstate->fdwroutine = fdwroutine;
***************
*** 235,240 **** ExecEndForeignScan(ForeignScanState *node)
--- 260,268 ----
  	/* close the relation. */
  	if (node->ss.ss_currentRelation)
  		ExecCloseScanRelation(node->ss.ss_currentRelation);
+ 
+ 	/* shut down recheck plan. */
+ 	ExecEndNode(node->fdw_recheck_plan);
  }
  
  /* ----------------------------------------------------------------
***************
*** 246,252 **** ExecEndForeignScan(ForeignScanState *node)
--- 274,301 ----
  void
  ExecReScanForeignScan(ForeignScanState *node)
  {
+ 	Index		scanrelid = ((Scan *) node->ss.ps.plan)->scanrelid;
+ 
  	node->fdwroutine->ReScanForeignScan(node);
  
  	ExecScanReScan(&node->ss);
+ 
+ 	if (scanrelid == 0)
+ 	{
+ 		Assert(node->fdw_recheck_plan != NULL);
+ 
+ 		/*
+ 		 * set chgParam for recheck plan
+ 		 */
+ 		if (((PlanState *) node)->chgParam != NULL)
+ 			UpdateChangedParamSet(node->fdw_recheck_plan,
+ 								  ((PlanState *) node)->chgParam);
+ 
+ 		/*
+ 		 * if chgParam of recheck plan is not null then the plan will be
+ 		 * re-scanned by first ExecProcNode.
+ 		 */
+ 		if (node->fdw_recheck_plan->chgParam == NULL)
+ 			ExecReScan(node->fdw_recheck_plan);
+ 	}
  }
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
***************
*** 648,653 **** _copyForeignScan(const ForeignScan *from)
--- 648,654 ----
  	COPY_NODE_FIELD(fdw_exprs);
  	COPY_NODE_FIELD(fdw_private);
  	COPY_NODE_FIELD(fdw_scan_tlist);
+ 	COPY_NODE_FIELD(fdw_recheck_plan);
  	COPY_NODE_FIELD(fdw_recheck_quals);
  	COPY_BITMAPSET_FIELD(fs_relids);
  	COPY_SCALAR_FIELD(fsSystemCol);
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
***************
*** 594,599 **** _outForeignScan(StringInfo str, const ForeignScan *node)
--- 594,600 ----
  	WRITE_NODE_FIELD(fdw_exprs);
  	WRITE_NODE_FIELD(fdw_private);
  	WRITE_NODE_FIELD(fdw_scan_tlist);
+ 	WRITE_NODE_FIELD(fdw_recheck_plan);
  	WRITE_NODE_FIELD(fdw_recheck_quals);
  	WRITE_BITMAPSET_FIELD(fs_relids);
  	WRITE_BOOL_FIELD(fsSystemCol);
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
***************
*** 1798,1803 **** _readForeignScan(void)
--- 1798,1804 ----
  	READ_NODE_FIELD(fdw_exprs);
  	READ_NODE_FIELD(fdw_private);
  	READ_NODE_FIELD(fdw_scan_tlist);
+ 	READ_NODE_FIELD(fdw_recheck_plan);
  	READ_NODE_FIELD(fdw_recheck_quals);
  	READ_BITMAPSET_FIELD(fs_relids);
  	READ_BOOL_FIELD(fsSystemCol);
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
***************
*** 2141,2146 **** create_foreignscan_plan(PlannerInfo *root, ForeignPath *best_path,
--- 2141,2157 ----
  	scan_plan->fs_relids = best_path->path.parent->relids;
  
  	/*
+ 	 * If we're scanning a join relation, generate a recheck plan for
+ 	 * EvalPlanQual support.  (Irrelevant if scanning a base relation.)
+ 	 */
+ 	if (scan_relid == 0)
+ 	{
+ 		scan_plan->fdw_recheck_plan =
+ 			create_plan_recurse(root, best_path->fdw_recheck_path);
+ 		scan_plan->fdw_recheck_plan->targetlist = scan_plan->fdw_scan_tlist;
+ 	}
+ 
+ 	/*
  	 * Replace any outer-relation variables with nestloop params in the qual
  	 * and fdw_exprs expressions.  We do this last so that the FDW doesn't
  	 * have to be involved.  (Note that parts of fdw_exprs could have come
***************
*** 3758,3763 **** make_foreignscan(List *qptlist,
--- 3769,3776 ----
  	node->fdw_exprs = fdw_exprs;
  	node->fdw_private = fdw_private;
  	node->fdw_scan_tlist = fdw_scan_tlist;
+ 	/* fdw_recheck_plan will be filled in by create_foreignscan_plan */
+ 	node->fdw_recheck_plan = NULL;
  	node->fdw_recheck_quals = fdw_recheck_quals;
  	/* fs_relids will be filled in by create_foreignscan_plan */
  	node->fs_relids = NULL;
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
***************
*** 1130,1135 **** set_foreignscan_references(PlannerInfo *root,
--- 1130,1137 ----
  		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
  		fscan->fdw_scan_tlist =
  			fix_scan_list(root, fscan->fdw_scan_tlist, rtoffset);
+ 		/* fdw_recheck_plan needs set_plan_refs() adjustments */
+ 		set_plan_refs(root, fscan->fdw_recheck_plan, rtoffset);
  	}
  	else
  	{
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
***************
*** 2405,2410 **** finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
--- 2405,2419 ----
  				/* We assume fdw_scan_tlist cannot contain Params */
  				context.paramids = bms_add_members(context.paramids,
  												   scan_params);
+ 
+ 				/* recheck plan if foreign join */
+ 				if (fscan->scan.scanrelid == 0)
+ 					context.paramids =
+ 						bms_add_members(context.paramids,
+ 										finalize_plan(root,
+ 													  fscan->fdw_recheck_plan,
+ 													  valid_params,
+ 													  scan_params));
  			}
  			break;
  
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
***************
*** 1488,1493 **** create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
--- 1488,1494 ----
  						double rows, Cost startup_cost, Cost total_cost,
  						List *pathkeys,
  						Relids required_outer,
+ 						Path *fdw_recheck_path,
  						List *fdw_private)
  {
  	ForeignPath *pathnode = makeNode(ForeignPath);
***************
*** 1501,1506 **** create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
--- 1502,1508 ----
  	pathnode->path.total_cost = total_cost;
  	pathnode->path.pathkeys = pathkeys;
  
+ 	pathnode->fdw_recheck_path = fdw_recheck_path;
  	pathnode->fdw_private = fdw_private;
  
  	return pathnode;
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
***************
*** 1579,1584 **** typedef struct WorkTableScanState
--- 1579,1585 ----
  typedef struct ForeignScanState
  {
  	ScanState	ss;				/* its first field is NodeTag */
+ 	PlanState  *fdw_recheck_plan;	/* local join execution plan */
  	List	   *fdw_recheck_quals;	/* original quals not in ss.ps.qual */
  	/* use struct pointer to avoid including fdwapi.h here */
  	struct FdwRoutine *fdwroutine;
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
***************
*** 529,534 **** typedef struct ForeignScan
--- 529,535 ----
  	List	   *fdw_exprs;		/* expressions that FDW may evaluate */
  	List	   *fdw_private;	/* private data for FDW */
  	List	   *fdw_scan_tlist; /* optional tlist describing scan tuple */
+ 	Plan	   *fdw_recheck_plan;	/* local join execution plan */
  	List	   *fdw_recheck_quals;	/* original quals not in scan.plan.quals */
  	Bitmapset  *fs_relids;		/* RTIs generated by this scan */
  	bool		fsSystemCol;	/* true if any "system column" is needed */
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
***************
*** 903,912 **** typedef struct TidPath
--- 903,917 ----
   * generally a good idea to use a representation that can be dumped by
   * nodeToString(), so that you can examine the structure during debugging
   * with tools like pprint().
+  *
+  * If a ForeignPath node represents a remote join, then fdw_recheck_path is
+  * a local join execution path for use in EvalPlanQual.  (Else it is NULL.)
+  * The parameterization of fdw_recheck_path must be the same as that of path.
   */
  typedef struct ForeignPath
  {
  	Path		path;
+ 	Path	   *fdw_recheck_path;
  	List	   *fdw_private;
  } ForeignPath;
  
*** a/src/include/optimizer/pathnode.h
--- b/src/include/optimizer/pathnode.h
***************
*** 86,91 **** extern ForeignPath *create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
--- 86,92 ----
  						double rows, Cost startup_cost, Cost total_cost,
  						List *pathkeys,
  						Relids required_outer,
+ 						Path *fdw_recheck_path,
  						List *fdw_private);
  
  extern Relids calc_nestloop_required_outer(Path *outer_path, Path *inner_path);

#151

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

about 10 years ago

In reply to: Kouhei Kaigai (#135)

Re: Foreign join pushdown vs EvalPlanQual

I wrote:

As Robert mentioned in [1], I think that if we're inside EPQ,
pushed-down quals and/or pushed-down joins should be locally rechecked
in the same way as other cases such as IndexRecheck. So, I'll propose
the updated version of the patch.

On 2015/10/16 18:48, Kouhei Kaigai wrote:

You have never answered my question for two months.

I never deny to execute the pushed-down qualifiers locally.
It is likely the best tactics in most cases.
But, why you try to enforce all the people a particular manner?

Here are various kind of FDW drivers. How do you guarantee it is
the best solution for all the people? It is basically impossible.
(Please google "Probatio diabolica")

You try to add two special purpose fields in ForeignScan;
fdw_recheck_plan and fdw_recheck_quals.
It requires FDW drivers to have pushed-down qualifier in a particular
data format, and also requires FDW drivers to process EPQ recheck by
alternative local plan, even if a part of FDW drivers can process
these jobs by its own implementation better.

I've repeatedly pointed out this issue, but never get reasonable
answer from you.

Again, I also admit alternative plan may be reasonable tactics for
most of FDW drivers. However, only FDW author can "decide" it is
the best tactics to handle the task for their module, not us.

I don't think it is a good interface design to enforce people to
follow a particular implementation manner. It should be discretion
of the extension.

I think that if you think so, you should give at least one concrete
example for that. Ideally accompanied by a demo of how that works well.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#152

Kouhei Kaigai

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Etsuro Fujita (#151)

Re: Foreign join pushdown vs EvalPlanQual

-----Original Message-----
From: Etsuro Fujita [mailto:fujita.etsuro@lab.ntt.co.jp]
Sent: Monday, October 19, 2015 8:52 PM
To: Kaigai Kouhei(海外浩平); Kyotaro HORIGUCHI
Cc: pgsql-hackers@postgresql.org; shigeru.hanada@gmail.com;
robertmhaas@gmail.com
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

I wrote:

As Robert mentioned in [1], I think that if we're inside EPQ,
pushed-down quals and/or pushed-down joins should be locally rechecked
in the same way as other cases such as IndexRecheck. So, I'll propose
the updated version of the patch.

On 2015/10/16 18:48, Kouhei Kaigai wrote:

You have never answered my question for two months.

I never deny to execute the pushed-down qualifiers locally.
It is likely the best tactics in most cases.
But, why you try to enforce all the people a particular manner?

Here are various kind of FDW drivers. How do you guarantee it is
the best solution for all the people? It is basically impossible.
(Please google "Probatio diabolica")

You try to add two special purpose fields in ForeignScan;
fdw_recheck_plan and fdw_recheck_quals.
It requires FDW drivers to have pushed-down qualifier in a particular
data format, and also requires FDW drivers to process EPQ recheck by
alternative local plan, even if a part of FDW drivers can process
these jobs by its own implementation better.

I've repeatedly pointed out this issue, but never get reasonable
answer from you.

Again, I also admit alternative plan may be reasonable tactics for
most of FDW drivers. However, only FDW author can "decide" it is
the best tactics to handle the task for their module, not us.

I don't think it is a good interface design to enforce people to
follow a particular implementation manner. It should be discretion
of the extension.

I think that if you think so, you should give at least one concrete
example for that. Ideally accompanied by a demo of how that works well.

I previously showed an example situation:
/messages/by-id/9A28C8860F777E439AA12E8AEA7694F801138B6F@BPXM15GP.gisp.nec.co.jp

Then, your response was below:
| Thanks for the answer, but I'm not still convinced.
| I think the EPQ testing shown in that use-case would probably not
| efficient, compared to the core's.

What I'm repeatedly talking about is flexibility of the interface,
not efficiently. If core backend provide a good enough EPQ recheck
routine, extension can call it but decision by its author.
Why do you want to prohibit extension to choose its implementation?

Also, I introduced the case of PG-Strom in the face-to-face meeting
with you. PG-Strom has self CPU-fallback routine to rescue GPU errors.
thus, I prefer to reuse this routine for EPQ rechecks, rather than
adding alternative local plan support here.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#153

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Etsuro Fujita (#147)

Re: Foreign join pushdown vs EvalPlanQual

On Mon, Oct 19, 2015 at 3:45 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

As Tom mentioned, just recomputing the original join tuple is not good
enough. We would need to rejoin the test tuples for the baserels even if
ROW_MARK_COPY is in use. Consider:

A=# BEGIN;
A=# UPDATE t SET a = a + 1 WHERE b = 1;
B=# SELECT * from t, ft1, ft2
WHERE t.a = ft1.a AND t.b = ft2.b AND ft1.c = ft2.c FOR UPDATE;
A=# COMMIT;

where the plan for the SELECT FOR UPDATE is

LockRows
-> Nested Loop
-> Seq Scan on t
-> Foreign Scan on <ft1, ft2>
Remote SQL: SELECT * FROM ft1 JOIN ft2 WHERE ft1.c = ft2.c AND ft1.a
= $1 AND ft2.b = $2

If an EPQ recheck is invoked by the A's UPDATE, just recomputing the
original join tuple from the whole-row image that you proposed would output
an incorrect result in the EQP recheck since the value a in the updated
version of a to-be-joined tuple in t would no longer match the value ft1.a
extracted from the whole-row image if the A's UPDATE has committed
successfully. So I think we would need to rejoin the tuples populated from
the whole-row images for the baserels ft1 and ft2, by executing the
secondary plan with the new parameter values for a and b.

No. You just need to populate fdw_recheck_quals correctly, same as
for the scan case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#154

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Kouhei Kaigai (#146)

Re: Foreign join pushdown vs EvalPlanQual

On Mon, Oct 19, 2015 at 12:17 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

1. Fetch every EPQ slot of base relations involved in this join.
In case of ForeignScan, all the required tuples of base relations
should be filled because it is preliminary fetched by whole-row var
if earlier row-locking, or by RefetchForeignRow if later row-locking.
In case of CustomScan, it can call ExecProcNode() to generate the
first tuple even if it does not exists.
Anyway, I assume all the component tuples of this join can be fetched
using existing EPQ slot because they are owned by base relations.

2. The recheck callback fills up ss_ScanTupleSlot according to the
fdw_scan_tlist or custom_scan_tlist. The callback knows the best way
to reconstruct the joined tuple from the base relations' tuple fetched
on the step-1.
For example, if joined tuple is consists of (t1.a, t1.b, t2.x, t3.s),
the callback picks up 't1.a' and 't1.b' from the tuple fetched from
the EPQ slot of t1, then put these values onto the 1st and 2nd slot.
Also, it picks up 't2.x' from the tuple fetched from the EPQ slot of
t2, then put this value onto the 3rd slot. Same as above for 't3'.
At this point, ss_ScanTupleSlot gets filled up by the expected fields
as if join clauses are satisfied.

3. The recheck callback also checks qualifiers of base relations that
are pushed down. Because expression nodes kept in fds_exprs or
custom_exprs are initialized to reference ss_ScanTupleSlot at setrefs.c,
it is more reasonable to run ExecQual after the step-2.
If one of the qualifiers of base relation was evaluated as false,
the recheck callback returns an empty slot.

4. The recheck callback also checks join-clauses to join underlying
base relations. Due to same reason at step-3, it is more reasonable
to execute ExecQual after the step-2.
If one of the join-clauses was evaluated as false, the recheck returns
an empty slot.
Elsewhere, it returns ss_ScanTupleSlot, then ExecScan will process
any further jobs.

Hmm, I guess this would work. But it still feels unnatural to me. It
feels like we haven't really pushed down the join. It's pushed down
except when there's an EPQ check, and then it's not. So we need a
whole alternate plan tree. With my proposal, we don't need that.

There is also some possible loss of efficiency with this approach.
Suppose that we have two tables ft1 and ft2 which are being joined,
and we push down the join. They are being joined on an integer
column, and the join needs to select several other columns as well.
However, ft1 and ft2 are very wide tables that also contain some text
columns. The query is like this:

SELECT localtab.a, ft1.p, ft2.p FROM localtab LEFT JOIN (ft1 JOIN ft2
ON ft1.x = ft2.x AND ft1.huge ~ 'stuff' AND f2.huge2 ~ 'nonsense') ON
localtab.q = ft1.q;

If we refetch each row individually, we will need a wholerow image of
ft1 and ft2 that includes all columns, or at least f1.huge and
f2.huge2. If we just fetch a wholerow image of the join output, we
can exclude those. The only thing we need to recheck is that it's
still the case that localtab.q = ft1.q (because the value of
localtab.q might have changed).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#155

Kouhei Kaigai

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Robert Haas (#154)

Re: Foreign join pushdown vs EvalPlanQual

On Mon, Oct 19, 2015 at 12:17 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

1. Fetch every EPQ slot of base relations involved in this join.
In case of ForeignScan, all the required tuples of base relations
should be filled because it is preliminary fetched by whole-row var
if earlier row-locking, or by RefetchForeignRow if later row-locking.
In case of CustomScan, it can call ExecProcNode() to generate the
first tuple even if it does not exists.
Anyway, I assume all the component tuples of this join can be fetched
using existing EPQ slot because they are owned by base relations.

2. The recheck callback fills up ss_ScanTupleSlot according to the
fdw_scan_tlist or custom_scan_tlist. The callback knows the best way
to reconstruct the joined tuple from the base relations' tuple fetched
on the step-1.
For example, if joined tuple is consists of (t1.a, t1.b, t2.x, t3.s),
the callback picks up 't1.a' and 't1.b' from the tuple fetched from
the EPQ slot of t1, then put these values onto the 1st and 2nd slot.
Also, it picks up 't2.x' from the tuple fetched from the EPQ slot of
t2, then put this value onto the 3rd slot. Same as above for 't3'.
At this point, ss_ScanTupleSlot gets filled up by the expected fields
as if join clauses are satisfied.

3. The recheck callback also checks qualifiers of base relations that
are pushed down. Because expression nodes kept in fds_exprs or
custom_exprs are initialized to reference ss_ScanTupleSlot at setrefs.c,
it is more reasonable to run ExecQual after the step-2.
If one of the qualifiers of base relation was evaluated as false,
the recheck callback returns an empty slot.

4. The recheck callback also checks join-clauses to join underlying
base relations. Due to same reason at step-3, it is more reasonable
to execute ExecQual after the step-2.
If one of the join-clauses was evaluated as false, the recheck returns
an empty slot.
Elsewhere, it returns ss_ScanTupleSlot, then ExecScan will process
any further jobs.

Hmm, I guess this would work. But it still feels unnatural to me. It
feels like we haven't really pushed down the join. It's pushed down
except when there's an EPQ check, and then it's not. So we need a
whole alternate plan tree. With my proposal, we don't need that.

Even if we fetch whole-row of both side, join pushdown is exactly working
because we can receive less number of rows than local join + 2 of foreign-
scan. (If planner works well, we can expect join-path that increases number
of rows shall be dropped.)

One downside of my proposition is growth of width for individual rows.
It is a trade-off situation. The above approach takes no changes for
existing EPQ infrastructure, thus, its implementation design is clear.
On the other hands, your approach will reduce traffic over the network,
however, it is still unclear how we integrate scanrelid==0 with EPQ
infrastructure.

On the other hands, in case of custom-scan that takes underlying local
scan-nodes, thus, any kind of ROW_MARK_* except for ROW_MARK_COPY will
happen. I think width of the joined tuples are relatively minor issue
than FDW cases. However, we cannot expect the fetched rows are protected
by early row-locking mechanism, so probability of re-fetching rows and
reconstruction of joined-tuple has relatively higher priority.

There is also some possible loss of efficiency with this approach.
Suppose that we have two tables ft1 and ft2 which are being joined,
and we push down the join. They are being joined on an integer
column, and the join needs to select several other columns as well.
However, ft1 and ft2 are very wide tables that also contain some text
columns. The query is like this:

SELECT localtab.a, ft1.p, ft2.p FROM localtab LEFT JOIN (ft1 JOIN ft2
ON ft1.x = ft2.x AND ft1.huge ~ 'stuff' AND f2.huge2 ~ 'nonsense') ON
localtab.q = ft1.q;

If we refetch each row individually, we will need a wholerow image of
ft1 and ft2 that includes all columns, or at least f1.huge and
f2.huge2. If we just fetch a wholerow image of the join output, we
can exclude those. The only thing we need to recheck is that it's
still the case that localtab.q = ft1.q (because the value of
localtab.q might have changed).

Isn't it possible to distinguish whole-var reference required by
locking mechanism, from the ones required by users?
(Does resjunk=true give us a hint?)

In case when whole-var reference is required by system internal, it
seems to me harmless to put dummy NULLs on unreferenced columns.
Is it a feasible idea?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Resolved by subject fallback

#156

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

about 10 years ago

In reply to: Robert Haas (#153)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/20 5:34, Robert Haas wrote:

On Mon, Oct 19, 2015 at 3:45 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

As Tom mentioned, just recomputing the original join tuple is not good
enough. We would need to rejoin the test tuples for the baserels even if
ROW_MARK_COPY is in use. Consider:

A=# BEGIN;
A=# UPDATE t SET a = a + 1 WHERE b = 1;
B=# SELECT * from t, ft1, ft2
WHERE t.a = ft1.a AND t.b = ft2.b AND ft1.c = ft2.c FOR UPDATE;
A=# COMMIT;

where the plan for the SELECT FOR UPDATE is

LockRows
-> Nested Loop
-> Seq Scan on t
-> Foreign Scan on <ft1, ft2>
Remote SQL: SELECT * FROM ft1 JOIN ft2 WHERE ft1.c = ft2.c AND ft1.a
= $1 AND ft2.b = $2

If an EPQ recheck is invoked by the A's UPDATE, just recomputing the
original join tuple from the whole-row image that you proposed would output
an incorrect result in the EQP recheck since the value a in the updated
version of a to-be-joined tuple in t would no longer match the value ft1.a
extracted from the whole-row image if the A's UPDATE has committed
successfully. So I think we would need to rejoin the tuples populated from
the whole-row images for the baserels ft1 and ft2, by executing the
secondary plan with the new parameter values for a and b.

No. You just need to populate fdw_recheck_quals correctly, same as
for the scan case.

Yeah, I think we can probably do that for the case where a pushed-down
join clause is an inner-join one, but I'm not sure that we can do that
for the case where that clause is an outer-join one. Maybe I'm missing
something, though.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#157

Kouhei Kaigai

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Etsuro Fujita (#156)

Re: Foreign join pushdown vs EvalPlanQual

-----Original Message-----
From: Etsuro Fujita [mailto:fujita.etsuro@lab.ntt.co.jp]
Sent: Tuesday, October 20, 2015 1:11 PM
To: Robert Haas
Cc: Tom Lane; Kaigai Kouhei(海外浩平); Kyotaro HORIGUCHI;
pgsql-hackers@postgresql.org; Shigeru Hanada
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

On 2015/10/20 5:34, Robert Haas wrote:

On Mon, Oct 19, 2015 at 3:45 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

As Tom mentioned, just recomputing the original join tuple is not good
enough. We would need to rejoin the test tuples for the baserels even if
ROW_MARK_COPY is in use. Consider:

A=# BEGIN;
A=# UPDATE t SET a = a + 1 WHERE b = 1;
B=# SELECT * from t, ft1, ft2
WHERE t.a = ft1.a AND t.b = ft2.b AND ft1.c = ft2.c FOR UPDATE;
A=# COMMIT;

where the plan for the SELECT FOR UPDATE is

LockRows
-> Nested Loop
-> Seq Scan on t
-> Foreign Scan on <ft1, ft2>
Remote SQL: SELECT * FROM ft1 JOIN ft2 WHERE ft1.c = ft2.c AND ft1.a
= $1 AND ft2.b = $2

If an EPQ recheck is invoked by the A's UPDATE, just recomputing the
original join tuple from the whole-row image that you proposed would output
an incorrect result in the EQP recheck since the value a in the updated
version of a to-be-joined tuple in t would no longer match the value ft1.a
extracted from the whole-row image if the A's UPDATE has committed
successfully. So I think we would need to rejoin the tuples populated from
the whole-row images for the baserels ft1 and ft2, by executing the
secondary plan with the new parameter values for a and b.

No. You just need to populate fdw_recheck_quals correctly, same as
for the scan case.

Yeah, I think we can probably do that for the case where a pushed-down
join clause is an inner-join one, but I'm not sure that we can do that
for the case where that clause is an outer-join one. Maybe I'm missing
something, though.

Please check my message yesterday. The non-nullable side of outer-join is
always visible regardless of the join-clause pushed down, as long as it
satisfies the scan-quals pushed-down.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#158

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

about 10 years ago

In reply to: Kouhei Kaigai (#157)

Re: Foreign join pushdown vs EvalPlanQual

On Mon, Oct 19, 2015 at 3:45 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

As Tom mentioned, just recomputing the original join tuple is not good
enough. We would need to rejoin the test tuples for the baserels even if
ROW_MARK_COPY is in use. Consider:

A=# BEGIN;
A=# UPDATE t SET a = a + 1 WHERE b = 1;
B=# SELECT * from t, ft1, ft2
WHERE t.a = ft1.a AND t.b = ft2.b AND ft1.c = ft2.c FOR UPDATE;
A=# COMMIT;

where the plan for the SELECT FOR UPDATE is

LockRows
-> Nested Loop
-> Seq Scan on t
-> Foreign Scan on <ft1, ft2>
Remote SQL: SELECT * FROM ft1 JOIN ft2 WHERE ft1.c = ft2.c AND ft1.a
= $1 AND ft2.b = $2

If an EPQ recheck is invoked by the A's UPDATE, just recomputing the
original join tuple from the whole-row image that you proposed would output
an incorrect result in the EQP recheck since the value a in the updated
version of a to-be-joined tuple in t would no longer match the value ft1.a
extracted from the whole-row image if the A's UPDATE has committed
successfully. So I think we would need to rejoin the tuples populated from
the whole-row images for the baserels ft1 and ft2, by executing the
secondary plan with the new parameter values for a and b.

Robert Haas wrote:

No. You just need to populate fdw_recheck_quals correctly, same as
for the scan case.

I wrote:

Yeah, I think we can probably do that for the case where a pushed-down
join clause is an inner-join one, but I'm not sure that we can do that
for the case where that clause is an outer-join one. Maybe I'm missing
something, though.

On 2015/10/20 15:42, Kouhei Kaigai wrote:

Please check my message yesterday. The non-nullable side of outer-join is
always visible regardless of the join-clause pushed down, as long as it
satisfies the scan-quals pushed-down.

Sorry, my explanation was not correct. (Needed to take in caffeine.)
What I'm concerned about is the following:

SELECT * FROM localtab JOIN (ft1 LEFT JOIN ft2 ON ft1.x = ft2.x) ON
localtab.id = ft1.id FOR UPDATE OF ft1

LockRows
-> Nested Loop
Join Filter: (localtab.id = ft1.id)
-> Seq Scan on localtab
-> Foreign Scan on <ft1, ft2>
Remote SQL: SELECT * FROM ft1 LEFT JOIN ft2 WHERE ft1.x =
ft2.x FOR UPDATE OF ft1

Assume that ft1 performs late row locking. If an EPQ recheck was
invoked due to a concurrent transaction on the remote server that
changed only the value x of the ft1 tuple previously retrieved, then we
would have to generate a fake ft1/ft2-join tuple with nulls for ft2.
(Assume that the ft2 tuple previously retrieved was not a null tuple.)
However, I'm not sure how we can do that in ForeignRecheck; we can't
know for example, which one is outer and which one is inner, without an
alternative local join execution plan. Maybe I'm missing something, though.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#159

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

about 10 years ago

In reply to: Etsuro Fujita (#156)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/20 13:11, Etsuro Fujita wrote:

On 2015/10/20 5:34, Robert Haas wrote:

On Mon, Oct 19, 2015 at 3:45 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

As Tom mentioned, just recomputing the original join tuple is not good
enough. We would need to rejoin the test tuples for the baserels
even if
ROW_MARK_COPY is in use. Consider:

A=# BEGIN;
A=# UPDATE t SET a = a + 1 WHERE b = 1;
B=# SELECT * from t, ft1, ft2
WHERE t.a = ft1.a AND t.b = ft2.b AND ft1.c = ft2.c FOR UPDATE;
A=# COMMIT;

where the plan for the SELECT FOR UPDATE is

LockRows
-> Nested Loop
-> Seq Scan on t
-> Foreign Scan on <ft1, ft2>
Remote SQL: SELECT * FROM ft1 JOIN ft2 WHERE ft1.c = ft2.c
AND ft1.a
= $1 AND ft2.b = $2

If an EPQ recheck is invoked by the A's UPDATE, just recomputing the
original join tuple from the whole-row image that you proposed would
output
an incorrect result in the EQP recheck since the value a in the updated
version of a to-be-joined tuple in t would no longer match the value
ft1.a
extracted from the whole-row image if the A's UPDATE has committed
successfully. So I think we would need to rejoin the tuples
populated from
the whole-row images for the baserels ft1 and ft2, by executing the
secondary plan with the new parameter values for a and b.

No. You just need to populate fdw_recheck_quals correctly, same as
for the scan case.

Yeah, I think we can probably do that for the case where a pushed-down
join clause is an inner-join one, but I'm not sure that we can do that
for the case where that clause is an outer-join one. Maybe I'm missing
something, though.

As I said yesterday, that opinion of me is completely wrong. Sorry for
the incorrectness. Let me explain a little bit more. I still think
that even if ROW_MARK_COPY is in use, we would need to locally rejoin
the tuples populated from the whole-row images for the foreign tables
involved in a remote join, using a secondary plan. Consider eg,

SELECT localtab.*, ft2 from localtab, ft1, ft2
WHERE ft1.x = ft2.x AND ft1.y = localtab.y FOR UPDATE

In this case, since the output of the foreign join would not include any
ft1 columns, I don't think we could do the same thing as for the scan
case, even if populating fdw_recheck_quals correctly. And I think we
would need to rejoin the tuples, using a local join execution plan,
which would have the parameterization for the to-be-pushed-down clause
ft1.y = localtab.y. I'm still missing something, though.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#160

Kouhei Kaigai

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Etsuro Fujita (#159)

Re: Foreign join pushdown vs EvalPlanQual

-----Original Message-----
From: Etsuro Fujita [mailto:fujita.etsuro@lab.ntt.co.jp]
Sent: Wednesday, October 21, 2015 12:31 PM
To: Robert Haas
Cc: Tom Lane; Kaigai Kouhei(海外浩平); Kyotaro HORIGUCHI;
pgsql-hackers@postgresql.org; Shigeru Hanada
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

On 2015/10/20 13:11, Etsuro Fujita wrote:

On 2015/10/20 5:34, Robert Haas wrote:

On Mon, Oct 19, 2015 at 3:45 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

As Tom mentioned, just recomputing the original join tuple is not good
enough. We would need to rejoin the test tuples for the baserels
even if
ROW_MARK_COPY is in use. Consider:

A=# BEGIN;
A=# UPDATE t SET a = a + 1 WHERE b = 1;
B=# SELECT * from t, ft1, ft2
WHERE t.a = ft1.a AND t.b = ft2.b AND ft1.c = ft2.c FOR UPDATE;
A=# COMMIT;

where the plan for the SELECT FOR UPDATE is

LockRows
-> Nested Loop
-> Seq Scan on t
-> Foreign Scan on <ft1, ft2>
Remote SQL: SELECT * FROM ft1 JOIN ft2 WHERE ft1.c = ft2.c
AND ft1.a
= $1 AND ft2.b = $2

If an EPQ recheck is invoked by the A's UPDATE, just recomputing the
original join tuple from the whole-row image that you proposed would
output
an incorrect result in the EQP recheck since the value a in the updated
version of a to-be-joined tuple in t would no longer match the value
ft1.a
extracted from the whole-row image if the A's UPDATE has committed
successfully. So I think we would need to rejoin the tuples
populated from
the whole-row images for the baserels ft1 and ft2, by executing the
secondary plan with the new parameter values for a and b.

No. You just need to populate fdw_recheck_quals correctly, same as
for the scan case.

Yeah, I think we can probably do that for the case where a pushed-down
join clause is an inner-join one, but I'm not sure that we can do that
for the case where that clause is an outer-join one. Maybe I'm missing
something, though.

As I said yesterday, that opinion of me is completely wrong. Sorry for
the incorrectness. Let me explain a little bit more. I still think
that even if ROW_MARK_COPY is in use, we would need to locally rejoin
the tuples populated from the whole-row images for the foreign tables
involved in a remote join, using a secondary plan. Consider eg,

SELECT localtab.*, ft2 from localtab, ft1, ft2
WHERE ft1.x = ft2.x AND ft1.y = localtab.y FOR UPDATE

In this case, since the output of the foreign join would not include any
ft1 columns, I don't think we could do the same thing as for the scan
case, even if populating fdw_recheck_quals correctly.

As an aside, could you introduce the reason why you think so? It is
significant point in discussion, if we want to reach the consensus.

It looks to me the above introduction mix up the target-list of user
query and the target-list of remote query.
If EPQ mechanism requires joined tuple on ft1 and ft2, FDW driver can
make a remote query as follows:
SELECT ft2, ft1.y, ft1.x, ft2.x FROM ft1.x = ft2.x FOR UPDATE
Thus, fdw_scan_tlist has four target-entries, but later two items are
resjunk=true because ForeignScan node drops these columns by projection
when it returns a tuple to upper node.
On the other hands, the joined-tuple we're talking about in this context
is a tuple prior to projection; formed according to the fdw_scan_tlist.
So, it contains all the necessary information to run scan/join qualifiers
towards the joined-tuple. It is not affected by the target-list of user
query.

Even though I think the approach with joined-tuple reconstruction is
reasonable solution here, it is not a fair reason to introduce disadvantage
of Robert's suggestion.

And I think we
would need to rejoin the tuples, using a local join execution plan,
which would have the parameterization for the to-be-pushed-down clause
ft1.y = localtab.y. I'm still missing something, though.

Also, please don't mix up "what we do" and "how we do".

It is "what we do" to discuss which format of tuples shall be returned
to the core backend from the extension, because it determines the role
of interface. If our consensus is to return a joined-tuple, we need to
design the interface according to the consensus.

On the other hands, it is "how we do" discussion whether we should
enforce all the FDW/CSP extension to have alternative plan, or not.
Once we got a consensus in "what we do" discussion, there are variable
options to solve the requirement by the consensus, however, we cannot
prioritize "how we do" without "what we do".

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#161

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

about 10 years ago

In reply to: Kouhei Kaigai (#160)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/21 13:34, Kouhei Kaigai wrote:

On 2015/10/20 13:11, Etsuro Fujita wrote:

On 2015/10/20 5:34, Robert Haas wrote:

No. You just need to populate fdw_recheck_quals correctly, same as
for the scan case.

As I said yesterday, that opinion of me is completely wrong. Sorry for
the incorrectness. Let me explain a little bit more. I still think
that even if ROW_MARK_COPY is in use, we would need to locally rejoin
the tuples populated from the whole-row images for the foreign tables
involved in a remote join, using a secondary plan. Consider eg,

SELECT localtab.*, ft2 from localtab, ft1, ft2
WHERE ft1.x = ft2.x AND ft1.y = localtab.y FOR UPDATE

In this case, since the output of the foreign join would not include any
ft1 columns, I don't think we could do the same thing as for the scan
case, even if populating fdw_recheck_quals correctly.

As an aside, could you introduce the reason why you think so? It is
significant point in discussion, if we want to reach the consensus.

On the other hands, the joined-tuple we're talking about in this context
is a tuple prior to projection; formed according to the fdw_scan_tlist.
So, it contains all the necessary information to run scan/join qualifiers
towards the joined-tuple. It is not affected by the target-list of user
query.

After research into the planner, I noticed that I was still wrong; IIUC,
the planner requires that the output of foreign join include the column
ft1.y even for that case. (I don't understand the reason why the
planner requires that.) So, as Robert mentioned, the clause ft1.y =
localtab.y could be rechecked during an EPQ recheck, if populating
fdw_recheck_quals correctly. Sorry again for the incorrectness.

Even though I think the approach with joined-tuple reconstruction is
reasonable solution here, it is not a fair reason to introduce disadvantage
of Robert's suggestion.

Agreed.

Also, please don't mix up "what we do" and "how we do".

It is "what we do" to discuss which format of tuples shall be returned
to the core backend from the extension, because it determines the role
of interface. If our consensus is to return a joined-tuple, we need to
design the interface according to the consensus.

On the other hands, it is "how we do" discussion whether we should
enforce all the FDW/CSP extension to have alternative plan, or not.
Once we got a consensus in "what we do" discussion, there are variable
options to solve the requirement by the consensus, however, we cannot
prioritize "how we do" without "what we do".

Agreed.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#162

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

about 10 years ago

In reply to: Kouhei Kaigai (#155)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/20 9:36, Kouhei Kaigai wrote:

Even if we fetch whole-row of both side, join pushdown is exactly working
because we can receive less number of rows than local join + 2 of foreign-
scan. (If planner works well, we can expect join-path that increases number
of rows shall be dropped.)

One downside of my proposition is growth of width for individual rows.
It is a trade-off situation. The above approach takes no changes for
existing EPQ infrastructure, thus, its implementation design is clear.
On the other hands, your approach will reduce traffic over the network,
however, it is still unclear how we integrate scanrelid==0 with EPQ
infrastructure.

I agree with KaiGai-san that his proposition (or my proposition based on
secondary plans) is still a performance improvement over the current
implementation on local joining plus early row locking, since that that
wouldn't have to transfer useless data that didn't satisfy join
conditions at all!

On the other hands, in case of custom-scan that takes underlying local
scan-nodes, thus, any kind of ROW_MARK_* except for ROW_MARK_COPY will
happen. I think width of the joined tuples are relatively minor issue
than FDW cases. However, we cannot expect the fetched rows are protected
by early row-locking mechanism, so probability of re-fetching rows and
reconstruction of joined-tuple has relatively higher priority.

I see.

There is also some possible loss of efficiency with this approach.
Suppose that we have two tables ft1 and ft2 which are being joined,
and we push down the join. They are being joined on an integer
column, and the join needs to select several other columns as well.
However, ft1 and ft2 are very wide tables that also contain some text
columns. The query is like this:

SELECT localtab.a, ft1.p, ft2.p FROM localtab LEFT JOIN (ft1 JOIN ft2
ON ft1.x = ft2.x AND ft1.huge ~ 'stuff' AND f2.huge2 ~ 'nonsense') ON
localtab.q = ft1.q;

If we refetch each row individually, we will need a wholerow image of
ft1 and ft2 that includes all columns, or at least f1.huge and
f2.huge2. If we just fetch a wholerow image of the join output, we
can exclude those. The only thing we need to recheck is that it's
still the case that localtab.q = ft1.q (because the value of
localtab.q might have changed).

As KaiGai-san mentioned above, what we need to discuss more about with
Robert's proposition is how to integrate that into the existing EPQ
machinery. For example, when, where, and how should we refetch the
whole-row image of the join output in the case of late row locking? IMV
I think that that would need to add a new FDW API different from
RefetchForeignRow, say RefetchForeignJoinRow.

IMO I think that another benefit from the proposition from KaiGai-san
(or me) would be that that could provide the whole functionality for row
locking in remote joins, without an additional development burden on an
FDW author; the author only has to write GetForeignRowMarkType and
RefetchForeignRow, which I think is relatively easy. I think that in
the proposition, the use of rowmark types such as ROW_MARK_SHARE or
ROW_MARK_EXCLUSIVE for foreign tables in remote joins would be quite
inefficient, but I think that the use of ROW_MARK_REFERENCE instead of
ROW_MARK_COPY would be an option for the workload where EPQ rechecks are
rarely invoked, because we just need to transfer ctids, not whole-row
images.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#163

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Etsuro Fujita (#158)

Re: Foreign join pushdown vs EvalPlanQual

On Tue, Oct 20, 2015 at 12:39 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

Sorry, my explanation was not correct. (Needed to take in caffeine.) What
I'm concerned about is the following:

SELECT * FROM localtab JOIN (ft1 LEFT JOIN ft2 ON ft1.x = ft2.x) ON
localtab.id = ft1.id FOR UPDATE OF ft1

LockRows
-> Nested Loop
Join Filter: (localtab.id = ft1.id)
-> Seq Scan on localtab
-> Foreign Scan on <ft1, ft2>
Remote SQL: SELECT * FROM ft1 LEFT JOIN ft2 WHERE ft1.x = ft2.x
FOR UPDATE OF ft1

Assume that ft1 performs late row locking.

If the SQL includes "FOR UPDATE of ft1", then it clearly performs
early row locking. I assume you meant to omit that.

If an EPQ recheck was invoked
due to a concurrent transaction on the remote server that changed only the
value x of the ft1 tuple previously retrieved, then we would have to
generate a fake ft1/ft2-join tuple with nulls for ft2. (Assume that the ft2
tuple previously retrieved was not a null tuple.) However, I'm not sure how
we can do that in ForeignRecheck; we can't know for example, which one is
outer and which one is inner, without an alternative local join execution
plan. Maybe I'm missing something, though.

I would expect it to issue a new query like: SELECT * FROM ft1 LEFT
JOIN ft2 WHERE ft1.x = ft2.x AND ft1.tid = $0 AND ft2.tid = $1.

This should be significantly more efficient than fetching the base
rows from each of two tables with two separate queries.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#164

Kouhei Kaigai

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Robert Haas (#163)

Re: Foreign join pushdown vs EvalPlanQual

On Tue, Oct 20, 2015 at 12:39 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

Sorry, my explanation was not correct. (Needed to take in caffeine.) What
I'm concerned about is the following:

SELECT * FROM localtab JOIN (ft1 LEFT JOIN ft2 ON ft1.x = ft2.x) ON
localtab.id = ft1.id FOR UPDATE OF ft1

LockRows
-> Nested Loop
Join Filter: (localtab.id = ft1.id)
-> Seq Scan on localtab
-> Foreign Scan on <ft1, ft2>
Remote SQL: SELECT * FROM ft1 LEFT JOIN ft2 WHERE ft1.x = ft2.x
FOR UPDATE OF ft1

Assume that ft1 performs late row locking.

If the SQL includes "FOR UPDATE of ft1", then it clearly performs
early row locking. I assume you meant to omit that.

If an EPQ recheck was invoked
due to a concurrent transaction on the remote server that changed only the
value x of the ft1 tuple previously retrieved, then we would have to
generate a fake ft1/ft2-join tuple with nulls for ft2. (Assume that the ft2
tuple previously retrieved was not a null tuple.) However, I'm not sure how
we can do that in ForeignRecheck; we can't know for example, which one is
outer and which one is inner, without an alternative local join execution
plan. Maybe I'm missing something, though.

I would expect it to issue a new query like: SELECT * FROM ft1 LEFT
JOIN ft2 WHERE ft1.x = ft2.x AND ft1.tid = $0 AND ft2.tid = $1.

This should be significantly more efficient than fetching the base
rows from each of two tables with two separate queries.

In this case, the EPQ slot to store the joined tuple is still
a challenge to be solved.

Is it possible to use one or any of EPQ slots that are setup for
base relations but represented by ForeignScan/CustomScan?
In case when ForeignScan run a remote join that involves three
base foreign tables (relid=2, 3, 5 for example), for example,
no other code touches this slot. So, it is safe even if we put
a joined tuple on EPQ slots of underlying base relations.

In this case, EPQ slots are initialized as below:

es_epqTuple[0] ... EPQ tuple of base relation (relid=1)
es_epqTuple[1] ... EPQ of the joined tuple (for relis=2, 3 5)
es_epqTuple[2] ... EPQ of the joined tuple (for relis=2, 3 5), copy of above
es_epqTuple[3] ... EPQ tuple of base relation (relid=4)
es_epqTuple[4] ... EPQ of the joined tuple (for relis=2, 3 5), copy of above
es_epqTuple[5] ... EPQ tuple of base relation (relid=6)

Also, FDW/CSP shall be responsible to return a joined tuple
as a result for whole-row reference of underlying base relation.
(One other challenge is how to handle the case when user explicitly
required a whole-row reference...Hmm...)

Then, if FDW/CSP is designed to utilize the preliminary joined
tuples rather than local join, it can just raise the tuple kept
in one of the EPQ slots for underlying base relations.
If FDW/CSP prefers local join, it can perform as like local join
doing; check join condition and construct a joined tuple by itself
or by alternative plan.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Resolved by subject fallback

#165

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Kouhei Kaigai (#164)

Re: Foreign join pushdown vs EvalPlanQual

On Thu, Oct 29, 2015 at 6:05 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

In this case, the EPQ slot to store the joined tuple is still
a challenge to be solved.

Is it possible to use one or any of EPQ slots that are setup for
base relations but represented by ForeignScan/CustomScan?

Yes, I proposed that exact thing upthread.

In case when ForeignScan run a remote join that involves three
base foreign tables (relid=2, 3, 5 for example), for example,
no other code touches this slot. So, it is safe even if we put
a joined tuple on EPQ slots of underlying base relations.

In this case, EPQ slots are initialized as below:

es_epqTuple[0] ... EPQ tuple of base relation (relid=1)
es_epqTuple[1] ... EPQ of the joined tuple (for relis=2, 3 5)
es_epqTuple[2] ... EPQ of the joined tuple (for relis=2, 3 5), copy of above
es_epqTuple[3] ... EPQ tuple of base relation (relid=4)
es_epqTuple[4] ... EPQ of the joined tuple (for relis=2, 3 5), copy of above
es_epqTuple[5] ... EPQ tuple of base relation (relid=6)

You don't really need to initialize them all. You can just initialize
es_epqTuple[1] and leave 2 and 4 unused.

Then, if FDW/CSP is designed to utilize the preliminary joined
tuples rather than local join, it can just raise the tuple kept
in one of the EPQ slots for underlying base relations.
If FDW/CSP prefers local join, it can perform as like local join
doing; check join condition and construct a joined tuple by itself
or by alternative plan.

Right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#166

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

about 10 years ago

In reply to: Robert Haas (#163)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/28 6:04, Robert Haas wrote:

On Tue, Oct 20, 2015 at 12:39 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

Sorry, my explanation was not correct. (Needed to take in caffeine.) What
I'm concerned about is the following:

SELECT * FROM localtab JOIN (ft1 LEFT JOIN ft2 ON ft1.x = ft2.x) ON
localtab.id = ft1.id FOR UPDATE OF ft1

LockRows
-> Nested Loop
Join Filter: (localtab.id = ft1.id)
-> Seq Scan on localtab
-> Foreign Scan on <ft1, ft2>
Remote SQL: SELECT * FROM ft1 LEFT JOIN ft2 WHERE ft1.x = ft2.x
FOR UPDATE OF ft1

Assume that ft1 performs late row locking.

If the SQL includes "FOR UPDATE of ft1", then it clearly performs
early row locking. I assume you meant to omit that.

Right. Sorry for my mistake.

If an EPQ recheck was invoked
due to a concurrent transaction on the remote server that changed only the
value x of the ft1 tuple previously retrieved, then we would have to
generate a fake ft1/ft2-join tuple with nulls for ft2. (Assume that the ft2
tuple previously retrieved was not a null tuple.) However, I'm not sure how
we can do that in ForeignRecheck; we can't know for example, which one is
outer and which one is inner, without an alternative local join execution
plan. Maybe I'm missing something, though.

I would expect it to issue a new query like: SELECT * FROM ft1 LEFT
JOIN ft2 WHERE ft1.x = ft2.x AND ft1.tid = $0 AND ft2.tid = $1.

We assume here that ft1 uses late row locking, so I thought the above
SQL should include "FOR UPDATE of ft1". But I still don't think that
that is right; the SQL with "FOR UPDATE of ft1" wouldn't generate the
fake ft1/ft2-join tuple with nulls for ft2, as expected. The reason for
that is that the updated version of the ft1 tuple wouldn't satisfy the
ft1.tid = $0 condition in an EPQ recheck, because the ctid for the
updated version of the ft1 tuple has changed. (IIUC, I think that if we
use a TID scan for ft1, the SQL would generate the expected result,
because I think that the TID condition would be ignored in the EPQ
recheck, but I don't think it's guaranteed to use a TID scan for ft1.)
Maybe I'm missing something, though.

This should be significantly more efficient than fetching the base
rows from each of two tables with two separate queries.

Maybe I think we could fix the SQL, so I have to admit that, but I'm
just wondering (1) what would happen for the case when ft1 uses late row
rocking and ft2 uses early row rocking and (2) that would be still more
efficient than re-fetching only the base row from ft1.

What I thought to improve the efficiency in the secondary-plan approach
that I proposed was that if we could parallelize re-fetching foreign
rows in ExecLockRows and EvalPlanQualFetchRowMarks, we would be able to
improve the efficiency not only for the case when performing a join of
foreign tables remotely but for the case when performing the join locally.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#167

Kouhei Kaigai

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Etsuro Fujita (#166)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/10/28 6:04, Robert Haas wrote:

On Tue, Oct 20, 2015 at 12:39 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

Sorry, my explanation was not correct. (Needed to take in caffeine.) What
I'm concerned about is the following:

SELECT * FROM localtab JOIN (ft1 LEFT JOIN ft2 ON ft1.x = ft2.x) ON
localtab.id = ft1.id FOR UPDATE OF ft1

LockRows
-> Nested Loop
Join Filter: (localtab.id = ft1.id)
-> Seq Scan on localtab
-> Foreign Scan on <ft1, ft2>
Remote SQL: SELECT * FROM ft1 LEFT JOIN ft2 WHERE ft1.x = ft2.x
FOR UPDATE OF ft1

Assume that ft1 performs late row locking.

If the SQL includes "FOR UPDATE of ft1", then it clearly performs
early row locking. I assume you meant to omit that.

Right. Sorry for my mistake.

If an EPQ recheck was invoked
due to a concurrent transaction on the remote server that changed only the
value x of the ft1 tuple previously retrieved, then we would have to
generate a fake ft1/ft2-join tuple with nulls for ft2. (Assume that the ft2
tuple previously retrieved was not a null tuple.) However, I'm not sure how
we can do that in ForeignRecheck; we can't know for example, which one is
outer and which one is inner, without an alternative local join execution
plan. Maybe I'm missing something, though.

I would expect it to issue a new query like: SELECT * FROM ft1 LEFT
JOIN ft2 WHERE ft1.x = ft2.x AND ft1.tid = $0 AND ft2.tid = $1.

We assume here that ft1 uses late row locking, so I thought the above
SQL should include "FOR UPDATE of ft1". But I still don't think that
that is right; the SQL with "FOR UPDATE of ft1" wouldn't generate the
fake ft1/ft2-join tuple with nulls for ft2, as expected. The reason for
that is that the updated version of the ft1 tuple wouldn't satisfy the
ft1.tid = $0 condition in an EPQ recheck, because the ctid for the
updated version of the ft1 tuple has changed. (IIUC, I think that if we
use a TID scan for ft1, the SQL would generate the expected result,
because I think that the TID condition would be ignored in the EPQ
recheck, but I don't think it's guaranteed to use a TID scan for ft1.)
Maybe I'm missing something, though.

It looks to me, we should not use ctid system column to identify remote
row when postgres_fdw tries to support late row locking.

The documentation says:
http://www.postgresql.org/docs/devel/static/fdw-callbacks.html#FDW-CALLBACKS-UPDATE

UPDATE and DELETE operations are performed against rows previously
fetched by the table-scanning functions. The FDW may need extra information,
such as a row ID or the values of primary-key columns, to ensure that it can
identify the exact row to update or delete

The "rowid" should not be changed once it is fetched from the remote side
until it is actually updated, deleted or locked, for correct identification.
If ctid is used for this purpose, it is safe only when remote row is locked
when it is fetched - it is exactly early row locking behavior, isn't it?

This should be significantly more efficient than fetching the base
rows from each of two tables with two separate queries.

Maybe I think we could fix the SQL, so I have to admit that, but I'm
just wondering (1) what would happen for the case when ft1 uses late row
rocking and ft2 uses early row rocking and (2) that would be still more
efficient than re-fetching only the base row from ft1.

It should be decision by FDW driver. It is not easy to estimate a certain
FDW driver mixes up early and late locking policy within a same remote join
query. Do you really want to support such a mysterious implementation?

Or, do you expect all the FDW driver is enforced to return a joined tuple
if remote join case? It is different from my idea; it shall be an extra
optimization option if FDW can fetch a joined tuple at once, but not always.
So, if FDW driver does not support this optimal behavior, your driver can
fetch two base tables then run local alternative join (or something other).

What I thought to improve the efficiency in the secondary-plan approach
that I proposed was that if we could parallelize re-fetching foreign
rows in ExecLockRows and EvalPlanQualFetchRowMarks, we would be able to
improve the efficiency not only for the case when performing a join of
foreign tables remotely but for the case when performing the join locally.

Parallelism is not a magic bullet...

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#168

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

about 10 years ago

In reply to: Kouhei Kaigai (#167)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/11/04 17:10, Kouhei Kaigai wrote:

On 2015/10/28 6:04, Robert Haas wrote:

On Tue, Oct 20, 2015 at 12:39 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

Sorry, my explanation was not correct. (Needed to take in caffeine.) What
I'm concerned about is the following:

SELECT * FROM localtab JOIN (ft1 LEFT JOIN ft2 ON ft1.x = ft2.x) ON
localtab.id = ft1.id FOR UPDATE OF ft1

LockRows
-> Nested Loop
Join Filter: (localtab.id = ft1.id)
-> Seq Scan on localtab
-> Foreign Scan on <ft1, ft2>
Remote SQL: SELECT * FROM ft1 LEFT JOIN ft2 WHERE ft1.x = ft2.x
FOR UPDATE OF ft1

Assume that ft1 performs late row locking.

If the SQL includes "FOR UPDATE of ft1", then it clearly performs
early row locking. I assume you meant to omit that.

If an EPQ recheck was invoked
due to a concurrent transaction on the remote server that changed only the
value x of the ft1 tuple previously retrieved, then we would have to
generate a fake ft1/ft2-join tuple with nulls for ft2. (Assume that the ft2
tuple previously retrieved was not a null tuple.) However, I'm not sure how
we can do that in ForeignRecheck; we can't know for example, which one is
outer and which one is inner, without an alternative local join execution
plan. Maybe I'm missing something, though.

I would expect it to issue a new query like: SELECT * FROM ft1 LEFT
JOIN ft2 WHERE ft1.x = ft2.x AND ft1.tid = $0 AND ft2.tid = $1.

We assume here that ft1 uses late row locking, so I thought the above
SQL should include "FOR UPDATE of ft1". But I still don't think that
that is right; the SQL with "FOR UPDATE of ft1" wouldn't generate the
fake ft1/ft2-join tuple with nulls for ft2, as expected. The reason for
that is that the updated version of the ft1 tuple wouldn't satisfy the
ft1.tid = $0 condition in an EPQ recheck, because the ctid for the
updated version of the ft1 tuple has changed. (IIUC, I think that if we
use a TID scan for ft1, the SQL would generate the expected result,
because I think that the TID condition would be ignored in the EPQ
recheck, but I don't think it's guaranteed to use a TID scan for ft1.)
Maybe I'm missing something, though.

It looks to me, we should not use ctid system column to identify remote
row when postgres_fdw tries to support late row locking.

The documentation says:
http://www.postgresql.org/docs/devel/static/fdw-callbacks.html#FDW-CALLBACKS-UPDATE

UPDATE and DELETE operations are performed against rows previously
fetched by the table-scanning functions. The FDW may need extra information,
such as a row ID or the values of primary-key columns, to ensure that it can
identify the exact row to update or delete

The "rowid" should not be changed once it is fetched from the remote side
until it is actually updated, deleted or locked, for correct identification.
If ctid is used for this purpose, it is safe only when remote row is locked
when it is fetched - it is exactly early row locking behavior, isn't it?

Yeah, we should use early row locking for a target foreign table in
UPDATE/DELETE.

In case of SELECT FOR UPDATE, I think we are allowed to use ctid to
identify target rows for late row locking, but I think the above SQL
should be changed to something like this:

SELECT * FROM (SELECT * FROM ft1 WHERE ft1.tid = $0 FOR UPDATE) ss1 LEFT
JOIN (SELECT * FROM ft2 WHERE ft2.tid = $1) ss2 ON ss1.x = ss2.x

This should be significantly more efficient than fetching the base
rows from each of two tables with two separate queries.

Maybe I think we could fix the SQL, so I have to admit that, but I'm
just wondering (1) what would happen for the case when ft1 uses late row
rocking and ft2 uses early row rocking and (2) that would be still more
efficient than re-fetching only the base row from ft1.

It should be decision by FDW driver. It is not easy to estimate a certain
FDW driver mixes up early and late locking policy within a same remote join
query. Do you really want to support such a mysterious implementation?

Yeah, the reason for that is because GetForeignRowMarkType allows that.

Or, do you expect all the FDW driver is enforced to return a joined tuple
if remote join case?

No. That wouldn't make sense if at least one component table involved
in a foreign join uses the rowmark type other than ROW_MARK_COPY.

It is different from my idea; it shall be an extra
optimization option if FDW can fetch a joined tuple at once, but not always.
So, if FDW driver does not support this optimal behavior, your driver can
fetch two base tables then run local alternative join (or something other).

OK, so if we all agree that the joined-tuple optimization is just an
option for the case where all the component tables use ROW_MARK_COPY,
I'd propose to leave that for 9.6.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#169

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 10 years ago

In reply to: Etsuro Fujita (#168)

Re: Foreign join pushdown vs EvalPlanQual

Hi, I've caught up again.

OK, so if we all agree that the joined-tuple optimization is just an
option for the case where all the component tables use ROW_MARK_COPY,
I'd propose to leave that for 9.6.

I still think that ExecScan is called under EPQ recheck without
EQP tuple for the *scan*.

The ForeignScan can be generated for a join and underlying
foreign scans and such execution node returns what the core
deesn't expect for any scan node. This is what I think is the
root cause of this problem.

So, as the third way, I propose to resurrect the abandoned
ForeinJoinState seems to be for the unearthed requirements. FDW
returns ForeignJoinPath, not ForeignScanPath then finally it
becomes ForeignJoinState, which is handeled as a join node with
no doubt.

What do you think about this?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#170

Kouhei Kaigai

kaigai@ak.jp.nec.com

about 10 years ago

In reply to: Kyotaro HORIGUCHI (#169)

Re: Foreign join pushdown vs EvalPlanQual

-----Original Message-----
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
Sent: Thursday, November 05, 2015 10:02 AM
To: fujita.etsuro@lab.ntt.co.jp
Cc: Kaigai Kouhei(海外浩平); robertmhaas@gmail.com; tgl@sss.pgh.pa.us;
pgsql-hackers@postgresql.org; shigeru.hanada@gmail.com
Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

Hi, I've caught up again.

OK, so if we all agree that the joined-tuple optimization is just an
option for the case where all the component tables use ROW_MARK_COPY,
I'd propose to leave that for 9.6.

I still think that ExecScan is called under EPQ recheck without
EQP tuple for the *scan*.

The ForeignScan can be generated for a join and underlying
foreign scans and such execution node returns what the core
deesn't expect for any scan node. This is what I think is the
root cause of this problem.

So, as the third way, I propose to resurrect the abandoned
ForeinJoinState seems to be for the unearthed requirements. FDW
returns ForeignJoinPath, not ForeignScanPath then finally it
becomes ForeignJoinState, which is handeled as a join node with
no doubt.

What do you think about this?

Apart from EPQ issues, it is fundamentally impossible to reflect
the remote join tree on local side, because remote server runs
the partial join in their best or arbitrary way.
If this ForeignJoinState has just a compatible join sub-tree, what
is the difference from the alternative local join sub-plan?

Even if we have another node, the roles of FDW driver is unchanged.
It eventually needs to do them:
1. Recheck scan-qualifier of base foreign table
2. Recheck join-clause of remote joins
3. Reconstruct a joined tuple

I try to estimate your intention...
You say that ForeignScan with scanrelid==0 is not a scan actually,
so it is problematic to call ExecScan on ExecForeignScan always.
Thus, individual ForeignJoin shall be defined.
Right?

In case of scanrelid==0, it performs like a scan on pseudo relation
that has record type defined by fdw_scan_tlist. The rows generated
with this node are consists of rows in underlying base relations.
A significant point is, FDW driver is responsible to generate the
rows according to the fdw_scan_tlist. Once FDW driver generates rows,
ExecScan() runs remaining tasks - execution of host clauses (although
it is not easy to image remote join includes host clause has cheaper
cost than others) and projection.

One thing I can agree is, ForeignScan is enforced to use ExecScan,
thus some FDW driver may concern about this hard-wired logic.
If we try to make ForeignScan unbound from the ExecScan, I like to
suggest to revise ExecForeignScan, just invoke a callback; then
FDW driver can choose whether ExecScan is best or not.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#171

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

about 10 years ago

In reply to: Etsuro Fujita (#168)

Re: Foreign join pushdown vs EvalPlanQual

On 2015/11/04 18:50, Etsuro Fujita wrote:

On 2015/11/04 17:10, Kouhei Kaigai wrote:

On 2015/10/28 6:04, Robert Haas wrote:

On Tue, Oct 20, 2015 at 12:39 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

Sorry, my explanation was not correct. (Needed to take in
caffeine.) What
I'm concerned about is the following:

SELECT * FROM localtab JOIN (ft1 LEFT JOIN ft2 ON ft1.x = ft2.x) ON
localtab.id = ft1.id FOR UPDATE OF ft1

If an EPQ recheck was invoked
due to a concurrent transaction on the remote server that changed
only the
value x of the ft1 tuple previously retrieved, then we would have to
generate a fake ft1/ft2-join tuple with nulls for ft2. (Assume that
the ft2
tuple previously retrieved was not a null tuple.) However, I'm not
sure how
we can do that in ForeignRecheck; we can't know for example, which
one is
outer and which one is inner, without an alternative local join
execution
plan. Maybe I'm missing something, though.

I would expect it to issue a new query like: SELECT * FROM ft1 LEFT
JOIN ft2 WHERE ft1.x = ft2.x AND ft1.tid = $0 AND ft2.tid = $1.

We assume here that ft1 uses late row locking, so I thought the above
SQL should include "FOR UPDATE of ft1". But I still don't think that
that is right; the SQL with "FOR UPDATE of ft1" wouldn't generate the
fake ft1/ft2-join tuple with nulls for ft2, as expected. The reason for
that is that the updated version of the ft1 tuple wouldn't satisfy the
ft1.tid = $0 condition in an EPQ recheck, because the ctid for the
updated version of the ft1 tuple has changed. (IIUC, I think that if we
use a TID scan for ft1, the SQL would generate the expected result,
because I think that the TID condition would be ignored in the EPQ
recheck, but I don't think it's guaranteed to use a TID scan for ft1.)
Maybe I'm missing something, though.

It looks to me, we should not use ctid system column to identify remote
row when postgres_fdw tries to support late row locking.

The "rowid" should not be changed once it is fetched from the remote side
until it is actually updated, deleted or locked, for correct
identification.
If ctid is used for this purpose, it is safe only when remote row is
locked
when it is fetched - it is exactly early row locking behavior, isn't it?

In case of SELECT FOR UPDATE, I think we are allowed to use ctid to
identify target rows for late row locking, but I think the above SQL
should be changed to something like this:

SELECT * FROM (SELECT * FROM ft1 WHERE ft1.tid = $0 FOR UPDATE) ss1 LEFT
JOIN (SELECT * FROM ft2 WHERE ft2.tid = $1) ss2 ON ss1.x = ss2.x

I noticed that the modofied SQL was still wrong; ss1 would produce no
tuple, if using eg, a sequential scan for ss1, as discussed above.
Sheesh, where is my brain?

I still think we are allowed to do that, but what is the right SQL for
that? In the current implementation of postgres_fdw, we need not take
into consideration that what was fetched was an updated version of the
tuple rather than the same version previously obtained, since that
always uses at least REPEATABLE READ in the remote session. But
otherwise it would be possible that what was fetched was an updated
version of the tuple, having a different ctid value, which wouldn't
satisfy the condition like "ft1.tid = $0" in ss1 any more.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#172

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 10 years ago

In reply to: Kouhei Kaigai (#170)

Re: Foreign join pushdown vs EvalPlanQual

Hello,

At Thu, 5 Nov 2015 01:58:00 +0000, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote in <9A28C8860F777E439AA12E8AEA7694F80116284C@BPXM15GP.gisp.nec.co.jp>

So, as the third way, I propose to resurrect the abandoned
ForeinJoinState seems to be for the unearthed requirements. FDW
returns ForeignJoinPath, not ForeignScanPath then finally it
becomes ForeignJoinState, which is handeled as a join node with
no doubt.

What do you think about this?

Apart from EPQ issues, it is fundamentally impossible to reflect
the remote join tree on local side, because remote server runs
the partial join in their best or arbitrary way.
If this ForeignJoinState has just a compatible join sub-tree, what
is the difference from the alternative local join sub-plan?

I think the ForeignJoinState don't have subnodes and might has no
difference in its structure from ForeignScanState. Its
significant difference from ForeignScanState would be that the
core can properly handle the return from the node as a joined
tuple in ordinary way. Executor no more calls ExecScan for joined
tuples again.

Even if we have another node, the roles of FDW driver is unchanged.
It eventually needs to do them:
1. Recheck scan-qualifier of base foreign table
2. Recheck join-clause of remote joins
3. Reconstruct a joined tuple

Yes, the most significant point of this proposal is in not FDW
side but core side.

I try to estimate your intention...
You say that ForeignScan with scanrelid==0 is not a scan actually,
so it is problematic to call ExecScan on ExecForeignScan always.
Thus, individual ForeignJoin shall be defined.
Right?

Definitely.

In case of scanrelid==0, it performs like a scan on pseudo relation
that has record type defined by fdw_scan_tlist. The rows generated
with this node are consists of rows in underlying base relations.
A significant point is, FDW driver is responsible to generate the
rows according to the fdw_scan_tlist. Once FDW driver generates rows,
ExecScan() runs remaining tasks - execution of host clauses (although
it is not easy to image remote join includes host clause has cheaper
cost than others) and projection.

Agreed. The role of FDW won't be changed by introducing
ForeignJoin.

One thing I can agree is, ForeignScan is enforced to use ExecScan,
thus some FDW driver may concern about this hard-wired logic.
If we try to make ForeignScan unbound from the ExecScan, I like to
suggest to revise ExecForeignScan, just invoke a callback; then
FDW driver can choose whether ExecScan is best or not.

Agreed. Calling ExecScan unconditionally from ForeignScan is the
cause of the root(?) cause I mentioned. Since there'd be no
difference in data structure between Foreign(Join&Node), calling
fdwroutine->ExecForeignScan() or something instaed of ExecScan()
from ExecForeignScan could be the alternative and most promising
solution for all problems in focus now.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#173

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 10 years ago

In reply to: Kyotaro HORIGUCHI (#172)

1 attachment(s)

Re: Foreign join pushdown vs EvalPlanQual

Hello,

The attached small patch is what I have in mind now.

fdwroutine->ExecForeignScan may be unset if the FDW does nothing
special. And all the FDW routine needs is the node.

Subject: [PATCH] Allow substitute ExecScan body for ExecForignScan

ForeignScan node may return joined tuple. This joined tuple cannot be
handled properly by ExecScan during EQP recheck. This patch allows
FDWs to give a special treat to such tuples.

regards,

One thing I can agree is, ForeignScan is enforced to use ExecScan,
thus some FDW driver may concern about this hard-wired logic.
If we try to make ForeignScan unbound from the ExecScan, I like to
suggest to revise ExecForeignScan, just invoke a callback; then
FDW driver can choose whether ExecScan is best or not.

Agreed. Calling ExecScan unconditionally from ForeignScan is the
cause of the root(?) cause I mentioned. Since there'd be no
difference in data structure between Foreign(Join&Node), calling
fdwroutine->ExecForeignScan() or something instaed of ExecScan()
from ExecForeignScan could be the alternative and most promising
solution for all problems in focus now.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Allow-substitute-ExecScan-body-for-ExecForignScan.patchtext/x-patch; charset=us-asciiDownload

>From cddbb29bf09e33af38bc7690d1b78f4e20f363b3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 6 Nov 2015 13:23:55 +0900
Subject: [PATCH] Allow substitute ExecScan body for ExecForignScan

ForeignScan node may return joined tuple. This joined tuple cannot be
handled properly by ExecScan during EQP recheck. This patch allows
FDWs to give a special treat to such tuples.
---
 src/backend/executor/nodeForeignscan.c | 3 +++
 src/include/foreign/fdwapi.h           | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 6165e4a..f43a50b 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -100,6 +100,9 @@ ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
 TupleTableSlot *
 ExecForeignScan(ForeignScanState *node)
 {
+	if (node->fdwroutine->ExecForeignScan)
+		return node->fdwroutine->ExecForeignScan(node);
+
 	return ExecScan((ScanState *) node,
 					(ExecScanAccessMtd) ForeignNext,
 					(ExecScanRecheckMtd) ForeignRecheck);
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 69b48b4..564898d 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -41,6 +41,8 @@ typedef ForeignScan *(*GetForeignPlan_function) (PlannerInfo *root,
 typedef void (*BeginForeignScan_function) (ForeignScanState *node,
 													   int eflags);
 
+typedef TupleTableSlot *(*ExecForeignScan_function) (ForeignScanState *node);
+
 typedef TupleTableSlot *(*IterateForeignScan_function) (ForeignScanState *node);
 
 typedef void (*ReScanForeignScan_function) (ForeignScanState *node);
@@ -137,6 +139,7 @@ typedef struct FdwRoutine
 	GetForeignPaths_function GetForeignPaths;
 	GetForeignPlan_function GetForeignPlan;
 	BeginForeignScan_function BeginForeignScan;
+	ExecForeignScan_function ExecForeignScan;
 	IterateForeignScan_function IterateForeignScan;
 	ReScanForeignScan_function ReScanForeignScan;
 	EndForeignScan_function EndForeignScan;
-- 
1.8.3.1