Bypassing cursors in postgres_fdw to enable parallel plans

Started by Rafia Sabihover 1 year ago38 messageshackers

rafia.pghackers@gmail.com

over 1 year ago

Hello hackers,

At present, in postgres_fdw, if a query which is using a parallel plan is
fired from the remote end fails to use the parallel plan locally because of
the presence of CURSORS. Consider the following example,
Local server,
Table:
create table t ( i int, j int, k text);
insert into t values(generate_series(1,10000), generate_series(1, 10000),
'check_this_out');
Query
select * from t where i > 1000;
Query plan
Gather (cost=0.00..116.08 rows=9000 width=23)
Workers Planned: 2
-> Parallel Seq Scan on t (cost=0.00..116.08 rows=3750 width=23)
Filter: (i > 1000)

Foreign server
create extension postgres_fdw;
CREATE SERVER foreign_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS
(host '127.0.0.1', port '5432', dbname 'postgres');
CREATE USER MAPPING FOR user1 SERVER foreign_server OPTIONS (user 'user1');
Table
CREATE FOREIGN TABLE foreign_table ( i int, j int, k text ) SERVER
foreign_server OPTIONS (schema_name 'public', table_name 't');
Query
select * from t where i > 1000;
Query plan at the local server
Seq Scan on t (cost=0.00..189.00 rows=9000 width=23)
Filter: (i > 1000)

I have used auto_explain extension to get the query plans at the local
server and also following settings in .conf to force the parallel plans for
the purpose of demonstration --
min_parallel_table_scan_size = 0
parallel_tuple_cost= 0
parallel_setup_cost = 0

with the patch:
set postgres_fdw.use_cursor = false;
Query plan at the local server
Gather (cost=0.00..116.08 rows=9000 width=23)
Workers Planned: 2
-> Parallel Seq Scan on t (cost=0.00..116.08 rows=3750 width=23)
Filter: (i > 1000)

Now, to overcome this limitation, I have worked on this idea (suggested by
my colleague Bernd Helmle) of bypassing the cursors. The way it works is as
follows,
there is a new GUC introduced postgres_fdw.use_cursor, which when unset
uses the mode without the cursor. Now, it uses PQsetChunkedRowsMode in
create_cursor when non-cursor mode is used. The size of the chunk is the
same as the fetch_size. Now in fetch_more_data, when non-cursor mode is
used, pgfdw_get_next_result is used to get the chunk in PGresult and
processed in the same manner as before.

Now, the issue comes when there are simultaneous queries, which is the case
with the join queries where all the tables involved in the join are at the
local server. Because in that case we have multiple cursors opened at the
same time and without a cursor mechanism we do not have any information or
any other structure to know what to fetch from which query. To handle that
case, we have a flag only_query, which is unset as soon as we have assigned
the cursor_number >= 2, in postgresBeginForeignScan. Now, in fetch_more
data, when we find out that only_query is unset, then we fetch all the data
for the query and store it in a Tuplestore. These tuples are then
transferred to the fsstate->tuples and then processed as usual.

So yes there is a performance drawback in the case of simultaneous queries,
however, the ability to use parallel plans is really an added advantage for
the users. Plus, we can keep things as before by this new GUC --
use_cursor, in case we are losing more for some workloads. So, in short I
feel hopeful that this could be a good idea and a good time to improve
postgres_fdw.

Looking forward to your reviews, comments, etc.
--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Rafia Sabih (#1)

Re: Bypassing cursors in postgres_fdw to enable parallel plans

On Mon, Jan 6, 2025 at 3:52 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:

Now, to overcome this limitation, I have worked on this idea (suggested by my colleague Bernd Helmle) of bypassing the cursors. The way it works is as follows,
there is a new GUC introduced postgres_fdw.use_cursor, which when unset uses the mode without the cursor. Now, it uses PQsetChunkedRowsMode in create_cursor when non-cursor mode is used. The size of the chunk is the same as the fetch_size. Now in fetch_more_data, when non-cursor mode is used, pgfdw_get_next_result is used to get the chunk in PGresult and processed in the same manner as before.

Now, the issue comes when there are simultaneous queries, which is the case with the join queries where all the tables involved in the join are at the local server. Because in that case we have multiple cursors opened at the same time and without a cursor mechanism we do not have any information or any other structure to know what to fetch from which query. To handle that case, we have a flag only_query, which is unset as soon as we have assigned the cursor_number >= 2, in postgresBeginForeignScan. Now, in fetch_more data, when we find out that only_query is unset, then we fetch all the data for the query and store it in a Tuplestore. These tuples are then transferred to the fsstate->tuples and then processed as usual.

So yes there is a performance drawback in the case of simultaneous queries, however, the ability to use parallel plans is really an added advantage for the users. Plus, we can keep things as before by this new GUC -- use_cursor, in case we are losing more for some workloads. So, in short I feel hopeful that this could be a good idea and a good time to improve postgres_fdw.

Hi,

I think it might have been nice to credit me in this post, since I
made some relevant suggestions here off-list, in particular the idea
of using a Tuplestore when there are multiple queries running. But I
don't think this patch quite implements what I suggested. Here, you
have a flag only_query which gets set to true at some point in time
and thereafter remains true for the lifetime of a session. That means,
I think, that all future queries will use the tuplestore even though
there might not be multiple queries running any more. which doesn't
seem like what we want. And, actually, this looks like it will be set
as soon as you reach the second query in the same transaction, even if
the two queries don't overlap. I think what you want to do is test
whether, at the point where we would need to issue a new query,
whether an existing query is already running. If not, move that
query's remaining results into a Tuplestore so you can issue the new
query.

I'm not sure what the best way to implement that is, exactly. Perhaps
fsstate->conn_state needs to store some more details about the
connection, but that's just a guess. I don't think a global variable
is what you want. Not only is that session-lifetime, but it applies
globally to every connection to every server. You want to test
something that is specific to one connection to one server, so it
needs to be part of a data structure that is scoped that way.

I think you'll want to figure out a good way to test this patch. I
don't know if we need or can reasonably have automated test cases for
this new functionality, but you at least want to have a good way to do
manual testing, so that you can show that the tuplestore is used in
cases where it's necessary and not otherwise. I'm not yet sure whether
this patch needs automated test cases or whether they can reasonably
be written, but you at least want to have a good procedure for manual
validation so that you can verify that the Tuplestore is used in all
the cases where it needs to be and, hopefully, no others.

--
Robert Haas
EDB: http://www.enterprisedb.com

Rafia Sabih

rafia.pghackers@gmail.com

over 1 year ago

In reply to: Robert Haas (#2)

Re: Bypassing cursors in postgres_fdw to enable parallel plans

On Tue, 14 Jan 2025 at 18:33, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jan 6, 2025 at 3:52 AM Rafia Sabih <rafia.pghackers@gmail.com>
wrote:

Now, to overcome this limitation, I have worked on this idea (suggested

by my colleague Bernd Helmle) of bypassing the cursors. The way it works is
as follows,

there is a new GUC introduced postgres_fdw.use_cursor, which when unset

uses the mode without the cursor. Now, it uses PQsetChunkedRowsMode in
create_cursor when non-cursor mode is used. The size of the chunk is the
same as the fetch_size. Now in fetch_more_data, when non-cursor mode is
used, pgfdw_get_next_result is used to get the chunk in PGresult and
processed in the same manner as before.

Now, the issue comes when there are simultaneous queries, which is the

case with the join queries where all the tables involved in the join are at
the local server. Because in that case we have multiple cursors opened at
the same time and without a cursor mechanism we do not have any information
or any other structure to know what to fetch from which query. To handle
that case, we have a flag only_query, which is unset as soon as we have
assigned the cursor_number >= 2, in postgresBeginForeignScan. Now, in
fetch_more data, when we find out that only_query is unset, then we fetch
all the data for the query and store it in a Tuplestore. These tuples are
then transferred to the fsstate->tuples and then processed as usual.

So yes there is a performance drawback in the case of simultaneous

queries, however, the ability to use parallel plans is really an added
advantage for the users. Plus, we can keep things as before by this new GUC
-- use_cursor, in case we are losing more for some workloads. So, in short
I feel hopeful that this could be a good idea and a good time to improve
postgres_fdw.

Hi,

I think it might have been nice to credit me in this post, since I
made some relevant suggestions here off-list, in particular the idea
of using a Tuplestore when there are multiple queries running. But I
don't think this patch quite implements what I suggested. Here, you
have a flag only_query which gets set to true at some point in time
and thereafter remains true for the lifetime of a session. That means,
I think, that all future queries will use the tuplestore even though
there might not be multiple queries running any more. which doesn't
seem like what we want. And, actually, this looks like it will be set
as soon as you reach the second query in the same transaction, even if
the two queries don't overlap. I think what you want to do is test
whether, at the point where we would need to issue a new query,
whether an existing query is already running. If not, move that
query's remaining results into a Tuplestore so you can issue the new
query.

I'm not sure what the best way to implement that is, exactly. Perhaps
fsstate->conn_state needs to store some more details about the
connection, but that's just a guess. I don't think a global variable
is what you want. Not only is that session-lifetime, but it applies
globally to every connection to every server. You want to test
something that is specific to one connection to one server, so it
needs to be part of a data structure that is scoped that way.

I think you'll want to figure out a good way to test this patch. I
don't know if we need or can reasonably have automated test cases for
this new functionality, but you at least want to have a good way to do
manual testing, so that you can show that the tuplestore is used in
cases where it's necessary and not otherwise. I'm not yet sure whether
this patch needs automated test cases or whether they can reasonably
be written, but you at least want to have a good procedure for manual
validation so that you can verify that the Tuplestore is used in all
the cases where it needs to be and, hopefully, no others.

--
Robert Haas
EDB: http://www.enterprisedb.com

Indeed you are right.
Firstly, accept my apologies for not mentioning you in credits for this
work. Thanks a lot for your efforts, discussions with you were helpful in
shaping this patch and bringing it to this level.

Next, yes the last version was using tuplestore for queries within the same
transaction after the second query. To overcome this, I came across this
method to identify if there is any other simultaneous query running with
the current query; now there is an int variable num_queries which is
incremented at every call of postgresBeginForeignScan and decremented at
every call of postgresEndForeignScan. This way, if there are simultaneous
queries running we get the value of num_queries greater than 1. Now, we
check the value of num_queries and use tuplestore only when num_queries is
greater than 1. So, basically the understanding here is that if
postgresBeginForeignScan is called and before the call of
postgresEndForeignScan if another call to postgresBeginForeignScan is made,
then these are simultaneous queries.

I couldn't really find any automated method of testing this, but did it
manually by debugging and/or printing log statements in
postgresBeginForeingScan, postgresEndForeignScan, and fetch_more_data to
confirm indeed there are simultaneous queries, and only they are using
tuplestore. So, the case of simultaneous queries I found was the join
query. Because, there it creates the cursor for one side of the join and
retrieves the first tuples for it and then creates the next cursor for the
other side of join and keeps on reading all the tuples for that query and
then it comes back to first cursor and retrieves all the tuples for that
one. Similarly, it works for the queries with n number of tables in join,
basically what I found is if there are n tables in the join there will be n
open cursors at a time and then they will be closed one by one in the
descending order of the cursor_number. I will think more on the topic of
testing this and will try to come up with a script (in the best case) to
confirm the use of tuplestore in required cases only, or atleast with a set
of steps to do so.

For the regular testing of this feature, I think a regression test with
this new GUC postgres_fdw.use_cursor set to false and running all the
existing tests of postgres_fdw should suffice. What do you think? However,
at the moment when non-cursor mode is used, regression tests are failing.
Some queries require order by because order is changed in non-cursor mode,
but some require more complex changes, I am working on them.

In this version of the patch I have added only the changes mentioned above
and not the regression test modification.

--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Rafia Sabih (#3)

Re: Bypassing cursors in postgres_fdw to enable parallel plans

On Fri, Jan 17, 2025 at 7:03 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:

Indeed you are right.
Firstly, accept my apologies for not mentioning you in credits for this work. Thanks a lot for your efforts, discussions with you were helpful in shaping this patch and bringing it to this level.

Next, yes the last version was using tuplestore for queries within the same transaction after the second query. To overcome this, I came across this method to identify if there is any other simultaneous query running with the current query; now there is an int variable num_queries which is incremented at every call of postgresBeginForeignScan and decremented at every call of postgresEndForeignScan. This way, if there are simultaneous queries running we get the value of num_queries greater than 1. Now, we check the value of num_queries and use tuplestore only when num_queries is greater than 1. So, basically the understanding here is that if postgresBeginForeignScan is called and before the call of postgresEndForeignScan if another call to postgresBeginForeignScan is made, then these are simultaneous queries.

This wouldn't be true in case of error, I believe.

--
Robert Haas
EDB: http://www.enterprisedb.com

KENAN YILMAZ

kenan.yilmaz@localus.com.tr

over 1 year ago

In reply to: Rafia Sabih (#3)

Re: Bypassing cursors in postgres_fdw to enable parallel plans

Hi Rafia,

Based on our previous private discussion, thanks for the update and for
clarifying the current state of the patch.
I understand that more substantial changes are on the way, so I’ll focus on
relevant test scenarios rather than performance testing at this stage.

I will proceed with expanding the scenarios, including:

Multiple postgres_fdw servers are active at the same time
Multiple connections from the same postgres_fdw are active concurrently
Multiple transactions run simultaneously on a single connection
Multiple sessions operate from a single active connection

I will submit the results of these tests to this mail thread so that they
can benefit the broader community as well. Additionally, once you publish
the updated version of the patch, I will rerun the tests with the latest
changes and share the updated results.

Best Regards,

Rafia Sabih <rafia.pghackers@gmail.com>, 17 Şub 2025 Pzt, 16:46 tarihinde
şunu yazdı:

Show quoted text

On Tue, 14 Jan 2025 at 18:33, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jan 6, 2025 at 3:52 AM Rafia Sabih <rafia.pghackers@gmail.com>
wrote:

Now, to overcome this limitation, I have worked on this idea (suggested

by my colleague Bernd Helmle) of bypassing the cursors. The way it works is
as follows,

there is a new GUC introduced postgres_fdw.use_cursor, which when unset

uses the mode without the cursor. Now, it uses PQsetChunkedRowsMode in
create_cursor when non-cursor mode is used. The size of the chunk is the
same as the fetch_size. Now in fetch_more_data, when non-cursor mode is
used, pgfdw_get_next_result is used to get the chunk in PGresult and
processed in the same manner as before.

Now, the issue comes when there are simultaneous queries, which is the

case with the join queries where all the tables involved in the join are at
the local server. Because in that case we have multiple cursors opened at
the same time and without a cursor mechanism we do not have any information
or any other structure to know what to fetch from which query. To handle
that case, we have a flag only_query, which is unset as soon as we have
assigned the cursor_number >= 2, in postgresBeginForeignScan. Now, in
fetch_more data, when we find out that only_query is unset, then we fetch
all the data for the query and store it in a Tuplestore. These tuples are
then transferred to the fsstate->tuples and then processed as usual.

So yes there is a performance drawback in the case of simultaneous

queries, however, the ability to use parallel plans is really an added
advantage for the users. Plus, we can keep things as before by this new GUC
-- use_cursor, in case we are losing more for some workloads. So, in short
I feel hopeful that this could be a good idea and a good time to improve
postgres_fdw.

Hi,

I think it might have been nice to credit me in this post, since I
made some relevant suggestions here off-list, in particular the idea
of using a Tuplestore when there are multiple queries running. But I
don't think this patch quite implements what I suggested. Here, you
have a flag only_query which gets set to true at some point in time
and thereafter remains true for the lifetime of a session. That means,
I think, that all future queries will use the tuplestore even though
there might not be multiple queries running any more. which doesn't
seem like what we want. And, actually, this looks like it will be set
as soon as you reach the second query in the same transaction, even if
the two queries don't overlap. I think what you want to do is test
whether, at the point where we would need to issue a new query,
whether an existing query is already running. If not, move that
query's remaining results into a Tuplestore so you can issue the new
query.

I'm not sure what the best way to implement that is, exactly. Perhaps
fsstate->conn_state needs to store some more details about the
connection, but that's just a guess. I don't think a global variable
is what you want. Not only is that session-lifetime, but it applies
globally to every connection to every server. You want to test
something that is specific to one connection to one server, so it
needs to be part of a data structure that is scoped that way.

I think you'll want to figure out a good way to test this patch. I
don't know if we need or can reasonably have automated test cases for
this new functionality, but you at least want to have a good way to do
manual testing, so that you can show that the tuplestore is used in
cases where it's necessary and not otherwise. I'm not yet sure whether
this patch needs automated test cases or whether they can reasonably
be written, but you at least want to have a good procedure for manual
validation so that you can verify that the Tuplestore is used in all
the cases where it needs to be and, hopefully, no others.

--
Robert Haas
EDB: http://www.enterprisedb.com

Indeed you are right.
Firstly, accept my apologies for not mentioning you in credits for this
work. Thanks a lot for your efforts, discussions with you were helpful in
shaping this patch and bringing it to this level.

Next, yes the last version was using tuplestore for queries within the
same transaction after the second query. To overcome this, I came across
this method to identify if there is any other simultaneous query running
with the current query; now there is an int variable num_queries which is
incremented at every call of postgresBeginForeignScan and decremented at
every call of postgresEndForeignScan. This way, if there are simultaneous
queries running we get the value of num_queries greater than 1. Now, we
check the value of num_queries and use tuplestore only when num_queries is
greater than 1. So, basically the understanding here is that if
postgresBeginForeignScan is called and before the call of
postgresEndForeignScan if another call to postgresBeginForeignScan is made,
then these are simultaneous queries.

I couldn't really find any automated method of testing this, but did it
manually by debugging and/or printing log statements in
postgresBeginForeingScan, postgresEndForeignScan, and fetch_more_data to
confirm indeed there are simultaneous queries, and only they are using
tuplestore. So, the case of simultaneous queries I found was the join
query. Because, there it creates the cursor for one side of the join and
retrieves the first tuples for it and then creates the next cursor for the
other side of join and keeps on reading all the tuples for that query and
then it comes back to first cursor and retrieves all the tuples for that
one. Similarly, it works for the queries with n number of tables in join,
basically what I found is if there are n tables in the join there will be n
open cursors at a time and then they will be closed one by one in the
descending order of the cursor_number. I will think more on the topic of
testing this and will try to come up with a script (in the best case) to
confirm the use of tuplestore in required cases only, or atleast with a set
of steps to do so.

For the regular testing of this feature, I think a regression test with
this new GUC postgres_fdw.use_cursor set to false and running all the
existing tests of postgres_fdw should suffice. What do you think? However,
at the moment when non-cursor mode is used, regression tests are failing.
Some queries require order by because order is changed in non-cursor mode,
but some require more complex changes, I am working on them.

In this version of the patch I have added only the changes mentioned above
and not the regression test modification.

--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH

KENAN YILMAZ

kenan.yilmaz@localus.com.tr

over 1 year ago

In reply to: KENAN YILMAZ (#5)

Re: Bypassing cursors in postgres_fdw to enable parallel plans

Hi Rafia,
Sorry for the late response. I have completed my tests, and parallel
workers were successfully launched on the remote server. Below are the
details of my setup and test results.

# Machine Details
CPU: 4 cores
Memory: 8GB
PostgreSQL Version: 17.2 (compiled from source)
OS: Rocky Linux 8.10
Two VM instances

# PostgreSQL Configuration (For Demonstration)

logging_collector = on
log_truncate_on_rotation = on
log_rotation_size = 1GB
log_filename = 'postgresql-%a.log'
log_line_prefix = '%t [%p]: %q bg=%b, db=%d, usr=%u, client=%h, qryId=%Q,
txId=%x, app=%a, line=%l'
max_worker_processes = 4
max_parallel_workers_per_gather = 2
max_parallel_maintenance_workers = 2
max_parallel_workers = 4
debug_print_plan = on
track_functions = 'all'
log_statement = 'all'
postgres_fdw.use_cursor = false
shared_preload_libraries = 'pg_stat_statements,auto_explain'
compute_query_id = on
min_parallel_table_scan_size = 0
parallel_tuple_cost = 0
parallel_setup_cost = 0
auto_explain.log_min_duration = '0.00ms'
auto_explain.log_analyze = 'on'

# Memory Settings
effective_cache_size = '4GB'
shared_buffers = '1638MB'
work_mem = '100MB'

# Test Setup
# Creating pgbench Tables

$ pgbench -i -s 50 testdb

Running SQL Tests on Local Machine (192.168.1.68)

psql> CREATE EXTENSION postgres_fdw;
psql> CREATE SERVER fdwtest FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host
'192.168.1.69', dbname 'testdb');
psql> CREATE USER MAPPING FOR postgres SERVER fdwtest OPTIONS (user
'postgres');
psql> CREATE SCHEMA fdwtest;
psql> IMPORT FOREIGN SCHEMA public FROM SERVER fdwtest INTO fdwtest;

# Query 1: Counting Rows in a Foreign Table
testdb=# explain analyze select count(*) from fdwtest.pgbench_accounts
where aid> 1000;
QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────
Foreign Scan (cost=102.84..155.73 rows=1 width=8) (actual
time=457.965..457.967 rows=1 loops=1)
Relations: Aggregate on (pgbench_accounts)
Planning Time: 0.661 ms
Execution Time: 458.802 ms
(4 rows)

Time: 461.944 ms

# Query 2: Fetching a Single Row from Foreign Table
testdb=# explain analyze select aid from fdwtest.pgbench_accounts where
aid> 1000 limit 1;
QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Foreign Scan on pgbench_accounts (cost=100.00..100.24 rows=1 width=4)
(actual time=9.335..9.336 rows=1 loops=1)
Planning Time: 0.452 ms
Execution Time: 10.304 ms
(3 rows)

Time: 12.605 ms

## PostgreSQL Logs
## Local Machine Logs (192.168.1.68) – Cropped

bg=client backend, db=testdb, usr=postgres, client=[local] ,
qryId=-6000156370405137929 , txId=0, app=psql, line=14 LOG: duration:
457.971 ms plan:
Query Text: explain analyze select count(*) from
fdwtest.pgbench_accounts where aid> 1000;
Foreign Scan (cost=102.84..155.73 rows=1 width=8) (actual
time=457.965..457.967 rows=1 loops=1)
Relations: Aggregate on (pgbench_accounts)
..
STATEMENT: explain analyze select aid from fdwtest.pgbench_accounts where
aid> 1000 limit 1;
bg=client backend, db=testdb, usr=postgres, client=[local] ,
qryId=5870070636604972000 , txId=0, app=psql, line=19 LOG: duration: 9.339
ms plan:
Query Text: explain analyze select aid from
fdwtest.pgbench_accounts where aid> 1000 limit 1;
Foreign Scan on pgbench_accounts (cost=100.00..100.24 rows=1
width=4) (actual time=9.335..9.336 rows=1 loops=1)

## Remote Machine Logs (192.168.1.69) – Cropped
STATEMENT: SELECT count(*) FROM public.pgbench_accounts WHERE ((aid >
1000))
bg=client backend, db=testdb, usr=postgres, client=192.168.1.68 ,
qryId=-7176633966431489392 , txId=0, app=postgres_fdw, line=22 LOG:
execute <unnamed>: SELECT count(*) FROM public.pgbench_accounts WHERE
((aid > 1000))
bg=client backend, db=testdb, usr=postgres, client=192.168.1.68 , qryId=0 ,
txId=0, app=postgres_fdw, line=23 LOG: statement: COMMIT TRANSACTION
bg=client backend, db=testdb, usr=postgres, client=192.168.1.68 ,
qryId=-2835399305386018931 , txId=0, app=postgres_fdw, line=24 LOG:
duration: 455.461 ms plan:
Query Text: SELECT count(*) FROM public.pgbench_accounts WHERE
((aid > 1000))
Finalize Aggregate (cost=113216.98..113216.99 rows=1 width=8)
(actual time=454.321..455.452 rows=1 loops=1)
-> Gather (cost=113216.97..113216.98 rows=2 width=8) (actual
time=454.173..455.443 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=113216.97..113216.98 rows=1
width=8) (actual time=449.393..449.394 rows=1 loops=3)
-> Parallel Seq Scan on pgbench_accounts
(cost=0.00..108009.67 rows=2082920 width=0) (actual time=0.255..343.378
rows=1666333 loops=3)
Filter: (aid > 1000)
Rows Removed by Filter: 333
bg=client backend, db=testdb, usr=postgres, client=192.168.1.68 , qryId=0 ,
txId=0, app=postgres_fdw, line=25 LOG: statement: START TRANSACTION
ISOLATION LEVEL REPEATABLE READ
..
STATEMENT: SELECT aid FROM public.pgbench_accounts WHERE ((aid > 1000))
LIMIT 1::bigint
bg=client backend, db=testdb, usr=postgres, client=192.168.1.68 ,
qryId=5994602644362067232 , txId=0, app=postgres_fdw, line=29 LOG: execute
<unnamed>: SELECT aid FROM public.pgbench_accounts WHERE ((aid > 1000))
LIMIT 1::bigint
bg=client backend, db=testdb, usr=postgres, client=192.168.1.68 , qryId=0 ,
txId=0, app=postgres_fdw, line=30 LOG: statement: COMMIT TRANSACTION
bg=client backend, db=testdb, usr=postgres, client=192.168.1.68 ,
qryId=-2835399305386018931 , txId=0, app=postgres_fdw, line=31 LOG:
duration: 7.983 ms plan:
Query Text: SELECT aid FROM public.pgbench_accounts WHERE ((aid >
1000)) LIMIT 1::bigint
Limit (cost=0.00..0.02 rows=1 width=4) (actual time=0.836..7.974
rows=1 loops=1)
-> Gather (cost=0.00..108009.67 rows=4999007 width=4) (actual
time=0.834..7.972 rows=1 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on pgbench_accounts
(cost=0.00..108009.67 rows=2082920 width=4) (actual time=0.270..0.271
rows=1 loops=3)
Filter: (aid > 1000)
Rows Removed by Filter: 333

if you would like me to conduct more complex tests, feel free to let me
know.
Best regards,
Kenan YILMAZ

KENAN YILMAZ <kenan.yilmaz@localus.com.tr>, 17 Şub 2025 Pzt, 17:09
tarihinde şunu yazdı:

Show quoted text

Hi Rafia,

Based on our previous private discussion, thanks for the update and for
clarifying the current state of the patch.
I understand that more substantial changes are on the way, so I’ll focus
on relevant test scenarios rather than performance testing at this stage.

I will proceed with expanding the scenarios, including:

Multiple postgres_fdw servers are active at the same time
Multiple connections from the same postgres_fdw are active concurrently
Multiple transactions run simultaneously on a single connection
Multiple sessions operate from a single active connection

I will submit the results of these tests to this mail thread so that they
can benefit the broader community as well. Additionally, once you publish
the updated version of the patch, I will rerun the tests with the latest
changes and share the updated results.

Best Regards,

Rafia Sabih <rafia.pghackers@gmail.com>, 17 Şub 2025 Pzt, 16:46 tarihinde
şunu yazdı:

On Tue, 14 Jan 2025 at 18:33, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jan 6, 2025 at 3:52 AM Rafia Sabih <rafia.pghackers@gmail.com>
wrote:

Now, to overcome this limitation, I have worked on this idea

(suggested by my colleague Bernd Helmle) of bypassing the cursors. The way
it works is as follows,

there is a new GUC introduced postgres_fdw.use_cursor, which when

unset uses the mode without the cursor. Now, it uses PQsetChunkedRowsMode
in create_cursor when non-cursor mode is used. The size of the chunk is the
same as the fetch_size. Now in fetch_more_data, when non-cursor mode is
used, pgfdw_get_next_result is used to get the chunk in PGresult and
processed in the same manner as before.

Now, the issue comes when there are simultaneous queries, which is the

case with the join queries where all the tables involved in the join are at
the local server. Because in that case we have multiple cursors opened at
the same time and without a cursor mechanism we do not have any information
or any other structure to know what to fetch from which query. To handle
that case, we have a flag only_query, which is unset as soon as we have
assigned the cursor_number >= 2, in postgresBeginForeignScan. Now, in
fetch_more data, when we find out that only_query is unset, then we fetch
all the data for the query and store it in a Tuplestore. These tuples are
then transferred to the fsstate->tuples and then processed as usual.

So yes there is a performance drawback in the case of simultaneous

queries, however, the ability to use parallel plans is really an added
advantage for the users. Plus, we can keep things as before by this new GUC
-- use_cursor, in case we are losing more for some workloads. So, in short
I feel hopeful that this could be a good idea and a good time to improve
postgres_fdw.

Hi,

I think it might have been nice to credit me in this post, since I
made some relevant suggestions here off-list, in particular the idea
of using a Tuplestore when there are multiple queries running. But I
don't think this patch quite implements what I suggested. Here, you
have a flag only_query which gets set to true at some point in time
and thereafter remains true for the lifetime of a session. That means,
I think, that all future queries will use the tuplestore even though
there might not be multiple queries running any more. which doesn't
seem like what we want. And, actually, this looks like it will be set
as soon as you reach the second query in the same transaction, even if
the two queries don't overlap. I think what you want to do is test
whether, at the point where we would need to issue a new query,
whether an existing query is already running. If not, move that
query's remaining results into a Tuplestore so you can issue the new
query.

I'm not sure what the best way to implement that is, exactly. Perhaps
fsstate->conn_state needs to store some more details about the
connection, but that's just a guess. I don't think a global variable
is what you want. Not only is that session-lifetime, but it applies
globally to every connection to every server. You want to test
something that is specific to one connection to one server, so it
needs to be part of a data structure that is scoped that way.

I think you'll want to figure out a good way to test this patch. I
don't know if we need or can reasonably have automated test cases for
this new functionality, but you at least want to have a good way to do
manual testing, so that you can show that the tuplestore is used in
cases where it's necessary and not otherwise. I'm not yet sure whether
this patch needs automated test cases or whether they can reasonably
be written, but you at least want to have a good procedure for manual
validation so that you can verify that the Tuplestore is used in all
the cases where it needs to be and, hopefully, no others.

--
Robert Haas
EDB: http://www.enterprisedb.com

Indeed you are right.
Firstly, accept my apologies for not mentioning you in credits for this
work. Thanks a lot for your efforts, discussions with you were helpful in
shaping this patch and bringing it to this level.

Next, yes the last version was using tuplestore for queries within the
same transaction after the second query. To overcome this, I came across
this method to identify if there is any other simultaneous query running
with the current query; now there is an int variable num_queries which is
incremented at every call of postgresBeginForeignScan and decremented at
every call of postgresEndForeignScan. This way, if there are simultaneous
queries running we get the value of num_queries greater than 1. Now, we
check the value of num_queries and use tuplestore only when num_queries is
greater than 1. So, basically the understanding here is that if
postgresBeginForeignScan is called and before the call of
postgresEndForeignScan if another call to postgresBeginForeignScan is made,
then these are simultaneous queries.

I couldn't really find any automated method of testing this, but did it
manually by debugging and/or printing log statements in
postgresBeginForeingScan, postgresEndForeignScan, and fetch_more_data to
confirm indeed there are simultaneous queries, and only they are using
tuplestore. So, the case of simultaneous queries I found was the join
query. Because, there it creates the cursor for one side of the join and
retrieves the first tuples for it and then creates the next cursor for the
other side of join and keeps on reading all the tuples for that query and
then it comes back to first cursor and retrieves all the tuples for that
one. Similarly, it works for the queries with n number of tables in join,
basically what I found is if there are n tables in the join there will be n
open cursors at a time and then they will be closed one by one in the
descending order of the cursor_number. I will think more on the topic of
testing this and will try to come up with a script (in the best case) to
confirm the use of tuplestore in required cases only, or atleast with a set
of steps to do so.

For the regular testing of this feature, I think a regression test with
this new GUC postgres_fdw.use_cursor set to false and running all the
existing tests of postgres_fdw should suffice. What do you think? However,
at the moment when non-cursor mode is used, regression tests are failing.
Some queries require order by because order is changed in non-cursor mode,
but some require more complex changes, I am working on them.

In this version of the patch I have added only the changes mentioned
above and not the regression test modification.

--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH

Andy Fan

zhihui.fan1213@gmail.com

over 1 year ago

In reply to: Rafia Sabih (#1)

Re: Bypassing cursors in postgres_fdw to enable parallel plans

Rafia Sabih <rafia.pghackers@gmail.com> writes:

Hi,

At present, in postgres_fdw, if a query which is using a parallel plan is fired from the remote end fails to use the
parallel plan locally because of the presence of CURSORS. Consider the following example,

...

Now, to overcome this limitation, I have worked on this idea (suggested by my colleague Bernd Helmle) of bypassing the
cursors.

Do you know why we can't use parallel plan when cursor is used? Is It
related to this code in ExecutePlan?

/*
* Set up parallel mode if appropriate.
*
* Parallel mode only supports complete execution of a plan. If we've
* already partially executed it, or if the caller asks us to exit early,
* we must force the plan to run without parallelism.
*/
if (queryDesc->already_executed || numberTuples != 0)
use_parallel_mode = false;

Actually I can't understand the comment as well and I had this
confusion for a long time.

--
Best Regards
Andy Fan

Rafia Sabih

rafia.pghackers@gmail.com

10 months ago

In reply to: Andy Fan (#7)

Re: Bypassing cursors in postgres_fdw to enable parallel plans

Hello hackers,

I am back at this work with a rebased and revised patch. The new version is
rebased and has a change in approach.
Whenever we are using non-cursor mode, for the first cursor we are always
saving the tuples
in the tuplestore, this is because we do not have any means to know
beforehand how many cursors are required for the query.
And when we switch to the next query then we do not have a way to fetch the
tuples for the previous query.
So, the tuples retrieved earlier for the first query were lost if not saved.
I would highly appreciate your time and feedback for this.

On Wed, 12 Mar 2025 at 12:57, Andy Fan <zhihuifan1213@163.com> wrote:

Rafia Sabih <rafia.pghackers@gmail.com> writes:

Hi,

At present, in postgres_fdw, if a query which is using a parallel plan

is fired from the remote end fails to use the

parallel plan locally because of the presence of CURSORS. Consider the

following example,
...

Now, to overcome this limitation, I have worked on this idea (suggested

by my colleague Bernd Helmle) of bypassing the

cursors.

Do you know why we can't use parallel plan when cursor is used? Is It
related to this code in ExecutePlan?

/*
* Set up parallel mode if appropriate.
*
* Parallel mode only supports complete execution of a plan. If
we've
* already partially executed it, or if the caller asks us to exit
early,
* we must force the plan to run without parallelism.
*/
if (queryDesc->already_executed || numberTuples != 0)
use_parallel_mode = false;

Actually I can't understand the comment as well and I had this
confusion for a long time.

--
Best Regards
Andy Fan

--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH

Robert Haas

robertmhaas@gmail.com

9 months ago

In reply to: Andy Fan (#7)

Re: Bypassing cursors in postgres_fdw to enable parallel plans

On Wed, Mar 12, 2025 at 7:57 AM Andy Fan <zhihuifan1213@163.com> wrote:

Do you know why we can't use parallel plan when cursor is used? Is It
related to this code in ExecutePlan?

Yes. When a cursor is used, the whole query isn't executed all at
once, but rather the executor will be started and stopped for each
fetch from the cursor. We can't keep the parallel workers running for
that whole time, not just because it would be inefficient, but because
it would be incorrect. State changes would be possible in the leader
that were not reflected in the workers, leading to chaos.

--
Robert Haas
EDB: http://www.enterprisedb.com

#10

Robert Haas

robertmhaas@gmail.com

9 months ago

In reply to: Rafia Sabih (#8)

Re: Bypassing cursors in postgres_fdw to enable parallel plans

On Mon, Sep 29, 2025 at 10:51 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:

I am back at this work with a rebased and revised patch. The new version is rebased and has a change in approach.
Whenever we are using non-cursor mode, for the first cursor we are always saving the tuples
in the tuplestore, this is because we do not have any means to know beforehand how many cursors are required for the query.

This might have the advantage of being simpler, but it's definitely
worse. If we're only fetching one result set, which will be common,
we'll buffer the whole thing in a tuplestore where that could
otherwise be avoided. Maybe it's still best to go with this approach;
not sure.

And when we switch to the next query then we do not have a way to fetch the tuples for the previous query.
So, the tuples retrieved earlier for the first query were lost if not saved.
I would highly appreciate your time and feedback for this.

My suggestions are to work on the following areas:

1. Automated testing. The patch has no regression tests, and won't get
committed without those.

2. Manual testing. How does the performance with this new option
compare to the existing method? The answer might be different for
small result sets that fit in memory and large ones that spill to
disk, and will certainly also depend on how productive parallel query
can be on the remote side; but I think you want to do and post on this
list some measurements showing the best and worst case for the patch.

3. Patch clean-up. There are plenty of typos and whitespace-only
changes in the patch. It's best to clean those up. Running pgindent is
a good idea too. Some places could use comments.

--
Robert Haas
EDB: http://www.enterprisedb.com

#11

Rafia Sabih

rafia.pghackers@gmail.com

8 months ago

In reply to: Robert Haas (#10)

Re: Bypassing cursors in postgres_fdw to enable parallel plans

On Fri, 10 Oct 2025 at 22:02, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Sep 29, 2025 at 10:51 AM Rafia Sabih <rafia.pghackers@gmail.com>

wrote:

I am back at this work with a rebased and revised patch. The new

version is rebased and has a change in approach.

Whenever we are using non-cursor mode, for the first cursor we are

always saving the tuples

in the tuplestore, this is because we do not have any means to know

beforehand how many cursors are required for the query.

This might have the advantage of being simpler, but it's definitely
worse. If we're only fetching one result set, which will be common,
we'll buffer the whole thing in a tuplestore where that could
otherwise be avoided. Maybe it's still best to go with this approach;
not sure.

And when we switch to the next query then we do not have a way to fetch

the tuples for the previous query.

So, the tuples retrieved earlier for the first query were lost if not

saved.

I would highly appreciate your time and feedback for this.

My suggestions are to work on the following areas:

1. Automated testing. The patch has no regression tests, and won't get
committed without those.

2. Manual testing. How does the performance with this new option
compare to the existing method? The answer might be different for
small result sets that fit in memory and large ones that spill to
disk, and will certainly also depend on how productive parallel query
can be on the remote side; but I think you want to do and post on this
list some measurements showing the best and worst case for the patch.

3. Patch clean-up. There are plenty of typos and whitespace-only
changes in the patch. It's best to clean those up. Running pgindent is
a good idea too. Some places could use comments.

--
Robert Haas
EDB: http://www.enterprisedb.com

Thank you Robert, for reviewing this patch. On going through the patch
more, I realised this was not equipped to handle the cases when there are
more than two active cursors. So to accommodate such a case, I now modified
the new struct for saving the previous query to a list of such structs.
Also, it turns out we need not to save the tuples in case this is an active
cursor, so we only populate the associated tuplestore only when we need to
create a new cursor when the old cursor is not completely done.

In this version I have added the regression tests as well. I wanted to test
this patch for all the cases of postgres_fdw, the only way I could figure
out how to do this was to test the select statements with the new GUC.

I also did some tests for performance. I used the contrib_regression
database and populated the table "S1"."T1" with more tuples to understand
the impact of patch on higher scale. I also used auto_explain to get the
foreign plans.

contrib_regression=# select count(*) from ft1;
2025-11-14 14:40:35.825 CET [44338] LOG: duration: 61.336 ms plan:
Query Text: select count(*) from ft1;
Foreign Scan (cost=102.50..123.72 rows=1 width=8)
Relations: Aggregate on (ft1)
2025-11-14 14:40:35.825 CET [44862] LOG: duration: 60.575 ms plan:
Query Text: DECLARE c1 CURSOR FOR
SELECT count(*) FROM "S 1"."T 1"
Aggregate (cost=21888.22..21888.23 rows=1 width=8)
-> Seq Scan on "T 1" (cost=0.00..19956.98 rows=772498 width=0)
count
--------
990821
(1 row)

Time: 62.728 ms
contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
Time: 1.515 ms
contrib_regression=# select count(*) from ft1;
2025-11-14 14:40:46.260 CET [44862] LOG: duration: 21.875 ms plan:
Query Text: SELECT count(*) FROM "S 1"."T 1"
Finalize Aggregate (cost=17255.64..17255.65 rows=1 width=8)
-> Gather (cost=17255.43..17255.64 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=16255.43..16255.44 rows=1 width=8)
-> Parallel Seq Scan on "T 1" (cost=0.00..15450.74
rows=321874 width=0)
2025-11-14 14:40:46.260 CET [44338] LOG: duration: 22.623 ms plan:
Query Text: select count(*) from ft1;
Foreign Scan (cost=102.50..123.72 rows=1 width=8)
Relations: Aggregate on (ft1)
count
--------
990821
(1 row)

Time: 24.862 ms

So for this query, the advantage is coming from parallel query which was
otherwise not possible in this scenario.
To study more performance with this patch, I found another interesting
query and here are the results,

contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# explain (analyse, buffers) SELECT t1."C 1" FROM "S
1"."T 1" t1 left join ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 =
t1."C 1");2025-11-14 15:57:46.893 CET [1946]
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=109013.50..131877.35 rows=772498 width=4) (actual
time=112311.578..112804.516 rows=990821.00 loops=1)
Hash Cond: (t1."C 1" = t3.c1)
Buffers: shared hit=12232, temp read=12754 written=12754
-> Seq Scan on "T 1" t1 (cost=0.00..19956.98 rows=772498 width=4)
(actual time=0.039..48.808 rows=990821.00 loops=1)
Buffers: shared hit=12232
-> Hash (cost=109001.00..109001.00 rows=1000 width=4) (actual
time=112310.386..112310.387 rows=990821.00 loops=1)
Buckets: 262144 (originally 1024) Batches: 8 (originally 1)
Memory Usage: 6408kB
Buffers: temp written=2537
-> Nested Loop (cost=200.43..109001.00 rows=1000 width=4)
(actual time=0.728..112030.241 rows=990821.00 loops=1)
-> Foreign Scan on ft1 t2 (cost=100.00..331.00 rows=1000
width=4) (actual time=0.398..710.505 rows=990821.00 loops=1)
-> Foreign Scan on ft2 t3 (cost=100.43..108.66 rows=1
width=4) (actual time=0.082..0.082 rows=1.00 loops=990821)
Planning:
Buffers: shared hit=5
Planning Time: 2.211 ms
Execution Time: 112825.428 ms
(15 rows)

contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
explain (analyse, buffers) SELECT t1."C 1" FROM "S 1"."T 1" t1 left join
ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1");
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=109013.50..131877.35 rows=772498 width=4) (actual
time=261.416..354.520 rows=990821.00 loops=1)
Hash Cond: (t1."C 1" = t3.c1)
Buffers: shared hit=12232, temp written=2660
-> Seq Scan on "T 1" t1 (cost=0.00..19956.98 rows=772498 width=4) (actual
time=0.021..35.531 rows=990821.00 loops=1)
Buffers: shared hit=12232
-> Hash (cost=109001.00..109001.00 rows=1000 width=4) (actual
time=261.381..261.383 rows=100.00 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
Buffers: temp written=2660
-> Nested Loop (cost=200.43..109001.00 rows=1000 width=4) (actual
time=255.563..261.356 rows=100.00 loops=1)
Buffers: temp written=2660
-> Foreign Scan on ft1 t2 (cost=100.00..331.00 rows=1000 width=4)
(actual time=0.433..0.443 rows=100.00 loops=1)
-> Foreign Scan on ft2 t3 (cost=100.43..108.66 rows=1 width=4)
(actual time=2.609..2.609 rows=1.00 loops=100)
Buffers: temp written=2660
Planning:
Buffers: shared hit=5
Planning Time: 2.284 ms
Execution Time: 389.358 ms
(17 rows)

So even in the case without a parallel plan, it is performing significantly
better. I investigated a bit more to find out why the query was so slow
with the cursors,
and came to understand that it is repeatedly abandoning and recreating the
cursor via the code path of postgresReScanForeignScan.
So, it looks like cursor management itself costs this much more time.

To understand the impact of the patch in case of tuples spilled to disk, I
tried following example,
contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# set enable_hashjoin = off;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b
where a.c1 = b.c1 order by a.c2;
QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=831.49..833.99 rows=1000 width=94) (actual
time=4537.437..4598.483 rows=990821.00 loops=1)
Sort Key: a.c2
Sort Method: external merge Disk: 137768kB
Buffers: temp read=83156 written=83253
-> Merge Join (cost=761.66..781.66 rows=1000 width=94) (actual
time=3748.488..4090.547 rows=990821.00 loops=1)
Merge Cond: (a.c1 = b.c1)
Buffers: temp read=48725 written=48779
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual
time=1818.521..1865.792 rows=990821.00 loops=1)
Sort Key: a.c1
Sort Method: external merge Disk: 75664kB
Buffers: temp read=18910 written=18937
-> Foreign Scan on ft1 a (cost=100.00..331.00 rows=1000
width=47) (actual time=1.302..1568.640 rows=990821.00 loops=1)
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual
time=1929.955..1981.104 rows=990821.00 loops=1)
Sort Key: b.c1
Sort Method: external sort Disk: 79520kB
Buffers: temp read=29815 written=29842
-> Foreign Scan on ft1 b (cost=100.00..331.00 rows=1000
width=47) (actual time=0.528..1553.249 rows=990821.00 loops=1)
Planning Time: 0.479 ms
Execution Time: 4661.872 ms
(19 rows)

contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b
where a.c1 = b.c1 order by a.c2;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=831.49..833.99 rows=1000 width=94) (actual
time=3376.385..3435.406 rows=990821.00 loops=1)
Sort Key: a.c2
Sort Method: external merge Disk: 137768kB
Buffers: temp read=83156 written=83253
-> Merge Join (cost=761.66..781.66 rows=1000 width=94) (actual
time=2565.517..2916.814 rows=990821.00 loops=1)
Merge Cond: (a.c1 = b.c1)
Buffers: temp read=48725 written=48779
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual
time=1249.517..1300.132 rows=990821.00 loops=1)
Sort Key: a.c1
Sort Method: external merge Disk: 75664kB
Buffers: temp read=18910 written=18937
-> Foreign Scan on ft1 a (cost=100.00..331.00 rows=1000
width=47) (actual time=1.651..980.740 rows=990821.00 loops=1)
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual
time=1315.990..1369.576 rows=990821.00 loops=1)
Sort Key: b.c1
Sort Method: external sort Disk: 79520kB
Buffers: temp read=29815 written=29842
-> Foreign Scan on ft1 b (cost=100.00..331.00 rows=1000
width=47) (actual time=0.426..970.728 rows=990821.00 loops=1)
Planning Time: 0.491 ms
Execution Time: 3527.457 ms
(19 rows)

So it doesn't hurt in this case either. But that is because not the
tuplestore which is added in this patch is spilling to disk here.
I am working on finding a case when there are two or more active cursors
and then when storing the tuples of one cursor,
they are spilled onto the disk. That might give a picture of the worst case
scenario of this patch.

Also, thanks to Kenan, a fellow hacker who finds this work interesting and
offered to do some performance analysis for this patch,
maybe he can also post more results here.

--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH

#12

KENAN YILMAZ

kenan.yilmaz@localus.com.tr

8 months ago

In reply to: Rafia Sabih (#11)

Re: Bypassing cursors in postgres_fdw to enable parallel plans

Hello Hackers,

I have executed use_cursor = true/false with quick tests.

The test table 't1' has 20,000,000 rows (~2.8 GB in size).

The test query is `EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE
a > 1000;`

The result sums are;

** `postgres_fdw.use_cursor = true` --> 20090.782 ms
** `postgres_fdw.use_cursor = false` --> 24103.994 ms

If you wish to run more advanced and complex test queries, feel free to do
so.

All execution test scenarios are listed below;

--
-- restart with flush caches;
$ pg_ctl -D $PGDATA stop && sudo sync && echo 3 | sudo tee
/proc/sys/vm/drop_caches && rm $PGDATA/log/*.log && pg_ctl -D $PGDATA start
-- Data prep stage
DROP DATABASE IF EXISTS testdb;
CREATE DATABASE testdb;
\c testdb;
DROP TABLE IF EXISTS t1 CASCADE;
CREATE UNLOGGED TABLE t1 (a int, b int, c text, d timestamp);

-- Insert test datas
INSERT INTO t1 SELECT 10 + mod(i, 30), i, md5(i::text) ||
md5((i+1000000)::text) || md5((i+2000000)::text), '2025-01-01
10:00:00'::timestamp + (random() * 31536000) * INTERVAL '1 second' FROM
generate_series(1, 20000000) i;

-- Table maintenance
ALTER TABLE t1 SET LOGGED;
VACUUM (ANALYZE, FULL) t1;

-- FDW prep stage
DROP EXTENSION IF EXISTS postgres_fdw CASCADE;
CREATE EXTENSION postgres_fdw;
CREATE SERVER foreign_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS
(host '127.0.0.1', port '5432', dbname 'testdb');
CREATE USER MAPPING FOR postgres SERVER foreign_server OPTIONS (user
'postgres');
CREATE FOREIGN TABLE t1fdw (a int, b int, c text, d timestamp) SERVER
foreign_server OPTIONS (table_name 't1');

-- restart with flush caches are applied before all stage executions.

--
-- * Local t1 table EXPLAIN results
--

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1 WHERE a > 1000;
QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Gather (cost=0.00..467803.85 rows=1 width=105) (actual
time=23260.306..23261.591 rows=0.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared read=363637
I/O Timings: shared read=64590.910
-> Parallel Seq Scan on t1 (cost=0.00..467803.85 rows=1 width=105)
(actual time=23242.279..23242.280 rows=0.00 loops=3)
Filter: (a > 1000)
Rows Removed by Filter: 6666667
Buffers: shared read=363637
I/O Timings: shared read=64590.910
Planning:
Buffers: shared hit=54 read=14 dirtied=1
I/O Timings: shared read=23.281
Planning Time: 38.734 ms
Execution Time: 23269.677 ms
(15 rows)

Time: 23347.716 ms (00:23.348)

--
-- * use_cursor = true (Default) EXPLAIN results
--

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Foreign Scan on t1fdw (cost=100.00..215.67 rows=427 width=40) (actual
time=20074.746..20074.796 rows=0.00 loops=1)
Planning:
Buffers: shared hit=33 read=17 dirtied=1
I/O Timings: shared read=10.696
Planning Time: 43.852 ms
Execution Time: 20090.782 ms
(6 rows)

Time: 20169.081 ms (00:20.169)

--> From Log Files
[153045]: line=21 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: COMMIT TRANSACTION
app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: START
TRANSACTION ISOLATION LEVEL REPEATABLE READ
[153045]: line=21 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: COMMIT TRANSACTION
app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query,
postgres.c:1078
[153045]: line=21 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: COMMIT TRANSACTION
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: execute
<unnamed>: DECLARE c1 CURSOR FOR
SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
[153045]: line=21 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: COMMIT TRANSACTION
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:
exec_execute_message, postgres.c:2245
[153045]: line=21 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: COMMIT TRANSACTION
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: FETCH
100 FROM c1
[153045]: line=21 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: COMMIT TRANSACTION
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query,
postgres.c:1078
[153044]: line=6 sid=6923fba4.255d4 tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOCATION: explain_ExecutorEnd, auto_explain.c:437
app=psql client=[local]: LOG: 00000: duration: 20074.799 ms plan:
Query Text: EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw
WHERE a > 1000;
Foreign Scan on t1fdw (cost=100.00..215.67 rows=427 width=40)
(actual time=20074.746..20074.796 rows=0.00 loops=1)
[153044]: line=6 sid=6923fba4.255d4 tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOCATION: explain_ExecutorEnd, auto_explain.c:437
app=psql client=[local]: LOCATION: explain_ExecutorEnd, auto_explain.c:437
[153045]: line=21 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: COMMIT TRANSACTION
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: CLOSE
c1
[153045]: line=21 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: COMMIT TRANSACTION
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query,
postgres.c:1078
[153045]: line=21 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: COMMIT TRANSACTION
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: duration:
20057.543 ms plan:
Query Text: DECLARE c1 CURSOR FOR
SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
Seq Scan on t1 (cost=0.00..613637.45 rows=1 width=105) (actual
time=20057.541..20057.541 rows=0.00 loops=1)
Filter: (a > 1000)
Rows Removed by Filter: 20000000
[153045]: line=21 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: COMMIT TRANSACTION
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:
explain_ExecutorEnd, auto_explain.c:437
[153045]: line=21 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: COMMIT TRANSACTION
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: COMMIT
TRANSACTION

--
-- * use_cursor = false EXPLAIN results
--

testdb=# SET postgres_fdw.use_cursor = false;

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Foreign Scan on t1fdw (cost=100.00..215.67 rows=427 width=40) (actual
time=24080.945..24080.956 rows=0.00 loops=1)
Planning:
Buffers: shared hit=33 read=17
I/O Timings: shared read=30.074
Planning Time: 53.678 ms
Execution Time: 24103.994 ms
(6 rows)

Time: 24230.548 ms (00:24.231)

--> From Log Files
[153121]: line=18 sid=6923fc16.25621 tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: explain_ExecutorEnd, auto_explain.c:437
app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: START
TRANSACTION ISOLATION LEVEL REPEATABLE READ
[153121]: line=18 sid=6923fc16.25621 tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: explain_ExecutorEnd, auto_explain.c:437
app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query,
postgres.c:1078
[153121]: line=18 sid=6923fc16.25621 tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: explain_ExecutorEnd, auto_explain.c:437
app=postgres_fdw client=127.0.0.1: LOG: 00000: execute <unnamed>: SELECT
a, b, c FROM public.t1 WHERE ((a > 1000))
[153121]: line=18 sid=6923fc16.25621 tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: explain_ExecutorEnd, auto_explain.c:437
app=postgres_fdw client=127.0.0.1: LOCATION: exec_execute_message,
postgres.c:2245
[153113]: line=8 sid=6923fc0c.25619 tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOCATION: explain_ExecutorEnd, auto_explain.c:437
app=psql client=[local]: LOG: 00000: duration: 24080.958 ms plan:
Query Text: EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw
WHERE a > 1000;
Foreign Scan on t1fdw (cost=100.00..215.67 rows=427 width=40)
(actual time=24080.945..24080.956 rows=0.00 loops=1)
[153113]: line=8 sid=6923fc0c.25619 tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOCATION: explain_ExecutorEnd, auto_explain.c:437
app=psql client=[local]: LOCATION: explain_ExecutorEnd, auto_explain.c:437
[153121]: line=18 sid=6923fc16.25621 tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: explain_ExecutorEnd, auto_explain.c:437
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: COMMIT
TRANSACTION
[153121]: line=18 sid=6923fc16.25621 tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: explain_ExecutorEnd, auto_explain.c:437
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query,
postgres.c:1078
[153121]: line=18 sid=6923fc16.25621 tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: explain_ExecutorEnd, auto_explain.c:437
app=postgres_fdw client=127.0.0.1: LOG: 00000: duration: 24059.372 ms
plan:
Query Text: SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
Gather (cost=0.00..467803.85 rows=1 width=105) (actual
time=24058.076..24059.367 rows=0.00 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on t1 (cost=0.00..467803.85 rows=1
width=105) (actual time=24051.406..24051.407 rows=0.00 loops=3)
Filter: (a > 1000)
Rows Removed by Filter: 6666667
[153121]: line=18 sid=6923fc16.25621 tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: explain_ExecutorEnd, auto_explain.c:437
app=postgres_fdw client=127.0.0.1: LOCATION: explain_ExecutorEnd,
auto_explain.c:437

---

Kenan YILMAZ

Rafia Sabih <rafia.pghackers@gmail.com>, 25 Kas 2025 Sal, 17:01 tarihinde
şunu yazdı:

Show quoted text

On Fri, 10 Oct 2025 at 22:02, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Sep 29, 2025 at 10:51 AM Rafia Sabih <rafia.pghackers@gmail.com>

wrote:

I am back at this work with a rebased and revised patch. The new

version is rebased and has a change in approach.

Whenever we are using non-cursor mode, for the first cursor we are

always saving the tuples

in the tuplestore, this is because we do not have any means to know

beforehand how many cursors are required for the query.

This might have the advantage of being simpler, but it's definitely
worse. If we're only fetching one result set, which will be common,
we'll buffer the whole thing in a tuplestore where that could
otherwise be avoided. Maybe it's still best to go with this approach;
not sure.

And when we switch to the next query then we do not have a way to

fetch the tuples for the previous query.

So, the tuples retrieved earlier for the first query were lost if not

saved.

I would highly appreciate your time and feedback for this.

My suggestions are to work on the following areas:

1. Automated testing. The patch has no regression tests, and won't get
committed without those.

2. Manual testing. How does the performance with this new option
compare to the existing method? The answer might be different for
small result sets that fit in memory and large ones that spill to
disk, and will certainly also depend on how productive parallel query
can be on the remote side; but I think you want to do and post on this
list some measurements showing the best and worst case for the patch.

3. Patch clean-up. There are plenty of typos and whitespace-only
changes in the patch. It's best to clean those up. Running pgindent is
a good idea too. Some places could use comments.

--
Robert Haas
EDB: http://www.enterprisedb.com

Thank you Robert, for reviewing this patch. On going through the patch
more, I realised this was not equipped to handle the cases when there are
more than two active cursors. So to accommodate such a case, I now modified
the new struct for saving the previous query to a list of such structs.
Also, it turns out we need not to save the tuples in case this is an active
cursor, so we only populate the associated tuplestore only when we need to
create a new cursor when the old cursor is not completely done.

In this version I have added the regression tests as well. I wanted to
test this patch for all the cases of postgres_fdw, the only way I could
figure out how to do this was to test the select statements with the new
GUC.

I also did some tests for performance. I used the contrib_regression
database and populated the table "S1"."T1" with more tuples to understand
the impact of patch on higher scale. I also used auto_explain to get the
foreign plans.

contrib_regression=# select count(*) from ft1;
2025-11-14 14:40:35.825 CET [44338] LOG: duration: 61.336 ms plan:
Query Text: select count(*) from ft1;
Foreign Scan (cost=102.50..123.72 rows=1 width=8)
Relations: Aggregate on (ft1)
2025-11-14 14:40:35.825 CET [44862] LOG: duration: 60.575 ms plan:
Query Text: DECLARE c1 CURSOR FOR
SELECT count(*) FROM "S 1"."T 1"
Aggregate (cost=21888.22..21888.23 rows=1 width=8)
-> Seq Scan on "T 1" (cost=0.00..19956.98 rows=772498 width=0)
count
--------
990821
(1 row)

Time: 62.728 ms
contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
Time: 1.515 ms
contrib_regression=# select count(*) from ft1;
2025-11-14 14:40:46.260 CET [44862] LOG: duration: 21.875 ms plan:
Query Text: SELECT count(*) FROM "S 1"."T 1"
Finalize Aggregate (cost=17255.64..17255.65 rows=1 width=8)
-> Gather (cost=17255.43..17255.64 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=16255.43..16255.44 rows=1 width=8)
-> Parallel Seq Scan on "T 1" (cost=0.00..15450.74
rows=321874 width=0)
2025-11-14 14:40:46.260 CET [44338] LOG: duration: 22.623 ms plan:
Query Text: select count(*) from ft1;
Foreign Scan (cost=102.50..123.72 rows=1 width=8)
Relations: Aggregate on (ft1)
count
--------
990821
(1 row)

Time: 24.862 ms

So for this query, the advantage is coming from parallel query which was
otherwise not possible in this scenario.
To study more performance with this patch, I found another interesting
query and here are the results,

contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# explain (analyse, buffers) SELECT t1."C 1" FROM "S
1"."T 1" t1 left join ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 =
t1."C 1");2025-11-14 15:57:46.893 CET [1946]
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=109013.50..131877.35 rows=772498 width=4) (actual
time=112311.578..112804.516 rows=990821.00 loops=1)
Hash Cond: (t1."C 1" = t3.c1)
Buffers: shared hit=12232, temp read=12754 written=12754
-> Seq Scan on "T 1" t1 (cost=0.00..19956.98 rows=772498 width=4)
(actual time=0.039..48.808 rows=990821.00 loops=1)
Buffers: shared hit=12232
-> Hash (cost=109001.00..109001.00 rows=1000 width=4) (actual
time=112310.386..112310.387 rows=990821.00 loops=1)
Buckets: 262144 (originally 1024) Batches: 8 (originally 1)
Memory Usage: 6408kB
Buffers: temp written=2537
-> Nested Loop (cost=200.43..109001.00 rows=1000 width=4)
(actual time=0.728..112030.241 rows=990821.00 loops=1)
-> Foreign Scan on ft1 t2 (cost=100.00..331.00 rows=1000
width=4) (actual time=0.398..710.505 rows=990821.00 loops=1)
-> Foreign Scan on ft2 t3 (cost=100.43..108.66 rows=1
width=4) (actual time=0.082..0.082 rows=1.00 loops=990821)
Planning:
Buffers: shared hit=5
Planning Time: 2.211 ms
Execution Time: 112825.428 ms
(15 rows)

contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
explain (analyse, buffers) SELECT t1."C 1" FROM "S 1"."T 1" t1 left join
ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1");
QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=109013.50..131877.35 rows=772498 width=4) (actual
time=261.416..354.520 rows=990821.00 loops=1)
Hash Cond: (t1."C 1" = t3.c1)
Buffers: shared hit=12232, temp written=2660
-> Seq Scan on "T 1" t1 (cost=0.00..19956.98 rows=772498 width=4)
(actual time=0.021..35.531 rows=990821.00 loops=1)
Buffers: shared hit=12232
-> Hash (cost=109001.00..109001.00 rows=1000 width=4) (actual
time=261.381..261.383 rows=100.00 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
Buffers: temp written=2660
-> Nested Loop (cost=200.43..109001.00 rows=1000 width=4) (actual
time=255.563..261.356 rows=100.00 loops=1)
Buffers: temp written=2660
-> Foreign Scan on ft1 t2 (cost=100.00..331.00 rows=1000 width=4)
(actual time=0.433..0.443 rows=100.00 loops=1)
-> Foreign Scan on ft2 t3 (cost=100.43..108.66 rows=1 width=4)
(actual time=2.609..2.609 rows=1.00 loops=100)
Buffers: temp written=2660
Planning:
Buffers: shared hit=5
Planning Time: 2.284 ms
Execution Time: 389.358 ms
(17 rows)

So even in the case without a parallel plan, it is performing
significantly better. I investigated a bit more to find out why the query
was so slow with the cursors,
and came to understand that it is repeatedly abandoning and recreating the
cursor via the code path of postgresReScanForeignScan.
So, it looks like cursor management itself costs this much more time.

To understand the impact of the patch in case of tuples spilled to disk, I
tried following example,
contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# set enable_hashjoin = off;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b
where a.c1 = b.c1 order by a.c2;
QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=831.49..833.99 rows=1000 width=94) (actual
time=4537.437..4598.483 rows=990821.00 loops=1)
Sort Key: a.c2
Sort Method: external merge Disk: 137768kB
Buffers: temp read=83156 written=83253
-> Merge Join (cost=761.66..781.66 rows=1000 width=94) (actual
time=3748.488..4090.547 rows=990821.00 loops=1)
Merge Cond: (a.c1 = b.c1)
Buffers: temp read=48725 written=48779
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual
time=1818.521..1865.792 rows=990821.00 loops=1)
Sort Key: a.c1
Sort Method: external merge Disk: 75664kB
Buffers: temp read=18910 written=18937
-> Foreign Scan on ft1 a (cost=100.00..331.00 rows=1000
width=47) (actual time=1.302..1568.640 rows=990821.00 loops=1)
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual
time=1929.955..1981.104 rows=990821.00 loops=1)
Sort Key: b.c1
Sort Method: external sort Disk: 79520kB
Buffers: temp read=29815 written=29842
-> Foreign Scan on ft1 b (cost=100.00..331.00 rows=1000
width=47) (actual time=0.528..1553.249 rows=990821.00 loops=1)
Planning Time: 0.479 ms
Execution Time: 4661.872 ms
(19 rows)

contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b
where a.c1 = b.c1 order by a.c2;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=831.49..833.99 rows=1000 width=94) (actual
time=3376.385..3435.406 rows=990821.00 loops=1)
Sort Key: a.c2
Sort Method: external merge Disk: 137768kB
Buffers: temp read=83156 written=83253
-> Merge Join (cost=761.66..781.66 rows=1000 width=94) (actual
time=2565.517..2916.814 rows=990821.00 loops=1)
Merge Cond: (a.c1 = b.c1)
Buffers: temp read=48725 written=48779
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual
time=1249.517..1300.132 rows=990821.00 loops=1)
Sort Key: a.c1
Sort Method: external merge Disk: 75664kB
Buffers: temp read=18910 written=18937
-> Foreign Scan on ft1 a (cost=100.00..331.00 rows=1000
width=47) (actual time=1.651..980.740 rows=990821.00 loops=1)
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual
time=1315.990..1369.576 rows=990821.00 loops=1)
Sort Key: b.c1
Sort Method: external sort Disk: 79520kB
Buffers: temp read=29815 written=29842
-> Foreign Scan on ft1 b (cost=100.00..331.00 rows=1000
width=47) (actual time=0.426..970.728 rows=990821.00 loops=1)
Planning Time: 0.491 ms
Execution Time: 3527.457 ms
(19 rows)

So it doesn't hurt in this case either. But that is because not the
tuplestore which is added in this patch is spilling to disk here.
I am working on finding a case when there are two or more active cursors
and then when storing the tuples of one cursor,
they are spilled onto the disk. That might give a picture of the worst
case scenario of this patch.

Also, thanks to Kenan, a fellow hacker who finds this work interesting and
offered to do some performance analysis for this patch,
maybe he can also post more results here.

--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH

#13

Rafia Sabih

rafia.pghackers@gmail.com

8 months ago

In reply to: KENAN YILMAZ (#12)

Re: Bypassing cursors in postgres_fdw to enable parallel plans

On Tue, 25 Nov 2025 at 15:24, KENAN YILMAZ <kenan.yilmaz@localus.com.tr>
wrote:

Hello Hackers,

I have executed use_cursor = true/false with quick tests.

The test table 't1' has 20,000,000 rows (~2.8 GB in size).

The test query is `EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw
WHERE a > 1000;`

The result sums are;

** `postgres_fdw.use_cursor = true` --> 20090.782 ms
** `postgres_fdw.use_cursor = false` --> 24103.994 ms

If you wish to run more advanced and complex test queries, feel free to do
so.

All execution test scenarios are listed below;

--
-- restart with flush caches;
$ pg_ctl -D $PGDATA stop && sudo sync && echo 3 | sudo tee
/proc/sys/vm/drop_caches && rm $PGDATA/log/*.log && pg_ctl -D $PGDATA start
-- Data prep stage
DROP DATABASE IF EXISTS testdb;
CREATE DATABASE testdb;
\c testdb;
DROP TABLE IF EXISTS t1 CASCADE;
CREATE UNLOGGED TABLE t1 (a int, b int, c text, d timestamp);

-- Insert test datas
INSERT INTO t1 SELECT 10 + mod(i, 30), i, md5(i::text) ||
md5((i+1000000)::text) || md5((i+2000000)::text), '2025-01-01
10:00:00'::timestamp + (random() * 31536000) * INTERVAL '1 second' FROM
generate_series(1, 20000000) i;

-- Table maintenance
ALTER TABLE t1 SET LOGGED;
VACUUM (ANALYZE, FULL) t1;

-- FDW prep stage
DROP EXTENSION IF EXISTS postgres_fdw CASCADE;
CREATE EXTENSION postgres_fdw;
CREATE SERVER foreign_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS
(host '127.0.0.1', port '5432', dbname 'testdb');
CREATE USER MAPPING FOR postgres SERVER foreign_server OPTIONS (user
'postgres');
CREATE FOREIGN TABLE t1fdw (a int, b int, c text, d timestamp) SERVER
foreign_server OPTIONS (table_name 't1');

-- restart with flush caches are applied before all stage executions.

--
-- * Local t1 table EXPLAIN results
--

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1 WHERE a > 1000;
QUERY PLAN

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Gather (cost=0.00..467803.85 rows=1 width=105) (actual
time=23260.306..23261.591 rows=0.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared read=363637
I/O Timings: shared read=64590.910
-> Parallel Seq Scan on t1 (cost=0.00..467803.85 rows=1 width=105)
(actual time=23242.279..23242.280 rows=0.00 loops=3)
Filter: (a > 1000)
Rows Removed by Filter: 6666667
Buffers: shared read=363637
I/O Timings: shared read=64590.910
Planning:
Buffers: shared hit=54 read=14 dirtied=1
I/O Timings: shared read=23.281
Planning Time: 38.734 ms
Execution Time: 23269.677 ms
(15 rows)

Time: 23347.716 ms (00:23.348)

--
-- * use_cursor = true (Default) EXPLAIN results
--

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
QUERY PLAN

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Foreign Scan on t1fdw (cost=100.00..215.67 rows=427 width=40) (actual
time=20074.746..20074.796 rows=0.00 loops=1)
Planning:
Buffers: shared hit=33 read=17 dirtied=1
I/O Timings: shared read=10.696
Planning Time: 43.852 ms
Execution Time: 20090.782 ms
(6 rows)

Time: 20169.081 ms (00:20.169)

--> From Log Files
[153045]: line=11 sid=6923fba6.255d5 tag=idle usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: START
TRANSACTION ISOLATION LEVEL REPEATABLE READ
[153045]: line=12 sid=6923fba6.255d5 tag=idle usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query,
postgres.c:1078
[153045]: line=13 sid=6923fba6.255d5 tag=DECLARE CURSOR usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: execute
<unnamed>: DECLARE c1 CURSOR FOR
SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
[153045]: line=14 sid=6923fba6.255d5 tag=DECLARE CURSOR usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:
exec_execute_message, postgres.c:2245
[153045]: line=15 sid=6923fba6.255d5 tag=idle in transaction usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement:
FETCH 100 FROM c1
[153045]: line=16 sid=6923fba6.255d5 tag=idle in transaction usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:
exec_simple_query, postgres.c:1078
[153044]: line=5 sid=6923fba4.255d4 tag=EXPLAIN usr=postgres db=testdb
app=psql client=[local]: LOG: 00000: duration: 20074.799 ms plan:
Query Text: EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw
WHERE a > 1000;
Foreign Scan on t1fdw (cost=100.00..215.67 rows=427 width=40)
(actual time=20074.746..20074.796 rows=0.00 loops=1)
[153044]: line=6 sid=6923fba4.255d4 tag=EXPLAIN usr=postgres db=testdb
app=psql client=[local]: LOCATION: explain_ExecutorEnd, auto_explain.c:437
[153045]: line=17 sid=6923fba6.255d5 tag=idle in transaction usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement:
CLOSE c1
[153045]: line=18 sid=6923fba6.255d5 tag=idle in transaction usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:
exec_simple_query, postgres.c:1078
[153045]: line=19 sid=6923fba6.255d5 tag=CLOSE CURSOR usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: duration:
20057.543 ms plan:
Query Text: DECLARE c1 CURSOR FOR
SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
Seq Scan on t1 (cost=0.00..613637.45 rows=1 width=105) (actual
time=20057.541..20057.541 rows=0.00 loops=1)
Filter: (a > 1000)
Rows Removed by Filter: 20000000
[153045]: line=20 sid=6923fba6.255d5 tag=CLOSE CURSOR usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:
explain_ExecutorEnd, auto_explain.c:437
[153045]: line=21 sid=6923fba6.255d5 tag=idle in transaction usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement:
COMMIT TRANSACTION

--
-- * use_cursor = false EXPLAIN results
--

testdb=# SET postgres_fdw.use_cursor = false;

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
QUERY PLAN

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Foreign Scan on t1fdw (cost=100.00..215.67 rows=427 width=40) (actual
time=24080.945..24080.956 rows=0.00 loops=1)
Planning:
Buffers: shared hit=33 read=17
I/O Timings: shared read=30.074
Planning Time: 53.678 ms
Execution Time: 24103.994 ms
(6 rows)

Time: 24230.548 ms (00:24.231)

--> From Log Files
[153121]: line=11 sid=6923fc16.25621 tag=idle usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: START
TRANSACTION ISOLATION LEVEL REPEATABLE READ
[153121]: line=12 sid=6923fc16.25621 tag=idle usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query,
postgres.c:1078
[153121]: line=13 sid=6923fc16.25621 tag=SELECT usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOG: 00000: execute <unnamed>: SELECT
a, b, c FROM public.t1 WHERE ((a > 1000))
[153121]: line=14 sid=6923fc16.25621 tag=SELECT usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOCATION: exec_execute_message,
postgres.c:2245
[153113]: line=7 sid=6923fc0c.25619 tag=EXPLAIN usr=postgres db=testdb
app=psql client=[local]: LOG: 00000: duration: 24080.958 ms plan:
Query Text: EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw
WHERE a > 1000;
Foreign Scan on t1fdw (cost=100.00..215.67 rows=427 width=40)
(actual time=24080.945..24080.956 rows=0.00 loops=1)
[153113]: line=8 sid=6923fc0c.25619 tag=EXPLAIN usr=postgres db=testdb
app=psql client=[local]: LOCATION: explain_ExecutorEnd, auto_explain.c:437
[153121]: line=15 sid=6923fc16.25621 tag=idle in transaction usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement:
COMMIT TRANSACTION
[153121]: line=16 sid=6923fc16.25621 tag=idle in transaction usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:
exec_simple_query, postgres.c:1078
[153121]: line=17 sid=6923fc16.25621 tag=COMMIT usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOG: 00000: duration: 24059.372 ms
plan:
Query Text: SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
Gather (cost=0.00..467803.85 rows=1 width=105) (actual
time=24058.076..24059.367 rows=0.00 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on t1 (cost=0.00..467803.85 rows=1
width=105) (actual time=24051.406..24051.407 rows=0.00 loops=3)
Filter: (a > 1000)
Rows Removed by Filter: 6666667
[153121]: line=18 sid=6923fc16.25621 tag=COMMIT usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOCATION: explain_ExecutorEnd,
auto_explain.c:437

---

Kenan YILMAZ

Thanks Kenan for these. So, it looks like the patch performs the same as in
the local scan case. I wonder if you found any case of performance
degradation with the patch.

Per an off-list discussion with Robert, he suggested using the
existing data structures for recording the state of last queries instead of
inventing something new.
Makes sense, so I reworked the patch to include tuplestore in
PgFdwScanState and then use PgFdwScanState as part of PgFdwConnState to
keep track of previously
active cursors. Nothing else is changed in this version of the patch.

Rafia Sabih <rafia.pghackers@gmail.com>, 25 Kas 2025 Sal, 17:01 tarihinde
şunu yazdı:

On Fri, 10 Oct 2025 at 22:02, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Sep 29, 2025 at 10:51 AM Rafia Sabih <rafia.pghackers@gmail.com>

wrote:

I am back at this work with a rebased and revised patch. The new

version is rebased and has a change in approach.

Whenever we are using non-cursor mode, for the first cursor we are

always saving the tuples

in the tuplestore, this is because we do not have any means to know

beforehand how many cursors are required for the query.

This might have the advantage of being simpler, but it's definitely
worse. If we're only fetching one result set, which will be common,
we'll buffer the whole thing in a tuplestore where that could
otherwise be avoided. Maybe it's still best to go with this approach;
not sure.

And when we switch to the next query then we do not have a way to

fetch the tuples for the previous query.

So, the tuples retrieved earlier for the first query were lost if not

saved.

I would highly appreciate your time and feedback for this.

My suggestions are to work on the following areas:

1. Automated testing. The patch has no regression tests, and won't get
committed without those.

2. Manual testing. How does the performance with this new option
compare to the existing method? The answer might be different for
small result sets that fit in memory and large ones that spill to
disk, and will certainly also depend on how productive parallel query
can be on the remote side; but I think you want to do and post on this
list some measurements showing the best and worst case for the patch.

3. Patch clean-up. There are plenty of typos and whitespace-only
changes in the patch. It's best to clean those up. Running pgindent is
a good idea too. Some places could use comments.

--
Robert Haas
EDB: http://www.enterprisedb.com

Thank you Robert, for reviewing this patch. On going through the patch
more, I realised this was not equipped to handle the cases when there are
more than two active cursors. So to accommodate such a case, I now modified
the new struct for saving the previous query to a list of such structs.
Also, it turns out we need not to save the tuples in case this is an active
cursor, so we only populate the associated tuplestore only when we need to
create a new cursor when the old cursor is not completely done.

In this version I have added the regression tests as well. I wanted to
test this patch for all the cases of postgres_fdw, the only way I could
figure out how to do this was to test the select statements with the new
GUC.

I also did some tests for performance. I used the contrib_regression
database and populated the table "S1"."T1" with more tuples to understand
the impact of patch on higher scale. I also used auto_explain to get the
foreign plans.

contrib_regression=# select count(*) from ft1;
2025-11-14 14:40:35.825 CET [44338] LOG: duration: 61.336 ms plan:
Query Text: select count(*) from ft1;
Foreign Scan (cost=102.50..123.72 rows=1 width=8)
Relations: Aggregate on (ft1)
2025-11-14 14:40:35.825 CET [44862] LOG: duration: 60.575 ms plan:
Query Text: DECLARE c1 CURSOR FOR
SELECT count(*) FROM "S 1"."T 1"
Aggregate (cost=21888.22..21888.23 rows=1 width=8)
-> Seq Scan on "T 1" (cost=0.00..19956.98 rows=772498 width=0)
count
--------
990821
(1 row)

Time: 62.728 ms
contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
Time: 1.515 ms
contrib_regression=# select count(*) from ft1;
2025-11-14 14:40:46.260 CET [44862] LOG: duration: 21.875 ms plan:
Query Text: SELECT count(*) FROM "S 1"."T 1"
Finalize Aggregate (cost=17255.64..17255.65 rows=1 width=8)
-> Gather (cost=17255.43..17255.64 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=16255.43..16255.44 rows=1 width=8)
-> Parallel Seq Scan on "T 1" (cost=0.00..15450.74
rows=321874 width=0)
2025-11-14 14:40:46.260 CET [44338] LOG: duration: 22.623 ms plan:
Query Text: select count(*) from ft1;
Foreign Scan (cost=102.50..123.72 rows=1 width=8)
Relations: Aggregate on (ft1)
count
--------
990821
(1 row)

Time: 24.862 ms

So for this query, the advantage is coming from parallel query which was
otherwise not possible in this scenario.
To study more performance with this patch, I found another interesting
query and here are the results,

contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# explain (analyse, buffers) SELECT t1."C 1" FROM "S
1"."T 1" t1 left join ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 =
t1."C 1");2025-11-14 15:57:46.893 CET [1946]
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=109013.50..131877.35 rows=772498 width=4) (actual
time=112311.578..112804.516 rows=990821.00 loops=1)
Hash Cond: (t1."C 1" = t3.c1)
Buffers: shared hit=12232, temp read=12754 written=12754
-> Seq Scan on "T 1" t1 (cost=0.00..19956.98 rows=772498 width=4)
(actual time=0.039..48.808 rows=990821.00 loops=1)
Buffers: shared hit=12232
-> Hash (cost=109001.00..109001.00 rows=1000 width=4) (actual
time=112310.386..112310.387 rows=990821.00 loops=1)
Buckets: 262144 (originally 1024) Batches: 8 (originally 1)
Memory Usage: 6408kB
Buffers: temp written=2537
-> Nested Loop (cost=200.43..109001.00 rows=1000 width=4)
(actual time=0.728..112030.241 rows=990821.00 loops=1)
-> Foreign Scan on ft1 t2 (cost=100.00..331.00 rows=1000
width=4) (actual time=0.398..710.505 rows=990821.00 loops=1)
-> Foreign Scan on ft2 t3 (cost=100.43..108.66 rows=1
width=4) (actual time=0.082..0.082 rows=1.00 loops=990821)
Planning:
Buffers: shared hit=5
Planning Time: 2.211 ms
Execution Time: 112825.428 ms
(15 rows)

contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
explain (analyse, buffers) SELECT t1."C 1" FROM "S 1"."T 1" t1 left
join ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1");
QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=109013.50..131877.35 rows=772498 width=4) (actual
time=261.416..354.520 rows=990821.00 loops=1)
Hash Cond: (t1."C 1" = t3.c1)
Buffers: shared hit=12232, temp written=2660
-> Seq Scan on "T 1" t1 (cost=0.00..19956.98 rows=772498 width=4)
(actual time=0.021..35.531 rows=990821.00 loops=1)
Buffers: shared hit=12232
-> Hash (cost=109001.00..109001.00 rows=1000 width=4) (actual
time=261.381..261.383 rows=100.00 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
Buffers: temp written=2660
-> Nested Loop (cost=200.43..109001.00 rows=1000 width=4) (actual
time=255.563..261.356 rows=100.00 loops=1)
Buffers: temp written=2660
-> Foreign Scan on ft1 t2 (cost=100.00..331.00 rows=1000 width=4)
(actual time=0.433..0.443 rows=100.00 loops=1)
-> Foreign Scan on ft2 t3 (cost=100.43..108.66 rows=1 width=4)
(actual time=2.609..2.609 rows=1.00 loops=100)
Buffers: temp written=2660
Planning:
Buffers: shared hit=5
Planning Time: 2.284 ms
Execution Time: 389.358 ms
(17 rows)

So even in the case without a parallel plan, it is performing
significantly better. I investigated a bit more to find out why the query
was so slow with the cursors,
and came to understand that it is repeatedly abandoning and recreating
the cursor via the code path of postgresReScanForeignScan.
So, it looks like cursor management itself costs this much more time.

To understand the impact of the patch in case of tuples spilled to disk,
I tried following example,
contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# set enable_hashjoin = off;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b
where a.c1 = b.c1 order by a.c2;
QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=831.49..833.99 rows=1000 width=94) (actual
time=4537.437..4598.483 rows=990821.00 loops=1)
Sort Key: a.c2
Sort Method: external merge Disk: 137768kB
Buffers: temp read=83156 written=83253
-> Merge Join (cost=761.66..781.66 rows=1000 width=94) (actual
time=3748.488..4090.547 rows=990821.00 loops=1)
Merge Cond: (a.c1 = b.c1)
Buffers: temp read=48725 written=48779
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual
time=1818.521..1865.792 rows=990821.00 loops=1)
Sort Key: a.c1
Sort Method: external merge Disk: 75664kB
Buffers: temp read=18910 written=18937
-> Foreign Scan on ft1 a (cost=100.00..331.00 rows=1000
width=47) (actual time=1.302..1568.640 rows=990821.00 loops=1)
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual
time=1929.955..1981.104 rows=990821.00 loops=1)
Sort Key: b.c1
Sort Method: external sort Disk: 79520kB
Buffers: temp read=29815 written=29842
-> Foreign Scan on ft1 b (cost=100.00..331.00 rows=1000
width=47) (actual time=0.528..1553.249 rows=990821.00 loops=1)
Planning Time: 0.479 ms
Execution Time: 4661.872 ms
(19 rows)

contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b
where a.c1 = b.c1 order by a.c2;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=831.49..833.99 rows=1000 width=94) (actual
time=3376.385..3435.406 rows=990821.00 loops=1)
Sort Key: a.c2
Sort Method: external merge Disk: 137768kB
Buffers: temp read=83156 written=83253
-> Merge Join (cost=761.66..781.66 rows=1000 width=94) (actual
time=2565.517..2916.814 rows=990821.00 loops=1)
Merge Cond: (a.c1 = b.c1)
Buffers: temp read=48725 written=48779
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual
time=1249.517..1300.132 rows=990821.00 loops=1)
Sort Key: a.c1
Sort Method: external merge Disk: 75664kB
Buffers: temp read=18910 written=18937
-> Foreign Scan on ft1 a (cost=100.00..331.00 rows=1000
width=47) (actual time=1.651..980.740 rows=990821.00 loops=1)
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual
time=1315.990..1369.576 rows=990821.00 loops=1)
Sort Key: b.c1
Sort Method: external sort Disk: 79520kB
Buffers: temp read=29815 written=29842
-> Foreign Scan on ft1 b (cost=100.00..331.00 rows=1000
width=47) (actual time=0.426..970.728 rows=990821.00 loops=1)
Planning Time: 0.491 ms
Execution Time: 3527.457 ms
(19 rows)

So it doesn't hurt in this case either. But that is because not the
tuplestore which is added in this patch is spilling to disk here.
I am working on finding a case when there are two or more active cursors
and then when storing the tuples of one cursor,
they are spilled onto the disk. That might give a picture of the worst
case scenario of this patch.

Also, thanks to Kenan, a fellow hacker who finds this work interesting
and offered to do some performance analysis for this patch,
maybe he can also post more results here.

--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH

--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH

#14

KENAN YILMAZ

kenan.yilmaz@localus.com.tr

8 months ago

In reply to: Rafia Sabih (#13)

Re: Bypassing cursors in postgres_fdw to enable parallel plans

Hello Hackers,

I have executed another quick tests based on v4-0001-Fetch-without-cursors
patch.
I also recompiled PostgreSQL from the latest main branch.

Here are the execution time results;

** Local t1 table select (1st run) → 35049.790 ms — Parallel workers
launched
** postgres_fdw.use_cursor = true → 18996.236 ms — No parallel workers
launched
** postgres_fdw.use_cursor = false → 24962.529 ms — Parallel workers
launched
** Local t1 table select (2nd run) → 21406.631 ms — Parallel workers
launched

In my opinion, the primary goal is to enable FDW to fully support parallel
execution.
From that perspective, these results seem acceptable.

If you would like to run more advanced or complex test queries, please feel
free to proceed.

The following tests were executed in the same manner as in my previous test
runs;

--
-- * Local t1 table EXPLAIN results
--

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1 WHERE a > 1000;
QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Gather (cost=0.00..467803.85 rows=1 width=105) (actual
time=35042.394..35043.542 rows=0.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=240 read=363397
I/O Timings: shared read=99749.585
-> Parallel Seq Scan on t1 (cost=0.00..467803.85 rows=1 width=105)
(actual time=35023.992..35023.993 rows=0.00 loops=3)
Filter: (a > 1000)
Rows Removed by Filter: 6666667
Buffers: shared hit=240 read=363397
I/O Timings: shared read=99749.585
Planning:
Buffers: shared hit=64 read=9 dirtied=1
I/O Timings: shared read=4.656
Planning Time: 30.145 ms
Execution Time: 35049.790 ms
(15 rows)

Time: 35154.644 ms (00:35.155)

--
-- * use_cursor = true EXPLAIN results
--

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Foreign Scan on t1fdw (cost=100.00..215.67 rows=427 width=40) (actual
time=18981.860..18981.896 rows=0.00 loops=1)
Planning:
Buffers: shared hit=33 read=17 dirtied=1
I/O Timings: shared read=27.012
Planning Time: 65.290 ms
Execution Time: 18996.236 ms
(6 rows)

Time: 19121.351 ms (00:19.121)

--> From Log Files
--> From Log Files
[197885]: line=22 sid=6929af4a.304fd tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query, postgres.c:1078
app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: START
TRANSACTION ISOLATION LEVEL REPEATABLE READ
[197885]: line=22 sid=6929af4a.304fd tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query, postgres.c:1078
app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query,
postgres.c:1078
[197885]: line=22 sid=6929af4a.304fd tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query, postgres.c:1078
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: execute
<unnamed>: DECLARE c1 CURSOR FOR
SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
[197885]: line=22 sid=6929af4a.304fd tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query, postgres.c:1078
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:
exec_execute_message, postgres.c:2245
[197885]: line=22 sid=6929af4a.304fd tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query, postgres.c:1078
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: FETCH
100 FROM c1
[197885]: line=22 sid=6929af4a.304fd tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query, postgres.c:1078
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query,
postgres.c:1078
[197882]: line=6 sid=6929af48.304fa tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOCATION: explain_ExecutorEnd, auto_explain.c:437
app=psql client=[local]: LOG: 00000: duration: 18981.899 ms plan:
Query Text: EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw
WHERE a > 1000;
Foreign Scan on t1fdw (cost=100.00..215.67 rows=427 width=40)
(actual time=18981.860..18981.896 rows=0.00 loops=1)
[197882]: line=6 sid=6929af48.304fa tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOCATION: explain_ExecutorEnd, auto_explain.c:437
app=psql client=[local]: LOCATION: explain_ExecutorEnd, auto_explain.c:437
[197885]: line=22 sid=6929af4a.304fd tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query, postgres.c:1078
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: CLOSE
c1
[197885]: line=22 sid=6929af4a.304fd tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query, postgres.c:1078
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query,
postgres.c:1078
[197885]: line=22 sid=6929af4a.304fd tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query, postgres.c:1078
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: duration:
18950.616 ms plan:
Query Text: DECLARE c1 CURSOR FOR
SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
Seq Scan on t1 (cost=0.00..613637.45 rows=1 width=105) (actual
time=18950.614..18950.614 rows=0.00 loops=1)
Filter: (a > 1000)
Rows Removed by Filter: 20000000
[197885]: line=22 sid=6929af4a.304fd tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query, postgres.c:1078
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:
explain_ExecutorEnd, auto_explain.c:437
[197885]: line=22 sid=6929af4a.304fd tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query, postgres.c:1078
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: COMMIT
TRANSACTION
[197885]: line=22 sid=6929af4a.304fd tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query, postgres.c:1078
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query,
postgres.c:1078

--
-- * use_cursor = false EXPLAIN results
--

testdb=# SET postgres_fdw.use_cursor = false;

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Foreign Scan on t1fdw (cost=100.00..215.67 rows=427 width=40) (actual
time=24940.034..24940.054 rows=0.00 loops=1)
Planning:
Buffers: shared hit=33 read=17
I/O Timings: shared read=34.719
Planning Time: 62.300 ms
Execution Time: 24962.529 ms
(6 rows)

Time: 25124.566 ms (00:25.125)

--> From Log Files
[197919]: line=18 sid=6929afcb.3051f tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: explain_ExecutorEnd, auto_explain.c:437
app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: START
TRANSACTION ISOLATION LEVEL REPEATABLE READ
[197919]: line=18 sid=6929afcb.3051f tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: explain_ExecutorEnd, auto_explain.c:437
app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query,
postgres.c:1078
[197919]: line=18 sid=6929afcb.3051f tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: explain_ExecutorEnd, auto_explain.c:437
app=postgres_fdw client=127.0.0.1: LOG: 00000: execute <unnamed>: SELECT
a, b, c FROM public.t1 WHERE ((a > 1000))
[197919]: line=18 sid=6929afcb.3051f tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: explain_ExecutorEnd, auto_explain.c:437
app=postgres_fdw client=127.0.0.1: LOCATION: exec_execute_message,
postgres.c:2245
[197917]: line=8 sid=6929afbd.3051d tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOCATION: explain_ExecutorEnd, auto_explain.c:437
app=psql client=[local]: LOG: 00000: duration: 24940.057 ms plan:
Query Text: EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw
WHERE a > 1000;
Foreign Scan on t1fdw (cost=100.00..215.67 rows=427 width=40)
(actual time=24940.034..24940.054 rows=0.00 loops=1)
[197917]: line=8 sid=6929afbd.3051d tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOCATION: explain_ExecutorEnd, auto_explain.c:437
app=psql client=[local]: LOCATION: explain_ExecutorEnd, auto_explain.c:437
[197919]: line=18 sid=6929afcb.3051f tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: explain_ExecutorEnd, auto_explain.c:437
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: COMMIT
TRANSACTION
[197919]: line=18 sid=6929afcb.3051f tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: explain_ExecutorEnd, auto_explain.c:437
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query,
postgres.c:1078
[197919]: line=18 sid=6929afcb.3051f tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: explain_ExecutorEnd, auto_explain.c:437
app=postgres_fdw client=127.0.0.1: LOG: 00000: duration: 24908.608 ms
plan:
Query Text: SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
Gather (cost=0.00..467803.85 rows=1 width=105) (actual
time=24906.370..24908.603 rows=0.00 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on t1 (cost=0.00..467803.85 rows=1
width=105) (actual time=24895.314..24895.314 rows=0.00 loops=3)
Filter: (a > 1000)
Rows Removed by Filter: 6666667
[197919]: line=18 sid=6929afcb.3051f tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION: explain_ExecutorEnd, auto_explain.c:437
app=postgres_fdw client=127.0.0.1: LOCATION: explain_ExecutorEnd,
auto_explain.c:437

---
--
-- * Local t1 table EXPLAIN results (again)
--

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1 WHERE a > 1000;
QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Gather (cost=0.00..467803.67 rows=1 width=105) (actual
time=21403.492..21405.381 rows=0.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared read=363637
I/O Timings: shared read=59425.134
-> Parallel Seq Scan on t1 (cost=0.00..467803.67 rows=1 width=105)
(actual time=21394.997..21394.997 rows=0.00 loops=3)
Filter: (a > 1000)
Rows Removed by Filter: 6666667
Buffers: shared read=363637
I/O Timings: shared read=59425.134
Planning:
Buffers: shared hit=54 read=14
I/O Timings: shared read=30.482
Planning Time: 37.955 ms
Execution Time: 21406.631 ms
(15 rows)

Time: 21506.079 ms (00:21.506)

---
Kenan YILMAZ

Rafia Sabih <rafia.pghackers@gmail.com>, 27 Kas 2025 Per, 13:50 tarihinde
şunu yazdı:

Show quoted text

On Tue, 25 Nov 2025 at 15:24, KENAN YILMAZ <kenan.yilmaz@localus.com.tr>
wrote:

Hello Hackers,

I have executed use_cursor = true/false with quick tests.

The test table 't1' has 20,000,000 rows (~2.8 GB in size).

The test query is `EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw
WHERE a > 1000;`

The result sums are;

** `postgres_fdw.use_cursor = true` --> 20090.782 ms
** `postgres_fdw.use_cursor = false` --> 24103.994 ms

If you wish to run more advanced and complex test queries, feel free to
do so.

All execution test scenarios are listed below;

--
-- restart with flush caches;
$ pg_ctl -D $PGDATA stop && sudo sync && echo 3 | sudo tee
/proc/sys/vm/drop_caches && rm $PGDATA/log/*.log && pg_ctl -D $PGDATA start
-- Data prep stage
DROP DATABASE IF EXISTS testdb;
CREATE DATABASE testdb;
\c testdb;
DROP TABLE IF EXISTS t1 CASCADE;
CREATE UNLOGGED TABLE t1 (a int, b int, c text, d timestamp);

-- Insert test datas
INSERT INTO t1 SELECT 10 + mod(i, 30), i, md5(i::text) ||
md5((i+1000000)::text) || md5((i+2000000)::text), '2025-01-01
10:00:00'::timestamp + (random() * 31536000) * INTERVAL '1 second' FROM
generate_series(1, 20000000) i;

-- Table maintenance
ALTER TABLE t1 SET LOGGED;
VACUUM (ANALYZE, FULL) t1;

-- FDW prep stage
DROP EXTENSION IF EXISTS postgres_fdw CASCADE;
CREATE EXTENSION postgres_fdw;
CREATE SERVER foreign_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS
(host '127.0.0.1', port '5432', dbname 'testdb');
CREATE USER MAPPING FOR postgres SERVER foreign_server OPTIONS (user
'postgres');
CREATE FOREIGN TABLE t1fdw (a int, b int, c text, d timestamp) SERVER
foreign_server OPTIONS (table_name 't1');

-- restart with flush caches are applied before all stage executions.

--
-- * Local t1 table EXPLAIN results
--

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1 WHERE a > 1000;
QUERY PLAN

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Gather (cost=0.00..467803.85 rows=1 width=105) (actual
time=23260.306..23261.591 rows=0.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared read=363637
I/O Timings: shared read=64590.910
-> Parallel Seq Scan on t1 (cost=0.00..467803.85 rows=1 width=105)
(actual time=23242.279..23242.280 rows=0.00 loops=3)
Filter: (a > 1000)
Rows Removed by Filter: 6666667
Buffers: shared read=363637
I/O Timings: shared read=64590.910
Planning:
Buffers: shared hit=54 read=14 dirtied=1
I/O Timings: shared read=23.281
Planning Time: 38.734 ms
Execution Time: 23269.677 ms
(15 rows)

Time: 23347.716 ms (00:23.348)

--
-- * use_cursor = true (Default) EXPLAIN results
--

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a >
1000;
QUERY PLAN

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Foreign Scan on t1fdw (cost=100.00..215.67 rows=427 width=40) (actual
time=20074.746..20074.796 rows=0.00 loops=1)
Planning:
Buffers: shared hit=33 read=17 dirtied=1
I/O Timings: shared read=10.696
Planning Time: 43.852 ms
Execution Time: 20090.782 ms
(6 rows)

Time: 20169.081 ms (00:20.169)

--> From Log Files
[153045]: line=11 sid=6923fba6.255d5 tag=idle usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: START
TRANSACTION ISOLATION LEVEL REPEATABLE READ
[153045]: line=12 sid=6923fba6.255d5 tag=idle usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query,
postgres.c:1078
[153045]: line=13 sid=6923fba6.255d5 tag=DECLARE CURSOR usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: execute
<unnamed>: DECLARE c1 CURSOR FOR
SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
[153045]: line=14 sid=6923fba6.255d5 tag=DECLARE CURSOR usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:
exec_execute_message, postgres.c:2245
[153045]: line=15 sid=6923fba6.255d5 tag=idle in transaction usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement:
FETCH 100 FROM c1
[153045]: line=16 sid=6923fba6.255d5 tag=idle in transaction usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:
exec_simple_query, postgres.c:1078
[153044]: line=5 sid=6923fba4.255d4 tag=EXPLAIN usr=postgres db=testdb
app=psql client=[local]: LOG: 00000: duration: 20074.799 ms plan:
Query Text: EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw
WHERE a > 1000;
Foreign Scan on t1fdw (cost=100.00..215.67 rows=427 width=40)
(actual time=20074.746..20074.796 rows=0.00 loops=1)
[153044]: line=6 sid=6923fba4.255d4 tag=EXPLAIN usr=postgres db=testdb
app=psql client=[local]: LOCATION: explain_ExecutorEnd, auto_explain.c:437
[153045]: line=17 sid=6923fba6.255d5 tag=idle in transaction usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement:
CLOSE c1
[153045]: line=18 sid=6923fba6.255d5 tag=idle in transaction usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:
exec_simple_query, postgres.c:1078
[153045]: line=19 sid=6923fba6.255d5 tag=CLOSE CURSOR usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: duration:
20057.543 ms plan:
Query Text: DECLARE c1 CURSOR FOR
SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
Seq Scan on t1 (cost=0.00..613637.45 rows=1 width=105) (actual
time=20057.541..20057.541 rows=0.00 loops=1)
Filter: (a > 1000)
Rows Removed by Filter: 20000000
[153045]: line=20 sid=6923fba6.255d5 tag=CLOSE CURSOR usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:
explain_ExecutorEnd, auto_explain.c:437
[153045]: line=21 sid=6923fba6.255d5 tag=idle in transaction usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement:
COMMIT TRANSACTION

--
-- * use_cursor = false EXPLAIN results
--

testdb=# SET postgres_fdw.use_cursor = false;

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a >
1000;
QUERY PLAN

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Foreign Scan on t1fdw (cost=100.00..215.67 rows=427 width=40) (actual
time=24080.945..24080.956 rows=0.00 loops=1)
Planning:
Buffers: shared hit=33 read=17
I/O Timings: shared read=30.074
Planning Time: 53.678 ms
Execution Time: 24103.994 ms
(6 rows)

Time: 24230.548 ms (00:24.231)

--> From Log Files
[153121]: line=11 sid=6923fc16.25621 tag=idle usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOG: 00000: statement: START
TRANSACTION ISOLATION LEVEL REPEATABLE READ
[153121]: line=12 sid=6923fc16.25621 tag=idle usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOCATION: exec_simple_query,
postgres.c:1078
[153121]: line=13 sid=6923fc16.25621 tag=SELECT usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOG: 00000: execute <unnamed>:
SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
[153121]: line=14 sid=6923fc16.25621 tag=SELECT usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOCATION: exec_execute_message,
postgres.c:2245
[153113]: line=7 sid=6923fc0c.25619 tag=EXPLAIN usr=postgres db=testdb
app=psql client=[local]: LOG: 00000: duration: 24080.958 ms plan:
Query Text: EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw
WHERE a > 1000;
Foreign Scan on t1fdw (cost=100.00..215.67 rows=427 width=40)
(actual time=24080.945..24080.956 rows=0.00 loops=1)
[153113]: line=8 sid=6923fc0c.25619 tag=EXPLAIN usr=postgres db=testdb
app=psql client=[local]: LOCATION: explain_ExecutorEnd, auto_explain.c:437
[153121]: line=15 sid=6923fc16.25621 tag=idle in transaction usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOG: 00000: statement:
COMMIT TRANSACTION
[153121]: line=16 sid=6923fc16.25621 tag=idle in transaction usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:
exec_simple_query, postgres.c:1078
[153121]: line=17 sid=6923fc16.25621 tag=COMMIT usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOG: 00000: duration: 24059.372 ms
plan:
Query Text: SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
Gather (cost=0.00..467803.85 rows=1 width=105) (actual
time=24058.076..24059.367 rows=0.00 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on t1 (cost=0.00..467803.85 rows=1
width=105) (actual time=24051.406..24051.407 rows=0.00 loops=3)
Filter: (a > 1000)
Rows Removed by Filter: 6666667
[153121]: line=18 sid=6923fc16.25621 tag=COMMIT usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOCATION: explain_ExecutorEnd,
auto_explain.c:437

---

Kenan YILMAZ

Thanks Kenan for these. So, it looks like the patch performs the same as
in the local scan case. I wonder if you found any case of performance
degradation with the patch.

Per an off-list discussion with Robert, he suggested using the
existing data structures for recording the state of last queries instead of
inventing something new.
Makes sense, so I reworked the patch to include tuplestore in
PgFdwScanState and then use PgFdwScanState as part of PgFdwConnState to
keep track of previously
active cursors. Nothing else is changed in this version of the patch.

Rafia Sabih <rafia.pghackers@gmail.com>, 25 Kas 2025 Sal, 17:01
tarihinde şunu yazdı:

On Fri, 10 Oct 2025 at 22:02, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Sep 29, 2025 at 10:51 AM Rafia Sabih <

rafia.pghackers@gmail.com> wrote:

I am back at this work with a rebased and revised patch. The new

version is rebased and has a change in approach.

Whenever we are using non-cursor mode, for the first cursor we are

always saving the tuples

in the tuplestore, this is because we do not have any means to know

beforehand how many cursors are required for the query.

This might have the advantage of being simpler, but it's definitely
worse. If we're only fetching one result set, which will be common,
we'll buffer the whole thing in a tuplestore where that could
otherwise be avoided. Maybe it's still best to go with this approach;
not sure.

And when we switch to the next query then we do not have a way to

fetch the tuples for the previous query.

So, the tuples retrieved earlier for the first query were lost if

not saved.

I would highly appreciate your time and feedback for this.

My suggestions are to work on the following areas:

1. Automated testing. The patch has no regression tests, and won't get
committed without those.

2. Manual testing. How does the performance with this new option
compare to the existing method? The answer might be different for
small result sets that fit in memory and large ones that spill to
disk, and will certainly also depend on how productive parallel query
can be on the remote side; but I think you want to do and post on this
list some measurements showing the best and worst case for the patch.

3. Patch clean-up. There are plenty of typos and whitespace-only
changes in the patch. It's best to clean those up. Running pgindent is
a good idea too. Some places could use comments.

--
Robert Haas
EDB: http://www.enterprisedb.com

Thank you Robert, for reviewing this patch. On going through the patch
more, I realised this was not equipped to handle the cases when there are
more than two active cursors. So to accommodate such a case, I now modified
the new struct for saving the previous query to a list of such structs.
Also, it turns out we need not to save the tuples in case this is an active
cursor, so we only populate the associated tuplestore only when we need to
create a new cursor when the old cursor is not completely done.

In this version I have added the regression tests as well. I wanted to
test this patch for all the cases of postgres_fdw, the only way I could
figure out how to do this was to test the select statements with the new
GUC.

I also did some tests for performance. I used the contrib_regression
database and populated the table "S1"."T1" with more tuples to understand
the impact of patch on higher scale. I also used auto_explain to get the
foreign plans.

contrib_regression=# select count(*) from ft1;
2025-11-14 14:40:35.825 CET [44338] LOG: duration: 61.336 ms plan:
Query Text: select count(*) from ft1;
Foreign Scan (cost=102.50..123.72 rows=1 width=8)
Relations: Aggregate on (ft1)
2025-11-14 14:40:35.825 CET [44862] LOG: duration: 60.575 ms plan:
Query Text: DECLARE c1 CURSOR FOR
SELECT count(*) FROM "S 1"."T 1"
Aggregate (cost=21888.22..21888.23 rows=1 width=8)
-> Seq Scan on "T 1" (cost=0.00..19956.98 rows=772498 width=0)
count
--------
990821
(1 row)

Time: 62.728 ms
contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
Time: 1.515 ms
contrib_regression=# select count(*) from ft1;
2025-11-14 14:40:46.260 CET [44862] LOG: duration: 21.875 ms plan:
Query Text: SELECT count(*) FROM "S 1"."T 1"
Finalize Aggregate (cost=17255.64..17255.65 rows=1 width=8)
-> Gather (cost=17255.43..17255.64 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=16255.43..16255.44 rows=1 width=8)
-> Parallel Seq Scan on "T 1" (cost=0.00..15450.74
rows=321874 width=0)
2025-11-14 14:40:46.260 CET [44338] LOG: duration: 22.623 ms plan:
Query Text: select count(*) from ft1;
Foreign Scan (cost=102.50..123.72 rows=1 width=8)
Relations: Aggregate on (ft1)
count
--------
990821
(1 row)

Time: 24.862 ms

So for this query, the advantage is coming from parallel query which was
otherwise not possible in this scenario.
To study more performance with this patch, I found another interesting
query and here are the results,

contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# explain (analyse, buffers) SELECT t1."C 1" FROM
"S 1"."T 1" t1 left join ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 =
t1."C 1");2025-11-14 15:57:46.893 CET [1946]
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=109013.50..131877.35 rows=772498 width=4) (actual
time=112311.578..112804.516 rows=990821.00 loops=1)
Hash Cond: (t1."C 1" = t3.c1)
Buffers: shared hit=12232, temp read=12754 written=12754
-> Seq Scan on "T 1" t1 (cost=0.00..19956.98 rows=772498 width=4)
(actual time=0.039..48.808 rows=990821.00 loops=1)
Buffers: shared hit=12232
-> Hash (cost=109001.00..109001.00 rows=1000 width=4) (actual
time=112310.386..112310.387 rows=990821.00 loops=1)
Buckets: 262144 (originally 1024) Batches: 8 (originally 1)
Memory Usage: 6408kB
Buffers: temp written=2537
-> Nested Loop (cost=200.43..109001.00 rows=1000 width=4)
(actual time=0.728..112030.241 rows=990821.00 loops=1)
-> Foreign Scan on ft1 t2 (cost=100.00..331.00
rows=1000 width=4) (actual time=0.398..710.505 rows=990821.00 loops=1)
-> Foreign Scan on ft2 t3 (cost=100.43..108.66 rows=1
width=4) (actual time=0.082..0.082 rows=1.00 loops=990821)
Planning:
Buffers: shared hit=5
Planning Time: 2.211 ms
Execution Time: 112825.428 ms
(15 rows)

contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
explain (analyse, buffers) SELECT t1."C 1" FROM "S 1"."T 1" t1 left
join ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1");
QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=109013.50..131877.35 rows=772498 width=4) (actual
time=261.416..354.520 rows=990821.00 loops=1)
Hash Cond: (t1."C 1" = t3.c1)
Buffers: shared hit=12232, temp written=2660
-> Seq Scan on "T 1" t1 (cost=0.00..19956.98 rows=772498 width=4)
(actual time=0.021..35.531 rows=990821.00 loops=1)
Buffers: shared hit=12232
-> Hash (cost=109001.00..109001.00 rows=1000 width=4) (actual
time=261.381..261.383 rows=100.00 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
Buffers: temp written=2660
-> Nested Loop (cost=200.43..109001.00 rows=1000 width=4) (actual
time=255.563..261.356 rows=100.00 loops=1)
Buffers: temp written=2660
-> Foreign Scan on ft1 t2 (cost=100.00..331.00 rows=1000
width=4) (actual time=0.433..0.443 rows=100.00 loops=1)
-> Foreign Scan on ft2 t3 (cost=100.43..108.66 rows=1 width=4)
(actual time=2.609..2.609 rows=1.00 loops=100)
Buffers: temp written=2660
Planning:
Buffers: shared hit=5
Planning Time: 2.284 ms
Execution Time: 389.358 ms
(17 rows)

So even in the case without a parallel plan, it is performing
significantly better. I investigated a bit more to find out why the query
was so slow with the cursors,
and came to understand that it is repeatedly abandoning and recreating
the cursor via the code path of postgresReScanForeignScan.
So, it looks like cursor management itself costs this much more time.

To understand the impact of the patch in case of tuples spilled to disk,
I tried following example,
contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# set enable_hashjoin = off;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1
b where a.c1 = b.c1 order by a.c2;
QUERY
PLAN

----------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=831.49..833.99 rows=1000 width=94) (actual
time=4537.437..4598.483 rows=990821.00 loops=1)
Sort Key: a.c2
Sort Method: external merge Disk: 137768kB
Buffers: temp read=83156 written=83253
-> Merge Join (cost=761.66..781.66 rows=1000 width=94) (actual
time=3748.488..4090.547 rows=990821.00 loops=1)
Merge Cond: (a.c1 = b.c1)
Buffers: temp read=48725 written=48779
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual
time=1818.521..1865.792 rows=990821.00 loops=1)
Sort Key: a.c1
Sort Method: external merge Disk: 75664kB
Buffers: temp read=18910 written=18937
-> Foreign Scan on ft1 a (cost=100.00..331.00 rows=1000
width=47) (actual time=1.302..1568.640 rows=990821.00 loops=1)
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual
time=1929.955..1981.104 rows=990821.00 loops=1)
Sort Key: b.c1
Sort Method: external sort Disk: 79520kB
Buffers: temp read=29815 written=29842
-> Foreign Scan on ft1 b (cost=100.00..331.00 rows=1000
width=47) (actual time=0.528..1553.249 rows=990821.00 loops=1)
Planning Time: 0.479 ms
Execution Time: 4661.872 ms
(19 rows)

contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1
b where a.c1 = b.c1 order by a.c2;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=831.49..833.99 rows=1000 width=94) (actual
time=3376.385..3435.406 rows=990821.00 loops=1)
Sort Key: a.c2
Sort Method: external merge Disk: 137768kB
Buffers: temp read=83156 written=83253
-> Merge Join (cost=761.66..781.66 rows=1000 width=94) (actual
time=2565.517..2916.814 rows=990821.00 loops=1)
Merge Cond: (a.c1 = b.c1)
Buffers: temp read=48725 written=48779
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual
time=1249.517..1300.132 rows=990821.00 loops=1)
Sort Key: a.c1
Sort Method: external merge Disk: 75664kB
Buffers: temp read=18910 written=18937
-> Foreign Scan on ft1 a (cost=100.00..331.00 rows=1000
width=47) (actual time=1.651..980.740 rows=990821.00 loops=1)
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual
time=1315.990..1369.576 rows=990821.00 loops=1)
Sort Key: b.c1
Sort Method: external sort Disk: 79520kB
Buffers: temp read=29815 written=29842
-> Foreign Scan on ft1 b (cost=100.00..331.00 rows=1000
width=47) (actual time=0.426..970.728 rows=990821.00 loops=1)
Planning Time: 0.491 ms
Execution Time: 3527.457 ms
(19 rows)

So it doesn't hurt in this case either. But that is because not the
tuplestore which is added in this patch is spilling to disk here.
I am working on finding a case when there are two or more active cursors
and then when storing the tuples of one cursor,
they are spilled onto the disk. That might give a picture of the worst
case scenario of this patch.

Also, thanks to Kenan, a fellow hacker who finds this work interesting
and offered to do some performance analysis for this patch,
maybe he can also post more results here.

--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH

--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH

#15

Robert Haas

robertmhaas@gmail.com

7 months ago

In reply to: Rafia Sabih (#13)

Re: Bypassing cursors in postgres_fdw to enable parallel plans

On Thu, Nov 27, 2025 at 5:50 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:

Thanks Kenan for these. So, it looks like the patch performs the same as in the local scan case. I wonder if you found any case of performance degradation with the patch.

Per an off-list discussion with Robert, he suggested using the existing data structures for recording the state of last queries instead of inventing something new.
Makes sense, so I reworked the patch to include tuplestore in PgFdwScanState and then use PgFdwScanState as part of PgFdwConnState to keep track of previously
active cursors. Nothing else is changed in this version of the patch.

Thanks for your continued work on this topic.

As we discussed off-list, the regression tests in this patch need
quite a bit more work. We shouldn't just repeat large numbers of tests
-- we should repeat enough to provide meaningful test coverage of the
new feature, and maybe add a few new ones that specifically target
scenarios that the patch is intended to cover. One somewhat bizarre
thing that I noticed scrolling through the regression test outputs is
this:

+-- Test with non-cursor mode
+SET postgres_fdw.use_cursor = false;
+SET postgres_fdw.use_cursor = true;

This happens in multiple places in the regression tests, and doesn't
really make any sense to me, because changing the GUC from to false,
doing nothing, and then changing it back to true doesn't seem like a
useful test scenario, and definitely doesn't seem like a test scenario
that we need to repeat multiple times. Honestly, I wonder how this
happened. Did you maybe run a script over the .sql files to generate
the new versions and then not check what the output actually looked
like? I really can't stress enough the need to be thoughtful about
constructing test cases for a patch of this type.

A related problem is that you've included 12,721 lines of useless
output in the patch file, because somehow postgres_fdw.out.orig got
checked into the repository. It's always a good idea, when posting a
patch to the list, to check the diffstat to make sure that there's
nothing in there that you don't expect to see. The presence of this
line just below your commit message could have alerted you to this
problem:

.../expected/postgres_fdw.out.orig | 12721 ++++++++++++++++

Ideally, I really recommend scrolling through the patch, not just the
diffstat, to make sure that everything is the way you want it to be,
before giving it to others to look at.

There are a number of other cosmetic problems with this patch that
make it hard to review the actual code changes. For instance:

+ /* To be used only in non-cursor mode */
+ Tuplestorestate *tuplestore;
+ TupleTableSlot *slot;
+ bool tuples_ready;
+ struct PgFdwScanState *next;

The comment is good, but it only explains a general fact about all
four of these fields. It doesn't explain specifically what each of
these fields is intended to do. Naming a field very generally, like
"next", when it must have some very specific purpose that is not
documented, makes it really hard for someone reading the code -- in
this case, me -- to understand what the point is. So, these fields
should probably have comments, but also, some of them probably need to
be renamed. Maybe "tuplestore" and "slot" are OK but "next" is not
going to be good enough.

+/* Fill the Tupleslot when another query needs to execute. */
+static void
+fillTupleSlot(PgFdwScanState *fsstate, ForeignScanState *node)

I think you're filling a tuple store, not a tuple slot.

+ initStringInfo(&buf);

What seems to happen here is that you create an empty buffer, add
fsstate->query to it and nothing else, and then use buf.data. So why
not just get rid of buf and use fsstate->query directly?

+ /* Always fetch the last prev_query */
+ for (; p1->next != NULL; p1 = p1->next);

There are multiple problems here. First, this code is not per
PostgreSQL style and would be reindented by pgindent, which you should
make a habit of running before posting. Second, p1 is not a
particularly informative or understandable variable name. Third, why
are we using a linked list if the value we're going to need is at the
end of the list? If we need to be able to access the last element of
the list efficiently, maybe we should be keeping the list in reverse
order, or maybe we should be using a List which permits efficient
random access.

But looking quickly through the patch, I have an even bigger question
here. It doesn't really seem like we ever care about any element of
the list other than the last. It looks like we go to a fair amount of
trouble to maintain the list at various points in the code, but it
seems like when we access the list, we just always go to the end. That
makes me wonder why we even need a list. Perhaps instead of
PgFdwConnState storing a pointer to "prev_queries", it should just
store a pointer to the PgFdwScanState for the query that is currently
"using" the connection, if there is one. That is, the query whose
result set is currently being received, and which must be buffered
into a tuplestore before using the connection for something else. When
we've done that, we reset the field to NULL, and we don't maintain any
list of the older "previous queries". Maybe there's some problem with
that idea, but that gets back to my earlier point about comments: the
comment for "next" (or however it got renamed) ought to be explaining
something about why it exists and what it's for. Without such a
comment, the only ways for me to be sure whether we really need "next"
is to either (a) ask you or (b) spend a long time staring at the code
to try to figure it out or (c) try removing it and see if it works.

Ah, wait! I just found some code that cares about the list but not
just about the last element:

+ for (; p1->next != NULL; p1 = p1->next)
+ /* find the correct prev_query */
+ {
+ if ((p1->tuples_ready && fsstate->cursor_number == p1->cursor_number))
+ cur_prev = p1;
+ }

But I'm still confused. Why do we need to iterate through the
prev_queries list to find the correct ForeignScanState? Isn't that
exactly what fsstate is here? I'm inclined to think that this is just
a complicated way of setting p1 = fsstate and then setting cur_prev =
p1, so we could skip all this and just test whether
fsstate->tuplestore != NULL and if so retrieve tuples from there. If
there's some reason why fsstate and cur_prev would end up being
different, then there should be some comment explaining it. I'm
inclined to think that would be a sign of a design flaw: I think the
idea here should be to make use of the ForeignScanState object that
already exists to hold the details that we need for this feature,
rather than creating a new one. But, if there were a comment telling
me why it's like this, I might change my mind.

+ /* Clear the last query details, once tuples are retrieved. */
+ if (fsstate->conn_state->prev_queries == cur_prev)
+ {
+ /*
+ * This is the first prev query in the list, check if there
+ * are more
+ */
+ if (fsstate->conn_state->num_prev_queries > 1)
+
+ /*
+ * fsstate->conn_state->prev_queries->next =
+ * cur_prev->next;
+ */
+ fsstate->next = cur_prev->next;
+ clear_prev_query_state(cur_prev);
+ }

This code can remove an item from the prev_queries list, but only if
it's the first item. Now, if my analysis above is correct and we don't
even need the prev_queries list, then it doesn't matter; all this
logic can basically go away. If we do need the prev_queries list, then
I don't think it can be right to only be able to remove the first
element of the list. There's no reason why the oldest prev_query has
to finish first. What if there are three queries in the list and the
middle one finishes first? Then this will just leave it in the list,
whereas if the first one had finished first, it would have been
deleted from the list. That kind of inconsistency doesn't seem like a
good idea.

+ /*
+ * This is to fetch all the tuples of this query and save
+ * them in Tuple Slot. Since it is using
+ * PQSetChunkedRowsMode, we only get the
+ * fsstate->fetch_size tuples in one run, so keep on
+ * executing till we get NULL in PGresult i.e. all the
+ * tuples are retrieved.
+ */
+ p1->tuplestore = tuplestore_begin_heap(false, true, work_mem);
+ p1->slot = MakeSingleTupleTableSlot(fsstate->tupdesc, &TTSOpsMinimalTuple);

The slot doesn't seem to be used anywhere in the code that follows. Is
there a reason to initialize it here, rather than wherever it's
needed? If so, maybe the comment could mention that.

+ i = 0;

I don't think this does anything, because you reinitialize i in the
inner loop where you use it. In general, try moving your variable
declarations to the innermost scope where they are needed. It makes
the code more clear and allows the compiler to tell you about stuff
like this.

+ else if (PQresultStatus(res) == PGRES_TUPLES_OK)
+ {
+ while (res != NULL)
+ res = pgfdw_get_result(conn);
+ break;
+ }

I doubt that that this is right. It seems like it's willing to throw
away an infinite number of result sets without doing anything with
them, or discard an infinite number of errors without processing them,
but that doesn't seem likely to be the right thing.

+ else if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ {
+ clear_conn_state(fsstate->conn_state);
+ pgfdw_report_error(res, conn, fsstate->query);
+ }

I also doubt that this is right. If it's necessary to call
clear_conn_state() before reporting this error, then the error
handling in this patch is probably wrong. We don't know where errors
will be thrown in general and cannot rely on being able to do manual
error cleanup before an error occurs. The goal should for it to be
safe for pgfdw_report_error -- or just ereport(ERROR, ...) -- to
happen anywhere without causing problems in the future.

clear_conn_state() doesn't look right either. It does some cleanup on
each element of the prev_queries list and decrements num_prev_queries,
but it doesn't actually remove the queries from the list. So after
calling this function the first time, I think that num_prev_queries
might end up being 0 while prev_queries is still a list of the same
length as before. After that, I imagine calling clear_conn_state() a
second time would result in num_prev_queries going negative. I don't
really understand why num_prev_queries exists -- it seems like it's
just a recipe for mistakes to have both the list itself, and a
separate field that gives you the list length. If you need that, the
existing List data type will do that bookkeeping for you, and then
there's less opportunity for mistakes.

Overall, I think the direction of the patch set has some promise, but
I think it needs a lot of cleanup: removal of unnecessary code, proper
formatting, moving variables to inner scopes, explanatory comments,
good names for variables and functions and structure members, removal
of unnecessary files from the patch, cleanup of the regression test
coverage so that it doesn't add more bloat than necessary, proper
choice of data structures, and so on. Right now, the good things that
you've done here are being hidden by these sorts of mechanical issues.
That's not just an issue for me as a reviewer: I suspect it's also
blocking you, as the patch author, from finding places where the code
could be made better. Being able to find such opportunities for
improvement and act on them is what will get this patch from
"interesting proof of concept" to "potentially committable patch".

--
Robert Haas
EDB: http://www.enterprisedb.com

#16

Alexander Pyhalov

a.pyhalov@postgrespro.ru

7 months ago

In reply to: Rafia Sabih (#13)

Re: Bypassing cursors in postgres_fdw to enable parallel plans

Rafia Sabih писал(а) 2025-11-27 13:50:

On Tue, 25 Nov 2025 at 15:24, KENAN YILMAZ
<kenan.yilmaz@localus.com.tr> wrote:

Hello Hackers,

I have executed use_cursor = true/false with quick tests.

Hi.

I've looked at this patch, because also was playing with tuplestore in
PgFdwScanState to have more predictable memory usage with large
fetch_size. Unfortunately, I've found that you can't use
TTSOpsMinimalTuple(?) tuplestore to store tuple's ctid. I hoped this
patch provides some solution for this issue, but no, it seems have the
same problem. Attaching test case with a reproducer.
--
Best regards,
Alexander Pyhalov,
Postgres Professional

#17

Rafia Sabih

rafia.pghackers@gmail.com

6 months ago

In reply to: Robert Haas (#15)

Re: Bypassing cursors in postgres_fdw to enable parallel plans

On Wed, 10 Dec 2025 at 18:44, Robert Haas <robertmhaas@gmail.com> wrote:

Overall, I think the direction of the patch set has some promise, but
I think it needs a lot of cleanup: removal of unnecessary code, proper
formatting, moving variables to inner scopes, explanatory comments,
good names for variables and functions and structure members, removal
of unnecessary files from the patch, cleanup of the regression test
coverage so that it doesn't add more bloat than necessary, proper
choice of data structures, and so on. Right now, the good things that
you've done here are being hidden by these sorts of mechanical issues.
That's not just an issue for me as a reviewer: I suspect it's also
blocking you, as the patch author, from finding places where the code
could be made better. Being able to find such opportunities for
improvement and act on them is what will get this patch from
"interesting proof of concept" to "potentially committable patch".

--
Robert Haas
EDB: http://www.enterprisedb.com

Thanks Robert for your time and attention to this patch.
Based on these review comments and an off list discussion about the design
of the patch, I have reworked the patch significantly.
In this version, a tuplestore is added to the PgFdwScanState along with
a flag. Now, in case of a cursor switch, this tuplestore is filled with the
remaining tuples of the query. The corresponding flag is set to indicate
that the tuplestore is ready to be fetched. To remember the last query that
was executing, a pointer to its PgFdwScanState is maintained in the
conn_state. Note that we only need to remember just the very last query and
once tuples for it are fetched we can forget that pointer, because now its
fsstate has the remaining tuples in the tuplestore.
So, this version of the patch looks simpler than before.
A few test cases are also added in the attached patch to cover the
different scenarios in case of non-cursor mode.
--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH

#18

Rafia Sabih

rafia.pghackers@gmail.com

4 months ago

In reply to: Rafia Sabih (#17)

Re: Bypassing cursors in postgres_fdw to enable parallel plans

Came to know that there were some CI failures. Posting the new version of
the patch which fixes them.

On Mon, 26 Jan 2026 at 03:44, Rafia Sabih <rafia.pghackers@gmail.com> wrote:

On Wed, 10 Dec 2025 at 18:44, Robert Haas <robertmhaas@gmail.com> wrote:

Overall, I think the direction of the patch set has some promise, but
I think it needs a lot of cleanup: removal of unnecessary code, proper
formatting, moving variables to inner scopes, explanatory comments,
good names for variables and functions and structure members, removal
of unnecessary files from the patch, cleanup of the regression test
coverage so that it doesn't add more bloat than necessary, proper
choice of data structures, and so on. Right now, the good things that
you've done here are being hidden by these sorts of mechanical issues.
That's not just an issue for me as a reviewer: I suspect it's also
blocking you, as the patch author, from finding places where the code
could be made better. Being able to find such opportunities for
improvement and act on them is what will get this patch from
"interesting proof of concept" to "potentially committable patch".

--
Robert Haas
EDB: http://www.enterprisedb.com

Thanks Robert for your time and attention to this patch.
Based on these review comments and an off list discussion about the design
of the patch, I have reworked the patch significantly.
In this version, a tuplestore is added to the PgFdwScanState along with
a flag. Now, in case of a cursor switch, this tuplestore is filled with the
remaining tuples of the query. The corresponding flag is set to indicate
that the tuplestore is ready to be fetched. To remember the last query that
was executing, a pointer to its PgFdwScanState is maintained in the
conn_state. Note that we only need to remember just the very last query and
once tuples for it are fetched we can forget that pointer, because now its
fsstate has the remaining tuples in the tuplestore.
So, this version of the patch looks simpler than before.
A few test cases are also added in the attached patch to cover the
different scenarios in case of non-cursor mode.
--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH

--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH

#19

Robert Haas

robertmhaas@gmail.com

4 months ago

In reply to: Rafia Sabih (#17)

Re: Bypassing cursors in postgres_fdw to enable parallel plans

On Mon, Jan 26, 2026 at 6:44 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:

Thanks Robert for your time and attention to this patch.
Based on these review comments and an off list discussion about the design of the patch, I have reworked the patch significantly.
In this version, a tuplestore is added to the PgFdwScanState along with a flag. Now, in case of a cursor switch, this tuplestore is filled with the remaining tuples of the query. The corresponding flag is set to indicate that the tuplestore is ready to be fetched. To remember the last query that was executing, a pointer to its PgFdwScanState is maintained in the conn_state. Note that we only need to remember just the very last query and once tuples for it are fetched we can forget that pointer, because now its fsstate has the remaining tuples in the tuplestore.
So, this version of the patch looks simpler than before.
A few test cases are also added in the attached patch to cover the different scenarios in case of non-cursor mode.

+ DefineCustomBoolVariable("postgres_fdw.use_cursor",

Usually I would expect the GUC for a few feature to be defined in such
a way that a true values means to use the new feature and a false
value means not to do that. This is reversed. Maybe consider renaming
the GUC so that the sense is inverted.

+       /* To be used only in non-cursor mode */
+       Tuplestorestate *tuplestore;    /* Tuplestore to save the tuples of the
+
  * query for later fetc
h. */
+       TupleTableSlot *slot;           /* Slot to be used when
reading the tuple from
+                                                                * the
tuplestore */
+       bool            tuples_ready;   /* To indicate when tuplestore
is ready to be
+                                                                * read. */
+       int                     total_tuples;   /* total tuples in the
tuplestore. */

There's already a tuplestore_tuple_count() function, so it's not clear
why you need total_tuples. It's also not exactly clear what
tuples_ready means here: sure, it means that we're ready to read
tuples, but when does that happen? I think the /* To be used only in
non-cursor mode */ comment could use expansion, too. Maybe use this to
explain the idea that we're going to use a Tuplestore in non-cursor
mode when, and only when, there are overlapping scans.

+ fsstate->conn_state->last_query = NULL;

I think this should be called something like active_scan. last_query
makes it sound like the last query we ran, whether or not it is still
in progress. But the value that it points to is an PgFdwScanState, so
it's not a query, it's a scan, and we should only have this set to a
non-NULL value for as long as that scan is actually active (i.e. has
the exclusive use of the connection).

+ * This routine fetches all the tuples of a query and saves them in a
tuplestore.
+ * This is required when the result of a query is not completely
fetched but the control
+ * switches to a different query.
+ * A call to fetch_data is made from here, hence we need complete
ForeignScanState here.

This is a good explanation but the last sentence probably isn't
needed. Also, you don't mention that this only used in non-cursor mode
(or whatever we end up calling it).

+       if (conn->asyncStatus == PGASYNC_IDLE)
+       {
+               /* If the connection is not active then set up */
+               if (!PQsendQueryParams(conn, last_fsstate->query,
last_fsstate->numParams,
+                                                          NULL,
values, NULL, NULL, 0))
+                       pgfdw_report_error(NULL, conn, last_fsstate->query);
+
+               if (!PQsetChunkedRowsMode(conn, last_fsstate->fetch_size))
+                       pgfdw_report_error(NULL, conn, last_fsstate->query);
+       }

I think this is a problem for multiple reasons. One problem is that
the function header says that we use this "when the result of a query
is not completely fetched". But if the status where PGASYNC_IDLE, the
query has not even started yet. In that case, seems there is no reason
to call this function in the first place. We can just start the new
scan right away and wait to do anything about the other scan until
it's needed. Additionally, asyncStatus is not part of libpq's exposed
API. You're only able to access it here because you've included
libpq-int.h into postgres_fdw.h, but libpq-int.h is so-called because
it's "internal", so you should really be trying very hard not to use
it, and especially not to include it in another header file. I suggest
that there's probably just a bug here -- I don't think you should need
to check asyncStatus here in the first place.

+ if (pgfdw_use_cursor)
+ snprintf(sql, sizeof(sql), "CLOSE c%u",
+ fsstate->cursor_number);
+ else
+ close_cursor = true;

This is extremely confusing coding. If we're using a cursor, we write
some SQL into a buffer but don't actually do anything with the buffer
right away. But if we're not using a cursor, then we set a flag that
tells us to close a cursor ... which we're not using? I think the
logic updates in the function are not at all clear. You need to back
up and think more carefully about what the function is supposed to do
in each case. Look at how the function starts after your patch:

postgresReScanForeignScan(ForeignScanState *node)
{
PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
char sql[64];
PGresult *res;
bool close_cursor = false;

/* If we haven't created the cursor yet, nothing to do. */
if (!fsstate->cursor_exists)
return;

Well, if we're in non-cursor mode, that presumably means this function
always returns without doing anything, since there should not be a
cursor if we are not using with them. That can't be right.
Alternatively, maybe it means that the cursor_exists flag is now being
used to track something other than whether a cursor exists, which
wouldn't be right either. I don't know off the top of my head exactly
what this function should do in each case, but as the patch author,
it's your job to go through, figure that out, and make sure that it is
all correct and well-commented.

Similarly, you have a bunch of changes to a function called
create_cursor(), but why would a function that creates a cursor get
changed to handle a mode where we don't create cursors? Either the
function needs to be renamed, or there should be two separate
functions, or these changes don't do anything in the first place.

+ while (res != NULL)
+ res = pgfdw_get_result(fsstate->conn);

I don't see how this code can be right. We should know how many result
sets we expect and read exactly that number, not just read over and
over until there's nothing more. But even if we did want to read until
there's nothing more, we would need to free those result sets rather
than leaking them.

-
- /* Clean up */
- pfree(buf.data);

It's hard to understand why this would be a good idea.

/*
* We'll store the tuples in the batch_cxt. First, flush the previous
* batch.
*/
fsstate->tuples = NULL;
- MemoryContextReset(fsstate->batch_cxt);
oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);

It's also hard to understand why this would be a good idea. It seems
like it makes the comment false: we wouldn't be flushing the previous
batch any more. We'd just be forgetting that it existed and leaking
the memory.

+ /*
+ * In non-cursor mode, there are three options for the further
+ * processing: 1. If the tuplestore is already filled, retrieve the
+ * tuples from there. 2. Fetch the tuples till the end of the query
+ * and store them in tuplestore. 3. Perform a normal fetch and process
+ * the tuples.

I find this comment pretty confusing. First of all, it's not really
clear what performing a "normal fetch" means here. But also, this says
that in non-cursor mode there are three options, and that's followed
by an if-else block i.e. two options. Three isn't the same as two, so
something isn't right here.

The code that follows is quite involved. It's rather deeply nested in
places, so possibly some of it needs to be broken out into separate
functions. It contains another instance, of this, which has the same
problems I pointed out earlier:

+ /* There are no more tuples to fetch */
+ while (res != NULL)
+ res = pgfdw_get_result(conn);

If there are no more tuples to fetch, then why do we need to loop
fetching result sets until we get a NULL? And if we ever get any, we
leak them, which is also bad.

+ /*
+ * Reset cursor mode when in asynchronous mode as it is not supported in
+ * non-cursor mode
+ */
+ if (!pgfdw_use_cursor)
+ pgfdw_use_cursor = true;

It's not allowed to change a GUC variable from outside of the GUC
code. If you need to disable the optimization in certain cases, you
need a separate flag for that, which is initially set based on
pgfdw_use_cursor and then can be changed here or wherever is needed.
You can't change the GUC value itself. This would cause a variety of
problems, including the fact that the optimization would then stay
disabled for later queries in the session, but probably also some more
GUC-related stuff.

-
SELECT current_database() AS current_database,
current_setting('port') AS current_port
\gset
-

Please avoid including unnecessary cosmetic changes in the patch file.

Overall, I think the test case changes are in much better shape than
they were in earlier patch versions. I suspect there might still be a
bit of room for improvement, but we can deal with that once some of
these other issues are sorted out.

--
Robert Haas
EDB: http://www.enterprisedb.com

#20

Rafia Sabih

rafia.pghackers@gmail.com

3 months ago

In reply to: Robert Haas (#19)

Re: Bypassing cursors in postgres_fdw to enable parallel plans

On Fri, 6 Mar 2026 at 21:51, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jan 26, 2026 at 6:44 AM Rafia Sabih <rafia.pghackers@gmail.com>
wrote:

Thanks Robert for your time and attention to this patch.
Based on these review comments and an off list discussion about the

design of the patch, I have reworked the patch significantly.

In this version, a tuplestore is added to the PgFdwScanState along with

a flag. Now, in case of a cursor switch, this tuplestore is filled with the
remaining tuples of the query. The corresponding flag is set to indicate
that the tuplestore is ready to be fetched. To remember the last query that
was executing, a pointer to its PgFdwScanState is maintained in the
conn_state. Note that we only need to remember just the very last query and
once tuples for it are fetched we can forget that pointer, because now its
fsstate has the remaining tuples in the tuplestore.

So, this version of the patch looks simpler than before.
A few test cases are also added in the attached patch to cover the

different scenarios in case of non-cursor mode.

+ DefineCustomBoolVariable("postgres_fdw.use_cursor",

Usually I would expect the GUC for a few feature to be defined in such
a way that a true values means to use the new feature and a false
value means not to do that. This is reversed. Maybe consider renaming
the GUC so that the sense is inverted.
+       /* To be used only in non-cursor mode */
+       Tuplestorestate *tuplestore;    /* Tuplestore to save the tuples
of the
+
* query for later fetc
h. */
+       TupleTableSlot *slot;           /* Slot to be used when
reading the tuple from
+                                                                * the
tuplestore */
+       bool            tuples_ready;   /* To indicate when tuplestore
is ready to be
+                                                                * read. */
+       int                     total_tuples;   /* total tuples in the
tuplestore. */
There's already a tuplestore_tuple_count() function, so it's not clear
why you need total_tuples. It's also not exactly clear what
tuples_ready means here: sure, it means that we're ready to read
tuples, but when does that happen? I think the /* To be used only in
non-cursor mode */ comment could use expansion, too. Maybe use this to
explain the idea that we're going to use a Tuplestore in non-cursor
mode when, and only when, there are overlapping scans.

+ fsstate->conn_state->last_query = NULL;

I think this should be called something like active_scan. last_query
makes it sound like the last query we ran, whether or not it is still
in progress. But the value that it points to is an PgFdwScanState, so
it's not a query, it's a scan, and we should only have this set to a
non-NULL value for as long as that scan is actually active (i.e. has
the exclusive use of the connection).
+ * This routine fetches all the tuples of a query and saves them in a
tuplestore.
+ * This is required when the result of a query is not completely
fetched but the control
+ * switches to a different query.
+ * A call to fetch_data is made from here, hence we need complete
ForeignScanState here.
This is a good explanation but the last sentence probably isn't
needed. Also, you don't mention that this only used in non-cursor mode
(or whatever we end up calling it).
+       if (conn->asyncStatus == PGASYNC_IDLE)
+       {
+               /* If the connection is not active then set up */
+               if (!PQsendQueryParams(conn, last_fsstate->query,
last_fsstate->numParams,
+                                                          NULL,
values, NULL, NULL, 0))
+                       pgfdw_report_error(NULL, conn,
last_fsstate->query);
+
+               if (!PQsetChunkedRowsMode(conn, last_fsstate->fetch_size))
+                       pgfdw_report_error(NULL, conn,
last_fsstate->query);
+       }
I think this is a problem for multiple reasons. One problem is that
the function header says that we use this "when the result of a query
is not completely fetched". But if the status where PGASYNC_IDLE, the
query has not even started yet. In that case, seems there is no reason
to call this function in the first place. We can just start the new
scan right away and wait to do anything about the other scan until
it's needed. Additionally, asyncStatus is not part of libpq's exposed
API. You're only able to access it here because you've included
libpq-int.h into postgres_fdw.h, but libpq-int.h is so-called because
it's "internal", so you should really be trying very hard not to use
it, and especially not to include it in another header file. I suggest
that there's probably just a bug here -- I don't think you should need
to check asyncStatus here in the first place.
+ if (pgfdw_use_cursor)
+ snprintf(sql, sizeof(sql), "CLOSE c%u",
+ fsstate->cursor_number);
+ else
+ close_cursor = true;
This is extremely confusing coding. If we're using a cursor, we write
some SQL into a buffer but don't actually do anything with the buffer
right away. But if we're not using a cursor, then we set a flag that
tells us to close a cursor ... which we're not using? I think the
logic updates in the function are not at all clear. You need to back
up and think more carefully about what the function is supposed to do
in each case. Look at how the function starts after your patch:

postgresReScanForeignScan(ForeignScanState *node)
{
PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
char sql[64];
PGresult *res;
bool close_cursor = false;

/* If we haven't created the cursor yet, nothing to do. */
if (!fsstate->cursor_exists)
return;

Well, if we're in non-cursor mode, that presumably means this function
always returns without doing anything, since there should not be a
cursor if we are not using with them. That can't be right.
Alternatively, maybe it means that the cursor_exists flag is now being
used to track something other than whether a cursor exists, which
wouldn't be right either. I don't know off the top of my head exactly
what this function should do in each case, but as the patch author,
it's your job to go through, figure that out, and make sure that it is
all correct and well-commented.

Similarly, you have a bunch of changes to a function called
create_cursor(), but why would a function that creates a cursor get
changed to handle a mode where we don't create cursors? Either the
function needs to be renamed, or there should be two separate
functions, or these changes don't do anything in the first place.

+ while (res != NULL)
+ res = pgfdw_get_result(fsstate->conn);

I don't see how this code can be right. We should know how many result
sets we expect and read exactly that number, not just read over and
over until there's nothing more. But even if we did want to read until
there's nothing more, we would need to free those result sets rather
than leaking them.

-
- /* Clean up */
- pfree(buf.data);

It's hard to understand why this would be a good idea.

/*
* We'll store the tuples in the batch_cxt. First, flush the previous
* batch.
*/
fsstate->tuples = NULL;
- MemoryContextReset(fsstate->batch_cxt);
oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);

It's also hard to understand why this would be a good idea. It seems
like it makes the comment false: we wouldn't be flushing the previous
batch any more. We'd just be forgetting that it existed and leaking
the memory.
+ /*
+ * In non-cursor mode, there are three options for the further
+ * processing: 1. If the tuplestore is already filled, retrieve the
+ * tuples from there. 2. Fetch the tuples till the end of the query
+ * and store them in tuplestore. 3. Perform a normal fetch and process
+ * the tuples.
I find this comment pretty confusing. First of all, it's not really
clear what performing a "normal fetch" means here. But also, this says
that in non-cursor mode there are three options, and that's followed
by an if-else block i.e. two options. Three isn't the same as two, so
something isn't right here.

The code that follows is quite involved. It's rather deeply nested in
places, so possibly some of it needs to be broken out into separate
functions. It contains another instance, of this, which has the same
problems I pointed out earlier:
+ /* There are no more tuples to fetch */
+ while (res != NULL)
+ res = pgfdw_get_result(conn);
If there are no more tuples to fetch, then why do we need to loop
fetching result sets until we get a NULL? And if we ever get any, we
leak them, which is also bad.
+ /*
+ * Reset cursor mode when in asynchronous mode as it is not supported in
+ * non-cursor mode
+ */
+ if (!pgfdw_use_cursor)
+ pgfdw_use_cursor = true;
It's not allowed to change a GUC variable from outside of the GUC
code. If you need to disable the optimization in certain cases, you
need a separate flag for that, which is initially set based on
pgfdw_use_cursor and then can be changed here or wherever is needed.
You can't change the GUC value itself. This would cause a variety of
problems, including the fact that the optimization would then stay
disabled for later queries in the session, but probably also some more
GUC-related stuff.

-
SELECT current_database() AS current_database,
current_setting('port') AS current_port
\gset
-

Please avoid including unnecessary cosmetic changes in the patch file.

Overall, I think the test case changes are in much better shape than
they were in earlier patch versions. I suspect there might still be a
bit of room for improvement, but we can deal with that once some of
these other issues are sorted out.

--
Robert Haas
EDB: http://www.enterprisedb.com

Thank you for the detailed review of the patch. Please find the attached
file for the updated patch.
To summarise the changes done in this patch following these review comments,
- the new GUC is called disable_cursor, so when we want to use this feature
we set it, otherwise it remains false by default. I hope this makes it
easier to understand as you mentioned.

- Simplified the code in fetch_more_data for this new mode. Now we have a
function called fetch_from_tuplestore, and it is called from
fetch_more_data when tuples from tuplestore are requested. Another function
called save_to_tuplestore is also introduced in this patch, it is called
from create_cursor code when all the tuples are to be fetched for the
active_scan. So things look better code readability wise.

- For handling the async mode, I went to one level up and modified
async_capable to false when no_cursor mode is used. So, now we do not
change any GUC as such.

- With regards to use of flag cursor_exists and functions create_cursor and
close_cursor, these are used in this no_cursor mode as well. For the flag
cursor_exists, I used it to check if we have already done the required
steps for when disale_cursor is set, like setting the ChunkedRowsMode,
saving the tuples to the tuplestore if required. Inherently there is no
nothing in fsstate or conn_state through which we can identify that all the
required steps are done or if this scan is coming for the first time, so
maybe we can have a different flag for the case when disable_cursor is set
and use it for such checks. But to me it feels like additional code when I
can use existing vars, but I agree on making it more readable that way. So,
I modified in this patch to add a separate flag called scan_init and
functions init_scan and end_scan for the setting and resetting of the
required values when cursors are not used. So, let me know what you think
is better using the existing flag or creating a new one for this mode as
done in this patch.

- For this change,
/*
* We'll store the tuples in the batch_cxt. First, flush the previous
* batch.
*/
fsstate->tuples = NULL;
- MemoryContextReset(fsstate->batch_cxt);
oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
since, the tuplestore is now created in batch_cxt, so we can't reset it
here. But I now changed it to be done only when disable_cursor is set.

- Regarding the following change,
+       if (conn->asyncStatus == PGASYNC_IDLE)
+       {
+               /* If the connection is not active then set up */
When we are here after the call to the rescan, the connection is as per the
current fsstate which has not yet started, hence the idle status, but we
are here to fetch the tuples for the active scan which got reset after the
rescan so it is required to set it up again. But you are right in that
using asyncStatus is not appropriate here, so I added a new flag to check
for rescan, which is set in rescan and used here to decide if we need to
set the ChunkedRowsMode.

- Modified and added more comments
--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH