Asynchronous execution on FDW

Started by Kyotaro HORIGUCHIover 10 years ago18 messages

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 10 years ago

5 attachment(s)

Hello. This is the new version of FDW async exection feature.

The status of this feature is as follows, as of the last commitfest.

- Async execution is valuable to have.
- But do the first kick in ExecInit phase is wrong.

So the design outline of this version is as following,

- The patch set consists of three parts. The fist is the
infrastracture in core-side, second is the code to enable
asynchronous execution of Postgres-FDW. The third part is the
alternative set of three methods to adapt fetch size, which
makes asynchronous execution more effective.

- It was a problem when to give the first kick for async exec. It
is not in ExecInit phase, and ExecProc phase does not fit,
too. An extra phase ExecPreProc or something is too
invasive. So I tried "pre-exec callback".

Any init-node can register callbacks on their turn, then the
registerd callbacks are called just before ExecProc phase in
executor. The first patch adds functions and structs to enable
this.

- The second part is not changed from the previous version. Add
PgFdwConn as a extended PgConn which have some members to
support asynchronous execution.

The asynchronous execution is kicked only for the first
ForeignScan node on the same foreign server. And the state
lasts until the next scan comes. This behavior is mainly
controlled in fetch_more_data(). The behavior limits the number
of simultaneous exection for one foreign server to 1. This
behavior is decided from the reason that no reasonable method
to limit multiplicity of execution on *single peer* was found
so far.

- The third part is three kind of trials of adaptive fetch size
feature.

The first one is duration-based adaptation. The patch
increases the fetch size by every FETCH execution but try to
keep the duration of every FETCH below 500 ms. But it is not
promising because it looks very unstable, or the behavior is
nearly unforeseeable..

The second one is based on byte-based FETCH feature. This
patch adds to FETCH command an argument to limit the number of
bytes (octets) to send. But this might be a over-exposure of
the internals. The size is counted based on internal
representation of a tuple and the client is needed to send the
overhead of its internal tuple representation in bytes. This
is effective but quite ugly..

The third is the most simple and straight-forward way, that
is, adds a foreign table option to specify the fetch_size. The
effect of this is also in doubt since the size of tuples for
one foreign table would vary according to the return-columns
list. But it is foreseeable for users and is a necessary knob
for those who want to tune it. Foreign server also could have
the same option as the default for that for foreign tables but
this patch have not added it.

The attached patches are the following,

- 0001-Add-infrastructure-of-pre-execution-callbacks.patch
Infrastructure of pre-execution callback

- 0002-Allow-asynchronous-remote-query-of-postgres_fdw.patch
FDW asynchronous execution feature

- 0003a-Add-experimental-POC-adaptive-fetch-size-feature.patch
Adaptive fetch size alternative 1: duration based control

- 0003b-POC-Experimental-fetch_by_size-feature.patch
Adaptive fetch size alternative 2: FETCH by size

- 0003c-Add-foreign-table-option-to-set-fetch-size.patch
Adaptive fetch size alternative 3: Foreign table option.

regards,

Attachments:

0001-Add-infrastructure-of-pre-execution-callbacks.patchtext/x-patch; charset=us-asciiDownload

>From eb621897d1410079c6458bc4d1914d1345eb77bc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 26 Jun 2015 15:12:16 +0900
Subject: [PATCH 1/3] Add infrastructure of pre-execution callbacks.

Some exec nodes have some work before plan tree execution.
This infrastructure provides such functionality
---
 src/backend/executor/execMain.c  | 32 ++++++++++++++++++++++++++++++++
 src/backend/executor/execUtils.c |  2 ++
 src/include/nodes/execnodes.h    | 22 ++++++++++++++++++++++
 3 files changed, 56 insertions(+)

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index a1561ce..51a86b2 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -764,6 +764,35 @@ ExecCheckXactReadOnly(PlannedStmt *plannedstmt)
 		PreventCommandIfParallelMode(CreateCommandTag((Node *) plannedstmt));
 }
 
+/*
+ * Register callbacks to be called just before plan execution.
+ */
+void
+RegisterPreExecCallback(PreExecCallback callback, EState *es, Node *nd,
+						void *arg)
+{
+	PreExecCallbackItem *item;
+
+	item = (PreExecCallbackItem*)
+		MemoryContextAlloc(es->es_query_cxt, sizeof(PreExecCallbackItem));
+	item->callback = callback;
+	item->node = nd;
+	item->arg = arg;
+
+	/* add the new node at the end of the chain */
+	item->next = es->es_preExecCallbacks;
+	es->es_preExecCallbacks = item;	
+}
+
+/* Execute registered pre-exec callbacks */
+void
+RunPreExecCallbacks(EState *es)
+{
+	PreExecCallbackItem *item;
+
+	for (item = es->es_preExecCallbacks ; item ; item = item->next)
+		item->callback(es, item->node);
+}
 
 /* ----------------------------------------------------------------
  *		InitPlan
@@ -956,6 +985,9 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 	 */
 	planstate = ExecInitNode(plan, estate, eflags);
 
+	/* Execute pre-execution callbacks registered during ExecInitNode */
+	RunPreExecCallbacks(estate);
+
 	/*
 	 * Get the tuple descriptor describing the type of tuples to return.
 	 */
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 3c611b9..e80bc22 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -123,6 +123,8 @@ CreateExecutorState(void)
 
 	estate->es_rowMarks = NIL;
 
+	estate->es_preExecCallbacks = NULL;
+
 	estate->es_processed = 0;
 	estate->es_lastoid = InvalidOid;
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 541ee18..cb8d854 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -343,6 +343,26 @@ typedef struct ResultRelInfo
 	List	   *ri_onConflictSetWhere;
 } ResultRelInfo;
 
+struct EState;
+
+/* ----------------
+ *	  Pre-execute callbacks
+ * ----------------
+ */
+typedef void (*PreExecCallback) (struct EState *estate, Node *node);
+typedef struct PreExecCallbackItem
+{
+	struct PreExecCallbackItem *next;
+	PreExecCallback callback;		/* function to call just before execution
+									 * starts */
+	Node *node;						/* node to process  */
+	void *arg;						/* any extra arguments */
+} PreExecCallbackItem;
+
+void RegisterPreExecCallback(PreExecCallback callback, struct EState *es,
+							 Node *nd, void *arg);
+void RunPreExecCallbacks(struct EState *es);
+
 /* ----------------
  *	  EState information
  *
@@ -387,6 +407,8 @@ typedef struct EState
 
 	List	   *es_rowMarks;	/* List of ExecRowMarks */
 
+	PreExecCallbackItem	 *es_preExecCallbacks; /* pre-exec callbacks */
+
 	uint32		es_processed;	/* # of tuples processed */
 	Oid			es_lastoid;		/* last oid processed (by INSERT) */
 
-- 
1.8.3.1

0002-Allow-asynchronous-remote-query-of-postgres_fdw.patchtext/x-patch; charset=us-asciiDownload

>From 8db5a4992cf0509b6c9f93d659a3ba3644f30fa9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 26 Jun 2015 16:54:39 +0900
Subject: [PATCH 2/3] Allow asynchronous remote query of postgres_fdw.

The new type PgFdwConn makes connection.c to be aware of running of
asynchronous query.

The first node on one connection is invoked prior to ExecProcNode. In
addition to that, fetch_more_data() tries to keep asynchronous
fetching as long as no other query starts to run on the same
connection using the async-aware function of PgFdwConn.
---
 contrib/postgres_fdw/Makefile       |   2 +-
 contrib/postgres_fdw/PgFdwConn.c    | 200 ++++++++++++++++++++++++
 contrib/postgres_fdw/PgFdwConn.h    |  61 ++++++++
 contrib/postgres_fdw/connection.c   |  81 +++++-----
 contrib/postgres_fdw/postgres_fdw.c | 294 +++++++++++++++++++++++++++---------
 contrib/postgres_fdw/postgres_fdw.h |  15 +-
 6 files changed, 538 insertions(+), 115 deletions(-)
 create mode 100644 contrib/postgres_fdw/PgFdwConn.c
 create mode 100644 contrib/postgres_fdw/PgFdwConn.h

diff --git a/contrib/postgres_fdw/Makefile b/contrib/postgres_fdw/Makefile
index d2b98e1..d0913e2 100644
--- a/contrib/postgres_fdw/Makefile
+++ b/contrib/postgres_fdw/Makefile
@@ -1,7 +1,7 @@
 # contrib/postgres_fdw/Makefile
 
 MODULE_big = postgres_fdw
-OBJS = postgres_fdw.o option.o deparse.o connection.o $(WIN32RES)
+OBJS = postgres_fdw.o PgFdwConn.o option.o deparse.o connection.o $(WIN32RES)
 PGFILEDESC = "postgres_fdw - foreign data wrapper for PostgreSQL"
 
 PG_CPPFLAGS = -I$(libpq_srcdir)
diff --git a/contrib/postgres_fdw/PgFdwConn.c b/contrib/postgres_fdw/PgFdwConn.c
new file mode 100644
index 0000000..b13b597
--- /dev/null
+++ b/contrib/postgres_fdw/PgFdwConn.c
@@ -0,0 +1,200 @@
+/*-------------------------------------------------------------------------
+ *
+ * PgFdwConn.c
+ *		  PGconn extending wrapper to enable asynchronous query.
+ *
+ * Portions Copyright (c) 2012-2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		  contrib/postgres_fdw/PgFdwConn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "PgFdwConn.h"
+
+#define PFC_ALLOCATE()	((PgFdwConn *)malloc(sizeof(PgFdwConn)))
+#define PFC_FREE(c)		free(c)
+
+struct pgfdw_conn
+{
+	PGconn *pgconn;				/* libpq connection for this connection */
+	int		nscans;				/* number of scans using this connection */
+	struct PgFdwScanState *async_scan; /* the connection currently running
+										* async query on this connection  */
+};
+
+void
+PFCsetAsyncScan(PgFdwConn *conn, struct PgFdwScanState *scan)
+{
+	conn->async_scan = scan;
+}
+
+struct PgFdwScanState *
+PFCgetAsyncScan(PgFdwConn *conn)
+{
+	return conn->async_scan;
+}
+
+int
+PFCisAsyncRunning(PgFdwConn *conn)
+{
+	return conn->async_scan != NULL;
+}
+
+PGconn *
+PFCgetPGconn(PgFdwConn *conn)
+{
+	return conn->pgconn;
+}
+
+int
+PFCgetNscans(PgFdwConn *conn)
+{
+	return conn->nscans;
+}
+
+int
+PFCincrementNscans(PgFdwConn *conn)
+{
+	return ++conn->nscans;
+}
+
+int
+PFCdecrementNscans(PgFdwConn *conn)
+{
+	Assert(conn->nscans > 0);
+	return --conn->nscans;
+}
+
+void
+PFCcancelAsync(PgFdwConn *conn)
+{
+	if (PFCisAsyncRunning(conn))
+		PFCconsumeInput(conn);
+}
+
+void
+PFCinit(PgFdwConn *conn)
+{
+	conn->async_scan = NULL;
+	conn->nscans = 0;
+}
+
+int
+PFCsendQuery(PgFdwConn *conn, const char *query)
+{
+	return PQsendQuery(conn->pgconn, query);
+}
+
+PGresult *
+PFCexec(PgFdwConn *conn, const char *query)
+{
+	return PQexec(conn->pgconn, query);
+}
+
+PGresult *
+PFCexecParams(PgFdwConn *conn,
+			  const char *command,
+			  int nParams,
+			  const Oid *paramTypes,
+			  const char *const * paramValues,
+			  const int *paramLengths,
+			  const int *paramFormats,
+			  int resultFormat)
+{
+	return PQexecParams(conn->pgconn,
+						command, nParams, paramTypes, paramValues,
+						paramLengths, paramFormats, resultFormat);
+}
+
+PGresult *
+PFCprepare(PgFdwConn *conn,
+		   const char *stmtName, const char *query,
+		   int nParams, const Oid *paramTypes)
+{
+	return PQprepare(conn->pgconn, stmtName, query, nParams, paramTypes);
+}
+
+PGresult *
+PFCexecPrepared(PgFdwConn *conn,
+				const char *stmtName,
+				int nParams,
+				const char *const * paramValues,
+				const int *paramLengths,
+				const int *paramFormats,
+				int resultFormat)
+{
+	return PQexecPrepared(conn->pgconn, 
+						  stmtName, nParams, paramValues, paramLengths,
+						  paramFormats, resultFormat);
+}
+
+PGresult *
+PFCgetResult(PgFdwConn *conn)
+{
+	return PQgetResult(conn->pgconn);
+}
+
+int
+PFCconsumeInput(PgFdwConn *conn)
+{
+	return PQconsumeInput(conn->pgconn);
+}
+
+int
+PFCisBusy(PgFdwConn *conn)
+{
+	return PQisBusy(conn->pgconn);
+}
+
+ConnStatusType
+PFCstatus(const PgFdwConn *conn)
+{
+	return PQstatus(conn->pgconn);
+}
+
+PGTransactionStatusType
+PFCtransactionStatus(const PgFdwConn *conn)
+{
+	return PQtransactionStatus(conn->pgconn);
+}
+
+int
+PFCserverVersion(const PgFdwConn *conn)
+{
+	return PQserverVersion(conn->pgconn);
+}
+
+char *
+PFCerrorMessage(const PgFdwConn *conn)
+{
+	return PQerrorMessage(conn->pgconn);
+}
+
+int
+PFCconnectionUsedPassword(const PgFdwConn *conn)
+{
+	return PQconnectionUsedPassword(conn->pgconn);
+}
+
+void
+PFCfinish(PgFdwConn *conn)
+{
+	return PQfinish(conn->pgconn);
+	PFC_FREE(conn);
+}
+
+PgFdwConn *
+PFCconnectdbParams(const char *const * keywords,
+				   const char *const * values, int expand_dbname)
+{
+	PgFdwConn *ret = PFC_ALLOCATE();
+
+	PFCinit(ret);
+	ret->pgconn = PQconnectdbParams(keywords, values, expand_dbname);
+
+	return ret;
+}
diff --git a/contrib/postgres_fdw/PgFdwConn.h b/contrib/postgres_fdw/PgFdwConn.h
new file mode 100644
index 0000000..f695f5a
--- /dev/null
+++ b/contrib/postgres_fdw/PgFdwConn.h
@@ -0,0 +1,61 @@
+/*-------------------------------------------------------------------------
+ *
+ * PgFdwConn.h
+ *		  PGconn extending wrapper to enable asynchronous query.
+ *
+ * Portions Copyright (c) 2012-2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		  contrib/postgres_fdw/PgFdwConn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PGFDWCONN_H
+#define PGFDWCONN_H
+
+#include "libpq-fe.h"
+
+typedef struct pgfdw_conn PgFdwConn;
+struct PgFdwScanState;
+
+extern void PFCsetAsyncScan(PgFdwConn *conn, struct PgFdwScanState *scan);
+extern struct PgFdwScanState *PFCgetAsyncScan(PgFdwConn *conn);
+extern int PFCisAsyncRunning(PgFdwConn *conn);
+extern PGconn *PFCgetPGconn(PgFdwConn *conn);
+extern int PFCgetNscans(PgFdwConn *conn);
+extern int PFCincrementNscans(PgFdwConn *conn);
+extern int PFCdecrementNscans(PgFdwConn *conn);
+extern void PFCcancelAsync(PgFdwConn *conn);
+extern void PFCinit(PgFdwConn *conn);
+extern int PFCsendQuery(PgFdwConn *conn, const char *query);
+extern PGresult *PFCexec(PgFdwConn *conn, const char *query);
+extern PGresult *PFCexecParams(PgFdwConn *conn,
+								const char *command,
+								int nParams,
+								const Oid *paramTypes,
+								const char *const * paramValues,
+								const int *paramLengths,
+								const int *paramFormats,
+								int resultFormat);
+extern PGresult *PFCprepare(PgFdwConn *conn,
+							const char *stmtName, const char *query,
+							int nParams, const Oid *paramTypes);
+extern PGresult *PFCexecPrepared(PgFdwConn *conn,
+								 const char *stmtName,
+								 int nParams,
+								 const char *const * paramValues,
+								 const int *paramLengths,
+								 const int *paramFormats,
+								 int resultFormat);
+extern PGresult *PFCgetResult(PgFdwConn *conn);
+extern int PFCconsumeInput(PgFdwConn *conn);
+extern int PFCisBusy(PgFdwConn *conn);
+extern ConnStatusType PFCstatus(const PgFdwConn *conn);
+extern PGTransactionStatusType PFCtransactionStatus(const PgFdwConn *conn);
+extern int PFCserverVersion(const PgFdwConn *conn);
+extern char *PFCerrorMessage(const PgFdwConn *conn);
+extern int PFCconnectionUsedPassword(const PgFdwConn *conn);
+extern void PFCfinish(PgFdwConn *conn);
+extern PgFdwConn *PFCconnectdbParams(const char *const * keywords,
+			 const char *const * values, int expand_dbname);
+#endif   /* PGFDWCONN_H */
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 1a1e5b5..790b675 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -44,7 +44,7 @@ typedef struct ConnCacheKey
 typedef struct ConnCacheEntry
 {
 	ConnCacheKey key;			/* hash key (must be first) */
-	PGconn	   *conn;			/* connection to foreign server, or NULL */
+	PgFdwConn	*conn;			/* connection to foreign server, or NULL */
 	int			xact_depth;		/* 0 = no xact open, 1 = main xact open, 2 =
 								 * one level of subxact open, etc */
 	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
@@ -64,10 +64,10 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
-static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+static PgFdwConn *connect_pg_server(ForeignServer *server, UserMapping *user);
 static void check_conn_params(const char **keywords, const char **values);
-static void configure_remote_session(PGconn *conn);
-static void do_sql_command(PGconn *conn, const char *sql);
+static void configure_remote_session(PgFdwConn *conn);
+static void do_sql_command(PgFdwConn *conn, const char *sql);
 static void begin_remote_xact(ConnCacheEntry *entry);
 static void pgfdw_xact_callback(XactEvent event, void *arg);
 static void pgfdw_subxact_callback(SubXactEvent event,
@@ -93,7 +93,7 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  * be useful and not mere pedantry.  We could not flush any active connections
  * mid-transaction anyway.
  */
-PGconn *
+PgFdwConn *
 GetConnection(ForeignServer *server, UserMapping *user,
 			  bool will_prep_stmt)
 {
@@ -161,9 +161,11 @@ GetConnection(ForeignServer *server, UserMapping *user,
 		entry->have_error = false;
 		entry->conn = connect_pg_server(server, user);
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\"",
-			 entry->conn, server->servername);
+			 PFCgetPGconn(entry->conn), server->servername);
 	}
 
+	PFCincrementNscans(entry->conn);
+
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
@@ -178,10 +180,10 @@ GetConnection(ForeignServer *server, UserMapping *user,
 /*
  * Connect to remote server using specified server and user mapping properties.
  */
-static PGconn *
+static PgFdwConn *
 connect_pg_server(ForeignServer *server, UserMapping *user)
 {
-	PGconn	   *volatile conn = NULL;
+	PgFdwConn	   *volatile conn = NULL;
 
 	/*
 	 * Use PG_TRY block to ensure closing connection on error.
@@ -223,14 +225,14 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 		/* verify connection parameters and make connection */
 		check_conn_params(keywords, values);
 
-		conn = PQconnectdbParams(keywords, values, false);
-		if (!conn || PQstatus(conn) != CONNECTION_OK)
+		conn = PFCconnectdbParams(keywords, values, false);
+		if (!conn || PFCstatus(conn) != CONNECTION_OK)
 		{
 			char	   *connmessage;
 			int			msglen;
 
 			/* libpq typically appends a newline, strip that */
-			connmessage = pstrdup(PQerrorMessage(conn));
+			connmessage = pstrdup(PFCerrorMessage(conn));
 			msglen = strlen(connmessage);
 			if (msglen > 0 && connmessage[msglen - 1] == '\n')
 				connmessage[msglen - 1] = '\0';
@@ -246,7 +248,7 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 		 * otherwise, he's piggybacking on the postgres server's user
 		 * identity. See also dblink_security_check() in contrib/dblink.
 		 */
-		if (!superuser() && !PQconnectionUsedPassword(conn))
+		if (!superuser() && !PFCconnectionUsedPassword(conn))
 			ereport(ERROR,
 				  (errcode(ERRCODE_S_R_E_PROHIBITED_SQL_STATEMENT_ATTEMPTED),
 				   errmsg("password is required"),
@@ -263,7 +265,7 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 	{
 		/* Release PGconn data structure if we managed to create one */
 		if (conn)
-			PQfinish(conn);
+			PFCfinish(conn);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
@@ -312,9 +314,9 @@ check_conn_params(const char **keywords, const char **values)
  * there are any number of ways to break things.
  */
 static void
-configure_remote_session(PGconn *conn)
+configure_remote_session(PgFdwConn *conn)
 {
-	int			remoteversion = PQserverVersion(conn);
+	int			remoteversion = PFCserverVersion(conn);
 
 	/* Force the search path to contain only pg_catalog (see deparse.c) */
 	do_sql_command(conn, "SET search_path = pg_catalog");
@@ -348,11 +350,11 @@ configure_remote_session(PGconn *conn)
  * Convenience subroutine to issue a non-data-returning SQL command to remote
  */
 static void
-do_sql_command(PGconn *conn, const char *sql)
+do_sql_command(PgFdwConn *conn, const char *sql)
 {
 	PGresult   *res;
 
-	res = PQexec(conn, sql);
+	res = PFCexec(conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
 		pgfdw_report_error(ERROR, res, conn, true, sql);
 	PQclear(res);
@@ -379,7 +381,7 @@ begin_remote_xact(ConnCacheEntry *entry)
 		const char *sql;
 
 		elog(DEBUG3, "starting remote transaction on connection %p",
-			 entry->conn);
+			 PFCgetPGconn(entry->conn));
 
 		if (IsolationIsSerializable())
 			sql = "START TRANSACTION ISOLATION LEVEL SERIALIZABLE";
@@ -408,13 +410,11 @@ begin_remote_xact(ConnCacheEntry *entry)
  * Release connection reference count created by calling GetConnection.
  */
 void
-ReleaseConnection(PGconn *conn)
+ReleaseConnection(PgFdwConn *conn)
 {
-	/*
-	 * Currently, we don't actually track connection references because all
-	 * cleanup is managed on a transaction or subtransaction basis instead. So
-	 * there's nothing to do here.
-	 */
+	/* ongoing async query should be canceled if no scans left */
+	if (PFCdecrementNscans(conn) == 0)
+		finish_async_query(conn);
 }
 
 /*
@@ -429,7 +429,7 @@ ReleaseConnection(PGconn *conn)
  * collisions are highly improbable; just be sure to use %u not %d to print.
  */
 unsigned int
-GetCursorNumber(PGconn *conn)
+GetCursorNumber(PgFdwConn *conn)
 {
 	return ++cursor_number;
 }
@@ -443,7 +443,7 @@ GetCursorNumber(PGconn *conn)
  * increasing the risk of prepared-statement name collisions by resetting.
  */
 unsigned int
-GetPrepStmtNumber(PGconn *conn)
+GetPrepStmtNumber(PgFdwConn *conn)
 {
 	return ++prep_stmt_number;
 }
@@ -462,7 +462,7 @@ GetPrepStmtNumber(PGconn *conn)
  * marked with have_error = true.
  */
 void
-pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
+pgfdw_report_error(int elevel, PGresult *res, PgFdwConn *conn,
 				   bool clear, const char *sql)
 {
 	/* If requested, PGresult must be released before leaving this function. */
@@ -490,7 +490,7 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 		 * return NULL, not a PGresult at all.
 		 */
 		if (message_primary == NULL)
-			message_primary = PQerrorMessage(conn);
+			message_primary = PFCerrorMessage(conn);
 
 		ereport(elevel,
 				(errcode(sqlstate),
@@ -542,7 +542,7 @@ pgfdw_xact_callback(XactEvent event, void *arg)
 		if (entry->xact_depth > 0)
 		{
 			elog(DEBUG3, "closing remote transaction on connection %p",
-				 entry->conn);
+				 PFCgetPGconn(entry->conn));
 
 			switch (event)
 			{
@@ -568,7 +568,7 @@ pgfdw_xact_callback(XactEvent event, void *arg)
 					 */
 					if (entry->have_prep_stmt && entry->have_error)
 					{
-						res = PQexec(entry->conn, "DEALLOCATE ALL");
+						res = PFCexec(entry->conn, "DEALLOCATE ALL");
 						PQclear(res);
 					}
 					entry->have_prep_stmt = false;
@@ -600,7 +600,7 @@ pgfdw_xact_callback(XactEvent event, void *arg)
 					/* Assume we might have lost track of prepared statements */
 					entry->have_error = true;
 					/* If we're aborting, abort all remote transactions too */
-					res = PQexec(entry->conn, "ABORT TRANSACTION");
+					res = PFCexec(entry->conn, "ABORT TRANSACTION");
 					/* Note: can't throw ERROR, it would be infinite loop */
 					if (PQresultStatus(res) != PGRES_COMMAND_OK)
 						pgfdw_report_error(WARNING, res, entry->conn, true,
@@ -611,7 +611,7 @@ pgfdw_xact_callback(XactEvent event, void *arg)
 						/* As above, make sure to clear any prepared stmts */
 						if (entry->have_prep_stmt && entry->have_error)
 						{
-							res = PQexec(entry->conn, "DEALLOCATE ALL");
+							res = PFCexec(entry->conn, "DEALLOCATE ALL");
 							PQclear(res);
 						}
 						entry->have_prep_stmt = false;
@@ -623,17 +623,19 @@ pgfdw_xact_callback(XactEvent event, void *arg)
 
 		/* Reset state to show we're out of a transaction */
 		entry->xact_depth = 0;
+		PFCcancelAsync(entry->conn);
+		PFCinit(entry->conn);
 
 		/*
 		 * If the connection isn't in a good idle state, discard it to
 		 * recover. Next GetConnection will open a new connection.
 		 */
-		if (PQstatus(entry->conn) != CONNECTION_OK ||
-			PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
+		if (PFCstatus(entry->conn) != CONNECTION_OK ||
+			PFCtransactionStatus(entry->conn) != PQTRANS_IDLE)
 		{
-			elog(DEBUG3, "discarding connection %p", entry->conn);
-			PQfinish(entry->conn);
-			entry->conn = NULL;
+			elog(DEBUG3, "discarding connection %p",
+				 PFCgetPGconn(entry->conn));
+			PFCfinish(entry->conn);
 		}
 	}
 
@@ -679,6 +681,9 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
 		PGresult   *res;
 		char		sql[100];
 
+		/* Shut down asynchronous scan if running */
+		PFCcancelAsync(entry->conn);
+
 		/*
 		 * We only care about connections with open remote subtransactions of
 		 * the current level.
@@ -704,7 +709,7 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
 			snprintf(sql, sizeof(sql),
 					 "ROLLBACK TO SAVEPOINT s%d; RELEASE SAVEPOINT s%d",
 					 curlevel, curlevel);
-			res = PQexec(entry->conn, sql);
+			res = PFCexec(entry->conn, sql);
 			if (PQresultStatus(res) != PGRES_COMMAND_OK)
 				pgfdw_report_error(WARNING, res, entry->conn, true, sql);
 			else
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 6da01e1..40cac3b 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -22,6 +22,7 @@
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
+#include "nodes/execnodes.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/cost.h"
@@ -124,6 +125,13 @@ enum FdwModifyPrivateIndex
 	FdwModifyPrivateRetrievedAttrs
 };
 
+typedef enum fetch_mode {
+	START_ONLY,
+	FORCE_SYNC,
+	ALLOW_ASYNC,
+	EXIT_ASYNC
+} fetch_mode;
+
 /*
  * Execution state of a foreign scan using postgres_fdw.
  */
@@ -137,7 +145,7 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConn  *conn;			/* connection for the scan */
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -157,6 +165,7 @@ typedef struct PgFdwScanState
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
 	MemoryContext temp_cxt;		/* context for per-tuple temporary data */
+	ExprContext	 *econtext;		/* copy of ps_ExprContext of ForeignScanState */
 } PgFdwScanState;
 
 /*
@@ -168,7 +177,7 @@ typedef struct PgFdwModifyState
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConn  *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -299,7 +308,7 @@ static void estimate_path_cost_size(PlannerInfo *root,
 						double *p_rows, int *p_width,
 						Cost *p_startup_cost, Cost *p_total_cost);
 static void get_remote_estimate(const char *sql,
-					PGconn *conn,
+					PgFdwConn *conn,
 					double *rows,
 					int *width,
 					Cost *startup_cost,
@@ -307,9 +316,9 @@ static void get_remote_estimate(const char *sql,
 static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
-static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
-static void close_cursor(PGconn *conn, unsigned int cursor_number);
+static void create_cursor(PgFdwScanState *fsstate);
+static void fetch_more_data(PgFdwScanState *node, fetch_mode cmd);
+static void close_cursor(PgFdwConn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
 						 ItemPointer tupleid,
@@ -877,6 +886,21 @@ postgresGetForeignPlan(PlannerInfo *root,
 							NIL /* no custom tlist */ );
 }
 
+/* call back function to kick the query to start on remote */
+static void
+postgresPreExecCallback(EState *estate, Node *node)
+{
+	PgFdwScanState *fsstate =
+		(PgFdwScanState *)((ForeignScanState *)node)->fdw_state;
+
+	create_cursor(fsstate);
+	/*
+	 * Start async scan if this is the first scan. See fetch_more_data() for
+	 * details
+	 */
+	fetch_more_data(fsstate, START_ONLY);
+}
+
 /*
  * postgresBeginForeignScan
  *		Initiate an executor scan of a foreign PostgreSQL table.
@@ -988,6 +1012,16 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 		fsstate->param_values = (const char **) palloc0(numParams * sizeof(char *));
 	else
 		fsstate->param_values = NULL;
+
+	fsstate->econtext = node->ss.ps.ps_ExprContext;
+
+	/*
+	 * Register this node to be asynchronously executed if this is the first
+	 * scan on this connection
+	 */
+	if (PFCgetNscans(fsstate->conn) == 1)
+		RegisterPreExecCallback(postgresPreExecCallback, estate,
+								(Node*)node, NULL);
 }
 
 /*
@@ -1006,7 +1040,10 @@ postgresIterateForeignScan(ForeignScanState *node)
 	 * cursor on the remote side.
 	 */
 	if (!fsstate->cursor_exists)
-		create_cursor(node);
+	{
+		finish_async_query(fsstate->conn);
+		create_cursor(fsstate);
+	}
 
 	/*
 	 * Get some more tuples, if we've run out.
@@ -1015,7 +1052,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 	{
 		/* No point in another fetch if we already detected EOF, though. */
 		if (!fsstate->eof_reached)
-			fetch_more_data(node);
+			fetch_more_data(fsstate, ALLOW_ASYNC);
 		/* If we didn't get any tuples, must be end of data. */
 		if (fsstate->next_tuple >= fsstate->num_tuples)
 			return ExecClearTuple(slot);
@@ -1075,7 +1112,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = PQexec(fsstate->conn, sql);
+	res = PFCexec(fsstate->conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
 		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
 	PQclear(res);
@@ -1411,19 +1448,22 @@ postgresExecForeignInsert(EState *estate,
 	/* Convert parameters needed by prepared statement to text form */
 	p_values = convert_prep_stmt_params(fmstate, NULL, slot);
 
+	/* Finish async query if runing */
+	finish_async_query(fmstate->conn);
+
 	/*
 	 * Execute the prepared statement, and check for success.
 	 *
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = PQexecPrepared(fmstate->conn,
-						 fmstate->p_name,
-						 fmstate->p_nums,
-						 p_values,
-						 NULL,
-						 NULL,
-						 0);
+	res = PFCexecPrepared(fmstate->conn,
+						  fmstate->p_name,
+						  fmstate->p_nums,
+						  p_values,
+						  NULL,
+						  NULL,
+						  0);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
 		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
@@ -1481,19 +1521,22 @@ postgresExecForeignUpdate(EState *estate,
 										(ItemPointer) DatumGetPointer(datum),
 										slot);
 
+	/* Finish async query if runing */
+	finish_async_query(fmstate->conn);
+
 	/*
 	 * Execute the prepared statement, and check for success.
 	 *
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = PQexecPrepared(fmstate->conn,
-						 fmstate->p_name,
-						 fmstate->p_nums,
-						 p_values,
-						 NULL,
-						 NULL,
-						 0);
+	res = PFCexecPrepared(fmstate->conn,
+						  fmstate->p_name,
+						  fmstate->p_nums,
+						  p_values,
+						  NULL,
+						  NULL,
+						  0);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
 		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
@@ -1551,19 +1594,22 @@ postgresExecForeignDelete(EState *estate,
 										(ItemPointer) DatumGetPointer(datum),
 										NULL);
 
+	/* Finish async query if runing */
+	finish_async_query(fmstate->conn);
+
 	/*
 	 * Execute the prepared statement, and check for success.
 	 *
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = PQexecPrepared(fmstate->conn,
-						 fmstate->p_name,
-						 fmstate->p_nums,
-						 p_values,
-						 NULL,
-						 NULL,
-						 0);
+	res = PFCexecPrepared(fmstate->conn,
+						  fmstate->p_name,
+						  fmstate->p_nums,
+						  p_values,
+						  NULL,
+						  NULL,
+						  0);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
 		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
@@ -1613,7 +1659,7 @@ postgresEndForeignModify(EState *estate,
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = PQexec(fmstate->conn, sql);
+		res = PFCexec(fmstate->conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
 			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
 		PQclear(res);
@@ -1745,7 +1791,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_join_conds;
 		StringInfoData sql;
 		List	   *retrieved_attrs;
-		PGconn	   *conn;
+		PgFdwConn  *conn;
 		Selectivity local_sel;
 		QualCost	local_cost;
 
@@ -1855,7 +1901,7 @@ estimate_path_cost_size(PlannerInfo *root,
  * The given "sql" must be an EXPLAIN command.
  */
 static void
-get_remote_estimate(const char *sql, PGconn *conn,
+get_remote_estimate(const char *sql, PgFdwConn *conn,
 					double *rows, int *width,
 					Cost *startup_cost, Cost *total_cost)
 {
@@ -1871,7 +1917,7 @@ get_remote_estimate(const char *sql, PGconn *conn,
 		/*
 		 * Execute EXPLAIN remotely.
 		 */
-		res = PQexec(conn, sql);
+		res = PFCexec(conn, sql);
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, sql);
 
@@ -1936,13 +1982,12 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
  * Create cursor for node's query with current parameter values.
  */
 static void
-create_cursor(ForeignScanState *node)
+create_cursor(PgFdwScanState *fsstate)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
-	ExprContext *econtext = node->ss.ps.ps_ExprContext;
+	ExprContext *econtext = fsstate->econtext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PgFdwConn   *conn = fsstate->conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -2004,8 +2049,8 @@ create_cursor(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = PQexecParams(conn, buf.data, numParams, NULL, values,
-					   NULL, NULL, 0);
+	res = PFCexecParams(conn, buf.data, numParams, NULL, values,
+						NULL, NULL, 0);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
 		pgfdw_report_error(ERROR, res, conn, true, fsstate->query);
 	PQclear(res);
@@ -2026,54 +2071,121 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+fetch_more_data(PgFdwScanState *fsstate, fetch_mode cmd)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
 
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch for the case other than exiting from async mode.
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (cmd != EXIT_ASYNC)
+	{
+		fsstate->tuples = NULL;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PgFdwConn  *conn = fsstate->conn;
 		char		sql[64];
 		int			fetch_size;
-		int			numrows;
+		int			numrows, addrows, restrows;
+		HeapTuple  *tmptuples;
 		int			i;
+		int			fetch_buf_size;
 
 		/* The fetch size is arbitrary, but shouldn't be enormous. */
 		fetch_size = 100;
 
+		/* Make the query to fetch tuples */
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fetch_size, fsstate->cursor_number);
 
-		res = PQexec(conn, sql);
+		if (PFCisAsyncRunning(conn))
+		{
+			Assert (cmd != START_ONLY);
+
+			/*
+			 * If the target fsstate is different from the scan state that the
+			 * current async fetch running for, the result should be stored
+			 * into it, then synchronously fetch data for the target fsstate.
+			 */
+			if (fsstate != PFCgetAsyncScan(conn))
+			{
+				fetch_more_data(PFCgetAsyncScan(conn), EXIT_ASYNC);
+				res = PFCexec(conn, sql);
+			}
+			else
+			{
+				/* Get result of running async fetch */
+				res = PFCgetResult(conn);
+				if (PQntuples(res) == fetch_size)
+				{
+					/*
+					 * Connection state doesn't go to IDLE even if all data
+					 * has been sent to client for asynchronous query. One
+					 * more PQgetResult() is needed to reset the state to
+					 * IDLE.  See PQexecFinish() for details.
+					 */
+					if (PFCgetResult(conn) != NULL)
+						elog(ERROR, "Connection status error.");
+				}
+			}
+			PFCsetAsyncScan(conn, NULL);
+		}
+		else
+		{
+			if (cmd == START_ONLY)
+			{
+				if (!PFCsendQuery(conn, sql))
+					pgfdw_report_error(ERROR, res, conn, false,
+									   fsstate->query);
+
+				PFCsetAsyncScan(conn, fsstate);
+				goto end_of_fetch;
+			}
+
+			/* Elsewise do synchronous query execution */
+			PFCsetAsyncScan(conn, NULL);
+			res = PFCexec(conn, sql);
+		}
+
 		/* On error, report the original query, not the FETCH. */
-		if (PQresultStatus(res) != PGRES_TUPLES_OK)
+		if (res &&  PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
-		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
+		/* allocate tuple storage */
+		tmptuples = fsstate->tuples;
+		addrows = PQntuples(res);
+		restrows = fsstate->num_tuples - fsstate->next_tuple;
+		numrows = restrows + addrows;
+		fetch_buf_size = numrows * sizeof(HeapTuple);
+		fsstate->tuples = (HeapTuple *) palloc0(fetch_buf_size);
+
+		Assert(restrows == 0 || tmptuples);
+
+		/* copy unread tuples if any */
+		for (i = 0 ; i < restrows ; i++)
+			fsstate->tuples[i] = tmptuples[fsstate->next_tuple + i];
+
 		fsstate->num_tuples = numrows;
 		fsstate->next_tuple = 0;
 
-		for (i = 0; i < numrows; i++)
+		/* Convert the data into HeapTuples */
+		for (i = 0 ; i < addrows; i++)
 		{
-			fsstate->tuples[i] =
+			HeapTuple tup =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
 										   fsstate->retrieved_attrs,
 										   fsstate->temp_cxt);
+			fsstate->tuples[restrows + i] = tup;
+			fetch_buf_size += (HEAPTUPLESIZE + tup->t_len);
 		}
 
 		/* Update fetch_ct_2 */
@@ -2085,6 +2197,23 @@ fetch_more_data(ForeignScanState *node)
 
 		PQclear(res);
 		res = NULL;
+
+		if (cmd == ALLOW_ASYNC)
+		{
+			if (!fsstate->eof_reached)
+			{
+				/*
+				 * We can immediately request the next bunch of tuples if
+				 * we're on asynchronous connection.
+				 */
+				if (!PFCsendQuery(conn, sql))
+					pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+				PFCsetAsyncScan(conn, fsstate);
+			}
+		}
+
+end_of_fetch:
+		;	/* Nothing to do here but needed to make compiler quiet. */
 	}
 	PG_CATCH();
 	{
@@ -2098,6 +2227,28 @@ fetch_more_data(ForeignScanState *node)
 }
 
 /*
+ * Force cancelling async command state.
+ */
+void
+finish_async_query(PgFdwConn *conn)
+{
+	PgFdwScanState *fsstate = PFCgetAsyncScan(conn);
+	PgFdwConn *async_conn;
+
+	/* Nothing to do if no async connection */
+	if (fsstate == NULL) return;
+	async_conn = fsstate->conn;
+	if (!async_conn ||
+		PFCgetNscans(async_conn) == 1 ||
+		!PFCisAsyncRunning(async_conn))
+		return;
+
+	fetch_more_data(PFCgetAsyncScan(async_conn), EXIT_ASYNC);
+
+	Assert(!PFCisAsyncRunning(async_conn));
+}
+
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -2151,7 +2302,7 @@ reset_transmission_modes(int nestlevel)
  * Utility routine to close a cursor.
  */
 static void
-close_cursor(PGconn *conn, unsigned int cursor_number)
+close_cursor(PgFdwConn *conn, unsigned int cursor_number)
 {
 	char		sql[64];
 	PGresult   *res;
@@ -2162,7 +2313,7 @@ close_cursor(PGconn *conn, unsigned int cursor_number)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = PQexec(conn, sql);
+	res = PFCexec(conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
 		pgfdw_report_error(ERROR, res, conn, true, sql);
 	PQclear(res);
@@ -2184,6 +2335,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 			 GetPrepStmtNumber(fmstate->conn));
 	p_name = pstrdup(prep_name);
 
+	/* Finish async query if runing */
+	finish_async_query(fmstate->conn);
+
 	/*
 	 * We intentionally do not specify parameter types here, but leave the
 	 * remote server to derive them by default.  This avoids possible problems
@@ -2194,11 +2348,11 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = PQprepare(fmstate->conn,
-					p_name,
-					fmstate->query,
-					0,
-					NULL);
+	res = PFCprepare(fmstate->conn,
+					 p_name,
+					 fmstate->query,
+					 0,
+					 NULL);
 
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
 		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
@@ -2316,7 +2470,7 @@ postgresAnalyzeForeignTable(Relation relation,
 	ForeignTable *table;
 	ForeignServer *server;
 	UserMapping *user;
-	PGconn	   *conn;
+	PgFdwConn   *conn;
 	StringInfoData sql;
 	PGresult   *volatile res = NULL;
 
@@ -2348,7 +2502,7 @@ postgresAnalyzeForeignTable(Relation relation,
 	/* In what follows, do not risk leaking any PGresults. */
 	PG_TRY();
 	{
-		res = PQexec(conn, sql.data);
+		res = PFCexec(conn, sql.data);
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, sql.data);
 
@@ -2398,7 +2552,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 	ForeignTable *table;
 	ForeignServer *server;
 	UserMapping *user;
-	PGconn	   *conn;
+	PgFdwConn   *conn;
 	unsigned int cursor_number;
 	StringInfoData sql;
 	PGresult   *volatile res = NULL;
@@ -2442,7 +2596,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 	/* In what follows, do not risk leaking any PGresults. */
 	PG_TRY();
 	{
-		res = PQexec(conn, sql.data);
+		res = PFCexec(conn, sql.data);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
 			pgfdw_report_error(ERROR, res, conn, false, sql.data);
 		PQclear(res);
@@ -2472,7 +2626,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 			snprintf(fetch_sql, sizeof(fetch_sql), "FETCH %d FROM c%u",
 					 fetch_size, cursor_number);
 
-			res = PQexec(conn, fetch_sql);
+			res = PFCexec(conn, fetch_sql);
 			/* On error, report the original query, not the FETCH. */
 			if (PQresultStatus(res) != PGRES_TUPLES_OK)
 				pgfdw_report_error(ERROR, res, conn, false, sql.data);
@@ -2600,7 +2754,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 	bool		import_not_null = true;
 	ForeignServer *server;
 	UserMapping *mapping;
-	PGconn	   *conn;
+	PgFdwConn  *conn;
 	StringInfoData buf;
 	PGresult   *volatile res = NULL;
 	int			numrows,
@@ -2633,7 +2787,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 	conn = GetConnection(server, mapping, false);
 
 	/* Don't attempt to import collation if remote server hasn't got it */
-	if (PQserverVersion(conn) < 90100)
+	if (PFCserverVersion(conn) < 90100)
 		import_collate = false;
 
 	/* Create workspace for strings */
@@ -2646,7 +2800,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 		appendStringInfoString(&buf, "SELECT 1 FROM pg_catalog.pg_namespace WHERE nspname = ");
 		deparseStringLiteral(&buf, stmt->remote_schema);
 
-		res = PQexec(conn, buf.data);
+		res = PFCexec(conn, buf.data);
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, buf.data);
 
@@ -2741,7 +2895,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 		appendStringInfo(&buf, " ORDER BY c.relname, a.attnum");
 
 		/* Fetch the data */
-		res = PQexec(conn, buf.data);
+		res = PFCexec(conn, buf.data);
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, buf.data);
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 3835ddb..c87e5cf 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -18,19 +18,22 @@
 #include "nodes/relation.h"
 #include "utils/relcache.h"
 
-#include "libpq-fe.h"
+#include "PgFdwConn.h"
+
+struct PgFdwScanState;
 
 /* in postgres_fdw.c */
 extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
+extern void finish_async_query(PgFdwConn *fsstate);
 
 /* in connection.c */
-extern PGconn *GetConnection(ForeignServer *server, UserMapping *user,
+extern PgFdwConn *GetConnection(ForeignServer *server, UserMapping *user,
 			  bool will_prep_stmt);
-extern void ReleaseConnection(PGconn *conn);
-extern unsigned int GetCursorNumber(PGconn *conn);
-extern unsigned int GetPrepStmtNumber(PGconn *conn);
-extern void pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
+extern void ReleaseConnection(PgFdwConn *conn);
+extern unsigned int GetCursorNumber(PgFdwConn *conn);
+extern unsigned int GetPrepStmtNumber(PgFdwConn *conn);
+extern void pgfdw_report_error(int elevel, PGresult *res, PgFdwConn *conn,
 				   bool clear, const char *sql);
 
 /* in option.c */
-- 
1.8.3.1

0003a-Add-experimental-POC-adaptive-fetch-size-feature.patchtext/x-patch; charset=us-asciiDownload

>From 307209588737de34573d39d2b2376ce6f689a0f6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 26 Jun 2015 17:31:26 +0900
Subject: [PATCH 3/3] Add experimental (POC) adaptive fetch size feature.

---
 contrib/postgres_fdw/postgres_fdw.c | 114 ++++++++++++++++++++++++++++++++++--
 1 file changed, 108 insertions(+), 6 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 40cac3b..108b4ba 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -48,6 +48,27 @@ PG_MODULE_MAGIC;
 /* Default CPU cost to process 1 row (above and beyond cpu_tuple_cost). */
 #define DEFAULT_FDW_TUPLE_COST		0.01
 
+/* Fetch size at startup. This might be better be a GUC parameter */
+#define MIN_FETCH_SIZE 100
+
+/* Maximum fetch size. This might be better be a GUC parameter */
+#define MAX_FETCH_SIZE 1000
+
+/*
+ * Maximum size for fetch buffer in kilobytes. Ditto.
+ *
+ * This should be far larger than sizeof(HeapTuple) * FETCH_SIZE_MAX. This is
+ * not a hard limit because we cannot know in advance the average row length
+ * returned.
+ */
+#define MAX_FETCH_BUFFER_SIZE 10000	/* 10MB */
+
+/* Maximum duration allowed for a single fetch, in milliseconds */
+#define MAX_FETCH_DURATION 500
+
+/* Number of successive async fetches to enlarge fetch_size */
+#define INCREASE_FETCH_SIZE_THRESHOLD 8
+
 /*
  * FDW-specific planner information kept in RelOptInfo.fdw_private for a
  * foreign table.  This information is collected by postgresGetForeignRelSize.
@@ -157,6 +178,12 @@ typedef struct PgFdwScanState
 	HeapTuple  *tuples;			/* array of currently-retrieved tuples */
 	int			num_tuples;		/* # of tuples in array */
 	int			next_tuple;		/* index of next one to return */
+	int			fetch_size;		/* rows to be fetched at once */
+	int			successive_async; /* # of successive fetches at this
+                                    fetch_size */
+	long		last_fetch_req_at;  /* The time of the last fetch request, in
+									 * milliseconds*/
+	int			last_buf_size;	/* Buffer size required for the last fetch */
 
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
@@ -886,6 +913,7 @@ postgresGetForeignPlan(PlannerInfo *root,
 							NIL /* no custom tlist */ );
 }
 
+
 /* call back function to kick the query to start on remote */
 static void
 postgresPreExecCallback(EState *estate, Node *node)
@@ -1015,6 +1043,10 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 
 	fsstate->econtext = node->ss.ps.ps_ExprContext;
 
+	fsstate->fetch_size = MIN_FETCH_SIZE;
+	fsstate->successive_async = 0;
+	fsstate->last_buf_size = 0;
+
 	/*
 	 * Register this node to be asynchronously executed if this is the first
 	 * scan on this connection
@@ -2092,18 +2124,72 @@ fetch_more_data(PgFdwScanState *fsstate, fetch_mode cmd)
 	{
 		PgFdwConn  *conn = fsstate->conn;
 		char		sql[64];
-		int			fetch_size;
 		int			numrows, addrows, restrows;
 		HeapTuple  *tmptuples;
+		int			prev_fetch_size = fsstate->fetch_size;
+		int 		new_fetch_size = fsstate->fetch_size;
 		int			i;
+		struct timeval tv = {0, 0};
+		long		current_time;
 		int			fetch_buf_size;
 
-		/* The fetch size is arbitrary, but shouldn't be enormous. */
-		fetch_size = 100;
+		gettimeofday(&tv, NULL);
+		current_time = tv.tv_sec * 1000 + tv.tv_usec / 1000;
+
+		/*
+		 * Calculate adaptive fetch size
+		 *
+		 * Calculate fetch_size based on maximal allowed duration and buffer
+		 * space. The fetch buffer size shouldn't be enormous so we try to
+		 * keep it under MAX_FETCH_BUFFER_SIZE.
+		 */
+
+		/* Decrease fetch_size if the previous required buffer size exceeded
+		 * MAX_FETCH_BUFFER_SIZE.*/
+		if (fsstate->last_buf_size > MAX_FETCH_BUFFER_SIZE)
+		{
+			new_fetch_size =
+				(int)((double)fsstate->fetch_size * MAX_FETCH_BUFFER_SIZE /
+					  fsstate->last_buf_size);
+		}
+		/*
+		 * Decrease fetch_size to twice if the last duration to fetch was too
+		 * long.
+		 */
+		if (PFCisBusy(conn) &&
+			fsstate->fetch_size > MIN_FETCH_SIZE &&
+			fsstate->last_fetch_req_at + MAX_FETCH_DURATION <
+			current_time)
+		{
+			int tmp_fetch_size = fsstate->fetch_size / 2;
+			if (tmp_fetch_size < new_fetch_size)
+				new_fetch_size = tmp_fetch_size;
+		}
+
+		/*
+		 * Increase fetch_size to twice if not decreased so far and other
+		 * conditions match.
+		 */
+		if (new_fetch_size == fsstate->fetch_size &&
+			fsstate->successive_async >= INCREASE_FETCH_SIZE_THRESHOLD &&
+			fsstate->fetch_size < MAX_FETCH_SIZE)
+			new_fetch_size *= 2;
+
+		/* Change fetch_size as calculated above */
+		if (new_fetch_size != fsstate->fetch_size)
+		{
+			if (new_fetch_size > MAX_FETCH_SIZE)
+				fsstate->fetch_size = MAX_FETCH_SIZE;
+			else if (new_fetch_size < MIN_FETCH_SIZE)
+				fsstate->fetch_size = MIN_FETCH_SIZE;
+			else
+				fsstate->fetch_size = new_fetch_size;
+			fsstate->successive_async = 0;
+		}
 
 		/* Make the query to fetch tuples */
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
-				 fetch_size, fsstate->cursor_number);
+				 fsstate->fetch_size, fsstate->cursor_number);
 
 		if (PFCisAsyncRunning(conn))
 		{
@@ -2123,7 +2209,7 @@ fetch_more_data(PgFdwScanState *fsstate, fetch_mode cmd)
 			{
 				/* Get result of running async fetch */
 				res = PFCgetResult(conn);
-				if (PQntuples(res) == fetch_size)
+				if (PQntuples(res) == prev_fetch_size)
 				{
 					/*
 					 * Connection state doesn't go to IDLE even if all data
@@ -2144,6 +2230,7 @@ fetch_more_data(PgFdwScanState *fsstate, fetch_mode cmd)
 				if (!PFCsendQuery(conn, sql))
 					pgfdw_report_error(ERROR, res, conn, false,
 									   fsstate->query);
+				fsstate->last_fetch_req_at = current_time;
 
 				PFCsetAsyncScan(conn, fsstate);
 				goto end_of_fetch;
@@ -2188,12 +2275,14 @@ fetch_more_data(PgFdwScanState *fsstate, fetch_mode cmd)
 			fetch_buf_size += (HEAPTUPLESIZE + tup->t_len);
 		}
 
+		fsstate->last_buf_size = fetch_buf_size / 1024; /* in kilobytes */
+
 		/* Update fetch_ct_2 */
 		if (fsstate->fetch_ct_2 < 2)
 			fsstate->fetch_ct_2++;
 
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fetch_size);
+		fsstate->eof_reached = (numrows < prev_fetch_size);
 
 		PQclear(res);
 		res = NULL;
@@ -2208,6 +2297,7 @@ fetch_more_data(PgFdwScanState *fsstate, fetch_mode cmd)
 				 */
 				if (!PFCsendQuery(conn, sql))
 					pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+				fsstate->last_fetch_req_at = current_time;
 				PFCsetAsyncScan(conn, fsstate);
 			}
 		}
@@ -2223,6 +2313,18 @@ end_of_fetch:
 	}
 	PG_END_TRY();
 
+	if (PFCisAsyncRunning(fsstate->conn))
+	{
+		if (fsstate->successive_async < INCREASE_FETCH_SIZE_THRESHOLD)
+			fsstate->successive_async++;
+	}
+	else
+	{
+		/* Reset fetch_size if the async_fetch stopped */
+		fsstate->successive_async = 0;
+		fsstate->fetch_size = MIN_FETCH_SIZE;
+	}
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
-- 
1.8.3.1

0003b-POC-Experimental-fetch_by_size-feature.patchtext/x-patch; charset=us-asciiDownload

>From e2cc7054bbab06c631d7c78491cb52143a4e47f9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 29 Jun 2015 16:51:12 +0900
Subject: [PATCH] POC: Experimental fetch_by_size feature

---
 contrib/auto_explain/auto_explain.c             |   8 +-
 contrib/pg_stat_statements/pg_stat_statements.c |   8 +-
 contrib/postgres_fdw/postgres_fdw.c             |  92 +++++++++++++++-----
 src/backend/access/common/heaptuple.c           |  42 +++++++++
 src/backend/commands/copy.c                     |   2 +-
 src/backend/commands/createas.c                 |   2 +-
 src/backend/commands/explain.c                  |   2 +-
 src/backend/commands/extension.c                |   2 +-
 src/backend/commands/matview.c                  |   2 +-
 src/backend/commands/portalcmds.c               |   4 +-
 src/backend/commands/prepare.c                  |   2 +-
 src/backend/executor/execMain.c                 |  39 +++++++--
 src/backend/executor/execUtils.c                |   1 +
 src/backend/executor/functions.c                |   2 +-
 src/backend/executor/spi.c                      |   4 +-
 src/backend/parser/gram.y                       |  65 ++++++++++++++
 src/backend/tcop/postgres.c                     |   2 +
 src/backend/tcop/pquery.c                       | 109 +++++++++++++++++-------
 src/include/access/htup_details.h               |   2 +
 src/include/executor/executor.h                 |   8 +-
 src/include/nodes/execnodes.h                   |   1 +
 src/include/nodes/parsenodes.h                  |   2 +
 src/include/tcop/pquery.h                       |   7 +-
 src/interfaces/ecpg/preproc/ecpg.addons         |  83 ++++++++++++++++++
 24 files changed, 409 insertions(+), 82 deletions(-)

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 2a184ed..f121a33 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -57,7 +57,7 @@ void		_PG_fini(void);
 static void explain_ExecutorStart(QueryDesc *queryDesc, int eflags);
 static void explain_ExecutorRun(QueryDesc *queryDesc,
 					ScanDirection direction,
-					long count);
+					long count, long size);
 static void explain_ExecutorFinish(QueryDesc *queryDesc);
 static void explain_ExecutorEnd(QueryDesc *queryDesc);
 
@@ -232,15 +232,15 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
  * ExecutorRun hook: all we need do is track nesting depth
  */
 static void
-explain_ExecutorRun(QueryDesc *queryDesc, ScanDirection direction, long count)
+explain_ExecutorRun(QueryDesc *queryDesc, ScanDirection direction, long count, long size)
 {
 	nesting_level++;
 	PG_TRY();
 	{
 		if (prev_ExecutorRun)
-			prev_ExecutorRun(queryDesc, direction, count);
+			prev_ExecutorRun(queryDesc, direction, count, size);
 		else
-			standard_ExecutorRun(queryDesc, direction, count);
+			standard_ExecutorRun(queryDesc, direction, count, size);
 		nesting_level--;
 	}
 	PG_CATCH();
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 0eb991c..593d406 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -289,7 +289,7 @@ static void pgss_post_parse_analyze(ParseState *pstate, Query *query);
 static void pgss_ExecutorStart(QueryDesc *queryDesc, int eflags);
 static void pgss_ExecutorRun(QueryDesc *queryDesc,
 				 ScanDirection direction,
-				 long count);
+				 long count, long size);
 static void pgss_ExecutorFinish(QueryDesc *queryDesc);
 static void pgss_ExecutorEnd(QueryDesc *queryDesc);
 static void pgss_ProcessUtility(Node *parsetree, const char *queryString,
@@ -870,15 +870,15 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
  * ExecutorRun hook: all we need do is track nesting depth
  */
 static void
-pgss_ExecutorRun(QueryDesc *queryDesc, ScanDirection direction, long count)
+pgss_ExecutorRun(QueryDesc *queryDesc, ScanDirection direction, long count, long size)
 {
 	nested_level++;
 	PG_TRY();
 	{
 		if (prev_ExecutorRun)
-			prev_ExecutorRun(queryDesc, direction, count);
+			prev_ExecutorRun(queryDesc, direction, count, size);
 		else
-			standard_ExecutorRun(queryDesc, direction, count);
+			standard_ExecutorRun(queryDesc, direction, count, size);
 		nested_level--;
 	}
 	PG_CATCH();
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 40cac3b..0419cde 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -48,6 +48,11 @@ PG_MODULE_MAGIC;
 /* Default CPU cost to process 1 row (above and beyond cpu_tuple_cost). */
 #define DEFAULT_FDW_TUPLE_COST		0.01
 
+/* Maximum tuples per fetch */
+#define MAX_FETCH_SIZE				10000
+
+/* Maximum memory usable for retrieved data  */
+#define MAX_FETCH_MEM				(512 * 1024)
 /*
  * FDW-specific planner information kept in RelOptInfo.fdw_private for a
  * foreign table.  This information is collected by postgresGetForeignRelSize.
@@ -166,6 +171,8 @@ typedef struct PgFdwScanState
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
 	MemoryContext temp_cxt;		/* context for per-tuple temporary data */
 	ExprContext	 *econtext;		/* copy of ps_ExprContext of ForeignScanState */
+	long		max_palloced_mem; /* For test, remove me later */
+	int			max_numrows;
 } PgFdwScanState;
 
 /*
@@ -331,6 +338,8 @@ static int postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 							  double *totaldeadrows);
 static void analyze_row_processor(PGresult *res, int row,
 					  PgFdwAnalyzeState *astate);
+static Size estimate_tuple_overhead(TupleDesc tupDesc,
+									List *retrieved_attrs);
 static HeapTuple make_tuple_from_result_row(PGresult *res,
 						   int row,
 						   Relation rel,
@@ -1138,6 +1147,7 @@ postgresEndForeignScan(ForeignScanState *node)
 	if (fsstate == NULL)
 		return;
 
+	elog(LOG, "Max memory for tuple store = %ld, max numrows = %d", fsstate->max_palloced_mem, fsstate->max_numrows);
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
 		close_cursor(fsstate->conn, fsstate->cursor_number);
@@ -2092,18 +2102,20 @@ fetch_more_data(PgFdwScanState *fsstate, fetch_mode cmd)
 	{
 		PgFdwConn  *conn = fsstate->conn;
 		char		sql[64];
-		int			fetch_size;
+		int			fetch_mem;
+		int			tuple_overhead;
 		int			numrows, addrows, restrows;
 		HeapTuple  *tmptuples;
 		int			i;
 		int			fetch_buf_size;
 
-		/* The fetch size is arbitrary, but shouldn't be enormous. */
-		fetch_size = 100;
-
-		/* Make the query to fetch tuples */
-		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
-				 fetch_size, fsstate->cursor_number);
+		tuple_overhead = estimate_tuple_overhead(fsstate->attinmeta->tupdesc,
+												 fsstate->retrieved_attrs);
+		fetch_mem = MAX_FETCH_MEM - MAX_FETCH_SIZE * sizeof(HeapTuple);
+		snprintf(sql, sizeof(sql), "FETCH %d LIMIT %d (%d) FROM c%u",
+				 MAX_FETCH_SIZE,
+				 fetch_mem, tuple_overhead,
+				 fsstate->cursor_number);
 
 		if (PFCisAsyncRunning(conn))
 		{
@@ -2123,17 +2135,15 @@ fetch_more_data(PgFdwScanState *fsstate, fetch_mode cmd)
 			{
 				/* Get result of running async fetch */
 				res = PFCgetResult(conn);
-				if (PQntuples(res) == fetch_size)
-				{
-					/*
-					 * Connection state doesn't go to IDLE even if all data
-					 * has been sent to client for asynchronous query. One
-					 * more PQgetResult() is needed to reset the state to
-					 * IDLE.  See PQexecFinish() for details.
-					 */
-					if (PFCgetResult(conn) != NULL)
-						elog(ERROR, "Connection status error.");
-				}
+
+				/*
+				 * Connection state doesn't go to IDLE even if all data
+				 * has been sent to client for asynchronous query. One
+				 * more PQgetResult() is needed to reset the state to
+				 * IDLE.  See PQexecFinish() for details.
+				 */
+				if (PFCgetResult(conn) != NULL)
+					elog(ERROR, "Connection status error.");
 			}
 			PFCsetAsyncScan(conn, NULL);
 		}
@@ -2161,6 +2171,8 @@ fetch_more_data(PgFdwScanState *fsstate, fetch_mode cmd)
 		/* allocate tuple storage */
 		tmptuples = fsstate->tuples;
 		addrows = PQntuples(res);
+		if (fsstate->max_numrows < addrows)
+			fsstate->max_numrows = addrows;
 		restrows = fsstate->num_tuples - fsstate->next_tuple;
 		numrows = restrows + addrows;
 		fetch_buf_size = numrows * sizeof(HeapTuple);
@@ -2188,12 +2200,15 @@ fetch_more_data(PgFdwScanState *fsstate, fetch_mode cmd)
 			fetch_buf_size += (HEAPTUPLESIZE + tup->t_len);
 		}
 
+		if (fsstate->max_palloced_mem < fetch_buf_size)
+			fsstate->max_palloced_mem = fetch_buf_size;
+
 		/* Update fetch_ct_2 */
 		if (fsstate->fetch_ct_2 < 2)
 			fsstate->fetch_ct_2++;
 
-		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fetch_size);
+		/* Must be EOF if we have no new tuple here. */
+		fsstate->eof_reached = (addrows == 0);
 
 		PQclear(res);
 		res = NULL;
@@ -3007,6 +3022,43 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 }
 
 /*
+ * Compute the estimated overhead of the result tuples
+ * See heap_form_tuple for the details of this calculation.
+ */
+static Size
+estimate_tuple_overhead(TupleDesc tupDesc,
+						List *retrieved_attrs)
+{
+	Size size = 0;
+	int	 ncol = list_length(retrieved_attrs);
+	ListCell	*lc;
+
+	size += offsetof(HeapTupleHeaderData, t_bits);
+	size += BITMAPLEN(ncol);
+
+	if (tupDesc->tdhasoid)
+		size += sizeof(Oid);
+
+	size = MAXALIGN(size);
+
+	size += sizeof(Datum) * ncol;
+	size += sizeof(bool) * ncol;
+
+	foreach (lc, retrieved_attrs)
+	{
+		int i = lfirst_int(lc);
+
+		if (i > 0)
+		{
+			if (tupDesc->attrs[i - 1]->attbyval)
+				size -= (sizeof(Datum) - tupDesc->attrs[i - 1]->attlen);
+		}
+	}
+
+	return size;
+}
+
+/*
  * Create a tuple from the specified row of the PGresult.
  *
  * rel is the local representation of the foreign table, attinmeta is
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index 09aea79..17525b5 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -133,6 +133,48 @@ heap_compute_data_size(TupleDesc tupleDesc,
 	return data_length;
 }
 
+Size
+slot_compute_raw_data_size(TupleTableSlot *slot)
+{
+	TupleDesc tupleDesc = slot->tts_tupleDescriptor;
+	Datum *values = slot->tts_values;
+	bool  *isnull = slot->tts_isnull;
+	Size		data_length = 0;
+	int			i;
+	int			numberOfAttributes = tupleDesc->natts;
+	Form_pg_attribute *att = tupleDesc->attrs;
+
+	if (slot->tts_nvalid < tupleDesc->natts)
+		heap_deform_tuple(slot->tts_tuple, tupleDesc,
+						  slot->tts_values, slot->tts_isnull);
+
+	for (i = 0; i < numberOfAttributes; i++)
+	{
+		Datum		val;
+		Form_pg_attribute atti;
+
+		if (isnull[i])
+			continue;
+
+		val = values[i];
+		atti = att[i];
+
+		if (atti->attlen == -1)
+		{
+			data_length += toast_raw_datum_size(val);
+		}
+		else
+		{
+			data_length = att_align_datum(data_length, atti->attalign,
+										  atti->attlen, val);
+			data_length = att_addlength_datum(data_length, atti->attlen,
+											  val);
+		}
+	}
+
+	return data_length;
+}
+
 /*
  * heap_fill_tuple
  *		Load data portion of a tuple from values/isnull arrays
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 8904676..463fc67 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -1928,7 +1928,7 @@ CopyTo(CopyState cstate)
 	else
 	{
 		/* run the plan --- the dest receiver will send tuples */
-		ExecutorRun(cstate->queryDesc, ForwardScanDirection, 0L);
+		ExecutorRun(cstate->queryDesc, ForwardScanDirection, 0L, 0L, 0);
 		processed = ((DR_copy *) cstate->queryDesc->dest)->processed;
 	}
 
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 41183f6..7612391 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -192,7 +192,7 @@ ExecCreateTableAs(CreateTableAsStmt *stmt, const char *queryString,
 		dir = ForwardScanDirection;
 
 	/* run the plan */
-	ExecutorRun(queryDesc, dir, 0L);
+	ExecutorRun(queryDesc, dir, 0L, 0L, 0);
 
 	/* save the rowcount if we're given a completionTag to fill */
 	if (completionTag)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0d1ecc2..4480343 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -498,7 +498,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 			dir = ForwardScanDirection;
 
 		/* run the plan */
-		ExecutorRun(queryDesc, dir, 0L);
+		ExecutorRun(queryDesc, dir, 0L, 0L, 0);
 
 		/* run cleanup too */
 		ExecutorFinish(queryDesc);
diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index 2b1dcd0..bc116f9 100644
--- a/src/backend/commands/extension.c
+++ b/src/backend/commands/extension.c
@@ -733,7 +733,7 @@ execute_sql_string(const char *sql, const char *filename)
 										dest, NULL, 0);
 
 				ExecutorStart(qdesc, 0);
-				ExecutorRun(qdesc, ForwardScanDirection, 0);
+				ExecutorRun(qdesc, ForwardScanDirection, 0L, 0L, 0);
 				ExecutorFinish(qdesc);
 				ExecutorEnd(qdesc);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 5492e59..39e29ba 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -363,7 +363,7 @@ refresh_matview_datafill(DestReceiver *dest, Query *query,
 	ExecutorStart(queryDesc, EXEC_FLAG_WITHOUT_OIDS);
 
 	/* run the plan */
-	ExecutorRun(queryDesc, ForwardScanDirection, 0L);
+	ExecutorRun(queryDesc, ForwardScanDirection, 0L, 0L, 0);
 
 	/* and clean up */
 	ExecutorFinish(queryDesc);
diff --git a/src/backend/commands/portalcmds.c b/src/backend/commands/portalcmds.c
index 2794537..85fffc1 100644
--- a/src/backend/commands/portalcmds.c
+++ b/src/backend/commands/portalcmds.c
@@ -177,6 +177,8 @@ PerformPortalFetch(FetchStmt *stmt,
 	nprocessed = PortalRunFetch(portal,
 								stmt->direction,
 								stmt->howMany,
+								stmt->howLarge,
+								stmt->tupoverhead,
 								dest);
 
 	/* Return command status if wanted */
@@ -375,7 +377,7 @@ PersistHoldablePortal(Portal portal)
 										true);
 
 		/* Fetch the result set into the tuplestore */
-		ExecutorRun(queryDesc, ForwardScanDirection, 0L);
+		ExecutorRun(queryDesc, ForwardScanDirection, 0L, 0L, 0);
 
 		(*queryDesc->dest->rDestroy) (queryDesc->dest);
 		queryDesc->dest = NULL;
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index fb33d30..46fe4f8 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -291,7 +291,7 @@ ExecuteQuery(ExecuteStmt *stmt, IntoClause *intoClause,
 	 */
 	PortalStart(portal, paramLI, eflags, GetActiveSnapshot());
 
-	(void) PortalRun(portal, count, false, dest, dest, completionTag);
+	(void) PortalRun(portal, count, 0L, 0, false, dest, dest, completionTag);
 
 	PortalDrop(portal, false);
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 51a86b2..5f0de97 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -79,6 +79,8 @@ static void ExecutePlan(EState *estate, PlanState *planstate,
 			CmdType operation,
 			bool sendTuples,
 			long numberTuples,
+			long sizeTuples,
+			int  tupleOverhead,
 			ScanDirection direction,
 			DestReceiver *dest);
 static bool ExecCheckRTEPerms(RangeTblEntry *rte);
@@ -277,17 +279,20 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
  */
 void
 ExecutorRun(QueryDesc *queryDesc,
-			ScanDirection direction, long count)
+			ScanDirection direction, long count, long size, int tupoverhead)
 {
 	if (ExecutorRun_hook)
-		(*ExecutorRun_hook) (queryDesc, direction, count);
+		(*ExecutorRun_hook) (queryDesc, direction,
+							 count, size, tupoverhead);
 	else
-		standard_ExecutorRun(queryDesc, direction, count);
+		standard_ExecutorRun(queryDesc, direction,
+							 count, size, tupoverhead);
 }
 
 void
 standard_ExecutorRun(QueryDesc *queryDesc,
-					 ScanDirection direction, long count)
+					 ScanDirection direction,
+					 long count, long size, int tupoverhead)
 {
 	EState	   *estate;
 	CmdType		operation;
@@ -339,6 +344,8 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 					operation,
 					sendTuples,
 					count,
+					size,
+					tupoverhead,
 					direction,
 					dest);
 
@@ -1551,22 +1558,27 @@ ExecutePlan(EState *estate,
 			CmdType operation,
 			bool sendTuples,
 			long numberTuples,
+			long sizeTuples,
+			int  tupleOverhead,
 			ScanDirection direction,
 			DestReceiver *dest)
 {
 	TupleTableSlot *slot;
 	long		current_tuple_count;
+	long		sent_size;
 
 	/*
 	 * initialize local variables
 	 */
 	current_tuple_count = 0;
-
+	sent_size = 0;
 	/*
 	 * Set the direction.
 	 */
 	estate->es_direction = direction;
 
+	estate->es_stoppedbysize = false;
+
 	/*
 	 * Loop until we've processed the proper number of tuples from the plan.
 	 */
@@ -1621,6 +1633,23 @@ ExecutePlan(EState *estate,
 		current_tuple_count++;
 		if (numberTuples && numberTuples == current_tuple_count)
 			break;
+
+		if (sizeTuples > 0)
+		{
+			/*
+			 * Count the size of tuples we've sent
+			 *
+			 * This needs all attributes deformed so a bit slow on some cases.
+			 */
+			sent_size += slot_compute_raw_data_size(slot) + tupleOverhead;
+
+			/* Quit when the size limit will be exceeded by this tuple */
+			if (sizeTuples < sent_size)
+			{
+				estate->es_stoppedbysize = true;
+				break;
+			}
+		}
 	}
 }
 
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index e80bc22..6b59c05 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -126,6 +126,7 @@ CreateExecutorState(void)
 	estate->es_preExecCallbacks = NULL;
 
 	estate->es_processed = 0;
+	estate->es_stoppedbysize = false;
 	estate->es_lastoid = InvalidOid;
 
 	estate->es_top_eflags = 0;
diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c
index ce49c47..7ab2e67 100644
--- a/src/backend/executor/functions.c
+++ b/src/backend/executor/functions.c
@@ -853,7 +853,7 @@ postquel_getnext(execution_state *es, SQLFunctionCachePtr fcache)
 		/* Run regular commands to completion unless lazyEval */
 		long		count = (es->lazyEval) ? 1L : 0L;
 
-		ExecutorRun(es->qd, ForwardScanDirection, count);
+		ExecutorRun(es->qd, ForwardScanDirection, count, 0L, 0);
 
 		/*
 		 * If we requested run to completion OR there was no tuple returned,
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index d544ad9..f29c3a8 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -2399,7 +2399,7 @@ _SPI_pquery(QueryDesc *queryDesc, bool fire_triggers, long tcount)
 
 	ExecutorStart(queryDesc, eflags);
 
-	ExecutorRun(queryDesc, ForwardScanDirection, tcount);
+	ExecutorRun(queryDesc, ForwardScanDirection, tcount, 0L, 0);
 
 	_SPI_current->processed = queryDesc->estate->es_processed;
 	_SPI_current->lastoid = queryDesc->estate->es_lastoid;
@@ -2477,7 +2477,7 @@ _SPI_cursor_operation(Portal portal, FetchDirection direction, long count,
 	/* Run the cursor */
 	nfetched = PortalRunFetch(portal,
 							  direction,
-							  count,
+							  count, 0L, 0,
 							  dest);
 
 	/*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index e0ff6f1..b7b061c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -538,6 +538,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <str>		opt_existing_window_name
 %type <boolean> opt_if_not_exists
 
+%type <ival>	opt_overhead
+
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
  * They must be listed first so that their numeric codes do not depend on
@@ -6066,6 +6068,16 @@ fetch_args:	cursor_name
 					n->howMany = $1;
 					$$ = (Node *)n;
 				}
+			| SignedIconst LIMIT Iconst opt_overhead opt_from_in cursor_name
+				{
+					FetchStmt *n = makeNode(FetchStmt);
+					n->portalname = $6;
+					n->direction = FETCH_FORWARD;
+					n->howMany = $1;
+					n->howLarge = $3;
+					n->tupoverhead = $4;
+					$$ = (Node *)n;
+				}
 			| ALL opt_from_in cursor_name
 				{
 					FetchStmt *n = makeNode(FetchStmt);
@@ -6074,6 +6086,16 @@ fetch_args:	cursor_name
 					n->howMany = FETCH_ALL;
 					$$ = (Node *)n;
 				}
+			| ALL LIMIT Iconst opt_overhead opt_from_in cursor_name
+				{
+					FetchStmt *n = makeNode(FetchStmt);
+					n->portalname = $6;
+					n->direction = FETCH_FORWARD;
+					n->howMany = FETCH_ALL;
+					n->howLarge = $3;
+					n->tupoverhead = $4;
+					$$ = (Node *)n;
+				}
 			| FORWARD opt_from_in cursor_name
 				{
 					FetchStmt *n = makeNode(FetchStmt);
@@ -6090,6 +6112,16 @@ fetch_args:	cursor_name
 					n->howMany = $2;
 					$$ = (Node *)n;
 				}
+			| FORWARD SignedIconst LIMIT Iconst opt_overhead opt_from_in cursor_name
+				{
+					FetchStmt *n = makeNode(FetchStmt);
+					n->portalname = $7;
+					n->direction = FETCH_FORWARD;
+					n->howMany = $2;
+					n->howLarge = $4;
+					n->tupoverhead = $5;
+					$$ = (Node *)n;
+				}
 			| FORWARD ALL opt_from_in cursor_name
 				{
 					FetchStmt *n = makeNode(FetchStmt);
@@ -6098,6 +6130,16 @@ fetch_args:	cursor_name
 					n->howMany = FETCH_ALL;
 					$$ = (Node *)n;
 				}
+			| FORWARD ALL LIMIT Iconst  opt_overhead  opt_from_in cursor_name
+				{
+					FetchStmt *n = makeNode(FetchStmt);
+					n->portalname = $7;
+					n->direction = FETCH_FORWARD;
+					n->howMany = FETCH_ALL;
+					n->howLarge = $4;
+					n->tupoverhead = $5;
+					$$ = (Node *)n;
+				}
 			| BACKWARD opt_from_in cursor_name
 				{
 					FetchStmt *n = makeNode(FetchStmt);
@@ -6114,6 +6156,16 @@ fetch_args:	cursor_name
 					n->howMany = $2;
 					$$ = (Node *)n;
 				}
+			| BACKWARD SignedIconst LIMIT Iconst  opt_overhead opt_from_in cursor_name
+				{
+					FetchStmt *n = makeNode(FetchStmt);
+					n->portalname = $7;
+					n->direction = FETCH_BACKWARD;
+					n->howMany = $2;
+					n->howLarge = $4;
+					n->tupoverhead = $5;
+					$$ = (Node *)n;
+				}
 			| BACKWARD ALL opt_from_in cursor_name
 				{
 					FetchStmt *n = makeNode(FetchStmt);
@@ -6122,6 +6174,16 @@ fetch_args:	cursor_name
 					n->howMany = FETCH_ALL;
 					$$ = (Node *)n;
 				}
+			| BACKWARD ALL LIMIT Iconst  opt_overhead opt_from_in cursor_name
+				{
+					FetchStmt *n = makeNode(FetchStmt);
+					n->portalname = $7;
+					n->direction = FETCH_BACKWARD;
+					n->howMany = FETCH_ALL;
+					n->howLarge = $4;
+					n->tupoverhead = $5;
+					$$ = (Node *)n;
+				}
 		;
 
 from_in:	FROM									{}
@@ -6132,6 +6194,9 @@ opt_from_in:	from_in								{}
 			| /* EMPTY */							{}
 		;
 
+opt_overhead:	'(' Iconst ')'						{ $$ = $2;}
+			| /* EMPTY */							{ $$ = 0; }
+		;
 
 /*****************************************************************************
  *
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index ce4bdaf..70641eb 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -1103,6 +1103,7 @@ exec_simple_query(const char *query_string)
 		 */
 		(void) PortalRun(portal,
 						 FETCH_ALL,
+						 0L, 0,
 						 isTopLevel,
 						 receiver,
 						 receiver,
@@ -1987,6 +1988,7 @@ exec_execute_message(const char *portal_name, long max_rows)
 
 	completed = PortalRun(portal,
 						  max_rows,
+						  0L, 0,
 						  true, /* always top level */
 						  receiver,
 						  receiver,
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 9c14e8a..ce9541a 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -16,6 +16,7 @@
 #include "postgres.h"
 
 #include "access/xact.h"
+#include "access/htup_details.h"
 #include "commands/prepare.h"
 #include "executor/tstoreReceiver.h"
 #include "miscadmin.h"
@@ -39,9 +40,11 @@ static void ProcessQuery(PlannedStmt *plan,
 			 DestReceiver *dest,
 			 char *completionTag);
 static void FillPortalStore(Portal portal, bool isTopLevel);
-static uint32 RunFromStore(Portal portal, ScanDirection direction, long count,
+static uint32 RunFromStore(Portal portal, ScanDirection direction,
+		     long count, long size, int tupoverhead, bool *stoppedbysize,
 			 DestReceiver *dest);
-static long PortalRunSelect(Portal portal, bool forward, long count,
+static long PortalRunSelect(Portal portal, bool forward,
+				long count, long size, int tupoverhead,
 				DestReceiver *dest);
 static void PortalRunUtility(Portal portal, Node *utilityStmt, bool isTopLevel,
 				 DestReceiver *dest, char *completionTag);
@@ -51,6 +54,8 @@ static void PortalRunMulti(Portal portal, bool isTopLevel,
 static long DoPortalRunFetch(Portal portal,
 				 FetchDirection fdirection,
 				 long count,
+				 long size,
+				 int tupoverehad,
 				 DestReceiver *dest);
 static void DoPortalRewind(Portal portal);
 
@@ -182,7 +187,7 @@ ProcessQuery(PlannedStmt *plan,
 	/*
 	 * Run the plan to completion.
 	 */
-	ExecutorRun(queryDesc, ForwardScanDirection, 0L);
+	ExecutorRun(queryDesc, ForwardScanDirection, 0L, 0L, 0);
 
 	/*
 	 * Build command completion status string, if caller wants one.
@@ -703,8 +708,8 @@ PortalSetResultFormat(Portal portal, int nFormats, int16 *formats)
  * suspended due to exhaustion of the count parameter.
  */
 bool
-PortalRun(Portal portal, long count, bool isTopLevel,
-		  DestReceiver *dest, DestReceiver *altdest,
+PortalRun(Portal portal, long count, long size, int tupoverhead,
+		  bool isTopLevel, DestReceiver *dest, DestReceiver *altdest,
 		  char *completionTag)
 {
 	bool		result;
@@ -787,7 +792,8 @@ PortalRun(Portal portal, long count, bool isTopLevel,
 				/*
 				 * Now fetch desired portion of results.
 				 */
-				nprocessed = PortalRunSelect(portal, true, count, dest);
+				nprocessed = PortalRunSelect(portal, true,
+											 count, size, tupoverhead, dest);
 
 				/*
 				 * If the portal result contains a command tag and the caller
@@ -892,11 +898,14 @@ static long
 PortalRunSelect(Portal portal,
 				bool forward,
 				long count,
+				long size,
+				int  tupoverhead,
 				DestReceiver *dest)
 {
 	QueryDesc  *queryDesc;
 	ScanDirection direction;
 	uint32		nprocessed;
+	bool		stoppedbysize;
 
 	/*
 	 * NB: queryDesc will be NULL if we are fetching from a held cursor or a
@@ -939,12 +948,15 @@ PortalRunSelect(Portal portal,
 			count = 0;
 
 		if (portal->holdStore)
-			nprocessed = RunFromStore(portal, direction, count, dest);
+			nprocessed = RunFromStore(portal, direction,
+									  count, size, tupoverhead,
+									  &stoppedbysize, dest);
 		else
 		{
 			PushActiveSnapshot(queryDesc->snapshot);
-			ExecutorRun(queryDesc, direction, count);
+			ExecutorRun(queryDesc, direction, count, size, tupoverhead);
 			nprocessed = queryDesc->estate->es_processed;
+			stoppedbysize = queryDesc->estate->es_stoppedbysize;
 			PopActiveSnapshot();
 		}
 
@@ -954,8 +966,9 @@ PortalRunSelect(Portal portal,
 
 			if (nprocessed > 0)
 				portal->atStart = false;		/* OK to go backward now */
-			if (count == 0 ||
-				(unsigned long) nprocessed < (unsigned long) count)
+			if ((count == 0 ||
+				 (unsigned long) nprocessed < (unsigned long) count) &&
+				!stoppedbysize)
 				portal->atEnd = true;	/* we retrieved 'em all */
 			oldPos = portal->portalPos;
 			portal->portalPos += nprocessed;
@@ -982,12 +995,15 @@ PortalRunSelect(Portal portal,
 			count = 0;
 
 		if (portal->holdStore)
-			nprocessed = RunFromStore(portal, direction, count, dest);
+			nprocessed = RunFromStore(portal, direction,
+									  count, size, tupoverhead,
+									  &stoppedbysize, dest);
 		else
 		{
 			PushActiveSnapshot(queryDesc->snapshot);
-			ExecutorRun(queryDesc, direction, count);
+			ExecutorRun(queryDesc, direction, count, size, tupoverhead);
 			nprocessed = queryDesc->estate->es_processed;
+			stoppedbysize = queryDesc->estate->es_stoppedbysize;
 			PopActiveSnapshot();
 		}
 
@@ -998,8 +1014,9 @@ PortalRunSelect(Portal portal,
 				portal->atEnd = false;	/* OK to go forward now */
 				portal->portalPos++;	/* adjust for endpoint case */
 			}
-			if (count == 0 ||
-				(unsigned long) nprocessed < (unsigned long) count)
+			if ((count == 0 ||
+				 (unsigned long) nprocessed < (unsigned long) count) &&
+				!stoppedbysize)
 			{
 				portal->atStart = true; /* we retrieved 'em all */
 				portal->portalPos = 0;
@@ -1088,11 +1105,15 @@ FillPortalStore(Portal portal, bool isTopLevel)
  * out for memory leaks.
  */
 static uint32
-RunFromStore(Portal portal, ScanDirection direction, long count,
-			 DestReceiver *dest)
+RunFromStore(Portal portal, ScanDirection direction,
+			 long count, long size_limit, int tupoverhead,
+			 bool *stoppedbysize, DestReceiver *dest)
 {
 	long		current_tuple_count = 0;
 	TupleTableSlot *slot;
+	long			sent_size = 0;
+
+	*stoppedbysize = false;
 
 	slot = MakeSingleTupleTableSlot(portal->tupDesc);
 
@@ -1122,6 +1143,9 @@ RunFromStore(Portal portal, ScanDirection direction, long count,
 				break;
 
 			(*dest->receiveSlot) (slot, dest);
+			/* Count the size of tuples we've sent */
+			sent_size += slot_compute_raw_data_size(slot)
+				+ tupoverhead;
 
 			ExecClearTuple(slot);
 
@@ -1133,10 +1157,19 @@ RunFromStore(Portal portal, ScanDirection direction, long count,
 			current_tuple_count++;
 			if (count && count == current_tuple_count)
 				break;
+
+			/* Quit when the size limit will be exceeded by this tuple */
+			if (current_tuple_count > 0 &&
+				size_limit > 0 && size_limit < sent_size)
+			{
+				*stoppedbysize = true;
+				break;
+			}
 		}
 	}
 
 	(*dest->rShutdown) (dest);
+	elog(LOG, "Sent %ld bytes", sent_size);
 
 	ExecDropSingleTupleTableSlot(slot);
 
@@ -1385,6 +1418,8 @@ long
 PortalRunFetch(Portal portal,
 			   FetchDirection fdirection,
 			   long count,
+			   long size,
+			   int  tupoverhead,
 			   DestReceiver *dest)
 {
 	long		result;
@@ -1422,7 +1457,8 @@ PortalRunFetch(Portal portal,
 		switch (portal->strategy)
 		{
 			case PORTAL_ONE_SELECT:
-				result = DoPortalRunFetch(portal, fdirection, count, dest);
+				result = DoPortalRunFetch(portal, fdirection,
+										  count, size, tupoverhead, dest);
 				break;
 
 			case PORTAL_ONE_RETURNING:
@@ -1439,7 +1475,8 @@ PortalRunFetch(Portal portal,
 				/*
 				 * Now fetch desired portion of results.
 				 */
-				result = DoPortalRunFetch(portal, fdirection, count, dest);
+				result = DoPortalRunFetch(portal, fdirection,
+										  count, size, tupoverhead, dest);
 				break;
 
 			default:
@@ -1484,6 +1521,8 @@ static long
 DoPortalRunFetch(Portal portal,
 				 FetchDirection fdirection,
 				 long count,
+				 long size,
+				 int  tupoverhead,
 				 DestReceiver *dest)
 {
 	bool		forward;
@@ -1526,7 +1565,7 @@ DoPortalRunFetch(Portal portal,
 				{
 					DoPortalRewind(portal);
 					if (count > 1)
-						PortalRunSelect(portal, true, count - 1,
+						PortalRunSelect(portal, true, count - 1, 0L, 0,
 										None_Receiver);
 				}
 				else
@@ -1536,13 +1575,15 @@ DoPortalRunFetch(Portal portal,
 					if (portal->atEnd)
 						pos++;	/* need one extra fetch if off end */
 					if (count <= pos)
-						PortalRunSelect(portal, false, pos - count + 1,
+						PortalRunSelect(portal, false,
+										pos - count + 1, 0L, 0,
 										None_Receiver);
 					else if (count > pos + 1)
-						PortalRunSelect(portal, true, count - pos - 1,
+						PortalRunSelect(portal, true,
+										count - pos - 1, 0L, 0,
 										None_Receiver);
 				}
-				return PortalRunSelect(portal, true, 1L, dest);
+				return PortalRunSelect(portal, true, 1L, 0L, 0, dest);
 			}
 			else if (count < 0)
 			{
@@ -1553,17 +1594,19 @@ DoPortalRunFetch(Portal portal,
 				 * (Is it worth considering case where count > half of size of
 				 * query?  We could rewind once we know the size ...)
 				 */
-				PortalRunSelect(portal, true, FETCH_ALL, None_Receiver);
+				PortalRunSelect(portal, true,
+								FETCH_ALL, 0L, 0, None_Receiver);
 				if (count < -1)
-					PortalRunSelect(portal, false, -count - 1, None_Receiver);
-				return PortalRunSelect(portal, false, 1L, dest);
+					PortalRunSelect(portal, false,
+									-count - 1, 0, 0, None_Receiver);
+				return PortalRunSelect(portal, false, 1L, 0L, 0, dest);
 			}
 			else
 			{
 				/* count == 0 */
 				/* Rewind to start, return zero rows */
 				DoPortalRewind(portal);
-				return PortalRunSelect(portal, true, 0L, dest);
+				return PortalRunSelect(portal, true, 0L, 0L, 0, dest);
 			}
 			break;
 		case FETCH_RELATIVE:
@@ -1573,8 +1616,9 @@ DoPortalRunFetch(Portal portal,
 				 * Definition: advance count-1 rows, return next row (if any).
 				 */
 				if (count > 1)
-					PortalRunSelect(portal, true, count - 1, None_Receiver);
-				return PortalRunSelect(portal, true, 1L, dest);
+					PortalRunSelect(portal, true,
+									count - 1, 0L, 0, None_Receiver);
+				return PortalRunSelect(portal, true, 1L, 0L, 0, dest);
 			}
 			else if (count < 0)
 			{
@@ -1583,8 +1627,9 @@ DoPortalRunFetch(Portal portal,
 				 * any).
 				 */
 				if (count < -1)
-					PortalRunSelect(portal, false, -count - 1, None_Receiver);
-				return PortalRunSelect(portal, false, 1L, dest);
+					PortalRunSelect(portal, false,
+									-count - 1, 0L, 0, None_Receiver);
+				return PortalRunSelect(portal, false, 1L, 0L, 0, dest);
 			}
 			else
 			{
@@ -1630,7 +1675,7 @@ DoPortalRunFetch(Portal portal,
 			 */
 			if (on_row)
 			{
-				PortalRunSelect(portal, false, 1L, None_Receiver);
+				PortalRunSelect(portal, false, 1L, 0L, 0, None_Receiver);
 				/* Set up to fetch one row forward */
 				count = 1;
 				forward = true;
@@ -1652,7 +1697,7 @@ DoPortalRunFetch(Portal portal,
 		return result;
 	}
 
-	return PortalRunSelect(portal, forward, count, dest);
+	return PortalRunSelect(portal, forward, count, size, tupoverhead, dest);
 }
 
 /*
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 55d483d..5f0c8f3 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -20,6 +20,7 @@
 #include "access/transam.h"
 #include "storage/bufpage.h"
 
+#include "executor/tuptable.h"
 /*
  * MaxTupleAttributeNumber limits the number of (user) columns in a tuple.
  * The key limit on this value is that the size of the fixed overhead for
@@ -761,6 +762,7 @@ extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
 /* prototypes for functions in common/heaptuple.c */
 extern Size heap_compute_data_size(TupleDesc tupleDesc,
 					   Datum *values, bool *isnull);
+extern Size slot_compute_raw_data_size(TupleTableSlot *slot);
 extern void heap_fill_tuple(TupleDesc tupleDesc,
 				Datum *values, bool *isnull,
 				char *data, Size data_size,
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 193a654..e2706a6 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -79,8 +79,8 @@ extern PGDLLIMPORT ExecutorStart_hook_type ExecutorStart_hook;
 
 /* Hook for plugins to get control in ExecutorRun() */
 typedef void (*ExecutorRun_hook_type) (QueryDesc *queryDesc,
-												   ScanDirection direction,
-												   long count);
+									   ScanDirection direction,
+									   long count, long size, int tupoverhead);
 extern PGDLLIMPORT ExecutorRun_hook_type ExecutorRun_hook;
 
 /* Hook for plugins to get control in ExecutorFinish() */
@@ -175,9 +175,9 @@ extern TupleTableSlot *ExecFilterJunk(JunkFilter *junkfilter,
 extern void ExecutorStart(QueryDesc *queryDesc, int eflags);
 extern void standard_ExecutorStart(QueryDesc *queryDesc, int eflags);
 extern void ExecutorRun(QueryDesc *queryDesc,
-			ScanDirection direction, long count);
+		ScanDirection direction, long count, long size, int tupoverhead);
 extern void standard_ExecutorRun(QueryDesc *queryDesc,
-					 ScanDirection direction, long count);
+		 ScanDirection direction, long count, long size, int tupoverhead);
 extern void ExecutorFinish(QueryDesc *queryDesc);
 extern void standard_ExecutorFinish(QueryDesc *queryDesc);
 extern void ExecutorEnd(QueryDesc *queryDesc);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cb8d854..f8121ec 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -410,6 +410,7 @@ typedef struct EState
 	PreExecCallbackItem	 *es_preExecCallbacks; /* pre-exec callbacks */
 
 	uint32		es_processed;	/* # of tuples processed */
+	bool		es_stoppedbysize; /* true if processing stopped by size */
 	Oid			es_lastoid;		/* last oid processed (by INSERT) */
 
 	int			es_top_eflags;	/* eflags passed to ExecutorStart */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 868905b..094c0ac 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2393,6 +2393,8 @@ typedef struct FetchStmt
 	NodeTag		type;
 	FetchDirection direction;	/* see above */
 	long		howMany;		/* number of rows, or position argument */
+	long		howLarge;		/* total bytes of rows */
+	int			tupoverhead;	/* declared overhead per tuple in client */
 	char	   *portalname;		/* name of portal (cursor) */
 	bool		ismove;			/* TRUE if MOVE */
 } FetchStmt;
diff --git a/src/include/tcop/pquery.h b/src/include/tcop/pquery.h
index 8073a6e..021532c 100644
--- a/src/include/tcop/pquery.h
+++ b/src/include/tcop/pquery.h
@@ -17,7 +17,6 @@
 #include "nodes/parsenodes.h"
 #include "utils/portal.h"
 
-
 extern PGDLLIMPORT Portal ActivePortal;
 
 
@@ -33,13 +32,15 @@ extern void PortalStart(Portal portal, ParamListInfo params,
 extern void PortalSetResultFormat(Portal portal, int nFormats,
 					  int16 *formats);
 
-extern bool PortalRun(Portal portal, long count, bool isTopLevel,
-		  DestReceiver *dest, DestReceiver *altdest,
+extern bool PortalRun(Portal portal, long count, long size, int tupoverhead,
+		  bool isTopLevel, DestReceiver *dest, DestReceiver *altdest,
 		  char *completionTag);
 
 extern long PortalRunFetch(Portal portal,
 			   FetchDirection fdirection,
 			   long count,
+			   long size,
+			   int tupoverhead,
 			   DestReceiver *dest);
 
 #endif   /* PQUERY_H */
diff --git a/src/interfaces/ecpg/preproc/ecpg.addons b/src/interfaces/ecpg/preproc/ecpg.addons
index b3b36cf..424f412 100644
--- a/src/interfaces/ecpg/preproc/ecpg.addons
+++ b/src/interfaces/ecpg/preproc/ecpg.addons
@@ -220,13 +220,56 @@ ECPG: fetch_argsNEXTopt_from_incursor_name addon
 ECPG: fetch_argsPRIORopt_from_incursor_name addon
 ECPG: fetch_argsFIRST_Popt_from_incursor_name addon
 ECPG: fetch_argsLAST_Popt_from_incursor_name addon
+		add_additional_variables($3, false);
+		if ($3[0] == ':')
+		{
+			free($3);
+			$3 = mm_strdup("$0");
+		}
 ECPG: fetch_argsALLopt_from_incursor_name addon
+ECPG: fetch_argsFORWARDopt_from_incursor_name addon
+ECPG: fetch_argsBACKWARDopt_from_incursor_name addon
 		add_additional_variables($3, false);
 		if ($3[0] == ':')
 		{
 			free($3);
 			$3 = mm_strdup("$0");
 		}
+ECPG: fetch_argsALLLIMITIconstopt_overheadopt_from_incursor_name addon
+		add_additional_variables($6, false);
+		if ($6[0] == ':')
+		{
+			free($6);
+			$6 = mm_strdup("$0");
+		}
+		if ($3[0] == '$')
+		{
+			free($3);
+			$3 = mm_strdup("$0");
+		}
+		if ($4[0] == '$')
+		{
+			free($4);
+			$4 = mm_strdup("$0");
+		}
+ECPG: fetch_argsFORWARDALLLIMITIconstopt_overheadopt_from_incursor_name addon
+ECPG: fetch_argsBACKWARDALLLIMITIconstopt_overheadopt_from_incursor_name addon
+		add_additional_variables($7, false);
+		if ($7[0] == ':')
+		{
+			free($7);
+			$7 = mm_strdup("$0");
+		}
+		if ($4[0] == '$')
+		{
+			free($4);
+			$4 = mm_strdup("$0");
+		}
+		if ($5[0] == '$')
+		{
+			free($5);
+			$5 = mm_strdup("$0");
+		}
 ECPG: fetch_argsSignedIconstopt_from_incursor_name addon
 		add_additional_variables($3, false);
 		if ($3[0] == ':')
@@ -234,11 +277,51 @@ ECPG: fetch_argsSignedIconstopt_from_incursor_name addon
 			free($3);
 			$3 = mm_strdup("$0");
 		}
+ECPG: fetch_argsSignedIconstLIMITIconstopt_overheadopt_from_incursor_name addon
+		add_additional_variables($6, false);
+		if ($6[0] == ':')
+		{
+			free($6);
+			$6 = mm_strdup("$0");
+		}
 		if ($1[0] == '$')
 		{
 			free($1);
 			$1 = mm_strdup("$0");
 		}
+		if ($3[0] == '$')
+		{
+			free($3);
+			$3 = mm_strdup("$0");
+		}
+		if ($4[0] == '$')
+		{
+			free($4);
+			$4 = mm_strdup("$0");
+		}
+ECPG: fetch_argsFORWARDSignedIconstLIMITIconstopt_overheadopt_from_incursor_name addon
+ECPG: fetch_argsBACKWARDSignedIconstLIMITIconstopt_overheadopt_from_incursor_name addon
+		add_additional_variables($7, false);
+		if ($7[0] == ':')
+		{
+			free($7);
+			$7 = mm_strdup("$0");
+		}
+		if ($2[0] == '$')
+		{
+			free($2);
+			$2 = mm_strdup("$0");
+		}
+		if ($4[0] == '$')
+		{
+			free($4);
+			$4 = mm_strdup("$0");
+		}
+		if ($5[0] == '$')
+		{
+			free($5);
+			$5 = mm_strdup("$0");
+		}
 ECPG: fetch_argsFORWARDALLopt_from_incursor_name addon
 ECPG: fetch_argsBACKWARDALLopt_from_incursor_name addon
 		add_additional_variables($4, false);
-- 
1.8.3.1

0003c-Add-foreign-table-option-to-set-fetch-size.patchtext/x-patch; charset=us-asciiDownload

>From 92596ee8cda043cbc37cd27d4ec668f624a49d6e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 30 Jun 2015 15:29:46 +0900
Subject: [PATCH] Add foreign table option to set fetch size.

---
 contrib/postgres_fdw/option.c       |  2 ++
 contrib/postgres_fdw/postgres_fdw.c | 31 +++++++++++++++++++++++++++----
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 7547ec2..793239d 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -153,6 +153,8 @@ InitPgFdwOptions(void)
 		/* updatable is available on both server and table */
 		{"updatable", ForeignServerRelationId, false},
 		{"updatable", ForeignTableRelationId, false},
+		/* fetch_size is available on table (XXX: also server may have it.) */
+		{"fetch_size", ForeignTableRelationId, false},
 		{NULL, InvalidOid, false}
 	};
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 40cac3b..e01213e 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -152,6 +152,7 @@ typedef struct PgFdwScanState
 	FmgrInfo   *param_flinfo;	/* output conversion functions for them */
 	List	   *param_exprs;	/* executable expressions for param values */
 	const char **param_values;	/* textual values of query parameters */
+	int			fetch_size;		/* number of tuples to request by one fetch */
 
 	/* for storing result tuples */
 	HeapTuple  *tuples;			/* array of currently-retrieved tuples */
@@ -945,6 +946,31 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(userid, server->serverid);
 
+	fsstate->fetch_size = -1;
+	foreach(lc, table->options)
+	{
+		DefElem    *def = (DefElem *) lfirst(lc);
+
+		/* Does anyone specify negatives? Who cares? */
+		if (strcmp(def->defname, "fetch_size") == 0)
+		{
+			char *ep = NULL;
+			char *defstr = defGetString(def);
+
+			fsstate->fetch_size = strtol(defstr, &ep, 10);
+			if (*ep || ep == defstr)
+			{
+				elog(WARNING,
+					 "Option \"%s\" must be a positive integer (foreign table \"%s\") : \"%s\"",
+					 def->defname, get_rel_name(table->relid), defstr);
+				fsstate->fetch_size = -1;		/* Use default */
+			}
+			break;
+		}
+	}
+	if (fsstate->fetch_size < 1)
+		fsstate->fetch_size = 100;		/* default size */
+
 	/*
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
@@ -2092,15 +2118,12 @@ fetch_more_data(PgFdwScanState *fsstate, fetch_mode cmd)
 	{
 		PgFdwConn  *conn = fsstate->conn;
 		char		sql[64];
-		int			fetch_size;
+		int			fetch_size = fsstate->fetch_size;
 		int			numrows, addrows, restrows;
 		HeapTuple  *tmptuples;
 		int			i;
 		int			fetch_buf_size;
 
-		/* The fetch size is arbitrary, but shouldn't be enormous. */
-		fetch_size = 100;
-
 		/* Make the query to fetch tuples */
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fetch_size, fsstate->cursor_number);
-- 
1.8.3.1

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 10 years ago

In reply to: Kyotaro HORIGUCHI (#1)

Re: Asynchronous execution on FDW

Ouch! I mistakenly made two CF entries for this patch. Could
someone remove this entry for me?

https://commitfest.postgresql.org/5/290/

The correct entry is "/5/291/"

======
Hello. This is the new version of FDW async exection feature.

The status of this feature is as follows, as of the last commitfest.

- Async execution is valuable to have.
- But do the first kick in ExecInit phase is wrong.

So the design outline of this version is as following,

Any init-node can register callbacks on their turn, then the
registerd callbacks are called just before ExecProc phase in
executor. The first patch adds functions and structs to enable
this.

- The second part is not changed from the previous version. Add
PgFdwConn as a extended PgConn which have some members to
support asynchronous execution.

- The third part is three kind of trials of adaptive fetch size
feature.

The attached patches are the following,

- 0001-Add-infrastructure-of-pre-execution-callbacks.patch
Infrastructure of pre-execution callback

- 0002-Allow-asynchronous-remote-query-of-postgres_fdw.patch
FDW asynchronous execution feature

- 0003a-Add-experimental-POC-adaptive-fetch-size-feature.patch
Adaptive fetch size alternative 1: duration based control

- 0003b-POC-Experimental-fetch_by_size-feature.patch
Adaptive fetch size alternative 2: FETCH by size

- 0003c-Add-foreign-table-option-to-set-fetch-size.patch
Adaptive fetch size alternative 3: Foreign table option.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

over 10 years ago

In reply to: Kyotaro HORIGUCHI (#2)

Re: Asynchronous execution on FDW

On Thu, Jul 2, 2015 at 3:07 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Ouch! I mistakenly made two CF entries for this patch. Could
someone remove this entry for me?

https://commitfest.postgresql.org/5/290/

The correct entry is "/5/291/"

Done.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 10 years ago

In reply to: Michael Paquier (#3)

Re: Asynchronous execution on FDW

Thank you.

At Thu, 2 Jul 2015 16:02:27 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqTs0YCwXedt1P=JjxFJeoj9UzLzkLuiX8=JdtPYUtNwwg@mail.gmail.com>

On Thu, Jul 2, 2015 at 3:07 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Ouch! I mistakenly made two CF entries for this patch. Could
someone remove this entry for me?

https://commitfest.postgresql.org/5/290/

The correct entry is "/5/291/"

Done.

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnaka@iki.fi

over 10 years ago

In reply to: Kyotaro HORIGUCHI (#1)

Re: Asynchronous execution on FDW

On 07/02/2015 08:48 AM, Kyotaro HORIGUCHI wrote:

- It was a problem when to give the first kick for async exec. It
is not in ExecInit phase, and ExecProc phase does not fit,
too. An extra phase ExecPreProc or something is too
invasive. So I tried "pre-exec callback".

Any init-node can register callbacks on their turn, then the
registerd callbacks are called just before ExecProc phase in
executor. The first patch adds functions and structs to enable
this.

At a quick glance, I think this has all the same problems as starting
the execution at ExecInit phase. The correct way to do this is to kick
off the queries in the first IterateForeignScan() call. You said that
"ExecProc phase does not fit" - why not?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 10 years ago

In reply to: Heikki Linnakangas (#5)

Re: Asynchronous execution on FDW

Hello, thank you for looking this.

If it is acceptable to reconstruct the executor nodes to have
additional return state PREP_RUN or such (which means it needs
one more call for the first tuple) , I'll modify the whole
executor to handle the state in the next patch to do so.

I haven't take the advice I had so far in this sense. But I came
to think that it is the most reasonable way to solve this.

======

- It was a problem when to give the first kick for async exec. It
is not in ExecInit phase, and ExecProc phase does not fit,
too. An extra phase ExecPreProc or something is too
invasive. So I tried "pre-exec callback".

Any init-node can register callbacks on their turn, then the
registerd callbacks are called just before ExecProc phase in
executor. The first patch adds functions and structs to enable
this.

At a quick glance, I think this has all the same problems as starting
the execution at ExecInit phase. The correct way to do this is to kick
off the queries in the first IterateForeignScan() call. You said that
"ExecProc phase does not fit" - why not?

Execution nodes are expected to return the first tuple if
available. But asynchronous execution can not return the first
tuple immediately. Simultaneous execution for the first tuple on
every foreign node is crucial than asynchronous fetching for many
cases, especially for the cases like sort/agg pushdown on FDW.

The reason why ExecProc does not fit is that the first loop
without returning tuple looks impact too large portion in
executor.

It is my mistake that it doesn't address the problem about
parameterized paths. Parameterized paths should be executed
within ExecProc loops so this patch would be like following.

- To gain the advantage of kicking execution before the first
ExecProc loop, non-parameterized paths are started using the
callback feature this patch provides.

- Parameterized paths need the upper nodes executed before it
starts execution so they should be start in ExecProc loop, but
runs asynchronously if possible.

This is rather a makeshift solution for the problem, but
considering current trend of parallelism, it might the time to
make the executor to fit parallel execution.

I hate my stupidity if you suggested this kind of solution by "do
it in ExecProc":(

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 10 years ago

In reply to: Kyotaro HORIGUCHI (#6)

6 attachment(s)

Re: Asynchronous execution on FDW

Hello, This is the new version of this patch.

At Tue, 07 Jul 2015 10:19:35 +0900, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote

This is rather a makeshift solution for the problem, but
considering current trend of parallelism, it might the time to
make the executor to fit parallel execution.

If it is acceptable to reconstruct the executor nodes to have
additional return state PREP_RUN or such (which means it needs
one more call for the first tuple) , I'll modify the whole
executor to handle the state in the next patch to do so.

I made a patchset to do this. The details of it and some examples
are shown after the summary below.

- I provided an infrastructure for asynchronous (simultaneous)
execution of multiple execnodes belonging one node, like joins.

- It (should) have addressed the "parameterized plan" problem.

- The infrastructure is a bit intrusive but simple, and it will
be usable by any nodes that supports asynchronous execution
(none so far except fdw, needs some modification in core). So
the async exec for Postgres-FDW now became an exapmle for the
infrastructure. It might be nice to start backend worker for
promising async resuest for a sort node.

- The postgres_fdw part is almost the same as the previous one.

The detailed explanation of the patchset follows.

============

I made a patchset to do this. It consists of five patches (plus
one for debug message).

1. Add infrastructure for run-state of executor node.

Currently executor nodes have binary run-states, one is
!TupIsNull(slot) which indicates that the next tuple may come
from the node, and the another is TupIsNull(slot) which
indicates that no more tuple will be come.

This patch expands it to four-state and have the value in
PlanState struct.

Inited : it is just after initialized.

Started: it is startd execution but no tuple retrieved. This
could be skipped.

Running: it is returning tuples.

Done : it has no more tuple to return. This is equivalent to
TupIsNull(slot).

The nodes Group, ModifyTable, SetOp and WindowAgg had their own
state flag replaceable by the new states in their own *State
part so they are moved to this new state set in this patch. This
patch does not change the current behavior.

2. Change all tuple-returning execnodes to maintain the new
run-state appropriately.

The rest nodes are modified by this patch to maintain the state
to be consistent with the TupIsNull() state at the ExecProcNode
level. This patch does not change the current behavior, too. (I
feel that the state Done would be no other than an encumbrance
in maintenance. The state is not referred in nowhere)

3. Add a feature to start node asynchronously.

All nodes that have more than one child node can execute the
children asynchronously by this patch. It tries start children
asynchronously if the state is "Inited" when entering Exec*
functions. Async request for nodes which has just one child is
simply propagated to the child, and leaf nodes such as scans
will decide whether to be async or not. Currently no leaf node
can be async except postgres_fdw.

NestLoop may run parameterized plan so it is specially treated
in StartNestLoop so that parameterized plans will not be
asynchronously started.

In StartHashJoin, whether the inner (hash) node is executed or
not is judged by the similar logic with ExecHashJoin.

Even after this patch applied, no leaf node can start
asynchronously so the behavior of the executor still be
unchanged.

4. Add StartForeignScan to FdwRoutine

Add new entry function to accept the asynchronous execution
request from the core.

5. Allow asynchronous remote query of postgres_fdw.

This is almost the same as the previous version. Except that it
runs on the new infrastructure, and added new server/foreign
table option allow_async.

The first foreign scan on the same server will be asynchronously
started execution if requested. And apart from the async start,
every successive fetches for the same foreign scan will be
asynchronously fetched.

Currently there's no means to observe what it is doing from
outside, so the additional sixth patch is to output debug
messages about asynchronous execution.

However, currently it is no test code for that but I'm at a loss
what to do as the test..

FWIW I provided two exaples of running asynchronous exexution.

regards,

===== Example
CREATE SERVER sv1 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'localhost', dbname 'postgres');
CREATE SERVER sv2 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'localhost', dbname 'postgres');
CREATE USER MAPPING FOR CURRENT_USER SERVER sv1;
CREATE USER MAPPING FOR CURRENT_USER SERVER sv2;
CREATE TABLE lp (a int, b int);
CREATE TABLE lt1 () INHERITS (lp);
CREATE TABLE lt2 () INHERITS (lp);
CREATE TABLE lt3 () INHERITS (lp);
CREATE TABLE lt4 () INHERITS (lp);
CREATE TABLE fp (LIKE lp);
CREATE FOREIGN TABLE ft1 () INHERITS (fp) SERVER sv1 OPTIONS (table_name 'lt1');
CREATE FOREIGN TABLE ft2 () INHERITS (fp) SERVER sv1 OPTIONS (table_name 'lt1');
CREATE FOREIGN TABLE ft3 () INHERITS (fp) SERVER sv2 OPTIONS (table_name 'lt1');
CREATE FOREIGN TABLE ft4 () INHERITS (fp) SERVER sv2 OPTIONS (table_name 'lt1');
INSERT INTO lt1 (SELECT a, a FROM generate_series(0, 999) a);
INSERT INTO lt2 (SELECT a+1000, a FROM generate_series(0, 999) a);
INSERT INTO lt3 (SELECT a+2000, a FROM generate_series(0, 999) a);
INSERT INTO lt4 (SELECT a+3000, a FROM generate_series(0, 999) a);

;; TEST FOR SIMPLE APPEND
=# SELECT * FROM fp;
1 LOG: pg_fdw: [ft1/sv1/0x293a580] Async exec started.
2 LOG: pg_fdw: [ft2/sv1/0x293a580] Async exec denied.
3 LOG: pg_fdw: [ft3/sv2/0x2898c70] Async exec started.
4 LOG: pg_fdw: [ft4/sv2/0x2898c70] Async exec denied.
5 LOG: pg_fdw: [ft1/sv1/0x293a580] Async fetch
....
6 LOG: pg_fdw: [ft1/sv1/0x293a580] Async fetch
7 LOG: pg_fdw: [ft2/sv1/0x293a580] Sync fetch.
8 LOG: pg_fdw: [ft2/sv1/0x293a580] Async fetch
...
9 LOG: pg_fdw: [ft2/sv1/0x293a580] Async fetch
10 LOG: pg_fdw: [ft3/sv2/0x2898c70] Async fetch
....
11 LOG: pg_fdw: [ft3/sv2/0x2898c70] Async fetch
12 LOG: pg_fdw: [ft4/sv2/0x2898c70] Sync fetch.
14 LOG: pg_fdw: [ft4/sv2/0x2898c70] Async fetch
...
15 LOG: pg_fdw: [ft4/sv2/0x2898c70] Async fetch

;; The notation inside the square bracket is
;; <table name>/<server name>/<ponter of connection>.
;;
;; 1-4 foreign servers denied async for the second scan for each (ft2/ft4).
;;
;; At 7, reading different table from 6 made it sync fetch but
;; the successive fetches afterward are async.
;;
;; ft2 and ft3 was on different server so 10 is async fetch for
;; the query executed asynchronously at 3.
;;
;; At 12 the same thing to 7 occurred.

;; TEST FOR PARAMETERIZED NESTLOOP
=# SET enable_hashjoin TO false;
=# SET enable_mergejoin TO false;
=# SET enable_material TO false;
=# ALTER FOREIGN TABLE ft4 OPTIONS (ADD use_remote_estimate 'true');
=# SELECT ft4.a FROM ft1 JOIN ft4 ON ft1.b = ft4.b WHERE ft1.a BETWEEN 800 AND 1000;
1 LOG: pg_fdw: [ft1/sv1/0x293a580] Async exec started.
2 LOG: pg_fdw: [ft1/sv1/0x293a580] Async fetch
3 LOG: pg_fdw: [ft4/sv2/0x2898c70] Sync fetch.
4 LOG: pg_fdw: [ft4/sv2/0x2898c70] Sync fetch.
...
5 LOG: pg_fdw: [ft4/sv2/0x2898c70] Sync fetch.
6 LOG: pg_fdw: [ft1/sv1/0x293a580] Async fetch
7 LOG: pg_fdw: [ft4/sv2/0x2898c70] Sync fetch.
...
8 LOG: pg_fdw: [ft4/sv2/0x2898c70] Sync fetch.
9 LOG: pg_fdw: [ft1/sv1/0x293a580] Async fetch

;; ft4 did not even try to async since the inner(ft4) is parameterized.
;; All fetches for inner(ft4) was executed synchronously.
;;
;; Meanwhile, ft1 was continuously reading asynchronously.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Add-infrastructure-for-executor-node-run-state.patchtext/x-patch; charset=us-asciiDownload

>From 34dabe294394c878186a3e33a248f85097b12c33 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 8 Jul 2015 11:48:12 +0900
Subject: [PATCH 1/6] Add infrastructure for executor node run state.

This infrastructure expands the node state from what ResultNode did to
general form having four states.

The states are Inited, Started, Running and Done. Running and Done are
the same as what rs_done of ResultNode indicated. Inited state
indiates that the node has been initialized but not executed. Started
state indicates that the node has been executed but the first tuple
have not received yet. Running indicates that the node is returning
tuples and Done indicates that the node has no more tuple to return.

The nodes Group, ModifyTable, SetOp and WindowAgg had their own
run-state management so they are moved to this infrastructure by this
patch.
---
 src/backend/commands/explain.c             |  2 +-
 src/backend/executor/nodeAgg.c             |  1 +
 src/backend/executor/nodeAppend.c          |  1 +
 src/backend/executor/nodeBitmapAnd.c       |  1 +
 src/backend/executor/nodeBitmapHeapscan.c  |  1 +
 src/backend/executor/nodeBitmapIndexscan.c |  1 +
 src/backend/executor/nodeBitmapOr.c        |  1 +
 src/backend/executor/nodeCtescan.c         |  1 +
 src/backend/executor/nodeCustom.c          |  1 +
 src/backend/executor/nodeForeignscan.c     |  1 +
 src/backend/executor/nodeFunctionscan.c    |  1 +
 src/backend/executor/nodeGroup.c           | 14 +++++++++-----
 src/backend/executor/nodeHash.c            |  1 +
 src/backend/executor/nodeHashjoin.c        |  1 +
 src/backend/executor/nodeIndexonlyscan.c   |  1 +
 src/backend/executor/nodeIndexscan.c       |  1 +
 src/backend/executor/nodeLimit.c           |  1 +
 src/backend/executor/nodeLockRows.c        |  1 +
 src/backend/executor/nodeMaterial.c        |  1 +
 src/backend/executor/nodeMergeAppend.c     |  1 +
 src/backend/executor/nodeMergejoin.c       |  1 +
 src/backend/executor/nodeModifyTable.c     |  9 ++++++---
 src/backend/executor/nodeNestloop.c        |  1 +
 src/backend/executor/nodeRecursiveunion.c  |  1 +
 src/backend/executor/nodeResult.c          |  1 +
 src/backend/executor/nodeSamplescan.c      |  1 +
 src/backend/executor/nodeSeqscan.c         |  1 +
 src/backend/executor/nodeSetOp.c           | 20 ++++++++++++--------
 src/backend/executor/nodeSort.c            |  1 +
 src/backend/executor/nodeSubqueryscan.c    |  1 +
 src/backend/executor/nodeTidscan.c         |  1 +
 src/backend/executor/nodeUnique.c          |  1 +
 src/backend/executor/nodeValuesscan.c      |  1 +
 src/backend/executor/nodeWindowAgg.c       | 12 ++++++++----
 src/backend/executor/nodeWorktablescan.c   |  1 +
 src/include/nodes/execnodes.h              | 26 ++++++++++++++++++++++----
 36 files changed, 88 insertions(+), 25 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0d1ecc2..86e3b5a 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2116,7 +2116,7 @@ static void
 show_sort_info(SortState *sortstate, ExplainState *es)
 {
 	Assert(IsA(sortstate, SortState));
-	if (es->analyze && sortstate->sort_Done &&
+	if (es->analyze && ExecNode_is_done(sortstate) &&
 		sortstate->tuplesortstate != NULL)
 	{
 		Tuplesortstate *state = (Tuplesortstate *) sortstate->tuplesortstate;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 2bf48c5..a48cd81 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1968,6 +1968,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate = makeNode(AggState);
 	aggstate->ss.ps.plan = (Plan *) node;
 	aggstate->ss.ps.state = estate;
+	SetNodeRunState(aggstate, Inited);
 
 	aggstate->aggs = NIL;
 	aggstate->numaggs = 0;
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 2cffef8..4718c0f 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -140,6 +140,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	 */
 	appendstate->ps.plan = (Plan *) node;
 	appendstate->ps.state = estate;
+	SetNodeRunState(appendstate, Inited);
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
 
diff --git a/src/backend/executor/nodeBitmapAnd.c b/src/backend/executor/nodeBitmapAnd.c
index 205980e..8bc5bbe 100644
--- a/src/backend/executor/nodeBitmapAnd.c
+++ b/src/backend/executor/nodeBitmapAnd.c
@@ -63,6 +63,7 @@ ExecInitBitmapAnd(BitmapAnd *node, EState *estate, int eflags)
 	 */
 	bitmapandstate->ps.plan = (Plan *) node;
 	bitmapandstate->ps.state = estate;
+	SetNodeRunState(bitmapandstate, Inited);
 	bitmapandstate->bitmapplans = bitmapplanstates;
 	bitmapandstate->nplans = nplans;
 
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 4597437..8d10e28 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -555,6 +555,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	scanstate = makeNode(BitmapHeapScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	SetNodeRunState(scanstate, Inited);
 
 	scanstate->tbm = NULL;
 	scanstate->tbmiterator = NULL;
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 77fc1e5..613054f 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -206,6 +206,7 @@ ExecInitBitmapIndexScan(BitmapIndexScan *node, EState *estate, int eflags)
 	indexstate = makeNode(BitmapIndexScanState);
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
+	SetNodeRunState(indexstate, Inited);
 
 	/* normally we don't make the result bitmap till runtime */
 	indexstate->biss_result = NULL;
diff --git a/src/backend/executor/nodeBitmapOr.c b/src/backend/executor/nodeBitmapOr.c
index 353a5b6..fcdaeaf 100644
--- a/src/backend/executor/nodeBitmapOr.c
+++ b/src/backend/executor/nodeBitmapOr.c
@@ -64,6 +64,7 @@ ExecInitBitmapOr(BitmapOr *node, EState *estate, int eflags)
 	 */
 	bitmaporstate->ps.plan = (Plan *) node;
 	bitmaporstate->ps.state = estate;
+	SetNodeRunState(bitmaporstate, Inited);
 	bitmaporstate->bitmapplans = bitmapplanstates;
 	bitmaporstate->nplans = nplans;
 
diff --git a/src/backend/executor/nodeCtescan.c b/src/backend/executor/nodeCtescan.c
index 75c1ab3..666ef91 100644
--- a/src/backend/executor/nodeCtescan.c
+++ b/src/backend/executor/nodeCtescan.c
@@ -191,6 +191,7 @@ ExecInitCteScan(CteScan *node, EState *estate, int eflags)
 	scanstate = makeNode(CteScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	SetNodeRunState(scanstate, Inited);
 	scanstate->eflags = eflags;
 	scanstate->cte_table = NULL;
 	scanstate->eof_cte = false;
diff --git a/src/backend/executor/nodeCustom.c b/src/backend/executor/nodeCustom.c
index 0a022df..e7e3b17 100644
--- a/src/backend/executor/nodeCustom.c
+++ b/src/backend/executor/nodeCustom.c
@@ -43,6 +43,7 @@ ExecInitCustomScan(CustomScan *cscan, EState *estate, int eflags)
 	/* fill up fields of ScanState */
 	css->ss.ps.plan = &cscan->scan.plan;
 	css->ss.ps.state = estate;
+	SetNodeRunState(css, Inited);
 
 	/* create expression context for node */
 	ExecAssignExprContext(estate, &css->ss.ps);
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index bb28a73..3ba4196 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -116,6 +116,7 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
 	scanstate = makeNode(ForeignScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	SetNodeRunState(scanstate, Inited);
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeFunctionscan.c b/src/backend/executor/nodeFunctionscan.c
index f5fa2b3..849b54f 100644
--- a/src/backend/executor/nodeFunctionscan.c
+++ b/src/backend/executor/nodeFunctionscan.c
@@ -299,6 +299,7 @@ ExecInitFunctionScan(FunctionScan *node, EState *estate, int eflags)
 	scanstate = makeNode(FunctionScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	SetNodeRunState(scanstate, Inited);
 	scanstate->eflags = eflags;
 
 	/*
diff --git a/src/backend/executor/nodeGroup.c b/src/backend/executor/nodeGroup.c
index 5e47854..1a8f669 100644
--- a/src/backend/executor/nodeGroup.c
+++ b/src/backend/executor/nodeGroup.c
@@ -40,10 +40,13 @@ ExecGroup(GroupState *node)
 	TupleTableSlot *firsttupleslot;
 	TupleTableSlot *outerslot;
 
+	/* Advance the state to running if just after initialized */
+	AdvanceNodeRunStateTo(node, Running);
+
 	/*
 	 * get state info from node
 	 */
-	if (node->grp_done)
+	if (ExecNode_is_done(node))
 		return NULL;
 	econtext = node->ss.ps.ps_ExprContext;
 	numCols = ((Group *) node->ss.ps.plan)->numCols;
@@ -86,7 +89,7 @@ ExecGroup(GroupState *node)
 		if (TupIsNull(outerslot))
 		{
 			/* empty input, so return nothing */
-			node->grp_done = TRUE;
+			SetNodeRunState(node, Done);
 			return NULL;
 		}
 		/* Copy tuple into firsttupleslot */
@@ -138,7 +141,7 @@ ExecGroup(GroupState *node)
 			if (TupIsNull(outerslot))
 			{
 				/* no more groups, so we're done */
-				node->grp_done = TRUE;
+				SetNodeRunState(node, Done);
 				return NULL;
 			}
 
@@ -207,7 +210,7 @@ ExecInitGroup(Group *node, EState *estate, int eflags)
 	grpstate = makeNode(GroupState);
 	grpstate->ss.ps.plan = (Plan *) node;
 	grpstate->ss.ps.state = estate;
-	grpstate->grp_done = FALSE;
+	SetNodeRunState(grpstate, Inited);
 
 	/*
 	 * create expression context
@@ -282,7 +285,6 @@ ExecReScanGroup(GroupState *node)
 {
 	PlanState  *outerPlan = outerPlanState(node);
 
-	node->grp_done = FALSE;
 	node->ss.ps.ps_TupFromTlist = false;
 	/* must clear first tuple */
 	ExecClearTuple(node->ss.ss_ScanTupleSlot);
@@ -293,4 +295,6 @@ ExecReScanGroup(GroupState *node)
 	 */
 	if (outerPlan->chgParam == NULL)
 		ExecReScan(outerPlan);
+
+	SetNodeRunState(node, Inited);
 }
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 906cb46..d388e17 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -182,6 +182,7 @@ ExecInitHash(Hash *node, EState *estate, int eflags)
 	hashstate = makeNode(HashState);
 	hashstate->ps.plan = (Plan *) node;
 	hashstate->ps.state = estate;
+	SetNodeRunState(hashstate, Inited);
 	hashstate->hashtable = NULL;
 	hashstate->hashkeys = NIL;	/* will be set by parent HashJoin */
 
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 1d78cdf..064421e 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -451,6 +451,7 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	hjstate = makeNode(HashJoinState);
 	hjstate->js.ps.plan = (Plan *) node;
 	hjstate->js.ps.state = estate;
+	SetNodeRunState(hjstate, Inited);
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..0e84314 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -403,6 +403,7 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate = makeNode(IndexOnlyScanState);
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
+	SetNodeRunState(indexstate, Inited);
 	indexstate->ioss_HeapFetches = 0;
 
 	/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index c0f14db..534d2f4 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -828,6 +828,7 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate = makeNode(IndexScanState);
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
+	SetNodeRunState(indexstate, Inited);
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index 40ac0d7..1b675d4 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -384,6 +384,7 @@ ExecInitLimit(Limit *node, EState *estate, int eflags)
 	limitstate = makeNode(LimitState);
 	limitstate->ps.plan = (Plan *) node;
 	limitstate->ps.state = estate;
+	SetNodeRunState(limitstate, Inited);
 
 	limitstate->lstate = LIMIT_INITIAL;
 
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index b9b0f06..eeeca0b 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -365,6 +365,7 @@ ExecInitLockRows(LockRows *node, EState *estate, int eflags)
 	lrstate = makeNode(LockRowsState);
 	lrstate->ps.plan = (Plan *) node;
 	lrstate->ps.state = estate;
+	SetNodeRunState(lrstate, Inited);
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index b2b5aa7..d9a67f4 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -171,6 +171,7 @@ ExecInitMaterial(Material *node, EState *estate, int eflags)
 	matstate = makeNode(MaterialState);
 	matstate->ss.ps.plan = (Plan *) node;
 	matstate->ss.ps.state = estate;
+	SetNodeRunState(matstate, Inited);
 
 	/*
 	 * We must have a tuplestore buffering the subplan output to do backward
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index bdf7680..3901255 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -83,6 +83,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	 */
 	mergestate->ps.plan = (Plan *) node;
 	mergestate->ps.state = estate;
+	SetNodeRunState(mergestate, Inited);
 	mergestate->mergeplans = mergeplanstates;
 	mergestate->ms_nplans = nplans;
 
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 34b6cf6..9970db1 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -1485,6 +1485,7 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 	mergestate = makeNode(MergeJoinState);
 	mergestate->js.ps.plan = (Plan *) node;
 	mergestate->js.ps.state = estate;
+	SetNodeRunState(mergestate, Inited);
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 874ca6a..25fa109 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1269,6 +1269,9 @@ ExecModifyTable(ModifyTableState *node)
 	HeapTupleData oldtupdata;
 	HeapTuple	oldtuple;
 
+	/* Advance the state to running if just after initialized */
+	AdvanceNodeRunStateTo(node, Running);
+
 	/*
 	 * This should NOT get called during EvalPlanQual; we should have passed a
 	 * subplan tree to EvalPlanQual, instead.  Use a runtime test not just
@@ -1287,7 +1290,7 @@ ExecModifyTable(ModifyTableState *node)
 	 * our subplan's nodes aren't necessarily robust against being called
 	 * extra times.
 	 */
-	if (node->mt_done)
+	if (ExecNode_is_done(node))
 		return NULL;
 
 	/*
@@ -1464,7 +1467,7 @@ ExecModifyTable(ModifyTableState *node)
 	 */
 	fireASTriggers(node);
 
-	node->mt_done = true;
+	SetNodeRunState(node, Done);
 
 	return NULL;
 }
@@ -1495,11 +1498,11 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	mtstate = makeNode(ModifyTableState);
 	mtstate->ps.plan = (Plan *) node;
 	mtstate->ps.state = estate;
+	SetNodeRunState(mtstate, Inited);
 	mtstate->ps.targetlist = NIL;		/* not actually used */
 
 	mtstate->operation = operation;
 	mtstate->canSetTag = node->canSetTag;
-	mtstate->mt_done = false;
 
 	mtstate->mt_plans = (PlanState **) palloc0(sizeof(PlanState *) * nplans);
 	mtstate->resultRelInfo = estate->es_result_relations + node->resultRelIndex;
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index e66bcda..b4c2f26 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -309,6 +309,7 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	nlstate = makeNode(NestLoopState);
 	nlstate->js.ps.plan = (Plan *) node;
 	nlstate->js.ps.state = estate;
+	SetNodeRunState(nlstate, Inited);
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index 8df1639..118496e 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -176,6 +176,7 @@ ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags)
 	rustate = makeNode(RecursiveUnionState);
 	rustate->ps.plan = (Plan *) node;
 	rustate->ps.state = estate;
+	SetNodeRunState(rustate, Inited);
 
 	rustate->eqfunctions = NULL;
 	rustate->hashfunctions = NULL;
diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c
index 8d3dde0..b4ee402 100644
--- a/src/backend/executor/nodeResult.c
+++ b/src/backend/executor/nodeResult.c
@@ -217,6 +217,7 @@ ExecInitResult(Result *node, EState *estate, int eflags)
 	resstate = makeNode(ResultState);
 	resstate->ps.plan = (Plan *) node;
 	resstate->ps.state = estate;
+	SetNodeRunState(resstate, Inited);
 
 	resstate->rs_done = false;
 	resstate->rs_checkqual = (node->resconstantqual == NULL) ? false : true;
diff --git a/src/backend/executor/nodeSamplescan.c b/src/backend/executor/nodeSamplescan.c
index 4c1c523..e88735a 100644
--- a/src/backend/executor/nodeSamplescan.c
+++ b/src/backend/executor/nodeSamplescan.c
@@ -153,6 +153,7 @@ ExecInitSampleScan(SampleScan *node, EState *estate, int eflags)
 	scanstate = makeNode(SampleScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	SetNodeRunState(scanstate, Inited);
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 3cb81fc..259b79a 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -169,6 +169,7 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
 	scanstate = makeNode(SeqScanState);
 	scanstate->ps.plan = (Plan *) node;
 	scanstate->ps.state = estate;
+	SetNodeRunState(scanstate, Inited);
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 7d00cc5..123d051 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -197,6 +197,9 @@ ExecSetOp(SetOpState *node)
 	SetOp	   *plannode = (SetOp *) node->ps.plan;
 	TupleTableSlot *resultTupleSlot = node->ps.ps_ResultTupleSlot;
 
+	/* Advance the state to running if just after initialized */
+	AdvanceNodeRunStateTo(node, Running);
+
 	/*
 	 * If the previously-returned tuple needs to be returned more than once,
 	 * keep returning it.
@@ -208,7 +211,7 @@ ExecSetOp(SetOpState *node)
 	}
 
 	/* Otherwise, we're done if we are out of groups */
-	if (node->setop_done)
+	if (ExecNode_is_done(node))
 		return NULL;
 
 	/* Fetch the next tuple group according to the correct strategy */
@@ -244,7 +247,7 @@ setop_retrieve_direct(SetOpState *setopstate)
 	/*
 	 * We loop retrieving groups until we find one we should return
 	 */
-	while (!setopstate->setop_done)
+	while (ExecNode_is_running(setopstate))
 	{
 		/*
 		 * If we don't already have the first tuple of the new group, fetch it
@@ -261,7 +264,7 @@ setop_retrieve_direct(SetOpState *setopstate)
 			else
 			{
 				/* outer plan produced no tuples at all */
-				setopstate->setop_done = true;
+				SetNodeRunState(setopstate, Done);
 				return NULL;
 			}
 		}
@@ -293,7 +296,7 @@ setop_retrieve_direct(SetOpState *setopstate)
 			if (TupIsNull(outerslot))
 			{
 				/* no more outer-plan tuples available */
-				setopstate->setop_done = true;
+				SetNodeRunState(setopstate, Done);
 				break;
 			}
 
@@ -433,7 +436,7 @@ setop_retrieve_hash_table(SetOpState *setopstate)
 	/*
 	 * We loop retrieving groups until we find one we should return
 	 */
-	while (!setopstate->setop_done)
+	while (ExecNode_is_running(setopstate))
 	{
 		/*
 		 * Find the next entry in the hash table
@@ -442,7 +445,7 @@ setop_retrieve_hash_table(SetOpState *setopstate)
 		if (entry == NULL)
 		{
 			/* No more entries in hashtable, so done */
-			setopstate->setop_done = true;
+			SetNodeRunState(setopstate, Done);
 			return NULL;
 		}
 
@@ -490,7 +493,7 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 
 	setopstate->eqfunctions = NULL;
 	setopstate->hashfunctions = NULL;
-	setopstate->setop_done = false;
+	SetNodeRunState(setopstate, Inited);
 	setopstate->numOutput = 0;
 	setopstate->pergroup = NULL;
 	setopstate->grp_firstTuple = NULL;
@@ -601,7 +604,6 @@ void
 ExecReScanSetOp(SetOpState *node)
 {
 	ExecClearTuple(node->ps.ps_ResultTupleSlot);
-	node->setop_done = false;
 	node->numOutput = 0;
 
 	if (((SetOp *) node->ps.plan)->strategy == SETOP_HASHED)
@@ -651,4 +653,6 @@ ExecReScanSetOp(SetOpState *node)
 	 */
 	if (node->ps.lefttree->chgParam == NULL)
 		ExecReScan(node->ps.lefttree);
+
+	SetNodeRunState(node, Inited);
 }
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index af1dccf..3ae5b89 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -162,6 +162,7 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
 	sortstate = makeNode(SortState);
 	sortstate->ss.ps.plan = (Plan *) node;
 	sortstate->ss.ps.state = estate;
+	SetNodeRunState(sortstate, Inited);
 
 	/*
 	 * We must have random access to the sort output to do backward scan or
diff --git a/src/backend/executor/nodeSubqueryscan.c b/src/backend/executor/nodeSubqueryscan.c
index e5d1e54..497d6df 100644
--- a/src/backend/executor/nodeSubqueryscan.c
+++ b/src/backend/executor/nodeSubqueryscan.c
@@ -117,6 +117,7 @@ ExecInitSubqueryScan(SubqueryScan *node, EState *estate, int eflags)
 	subquerystate = makeNode(SubqueryScanState);
 	subquerystate->ss.ps.plan = (Plan *) node;
 	subquerystate->ss.ps.state = estate;
+	SetNodeRunState(subquerystate, Inited);
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index 203f1ac..f19e735 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -461,6 +461,7 @@ ExecInitTidScan(TidScan *node, EState *estate, int eflags)
 	tidstate = makeNode(TidScanState);
 	tidstate->ss.ps.plan = (Plan *) node;
 	tidstate->ss.ps.state = estate;
+	SetNodeRunState(tidstate, Inited);
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeUnique.c b/src/backend/executor/nodeUnique.c
index 1cb4a8a..f259f32 100644
--- a/src/backend/executor/nodeUnique.c
+++ b/src/backend/executor/nodeUnique.c
@@ -122,6 +122,7 @@ ExecInitUnique(Unique *node, EState *estate, int eflags)
 	uniquestate = makeNode(UniqueState);
 	uniquestate->ps.plan = (Plan *) node;
 	uniquestate->ps.state = estate;
+	SetNodeRunState(uniquestate, Inited);
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeValuesscan.c b/src/backend/executor/nodeValuesscan.c
index a39695a..c56199c 100644
--- a/src/backend/executor/nodeValuesscan.c
+++ b/src/backend/executor/nodeValuesscan.c
@@ -205,6 +205,7 @@ ExecInitValuesScan(ValuesScan *node, EState *estate, int eflags)
 	scanstate = makeNode(ValuesScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	SetNodeRunState(scanstate, Inited);
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index ecf96f8..b91d4e6 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1564,7 +1564,10 @@ ExecWindowAgg(WindowAggState *winstate)
 	int			i;
 	int			numfuncs;
 
-	if (winstate->all_done)
+	/* Advance the state to running if just after initialized */
+	AdvanceNodeRunStateTo(winstate, Running);
+
+	if (ExecNode_is_done(winstate))
 		return NULL;
 
 	/*
@@ -1686,7 +1689,7 @@ restart:
 		}
 		else
 		{
-			winstate->all_done = true;
+			SetNodeRunState(winstate, Done);
 			return NULL;
 		}
 	}
@@ -1787,6 +1790,7 @@ ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags)
 	winstate = makeNode(WindowAggState);
 	winstate->ss.ps.plan = (Plan *) node;
 	winstate->ss.ps.state = estate;
+	SetNodeRunState(winstate, Inited);
 
 	/*
 	 * Create expression contexts.  We need two, one for per-input-tuple
@@ -2060,8 +2064,6 @@ ExecReScanWindowAgg(WindowAggState *node)
 	PlanState  *outerPlan = outerPlanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 
-	node->all_done = false;
-
 	node->ss.ps.ps_TupFromTlist = false;
 	node->all_first = true;
 
@@ -2085,6 +2087,8 @@ ExecReScanWindowAgg(WindowAggState *node)
 	 */
 	if (outerPlan->chgParam == NULL)
 		ExecReScan(outerPlan);
+
+	SetNodeRunState(node, Inited);
 }
 
 /*
diff --git a/src/backend/executor/nodeWorktablescan.c b/src/backend/executor/nodeWorktablescan.c
index 618508e..799e96b 100644
--- a/src/backend/executor/nodeWorktablescan.c
+++ b/src/backend/executor/nodeWorktablescan.c
@@ -144,6 +144,7 @@ ExecInitWorkTableScan(WorkTableScan *node, EState *estate, int eflags)
 	scanstate = makeNode(WorkTableScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	SetNodeRunState(scanstate, Inited);
 	scanstate->rustate = NULL;	/* we'll set this later */
 
 	/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 541ee18..4066341 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -344,6 +344,27 @@ typedef struct ResultRelInfo
 } ResultRelInfo;
 
 /* ----------------
+ *  Enumeration and macros for executor node running state.
+ */
+typedef enum ExecNodeRunState
+{
+	ERunState_Inited,		/* Just after initialized */
+	ERunState_Started,		/* Execution started but needs one more call
+							 * for the first tuple */
+	ERunState_Running,		/* Returning the next tuple */
+	ERunState_Done			/* No tuple to return  */
+} ExecNodeRunState;
+
+#define SetNodeRunState(nd,st) (((PlanState*)nd)->runstate = (ERunState_##st))
+#define AdvanceNodeRunStateTo(nd,st) \
+	do {\
+		if (((PlanState*)nd)->runstate < (ERunState_##st))\
+			((PlanState*)nd)->runstate = (ERunState_##st);\
+	} while(0);
+#define ExecNode_is_running(nd)	(((PlanState*)nd)->runstate == ERunState_Running)
+#define ExecNode_is_done(nd)	(((PlanState*)nd)->runstate == ERunState_Done)
+
+/* ----------------
  *	  EState information
  *
  * Master working state for an Executor invocation
@@ -1059,6 +1080,7 @@ typedef struct PlanState
 	ProjectionInfo *ps_ProjInfo;	/* info for doing tuple projection */
 	bool		ps_TupFromTlist;/* state flag for processing set-valued
 								 * functions in targetlist */
+	ExecNodeRunState runstate;	/* Execution state of this node */
 } PlanState;
 
 /* ----------------
@@ -1120,7 +1142,6 @@ typedef struct ModifyTableState
 	PlanState	ps;				/* its first field is NodeTag */
 	CmdType		operation;		/* INSERT, UPDATE, or DELETE */
 	bool		canSetTag;		/* do we set the command tag/es_processed? */
-	bool		mt_done;		/* are we done? */
 	PlanState **mt_plans;		/* subplans (one per target rel) */
 	int			mt_nplans;		/* number of plans in the array */
 	int			mt_whichplan;	/* which one is being executed (0..n-1) */
@@ -1799,7 +1820,6 @@ typedef struct GroupState
 {
 	ScanState	ss;				/* its first field is NodeTag */
 	FmgrInfo   *eqfunctions;	/* per-field lookup data for equality fns */
-	bool		grp_done;		/* indicates completion of Group scan */
 } GroupState;
 
 /* ---------------------
@@ -1901,7 +1921,6 @@ typedef struct WindowAggState
 	ExprContext *tmpcontext;	/* short-term evaluation context */
 
 	bool		all_first;		/* true if the scan is starting */
-	bool		all_done;		/* true if the scan is finished */
 	bool		partition_spooled;		/* true if all tuples in current
 										 * partition have been spooled into
 										 * tuplestore */
@@ -1968,7 +1987,6 @@ typedef struct SetOpState
 	PlanState	ps;				/* its first field is NodeTag */
 	FmgrInfo   *eqfunctions;	/* per-grouping-field equality fns */
 	FmgrInfo   *hashfunctions;	/* per-grouping-field hash fns */
-	bool		setop_done;		/* indicates completion of output scan */
 	long		numOutput;		/* number of dups left to output */
 	MemoryContext tempContext;	/* short-term context for comparisons */
 	/* these fields are used in SETOP_SORTED mode: */
-- 
1.8.3.1

0002-Change-all-tuple-returning-execution-nodes-to-mainta.patchtext/x-patch; charset=us-asciiDownload

>From d614c7f417b4c859173a472659c0985db228bfdc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 8 Jul 2015 17:39:47 +0900
Subject: [PATCH 2/6] Change all tuple-returning execution nodes to maintain
 run-state appropriately.

This doesn't change any behavior but maintain run-state to be
consistent with whether returning tuple is null or not at
ExecProcNode.
---
 src/backend/executor/nodeAgg.c             |  6 ++++++
 src/backend/executor/nodeAppend.c          |  7 +++++++
 src/backend/executor/nodeBitmapAnd.c       |  6 ++++++
 src/backend/executor/nodeBitmapHeapscan.c  | 13 ++++++++++++-
 src/backend/executor/nodeBitmapIndexscan.c |  6 ++++++
 src/backend/executor/nodeBitmapOr.c        |  6 ++++++
 src/backend/executor/nodeCtescan.c         | 13 ++++++++++++-
 src/backend/executor/nodeCustom.c          | 15 ++++++++++++++-
 src/backend/executor/nodeForeignscan.c     | 12 +++++++++++-
 src/backend/executor/nodeFunctionscan.c    | 13 ++++++++++++-
 src/backend/executor/nodeGroup.c           |  4 ++--
 src/backend/executor/nodeHash.c            |  6 ++++++
 src/backend/executor/nodeHashjoin.c        | 11 +++++++++++
 src/backend/executor/nodeIndexonlyscan.c   | 16 +++++++++++++++-
 src/backend/executor/nodeIndexscan.c       | 18 ++++++++++++++++--
 src/backend/executor/nodeLimit.c           | 22 ++++++++++++++++++++++
 src/backend/executor/nodeLockRows.c        |  7 +++++++
 src/backend/executor/nodeMaterial.c        |  9 +++++++++
 src/backend/executor/nodeMergeAppend.c     |  5 +++++
 src/backend/executor/nodeMergejoin.c       | 12 +++++++++++-
 src/backend/executor/nodeNestloop.c        |  5 +++++
 src/backend/executor/nodeRecursiveunion.c  |  5 +++++
 src/backend/executor/nodeResult.c          | 12 ++++++++++++
 src/backend/executor/nodeSamplescan.c      | 11 ++++++++++-
 src/backend/executor/nodeSeqscan.c         | 11 ++++++++++-
 src/backend/executor/nodeSetOp.c           |  3 +--
 src/backend/executor/nodeSort.c            | 11 +++++++++++
 src/backend/executor/nodeSubqueryscan.c    | 11 ++++++++++-
 src/backend/executor/nodeTidscan.c         | 12 +++++++++++-
 src/backend/executor/nodeUnique.c          |  5 +++++
 src/backend/executor/nodeValuesscan.c      | 12 +++++++++++-
 src/backend/executor/nodeWindowAgg.c       |  3 +--
 src/backend/executor/nodeWorktablescan.c   | 12 +++++++++++-
 src/include/nodes/execnodes.h              | 12 +++++++-----
 34 files changed, 306 insertions(+), 26 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index a48cd81..b07d57f 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1452,6 +1452,8 @@ ExecAgg(AggState *node)
 {
 	TupleTableSlot *result;
 
+	SetNodeRunState(node, Running);
+
 	/*
 	 * Check to see if we're still projecting out tuples from a previous agg
 	 * tuple (because there is a function-returning-set in the projection
@@ -1492,6 +1494,7 @@ ExecAgg(AggState *node)
 			return result;
 	}
 
+	SetNodeRunState(node, Done);
 	return NULL;
 }
 
@@ -2651,6 +2654,9 @@ ExecReScanAgg(AggState *node)
 	int			numGroupingSets = Max(node->maxsets, 1);
 	int			setno;
 
+
+	SetNodeRunState(node, Inited);
+
 	node->agg_done = false;
 
 	node->ss.ps.ps_TupFromTlist = false;
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 4718c0f..03b3b66 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -194,6 +194,8 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
+	SetNodeRunState(node, Running);
+
 	for (;;)
 	{
 		PlanState  *subnode;
@@ -229,7 +231,10 @@ ExecAppend(AppendState *node)
 		else
 			node->as_whichplan--;
 		if (!exec_append_initialize_next(node))
+		{
+			SetNodeRunState(node, Done);
 			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		}
 
 		/* Else loop back and try to get a tuple from the new subplan */
 	}
@@ -268,6 +273,8 @@ ExecReScanAppend(AppendState *node)
 {
 	int			i;
 
+	SetNodeRunState(node, Inited);
+
 	for (i = 0; i < node->as_nplans; i++)
 	{
 		PlanState  *subnode = node->appendplans[i];
diff --git a/src/backend/executor/nodeBitmapAnd.c b/src/backend/executor/nodeBitmapAnd.c
index 8bc5bbe..64f202e 100644
--- a/src/backend/executor/nodeBitmapAnd.c
+++ b/src/backend/executor/nodeBitmapAnd.c
@@ -105,6 +105,8 @@ MultiExecBitmapAnd(BitmapAndState *node)
 	if (node->ps.instrument)
 		InstrStartNode(node->ps.instrument);
 
+	SetNodeRunState(node, Running);
+
 	/*
 	 * get information from the node
 	 */
@@ -146,6 +148,8 @@ MultiExecBitmapAnd(BitmapAndState *node)
 	if (result == NULL)
 		elog(ERROR, "BitmapAnd doesn't support zero inputs");
 
+	SetNodeRunState(node, Done);
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, 0 /* XXX */ );
@@ -189,6 +193,8 @@ ExecReScanBitmapAnd(BitmapAndState *node)
 {
 	int			i;
 
+	SetNodeRunState(node, Inited);
+
 	for (i = 0; i < node->nplans; i++)
 	{
 		PlanState  *subnode = node->bitmapplans[i];
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 8d10e28..47f08bf 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -437,9 +437,18 @@ BitmapHeapRecheck(BitmapHeapScanState *node, TupleTableSlot *slot)
 TupleTableSlot *
 ExecBitmapHeapScan(BitmapHeapScanState *node)
 {
-	return ExecScan(&node->ss,
+	TupleTableSlot *slot;
+
+	SetNodeRunState(node, Running);
+
+	slot = ExecScan(&node->ss,
 					(ExecScanAccessMtd) BitmapHeapNext,
 					(ExecScanRecheckMtd) BitmapHeapRecheck);
+
+	if (TupIsNull(slot))
+		SetNodeRunState(node, Done);
+
+	return slot;
 }
 
 /* ----------------------------------------------------------------
@@ -451,6 +460,8 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
 {
 	PlanState  *outerPlan = outerPlanState(node);
 
+	SetNodeRunState(node, Inited);
+
 	/* rescan to release any page pin */
 	heap_rescan(node->ss.ss_currentScanDesc, NULL);
 
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 613054f..acfce3d 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -44,6 +44,8 @@ MultiExecBitmapIndexScan(BitmapIndexScanState *node)
 	if (node->ss.ps.instrument)
 		InstrStartNode(node->ss.ps.instrument);
 
+	SetNodeRunState(node, Running);
+
 	/*
 	 * extract necessary information from index scan node
 	 */
@@ -98,6 +100,8 @@ MultiExecBitmapIndexScan(BitmapIndexScanState *node)
 						 NULL, 0);
 	}
 
+	SetNodeRunState(node, Done);
+
 	/* must provide our own instrumentation support */
 	if (node->ss.ps.instrument)
 		InstrStopNode(node->ss.ps.instrument, nTuples);
@@ -117,6 +121,8 @@ ExecReScanBitmapIndexScan(BitmapIndexScanState *node)
 {
 	ExprContext *econtext = node->biss_RuntimeContext;
 
+	SetNodeRunState(node, Inited);
+
 	/*
 	 * Reset the runtime-key context so we don't leak memory as each outer
 	 * tuple is scanned.  Note this assumes that we will recalculate *all*
diff --git a/src/backend/executor/nodeBitmapOr.c b/src/backend/executor/nodeBitmapOr.c
index fcdaeaf..7a5bcf5 100644
--- a/src/backend/executor/nodeBitmapOr.c
+++ b/src/backend/executor/nodeBitmapOr.c
@@ -106,6 +106,8 @@ MultiExecBitmapOr(BitmapOrState *node)
 	if (node->ps.instrument)
 		InstrStartNode(node->ps.instrument);
 
+	SetNodeRunState(node, Running);
+
 	/*
 	 * get information from the node
 	 */
@@ -162,6 +164,8 @@ MultiExecBitmapOr(BitmapOrState *node)
 	if (result == NULL)
 		elog(ERROR, "BitmapOr doesn't support zero inputs");
 
+	SetNodeRunState(node, Done);
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, 0 /* XXX */ );
@@ -205,6 +209,8 @@ ExecReScanBitmapOr(BitmapOrState *node)
 {
 	int			i;
 
+	SetNodeRunState(node, Inited);
+
 	for (i = 0; i < node->nplans; i++)
 	{
 		PlanState  *subnode = node->bitmapplans[i];
diff --git a/src/backend/executor/nodeCtescan.c b/src/backend/executor/nodeCtescan.c
index 666ef91..d237370 100644
--- a/src/backend/executor/nodeCtescan.c
+++ b/src/backend/executor/nodeCtescan.c
@@ -152,9 +152,18 @@ CteScanRecheck(CteScanState *node, TupleTableSlot *slot)
 TupleTableSlot *
 ExecCteScan(CteScanState *node)
 {
-	return ExecScan(&node->ss,
+	TupleTableSlot *slot;
+
+	SetNodeRunState(node, Running);
+
+	slot = ExecScan(&node->ss,
 					(ExecScanAccessMtd) CteScanNext,
 					(ExecScanRecheckMtd) CteScanRecheck);
+
+	if (TupIsNull(slot))
+		SetNodeRunState(node, Done);
+
+	return slot;
 }
 
 
@@ -312,6 +321,8 @@ ExecReScanCteScan(CteScanState *node)
 {
 	Tuplestorestate *tuplestorestate = node->leader->cte_table;
 
+	SetNodeRunState(node, Inited);
+
 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
 
 	ExecScanReScan(&node->ss);
diff --git a/src/backend/executor/nodeCustom.c b/src/backend/executor/nodeCustom.c
index e7e3b17..9f85fd7 100644
--- a/src/backend/executor/nodeCustom.c
+++ b/src/backend/executor/nodeCustom.c
@@ -110,8 +110,16 @@ ExecInitCustomScan(CustomScan *cscan, EState *estate, int eflags)
 TupleTableSlot *
 ExecCustomScan(CustomScanState *node)
 {
+	TupleTableSlot *slot;
+
 	Assert(node->methods->ExecCustomScan != NULL);
-	return node->methods->ExecCustomScan(node);
+	SetNodeRunState(node, Running);
+	slot = node->methods->ExecCustomScan(node);
+
+	if (TupIsNull(slot))
+		SetNodeRunState(node, Done);
+
+	return slot;
 }
 
 void
@@ -136,6 +144,7 @@ void
 ExecReScanCustomScan(CustomScanState *node)
 {
 	Assert(node->methods->ReScanCustomScan != NULL);
+	SetNodeRunState(node, Inited);
 	node->methods->ReScanCustomScan(node);
 }
 
@@ -158,5 +167,9 @@ ExecCustomRestrPos(CustomScanState *node)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("custom-scan \"%s\" does not support MarkPos",
 						node->methods->CustomName)));
+
 	node->methods->RestrPosCustomScan(node);
+
+	/* Restoring position in turn restores run state */
+	SetNodeRunState(node, Running);
 }
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 3ba4196..5d39c85 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -88,9 +88,17 @@ ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
 TupleTableSlot *
 ExecForeignScan(ForeignScanState *node)
 {
-	return ExecScan((ScanState *) node,
+	TupleTableSlot * slot;
+
+	SetNodeRunState(node, Running);
+	slot = ExecScan((ScanState *) node,
 					(ExecScanAccessMtd) ForeignNext,
 					(ExecScanRecheckMtd) ForeignRecheck);
+
+	if (TupIsNull(slot))
+		SetNodeRunState(node, Done);
+
+	return slot;
 }
 
 
@@ -232,6 +240,8 @@ ExecEndForeignScan(ForeignScanState *node)
 void
 ExecReScanForeignScan(ForeignScanState *node)
 {
+	SetNodeRunState(node, Inited);
+
 	node->fdwroutine->ReScanForeignScan(node);
 
 	ExecScanReScan(&node->ss);
diff --git a/src/backend/executor/nodeFunctionscan.c b/src/backend/executor/nodeFunctionscan.c
index 849b54f..08f9bbf 100644
--- a/src/backend/executor/nodeFunctionscan.c
+++ b/src/backend/executor/nodeFunctionscan.c
@@ -265,9 +265,18 @@ FunctionRecheck(FunctionScanState *node, TupleTableSlot *slot)
 TupleTableSlot *
 ExecFunctionScan(FunctionScanState *node)
 {
-	return ExecScan(&node->ss,
+	TupleTableSlot *slot;
+
+	SetNodeRunState(node, Running);
+
+	slot = ExecScan(&node->ss,
 					(ExecScanAccessMtd) FunctionNext,
 					(ExecScanRecheckMtd) FunctionRecheck);
+
+	if (TupIsNull(slot))
+		SetNodeRunState(node, Done);
+
+	return slot;
 }
 
 /* ----------------------------------------------------------------
@@ -569,6 +578,8 @@ ExecReScanFunctionScan(FunctionScanState *node)
 	int			i;
 	Bitmapset  *chgparam = node->ss.ps.chgParam;
 
+	SetNodeRunState(node, Inited);
+
 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
 	for (i = 0; i < node->nfuncs; i++)
 	{
diff --git a/src/backend/executor/nodeGroup.c b/src/backend/executor/nodeGroup.c
index 1a8f669..a593d9f 100644
--- a/src/backend/executor/nodeGroup.c
+++ b/src/backend/executor/nodeGroup.c
@@ -285,6 +285,8 @@ ExecReScanGroup(GroupState *node)
 {
 	PlanState  *outerPlan = outerPlanState(node);
 
+	SetNodeRunState(node, Inited);
+
 	node->ss.ps.ps_TupFromTlist = false;
 	/* must clear first tuple */
 	ExecClearTuple(node->ss.ss_ScanTupleSlot);
@@ -295,6 +297,4 @@ ExecReScanGroup(GroupState *node)
 	 */
 	if (outerPlan->chgParam == NULL)
 		ExecReScan(outerPlan);
-
-	SetNodeRunState(node, Inited);
 }
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index d388e17..308a5aab 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -84,6 +84,8 @@ MultiExecHash(HashState *node)
 	if (node->ps.instrument)
 		InstrStartNode(node->ps.instrument);
 
+	SetNodeRunState(node, Running);
+
 	/*
 	 * get state info from node
 	 */
@@ -148,6 +150,8 @@ MultiExecHash(HashState *node)
 	if (hashtable->spaceUsed > hashtable->spacePeak)
 		hashtable->spacePeak = hashtable->spaceUsed;
 
+	SetNodeRunState(node, Done);
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
@@ -1260,6 +1264,8 @@ ExecHashTableResetMatchFlags(HashJoinTable hashtable)
 void
 ExecReScanHash(HashState *node)
 {
+	SetNodeRunState(node, Inited);
+
 	/*
 	 * if chgParam of subnode is not null then plan will be re-scanned by
 	 * first ExecProcNode.
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 064421e..dbaabc4 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -72,6 +72,8 @@ ExecHashJoin(HashJoinState *node)
 	uint32		hashvalue;
 	int			batchno;
 
+	SetNodeRunState(node, Running);
+
 	/*
 	 * get information from HashJoin node
 	 */
@@ -155,6 +157,7 @@ ExecHashJoin(HashJoinState *node)
 					if (TupIsNull(node->hj_FirstOuterTupleSlot))
 					{
 						node->hj_OuterNotEmpty = false;
+						SetNodeRunState(node, Done);
 						return NULL;
 					}
 					else
@@ -183,7 +186,10 @@ ExecHashJoin(HashJoinState *node)
 				 * outer relation.
 				 */
 				if (hashtable->totalTuples == 0 && !HJ_FILL_OUTER(node))
+				{
+					SetNodeRunState(node, Done);
 					return NULL;
+				}
 
 				/*
 				 * need to remember whether nbatch has increased since we
@@ -414,7 +420,10 @@ ExecHashJoin(HashJoinState *node)
 				 * Try to advance to next batch.  Done if there are no more.
 				 */
 				if (!ExecHashJoinNewBatch(node))
+				{
+					SetNodeRunState(node, Done);
 					return NULL;	/* end of join */
+				}
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
 				break;
 
@@ -944,6 +953,8 @@ ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 void
 ExecReScanHashJoin(HashJoinState *node)
 {
+	SetNodeRunState(node, Inited);
+
 	/*
 	 * In a multi-batch join, we currently have to do rescans the hard way,
 	 * primarily because batch temp files may have already been released. But
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 0e84314..b3676c9 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -252,15 +252,24 @@ IndexOnlyRecheck(IndexOnlyScanState *node, TupleTableSlot *slot)
 TupleTableSlot *
 ExecIndexOnlyScan(IndexOnlyScanState *node)
 {
+	TupleTableSlot *slot;
+
 	/*
 	 * If we have runtime keys and they've not already been set up, do it now.
 	 */
 	if (node->ioss_NumRuntimeKeys != 0 && !node->ioss_RuntimeKeysReady)
 		ExecReScan((PlanState *) node);
 
-	return ExecScan(&node->ss,
+	SetNodeRunState(node, Running);
+
+	slot = ExecScan(&node->ss,
 					(ExecScanAccessMtd) IndexOnlyNext,
 					(ExecScanRecheckMtd) IndexOnlyRecheck);
+
+	if (TupIsNull(slot))
+		SetNodeRunState(node, Done);
+
+	return slot;
 }
 
 /* ----------------------------------------------------------------
@@ -277,6 +286,8 @@ ExecIndexOnlyScan(IndexOnlyScanState *node)
 void
 ExecReScanIndexOnlyScan(IndexOnlyScanState *node)
 {
+	SetNodeRunState(node, Inited);
+
 	/*
 	 * If we are doing runtime key calculations (ie, any of the index key
 	 * values weren't simple Consts), compute the new key values.  But first,
@@ -376,6 +387,9 @@ void
 ExecIndexOnlyRestrPos(IndexOnlyScanState *node)
 {
 	index_restrpos(node->ioss_ScanDesc);
+
+	/* Restoring position in turn restores run state */
+	SetNodeRunState(node, Running);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 534d2f4..a343c5c 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -484,20 +484,29 @@ reorderqueue_pop(IndexScanState *node)
 TupleTableSlot *
 ExecIndexScan(IndexScanState *node)
 {
+	TupleTableSlot *slot;
+
 	/*
 	 * If we have runtime keys and they've not already been set up, do it now.
 	 */
 	if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)
 		ExecReScan((PlanState *) node);
 
+	SetNodeRunState(node, Running);
+
 	if (node->iss_NumOrderByKeys > 0)
-		return ExecScan(&node->ss,
+		slot = ExecScan(&node->ss,
 						(ExecScanAccessMtd) IndexNextWithReorder,
 						(ExecScanRecheckMtd) IndexRecheck);
 	else
-		return ExecScan(&node->ss,
+		slot = ExecScan(&node->ss,
 						(ExecScanAccessMtd) IndexNext,
 						(ExecScanRecheckMtd) IndexRecheck);
+
+	if (TupIsNull(slot))
+		SetNodeRunState(node, Done);
+
+	return slot;
 }
 
 /* ----------------------------------------------------------------
@@ -514,6 +523,8 @@ ExecIndexScan(IndexScanState *node)
 void
 ExecReScanIndexScan(IndexScanState *node)
 {
+	SetNodeRunState(node, Inited);
+
 	/*
 	 * If we are doing runtime key calculations (ie, any of the index key
 	 * values weren't simple Consts), compute the new key values.  But first,
@@ -802,6 +813,9 @@ void
 ExecIndexRestrPos(IndexScanState *node)
 {
 	index_restrpos(node->iss_ScanDesc);
+
+	/* Restoring position in turn restores run state */
+	SetNodeRunState(node, Running);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index 1b675d4..e59d71f 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -43,6 +43,8 @@ ExecLimit(LimitState *node)
 	TupleTableSlot *slot;
 	PlanState  *outerPlan;
 
+	SetNodeRunState(node, Running);
+
 	/*
 	 * get information from the node
 	 */
@@ -72,7 +74,10 @@ ExecLimit(LimitState *node)
 			 * If backwards scan, just return NULL without changing state.
 			 */
 			if (!ScanDirectionIsForward(direction))
+			{
+				SetNodeRunState(node, Done);
 				return NULL;
+			}
 
 			/*
 			 * Check for empty window; if so, treat like empty subplan.
@@ -80,6 +85,7 @@ ExecLimit(LimitState *node)
 			if (node->count <= 0 && !node->noCount)
 			{
 				node->lstate = LIMIT_EMPTY;
+				SetNodeRunState(node, Done);
 				return NULL;
 			}
 
@@ -96,6 +102,7 @@ ExecLimit(LimitState *node)
 					 * any output at all.
 					 */
 					node->lstate = LIMIT_EMPTY;
+					SetNodeRunState(node, Done);
 					return NULL;
 				}
 				node->subSlot = slot;
@@ -115,6 +122,7 @@ ExecLimit(LimitState *node)
 			 * The subplan is known to return no tuples (or not more than
 			 * OFFSET tuples, in general).  So we return no tuples.
 			 */
+			SetNodeRunState(node, Done);
 			return NULL;
 
 		case LIMIT_INWINDOW:
@@ -130,6 +138,7 @@ ExecLimit(LimitState *node)
 					node->position - node->offset >= node->count)
 				{
 					node->lstate = LIMIT_WINDOWEND;
+					SetNodeRunState(node, Done);
 					return NULL;
 				}
 
@@ -140,6 +149,7 @@ ExecLimit(LimitState *node)
 				if (TupIsNull(slot))
 				{
 					node->lstate = LIMIT_SUBPLANEOF;
+					SetNodeRunState(node, Done);
 					return NULL;
 				}
 				node->subSlot = slot;
@@ -154,6 +164,7 @@ ExecLimit(LimitState *node)
 				if (node->position <= node->offset + 1)
 				{
 					node->lstate = LIMIT_WINDOWSTART;
+					SetNodeRunState(node, Done);
 					return NULL;
 				}
 
@@ -170,7 +181,10 @@ ExecLimit(LimitState *node)
 
 		case LIMIT_SUBPLANEOF:
 			if (ScanDirectionIsForward(direction))
+			{
+				SetNodeRunState(node, Done);
 				return NULL;
+			}
 
 			/*
 			 * Backing up from subplan EOF, so re-fetch previous tuple; there
@@ -186,7 +200,10 @@ ExecLimit(LimitState *node)
 
 		case LIMIT_WINDOWEND:
 			if (ScanDirectionIsForward(direction))
+			{
+				SetNodeRunState(node, Done);
 				return NULL;
+			}
 
 			/*
 			 * Backing up from window end: simply re-return the last tuple
@@ -199,7 +216,10 @@ ExecLimit(LimitState *node)
 
 		case LIMIT_WINDOWSTART:
 			if (!ScanDirectionIsForward(direction))
+			{
+				SetNodeRunState(node, Done);
 				return NULL;
+			}
 
 			/*
 			 * Advancing after having backed off window start: simply
@@ -443,6 +463,8 @@ ExecEndLimit(LimitState *node)
 void
 ExecReScanLimit(LimitState *node)
 {
+	SetNodeRunState(node, Inited);
+
 	/*
 	 * Recompute limit/offset in case parameters changed, and reset the state
 	 * machine.  We must do this before rescanning our child node, in case
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index eeeca0b..2ccf05d 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -44,6 +44,8 @@ ExecLockRows(LockRowsState *node)
 	bool		epq_needed;
 	ListCell   *lc;
 
+	SetNodeRunState(node, Running);
+
 	/*
 	 * get information from the node
 	 */
@@ -57,7 +59,10 @@ lnext:
 	slot = ExecProcNode(outerPlan);
 
 	if (TupIsNull(slot))
+	{
+		SetNodeRunState(node, Done);
 		return NULL;
+	}
 
 	/* We don't need EvalPlanQual unless we get updated tuple version(s) */
 	epq_needed = false;
@@ -460,6 +465,8 @@ ExecEndLockRows(LockRowsState *node)
 void
 ExecReScanLockRows(LockRowsState *node)
 {
+	SetNodeRunState(node, Inited);
+
 	/*
 	 * if chgParam of subnode is not null then plan will be re-scanned by
 	 * first ExecProcNode.
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index d9a67f4..981398a 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -45,6 +45,8 @@ ExecMaterial(MaterialState *node)
 	bool		eof_tuplestore;
 	TupleTableSlot *slot;
 
+	SetNodeRunState(node, Running);
+
 	/*
 	 * get state info from node
 	 */
@@ -132,6 +134,7 @@ ExecMaterial(MaterialState *node)
 		if (TupIsNull(outerslot))
 		{
 			node->eof_underlying = true;
+			SetNodeRunState(node, Done);
 			return NULL;
 		}
 
@@ -152,6 +155,7 @@ ExecMaterial(MaterialState *node)
 	/*
 	 * Nothing left ...
 	 */
+	SetNodeRunState(node, Done);
 	return ExecClearTuple(slot);
 }
 
@@ -307,6 +311,9 @@ ExecMaterialRestrPos(MaterialState *node)
 	 * copy the mark to the active read pointer.
 	 */
 	tuplestore_copy_read_pointer(node->tuplestorestate, 1, 0);
+
+	/* Restoring position in turn restores run state */
+	SetNodeRunState(node, Running);
 }
 
 /* ----------------------------------------------------------------
@@ -320,6 +327,8 @@ ExecReScanMaterial(MaterialState *node)
 {
 	PlanState  *outerPlan = outerPlanState(node);
 
+	SetNodeRunState(node, Inited);
+
 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
 
 	if (node->eflags != 0)
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index 3901255..4678d7c 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -170,6 +170,8 @@ ExecMergeAppend(MergeAppendState *node)
 	TupleTableSlot *result;
 	SlotNumber	i;
 
+	SetNodeRunState(node, Running);
+
 	if (!node->ms_initialized)
 	{
 		/*
@@ -207,6 +209,7 @@ ExecMergeAppend(MergeAppendState *node)
 	{
 		/* All the subplans are exhausted, and so is the heap */
 		result = ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		SetNodeRunState(node, Done);
 	}
 	else
 	{
@@ -289,6 +292,8 @@ ExecReScanMergeAppend(MergeAppendState *node)
 {
 	int			i;
 
+	SetNodeRunState(node, Inited);
+
 	for (i = 0; i < node->ms_nplans; i++)
 	{
 		PlanState  *subnode = node->mergeplans[i];
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 9970db1..74ceaa2 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -630,6 +630,8 @@ ExecMergeJoin(MergeJoinState *node)
 	bool		doFillOuter;
 	bool		doFillInner;
 
+	SetNodeRunState(node, Running);
+
 	/*
 	 * get information from node
 	 */
@@ -728,6 +730,7 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
+						SetNodeRunState(node, Done);
 						return NULL;
 				}
 				break;
@@ -785,6 +788,7 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
+						SetNodeRunState(node, Done);
 						return NULL;
 				}
 				break;
@@ -1039,6 +1043,7 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
+						SetNodeRunState(node, Done);
 						return NULL;
 				}
 				break;
@@ -1174,6 +1179,7 @@ ExecMergeJoin(MergeJoinState *node)
 								break;
 							}
 							/* Otherwise we're done. */
+							SetNodeRunState(node, Done);
 							return NULL;
 					}
 				}
@@ -1292,6 +1298,7 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
+						SetNodeRunState(node, Done);
 						return NULL;
 				}
 				break;
@@ -1362,6 +1369,7 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
+						SetNodeRunState(node, Done);
 						return NULL;
 				}
 				break;
@@ -1406,6 +1414,7 @@ ExecMergeJoin(MergeJoinState *node)
 				if (TupIsNull(innerTupleSlot))
 				{
 					MJ_printf("ExecMergeJoin: end of inner subplan\n");
+					SetNodeRunState(node, Done);
 					return NULL;
 				}
 
@@ -1448,6 +1457,7 @@ ExecMergeJoin(MergeJoinState *node)
 				if (TupIsNull(outerTupleSlot))
 				{
 					MJ_printf("ExecMergeJoin: end of outer subplan\n");
+					SetNodeRunState(node, Done);
 					return NULL;
 				}
 
@@ -1682,6 +1692,7 @@ ExecEndMergeJoin(MergeJoinState *node)
 void
 ExecReScanMergeJoin(MergeJoinState *node)
 {
+	SetNodeRunState(node, Inited);
 	ExecClearTuple(node->mj_MarkedTupleSlot);
 
 	node->mj_JoinState = EXEC_MJ_INITIALIZE_OUTER;
@@ -1699,5 +1710,4 @@ ExecReScanMergeJoin(MergeJoinState *node)
 		ExecReScan(node->js.ps.lefttree);
 	if (node->js.ps.righttree->chgParam == NULL)
 		ExecReScan(node->js.ps.righttree);
-
 }
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index b4c2f26..ae69176 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -69,6 +69,8 @@ ExecNestLoop(NestLoopState *node)
 	ExprContext *econtext;
 	ListCell   *lc;
 
+	SetNodeRunState(node, Running);
+
 	/*
 	 * get information from the node
 	 */
@@ -128,6 +130,7 @@ ExecNestLoop(NestLoopState *node)
 			if (TupIsNull(outerTupleSlot))
 			{
 				ENL1_printf("no outer tuple, ending join");
+				SetNodeRunState(node, Done);
 				return NULL;
 			}
 
@@ -429,6 +432,8 @@ ExecReScanNestLoop(NestLoopState *node)
 {
 	PlanState  *outerPlan = outerPlanState(node);
 
+	SetNodeRunState(node, Inited);
+
 	/*
 	 * If outerPlan->chgParam is not null then plan will be automatically
 	 * re-scanned by first ExecProcNode.
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index 118496e..27a86d3 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -81,6 +81,8 @@ ExecRecursiveUnion(RecursiveUnionState *node)
 	TupleTableSlot *slot;
 	bool		isnew;
 
+	SetNodeRunState(node, Running);
+
 	/* 1. Evaluate non-recursive term */
 	if (!node->recursing)
 	{
@@ -154,6 +156,7 @@ ExecRecursiveUnion(RecursiveUnionState *node)
 		return slot;
 	}
 
+	SetNodeRunState(node, Done);
 	return NULL;
 }
 
@@ -309,6 +312,8 @@ ExecReScanRecursiveUnion(RecursiveUnionState *node)
 	PlanState  *innerPlan = innerPlanState(node);
 	RecursiveUnion *plan = (RecursiveUnion *) node->ps.plan;
 
+	SetNodeRunState(node, Inited);
+
 	/*
 	 * Set recursive term's chgParam to tell it that we'll modify the working
 	 * table and therefore it has to rescan.
diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c
index b4ee402..ec81eda 100644
--- a/src/backend/executor/nodeResult.c
+++ b/src/backend/executor/nodeResult.c
@@ -72,6 +72,7 @@ ExecResult(ResultState *node)
 	ExprContext *econtext;
 	ExprDoneCond isDone;
 
+	SetNodeRunState(node, Running);
 	econtext = node->ps.ps_ExprContext;
 
 	/*
@@ -87,6 +88,7 @@ ExecResult(ResultState *node)
 		if (!qualResult)
 		{
 			node->rs_done = true;
+			SetNodeRunState(node, Done);
 			return NULL;
 		}
 	}
@@ -130,7 +132,10 @@ ExecResult(ResultState *node)
 			outerTupleSlot = ExecProcNode(outerPlan);
 
 			if (TupIsNull(outerTupleSlot))
+			{
+				SetNodeRunState(node, Done);
 				return NULL;
+			}
 
 			/*
 			 * prepare to compute projection expressions, which will expect to
@@ -161,6 +166,7 @@ ExecResult(ResultState *node)
 		}
 	}
 
+	SetNodeRunState(node, Done);
 	return NULL;
 }
 
@@ -189,7 +195,12 @@ ExecResultRestrPos(ResultState *node)
 	PlanState  *outerPlan = outerPlanState(node);
 
 	if (outerPlan != NULL)
+	{
 		ExecRestrPos(outerPlan);
+
+		/* Restoring position in turn restores run state */
+		SetNodeRunState(node, Running);
+	}
 	else
 		elog(ERROR, "Result nodes do not support mark/restore");
 }
@@ -295,6 +306,7 @@ ExecEndResult(ResultState *node)
 void
 ExecReScanResult(ResultState *node)
 {
+	SetNodeRunState(node, Inited);
 	node->rs_done = false;
 	node->ps.ps_TupFromTlist = false;
 	node->rs_checkqual = (node->resconstantqual == NULL) ? false : true;
diff --git a/src/backend/executor/nodeSamplescan.c b/src/backend/executor/nodeSamplescan.c
index e88735a..cb1d2ec 100644
--- a/src/backend/executor/nodeSamplescan.c
+++ b/src/backend/executor/nodeSamplescan.c
@@ -91,9 +91,17 @@ SampleRecheck(SampleScanState *node, TupleTableSlot *slot)
 TupleTableSlot *
 ExecSampleScan(SampleScanState *node)
 {
-	return ExecScan((ScanState *) node,
+	TupleTableSlot *slot;
+
+	SetNodeRunState(node, Running);
+	slot = ExecScan((ScanState *) node,
 					(ExecScanAccessMtd) SampleNext,
 					(ExecScanRecheckMtd) SampleRecheck);
+
+	if (TupIsNull(slot))
+		SetNodeRunState(node, Done);
+
+	return slot;
 }
 
 /* ----------------------------------------------------------------
@@ -247,6 +255,7 @@ ExecEndSampleScan(SampleScanState *node)
 void
 ExecReScanSampleScan(SampleScanState *node)
 {
+	SetNodeRunState(node, Inited);
 	heap_rescan(node->ss.ss_currentScanDesc, NULL);
 
 	/*
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 259b79a..b2ed888 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -108,9 +108,17 @@ SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
 TupleTableSlot *
 ExecSeqScan(SeqScanState *node)
 {
-	return ExecScan((ScanState *) node,
+	TupleTableSlot *slot;
+
+	SetNodeRunState(node, Running);
+	slot = ExecScan((ScanState *) node,
 					(ExecScanAccessMtd) SeqNext,
 					(ExecScanRecheckMtd) SeqRecheck);
+
+	if (TupIsNull(slot))
+		SetNodeRunState(node, Done);
+
+	return slot;
 }
 
 /* ----------------------------------------------------------------
@@ -266,6 +274,7 @@ ExecReScanSeqScan(SeqScanState *node)
 {
 	HeapScanDesc scan;
 
+	SetNodeRunState(node, Inited);
 	scan = node->ss_currentScanDesc;
 
 	heap_rescan(scan,			/* scan desc */
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 123d051..c248ff3 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -603,6 +603,7 @@ ExecEndSetOp(SetOpState *node)
 void
 ExecReScanSetOp(SetOpState *node)
 {
+	SetNodeRunState(node, Inited);
 	ExecClearTuple(node->ps.ps_ResultTupleSlot);
 	node->numOutput = 0;
 
@@ -653,6 +654,4 @@ ExecReScanSetOp(SetOpState *node)
 	 */
 	if (node->ps.lefttree->chgParam == NULL)
 		ExecReScan(node->ps.lefttree);
-
-	SetNodeRunState(node, Inited);
 }
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 3ae5b89..a2abec7 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -49,6 +49,8 @@ ExecSort(SortState *node)
 	SO1_printf("ExecSort: %s\n",
 			   "entering routine");
 
+	SetNodeRunState(node, Running);
+
 	estate = node->ss.ps.state;
 	dir = estate->es_direction;
 	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
@@ -138,6 +140,10 @@ ExecSort(SortState *node)
 	(void) tuplesort_gettupleslot(tuplesortstate,
 								  ScanDirectionIsForward(dir),
 								  slot);
+
+	if (TupIsNull(slot))
+		SetNodeRunState(node, Done);
+
 	return slot;
 }
 
@@ -282,6 +288,9 @@ ExecSortRestrPos(SortState *node)
 	if (!node->sort_Done)
 		return;
 
+	/* Restoring position in turn restores run state */
+	SetNodeRunState(node, Running);
+
 	/*
 	 * restore the scan to the previously marked position
 	 */
@@ -293,6 +302,8 @@ ExecReScanSort(SortState *node)
 {
 	PlanState  *outerPlan = outerPlanState(node);
 
+	SetNodeRunState(node, Inited);
+
 	/*
 	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
 	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
diff --git a/src/backend/executor/nodeSubqueryscan.c b/src/backend/executor/nodeSubqueryscan.c
index 497d6df..d8799d1 100644
--- a/src/backend/executor/nodeSubqueryscan.c
+++ b/src/backend/executor/nodeSubqueryscan.c
@@ -90,9 +90,17 @@ SubqueryRecheck(SubqueryScanState *node, TupleTableSlot *slot)
 TupleTableSlot *
 ExecSubqueryScan(SubqueryScanState *node)
 {
-	return ExecScan(&node->ss,
+	TupleTableSlot *slot;
+
+	SetNodeRunState(node, Running);
+	slot = ExecScan(&node->ss,
 					(ExecScanAccessMtd) SubqueryNext,
 					(ExecScanRecheckMtd) SubqueryRecheck);
+
+	if (TupIsNull(slot))
+		SetNodeRunState(node, Done);
+
+	return slot;
 }
 
 /* ----------------------------------------------------------------
@@ -199,6 +207,7 @@ ExecEndSubqueryScan(SubqueryScanState *node)
 void
 ExecReScanSubqueryScan(SubqueryScanState *node)
 {
+	SetNodeRunState(node, Inited);
 	ExecScanReScan(&node->ss);
 
 	/*
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index f19e735..724016d 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -390,9 +390,17 @@ TidRecheck(TidScanState *node, TupleTableSlot *slot)
 TupleTableSlot *
 ExecTidScan(TidScanState *node)
 {
-	return ExecScan(&node->ss,
+	TupleTableSlot *slot;
+
+	SetNodeRunState(node, Running);
+	slot = ExecScan(&node->ss,
 					(ExecScanAccessMtd) TidNext,
 					(ExecScanRecheckMtd) TidRecheck);
+
+	if (TupIsNull(slot))
+		SetNodeRunState(node, Done);
+
+	return slot;
 }
 
 /* ----------------------------------------------------------------
@@ -402,6 +410,8 @@ ExecTidScan(TidScanState *node)
 void
 ExecReScanTidScan(TidScanState *node)
 {
+	SetNodeRunState(node, Inited);
+
 	if (node->tss_TidList)
 		pfree(node->tss_TidList);
 	node->tss_TidList = NULL;
diff --git a/src/backend/executor/nodeUnique.c b/src/backend/executor/nodeUnique.c
index f259f32..1f7ca10 100644
--- a/src/backend/executor/nodeUnique.c
+++ b/src/backend/executor/nodeUnique.c
@@ -50,6 +50,8 @@ ExecUnique(UniqueState *node)
 	TupleTableSlot *slot;
 	PlanState  *outerPlan;
 
+	SetNodeRunState(node, Running);
+
 	/*
 	 * get information from the node
 	 */
@@ -71,6 +73,7 @@ ExecUnique(UniqueState *node)
 		{
 			/* end of subplan, so we're done */
 			ExecClearTuple(resultTupleSlot);
+			SetNodeRunState(node, Done);
 			return NULL;
 		}
 
@@ -187,6 +190,8 @@ ExecEndUnique(UniqueState *node)
 void
 ExecReScanUnique(UniqueState *node)
 {
+	SetNodeRunState(node, Inited);
+
 	/* must clear result tuple so first input tuple is returned */
 	ExecClearTuple(node->ps.ps_ResultTupleSlot);
 
diff --git a/src/backend/executor/nodeValuesscan.c b/src/backend/executor/nodeValuesscan.c
index c56199c..48d1ad8 100644
--- a/src/backend/executor/nodeValuesscan.c
+++ b/src/backend/executor/nodeValuesscan.c
@@ -175,9 +175,18 @@ ValuesRecheck(ValuesScanState *node, TupleTableSlot *slot)
 TupleTableSlot *
 ExecValuesScan(ValuesScanState *node)
 {
-	return ExecScan(&node->ss,
+	TupleTableSlot *slot;
+
+	/* Advance the state to running if just after initialized */
+	SetNodeRunState(node, Running);
+	slot = ExecScan(&node->ss,
 					(ExecScanAccessMtd) ValuesNext,
 					(ExecScanRecheckMtd) ValuesRecheck);
+
+	if (TupIsNull(slot))
+		SetNodeRunState(node, Done);
+
+	return slot;
 }
 
 /* ----------------------------------------------------------------
@@ -302,6 +311,7 @@ ExecEndValuesScan(ValuesScanState *node)
 void
 ExecReScanValuesScan(ValuesScanState *node)
 {
+	SetNodeRunState(node, Inited);
 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
 
 	ExecScanReScan(&node->ss);
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index b91d4e6..29d4389 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -2064,6 +2064,7 @@ ExecReScanWindowAgg(WindowAggState *node)
 	PlanState  *outerPlan = outerPlanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 
+	SetNodeRunState(node, Inited);
 	node->ss.ps.ps_TupFromTlist = false;
 	node->all_first = true;
 
@@ -2087,8 +2088,6 @@ ExecReScanWindowAgg(WindowAggState *node)
 	 */
 	if (outerPlan->chgParam == NULL)
 		ExecReScan(outerPlan);
-
-	SetNodeRunState(node, Inited);
 }
 
 /*
diff --git a/src/backend/executor/nodeWorktablescan.c b/src/backend/executor/nodeWorktablescan.c
index 799e96b..46350ab 100644
--- a/src/backend/executor/nodeWorktablescan.c
+++ b/src/backend/executor/nodeWorktablescan.c
@@ -80,6 +80,10 @@ WorkTableScanRecheck(WorkTableScanState *node, TupleTableSlot *slot)
 TupleTableSlot *
 ExecWorkTableScan(WorkTableScanState *node)
 {
+	TupleTableSlot *slot;
+
+	SetNodeRunState(node, Running);
+
 	/*
 	 * On the first call, find the ancestor RecursiveUnion's state via the
 	 * Param slot reserved for it.  (We can't do this during node init because
@@ -114,9 +118,14 @@ ExecWorkTableScan(WorkTableScanState *node)
 		ExecAssignScanProjectionInfo(&node->ss);
 	}
 
-	return ExecScan(&node->ss,
+	slot = ExecScan(&node->ss,
 					(ExecScanAccessMtd) WorkTableScanNext,
 					(ExecScanRecheckMtd) WorkTableScanRecheck);
+
+	if (TupIsNull(slot))
+		SetNodeRunState(node, Done);
+
+	return slot;
 }
 
 
@@ -210,6 +219,7 @@ ExecEndWorkTableScan(WorkTableScanState *node)
 void
 ExecReScanWorkTableScan(WorkTableScanState *node)
 {
+	SetNodeRunState(node, Inited);
 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
 
 	ExecScanReScan(&node->ss);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4066341..c27b443 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -355,14 +355,16 @@ typedef enum ExecNodeRunState
 	ERunState_Done			/* No tuple to return  */
 } ExecNodeRunState;
 
-#define SetNodeRunState(nd,st) (((PlanState*)nd)->runstate = (ERunState_##st))
+#define SetNodeRunState(nd,st) (((PlanState*)(nd))->runstate = (ERunState_##st))
+#define ExecNode_is(nd,st) (((PlanState*)(nd))->runstate == (ERunState_##st))
+#define ExecNode_is_inited(nd)	(ExecNode_is((nd),Inited))
+#define ExecNode_is_running(nd)	(ExecNode_is((nd),Running))
+#define ExecNode_is_done(nd) (ExecNode_is((nd),Done))
 #define AdvanceNodeRunStateTo(nd,st) \
 	do {\
-		if (((PlanState*)nd)->runstate < (ERunState_##st))\
-			((PlanState*)nd)->runstate = (ERunState_##st);\
+		if (((PlanState*)(nd))->runstate < (ERunState_##st))\
+			((PlanState*)(nd))->runstate = (ERunState_##st);\
 	} while(0);
-#define ExecNode_is_running(nd)	(((PlanState*)nd)->runstate == ERunState_Running)
-#define ExecNode_is_done(nd)	(((PlanState*)nd)->runstate == ERunState_Done)
 
 /* ----------------
  *	  EState information
-- 
1.8.3.1

0003-Add-a-feature-to-start-node-asynchronously.patchtext/x-patch; charset=us-asciiDownload

>From 6f943869f0eb34289ac96a84384e98baff4a26d1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 9 Jul 2015 19:34:15 +0900
Subject: [PATCH 3/6] Add a feature to start node asynchronously

Add a feature to start child nodes asynchronously in join nodes and
Append/MergeAppend nodes.  At this point, no node can run
asynchronously so the behavior is not changed.
---
 src/backend/executor/execProcnode.c     | 103 ++++++++++++++++++++++++++++++++
 src/backend/executor/nodeAgg.c          |  22 +++++++
 src/backend/executor/nodeAppend.c       |  29 +++++++++
 src/backend/executor/nodeCtescan.c      |  21 +++++++
 src/backend/executor/nodeGroup.c        |  22 +++++++
 src/backend/executor/nodeHash.c         |  22 +++++++
 src/backend/executor/nodeHashjoin.c     |  54 +++++++++++++++++
 src/backend/executor/nodeLimit.c        |  22 +++++++
 src/backend/executor/nodeLockRows.c     |  22 +++++++
 src/backend/executor/nodeMaterial.c     |  23 +++++++
 src/backend/executor/nodeMergeAppend.c  |  30 ++++++++++
 src/backend/executor/nodeMergejoin.c    |  29 +++++++++
 src/backend/executor/nodeNestloop.c     |  34 +++++++++++
 src/backend/executor/nodeResult.c       |  24 ++++++++
 src/backend/executor/nodeSetOp.c        |  22 +++++++
 src/backend/executor/nodeSort.c         |  22 +++++++
 src/backend/executor/nodeSubqueryscan.c |  22 +++++++
 src/backend/executor/nodeUnique.c       |  22 +++++++
 src/backend/executor/nodeWindowAgg.c    |  22 +++++++
 src/include/executor/executor.h         |   1 +
 src/include/executor/nodeAgg.h          |   1 +
 src/include/executor/nodeAppend.h       |   1 +
 src/include/executor/nodeCtescan.h      |   1 +
 src/include/executor/nodeGroup.h        |   1 +
 src/include/executor/nodeHash.h         |   1 +
 src/include/executor/nodeHashjoin.h     |   1 +
 src/include/executor/nodeLimit.h        |   1 +
 src/include/executor/nodeLockRows.h     |   1 +
 src/include/executor/nodeMaterial.h     |   1 +
 src/include/executor/nodeMergeAppend.h  |   1 +
 src/include/executor/nodeMergejoin.h    |   1 +
 src/include/executor/nodeNestloop.h     |   1 +
 src/include/executor/nodeResult.h       |   1 +
 src/include/executor/nodeSetOp.h        |   1 +
 src/include/executor/nodeSort.h         |   1 +
 src/include/executor/nodeSubqueryscan.h |   1 +
 src/include/executor/nodeUnique.h       |   1 +
 src/include/executor/nodeWindowAgg.h    |   1 +
 38 files changed, 586 insertions(+)

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 03c2feb..2b282ca 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -769,3 +769,106 @@ ExecEndNode(PlanState *node)
 			break;
 	}
 }
+
+/*
+ * StartProcNode - asynchronously execnode nodes underneath if possible
+ *
+ * Returns true if the node has been started asynchronously. Some of the nodes
+ * may be started even if false.
+ */
+bool
+StartProcNode(PlanState *node)
+{
+	/*
+	 * Refuse duplicate start. This occurs for skipped children on rescan on
+	 * nodes such like MergeAppend.
+	 */
+	if (node->runstate > ERunState_Started)
+		return false;
+
+	switch (nodeTag(node))
+	{
+	case T_ResultState:
+		return StartResult((ResultState *)node);
+
+	case T_AppendState:
+		return StartAppend((AppendState *)node);
+
+	case T_MergeAppendState:
+		return StartMergeAppend((MergeAppendState *)node);
+
+	case T_SubqueryScanState:
+		return StartSubqueryScan((SubqueryScanState *)node);
+
+	case T_CteScanState:
+		return StartCteScan((CteScanState *)node);
+
+		/*
+		 * join nodes
+		 */
+	case T_NestLoopState:
+		return StartNestLoop((NestLoopState *)node);
+
+	case T_MergeJoinState:
+		return StartMergeJoin((MergeJoinState *)node);
+
+	case T_HashJoinState:
+		return StartHashJoin((HashJoinState *)node);
+
+		/*
+		 * materialization nodes
+		 */
+	case T_MaterialState:
+		return StartMaterial((MaterialState *)node);
+
+	case T_SortState:
+		return StartSort((SortState *)node);
+
+	case T_GroupState:
+		return StartGroup((GroupState *)node);
+
+	case T_AggState:
+		return StartAgg((AggState *)node);
+
+	case T_WindowAggState:
+		return StartWindowAgg((WindowAggState *)node);
+
+	case T_UniqueState:
+		return StartUnique((UniqueState *)node);
+
+	case T_HashState:
+		return StartHash((HashState *)node);
+
+	case T_SetOpState:
+		return StartSetOp((SetOpState *)node);
+
+	case T_LockRowsState:
+		return StartLockRows((LockRowsState *)node);
+
+	case T_LimitState:
+		return StartLimit((LimitState *)node);
+
+	/* These nodes cannot run asynchronously */
+	case T_ForeignScanState:
+	case T_WorkTableScanState:
+	case T_CustomScanState:
+	case T_FunctionScanState:
+	case T_ValuesScanState:
+	case T_SeqScanState:
+	case T_SampleScanState:
+	case T_IndexScanState:
+	case T_IndexOnlyScanState:
+	case T_BitmapIndexScanState:
+	case T_BitmapHeapScanState:
+	case T_TidScanState:
+	case T_ModifyTableState:
+	case T_RecursiveUnionState:
+	case T_BitmapAndState:
+	case T_BitmapOrState:
+		return false;
+
+	default:
+		elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
+		break;
+	}
+}
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index b07d57f..dfd5816 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1499,6 +1499,28 @@ ExecAgg(AggState *node)
 }
 
 /*
+ * StartAgg - Try asynchronous execution of this node
+ *
+ * Try to start asynchronously.
+ * Returns true if any of underlying nodes started asynchronously
+ */
+bool
+StartAgg(AggState *node)
+{
+	if (!ExecNode_is_inited(node))
+		return false;
+
+	if (StartProcNode(outerPlanState(node)))
+	{
+		SetNodeRunState(node, Started);
+		return true;
+	}
+
+	return false;
+}
+
+
+/*
  * ExecAgg for non-hashed case
  */
 static TupleTableSlot *
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 03b3b66..2b918b2 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -194,6 +194,10 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
+	/* start child nodes asynchronously if possible */
+	if (ExecNode_is_inited(node))
+		StartAppend(node);
+
 	SetNodeRunState(node, Running);
 
 	for (;;)
@@ -241,6 +245,31 @@ ExecAppend(AppendState *node)
 }
 
 /* ----------------------------------------------------------------
+ *		StartAppend
+ *
+ * Try to start asynchronously.
+ * Returns true if any of underlying nodes started asynchronously
+ * ----------------------------------------------------------------
+ */
+bool
+StartAppend(AppendState *node)
+{
+	int i;
+	bool async = false;
+
+	if (!ExecNode_is_inited(node))
+		return false;
+
+	for (i = 0 ; i < node->as_nplans ; i++)
+		async |= StartProcNode(node->appendplans[i]);
+
+	if (async)
+		SetNodeRunState(node, Started);
+
+	return async;
+}
+
+/* ----------------------------------------------------------------
  *		ExecEndAppend
  *
  *		Shuts down the subscans of the append node.
diff --git a/src/backend/executor/nodeCtescan.c b/src/backend/executor/nodeCtescan.c
index d237370..cae0aca 100644
--- a/src/backend/executor/nodeCtescan.c
+++ b/src/backend/executor/nodeCtescan.c
@@ -166,6 +166,27 @@ ExecCteScan(CteScanState *node)
 	return slot;
 }
 
+/* ----------------------------------------------------------------
+ *		StartCteScan
+ *
+ * Try to start asynchronously.
+ * Returns true if any of underlying nodes started asynchronously
+ * ----------------------------------------------------------------
+ */
+bool
+StartCteScan(CteScanState *node)
+{
+	if (ExecNode_is_inited(node))
+		return false;
+
+	if (StartProcNode(node->cteplanstate))
+	{
+		SetNodeRunState(node, Started);
+		return true;
+	}
+
+	return false;
+}
 
 /* ----------------------------------------------------------------
  *		ExecInitCteScan
diff --git a/src/backend/executor/nodeGroup.c b/src/backend/executor/nodeGroup.c
index a593d9f..ea947b9 100644
--- a/src/backend/executor/nodeGroup.c
+++ b/src/backend/executor/nodeGroup.c
@@ -190,6 +190,28 @@ ExecGroup(GroupState *node)
 }
 
 /* -----------------
+ * StartGroup
+ *
+ * Try to start asynchronously.
+ * Returns true if any of underlying nodes started asynchronously
+ * -----------------
+ */
+bool
+StartGroup(GroupState *node)
+{
+	if (!ExecNode_is_inited(node))
+		return false;
+
+	if (StartProcNode(outerPlanState(node)))
+	{
+		SetNodeRunState(node, Started);
+		return true;
+	}
+
+	return false;
+}
+
+/* -----------------
  * ExecInitGroup
  *
  *	Creates the run-time information for the group node produced by the
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 308a5aab..030bb67 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -167,6 +167,28 @@ MultiExecHash(HashState *node)
 }
 
 /* ----------------------------------------------------------------
+ *		AsyncStartHash
+ *
+ * Try to start asynchronously.
+ * Returns true if any of underlying nodes started asynchronously
+ * ----------------------------------------------------------------
+ */
+bool
+StartHash(HashState *node)
+{
+	if (!ExecNode_is_inited(node))
+		return false;
+
+	if (StartProcNode(outerPlanState(node)))
+	{
+		SetNodeRunState(node, Started);
+		return true;
+	}
+
+	return false;
+}
+
+/* ----------------------------------------------------------------
  *		ExecInitHash
  *
  *		Init routine for Hash node
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index dbaabc4..ada9290 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -72,6 +72,10 @@ ExecHashJoin(HashJoinState *node)
 	uint32		hashvalue;
 	int			batchno;
 
+	/* Try to start asynchronously */
+	if (ExecNode_is_inited(node))
+		StartHashJoin(node);
+
 	SetNodeRunState(node, Running);
 
 	/*
@@ -435,6 +439,56 @@ ExecHashJoin(HashJoinState *node)
 }
 
 /* ----------------------------------------------------------------
+ *		StartHashJoin
+ *
+ * This function behaves a bit different from StartNode functions of other
+ * nodes from the behavior of ExecHashJoin.
+ * Returns true if any of underlying nodes started asynchronously
+ * ----------------------------------------------------------------
+ */
+bool
+StartHashJoin(HashJoinState *node)
+{
+	PlanState  *outerNode = outerPlanState(node);
+	HashState  *hashNode = (HashState *) innerPlanState(node);
+	bool 		async;
+
+	if (!ExecNode_is_inited(node))
+		return false;
+
+	async = StartProcNode(outerNode);
+
+	/*
+	 * This condition is the same to that to check the necessity of inner hash
+	 * at HJ_BUILD_HASHTABLE of ExecHashJoin.
+	 */
+	if (!HJ_FILL_INNER(node) &&
+		(HJ_FILL_OUTER(node) ||
+		 (outerNode->plan->startup_cost < hashNode->ps.plan->total_cost &&
+		  !node->hj_OuterNotEmpty)))
+	{
+		/*
+		 * The first tuple of outer plan is needed to judge the necessity of
+		 * inner hash here so don't start inner plan. Although the condition
+		 * to come here is dependent on the costs of outer startup and hash
+		 * creation and asynchronous execution will break this balance, we
+		 * continue to depend on this formula for now, because of the lack of
+		 * appropriate alternative.
+		 */
+	}
+	else
+	{
+		/* Hash will be created. Start the inner node. */
+		async |= StartProcNode((PlanState *)hashNode);
+	}
+
+	if (async)
+		SetNodeRunState(node, Started);
+
+	return async;
+}
+
+/* ----------------------------------------------------------------
  *		ExecInitHashJoin
  *
  *		Init routine for HashJoin node.
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index e59d71f..21b2c37 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -243,6 +243,28 @@ ExecLimit(LimitState *node)
 	return slot;
 }
 
+/* ----------------------------------------------------------------
+ *		StartLimit
+ *
+ * Try to start asynchronously.
+ * Returns true if any of underlying nodes started asynchronously
+ * ----------------------------------------------------------------
+ */
+bool
+StartLimit(LimitState *node)
+{
+	if (!ExecNode_is_inited(node))
+		return false;
+
+	if (StartProcNode(outerPlanState(node)))
+	{
+		SetNodeRunState(node, Started);
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Evaluate the limit/offset expressions --- done at startup or rescan.
  *
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 2ccf05d..e741a93 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -347,6 +347,28 @@ lnext:
 }
 
 /* ----------------------------------------------------------------
+ *		StartLockRows
+ *
+ * Try to start asynchronously.
+ * Returns true if any of underlying nodes started asynchronously
+ * ----------------------------------------------------------------
+ */
+bool
+StartLockRows(LockRowsState *node)
+{
+	if (!ExecNode_is_inited(node))
+		return false;
+
+	if (StartProcNode(outerPlanState(node)))
+	{
+		SetNodeRunState(node, Started);
+		return true;
+	}
+
+	return false;
+}
+
+/* ----------------------------------------------------------------
  *		ExecInitLockRows
  *
  *		This initializes the LockRows node state structures and
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index 981398a..4e41c1c 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -160,6 +160,28 @@ ExecMaterial(MaterialState *node)
 }
 
 /* ----------------------------------------------------------------
+ *		StartMaterial
+ *
+ * Try to start asynchronously.
+ * Returns true if any of underlying nodes started asynchronously
+ * ----------------------------------------------------------------
+ */
+bool
+StartMaterial(MaterialState *node)
+{
+	if (!ExecNode_is_inited(node))
+		return false;
+
+	if (StartProcNode(outerPlanState(node)))
+	{
+		SetNodeRunState(node, Started);
+		return true;
+	}
+
+	return false;
+}
+
+/* ----------------------------------------------------------------
  *		ExecInitMaterial
  * ----------------------------------------------------------------
  */
@@ -330,6 +352,7 @@ ExecReScanMaterial(MaterialState *node)
 	SetNodeRunState(node, Inited);
 
 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	SetNodeRunState(node, Inited);
 
 	if (node->eflags != 0)
 	{
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index 4678d7c..ab6c304 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -170,6 +170,10 @@ ExecMergeAppend(MergeAppendState *node)
 	TupleTableSlot *result;
 	SlotNumber	i;
 
+	/* start child nodes asynchronously if possible */
+	if (ExecNode_is_inited(node))
+		StartMergeAppend(node);
+	
 	SetNodeRunState(node, Running);
 
 	if (!node->ms_initialized)
@@ -220,6 +224,32 @@ ExecMergeAppend(MergeAppendState *node)
 	return result;
 }
 
+/* ----------------------------------------------------------------
+ *		StartMergeAppend
+ *
+ * Try to start asynchronously.
+ * Returns true if any of underlying nodes started asynchronously
+ * ----------------------------------------------------------------
+ */
+bool
+StartMergeAppend(MergeAppendState *node)
+{
+	int i;
+	bool async = false;
+
+	if (ExecNode_is_inited(node))
+		return false;
+
+	for (i = 0 ; i < node->ms_nplans ; i++)
+		async |= StartProcNode(node->mergeplans[i]);
+
+	if (async)
+		SetNodeRunState(node, Started);
+
+	return async;
+}
+
+
 /*
  * Compare the tuples in the two given slots.
  */
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 74ceaa2..32bd8a5 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -630,6 +630,10 @@ ExecMergeJoin(MergeJoinState *node)
 	bool		doFillOuter;
 	bool		doFillInner;
 
+	/* Execute childs asynchronously if possible */
+	if (ExecNode_is_inited(node))
+		StartMergeJoin(node);
+
 	SetNodeRunState(node, Running);
 
 	/*
@@ -1475,6 +1479,31 @@ ExecMergeJoin(MergeJoinState *node)
 }
 
 /* ----------------------------------------------------------------
+ *		AsyncStartMergeJoin
+ *
+ * Try to start asynchronously.
+ * Returns true if any of underlying nodes started asynchronously
+ * ----------------------------------------------------------------
+ */
+bool
+StartMergeJoin(MergeJoinState *node)
+{
+	bool async;
+
+	if (!ExecNode_is_inited(node))
+		return false;
+
+	/* Merge join can unconditionally start child nodes asynchronously */
+	async  = StartProcNode(innerPlanState(node));
+	async |= StartProcNode(outerPlanState(node));
+
+	if (async)
+		SetNodeRunState(node, Started);
+
+	return async;
+}
+
+/* ----------------------------------------------------------------
  *		ExecInitMergeJoin
  * ----------------------------------------------------------------
  */
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index ae69176..cb95a3d 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -69,6 +69,10 @@ ExecNestLoop(NestLoopState *node)
 	ExprContext *econtext;
 	ListCell   *lc;
 
+	/* Execute childs asynchronously if possible */
+	if (ExecNode_is_inited(node))
+		StartNestLoop(node);
+
 	SetNodeRunState(node, Running);
 
 	/*
@@ -292,6 +296,36 @@ ExecNestLoop(NestLoopState *node)
 }
 
 /* ----------------------------------------------------------------
+ *		StartNestLoop
+ *
+ * The inner plan of nest loop won't be executed asynchronously if it is
+ * parameterized.
+ * Returns true if any of underlying nodes started asynchronously
+ * ----------------------------------------------------------------
+ */
+bool
+StartNestLoop(NestLoopState *node)
+{
+	NestLoop   *nl = (NestLoop *) node->js.ps.plan;
+	bool async;
+
+	if (!ExecNode_is_inited(node))
+		return true;
+
+	/* Always try async execution of outer plan  */
+	async = StartProcNode(outerPlanState(node));
+
+	/* This inner node cannot be asynchronous if it is parameterized */
+	if (list_length(nl->nestParams) < 1)
+		async |= StartProcNode(innerPlanState(node));
+
+	if (async)
+		SetNodeRunState(node, Started);
+
+	return async;
+}
+
+/* ----------------------------------------------------------------
  *		ExecInitNestLoop
  * ----------------------------------------------------------------
  */
diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c
index ec81eda..c33443a 100644
--- a/src/backend/executor/nodeResult.c
+++ b/src/backend/executor/nodeResult.c
@@ -171,6 +171,30 @@ ExecResult(ResultState *node)
 }
 
 /* ----------------------------------------------------------------
+ *		StartResult
+ *
+ * Try to start asynchronously.
+ * Returns true if any of underlying nodes started asynchronously
+ * ----------------------------------------------------------------
+ */
+bool
+StartResult(ResultState * node)
+{
+	PlanState *subnode = outerPlanState(node);
+
+	if (!ExecNode_is_inited(node))
+		return false;
+
+	if (subnode && StartProcNode(subnode))
+	{
+		SetNodeRunState(node, Started);
+		return true;
+	}
+
+	return false;
+}
+
+/* ----------------------------------------------------------------
  *		ExecResultMarkPos
  * ----------------------------------------------------------------
  */
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index c248ff3..a0dbec6 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -225,6 +225,28 @@ ExecSetOp(SetOpState *node)
 		return setop_retrieve_direct(node);
 }
 
+/* ----------------------------------------------------------------
+ *		StartSetOp
+ *
+ * Try to start asynchronously.
+ * Returns true if any of underlying nodes started asynchronously
+ * ----------------------------------------------------------------
+ */
+bool
+StartSetOp(SetOpState *node)
+{
+	if (!ExecNode_is_inited(node))
+		return false;
+
+	if (StartProcNode(outerPlanState(node)))
+	{
+		SetNodeRunState(node, Started);
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * ExecSetOp for non-hashed case
  */
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index a2abec7..f0c9a63 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -148,6 +148,28 @@ ExecSort(SortState *node)
 }
 
 /* ----------------------------------------------------------------
+ *		StartSort
+ *
+ * Try to start asynchronously.
+ * Returns true if any of underlying nodes started asynchronously
+ * ----------------------------------------------------------------
+ */
+bool
+StartSort(SortState *node)
+{
+	if (!ExecNode_is_inited(node))
+		return false;
+
+	if (StartProcNode(outerPlanState(node)))
+	{
+		SetNodeRunState(node, Started);
+		return true;
+	}
+
+	return false;
+}
+
+/* ----------------------------------------------------------------
  *		ExecInitSort
  *
  *		Creates the run-time state information for the sort node
diff --git a/src/backend/executor/nodeSubqueryscan.c b/src/backend/executor/nodeSubqueryscan.c
index d8799d1..4899d93 100644
--- a/src/backend/executor/nodeSubqueryscan.c
+++ b/src/backend/executor/nodeSubqueryscan.c
@@ -104,6 +104,28 @@ ExecSubqueryScan(SubqueryScanState *node)
 }
 
 /* ----------------------------------------------------------------
+ *		StartSubqueryScan
+ *
+ * Try to start asynchronously.
+ * Returns true if any of underlying nodes started asynchronously
+ * ----------------------------------------------------------------
+ */
+bool
+StartSubqueryScan(SubqueryScanState *node)
+{
+	if (ExecNode_is_inited(node))
+		return false;
+
+	if (StartProcNode(node->subplan))
+	{
+		SetNodeRunState(node, Started);
+		return true;
+	}
+
+	return false;
+}
+
+/* ----------------------------------------------------------------
  *		ExecInitSubqueryScan
  * ----------------------------------------------------------------
  */
diff --git a/src/backend/executor/nodeUnique.c b/src/backend/executor/nodeUnique.c
index 1f7ca10..53da967 100644
--- a/src/backend/executor/nodeUnique.c
+++ b/src/backend/executor/nodeUnique.c
@@ -105,6 +105,28 @@ ExecUnique(UniqueState *node)
 }
 
 /* ----------------------------------------------------------------
+ *		StartUnique
+ *
+ * Try to start asynchronously.
+ * Returns true if any of underlying nodes started asynchronously
+ * ----------------------------------------------------------------
+ */
+bool
+StartUnique(UniqueState *node)
+{
+	if (!ExecNode_is_inited(node))
+		return false;
+
+	if (StartProcNode(outerPlanState(node)))
+	{
+		SetNodeRunState(node, Started);
+		return true;
+	}
+
+	return false;
+}
+
+/* ----------------------------------------------------------------
  *		ExecInitUnique
  *
  *		This initializes the unique node state structures and
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index 29d4389..3fd5e24 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1760,6 +1760,28 @@ restart:
 }
 
 /* -----------------
+ * StartWindowAgg
+ *
+ * Try to start asynchronously.
+ * Returns true if any of underlying nodes started asynchronously
+ * -----------------
+ */
+bool
+StartWindowAgg(WindowAggState *node)
+{
+	if (!ExecNode_is_inited(node))
+		return false;
+
+	if (StartProcNode(outerPlanState(node)))
+	{
+		SetNodeRunState(node, Started);
+		return true;
+	}
+
+	return false;
+}
+
+/* -----------------
  * ExecInitWindowAgg
  *
  *	Creates the run-time information for the WindowAgg node produced by the
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 193a654..cf7d386 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -223,6 +223,7 @@ extern void EvalPlanQualEnd(EPQState *epqstate);
  */
 extern PlanState *ExecInitNode(Plan *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecProcNode(PlanState *node);
+extern bool StartProcNode(PlanState *node);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
 
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index fe3b81a..7fb0a6f 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -18,6 +18,7 @@
 
 extern AggState *ExecInitAgg(Agg *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecAgg(AggState *node);
+extern bool StartAgg(AggState *node);
 extern void ExecEndAgg(AggState *node);
 extern void ExecReScanAgg(AggState *node);
 
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index f2d920b..d77b70e 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -18,6 +18,7 @@
 
 extern AppendState *ExecInitAppend(Append *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecAppend(AppendState *node);
+extern bool StartAppend(AppendState *node);
 extern void ExecEndAppend(AppendState *node);
 extern void ExecReScanAppend(AppendState *node);
 
diff --git a/src/include/executor/nodeCtescan.h b/src/include/executor/nodeCtescan.h
index 369dafa..e418786 100644
--- a/src/include/executor/nodeCtescan.h
+++ b/src/include/executor/nodeCtescan.h
@@ -18,6 +18,7 @@
 
 extern CteScanState *ExecInitCteScan(CteScan *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecCteScan(CteScanState *node);
+extern bool StartCteScan(CteScanState *node);
 extern void ExecEndCteScan(CteScanState *node);
 extern void ExecReScanCteScan(CteScanState *node);
 
diff --git a/src/include/executor/nodeGroup.h b/src/include/executor/nodeGroup.h
index 3485fe8..bfc75cd 100644
--- a/src/include/executor/nodeGroup.h
+++ b/src/include/executor/nodeGroup.h
@@ -18,6 +18,7 @@
 
 extern GroupState *ExecInitGroup(Group *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecGroup(GroupState *node);
+extern bool StartGroup(GroupState *node);
 extern void ExecEndGroup(GroupState *node);
 extern void ExecReScanGroup(GroupState *node);
 
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index acc28438..b0855d3 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -19,6 +19,7 @@
 extern HashState *ExecInitHash(Hash *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecHash(HashState *node);
 extern Node *MultiExecHash(HashState *node);
+extern bool StartHash(HashState *node);
 extern void ExecEndHash(HashState *node);
 extern void ExecReScanHash(HashState *node);
 
diff --git a/src/include/executor/nodeHashjoin.h b/src/include/executor/nodeHashjoin.h
index c35a51c..826f639 100644
--- a/src/include/executor/nodeHashjoin.h
+++ b/src/include/executor/nodeHashjoin.h
@@ -19,6 +19,7 @@
 
 extern HashJoinState *ExecInitHashJoin(HashJoin *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecHashJoin(HashJoinState *node);
+extern bool StartHashJoin(HashJoinState *node);
 extern void ExecEndHashJoin(HashJoinState *node);
 extern void ExecReScanHashJoin(HashJoinState *node);
 
diff --git a/src/include/executor/nodeLimit.h b/src/include/executor/nodeLimit.h
index 44f2936..5e8d2ea 100644
--- a/src/include/executor/nodeLimit.h
+++ b/src/include/executor/nodeLimit.h
@@ -18,6 +18,7 @@
 
 extern LimitState *ExecInitLimit(Limit *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecLimit(LimitState *node);
+extern bool StartLimit(LimitState *node);
 extern void ExecEndLimit(LimitState *node);
 extern void ExecReScanLimit(LimitState *node);
 
diff --git a/src/include/executor/nodeLockRows.h b/src/include/executor/nodeLockRows.h
index 41764a1..c450233 100644
--- a/src/include/executor/nodeLockRows.h
+++ b/src/include/executor/nodeLockRows.h
@@ -18,6 +18,7 @@
 
 extern LockRowsState *ExecInitLockRows(LockRows *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecLockRows(LockRowsState *node);
+extern bool StartLockRows(LockRowsState *node);
 extern void ExecEndLockRows(LockRowsState *node);
 extern void ExecReScanLockRows(LockRowsState *node);
 
diff --git a/src/include/executor/nodeMaterial.h b/src/include/executor/nodeMaterial.h
index cfb7a13..0392d29 100644
--- a/src/include/executor/nodeMaterial.h
+++ b/src/include/executor/nodeMaterial.h
@@ -18,6 +18,7 @@
 
 extern MaterialState *ExecInitMaterial(Material *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecMaterial(MaterialState *node);
+extern bool StartMaterial(MaterialState *node);
 extern void ExecEndMaterial(MaterialState *node);
 extern void ExecMaterialMarkPos(MaterialState *node);
 extern void ExecMaterialRestrPos(MaterialState *node);
diff --git a/src/include/executor/nodeMergeAppend.h b/src/include/executor/nodeMergeAppend.h
index 3c5068c..2f637dc 100644
--- a/src/include/executor/nodeMergeAppend.h
+++ b/src/include/executor/nodeMergeAppend.h
@@ -18,6 +18,7 @@
 
 extern MergeAppendState *ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecMergeAppend(MergeAppendState *node);
+extern bool StartMergeAppend(MergeAppendState *node);
 extern void ExecEndMergeAppend(MergeAppendState *node);
 extern void ExecReScanMergeAppend(MergeAppendState *node);
 
diff --git a/src/include/executor/nodeMergejoin.h b/src/include/executor/nodeMergejoin.h
index bee5367..ead6898 100644
--- a/src/include/executor/nodeMergejoin.h
+++ b/src/include/executor/nodeMergejoin.h
@@ -18,6 +18,7 @@
 
 extern MergeJoinState *ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecMergeJoin(MergeJoinState *node);
+extern bool StartMergeJoin(MergeJoinState *node);
 extern void ExecEndMergeJoin(MergeJoinState *node);
 extern void ExecReScanMergeJoin(MergeJoinState *node);
 
diff --git a/src/include/executor/nodeNestloop.h b/src/include/executor/nodeNestloop.h
index ff0720f..f79a002 100644
--- a/src/include/executor/nodeNestloop.h
+++ b/src/include/executor/nodeNestloop.h
@@ -18,6 +18,7 @@
 
 extern NestLoopState *ExecInitNestLoop(NestLoop *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecNestLoop(NestLoopState *node);
+extern bool StartNestLoop(NestLoopState *node);
 extern void ExecEndNestLoop(NestLoopState *node);
 extern void ExecReScanNestLoop(NestLoopState *node);
 
diff --git a/src/include/executor/nodeResult.h b/src/include/executor/nodeResult.h
index 17a7bb6..84b375d 100644
--- a/src/include/executor/nodeResult.h
+++ b/src/include/executor/nodeResult.h
@@ -18,6 +18,7 @@
 
 extern ResultState *ExecInitResult(Result *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecResult(ResultState *node);
+extern bool StartResult(ResultState *node);
 extern void ExecEndResult(ResultState *node);
 extern void ExecResultMarkPos(ResultState *node);
 extern void ExecResultRestrPos(ResultState *node);
diff --git a/src/include/executor/nodeSetOp.h b/src/include/executor/nodeSetOp.h
index ed6c96a..f960dda 100644
--- a/src/include/executor/nodeSetOp.h
+++ b/src/include/executor/nodeSetOp.h
@@ -18,6 +18,7 @@
 
 extern SetOpState *ExecInitSetOp(SetOp *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecSetOp(SetOpState *node);
+extern bool StartSetOp(SetOpState *node);
 extern void ExecEndSetOp(SetOpState *node);
 extern void ExecReScanSetOp(SetOpState *node);
 
diff --git a/src/include/executor/nodeSort.h b/src/include/executor/nodeSort.h
index 20d909b..0c6d12d 100644
--- a/src/include/executor/nodeSort.h
+++ b/src/include/executor/nodeSort.h
@@ -18,6 +18,7 @@
 
 extern SortState *ExecInitSort(Sort *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecSort(SortState *node);
+extern bool StartSort(SortState *node);
 extern void ExecEndSort(SortState *node);
 extern void ExecSortMarkPos(SortState *node);
 extern void ExecSortRestrPos(SortState *node);
diff --git a/src/include/executor/nodeSubqueryscan.h b/src/include/executor/nodeSubqueryscan.h
index 56e3aec..0301edd 100644
--- a/src/include/executor/nodeSubqueryscan.h
+++ b/src/include/executor/nodeSubqueryscan.h
@@ -18,6 +18,7 @@
 
 extern SubqueryScanState *ExecInitSubqueryScan(SubqueryScan *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecSubqueryScan(SubqueryScanState *node);
+extern bool StartSubqueryScan(SubqueryScanState *node);
 extern void ExecEndSubqueryScan(SubqueryScanState *node);
 extern void ExecReScanSubqueryScan(SubqueryScanState *node);
 
diff --git a/src/include/executor/nodeUnique.h b/src/include/executor/nodeUnique.h
index ec2df59..76727aa 100644
--- a/src/include/executor/nodeUnique.h
+++ b/src/include/executor/nodeUnique.h
@@ -18,6 +18,7 @@
 
 extern UniqueState *ExecInitUnique(Unique *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecUnique(UniqueState *node);
+extern bool StartUnique(UniqueState *node);
 extern void ExecEndUnique(UniqueState *node);
 extern void ExecReScanUnique(UniqueState *node);
 
diff --git a/src/include/executor/nodeWindowAgg.h b/src/include/executor/nodeWindowAgg.h
index 8a7b1fa..e9699b0 100644
--- a/src/include/executor/nodeWindowAgg.h
+++ b/src/include/executor/nodeWindowAgg.h
@@ -18,6 +18,7 @@
 
 extern WindowAggState *ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecWindowAgg(WindowAggState *node);
+extern bool StartWindowAgg(WindowAggState *node);
 extern void ExecEndWindowAgg(WindowAggState *node);
 extern void ExecReScanWindowAgg(WindowAggState *node);
 
-- 
1.8.3.1

0004-Add-StartForeignScan-to-FdwRoutine.patchtext/x-patch; charset=us-asciiDownload

>From 096769d7670713f7b7c91920be84a04392b9846b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 9 Jul 2015 20:21:14 +0900
Subject: [PATCH 4/6] Add StartForeignScan to FdwRoutine

Add new entry function in FdwRoutineStart so that fdw can be started
using asynchornous execution infrastructure.
---
 src/backend/executor/execProcnode.c    |  4 +++-
 src/backend/executor/nodeForeignscan.c | 24 ++++++++++++++++++++++++
 src/include/executor/nodeForeignscan.h |  1 +
 src/include/foreign/fdwapi.h           |  3 +++
 4 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 2b282ca..6d7ad51 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -848,8 +848,10 @@ StartProcNode(PlanState *node)
 	case T_LimitState:
 		return StartLimit((LimitState *)node);
 
-	/* These nodes cannot run asynchronously */
 	case T_ForeignScanState:
+		return StartForeignScan((ForeignScanState *)node);
+
+	/* These nodes cannot run asynchronously */
 	case T_WorkTableScanState:
 	case T_CustomScanState:
 	case T_FunctionScanState:
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 5d39c85..de19bf9 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -101,6 +101,30 @@ ExecForeignScan(ForeignScanState *node)
 	return slot;
 }
 
+/* ----------------------------------------------------------------
+ *		StartForeignScan
+ *
+ * Try to start asynchronously.
+ * Returns true if any of underlying nodes started asynchronously
+ * ----------------------------------------------------------------
+ */
+bool
+StartForeignScan(ForeignScanState *node)
+{
+	StartForeginScan_function StartForeignScanFunc =
+		node->fdwroutine->StartForeignScan;
+
+	if (!ExecNode_is(node, Inited))
+		return false;
+
+	if (StartForeignScanFunc && StartForeignScanFunc(node))
+	{
+		SetNodeRunState(node, Started);
+		return true;
+	}
+
+	return false;
+}
 
 /* ----------------------------------------------------------------
  *		ExecInitForeignScan
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 45e0e9c..39802a3 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -18,6 +18,7 @@
 
 extern ForeignScanState *ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecForeignScan(ForeignScanState *node);
+extern bool StartForeignScan(ForeignScanState *node);
 extern void ExecEndForeignScan(ForeignScanState *node);
 extern void ExecReScanForeignScan(ForeignScanState *node);
 
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 69b48b4..9f4234a 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -118,6 +118,8 @@ typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 typedef List *(*ImportForeignSchema_function) (ImportForeignSchemaStmt *stmt,
 														   Oid serverOid);
 
+typedef bool (*StartForeginScan_function) (ForeignScanState *node);
+
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
@@ -140,6 +142,7 @@ typedef struct FdwRoutine
 	IterateForeignScan_function IterateForeignScan;
 	ReScanForeignScan_function ReScanForeignScan;
 	EndForeignScan_function EndForeignScan;
+	StartForeginScan_function StartForeignScan;
 
 	/*
 	 * Remaining functions are optional.  Set the pointer to NULL for any that
-- 
1.8.3.1

0005-Allow-asynchronous-remote-query-of-postgres_fdw.patchtext/x-patch; charset=us-asciiDownload

>From c3aec50bccc28cdc9376136d2fe8cd9865ba326a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 26 Jun 2015 16:54:39 +0900
Subject: [PATCH 5/6] Allow asynchronous remote query of postgres_fdw.

The new type PgFdwConn makes connection.c to be aware of running of
asynchronous query.

The scan node starts first on one connection(foreign server) is
executed in StartProcNode phase if server or foreign table option
"allow_async" is true (default).

Apart from the starting of execution, every successive fetch for the
same foreign scan is issued asynchronously regardless of allow_async
setting.
---
 contrib/postgres_fdw/Makefile       |   2 +-
 contrib/postgres_fdw/PgFdwConn.c    | 200 +++++++++++++++++++++
 contrib/postgres_fdw/PgFdwConn.h    |  61 +++++++
 contrib/postgres_fdw/connection.c   |  81 +++++----
 contrib/postgres_fdw/option.c       |   3 +
 contrib/postgres_fdw/postgres_fdw.c | 340 ++++++++++++++++++++++++++++--------
 contrib/postgres_fdw/postgres_fdw.h |  15 +-
 7 files changed, 587 insertions(+), 115 deletions(-)
 create mode 100644 contrib/postgres_fdw/PgFdwConn.c
 create mode 100644 contrib/postgres_fdw/PgFdwConn.h

diff --git a/contrib/postgres_fdw/Makefile b/contrib/postgres_fdw/Makefile
index d2b98e1..d0913e2 100644
--- a/contrib/postgres_fdw/Makefile
+++ b/contrib/postgres_fdw/Makefile
@@ -1,7 +1,7 @@
 # contrib/postgres_fdw/Makefile
 
 MODULE_big = postgres_fdw
-OBJS = postgres_fdw.o option.o deparse.o connection.o $(WIN32RES)
+OBJS = postgres_fdw.o PgFdwConn.o option.o deparse.o connection.o $(WIN32RES)
 PGFILEDESC = "postgres_fdw - foreign data wrapper for PostgreSQL"
 
 PG_CPPFLAGS = -I$(libpq_srcdir)
diff --git a/contrib/postgres_fdw/PgFdwConn.c b/contrib/postgres_fdw/PgFdwConn.c
new file mode 100644
index 0000000..b13b597
--- /dev/null
+++ b/contrib/postgres_fdw/PgFdwConn.c
@@ -0,0 +1,200 @@
+/*-------------------------------------------------------------------------
+ *
+ * PgFdwConn.c
+ *		  PGconn extending wrapper to enable asynchronous query.
+ *
+ * Portions Copyright (c) 2012-2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		  contrib/postgres_fdw/PgFdwConn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "PgFdwConn.h"
+
+#define PFC_ALLOCATE()	((PgFdwConn *)malloc(sizeof(PgFdwConn)))
+#define PFC_FREE(c)		free(c)
+
+struct pgfdw_conn
+{
+	PGconn *pgconn;				/* libpq connection for this connection */
+	int		nscans;				/* number of scans using this connection */
+	struct PgFdwScanState *async_scan; /* the connection currently running
+										* async query on this connection  */
+};
+
+void
+PFCsetAsyncScan(PgFdwConn *conn, struct PgFdwScanState *scan)
+{
+	conn->async_scan = scan;
+}
+
+struct PgFdwScanState *
+PFCgetAsyncScan(PgFdwConn *conn)
+{
+	return conn->async_scan;
+}
+
+int
+PFCisAsyncRunning(PgFdwConn *conn)
+{
+	return conn->async_scan != NULL;
+}
+
+PGconn *
+PFCgetPGconn(PgFdwConn *conn)
+{
+	return conn->pgconn;
+}
+
+int
+PFCgetNscans(PgFdwConn *conn)
+{
+	return conn->nscans;
+}
+
+int
+PFCincrementNscans(PgFdwConn *conn)
+{
+	return ++conn->nscans;
+}
+
+int
+PFCdecrementNscans(PgFdwConn *conn)
+{
+	Assert(conn->nscans > 0);
+	return --conn->nscans;
+}
+
+void
+PFCcancelAsync(PgFdwConn *conn)
+{
+	if (PFCisAsyncRunning(conn))
+		PFCconsumeInput(conn);
+}
+
+void
+PFCinit(PgFdwConn *conn)
+{
+	conn->async_scan = NULL;
+	conn->nscans = 0;
+}
+
+int
+PFCsendQuery(PgFdwConn *conn, const char *query)
+{
+	return PQsendQuery(conn->pgconn, query);
+}
+
+PGresult *
+PFCexec(PgFdwConn *conn, const char *query)
+{
+	return PQexec(conn->pgconn, query);
+}
+
+PGresult *
+PFCexecParams(PgFdwConn *conn,
+			  const char *command,
+			  int nParams,
+			  const Oid *paramTypes,
+			  const char *const * paramValues,
+			  const int *paramLengths,
+			  const int *paramFormats,
+			  int resultFormat)
+{
+	return PQexecParams(conn->pgconn,
+						command, nParams, paramTypes, paramValues,
+						paramLengths, paramFormats, resultFormat);
+}
+
+PGresult *
+PFCprepare(PgFdwConn *conn,
+		   const char *stmtName, const char *query,
+		   int nParams, const Oid *paramTypes)
+{
+	return PQprepare(conn->pgconn, stmtName, query, nParams, paramTypes);
+}
+
+PGresult *
+PFCexecPrepared(PgFdwConn *conn,
+				const char *stmtName,
+				int nParams,
+				const char *const * paramValues,
+				const int *paramLengths,
+				const int *paramFormats,
+				int resultFormat)
+{
+	return PQexecPrepared(conn->pgconn, 
+						  stmtName, nParams, paramValues, paramLengths,
+						  paramFormats, resultFormat);
+}
+
+PGresult *
+PFCgetResult(PgFdwConn *conn)
+{
+	return PQgetResult(conn->pgconn);
+}
+
+int
+PFCconsumeInput(PgFdwConn *conn)
+{
+	return PQconsumeInput(conn->pgconn);
+}
+
+int
+PFCisBusy(PgFdwConn *conn)
+{
+	return PQisBusy(conn->pgconn);
+}
+
+ConnStatusType
+PFCstatus(const PgFdwConn *conn)
+{
+	return PQstatus(conn->pgconn);
+}
+
+PGTransactionStatusType
+PFCtransactionStatus(const PgFdwConn *conn)
+{
+	return PQtransactionStatus(conn->pgconn);
+}
+
+int
+PFCserverVersion(const PgFdwConn *conn)
+{
+	return PQserverVersion(conn->pgconn);
+}
+
+char *
+PFCerrorMessage(const PgFdwConn *conn)
+{
+	return PQerrorMessage(conn->pgconn);
+}
+
+int
+PFCconnectionUsedPassword(const PgFdwConn *conn)
+{
+	return PQconnectionUsedPassword(conn->pgconn);
+}
+
+void
+PFCfinish(PgFdwConn *conn)
+{
+	return PQfinish(conn->pgconn);
+	PFC_FREE(conn);
+}
+
+PgFdwConn *
+PFCconnectdbParams(const char *const * keywords,
+				   const char *const * values, int expand_dbname)
+{
+	PgFdwConn *ret = PFC_ALLOCATE();
+
+	PFCinit(ret);
+	ret->pgconn = PQconnectdbParams(keywords, values, expand_dbname);
+
+	return ret;
+}
diff --git a/contrib/postgres_fdw/PgFdwConn.h b/contrib/postgres_fdw/PgFdwConn.h
new file mode 100644
index 0000000..f695f5a
--- /dev/null
+++ b/contrib/postgres_fdw/PgFdwConn.h
@@ -0,0 +1,61 @@
+/*-------------------------------------------------------------------------
+ *
+ * PgFdwConn.h
+ *		  PGconn extending wrapper to enable asynchronous query.
+ *
+ * Portions Copyright (c) 2012-2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		  contrib/postgres_fdw/PgFdwConn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PGFDWCONN_H
+#define PGFDWCONN_H
+
+#include "libpq-fe.h"
+
+typedef struct pgfdw_conn PgFdwConn;
+struct PgFdwScanState;
+
+extern void PFCsetAsyncScan(PgFdwConn *conn, struct PgFdwScanState *scan);
+extern struct PgFdwScanState *PFCgetAsyncScan(PgFdwConn *conn);
+extern int PFCisAsyncRunning(PgFdwConn *conn);
+extern PGconn *PFCgetPGconn(PgFdwConn *conn);
+extern int PFCgetNscans(PgFdwConn *conn);
+extern int PFCincrementNscans(PgFdwConn *conn);
+extern int PFCdecrementNscans(PgFdwConn *conn);
+extern void PFCcancelAsync(PgFdwConn *conn);
+extern void PFCinit(PgFdwConn *conn);
+extern int PFCsendQuery(PgFdwConn *conn, const char *query);
+extern PGresult *PFCexec(PgFdwConn *conn, const char *query);
+extern PGresult *PFCexecParams(PgFdwConn *conn,
+								const char *command,
+								int nParams,
+								const Oid *paramTypes,
+								const char *const * paramValues,
+								const int *paramLengths,
+								const int *paramFormats,
+								int resultFormat);
+extern PGresult *PFCprepare(PgFdwConn *conn,
+							const char *stmtName, const char *query,
+							int nParams, const Oid *paramTypes);
+extern PGresult *PFCexecPrepared(PgFdwConn *conn,
+								 const char *stmtName,
+								 int nParams,
+								 const char *const * paramValues,
+								 const int *paramLengths,
+								 const int *paramFormats,
+								 int resultFormat);
+extern PGresult *PFCgetResult(PgFdwConn *conn);
+extern int PFCconsumeInput(PgFdwConn *conn);
+extern int PFCisBusy(PgFdwConn *conn);
+extern ConnStatusType PFCstatus(const PgFdwConn *conn);
+extern PGTransactionStatusType PFCtransactionStatus(const PgFdwConn *conn);
+extern int PFCserverVersion(const PgFdwConn *conn);
+extern char *PFCerrorMessage(const PgFdwConn *conn);
+extern int PFCconnectionUsedPassword(const PgFdwConn *conn);
+extern void PFCfinish(PgFdwConn *conn);
+extern PgFdwConn *PFCconnectdbParams(const char *const * keywords,
+			 const char *const * values, int expand_dbname);
+#endif   /* PGFDWCONN_H */
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 1a1e5b5..790b675 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -44,7 +44,7 @@ typedef struct ConnCacheKey
 typedef struct ConnCacheEntry
 {
 	ConnCacheKey key;			/* hash key (must be first) */
-	PGconn	   *conn;			/* connection to foreign server, or NULL */
+	PgFdwConn	*conn;			/* connection to foreign server, or NULL */
 	int			xact_depth;		/* 0 = no xact open, 1 = main xact open, 2 =
 								 * one level of subxact open, etc */
 	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
@@ -64,10 +64,10 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
-static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+static PgFdwConn *connect_pg_server(ForeignServer *server, UserMapping *user);
 static void check_conn_params(const char **keywords, const char **values);
-static void configure_remote_session(PGconn *conn);
-static void do_sql_command(PGconn *conn, const char *sql);
+static void configure_remote_session(PgFdwConn *conn);
+static void do_sql_command(PgFdwConn *conn, const char *sql);
 static void begin_remote_xact(ConnCacheEntry *entry);
 static void pgfdw_xact_callback(XactEvent event, void *arg);
 static void pgfdw_subxact_callback(SubXactEvent event,
@@ -93,7 +93,7 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  * be useful and not mere pedantry.  We could not flush any active connections
  * mid-transaction anyway.
  */
-PGconn *
+PgFdwConn *
 GetConnection(ForeignServer *server, UserMapping *user,
 			  bool will_prep_stmt)
 {
@@ -161,9 +161,11 @@ GetConnection(ForeignServer *server, UserMapping *user,
 		entry->have_error = false;
 		entry->conn = connect_pg_server(server, user);
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\"",
-			 entry->conn, server->servername);
+			 PFCgetPGconn(entry->conn), server->servername);
 	}
 
+	PFCincrementNscans(entry->conn);
+
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
@@ -178,10 +180,10 @@ GetConnection(ForeignServer *server, UserMapping *user,
 /*
  * Connect to remote server using specified server and user mapping properties.
  */
-static PGconn *
+static PgFdwConn *
 connect_pg_server(ForeignServer *server, UserMapping *user)
 {
-	PGconn	   *volatile conn = NULL;
+	PgFdwConn	   *volatile conn = NULL;
 
 	/*
 	 * Use PG_TRY block to ensure closing connection on error.
@@ -223,14 +225,14 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 		/* verify connection parameters and make connection */
 		check_conn_params(keywords, values);
 
-		conn = PQconnectdbParams(keywords, values, false);
-		if (!conn || PQstatus(conn) != CONNECTION_OK)
+		conn = PFCconnectdbParams(keywords, values, false);
+		if (!conn || PFCstatus(conn) != CONNECTION_OK)
 		{
 			char	   *connmessage;
 			int			msglen;
 
 			/* libpq typically appends a newline, strip that */
-			connmessage = pstrdup(PQerrorMessage(conn));
+			connmessage = pstrdup(PFCerrorMessage(conn));
 			msglen = strlen(connmessage);
 			if (msglen > 0 && connmessage[msglen - 1] == '\n')
 				connmessage[msglen - 1] = '\0';
@@ -246,7 +248,7 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 		 * otherwise, he's piggybacking on the postgres server's user
 		 * identity. See also dblink_security_check() in contrib/dblink.
 		 */
-		if (!superuser() && !PQconnectionUsedPassword(conn))
+		if (!superuser() && !PFCconnectionUsedPassword(conn))
 			ereport(ERROR,
 				  (errcode(ERRCODE_S_R_E_PROHIBITED_SQL_STATEMENT_ATTEMPTED),
 				   errmsg("password is required"),
@@ -263,7 +265,7 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 	{
 		/* Release PGconn data structure if we managed to create one */
 		if (conn)
-			PQfinish(conn);
+			PFCfinish(conn);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
@@ -312,9 +314,9 @@ check_conn_params(const char **keywords, const char **values)
  * there are any number of ways to break things.
  */
 static void
-configure_remote_session(PGconn *conn)
+configure_remote_session(PgFdwConn *conn)
 {
-	int			remoteversion = PQserverVersion(conn);
+	int			remoteversion = PFCserverVersion(conn);
 
 	/* Force the search path to contain only pg_catalog (see deparse.c) */
 	do_sql_command(conn, "SET search_path = pg_catalog");
@@ -348,11 +350,11 @@ configure_remote_session(PGconn *conn)
  * Convenience subroutine to issue a non-data-returning SQL command to remote
  */
 static void
-do_sql_command(PGconn *conn, const char *sql)
+do_sql_command(PgFdwConn *conn, const char *sql)
 {
 	PGresult   *res;
 
-	res = PQexec(conn, sql);
+	res = PFCexec(conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
 		pgfdw_report_error(ERROR, res, conn, true, sql);
 	PQclear(res);
@@ -379,7 +381,7 @@ begin_remote_xact(ConnCacheEntry *entry)
 		const char *sql;
 
 		elog(DEBUG3, "starting remote transaction on connection %p",
-			 entry->conn);
+			 PFCgetPGconn(entry->conn));
 
 		if (IsolationIsSerializable())
 			sql = "START TRANSACTION ISOLATION LEVEL SERIALIZABLE";
@@ -408,13 +410,11 @@ begin_remote_xact(ConnCacheEntry *entry)
  * Release connection reference count created by calling GetConnection.
  */
 void
-ReleaseConnection(PGconn *conn)
+ReleaseConnection(PgFdwConn *conn)
 {
-	/*
-	 * Currently, we don't actually track connection references because all
-	 * cleanup is managed on a transaction or subtransaction basis instead. So
-	 * there's nothing to do here.
-	 */
+	/* ongoing async query should be canceled if no scans left */
+	if (PFCdecrementNscans(conn) == 0)
+		finish_async_query(conn);
 }
 
 /*
@@ -429,7 +429,7 @@ ReleaseConnection(PGconn *conn)
  * collisions are highly improbable; just be sure to use %u not %d to print.
  */
 unsigned int
-GetCursorNumber(PGconn *conn)
+GetCursorNumber(PgFdwConn *conn)
 {
 	return ++cursor_number;
 }
@@ -443,7 +443,7 @@ GetCursorNumber(PGconn *conn)
  * increasing the risk of prepared-statement name collisions by resetting.
  */
 unsigned int
-GetPrepStmtNumber(PGconn *conn)
+GetPrepStmtNumber(PgFdwConn *conn)
 {
 	return ++prep_stmt_number;
 }
@@ -462,7 +462,7 @@ GetPrepStmtNumber(PGconn *conn)
  * marked with have_error = true.
  */
 void
-pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
+pgfdw_report_error(int elevel, PGresult *res, PgFdwConn *conn,
 				   bool clear, const char *sql)
 {
 	/* If requested, PGresult must be released before leaving this function. */
@@ -490,7 +490,7 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 		 * return NULL, not a PGresult at all.
 		 */
 		if (message_primary == NULL)
-			message_primary = PQerrorMessage(conn);
+			message_primary = PFCerrorMessage(conn);
 
 		ereport(elevel,
 				(errcode(sqlstate),
@@ -542,7 +542,7 @@ pgfdw_xact_callback(XactEvent event, void *arg)
 		if (entry->xact_depth > 0)
 		{
 			elog(DEBUG3, "closing remote transaction on connection %p",
-				 entry->conn);
+				 PFCgetPGconn(entry->conn));
 
 			switch (event)
 			{
@@ -568,7 +568,7 @@ pgfdw_xact_callback(XactEvent event, void *arg)
 					 */
 					if (entry->have_prep_stmt && entry->have_error)
 					{
-						res = PQexec(entry->conn, "DEALLOCATE ALL");
+						res = PFCexec(entry->conn, "DEALLOCATE ALL");
 						PQclear(res);
 					}
 					entry->have_prep_stmt = false;
@@ -600,7 +600,7 @@ pgfdw_xact_callback(XactEvent event, void *arg)
 					/* Assume we might have lost track of prepared statements */
 					entry->have_error = true;
 					/* If we're aborting, abort all remote transactions too */
-					res = PQexec(entry->conn, "ABORT TRANSACTION");
+					res = PFCexec(entry->conn, "ABORT TRANSACTION");
 					/* Note: can't throw ERROR, it would be infinite loop */
 					if (PQresultStatus(res) != PGRES_COMMAND_OK)
 						pgfdw_report_error(WARNING, res, entry->conn, true,
@@ -611,7 +611,7 @@ pgfdw_xact_callback(XactEvent event, void *arg)
 						/* As above, make sure to clear any prepared stmts */
 						if (entry->have_prep_stmt && entry->have_error)
 						{
-							res = PQexec(entry->conn, "DEALLOCATE ALL");
+							res = PFCexec(entry->conn, "DEALLOCATE ALL");
 							PQclear(res);
 						}
 						entry->have_prep_stmt = false;
@@ -623,17 +623,19 @@ pgfdw_xact_callback(XactEvent event, void *arg)
 
 		/* Reset state to show we're out of a transaction */
 		entry->xact_depth = 0;
+		PFCcancelAsync(entry->conn);
+		PFCinit(entry->conn);
 
 		/*
 		 * If the connection isn't in a good idle state, discard it to
 		 * recover. Next GetConnection will open a new connection.
 		 */
-		if (PQstatus(entry->conn) != CONNECTION_OK ||
-			PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
+		if (PFCstatus(entry->conn) != CONNECTION_OK ||
+			PFCtransactionStatus(entry->conn) != PQTRANS_IDLE)
 		{
-			elog(DEBUG3, "discarding connection %p", entry->conn);
-			PQfinish(entry->conn);
-			entry->conn = NULL;
+			elog(DEBUG3, "discarding connection %p",
+				 PFCgetPGconn(entry->conn));
+			PFCfinish(entry->conn);
 		}
 	}
 
@@ -679,6 +681,9 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
 		PGresult   *res;
 		char		sql[100];
 
+		/* Shut down asynchronous scan if running */
+		PFCcancelAsync(entry->conn);
+
 		/*
 		 * We only care about connections with open remote subtransactions of
 		 * the current level.
@@ -704,7 +709,7 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
 			snprintf(sql, sizeof(sql),
 					 "ROLLBACK TO SAVEPOINT s%d; RELEASE SAVEPOINT s%d",
 					 curlevel, curlevel);
-			res = PQexec(entry->conn, sql);
+			res = PFCexec(entry->conn, sql);
 			if (PQresultStatus(res) != PGRES_COMMAND_OK)
 				pgfdw_report_error(WARNING, res, entry->conn, true, sql);
 			else
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 7547ec2..07977d9 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -153,6 +153,9 @@ InitPgFdwOptions(void)
 		/* updatable is available on both server and table */
 		{"updatable", ForeignServerRelationId, false},
 		{"updatable", ForeignTableRelationId, false},
+		/* async execution */
+		{"allow_async", ForeignServerRelationId, false},
+		{"allow_async", ForeignTableRelationId, false},
 		{NULL, InvalidOid, false}
 	};
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index e4d799c..dc60bcc 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -22,6 +22,7 @@
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
+#include "nodes/execnodes.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/cost.h"
@@ -124,6 +125,13 @@ enum FdwModifyPrivateIndex
 	FdwModifyPrivateRetrievedAttrs
 };
 
+typedef enum fetch_mode {
+	START_ONLY,
+	FORCE_SYNC,
+	ALLOW_ASYNC,
+	EXIT_ASYNC
+} fetch_mode;
+
 /*
  * Execution state of a foreign scan using postgres_fdw.
  */
@@ -137,7 +145,7 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConn  *conn;			/* connection for the scan */
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -157,6 +165,9 @@ typedef struct PgFdwScanState
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
 	MemoryContext temp_cxt;		/* context for per-tuple temporary data */
+	ExprContext	 *econtext;		/* copy of ps_ExprContext of ForeignScanState */
+
+	bool		  allow_async;	/* true if async execution is allowed.  */
 } PgFdwScanState;
 
 /*
@@ -168,7 +179,7 @@ typedef struct PgFdwModifyState
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConn  *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -250,6 +261,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static bool postgresStartForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -299,7 +311,7 @@ static void estimate_path_cost_size(PlannerInfo *root,
 						double *p_rows, int *p_width,
 						Cost *p_startup_cost, Cost *p_total_cost);
 static void get_remote_estimate(const char *sql,
-					PGconn *conn,
+					PgFdwConn *conn,
 					double *rows,
 					int *width,
 					Cost *startup_cost,
@@ -307,9 +319,9 @@ static void get_remote_estimate(const char *sql,
 static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
-static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
-static void close_cursor(PGconn *conn, unsigned int cursor_number);
+static void create_cursor(PgFdwScanState *fsstate);
+static void fetch_more_data(PgFdwScanState *node, fetch_mode cmd);
+static void close_cursor(PgFdwConn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
 						 ItemPointer tupleid,
@@ -348,6 +360,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->StartForeignScan = postgresStartForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -373,6 +386,37 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 }
 
 /*
+ * Read boolean server/table options
+ * 0 is false, 1 is true, -1 is not specified
+ * table options overrides server options.
+ */
+static int
+postgresGetOptionBoolean(ForeignServer *server, ForeignTable *table,
+						 char *optname)
+{
+	ListCell *lc;
+	int val = -1;
+
+	foreach(lc, table->options)
+	{
+		DefElem    *def = (DefElem *) lfirst(lc);
+
+		if (strcmp(def->defname, optname) == 0)
+			val = defGetBoolean(def);
+	}
+	if (val >= 0) return val;
+
+	foreach(lc, server->options)
+	{
+		DefElem    *def = (DefElem *) lfirst(lc);
+
+		if (strcmp(def->defname, optname) == 0)
+			val = defGetBoolean(def);
+	}
+	return val;
+}
+
+/*
  * postgresGetForeignRelSize
  *		Estimate # of rows and width of the result of the scan
  *
@@ -877,6 +921,34 @@ postgresGetForeignPlan(PlannerInfo *root,
 							NIL /* no custom tlist */ );
 }
 
+/* Asynchronous execution function */
+static bool
+postgresStartForeignScan(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = (PgFdwScanState *)node->fdw_state;
+
+	if (!fsstate->allow_async)
+		return false;
+
+	/*
+	 * On the current implemnt, scans can run asynchronously if it is the
+	 * first scan on its connection.
+	 */
+	if (!PFCgetAsyncScan(fsstate->conn))
+	{
+		create_cursor(fsstate);
+
+		/*
+		 * Start async scan if this is the first scan. See fetch_more_data()
+		 * for details
+		 */
+		fetch_more_data(fsstate, START_ONLY);
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * postgresBeginForeignScan
  *		Initiate an executor scan of a foreign PostgreSQL table.
@@ -931,6 +1003,12 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
 	fsstate->cursor_exists = false;
 
+	/* Get async execution option */
+	fsstate->allow_async =
+		postgresGetOptionBoolean(server, table, "allow_async");
+	if (fsstate->allow_async < 0)
+		fsstate->allow_async = 1;	/* Default is true */
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -988,6 +1066,8 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 		fsstate->param_values = (const char **) palloc0(numParams * sizeof(char *));
 	else
 		fsstate->param_values = NULL;
+
+	fsstate->econtext = node->ss.ps.ps_ExprContext;
 }
 
 /*
@@ -1006,7 +1086,10 @@ postgresIterateForeignScan(ForeignScanState *node)
 	 * cursor on the remote side.
 	 */
 	if (!fsstate->cursor_exists)
-		create_cursor(node);
+	{
+		finish_async_query(fsstate->conn);
+		create_cursor(fsstate);
+	}
 
 	/*
 	 * Get some more tuples, if we've run out.
@@ -1015,7 +1098,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 	{
 		/* No point in another fetch if we already detected EOF, though. */
 		if (!fsstate->eof_reached)
-			fetch_more_data(node);
+			fetch_more_data(fsstate, ALLOW_ASYNC);
 		/* If we didn't get any tuples, must be end of data. */
 		if (fsstate->next_tuple >= fsstate->num_tuples)
 			return ExecClearTuple(slot);
@@ -1075,7 +1158,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = PQexec(fsstate->conn, sql);
+	res = PFCexec(fsstate->conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
 		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
 	PQclear(res);
@@ -1411,19 +1494,22 @@ postgresExecForeignInsert(EState *estate,
 	/* Convert parameters needed by prepared statement to text form */
 	p_values = convert_prep_stmt_params(fmstate, NULL, slot);
 
+	/* Finish async query if runing */
+	finish_async_query(fmstate->conn);
+
 	/*
 	 * Execute the prepared statement, and check for success.
 	 *
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = PQexecPrepared(fmstate->conn,
-						 fmstate->p_name,
-						 fmstate->p_nums,
-						 p_values,
-						 NULL,
-						 NULL,
-						 0);
+	res = PFCexecPrepared(fmstate->conn,
+						  fmstate->p_name,
+						  fmstate->p_nums,
+						  p_values,
+						  NULL,
+						  NULL,
+						  0);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
 		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
@@ -1481,19 +1567,22 @@ postgresExecForeignUpdate(EState *estate,
 										(ItemPointer) DatumGetPointer(datum),
 										slot);
 
+	/* Finish async query if runing */
+	finish_async_query(fmstate->conn);
+
 	/*
 	 * Execute the prepared statement, and check for success.
 	 *
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = PQexecPrepared(fmstate->conn,
-						 fmstate->p_name,
-						 fmstate->p_nums,
-						 p_values,
-						 NULL,
-						 NULL,
-						 0);
+	res = PFCexecPrepared(fmstate->conn,
+						  fmstate->p_name,
+						  fmstate->p_nums,
+						  p_values,
+						  NULL,
+						  NULL,
+						  0);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
 		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
@@ -1551,19 +1640,22 @@ postgresExecForeignDelete(EState *estate,
 										(ItemPointer) DatumGetPointer(datum),
 										NULL);
 
+	/* Finish async query if runing */
+	finish_async_query(fmstate->conn);
+
 	/*
 	 * Execute the prepared statement, and check for success.
 	 *
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = PQexecPrepared(fmstate->conn,
-						 fmstate->p_name,
-						 fmstate->p_nums,
-						 p_values,
-						 NULL,
-						 NULL,
-						 0);
+	res = PFCexecPrepared(fmstate->conn,
+						  fmstate->p_name,
+						  fmstate->p_nums,
+						  p_values,
+						  NULL,
+						  NULL,
+						  0);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
 		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
@@ -1613,7 +1705,7 @@ postgresEndForeignModify(EState *estate,
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = PQexec(fmstate->conn, sql);
+		res = PFCexec(fmstate->conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
 			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
 		PQclear(res);
@@ -1745,7 +1837,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_join_conds;
 		StringInfoData sql;
 		List	   *retrieved_attrs;
-		PGconn	   *conn;
+		PgFdwConn  *conn;
 		Selectivity local_sel;
 		QualCost	local_cost;
 
@@ -1855,7 +1947,7 @@ estimate_path_cost_size(PlannerInfo *root,
  * The given "sql" must be an EXPLAIN command.
  */
 static void
-get_remote_estimate(const char *sql, PGconn *conn,
+get_remote_estimate(const char *sql, PgFdwConn *conn,
 					double *rows, int *width,
 					Cost *startup_cost, Cost *total_cost)
 {
@@ -1871,7 +1963,7 @@ get_remote_estimate(const char *sql, PGconn *conn,
 		/*
 		 * Execute EXPLAIN remotely.
 		 */
-		res = PQexec(conn, sql);
+		res = PFCexec(conn, sql);
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, sql);
 
@@ -1936,13 +2028,12 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
  * Create cursor for node's query with current parameter values.
  */
 static void
-create_cursor(ForeignScanState *node)
+create_cursor(PgFdwScanState *fsstate)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
-	ExprContext *econtext = node->ss.ps.ps_ExprContext;
+	ExprContext *econtext = fsstate->econtext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PgFdwConn   *conn = fsstate->conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -2004,8 +2095,8 @@ create_cursor(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = PQexecParams(conn, buf.data, numParams, NULL, values,
-					   NULL, NULL, 0);
+	res = PFCexecParams(conn, buf.data, numParams, NULL, values,
+						NULL, NULL, 0);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
 		pgfdw_report_error(ERROR, res, conn, true, fsstate->query);
 	PQclear(res);
@@ -2026,54 +2117,121 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+fetch_more_data(PgFdwScanState *fsstate, fetch_mode cmd)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
 
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch for the case other than exiting from async mode.
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (cmd != EXIT_ASYNC)
+	{
+		fsstate->tuples = NULL;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PgFdwConn  *conn = fsstate->conn;
 		char		sql[64];
 		int			fetch_size;
-		int			numrows;
+		int			numrows, addrows, restrows;
+		HeapTuple  *tmptuples;
 		int			i;
+		int			fetch_buf_size;
 
 		/* The fetch size is arbitrary, but shouldn't be enormous. */
 		fetch_size = 100;
 
+		/* Make the query to fetch tuples */
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fetch_size, fsstate->cursor_number);
 
-		res = PQexec(conn, sql);
+		if (PFCisAsyncRunning(conn))
+		{
+			Assert (cmd != START_ONLY);
+
+			/*
+			 * If the target fsstate is different from the scan state that the
+			 * current async fetch running for, the result should be stored
+			 * into it, then synchronously fetch data for the target fsstate.
+			 */
+			if (fsstate != PFCgetAsyncScan(conn))
+			{
+				fetch_more_data(PFCgetAsyncScan(conn), EXIT_ASYNC);
+				res = PFCexec(conn, sql);
+			}
+			else
+			{
+				/* Get result of running async fetch */
+				res = PFCgetResult(conn);
+				if (PQntuples(res) == fetch_size)
+				{
+					/*
+					 * Connection state doesn't go to IDLE even if all data
+					 * has been sent to client for asynchronous query. One
+					 * more PQgetResult() is needed to reset the state to
+					 * IDLE.  See PQexecFinish() for details.
+					 */
+					if (PFCgetResult(conn) != NULL)
+						elog(ERROR, "Connection status error.");
+				}
+			}
+			PFCsetAsyncScan(conn, NULL);
+		}
+		else
+		{
+			if (cmd == START_ONLY)
+			{
+				if (!PFCsendQuery(conn, sql))
+					pgfdw_report_error(ERROR, res, conn, false,
+									   fsstate->query);
+
+				PFCsetAsyncScan(conn, fsstate);
+				goto end_of_fetch;
+			}
+
+			/* Elsewise do synchronous query execution */
+			PFCsetAsyncScan(conn, NULL);
+			res = PFCexec(conn, sql);
+		}
+
 		/* On error, report the original query, not the FETCH. */
-		if (PQresultStatus(res) != PGRES_TUPLES_OK)
+		if (res &&  PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
-		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
+		/* allocate tuple storage */
+		tmptuples = fsstate->tuples;
+		addrows = PQntuples(res);
+		restrows = fsstate->num_tuples - fsstate->next_tuple;
+		numrows = restrows + addrows;
+		fetch_buf_size = numrows * sizeof(HeapTuple);
+		fsstate->tuples = (HeapTuple *) palloc0(fetch_buf_size);
+
+		Assert(restrows == 0 || tmptuples);
+
+		/* copy unread tuples if any */
+		for (i = 0 ; i < restrows ; i++)
+			fsstate->tuples[i] = tmptuples[fsstate->next_tuple + i];
+
 		fsstate->num_tuples = numrows;
 		fsstate->next_tuple = 0;
 
-		for (i = 0; i < numrows; i++)
+		/* Convert the data into HeapTuples */
+		for (i = 0 ; i < addrows; i++)
 		{
-			fsstate->tuples[i] =
+			HeapTuple tup =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
 										   fsstate->retrieved_attrs,
 										   fsstate->temp_cxt);
+			fsstate->tuples[restrows + i] = tup;
+			fetch_buf_size += (HEAPTUPLESIZE + tup->t_len);
 		}
 
 		/* Update fetch_ct_2 */
@@ -2085,6 +2243,23 @@ fetch_more_data(ForeignScanState *node)
 
 		PQclear(res);
 		res = NULL;
+
+		if (cmd == ALLOW_ASYNC)
+		{
+			if (!fsstate->eof_reached)
+			{
+				/*
+				 * We can immediately request the next bunch of tuples if
+				 * we're on asynchronous connection.
+				 */
+				if (!PFCsendQuery(conn, sql))
+					pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+				PFCsetAsyncScan(conn, fsstate);
+			}
+		}
+
+end_of_fetch:
+		;	/* Nothing to do here but needed to make compiler quiet. */
 	}
 	PG_CATCH();
 	{
@@ -2098,6 +2273,28 @@ fetch_more_data(ForeignScanState *node)
 }
 
 /*
+ * Force cancelling async command state.
+ */
+void
+finish_async_query(PgFdwConn *conn)
+{
+	PgFdwScanState *fsstate = PFCgetAsyncScan(conn);
+	PgFdwConn *async_conn;
+
+	/* Nothing to do if no async connection */
+	if (fsstate == NULL) return;
+	async_conn = fsstate->conn;
+	if (!async_conn ||
+		PFCgetNscans(async_conn) == 1 ||
+		!PFCisAsyncRunning(async_conn))
+		return;
+
+	fetch_more_data(PFCgetAsyncScan(async_conn), EXIT_ASYNC);
+
+	Assert(!PFCisAsyncRunning(async_conn));
+}
+
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -2151,7 +2348,7 @@ reset_transmission_modes(int nestlevel)
  * Utility routine to close a cursor.
  */
 static void
-close_cursor(PGconn *conn, unsigned int cursor_number)
+close_cursor(PgFdwConn *conn, unsigned int cursor_number)
 {
 	char		sql[64];
 	PGresult   *res;
@@ -2162,7 +2359,7 @@ close_cursor(PGconn *conn, unsigned int cursor_number)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = PQexec(conn, sql);
+	res = PFCexec(conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
 		pgfdw_report_error(ERROR, res, conn, true, sql);
 	PQclear(res);
@@ -2184,6 +2381,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 			 GetPrepStmtNumber(fmstate->conn));
 	p_name = pstrdup(prep_name);
 
+	/* Finish async query if runing */
+	finish_async_query(fmstate->conn);
+
 	/*
 	 * We intentionally do not specify parameter types here, but leave the
 	 * remote server to derive them by default.  This avoids possible problems
@@ -2194,11 +2394,11 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = PQprepare(fmstate->conn,
-					p_name,
-					fmstate->query,
-					0,
-					NULL);
+	res = PFCprepare(fmstate->conn,
+					 p_name,
+					 fmstate->query,
+					 0,
+					 NULL);
 
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
 		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
@@ -2316,7 +2516,7 @@ postgresAnalyzeForeignTable(Relation relation,
 	ForeignTable *table;
 	ForeignServer *server;
 	UserMapping *user;
-	PGconn	   *conn;
+	PgFdwConn   *conn;
 	StringInfoData sql;
 	PGresult   *volatile res = NULL;
 
@@ -2348,7 +2548,7 @@ postgresAnalyzeForeignTable(Relation relation,
 	/* In what follows, do not risk leaking any PGresults. */
 	PG_TRY();
 	{
-		res = PQexec(conn, sql.data);
+		res = PFCexec(conn, sql.data);
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, sql.data);
 
@@ -2398,7 +2598,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 	ForeignTable *table;
 	ForeignServer *server;
 	UserMapping *user;
-	PGconn	   *conn;
+	PgFdwConn   *conn;
 	unsigned int cursor_number;
 	StringInfoData sql;
 	PGresult   *volatile res = NULL;
@@ -2442,7 +2642,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 	/* In what follows, do not risk leaking any PGresults. */
 	PG_TRY();
 	{
-		res = PQexec(conn, sql.data);
+		res = PFCexec(conn, sql.data);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
 			pgfdw_report_error(ERROR, res, conn, false, sql.data);
 		PQclear(res);
@@ -2472,7 +2672,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 			snprintf(fetch_sql, sizeof(fetch_sql), "FETCH %d FROM c%u",
 					 fetch_size, cursor_number);
 
-			res = PQexec(conn, fetch_sql);
+			res = PFCexec(conn, fetch_sql);
 			/* On error, report the original query, not the FETCH. */
 			if (PQresultStatus(res) != PGRES_TUPLES_OK)
 				pgfdw_report_error(ERROR, res, conn, false, sql.data);
@@ -2600,7 +2800,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 	bool		import_not_null = true;
 	ForeignServer *server;
 	UserMapping *mapping;
-	PGconn	   *conn;
+	PgFdwConn  *conn;
 	StringInfoData buf;
 	PGresult   *volatile res = NULL;
 	int			numrows,
@@ -2633,7 +2833,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 	conn = GetConnection(server, mapping, false);
 
 	/* Don't attempt to import collation if remote server hasn't got it */
-	if (PQserverVersion(conn) < 90100)
+	if (PFCserverVersion(conn) < 90100)
 		import_collate = false;
 
 	/* Create workspace for strings */
@@ -2646,7 +2846,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 		appendStringInfoString(&buf, "SELECT 1 FROM pg_catalog.pg_namespace WHERE nspname = ");
 		deparseStringLiteral(&buf, stmt->remote_schema);
 
-		res = PQexec(conn, buf.data);
+		res = PFCexec(conn, buf.data);
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, buf.data);
 
@@ -2741,7 +2941,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 		appendStringInfoString(&buf, " ORDER BY c.relname, a.attnum");
 
 		/* Fetch the data */
-		res = PQexec(conn, buf.data);
+		res = PFCexec(conn, buf.data);
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, buf.data);
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 3835ddb..c87e5cf 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -18,19 +18,22 @@
 #include "nodes/relation.h"
 #include "utils/relcache.h"
 
-#include "libpq-fe.h"
+#include "PgFdwConn.h"
+
+struct PgFdwScanState;
 
 /* in postgres_fdw.c */
 extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
+extern void finish_async_query(PgFdwConn *fsstate);
 
 /* in connection.c */
-extern PGconn *GetConnection(ForeignServer *server, UserMapping *user,
+extern PgFdwConn *GetConnection(ForeignServer *server, UserMapping *user,
 			  bool will_prep_stmt);
-extern void ReleaseConnection(PGconn *conn);
-extern unsigned int GetCursorNumber(PGconn *conn);
-extern unsigned int GetPrepStmtNumber(PGconn *conn);
-extern void pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
+extern void ReleaseConnection(PgFdwConn *conn);
+extern unsigned int GetCursorNumber(PgFdwConn *conn);
+extern unsigned int GetPrepStmtNumber(PgFdwConn *conn);
+extern void pgfdw_report_error(int elevel, PGresult *res, PgFdwConn *conn,
 				   bool clear, const char *sql);
 
 /* in option.c */
-- 
1.8.3.1

0006-Debug-message-for-async-execution-of-postgres_fdw.patchtext/x-patch; charset=us-asciiDownload

>From 249920b4d812127535731c3d6ced714506879241 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 10 Jul 2015 15:02:59 +0900
Subject: [PATCH 6/6] Debug message for async execution of postgres_fdw.

---
 contrib/postgres_fdw/postgres_fdw.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index dc60bcc..5d62e36 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -385,6 +385,25 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	PG_RETURN_POINTER(routine);
 }
 
+static void
+postgresDebugLog(PgFdwScanState *fsstate, char *msg, void* ptr)
+{
+	ForeignTable *table = GetForeignTable(RelationGetRelid(fsstate->rel));
+	ForeignServer *server = GetForeignServer(table->serverid);
+
+	if (fsstate->conn)
+		ereport(LOG,
+				(errmsg ("pg_fdw: [%s/%s/%p] %s",
+						 get_rel_name(table->relid), server->servername,
+						 fsstate->conn, msg),
+				 errhidestmt(true)));
+	else
+		ereport(LOG,
+				(errmsg ("pg_fdw: [%s/%s/--] %s",
+						 get_rel_name(table->relid), server->servername, msg),
+				 errhidestmt(true)));
+}
+
 /*
  * Read boolean server/table options
  * 0 is false, 1 is true, -1 is not specified
@@ -928,7 +947,10 @@ postgresStartForeignScan(ForeignScanState *node)
 	PgFdwScanState *fsstate = (PgFdwScanState *)node->fdw_state;
 
 	if (!fsstate->allow_async)
+	{
+		postgresDebugLog(fsstate, "Async start admistratively denied.", NULL);
 		return false;
+	}
 
 	/*
 	 * On the current implemnt, scans can run asynchronously if it is the
@@ -943,9 +965,11 @@ postgresStartForeignScan(ForeignScanState *node)
 		 * for details
 		 */
 		fetch_more_data(fsstate, START_ONLY);
+		postgresDebugLog(fsstate, "Async exec started.", fsstate->conn);
 		return true;
 	}
 
+	postgresDebugLog(fsstate, "Async exec denied.", NULL);
 	return false;
 }
 
@@ -2162,6 +2186,9 @@ fetch_more_data(PgFdwScanState *fsstate, fetch_mode cmd)
 			 */
 			if (fsstate != PFCgetAsyncScan(conn))
 			{
+				postgresDebugLog(fsstate,
+								 "Changed to sync fetch (different scan)",
+								 fsstate->conn);
 				fetch_more_data(PFCgetAsyncScan(conn), EXIT_ASYNC);
 				res = PFCexec(conn, sql);
 			}
@@ -2196,6 +2223,7 @@ fetch_more_data(PgFdwScanState *fsstate, fetch_mode cmd)
 			}
 
 			/* Elsewise do synchronous query execution */
+			postgresDebugLog(fsstate, "Sync fetch.", conn);
 			PFCsetAsyncScan(conn, NULL);
 			res = PFCexec(conn, sql);
 		}
-- 
1.8.3.1

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 10 years ago

In reply to: Kyotaro HORIGUCHI (#7)

1 attachment(s)

Re: Asynchronous execution on FDW

Hi,

Currently there's no means to observe what it is doing from
outside, so the additional sixth patch is to output debug
messages about asynchronous execution.

The sixth patch did not contain one message shown in the example.
Attached is the revised version.
Other patches are not changed.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0006-Debug-message-for-async-execution-of-postgres_fdw.patchtext/x-patch; charset=us-asciiDownload

>From d1ed9fe6a4e68d42653a552a680a038a0aef5683 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 10 Jul 2015 15:02:59 +0900
Subject: [PATCH 6/6] Debug message for async execution of postgres_fdw.

---
 contrib/postgres_fdw/postgres_fdw.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index dc60bcc..a8a9cc5 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -385,6 +385,25 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	PG_RETURN_POINTER(routine);
 }
 
+static void
+postgresDebugLog(PgFdwScanState *fsstate, char *msg, void* ptr)
+{
+	ForeignTable *table = GetForeignTable(RelationGetRelid(fsstate->rel));
+	ForeignServer *server = GetForeignServer(table->serverid);
+
+	if (fsstate->conn)
+		ereport(LOG,
+				(errmsg ("pg_fdw: [%s/%s/%p] %s",
+						 get_rel_name(table->relid), server->servername,
+						 fsstate->conn, msg),
+				 errhidestmt(true)));
+	else
+		ereport(LOG,
+				(errmsg ("pg_fdw: [%s/%s/--] %s",
+						 get_rel_name(table->relid), server->servername, msg),
+				 errhidestmt(true)));
+}
+
 /*
  * Read boolean server/table options
  * 0 is false, 1 is true, -1 is not specified
@@ -928,7 +947,10 @@ postgresStartForeignScan(ForeignScanState *node)
 	PgFdwScanState *fsstate = (PgFdwScanState *)node->fdw_state;
 
 	if (!fsstate->allow_async)
+	{
+		postgresDebugLog(fsstate, "Async start admistratively denied.", NULL);
 		return false;
+	}
 
 	/*
 	 * On the current implemnt, scans can run asynchronously if it is the
@@ -943,9 +965,11 @@ postgresStartForeignScan(ForeignScanState *node)
 		 * for details
 		 */
 		fetch_more_data(fsstate, START_ONLY);
+		postgresDebugLog(fsstate, "Async exec started.", fsstate->conn);
 		return true;
 	}
 
+	postgresDebugLog(fsstate, "Async exec denied.", NULL);
 	return false;
 }
 
@@ -2162,11 +2186,16 @@ fetch_more_data(PgFdwScanState *fsstate, fetch_mode cmd)
 			 */
 			if (fsstate != PFCgetAsyncScan(conn))
 			{
+				postgresDebugLog(fsstate,
+								 "Changed to sync fetch (different scan)",
+								 fsstate->conn);
 				fetch_more_data(PFCgetAsyncScan(conn), EXIT_ASYNC);
 				res = PFCexec(conn, sql);
 			}
 			else
 			{
+				postgresDebugLog(fsstate,
+								 "Async fetch", fsstate->conn);
 				/* Get result of running async fetch */
 				res = PFCgetResult(conn);
 				if (PQntuples(res) == fetch_size)
@@ -2196,6 +2225,7 @@ fetch_more_data(PgFdwScanState *fsstate, fetch_mode cmd)
 			}
 
 			/* Elsewise do synchronous query execution */
+			postgresDebugLog(fsstate, "Sync fetch.", conn);
 			PFCsetAsyncScan(conn, NULL);
 			res = PFCexec(conn, sql);
 		}
-- 
1.8.3.1

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Heikki Linnakangas (#5)

Re: Asynchronous execution on FDW

On Fri, Jul 3, 2015 at 4:41 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

At a quick glance, I think this has all the same problems as starting the
execution at ExecInit phase. The correct way to do this is to kick off the
queries in the first IterateForeignScan() call. You said that "ExecProc
phase does not fit" - why not?

What exactly are those problems?

I can think of these:

1. If the scan is parametrized, we probably can't do it for lack of
knowledge of what they will be. This seems easy; just don't do it in
that case.
2. It's possible that we're down inside some subtree of the plan that
won't actually get executed. This is trickier.

Consider this:

Append
-> Foreign Scan
-> Foreign Scan
-> Foreign Scan
<repeat 17 more times>

If we don't start each foreign scan until the first tuple is fetched,
we will not get any benefit here, because we won't fetch the first
tuple from query #2 until we finish reading the results of query #1.
If the result of the Append node will be needed in its entirety, we
really, really want to launch of those queries as early as possible.
OTOH, if there's a Limit node with a small limit on top of the Append
node, that could be quite wasteful. We could decide not to care:
after all, if our limit is satisfied, we can just bang the remote
connections shut, and if they wasted some CPU, well, tough luck for
them. But it would be nice to be smarter. I'm not sure how, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 10 years ago

In reply to: Robert Haas (#9)

Re: Asynchronous execution on FDW

Hello, thank you for the comment.

At Fri, 17 Jul 2015 14:34:53 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoaiJK1svzw_GkFU+zsSxciJKFELqu2AOMVUPhpSFw4BsQ@mail.gmail.com>

On Fri, Jul 3, 2015 at 4:41 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

At a quick glance, I think this has all the same problems as starting the
execution at ExecInit phase. The correct way to do this is to kick off the
queries in the first IterateForeignScan() call. You said that "ExecProc
phase does not fit" - why not?

What exactly are those problems?

I can think of these:

1. If the scan is parametrized, we probably can't do it for lack of
knowledge of what they will be. This seems easy; just don't do it in
that case.

We can put an early kick to foreign scans only for the first shot
if we do it outside (before) ExecProc phase.

Nestloop
-> SeqScan
-> Append
-> Foreign (Index) Scan
-> Foreign (Index) Scan
..

This plan premises precise (even to some extent) estimate for
remote query but async execution within ExecProc phase would be
in effect for this case.

2. It's possible that we're down inside some subtree of the plan that
won't actually get executed. This is trickier.

As for current postgres_fdw, it is done simply abandoning queued
result then close the cursor.

Consider this:

Append
-> Foreign Scan
-> Foreign Scan
-> Foreign Scan
<repeat 17 more times>

If we don't start each foreign scan until the first tuple is fetched,
we will not get any benefit here, because we won't fetch the first
tuple from query #2 until we finish reading the results of query #1.
If the result of the Append node will be needed in its entirety, we
really, really want to launch of those queries as early as possible.
OTOH, if there's a Limit node with a small limit on top of the Append
node, that could be quite wasteful.

It's the nature of speculative execution, but the Limit will be
pushed down onto every Foreign Scans near future.

We could decide not to care: after all, if our limit is
satisfied, we can just bang the remote connections shut, and if
they wasted some CPU, well, tough luck for them. But it would
be nice to be smarter. I'm not sure how, though.

Appropriate fetch size will cap the harm and the case will be
handled as I mentioned above as for postgres_fdw.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Kyotaro HORIGUCHI (#10)

Re: Asynchronous execution on FDW

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kyotaro HORIGUCHI
Sent: Wednesday, July 22, 2015 4:10 PM
To: robertmhaas@gmail.com
Cc: hlinnaka@iki.fi; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Asynchronous execution on FDW

Hello, thank you for the comment.

At Fri, 17 Jul 2015 14:34:53 -0400, Robert Haas <robertmhaas@gmail.com> wrote
in <CA+TgmoaiJK1svzw_GkFU+zsSxciJKFELqu2AOMVUPhpSFw4BsQ@mail.gmail.com>

On Fri, Jul 3, 2015 at 4:41 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

At a quick glance, I think this has all the same problems as starting the
execution at ExecInit phase. The correct way to do this is to kick off the
queries in the first IterateForeignScan() call. You said that "ExecProc
phase does not fit" - why not?

What exactly are those problems?

I can think of these:

1. If the scan is parametrized, we probably can't do it for lack of
knowledge of what they will be. This seems easy; just don't do it in
that case.

We can put an early kick to foreign scans only for the first shot
if we do it outside (before) ExecProc phase.

Nestloop
-> SeqScan
-> Append
-> Foreign (Index) Scan
-> Foreign (Index) Scan
..

This plan premises precise (even to some extent) estimate for
remote query but async execution within ExecProc phase would be
in effect for this case.

2. It's possible that we're down inside some subtree of the plan that
won't actually get executed. This is trickier.

As for current postgres_fdw, it is done simply abandoning queued
result then close the cursor.

Consider this:

Append
-> Foreign Scan
-> Foreign Scan
-> Foreign Scan
<repeat 17 more times>

If we don't start each foreign scan until the first tuple is fetched,
we will not get any benefit here, because we won't fetch the first
tuple from query #2 until we finish reading the results of query #1.
If the result of the Append node will be needed in its entirety, we
really, really want to launch of those queries as early as possible.
OTOH, if there's a Limit node with a small limit on top of the Append
node, that could be quite wasteful.

It's the nature of speculative execution, but the Limit will be
pushed down onto every Foreign Scans near future.

We could decide not to care: after all, if our limit is
satisfied, we can just bang the remote connections shut, and if
they wasted some CPU, well, tough luck for them. But it would
be nice to be smarter. I'm not sure how, though.

Appropriate fetch size will cap the harm and the case will be
handled as I mentioned above as for postgres_fdw.

Horiguchi-san,

Let me ask an elemental question.

If we have ParallelAppend node that kicks a background worker process for
each underlying child node in parallel, does ForeignScan need to do something
special?

Expected waste of CPU or I/O is common problem to be solved, however, it does
not need to add a special case handling to ForeignScan, I think.
How about your opinion?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#11)

Re: Asynchronous execution on FDW

Hello,

Let me ask an elemental question.

If we have ParallelAppend node that kicks a background worker process for
each underlying child node in parallel, does ForeignScan need to do something
special?

Although I don't see the point of the background worker in your
story but at least for ParalleMergeAppend, it would frequently
discontinues to scan by upper Limit so one more state, say setup
- which mans a worker is allocated but not started- would be
useful and the driver node might need to manage the number of
async execution. Or the driven nodes might do so inversely.

As for ForeignScan, it is merely an API for FDW and does nothing
substantial so it would have nothing special to do. As for
postgres_fdw, current patch restricts one execution per one
foreign server at once by itself. We would have to provide
another execution management if we want to have two or more
simultaneous scans per one foreign server at once.

Sorry for the focusless discussion but does this answer some of
your question?

Expected waste of CPU or I/O is common problem to be solved, however, it does
not need to add a special case handling to ForeignScan, I think.
How about your opinion?

I agree with you that ForeignScan as the wrapper for FDWs don't
need anything special for the case. I suppose for now that
avoiding the penalty from abandoning too many speculatively
executed scans (or other works on bg worker like sorts) would be
a business of the upper node of FDWs, or somewhere else.

However, I haven't dismissed the possibility that some common
works related to resource management could be integrated into
executor (or even into planner), but I see none for now.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Kyotaro HORIGUCHI (#12)

Re: Asynchronous execution on FDW

If we have ParallelAppend node that kicks a background worker process for
each underlying child node in parallel, does ForeignScan need to do something
special?

Although I don't see the point of the background worker in your
story but at least for ParalleMergeAppend, it would frequently
discontinues to scan by upper Limit so one more state, say setup
- which mans a worker is allocated but not started- would be
useful and the driver node might need to manage the number of
async execution. Or the driven nodes might do so inversely.

I expected workloads like single shot scan on a partitioned large
fact table on DWH system. Yep, if workload is expected to rescan
so frequently, its expected cost shall be higher (by the cost to
launch bgworker) than existing Append, then planner will kick out
this path.

Regarding of interaction between Limit and ParallelMergeAppend,
it is probably the best scenario, isn't it? If Limit picks up
the least 1000rows from a partitioned table consists of 20 child
tables, ParallelMergeAppend can launch 20 parallel jobs that
picks up the least 1000rows from the child relations for each.
Probably, it is same job done in pass_down_bound() of nodeLimit.c.

As for ForeignScan, it is merely an API for FDW and does nothing
substantial so it would have nothing special to do. As for
postgres_fdw, current patch restricts one execution per one
foreign server at once by itself. We would have to provide
another execution management if we want to have two or more
simultaneous scans per one foreign server at once.

Yep, your 4th patch defines a new callback to FdwRoutines and
5th patch implements postgres_fdw specific portion.
It shall work for distributed / shaded database environment well,
however, its benefit is around ForeignScan only.
Once management node kicks underlying SeqScan, ForeignScan or
others in parallel, it also enables to run local heap scan
asynchronously.

Sorry for the focusless discussion but does this answer some of
your question?

Hmm... Its advantage is still unclear for me. However, it is not
fair to hijack this thread by my idea.
I'll submit my design proposal about ParallelAppend towards the
next commit-fest. Please comment on.

Expected waste of CPU or I/O is common problem to be solved, however, it does
not need to add a special case handling to ForeignScan, I think.
How about your opinion?

I agree with you that ForeignScan as the wrapper for FDWs don't
need anything special for the case. I suppose for now that
avoiding the penalty from abandoning too many speculatively
executed scans (or other works on bg worker like sorts) would be
a business of the upper node of FDWs, or somewhere else.

However, I haven't dismissed the possibility that some common
works related to resource management could be integrated into
executor (or even into planner), but I see none for now.

I also agree with it is "eventually" needed, but may not be supported
in the first version.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 10 years ago

In reply to: Kouhei Kaigai (#13)

Re: Asynchronous execution on FDW

Hello,

At Thu, 23 Jul 2015 09:38:39 +0000, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote in <9A28C8860F777E439AA12E8AEA7694F80111BCEC@BPXM15GP.gisp.nec.co.jp>

I expected workloads like single shot scan on a partitioned large
fact table on DWH system. Yep, if workload is expected to rescan
so frequently, its expected cost shall be higher (by the cost to
launch bgworker) than existing Append, then planner will kick out
this path.

Regarding of interaction between Limit and ParallelMergeAppend,
it is probably the best scenario, isn't it? If Limit picks up
the least 1000rows from a partitioned table consists of 20 child
tables, ParallelMergeAppend can launch 20 parallel jobs that
picks up the least 1000rows from the child relations for each.
Probably, it is same job done in pass_down_bound() of nodeLimit.c.

Yes. I confused a bit. The scenario is one of least problematic
cases.

As for ForeignScan, it is merely an API for FDW and does nothing
substantial so it would have nothing special to do. As for
postgres_fdw, current patch restricts one execution per one
foreign server at once by itself. We would have to provide
another execution management if we want to have two or more
simultaneous scans per one foreign server at once.

Yep, your 4th patch defines a new callback to FdwRoutines and
5th patch implements postgres_fdw specific portion.
It shall work for distributed / shaded database environment well,
however, its benefit is around ForeignScan only.
Once management node kicks underlying SeqScan, ForeignScan or
others in parallel, it also enables to run local heap scan
asynchronously.

I suppose SeqScan don't need async kick since its startup cost is
extremely low as nothing. (fetching first several pages would
boost seqscans?) On the other hand sort/hash would be a field
where asynchronous execution is in effect.

Sorry for the focusless discussion but does this answer some of
your question?

Hmm... Its advantage is still unclear for me. However, it is not
fair to hijack this thread by my idea.

It would be more advantageous if join/sort pushdown on fdw comes,
where start-up cost could be extremely high...

I'll submit my design proposal about ParallelAppend towards the
next commit-fest. Please comment on.

Ok, I'll come there.

Expected waste of CPU or I/O is common problem to be solved, however, it does
not need to add a special case handling to ForeignScan, I think.
How about your opinion?

I agree with you that ForeignScan as the wrapper for FDWs don't
need anything special for the case. I suppose for now that
avoiding the penalty from abandoning too many speculatively
executed scans (or other works on bg worker like sorts) would be
a business of the upper node of FDWs, or somewhere else.

However, I haven't dismissed the possibility that some common
works related to resource management could be integrated into
executor (or even into planner), but I see none for now.

I also agree with it is "eventually" needed, but may not be supported
in the first version.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Kouhei Kaigai

kaigai@ak.jp.nec.com

over 10 years ago

In reply to: Kyotaro HORIGUCHI (#14)

Re: Asynchronous execution on FDW

Hello Horiguchi-san,

As for ForeignScan, it is merely an API for FDW and does nothing
substantial so it would have nothing special to do. As for
postgres_fdw, current patch restricts one execution per one
foreign server at once by itself. We would have to provide
another execution management if we want to have two or more
simultaneous scans per one foreign server at once.

Yep, your 4th patch defines a new callback to FdwRoutines and
5th patch implements postgres_fdw specific portion.
It shall work for distributed / shaded database environment well,
however, its benefit is around ForeignScan only.
Once management node kicks underlying SeqScan, ForeignScan or
others in parallel, it also enables to run local heap scan
asynchronously.

I suppose SeqScan don't need async kick since its startup cost is
extremely low as nothing. (fetching first several pages would
boost seqscans?) On the other hand sort/hash would be a field
where asynchronous execution is in effect.

Startup cost is not only advantage of asynchronous execution.
If background worker prefetches the records to be read soon, during
other tasks are in progress, its latency to fetch next record is
much faster than usual execution path.
Please assume if next record is on neither shared-buffer nor page
cache of operating system.
First, the upper node calls heap_getnext() to fetch next record,
then it looks up the target block on the shared-buffer, then it
issues read(2) system call, then operating system makes the caller
process slept until this block gets read from the storage.
If asynchronous worker already goes through the above painful code
path and the records to be read are ready on the top of queue, it
will reduce the i/o wait time dramatically.

Sorry for the focusless discussion but does this answer some of
your question?

Hmm... Its advantage is still unclear for me. However, it is not
fair to hijack this thread by my idea.

It would be more advantageous if join/sort pushdown on fdw comes,
where start-up cost could be extremely high...

Not only FDW. I intend to combine the ParallelAppend with another idea
I previously post, to run tables join in parallel.
In case of partitioned foreign-tables, planner probably needs to consider
(1) FDW scan + local serial join, (2) FDW scan + local parallel join,
or (3) FDW remote join, according to the cost.

* [idea] table partition + hash join:
/messages/by-id/9A28C8860F777E439AA12E8AEA7694F8010F672B@BPXM15GP.gisp.nec.co.jp

Anyway, let's have a further discussion in another thread.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Heikki Linnakangas

hlinnaka@iki.fi

over 10 years ago

In reply to: Kouhei Kaigai (#15)

Re: Asynchronous execution on FDW

I've marked this as rejected in the commitfest, because others are
working on a more general solution with parallel workers. That's still
work-in-progress, and it's not certain if it's going to make it into
9.6, but if it does it will largely render this obsolete. We can revisit
this patch later in the release cycle, if the parallel scan patch hasn't
solved the same use case by then.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Heikki Linnakangas (#16)

Re: Asynchronous execution on FDW

On Mon, Aug 10, 2015 at 3:23 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I've marked this as rejected in the commitfest, because others are
working on a more general solution with parallel workers. That's still
work-in-progress, and it's not certain if it's going to make it into
9.6, but if it does it will largely render this obsolete. We can revisit
this patch later in the release cycle, if the parallel scan patch hasn't
solved the same use case by then.

I think the really important issue for this patch is the one discussed here:

/messages/by-id/CA+TgmoaiJK1svzw_GkFU+zsSxciJKFELqu2AOMVUPhpSFw4BsQ@mail.gmail.com

You raised an important issue there but never really expressed an
opinion on the points I raised, here or on the other thread. And
neither did anyone else except the patch author who, perhaps
unsurprisingly, thinks it's OK. I wish we could get more discussion
about that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Stephen Frost

sfrost@snowman.net

over 10 years ago

In reply to: Robert Haas (#17)

Re: Asynchronous execution on FDW

* Robert Haas (robertmhaas@gmail.com) wrote:

On Mon, Aug 10, 2015 at 3:23 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I've marked this as rejected in the commitfest, because others are
working on a more general solution with parallel workers. That's still
work-in-progress, and it's not certain if it's going to make it into
9.6, but if it does it will largely render this obsolete. We can revisit
this patch later in the release cycle, if the parallel scan patch hasn't
solved the same use case by then.

I think the really important issue for this patch is the one discussed here:

/messages/by-id/CA+TgmoaiJK1svzw_GkFU+zsSxciJKFELqu2AOMVUPhpSFw4BsQ@mail.gmail.com

I agree that it'd be great to figure out the answer to #2, but I'm also
of the opinion that we can either let the user tell us through the use
of the GUCs proposed in the patch or simply not worry about the
potential for time wastage associated with starting them all at once, as
you suggested there.

You raised an important issue there but never really expressed an
opinion on the points I raised, here or on the other thread. And
neither did anyone else except the patch author who, perhaps
unsurprisingly, thinks it's OK. I wish we could get more discussion
about that.

When I read the proposal, I had the same reaction that it didn't seem
like quite the right place and it further bothered me that it was
specific to FDWs.

Perhaps not surprisingly, as I authored it, but I'm still a fan of my
proposal #1 here:

/messages/by-id/20131104032604.GB2706@tamriel.snowman.net

More generally, I completely agree with the position (I believe your's,
but I might be misremembering) that we want to have this async
capability independently and in addition to parallel scan. I don't
believe one obviates the advantages of the other.

Thanks!

Stephen