Asynchronous Append on postgres_fdw nodes.
Hello, this is a follow-on of [1]/messages/by-id/2020012917585385831113@highgo.ca and [2]/messages/by-id/20180515.202945.69332784.horiguchi.kyotaro@lab.ntt.co.jp.
Currently the executor visits execution nodes one-by-one. Considering
sharding, Append on multiple postgres_fdw nodes can work
simultaneously and that can largely shorten the respons of the whole
query. For example, aggregations that can be pushed-down to remote
would be accelerated by the number of remote servers. Even other than
such an extreme case, collecting tuples from multiple servers also can
be accelerated by tens of percent [2]/messages/by-id/20180515.202945.69332784.horiguchi.kyotaro@lab.ntt.co.jp.
I have suspended the work waiting asyncrohous or push-up executor to
come but the mood seems inclining toward doing that before that to
come [3]/messages/by-id/20191205181217.GA12895@momjian.us.
The patchset consists of three parts.
- v2-0001-Allow-wait-event-set-to-be-regsitered-to-resoure.patch
The async feature uses WaitEvent, and it needs to be released on
error. This patch makes it possible to register WaitEvent to
resowner to handle that case..
- v2-0002-infrastructure-for-asynchronous-execution.patch
It povides an abstraction layer of asynchronous behavior
(execAsync). Then adds ExecAppend, another version of ExecAppend,
that handles "async-capable" subnodes asynchronously. Also it
contains planner part that makes planner aware of "async-capable"
and "async-aware" path nodes.
- v2-0003-async-postgres_fdw.patch
The "async-capable" postgres_fdw. It accelerates multiple
postgres_fdw nodes on a single connection case as well as
postgres_fdw nodes on dedicate connections.
regards.
[1]: /messages/by-id/2020012917585385831113@highgo.ca
[2]: /messages/by-id/20180515.202945.69332784.horiguchi.kyotaro@lab.ntt.co.jp
[3]: /messages/by-id/20191205181217.GA12895@momjian.us
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v2-0001-Allow-wait-event-set-to-be-registered-to-resource.patchtext/x-patch; charset=us-asciiDownload
From 22099ed9a6107b92c8e2b95ff1d199832810629c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH v2 1/3] Allow wait event set to be registered to resource
owner
WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
src/backend/libpq/pqcomm.c | 2 +-
src/backend/storage/ipc/latch.c | 18 ++++-
src/backend/storage/lmgr/condition_variable.c | 2 +-
src/backend/utils/resowner/resowner.c | 67 +++++++++++++++++++
src/include/storage/latch.h | 4 +-
src/include/utils/resowner_private.h | 8 +++
6 files changed, 96 insertions(+), 5 deletions(-)
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 7717bb2719..16aefb03ee 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -218,7 +218,7 @@ pq_init(void)
(errmsg("could not set socket to nonblocking mode: %m")));
#endif
- FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+ FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
NULL, NULL);
AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 046ca5c6c7..9c10bd5fcf 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -56,6 +56,7 @@
#include "storage/latch.h"
#include "storage/pmsignal.h"
#include "storage/shmem.h"
+#include "utils/resowner_private.h"
/*
* Select the fd readiness primitive to use. Normally the "most modern"
@@ -84,6 +85,8 @@ struct WaitEventSet
int nevents; /* number of registered events */
int nevents_space; /* maximum number of events in this set */
+ ResourceOwner resowner; /* Resource owner */
+
/*
* Array, of nevents_space length, storing the definition of events this
* set is waiting for.
@@ -393,7 +396,7 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
int ret = 0;
int rc;
WaitEvent event;
- WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
if (wakeEvents & WL_TIMEOUT)
Assert(timeout >= 0);
@@ -560,12 +563,15 @@ ResetLatch(Latch *latch)
* WaitEventSetWait().
*/
WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
{
WaitEventSet *set;
char *data;
Size sz = 0;
+ if (res)
+ ResourceOwnerEnlargeWESs(res);
+
/*
* Use MAXALIGN size/alignment to guarantee that later uses of memory are
* aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -680,6 +686,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
#endif
+ /* Register this wait event set if requested */
+ set->resowner = res;
+ if (res)
+ ResourceOwnerRememberWES(set->resowner, set);
+
return set;
}
@@ -725,6 +736,9 @@ FreeWaitEventSet(WaitEventSet *set)
}
#endif
+ if (set->resowner != NULL)
+ ResourceOwnerForgetWES(set->resowner, set);
+
pfree(set);
}
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 37b6a4eecd..fcc92138fe 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -70,7 +70,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
{
WaitEventSet *new_event_set;
- new_event_set = CreateWaitEventSet(TopMemoryContext, 2);
+ new_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 2);
AddWaitEventToSet(new_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
MyLatch, NULL);
AddWaitEventToSet(new_event_set, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 3c39e48825..035e83f4f8 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -128,6 +128,7 @@ typedef struct ResourceOwnerData
ResourceArray filearr; /* open temporary files */
ResourceArray dsmarr; /* dynamic shmem segments */
ResourceArray jitarr; /* JIT contexts */
+ ResourceArray wesarr; /* wait event sets */
/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
int nlocks; /* number of owned locks */
@@ -175,6 +176,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
static void PrintSnapshotLeakWarning(Snapshot snapshot);
static void PrintFileLeakWarning(File file);
static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
/*****************************************************************************
@@ -444,6 +446,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
+ ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
return owner;
}
@@ -553,6 +556,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
jit_release_context(context);
}
+
+ /* Ditto for wait event sets */
+ while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+ {
+ WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+ if (isCommit)
+ PrintWESLeakWarning(event);
+ FreeWaitEventSet(event);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -701,6 +714,7 @@ ResourceOwnerDelete(ResourceOwner owner)
Assert(owner->filearr.nitems == 0);
Assert(owner->dsmarr.nitems == 0);
Assert(owner->jitarr.nitems == 0);
+ Assert(owner->wesarr.nitems == 0);
Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
/*
@@ -728,6 +742,7 @@ ResourceOwnerDelete(ResourceOwner owner)
ResourceArrayFree(&(owner->filearr));
ResourceArrayFree(&(owner->dsmarr));
ResourceArrayFree(&(owner->jitarr));
+ ResourceArrayFree(&(owner->wesarr));
pfree(owner);
}
@@ -1346,3 +1361,55 @@ ResourceOwnerForgetJIT(ResourceOwner owner, Datum handle)
elog(ERROR, "JIT context %p is not owned by resource owner %s",
DatumGetPointer(handle), owner->name);
}
+
+/*
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+ ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+ ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+ elog(ERROR, "wait event set %p is not owned by resource owner %s",
+ events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ elog(WARNING, "wait event set leak: %p still referenced",
+ events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 46ae56cae3..b1b8375768 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
#define LATCH_H
#include <signal.h>
+#include "utils/resowner.h"
/*
* Latch structure should be treated as opaque and only accessed through
@@ -163,7 +164,8 @@ extern void DisownLatch(Latch *latch);
extern void SetLatch(Latch *latch);
extern void ResetLatch(Latch *latch);
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+ ResourceOwner res, int nevents);
extern void FreeWaitEventSet(WaitEventSet *set);
extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index a781a7a2aa..7d19dadd57 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
#include "storage/dsm.h"
#include "storage/fd.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "utils/catcache.h"
#include "utils/plancache.h"
@@ -95,4 +96,11 @@ extern void ResourceOwnerRememberJIT(ResourceOwner owner,
extern void ResourceOwnerForgetJIT(ResourceOwner owner,
Datum handle);
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+ WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+ WaitEventSet *);
+
#endif /* RESOWNER_PRIVATE_H */
--
2.18.2
v2-0002-infrastructure-for-asynchronous-execution.patchtext/x-patch; charset=us-asciiDownload
From 8d2fd1f17f8e38e1106017fe6327fbeaec3bcd52 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 15 May 2018 20:21:32 +0900
Subject: [PATCH v2 2/3] infrastructure for asynchronous execution
This patch add an infrastructure for asynchronous execution. As a PoC
this makes only Append capable to handle asynchronously executable
subnodes.
---
src/backend/commands/explain.c | 17 ++
src/backend/executor/Makefile | 1 +
src/backend/executor/execAsync.c | 152 +++++++++++
src/backend/executor/nodeAppend.c | 342 ++++++++++++++++++++----
src/backend/executor/nodeForeignscan.c | 21 ++
src/backend/nodes/bitmapset.c | 72 +++++
src/backend/nodes/copyfuncs.c | 3 +
src/backend/nodes/outfuncs.c | 3 +
src/backend/nodes/readfuncs.c | 3 +
src/backend/optimizer/plan/createplan.c | 66 ++++-
src/backend/postmaster/pgstat.c | 3 +
src/backend/postmaster/syslogger.c | 2 +-
src/backend/utils/adt/ruleutils.c | 8 +-
src/include/executor/execAsync.h | 22 ++
src/include/executor/executor.h | 1 +
src/include/executor/nodeForeignscan.h | 3 +
src/include/foreign/fdwapi.h | 11 +
src/include/nodes/bitmapset.h | 1 +
src/include/nodes/execnodes.h | 23 +-
src/include/nodes/plannodes.h | 9 +
src/include/pgstat.h | 3 +-
21 files changed, 703 insertions(+), 63 deletions(-)
create mode 100644 src/backend/executor/execAsync.c
create mode 100644 src/include/executor/execAsync.h
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d901dc4a50..daccad8268 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -84,6 +84,7 @@ static void show_sort_keys(SortState *sortstate, List *ancestors,
ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
+static void show_append_info(AppendState *astate, ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
ExplainState *es);
static void show_grouping_sets(PlanState *planstate, Agg *agg,
@@ -1343,6 +1344,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
}
if (plan->parallel_aware)
appendStringInfoString(es->str, "Parallel ");
+ if (plan->async_capable)
+ appendStringInfoString(es->str, "Async ");
appendStringInfoString(es->str, pname);
es->indent++;
}
@@ -1916,6 +1919,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Hash:
show_hash_info(castNode(HashState, planstate), es);
break;
+
+ case T_Append:
+ show_append_info(castNode(AppendState, planstate), es);
+ break;
+
default:
break;
}
@@ -2247,6 +2255,15 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ancestors, es);
}
+static void
+show_append_info(AppendState *astate, ExplainState *es)
+{
+ Append *plan = (Append *) astate->ps.plan;
+
+ if (plan->nasyncplans > 0)
+ ExplainPropertyInteger("Async subplans", "", plan->nasyncplans, es);
+}
+
/*
* Show the grouping keys for an Agg node.
*/
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..8a2d6e9961 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
+ execAsync.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000000..2b7d1877e0
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,152 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+/*
+ * ExecAsyncConfigureWait: Add wait event to the WaitEventSet if needed.
+ *
+ * If reinit is true, the caller didn't reuse existing WaitEventSet.
+ */
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+ void *data, bool reinit)
+{
+ switch (nodeTag(node))
+ {
+ case T_ForeignScanState:
+ return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+ wes, data, reinit);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(node));
+ }
+}
+
+/*
+ * struct for memory context callback argument used in ExecAsyncEventWait
+ */
+typedef struct {
+ int **p_refind;
+ int *p_refindsize;
+} ExecAsync_mcbarg;
+
+/*
+ * callback function to reset static variables pointing to the memory in
+ * TopTransactionContext in ExecAsyncEventWait.
+ */
+static void ExecAsyncMemoryContextCallback(void *arg)
+{
+ /* arg is the address of the variable refind in ExecAsyncEventWait */
+ ExecAsync_mcbarg *mcbarg = (ExecAsync_mcbarg *) arg;
+ *mcbarg->p_refind = NULL;
+ *mcbarg->p_refindsize = 0;
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * ExecAsyncEventWait:
+ *
+ * Wait for async events to fire. Returns the Bitmapset of fired events.
+ */
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+ static int *refind = NULL;
+ static int refindsize = 0;
+ WaitEventSet *wes;
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred = 0;
+ Bitmapset *fired_events = NULL;
+ int i;
+ int n;
+
+ n = bms_num_members(waitnodes);
+ wes = CreateWaitEventSet(TopTransactionContext,
+ TopTransactionResourceOwner, n);
+ if (refindsize < n)
+ {
+ if (refindsize == 0)
+ refindsize = EVENT_BUFFER_SIZE; /* XXX */
+ while (refindsize < n)
+ refindsize *= 2;
+ if (refind)
+ refind = (int *) repalloc(refind, refindsize * sizeof(int));
+ else
+ {
+ static ExecAsync_mcbarg mcb_arg =
+ { &refind, &refindsize };
+ static MemoryContextCallback mcb =
+ { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
+ MemoryContext oldctxt =
+ MemoryContextSwitchTo(TopTransactionContext);
+
+ /*
+ * refind points to a memory block in
+ * TopTransactionContext. Register a callback to reset it.
+ */
+ MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
+ refind = (int *) palloc(refindsize * sizeof(int));
+ MemoryContextSwitchTo(oldctxt);
+ }
+ }
+
+ /* Prepare WaitEventSet for waiting on the waitnodes. */
+ n = 0;
+ for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+ i = bms_next_member(waitnodes, i))
+ {
+ refind[i] = i;
+ if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+ n++;
+ }
+
+ /* Return immediately if no node to wait. */
+ if (n == 0)
+ {
+ FreeWaitEventSet(wes);
+ return NULL;
+ }
+
+ noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+ EVENT_BUFFER_SIZE,
+ WAIT_EVENT_ASYNC_WAIT);
+ FreeWaitEventSet(wes);
+ if (noccurred == 0)
+ return NULL;
+
+ for (i = 0 ; i < noccurred ; i++)
+ {
+ WaitEvent *w = &occurred_event[i];
+
+ if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+ {
+ int n = *(int*)w->user_data;
+
+ fired_events = bms_add_member(fired_events, n);
+ }
+ }
+
+ return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 88919e62fa..b5a8adfaf8 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -60,6 +60,7 @@
#include "executor/execdebug.h"
#include "executor/execPartition.h"
#include "executor/nodeAppend.h"
+#include "executor/execAsync.h"
#include "miscadmin.h"
/* Shared state for parallel-aware Append. */
@@ -80,6 +81,7 @@ struct ParallelAppendState
#define INVALID_SUBPLAN_INDEX -1
static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);
static bool choose_next_subplan_locally(AppendState *node);
static bool choose_next_subplan_for_leader(AppendState *node);
static bool choose_next_subplan_for_worker(AppendState *node);
@@ -103,22 +105,22 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
PlanState **appendplanstates;
Bitmapset *validsubplans;
int nplans;
+ int nasyncplans;
int firstvalid;
int i,
j;
/* check for unsupported flags */
- Assert(!(eflags & EXEC_FLAG_MARK));
+ Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC)));
/*
* create new AppendState for our append node
*/
appendstate->ps.plan = (Plan *) node;
appendstate->ps.state = estate;
- appendstate->ps.ExecProcNode = ExecAppend;
/* Let choose_next_subplan_* function handle setting the first subplan */
- appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+ appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
/* If run-time partition pruning is enabled, then set that up now */
if (node->part_prune_info != NULL)
@@ -152,11 +154,12 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
/*
* When no run-time pruning is required and there's at least one
- * subplan, we can fill as_valid_subplans immediately, preventing
+ * subplan, we can fill as_valid_syncsubplans immediately, preventing
* later calls to ExecFindMatchingSubPlans.
*/
if (!prunestate->do_exec_prune && nplans > 0)
- appendstate->as_valid_subplans = bms_add_range(NULL, 0, nplans - 1);
+ appendstate->as_valid_syncsubplans =
+ bms_add_range(NULL, node->nasyncplans, nplans - 1);
}
else
{
@@ -167,8 +170,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
* subplans as valid; they must also all be initialized.
*/
Assert(nplans > 0);
- appendstate->as_valid_subplans = validsubplans =
- bms_add_range(NULL, 0, nplans - 1);
+ validsubplans = bms_add_range(NULL, 0, nplans - 1);
+ appendstate->as_valid_syncsubplans =
+ bms_add_range(NULL, node->nasyncplans, nplans - 1);
appendstate->as_prune_state = NULL;
}
@@ -192,10 +196,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
*/
j = 0;
firstvalid = nplans;
+ nasyncplans = 0;
+
i = -1;
while ((i = bms_next_member(validsubplans, i)) >= 0)
{
Plan *initNode = (Plan *) list_nth(node->appendplans, i);
+ int sub_eflags = eflags;
+
+ /* Let async-capable subplans run asynchronously */
+ if (i < node->nasyncplans)
+ {
+ sub_eflags |= EXEC_FLAG_ASYNC;
+ nasyncplans++;
+ }
/*
* Record the lowest appendplans index which is a valid partial plan.
@@ -203,13 +217,46 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
if (i >= node->first_partial_plan && j < firstvalid)
firstvalid = j;
- appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
+ appendplanstates[j++] = ExecInitNode(initNode, estate, sub_eflags);
}
appendstate->as_first_partial_plan = firstvalid;
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ /* fill in async stuff */
+ appendstate->as_nasyncplans = nasyncplans;
+ appendstate->as_syncdone = (nasyncplans == nplans);
+ appendstate->as_exec_prune = false;
+
+ /* choose appropriate version of Exec function */
+ if (appendstate->as_nasyncplans == 0)
+ appendstate->ps.ExecProcNode = ExecAppend;
+ else
+ appendstate->ps.ExecProcNode = ExecAppendAsync;
+
+ if (appendstate->as_nasyncplans)
+ {
+ appendstate->as_asyncresult = (TupleTableSlot **)
+ palloc0(appendstate->as_nasyncplans * sizeof(TupleTableSlot *));
+
+ /* initially, all async requests need a request */
+ appendstate->as_needrequest =
+ bms_add_range(NULL, 0, appendstate->as_nasyncplans - 1);
+
+ /*
+ * ExecAppendAsync needs as_valid_syncsubplans to handle async
+ * subnodes.
+ */
+ if (appendstate->as_prune_state != NULL &&
+ appendstate->as_prune_state->do_exec_prune)
+ {
+ Assert(appendstate->as_valid_syncsubplans == NULL);
+
+ appendstate->as_exec_prune = true;
+ }
+ }
+
/*
* Miscellaneous initialization
*/
@@ -233,7 +280,7 @@ ExecAppend(PlanState *pstate)
{
AppendState *node = castNode(AppendState, pstate);
- if (node->as_whichplan < 0)
+ if (node->as_whichsyncplan < 0)
{
/* Nothing to do if there are no subplans */
if (node->as_nplans == 0)
@@ -243,11 +290,13 @@ ExecAppend(PlanState *pstate)
* If no subplan has been chosen, we must choose one before
* proceeding.
*/
- if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
+ if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
!node->choose_next_subplan(node))
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
+ Assert(node->as_nasyncplans == 0);
+
for (;;)
{
PlanState *subnode;
@@ -258,8 +307,9 @@ ExecAppend(PlanState *pstate)
/*
* figure out which subplan we are currently processing
*/
- Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
- subnode = node->appendplans[node->as_whichplan];
+ Assert(node->as_whichsyncplan >= 0 &&
+ node->as_whichsyncplan < node->as_nplans);
+ subnode = node->appendplans[node->as_whichsyncplan];
/*
* get a tuple from the subplan
@@ -282,6 +332,172 @@ ExecAppend(PlanState *pstate)
}
}
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+ AppendState *node = castNode(AppendState, pstate);
+ Bitmapset *needrequest;
+ int i;
+
+ Assert(node->as_nasyncplans > 0);
+
+restart:
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ if (node->as_exec_prune)
+ {
+ Bitmapset *valid_subplans =
+ ExecFindMatchingSubPlans(node->as_prune_state);
+
+ /* Distribute valid subplans into sync and async */
+ node->as_needrequest =
+ bms_intersect(node->as_needrequest, valid_subplans);
+ node->as_valid_syncsubplans =
+ bms_difference(valid_subplans, node->as_needrequest);
+
+ node->as_exec_prune = false;
+ }
+
+ needrequest = node->as_needrequest;
+ node->as_needrequest = NULL;
+ while ((i = bms_first_member(needrequest)) >= 0)
+ {
+ TupleTableSlot *slot;
+ PlanState *subnode = node->appendplans[i];
+
+ slot = ExecProcNode(subnode);
+ if (subnode->asyncstate == AS_AVAILABLE)
+ {
+ if (!TupIsNull(slot))
+ {
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+ node->as_needrequest = bms_add_member(node->as_needrequest, i);
+ }
+ }
+ else
+ node->as_pending_async = bms_add_member(node->as_pending_async, i);
+ }
+ bms_free(needrequest);
+
+ for (;;)
+ {
+ TupleTableSlot *result;
+
+ /* return now if a result is available */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ while (!bms_is_empty(node->as_pending_async))
+ {
+ /* Don't wait for async nodes if any sync node exists. */
+ long timeout = node->as_syncdone ? -1 : 0;
+ Bitmapset *fired;
+ int i;
+
+ fired = ExecAsyncEventWait(node->appendplans,
+ node->as_pending_async,
+ timeout);
+
+ if (bms_is_empty(fired) && node->as_syncdone)
+ {
+ /*
+ * We come here when all the subnodes had fired before
+ * waiting. Rery fetching from the nodes.
+ */
+ node->as_needrequest = node->as_pending_async;
+ node->as_pending_async = NULL;
+ goto restart;
+ }
+
+ while ((i = bms_first_member(fired)) >= 0)
+ {
+ TupleTableSlot *slot;
+ PlanState *subnode = node->appendplans[i];
+ slot = ExecProcNode(subnode);
+
+ Assert(subnode->asyncstate == AS_AVAILABLE);
+
+ if (!TupIsNull(slot))
+ {
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+ node->as_needrequest =
+ bms_add_member(node->as_needrequest, i);
+ }
+
+ node->as_pending_async =
+ bms_del_member(node->as_pending_async, i);
+ }
+ bms_free(fired);
+
+ /* return now if a result is available */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If there is no asynchronous activity still pending and the
+ * synchronous activity is also complete, we're totally done scanning
+ * this node. Otherwise, we're done with the asynchronous stuff but
+ * must continue scanning the synchronous children.
+ */
+
+ if (!node->as_syncdone &&
+ node->as_whichsyncplan == INVALID_SUBPLAN_INDEX)
+ node->as_syncdone = !node->choose_next_subplan(node);
+
+ if (node->as_syncdone)
+ {
+ Assert(bms_is_empty(node->as_pending_async));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+
+ /*
+ * get a tuple from the subplan
+ */
+ result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+ if (!TupIsNull(result))
+ {
+ /*
+ * If the subplan gave us something then return it as-is. We do
+ * NOT make use of the result slot that was set up in
+ * ExecInitAppend; there's no need for it.
+ */
+ return result;
+ }
+
+ /*
+ * Go on to the "next" subplan. If no more subplans, return the empty
+ * slot set up for us by ExecInitAppend, unless there are async plans
+ * we have yet to finish.
+ */
+ if (!node->choose_next_subplan(node))
+ {
+ node->as_syncdone = true;
+ if (bms_is_empty(node->as_pending_async))
+ {
+ Assert(bms_is_empty(node->as_needrequest));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
+
+ /* Else loop back and try to get a tuple from the new subplan */
+ }
+}
+
/* ----------------------------------------------------------------
* ExecEndAppend
*
@@ -324,10 +540,18 @@ ExecReScanAppend(AppendState *node)
bms_overlap(node->ps.chgParam,
node->as_prune_state->execparamids))
{
- bms_free(node->as_valid_subplans);
- node->as_valid_subplans = NULL;
+ bms_free(node->as_valid_syncsubplans);
+ node->as_valid_syncsubplans = NULL;
}
+ /* Reset async state. */
+ for (i = 0; i < node->as_nasyncplans; ++i)
+ ExecShutdownNode(node->appendplans[i]);
+
+ node->as_nasyncresult = 0;
+ node->as_needrequest = bms_add_range(NULL, 0, node->as_nasyncplans - 1);
+ node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
for (i = 0; i < node->as_nplans; i++)
{
PlanState *subnode = node->appendplans[i];
@@ -348,7 +572,7 @@ ExecReScanAppend(AppendState *node)
}
/* Let choose_next_subplan_* function handle setting the first subplan */
- node->as_whichplan = INVALID_SUBPLAN_INDEX;
+ node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
}
/* ----------------------------------------------------------------
@@ -436,7 +660,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
static bool
choose_next_subplan_locally(AppendState *node)
{
- int whichplan = node->as_whichplan;
+ int whichplan = node->as_whichsyncplan;
int nextplan;
/* We should never be called when there are no subplans */
@@ -451,10 +675,18 @@ choose_next_subplan_locally(AppendState *node)
*/
if (whichplan == INVALID_SUBPLAN_INDEX)
{
- if (node->as_valid_subplans == NULL)
- node->as_valid_subplans =
+ /* Shouldn't have an active async node */
+ Assert(bms_is_empty(node->as_needrequest));
+
+ if (node->as_valid_syncsubplans == NULL)
+ node->as_valid_syncsubplans =
ExecFindMatchingSubPlans(node->as_prune_state);
+ /* Exclude async plans */
+ if (node->as_nasyncplans > 0)
+ bms_del_range(node->as_valid_syncsubplans,
+ 0, node->as_nasyncplans - 1);
+
whichplan = -1;
}
@@ -462,14 +694,14 @@ choose_next_subplan_locally(AppendState *node)
Assert(whichplan >= -1 && whichplan <= node->as_nplans);
if (ScanDirectionIsForward(node->ps.state->es_direction))
- nextplan = bms_next_member(node->as_valid_subplans, whichplan);
+ nextplan = bms_next_member(node->as_valid_syncsubplans, whichplan);
else
- nextplan = bms_prev_member(node->as_valid_subplans, whichplan);
+ nextplan = bms_prev_member(node->as_valid_syncsubplans, whichplan);
if (nextplan < 0)
return false;
- node->as_whichplan = nextplan;
+ node->as_whichsyncplan = nextplan;
return true;
}
@@ -490,29 +722,29 @@ choose_next_subplan_for_leader(AppendState *node)
/* Backward scan is not supported by parallel-aware plans */
Assert(ScanDirectionIsForward(node->ps.state->es_direction));
- /* We should never be called when there are no subplans */
- Assert(node->as_nplans > 0);
+ /* We should never be called when there are no sync subplans */
+ Assert(node->as_nplans > node->as_nasyncplans);
LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
- if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
+ if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
{
/* Mark just-completed subplan as finished. */
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
}
else
{
/* Start with last subplan. */
- node->as_whichplan = node->as_nplans - 1;
+ node->as_whichsyncplan = node->as_nplans - 1;
/*
* If we've yet to determine the valid subplans then do so now. If
* run-time pruning is disabled then the valid subplans will always be
* set to all subplans.
*/
- if (node->as_valid_subplans == NULL)
+ if (node->as_valid_syncsubplans == NULL)
{
- node->as_valid_subplans =
+ node->as_valid_syncsubplans =
ExecFindMatchingSubPlans(node->as_prune_state);
/*
@@ -524,26 +756,26 @@ choose_next_subplan_for_leader(AppendState *node)
}
/* Loop until we find a subplan to execute. */
- while (pstate->pa_finished[node->as_whichplan])
+ while (pstate->pa_finished[node->as_whichsyncplan])
{
- if (node->as_whichplan == 0)
+ if (node->as_whichsyncplan == 0)
{
pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
- node->as_whichplan = INVALID_SUBPLAN_INDEX;
+ node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
LWLockRelease(&pstate->pa_lock);
return false;
}
/*
- * We needn't pay attention to as_valid_subplans here as all invalid
+ * We needn't pay attention to as_valid_syncsubplans here as all invalid
* plans have been marked as finished.
*/
- node->as_whichplan--;
+ node->as_whichsyncplan--;
}
/* If non-partial, immediately mark as finished. */
- if (node->as_whichplan < node->as_first_partial_plan)
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ if (node->as_whichsyncplan < node->as_first_partial_plan)
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
LWLockRelease(&pstate->pa_lock);
@@ -571,23 +803,23 @@ choose_next_subplan_for_worker(AppendState *node)
/* Backward scan is not supported by parallel-aware plans */
Assert(ScanDirectionIsForward(node->ps.state->es_direction));
- /* We should never be called when there are no subplans */
- Assert(node->as_nplans > 0);
+ /* We should never be called when there are no sync subplans */
+ Assert(node->as_nplans > node->as_nasyncplans);
LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
/* Mark just-completed subplan as finished. */
- if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
/*
* If we've yet to determine the valid subplans then do so now. If
* run-time pruning is disabled then the valid subplans will always be set
* to all subplans.
*/
- else if (node->as_valid_subplans == NULL)
+ else if (node->as_valid_syncsubplans == NULL)
{
- node->as_valid_subplans =
+ node->as_valid_syncsubplans =
ExecFindMatchingSubPlans(node->as_prune_state);
mark_invalid_subplans_as_finished(node);
}
@@ -600,30 +832,30 @@ choose_next_subplan_for_worker(AppendState *node)
}
/* Save the plan from which we are starting the search. */
- node->as_whichplan = pstate->pa_next_plan;
+ node->as_whichsyncplan = pstate->pa_next_plan;
/* Loop until we find a valid subplan to execute. */
while (pstate->pa_finished[pstate->pa_next_plan])
{
int nextplan;
- nextplan = bms_next_member(node->as_valid_subplans,
+ nextplan = bms_next_member(node->as_valid_syncsubplans,
pstate->pa_next_plan);
if (nextplan >= 0)
{
/* Advance to the next valid plan. */
pstate->pa_next_plan = nextplan;
}
- else if (node->as_whichplan > node->as_first_partial_plan)
+ else if (node->as_whichsyncplan > node->as_first_partial_plan)
{
/*
* Try looping back to the first valid partial plan, if there is
* one. If there isn't, arrange to bail out below.
*/
- nextplan = bms_next_member(node->as_valid_subplans,
+ nextplan = bms_next_member(node->as_valid_syncsubplans,
node->as_first_partial_plan - 1);
pstate->pa_next_plan =
- nextplan < 0 ? node->as_whichplan : nextplan;
+ nextplan < 0 ? node->as_whichsyncplan : nextplan;
}
else
{
@@ -631,10 +863,10 @@ choose_next_subplan_for_worker(AppendState *node)
* At last plan, and either there are no partial plans or we've
* tried them all. Arrange to bail out.
*/
- pstate->pa_next_plan = node->as_whichplan;
+ pstate->pa_next_plan = node->as_whichsyncplan;
}
- if (pstate->pa_next_plan == node->as_whichplan)
+ if (pstate->pa_next_plan == node->as_whichsyncplan)
{
/* We've tried everything! */
pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
@@ -644,8 +876,8 @@ choose_next_subplan_for_worker(AppendState *node)
}
/* Pick the plan we found, and advance pa_next_plan one more time. */
- node->as_whichplan = pstate->pa_next_plan;
- pstate->pa_next_plan = bms_next_member(node->as_valid_subplans,
+ node->as_whichsyncplan = pstate->pa_next_plan;
+ pstate->pa_next_plan = bms_next_member(node->as_valid_syncsubplans,
pstate->pa_next_plan);
/*
@@ -654,7 +886,7 @@ choose_next_subplan_for_worker(AppendState *node)
*/
if (pstate->pa_next_plan < 0)
{
- int nextplan = bms_next_member(node->as_valid_subplans,
+ int nextplan = bms_next_member(node->as_valid_syncsubplans,
node->as_first_partial_plan - 1);
if (nextplan >= 0)
@@ -671,8 +903,8 @@ choose_next_subplan_for_worker(AppendState *node)
}
/* If non-partial, immediately mark as finished. */
- if (node->as_whichplan < node->as_first_partial_plan)
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ if (node->as_whichsyncplan < node->as_first_partial_plan)
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
LWLockRelease(&pstate->pa_lock);
@@ -699,13 +931,13 @@ mark_invalid_subplans_as_finished(AppendState *node)
Assert(node->as_prune_state);
/* Nothing to do if all plans are valid */
- if (bms_num_members(node->as_valid_subplans) == node->as_nplans)
+ if (bms_num_members(node->as_valid_syncsubplans) == node->as_nplans)
return;
/* Mark all non-valid plans as finished */
for (i = 0; i < node->as_nplans; i++)
{
- if (!bms_is_member(i, node->as_valid_subplans))
+ if (!bms_is_member(i, node->as_valid_syncsubplans))
node->as_pstate->pa_finished[i] = true;
}
}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 513471ab9b..3bf4aaa63d 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -141,6 +141,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
scanstate->ss.ps.plan = (Plan *) node;
scanstate->ss.ps.state = estate;
scanstate->ss.ps.ExecProcNode = ExecForeignScan;
+ scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+ if ((eflags & EXEC_FLAG_ASYNC) != 0)
+ scanstate->fs_async = true;
/*
* Miscellaneous initialization
@@ -384,3 +388,20 @@ ExecShutdownForeignScan(ForeignScanState *node)
if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+ void *caller_data, bool reinit)
+{
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+ caller_data, reinit);
+}
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 2719ea45a3..05b625783b 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -895,6 +895,78 @@ bms_add_range(Bitmapset *a, int lower, int upper)
return a;
}
+/*
+ * bms_del_range
+ * Delete members in the range of 'lower' to 'upper' from the set.
+ *
+ * Note this could also be done by calling bms_del_member in a loop, however,
+ * using this function will be faster when the range is large as we work at
+ * the bitmapword level rather than at bit level.
+ */
+Bitmapset *
+bms_del_range(Bitmapset *a, int lower, int upper)
+{
+ int lwordnum,
+ lbitnum,
+ uwordnum,
+ ushiftbits,
+ wordnum;
+
+ if (lower < 0 || upper < 0)
+ elog(ERROR, "negative bitmapset member not allowed");
+ if (lower > upper)
+ elog(ERROR, "lower range must not be above upper range");
+ uwordnum = WORDNUM(upper);
+
+ if (a == NULL)
+ {
+ a = (Bitmapset *) palloc0(BITMAPSET_SIZE(uwordnum + 1));
+ a->nwords = uwordnum + 1;
+ }
+
+ /* ensure we have enough words to store the upper bit */
+ else if (uwordnum >= a->nwords)
+ {
+ int oldnwords = a->nwords;
+ int i;
+
+ a = (Bitmapset *) repalloc(a, BITMAPSET_SIZE(uwordnum + 1));
+ a->nwords = uwordnum + 1;
+ /* zero out the enlarged portion */
+ for (i = oldnwords; i < a->nwords; i++)
+ a->words[i] = 0;
+ }
+
+ wordnum = lwordnum = WORDNUM(lower);
+
+ lbitnum = BITNUM(lower);
+ ushiftbits = BITNUM(upper) + 1;
+
+ /*
+ * Special case when lwordnum is the same as uwordnum we must perform the
+ * upper and lower masking on the word.
+ */
+ if (lwordnum == uwordnum)
+ {
+ a->words[lwordnum] &= ((bitmapword) (((bitmapword) 1 << lbitnum) - 1)
+ | (~(bitmapword) 0) << ushiftbits);
+ }
+ else
+ {
+ /* turn off lbitnum and all bits left of it */
+ a->words[wordnum++] &= (bitmapword) (((bitmapword) 1 << lbitnum) - 1);
+
+ /* turn off all bits for any intermediate words */
+ while (wordnum < uwordnum)
+ a->words[wordnum++] = (bitmapword) 0;
+
+ /* turn off upper's bit and all bits right of it. */
+ a->words[uwordnum] &= (~(bitmapword) 0) << ushiftbits;
+ }
+
+ return a;
+}
+
/*
* bms_int_members - like bms_intersect, but left input is recycled
*/
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 54ad62bb7f..59205e5da6 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -121,6 +121,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
COPY_SCALAR_FIELD(plan_width);
COPY_SCALAR_FIELD(parallel_aware);
COPY_SCALAR_FIELD(parallel_safe);
+ COPY_SCALAR_FIELD(async_capable);
COPY_SCALAR_FIELD(plan_node_id);
COPY_NODE_FIELD(targetlist);
COPY_NODE_FIELD(qual);
@@ -246,6 +247,8 @@ _copyAppend(const Append *from)
COPY_NODE_FIELD(appendplans);
COPY_SCALAR_FIELD(first_partial_plan);
COPY_NODE_FIELD(part_prune_info);
+ COPY_SCALAR_FIELD(nasyncplans);
+ COPY_SCALAR_FIELD(referent);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index d76fae44b8..130b4c7b85 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -334,6 +334,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
WRITE_INT_FIELD(plan_width);
WRITE_BOOL_FIELD(parallel_aware);
WRITE_BOOL_FIELD(parallel_safe);
+ WRITE_BOOL_FIELD(async_capable);
WRITE_INT_FIELD(plan_node_id);
WRITE_NODE_FIELD(targetlist);
WRITE_NODE_FIELD(qual);
@@ -436,6 +437,8 @@ _outAppend(StringInfo str, const Append *node)
WRITE_NODE_FIELD(appendplans);
WRITE_INT_FIELD(first_partial_plan);
WRITE_NODE_FIELD(part_prune_info);
+ WRITE_INT_FIELD(nasyncplans);
+ WRITE_INT_FIELD(referent);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 551ce6c41c..1708337177 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1571,6 +1571,7 @@ ReadCommonPlan(Plan *local_node)
READ_INT_FIELD(plan_width);
READ_BOOL_FIELD(parallel_aware);
READ_BOOL_FIELD(parallel_safe);
+ READ_BOOL_FIELD(async_capable);
READ_INT_FIELD(plan_node_id);
READ_NODE_FIELD(targetlist);
READ_NODE_FIELD(qual);
@@ -1671,6 +1672,8 @@ _readAppend(void)
READ_NODE_FIELD(appendplans);
READ_INT_FIELD(first_partial_plan);
READ_NODE_FIELD(part_prune_info);
+ READ_INT_FIELD(nasyncplans);
+ READ_INT_FIELD(referent);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..8bb5294155 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -292,6 +292,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
List *rowMarks, OnConflictExpr *onconflict, int epqParam);
static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
GatherMergePath *best_path);
+static bool is_async_capable_path(Path *path);
/*
@@ -1069,6 +1070,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
bool tlist_was_changed = false;
List *pathkeys = best_path->path.pathkeys;
List *subplans = NIL;
+ List *asyncplans = NIL;
+ List *syncplans = NIL;
+ List *asyncpaths = NIL;
+ List *syncpaths = NIL;
+ List *newsubpaths = NIL;
ListCell *subpaths;
RelOptInfo *rel = best_path->path.parent;
PartitionPruneInfo *partpruneinfo = NULL;
@@ -1077,6 +1083,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
Oid *nodeSortOperators = NULL;
Oid *nodeCollations = NULL;
bool *nodeNullsFirst = NULL;
+ int nasyncplans = 0;
+ bool first = true;
+ bool referent_is_sync = true;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1206,9 +1215,36 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
}
- subplans = lappend(subplans, subplan);
+ /*
+ * Classify as async-capable or not. If we have decided to run the
+ * chidlren in parallel, we cannot any one of them run asynchronously.
+ */
+ if (!best_path->path.parallel_safe && is_async_capable_path(subpath))
+ {
+ subplan->async_capable = true;
+ asyncplans = lappend(asyncplans, subplan);
+ asyncpaths = lappend(asyncpaths, subpath);
+ ++nasyncplans;
+ if (first)
+ referent_is_sync = false;
+ }
+ else
+ {
+ syncplans = lappend(syncplans, subplan);
+ syncpaths = lappend(syncpaths, subpath);
+ }
+
+ first = false;
}
+ /*
+ * subplan contains asyncplans in the first half, if any, and sync plans in
+ * another half, if any. We need that the same for subpaths to make
+ * partition pruning information in sync with subplans.
+ */
+ subplans = list_concat(asyncplans, syncplans);
+ newsubpaths = list_concat(asyncpaths, syncpaths);
+
/*
* If any quals exist, they may be useful to perform further partition
* pruning during execution. Gather information needed by the executor to
@@ -1236,7 +1272,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
if (prunequal != NIL)
partpruneinfo =
make_partition_pruneinfo(root, rel,
- best_path->subpaths,
+ newsubpaths,
best_path->partitioned_rels,
prunequal);
}
@@ -1244,6 +1280,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
plan->appendplans = subplans;
plan->first_partial_plan = best_path->first_partial_path;
plan->part_prune_info = partpruneinfo;
+ plan->nasyncplans = nasyncplans;
+ plan->referent = referent_is_sync ? nasyncplans : 0;
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -6841,3 +6879,27 @@ is_projection_capable_plan(Plan *plan)
}
return true;
}
+
+/*
+ * is_projection_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 462b4d7e06..4a812bed24 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3851,6 +3851,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
case WAIT_EVENT_SYNC_REP:
event_name = "SyncRep";
break;
+ case WAIT_EVENT_ASYNC_WAIT:
+ event_name = "AsyncExecWait";
+ break;
/* no default case, so that compiler will warn */
}
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index cf7b535e4e..32c1e51128 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -306,7 +306,7 @@ SysLoggerMain(int argc, char *argv[])
* syslog pipe, which implies that all other backends have exited
* (including the postmaster).
*/
- wes = CreateWaitEventSet(CurrentMemoryContext, 2);
+ wes = CreateWaitEventSet(CurrentMemoryContext, NULL, 2);
AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
#ifndef WIN32
AddWaitEventToSet(wes, WL_SOCKET_READABLE, syslogPipe[0], NULL, NULL);
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 158784474d..70489c7c4c 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4573,10 +4573,14 @@ set_deparse_plan(deparse_namespace *dpns, Plan *plan)
* tlists according to one of the children, and the first one is the most
* natural choice. Likewise special-case ModifyTable to pretend that the
* first child plan is the OUTER referent; this is to support RETURNING
- * lists containing references to non-target relations.
+ * lists containing references to non-target relations. For Append, use the
+ * explicitly specified referent.
*/
if (IsA(plan, Append))
- dpns->outer_plan = linitial(((Append *) plan)->appendplans);
+ {
+ Append *app = (Append *) plan;
+ dpns->outer_plan = list_nth(app->appendplans, app->referent);
+ }
else if (IsA(plan, MergeAppend))
dpns->outer_plan = linitial(((MergeAppend *) plan)->mergeplans);
else if (IsA(plan, ModifyTable))
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000000..3b6bf4a516
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,22 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ * Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+ void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+ long timeout);
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 81fdfa4add..e5d5e9726d 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -59,6 +59,7 @@
#define EXEC_FLAG_MARK 0x0008 /* need mark/restore */
#define EXEC_FLAG_SKIP_TRIGGERS 0x0010 /* skip AfterTrigger calls */
#define EXEC_FLAG_WITH_NO_DATA 0x0020 /* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC 0x0040 /* request async execution */
/* Hook for plugins to get control in ExecutorStart() */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 326d713ebf..71a233b41f 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,
extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
ParallelWorkerContext *pwcxt);
extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+ WaitEventSet *wes,
+ void *caller_data, bool reinit);
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 95556dfb15..853ba2b5ad 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -169,6 +169,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
List *fdw_private,
RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+ WaitEventSet *wes,
+ void *caller_data,
+ bool reinit);
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -190,6 +195,7 @@ typedef struct FdwRoutine
GetForeignPlan_function GetForeignPlan;
BeginForeignScan_function BeginForeignScan;
IterateForeignScan_function IterateForeignScan;
+ IterateForeignScan_function IterateForeignScanAsync;
ReScanForeignScan_function ReScanForeignScan;
EndForeignScan_function EndForeignScan;
@@ -242,6 +248,11 @@ typedef struct FdwRoutine
InitializeDSMForeignScan_function InitializeDSMForeignScan;
ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan;
InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+
ShutdownForeignScan_function ShutdownForeignScan;
/* Support functions for path reparameterization. */
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index d113c271ee..177e6218cb 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -107,6 +107,7 @@ extern Bitmapset *bms_add_members(Bitmapset *a, const Bitmapset *b);
extern Bitmapset *bms_add_range(Bitmapset *a, int lower, int upper);
extern Bitmapset *bms_int_members(Bitmapset *a, const Bitmapset *b);
extern Bitmapset *bms_del_members(Bitmapset *a, const Bitmapset *b);
+extern Bitmapset *bms_del_range(Bitmapset *a, int lower, int upper);
extern Bitmapset *bms_join(Bitmapset *a, Bitmapset *b);
/* support for iterating through the integer elements of a set: */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cd3ddf781f..7778f5ddc2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -936,6 +936,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
* abstract superclass for all PlanState-type nodes.
* ----------------
*/
+typedef enum AsyncState
+{
+ AS_AVAILABLE,
+ AS_WAITING
+} AsyncState;
+
typedef struct PlanState
{
NodeTag type;
@@ -1024,6 +1030,11 @@ typedef struct PlanState
bool outeropsset;
bool inneropsset;
bool resultopsset;
+
+ /* Async subnode execution sutff */
+ AsyncState asyncstate;
+
+ int32 padding; /* to keep alignment of derived types */
} PlanState;
/* ----------------
@@ -1219,14 +1230,21 @@ struct AppendState
PlanState ps; /* its first field is NodeTag */
PlanState **appendplans; /* array of PlanStates for my inputs */
int as_nplans;
- int as_whichplan;
+ int as_whichsyncplan; /* which sync plan is being executed */
int as_first_partial_plan; /* Index of 'appendplans' containing
* the first partial plan */
+ int as_nasyncplans; /* # of async-capable children */
ParallelAppendState *as_pstate; /* parallel coordination info */
Size pstate_len; /* size of parallel coordination info */
struct PartitionPruneState *as_prune_state;
- Bitmapset *as_valid_subplans;
+ Bitmapset *as_valid_syncsubplans;
bool (*choose_next_subplan) (AppendState *);
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_needrequest; /* async plans needing a new request */
+ Bitmapset *as_pending_async; /* pending async plans */
+ TupleTableSlot **as_asyncresult; /* unreturned results of async plans */
+ int as_nasyncresult; /* # of valid entries in as_asyncresult */
+ bool as_exec_prune; /* runtime pruning needed for async exec? */
};
/* ----------------
@@ -1794,6 +1812,7 @@ typedef struct ForeignScanState
Size pscan_len; /* size of parallel coordination information */
/* use struct pointer to avoid including fdwapi.h here */
struct FdwRoutine *fdwroutine;
+ bool fs_async;
void *fdw_state; /* foreign-data wrapper can keep state here */
} ForeignScanState;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 99835ae2e4..fa4ddbb400 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -135,6 +135,11 @@ typedef struct Plan
bool parallel_aware; /* engage parallel-aware logic? */
bool parallel_safe; /* OK to use as part of parallel plan? */
+ /*
+ * information needed for asynchronous execution
+ */
+ bool async_capable; /* engage asyncronous execution logic? */
+
/*
* Common structural data for all Plan types.
*/
@@ -262,6 +267,10 @@ typedef struct Append
/* Info for run-time subplan pruning; NULL if we're not doing that */
struct PartitionPruneInfo *part_prune_info;
+
+ /* Async child node execution stuff */
+ int nasyncplans; /* # async subplans, always at start of list */
+ int referent; /* index of inheritance tree referent */
} Append;
/* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 3a65a51696..1bc713254c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -853,7 +853,8 @@ typedef enum
WAIT_EVENT_REPLICATION_ORIGIN_DROP,
WAIT_EVENT_REPLICATION_SLOT_DROP,
WAIT_EVENT_SAFE_SNAPSHOT,
- WAIT_EVENT_SYNC_REP
+ WAIT_EVENT_SYNC_REP,
+ WAIT_EVENT_ASYNC_WAIT
} WaitEventIPC;
/* ----------
--
2.18.2
v2-0003-async-postgres_fdw.patchtext/x-patch; charset=us-asciiDownload
From 6312e1a42c5a89642bb0ec1b7373e5ce4f8e0326 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH v2 3/3] async postgres_fdw
---
contrib/postgres_fdw/connection.c | 28 +
.../postgres_fdw/expected/postgres_fdw.out | 222 ++++---
contrib/postgres_fdw/postgres_fdw.c | 607 ++++++++++++++++--
contrib/postgres_fdw/postgres_fdw.h | 2 +
contrib/postgres_fdw/sql/postgres_fdw.sql | 20 +-
5 files changed, 703 insertions(+), 176 deletions(-)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index e45647f3ea..2184f7745a 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -58,6 +58,7 @@ typedef struct ConnCacheEntry
bool invalidated; /* true if reconnect is pending */
uint32 server_hashvalue; /* hash value of foreign server OID */
uint32 mapping_hashvalue; /* hash value of user mapping OID */
+ void *storage; /* connection specific storage */
} ConnCacheEntry;
/*
@@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
entry->conn, server->servername, user->umid, user->userid);
+ entry->storage = NULL;
}
/*
@@ -215,6 +217,32 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
return entry->conn;
}
+/*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+ bool found;
+ ConnCacheEntry *entry;
+ ConnCacheKey key;
+
+ /* Find storage using the same key with GetConnection */
+ key = user->umid;
+ entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+ Assert(found);
+
+ /* Create one if not any. */
+ if (entry->storage == NULL)
+ {
+ entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+ memset(entry->storage, 0, initsize);
+ }
+
+ return entry->storage;
+}
+
/*
* Connect to remote server using specified server and user mapping properties.
*/
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 62c2697920..e11e0d40a7 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6973,7 +6973,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+-------
a | aaa
@@ -7001,7 +7001,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7029,7 +7029,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7057,7 +7057,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | newtoo
@@ -7127,35 +7127,41 @@ insert into bar2 values(3,33,33);
insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
f1 | f2
----+----
1 | 11
@@ -7165,35 +7171,41 @@ select * from bar where f1 in (select f1 from foo) for update;
(4 rows)
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
f1 | f2
----+----
1 | 11
@@ -7223,11 +7235,12 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-> Hash Join
Output: bar_1.f1, (bar_1.f2 + 100), bar_1.f3, bar_1.ctid, foo.ctid, foo.*, foo.tableoid
Inner Unique: true
@@ -7241,12 +7254,13 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(39 rows)
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(41 rows)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
select tableoid::regclass, * from bar order by 1,2;
@@ -7276,16 +7290,17 @@ where bar.f1 = ss.f1;
Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
Hash Cond: (foo.f1 = bar.f1)
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
- -> Foreign Scan on public.foo2 foo_1
+ Async subplans: 2
+ -> Async Foreign Scan on public.foo2 foo_1
Output: ROW(foo_1.f1), foo_1.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_2
- Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
- -> Foreign Scan on public.foo2 foo_3
+ -> Async Foreign Scan on public.foo2 foo_3
Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_2
+ Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
-> Hash
Output: bar.f1, bar.f2, bar.ctid
-> Seq Scan on public.bar
@@ -7303,17 +7318,18 @@ where bar.f1 = ss.f1;
Output: (ROW(foo.f1)), foo.f1
Sort Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
- -> Foreign Scan on public.foo2 foo_1
+ Async subplans: 2
+ -> Async Foreign Scan on public.foo2 foo_1
Output: ROW(foo_1.f1), foo_1.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_2
- Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
- -> Foreign Scan on public.foo2 foo_3
+ -> Async Foreign Scan on public.foo2 foo_3
Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
-(45 rows)
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_2
+ Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
+(47 rows)
update bar set f2 = f2 + 100
from
@@ -7463,27 +7479,33 @@ delete from foo where f1 < 5 returning *;
(5 rows)
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
- QUERY PLAN
-------------------------------------------------------------------------------
- Update on public.bar
- Output: bar.f1, bar.f2
- Update on public.bar
- Foreign Update on public.bar2 bar_1
- -> Seq Scan on public.bar
- Output: bar.f1, (bar.f2 + 100), bar.ctid
- -> Foreign Update on public.bar2 bar_1
- Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Output: u.f1, u.f2
+ Sort Key: u.f1
+ CTE u
+ -> Update on public.bar
+ Output: bar.f1, bar.f2
+ Update on public.bar
+ Foreign Update on public.bar2 bar_1
+ -> Seq Scan on public.bar
+ Output: bar.f1, (bar.f2 + 100), bar.ctid
+ -> Foreign Update on public.bar2 bar_1
+ Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+ -> CTE Scan on u
+ Output: u.f1, u.f2
+(14 rows)
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
f1 | f2
----+-----
1 | 311
2 | 322
- 6 | 266
3 | 333
4 | 344
+ 6 | 266
7 | 277
(6 rows)
@@ -8558,11 +8580,12 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
Sort
Sort Key: t1.a, t3.c
-> Append
- -> Foreign Scan
+ Async subplans: 2
+ -> Async Foreign Scan
Relations: ((ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)) INNER JOIN (ftprt1_p1 t3_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: ((ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)) INNER JOIN (ftprt1_p2 t3_2)
-(7 rows)
+(8 rows)
SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER JOIN fprt1 t3 ON (t2.b = t3.a) WHERE t1.a % 25 =0 ORDER BY 1,2,3;
a | b | c
@@ -8597,20 +8620,22 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
-- with whole-row reference; partitionwise join does not apply
EXPLAIN (COSTS OFF)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
- QUERY PLAN
---------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------
Sort
Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
-> Hash Full Join
Hash Cond: (t1.a = t2.b)
-> Append
- -> Foreign Scan on ftprt1_p1 t1_1
- -> Foreign Scan on ftprt1_p2 t1_2
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt1_p1 t1_1
+ -> Async Foreign Scan on ftprt1_p2 t1_2
-> Hash
-> Append
- -> Foreign Scan on ftprt2_p1 t2_1
- -> Foreign Scan on ftprt2_p2 t2_2
-(11 rows)
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt2_p1 t2_1
+ -> Async Foreign Scan on ftprt2_p2 t2_2
+(13 rows)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
wr | wr
@@ -8639,11 +8664,12 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
Sort
Sort Key: t1.a, t1.b
-> Append
- -> Foreign Scan
+ Async subplans: 2
+ -> Async Foreign Scan
Relations: (ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: (ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)
-(7 rows)
+(8 rows)
SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE t1.a%25 = 0 ORDER BY 1,2;
a | b
@@ -8696,21 +8722,23 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
-- test FOR UPDATE; partitionwise join does not apply
EXPLAIN (COSTS OFF)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
- QUERY PLAN
---------------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------------
LockRows
-> Sort
Sort Key: t1.a
-> Hash Join
Hash Cond: (t2.b = t1.a)
-> Append
- -> Foreign Scan on ftprt2_p1 t2_1
- -> Foreign Scan on ftprt2_p2 t2_2
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt2_p1 t2_1
+ -> Async Foreign Scan on ftprt2_p2 t2_2
-> Hash
-> Append
- -> Foreign Scan on ftprt1_p1 t1_1
- -> Foreign Scan on ftprt1_p2 t1_2
-(12 rows)
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt1_p1 t1_1
+ -> Async Foreign Scan on ftprt1_p2 t1_2
+(14 rows)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
a | b
@@ -8745,18 +8773,19 @@ ANALYZE fpagg_tab_p3;
SET enable_partitionwise_aggregate TO false;
EXPLAIN (COSTS OFF)
SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
- QUERY PLAN
------------------------------------------------------------
+ QUERY PLAN
+-----------------------------------------------------------------
Sort
Sort Key: pagg_tab.a
-> HashAggregate
Group Key: pagg_tab.a
Filter: (avg(pagg_tab.b) < '22'::numeric)
-> Append
- -> Foreign Scan on fpagg_tab_p1 pagg_tab_1
- -> Foreign Scan on fpagg_tab_p2 pagg_tab_2
- -> Foreign Scan on fpagg_tab_p3 pagg_tab_3
-(9 rows)
+ Async subplans: 3
+ -> Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+ -> Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+ -> Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
+(10 rows)
-- Plan with partitionwise aggregates is enabled
SET enable_partitionwise_aggregate TO true;
@@ -8767,13 +8796,14 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
Sort
Sort Key: pagg_tab.a
-> Append
- -> Foreign Scan
+ Async subplans: 3
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p1 pagg_tab)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p2 pagg_tab_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p3 pagg_tab_2)
-(9 rows)
+(10 rows)
SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
a | sum | min | count
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 2175dff824..7b34afa119 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -21,6 +21,8 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -35,6 +37,7 @@
#include "optimizer/restrictinfo.h"
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
+#include "pgstat.h"
#include "postgres_fdw.h"
#include "utils/builtins.h"
#include "utils/float.h"
@@ -56,6 +59,9 @@ PG_MODULE_MAGIC;
/* If no remote estimates, assume a sort costs 20% extra */
#define DEFAULT_FDW_SORT_MULTIPLIER 1.2
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
/*
* Indexes of FDW-private information stored in fdw_private lists.
*
@@ -122,11 +128,28 @@ enum FdwDirectModifyPrivateIndex
FdwDirectModifyPrivateSetProcessed
};
+/*
+ * Connection common state.
+ */
+typedef struct PgFdwConnCommonState
+{
+ ForeignScanState *leader; /* leader node of this connection */
+ bool busy; /* true if this connection is busy */
+} PgFdwConnCommonState;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+ PGconn *conn; /* connection for the scan */
+ PgFdwConnCommonState *commonstate; /* connection common state */
+} PgFdwState;
+
/*
* Execution state of a foreign scan using postgres_fdw.
*/
typedef struct PgFdwScanState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table. NULL
* for a foreign join scan. */
TupleDesc tupdesc; /* tuple descriptor of scan */
@@ -137,7 +160,7 @@ typedef struct PgFdwScanState
List *retrieved_attrs; /* list of retrieved attribute numbers */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
+ bool result_ready;
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -153,6 +176,12 @@ typedef struct PgFdwScanState
/* batch-level state, for optimizing rewinds and avoiding useless fetch */
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ bool run_async; /* true if run asynchronously */
+ bool inqueue; /* true if this node is in waiter queue */
+ ForeignScanState *waiter; /* Next node to run a query among nodes
+ * sharing the same connection */
+ ForeignScanState *last_waiter; /* last waiting node in waiting queue.
+ * valid only on the leader node */
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
@@ -166,11 +195,11 @@ typedef struct PgFdwScanState
*/
typedef struct PgFdwModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -197,6 +226,7 @@ typedef struct PgFdwModifyState
*/
typedef struct PgFdwDirectModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
@@ -326,6 +356,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
static void postgresReScanForeignScan(ForeignScanState *node);
static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
static void postgresAddForeignUpdateTargets(Query *parsetree,
RangeTblEntry *target_rte,
Relation target_relation);
@@ -391,6 +422,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
RelOptInfo *input_rel,
RelOptInfo *output_rel,
void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+ WaitEventSet *wes,
+ void *caller_data, bool reinit);
/*
* Helper functions
@@ -419,7 +454,9 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn, bool clear_queue);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static PgFdwModifyState *create_foreign_modify(EState *estate,
RangeTblEntry *rte,
@@ -522,6 +559,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
routine->IterateForeignScan = postgresIterateForeignScan;
routine->ReScanForeignScan = postgresReScanForeignScan;
routine->EndForeignScan = postgresEndForeignScan;
+ routine->ShutdownForeignScan = postgresShutdownForeignScan;
/* Functions for updating foreign tables */
routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -558,6 +596,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for async execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+
PG_RETURN_POINTER(routine);
}
@@ -1434,12 +1476,22 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->s.conn = GetConnection(user, false);
+ fsstate->s.commonstate = (PgFdwConnCommonState *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
+ fsstate->s.commonstate->leader = NULL;
+ fsstate->s.commonstate->busy = false;
+ fsstate->waiter = NULL;
+ fsstate->last_waiter = node;
/* Assign a unique ID for my cursor */
- fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+ fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
fsstate->cursor_exists = false;
+ /* Initialize async execution status */
+ fsstate->run_async = false;
+ fsstate->inqueue = false;
+
/* Get private info created by planner functions. */
fsstate->query = strVal(list_nth(fsplan->fdw_private,
FdwScanPrivateSelectSql));
@@ -1487,40 +1539,249 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
&fsstate->param_values);
}
+/*
+ * Async queue manipuration functions
+ */
+
+/*
+ * add_async_waiter:
+ *
+ * Enqueue the node if it doesn't in the queue. Immediately starts the node if
+ * the connection is not busy.
+ */
+static inline void
+add_async_waiter(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+
+ /* do nothing if the node is already in the queue or already eof'ed */
+ if (leader == node || fsstate->inqueue || fsstate->eof_reached)
+ return;
+
+ if (leader == NULL)
+ {
+ /* immediately send request if not busy */
+ request_more_data(node);
+ }
+ else
+ {
+ PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+ PgFdwScanState *last_waiter_state
+ = GetPgFdwScanState(leader_state->last_waiter);
+
+ last_waiter_state->waiter = node;
+ leader_state->last_waiter = node;
+ fsstate->inqueue = true;
+ }
+}
+
+/*
+ * move_to_next_waiter:
+ *
+ * Makes the first waiter be the next leader
+ * Returns the new leader or NULL if there's no waiter.
+ */
+static inline ForeignScanState *
+move_to_next_waiter(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *ret = fsstate->waiter;
+
+ Assert(fsstate->s.commonstate->leader = node);
+
+ if (ret)
+ {
+ PgFdwScanState *retstate = GetPgFdwScanState(ret);
+ fsstate->waiter = NULL;
+ retstate->last_waiter = fsstate->last_waiter;
+ retstate->inqueue = false;
+ }
+
+ fsstate->s.commonstate->leader = ret;
+
+ return ret;
+}
+
+/*
+ * Remove the node from waiter queue.
+ *
+ * Results are cleared before removing leader if it is busy.
+ */
+static inline void
+remove_async_node(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+ PgFdwScanState *leader_state;
+ ForeignScanState *prev;
+ PgFdwScanState *prev_state;
+ ForeignScanState *cur;
+
+ /* no need to remove me */
+ if (!leader || !fsstate->inqueue)
+ return;
+
+ leader_state = GetPgFdwScanState(leader);
+
+ /* Remove the leader node */
+ if (leader == node)
+ {
+ ForeignScanState *next_leader;
+
+ if (leader_state->s.commonstate->busy)
+ {
+ /*
+ * this node is waiting for result, absorb the result first so
+ * that the following commands can be sent on the connection.
+ */
+ PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+ PGconn *conn = leader_state->s.conn;
+
+ while(PQisBusy(conn))
+ PQclear(PQgetResult(conn));
+
+ leader_state->s.commonstate->busy = false;
+ }
+
+ /* Make the first waiter the leader */
+ if (leader_state->waiter)
+ {
+ PgFdwScanState *next_leader_state;
+
+ next_leader = leader_state->waiter;
+ next_leader_state = GetPgFdwScanState(next_leader);
+
+ leader_state->s.commonstate->leader = next_leader;
+ next_leader_state->last_waiter = leader_state->last_waiter;
+ }
+ leader_state->waiter = NULL;
+
+ return;
+ }
+
+ /*
+ * Just remove the node in queue
+ *
+ * This function is called on the shutdown path. We don't bother
+ * considering faster way to do this.
+ */
+ prev = leader;
+ prev_state = leader_state;
+ cur = GetPgFdwScanState(prev)->waiter;
+ while (cur)
+ {
+ PgFdwScanState *curstate = GetPgFdwScanState(cur);
+
+ if (cur == node)
+ {
+ prev_state->waiter = curstate->waiter;
+ if (leader_state->last_waiter == cur)
+ leader_state->last_waiter = prev;
+ else
+ leader_state->last_waiter = cur;
+
+ fsstate->inqueue = false;
+
+ return;
+ }
+ prev = cur;
+ prev_state = curstate;
+ cur = curstate->waiter;
+ }
+}
+
/*
* postgresIterateForeignScan
- * Retrieve next row from the result set, or clear tuple slot to indicate
- * EOF.
+ * Retrieve next row from the result set.
+ *
+ * For synchronous nodes, returns clear tuple slot means EOF.
+ *
+ * For asynchronous nodes, if clear tuple slot is returned, the caller
+ * needs to check asyncstate to tell if all tuples received (AS_AVAILABLE)
+ * or waiting for the next data to come (AS_WAITING).
*/
static TupleTableSlot *
postgresIterateForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
- /*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
- */
- if (!fsstate->cursor_exists)
- create_cursor(node);
+ if (fsstate->next_tuple >= fsstate->num_tuples && !fsstate->eof_reached)
+ {
+ /* we've run out, get some more tuples */
+ if (!node->fs_async)
+ {
+ /* finish the running query before sending command for this node */
+ if (!fsstate->s.commonstate->busy)
+ vacate_connection((PgFdwState *)fsstate, false);
+
+ request_more_data(node);
+
+ /* Fetch the result immediately. */
+ fetch_received_data(node);
+ }
+ else if (!fsstate->s.commonstate->busy)
+ {
+ /* If the connection is not busy, just send the request. */
+ request_more_data(node);
+ }
+ else
+ {
+ /* The connection is busy */
+ bool available = true;
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+ PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+
+ /* Check if the result is immediately available */
+ if (PQisBusy(leader_state->s.conn))
+ {
+ int rc = WaitLatchOrSocket(NULL,
+ WL_SOCKET_READABLE | WL_TIMEOUT |
+ WL_EXIT_ON_PM_DEATH,
+ PQsocket(leader_state->s.conn), 0,
+ WAIT_EVENT_ASYNC_WAIT);
+ if (!(rc & WL_SOCKET_READABLE))
+ available = false;
+ }
+
+ /* Fetch the leader's data if any */
+ if (available)
+ fetch_received_data(leader);
+
+ /* queue the requested node */
+ add_async_waiter(node);
+
+ /* queue the previous leader for the next request if needed */
+ add_async_waiter(leader);
+ }
+ }
- /*
- * Get some more tuples, if we've run out.
- */
if (fsstate->next_tuple >= fsstate->num_tuples)
{
- /* No point in another fetch if we already detected EOF, though. */
- if (!fsstate->eof_reached)
- fetch_more_data(node);
- /* If we didn't get any tuples, must be end of data. */
- if (fsstate->next_tuple >= fsstate->num_tuples)
- return ExecClearTuple(slot);
+ /*
+ * We haven't received a result for the given node this time, return
+ * with no tuple to give way to another node.
+ */
+ if (fsstate->eof_reached)
+ {
+ fsstate->result_ready = true;
+ node->ss.ps.asyncstate = AS_AVAILABLE;
+ }
+ else
+ {
+ fsstate->result_ready = false;
+ node->ss.ps.asyncstate = AS_WAITING;
+ }
+
+ return ExecClearTuple(slot);
}
/*
* Return the next tuple.
*/
+ fsstate->result_ready = true;
+ node->ss.ps.asyncstate = AS_AVAILABLE;
ExecStoreHeapTuple(fsstate->tuples[fsstate->next_tuple++],
slot,
false);
@@ -1535,7 +1796,7 @@ postgresIterateForeignScan(ForeignScanState *node)
static void
postgresReScanForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
char sql[64];
PGresult *res;
@@ -1543,6 +1804,8 @@ postgresReScanForeignScan(ForeignScanState *node)
if (!fsstate->cursor_exists)
return;
+ vacate_connection((PgFdwState *)fsstate, true);
+
/*
* If any internal parameters affecting this node have changed, we'd
* better destroy and recreate the cursor. Otherwise, rewinding it should
@@ -1571,9 +1834,9 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
PQclear(res);
/* Now force a fresh FETCH. */
@@ -1591,7 +1854,7 @@ postgresReScanForeignScan(ForeignScanState *node)
static void
postgresEndForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
if (fsstate == NULL)
@@ -1599,15 +1862,31 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->s.conn, fsstate->cursor_number);
/* Release remote connection */
- ReleaseConnection(fsstate->conn);
- fsstate->conn = NULL;
+ ReleaseConnection(fsstate->s.conn);
+ fsstate->s.conn = NULL;
/* MemoryContexts will be deleted automatically. */
}
+/*
+ * postgresShutdownForeignScan
+ * Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+ ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+ if (plan->operation != CMD_SELECT)
+ return;
+
+ /* remove the node from waiting queue */
+ remove_async_node(node);
+}
+
/*
* postgresAddForeignUpdateTargets
* Add resjunk column(s) needed for update/delete on a foreign table
@@ -2372,7 +2651,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->s.conn = GetConnection(user, false);
+ dmstate->s.commonstate = (PgFdwConnCommonState *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
/* Update the foreign-join-related fields. */
if (fsplan->scan.scanrelid == 0)
@@ -2457,7 +2738,11 @@ postgresIterateDirectModify(ForeignScanState *node)
* If this is the first call after Begin, execute the statement.
*/
if (dmstate->num_tuples == -1)
+ {
+ /* finish running query to send my command */
+ vacate_connection((PgFdwState *)dmstate, true);
execute_dml_stmt(node);
+ }
/*
* If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2504,8 +2789,8 @@ postgresEndDirectModify(ForeignScanState *node)
PQclear(dmstate->result);
/* Release remote connection */
- ReleaseConnection(dmstate->conn);
- dmstate->conn = NULL;
+ ReleaseConnection(dmstate->s.conn);
+ dmstate->s.conn = NULL;
/* MemoryContext will be deleted automatically. */
}
@@ -2703,6 +2988,7 @@ estimate_path_cost_size(PlannerInfo *root,
List *local_param_join_conds;
StringInfoData sql;
PGconn *conn;
+ PgFdwConnCommonState *commonstate;
Selectivity local_sel;
QualCost local_cost;
List *fdw_scan_tlist = NIL;
@@ -2747,6 +3033,18 @@ estimate_path_cost_size(PlannerInfo *root,
/* Get the remote estimate */
conn = GetConnection(fpinfo->user, false);
+ commonstate = GetConnectionSpecificStorage(fpinfo->user,
+ sizeof(PgFdwConnCommonState));
+ if (commonstate)
+ {
+ PgFdwState tmpstate;
+ tmpstate.conn = conn;
+ tmpstate.commonstate = commonstate;
+
+ /* finish running query to send my command */
+ vacate_connection(&tmpstate, true);
+ }
+
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -3317,11 +3615,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
static void
create_cursor(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
ExprContext *econtext = node->ss.ps.ps_ExprContext;
int numParams = fsstate->numParams;
const char **values = fsstate->param_values;
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
StringInfoData buf;
PGresult *res;
@@ -3384,50 +3682,127 @@ create_cursor(ForeignScanState *node)
}
/*
- * Fetch some more rows from the node's cursor.
+ * Sends the next request of the node. If the given node is different from the
+ * current connection leader, pushes it back to waiter queue and let the given
+ * node be the leader.
*/
static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+ PGconn *conn = fsstate->s.conn;
+ char sql[64];
+
+ /* must be non-busy */
+ Assert(!fsstate->s.commonstate->busy);
+ /* must be not-eof */
+ Assert(!fsstate->eof_reached);
+
+ /*
+ * If this is the first call after Begin or ReScan, we need to create the
+ * cursor on the remote side.
+ */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (!PQsendQuery(conn, sql))
+ pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+ fsstate->s.commonstate->busy = true;
+
+ /* Let the node be the leader if it is different from current one */
+ if (leader != node)
+ {
+ /*
+ * If the connection leader exists, insert the node as the connection
+ * leader making the current leader be the first waiter.
+ */
+ if (leader != NULL)
+ {
+ remove_async_node(node);
+ fsstate->last_waiter = GetPgFdwScanState(leader)->last_waiter;
+ fsstate->waiter = leader;
+ }
+ else
+ {
+ fsstate->last_waiter = node;
+ fsstate->waiter = NULL;
+ }
+
+ fsstate->s.commonstate->leader = node;
+ }
+}
+
+/*
+ * Fetches received data and automatically send requests of the next waiter.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
PGresult *volatile res = NULL;
MemoryContext oldcontext;
+ ForeignScanState *waiter;
+
+ /* I should be the current connection leader */
+ Assert(fsstate->s.commonstate->leader == node);
/*
* We'll store the tuples in the batch_cxt. First, flush the previous
- * batch.
+ * batch if no tuple is remaining
*/
- fsstate->tuples = NULL;
- MemoryContextReset(fsstate->batch_cxt);
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->tuples = NULL;
+ fsstate->num_tuples = 0;
+ MemoryContextReset(fsstate->batch_cxt);
+ }
+ else if (fsstate->next_tuple > 0)
+ {
+ /* move the remaining tuples to the beginning of the store */
+ int n = 0;
+
+ while(fsstate->next_tuple < fsstate->num_tuples)
+ fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+ fsstate->num_tuples = n;
+ }
+
oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
/* PGresult must be released before leaving this function. */
PG_TRY();
{
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
char sql[64];
- int numrows;
+ int addrows;
+ size_t newsize;
int i;
snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
fsstate->fetch_size, fsstate->cursor_number);
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_get_result(conn, sql);
/* On error, report the original query, not the FETCH. */
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
/* Convert the data into HeapTuples */
- numrows = PQntuples(res);
- fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
- fsstate->num_tuples = numrows;
- fsstate->next_tuple = 0;
+ addrows = PQntuples(res);
+ newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+ if (fsstate->tuples)
+ fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+ else
+ fsstate->tuples = (HeapTuple *) palloc(newsize);
- for (i = 0; i < numrows; i++)
+ for (i = 0; i < addrows; i++)
{
Assert(IsA(node->ss.ps.plan, ForeignScan));
- fsstate->tuples[i] =
+ fsstate->tuples[fsstate->num_tuples + i] =
make_tuple_from_result_row(res, i,
fsstate->rel,
fsstate->attinmeta,
@@ -3437,22 +3812,75 @@ fetch_more_data(ForeignScanState *node)
}
/* Update fetch_ct_2 */
- if (fsstate->fetch_ct_2 < 2)
+ if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
fsstate->fetch_ct_2++;
+ fsstate->next_tuple = 0;
+ fsstate->num_tuples += addrows;
+
/* Must be EOF if we didn't get as many tuples as we asked for. */
- fsstate->eof_reached = (numrows < fsstate->fetch_size);
+ fsstate->eof_reached = (addrows < fsstate->fetch_size);
+
+ PQclear(res);
+ res = NULL;
}
PG_FINALLY();
{
+ fsstate->s.commonstate->busy = false;
+
if (res)
PQclear(res);
}
PG_END_TRY();
+ fsstate->s.commonstate->busy = false;
+
+ /* let the first waiter be the next leader of this connection */
+ waiter = move_to_next_waiter(node);
+
+ /* send the next request if any */
+ if (waiter)
+ request_more_data(waiter);
+
MemoryContextSwitchTo(oldcontext);
}
+/*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate, bool clear_queue)
+{
+ PgFdwConnCommonState *commonstate = fdwstate->commonstate;
+ ForeignScanState *leader;
+
+ /* the connection is alrady available */
+ if (commonstate == NULL || commonstate->leader == NULL || !commonstate->busy)
+ return;
+
+ /*
+ * let the current connection leader read the result for the running query
+ */
+ leader = commonstate->leader;
+ fetch_received_data(leader);
+
+ /* let the first waiter be the next leader of this connection */
+ move_to_next_waiter(leader);
+
+ if (!clear_queue)
+ return;
+
+ /* Clear the waiting list */
+ while (leader)
+ {
+ PgFdwScanState *fsstate = GetPgFdwScanState(leader);
+
+ fsstate->last_waiter = NULL;
+ leader = fsstate->waiter;
+ fsstate->waiter = NULL;
+ }
+}
+
/*
* Force assorted GUC parameters to settings that ensure that we'll output
* data values in a form that is unambiguous to the remote server.
@@ -3566,7 +3994,9 @@ create_foreign_modify(EState *estate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->s.conn = GetConnection(user, true);
+ fmstate->s.commonstate = (PgFdwConnCommonState *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Set up remote query information. */
@@ -3653,6 +4083,9 @@ execute_foreign_modify(EState *estate,
operation == CMD_UPDATE ||
operation == CMD_DELETE);
+ /* finish running query to send my command */
+ vacate_connection((PgFdwState *)fmstate, true);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -3680,14 +4113,14 @@ execute_foreign_modify(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3695,10 +4128,10 @@ execute_foreign_modify(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -3734,7 +4167,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
- GetPrepStmtNumber(fmstate->conn));
+ GetPrepStmtNumber(fmstate->s.conn));
p_name = pstrdup(prep_name);
/*
@@ -3744,12 +4177,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* the prepared statements we use in this module are simple enough that
* the remote server will make the right choices.
*/
- if (!PQsendPrepare(fmstate->conn,
+ if (!PQsendPrepare(fmstate->s.conn,
p_name,
fmstate->query,
0,
NULL))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3757,9 +4190,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
PQclear(res);
/* This action shows that the prepare has been done. */
@@ -3888,16 +4321,16 @@ finish_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
PQclear(res);
fmstate->p_name = NULL;
}
/* Release remote connection */
- ReleaseConnection(fmstate->conn);
- fmstate->conn = NULL;
+ ReleaseConnection(fmstate->s.conn);
+ fmstate->s.conn = NULL;
}
/*
@@ -4056,9 +4489,9 @@ execute_dml_stmt(ForeignScanState *node)
* the desired result. This allows us to avoid assuming that the remote
* server has the same OIDs we do for the parameters' types.
*/
- if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+ if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
NULL, values, NULL, NULL, 0))
- pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+ pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
/*
* Get the result, and check for success.
@@ -4066,10 +4499,10 @@ execute_dml_stmt(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+ dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
if (PQresultStatus(dmstate->result) !=
(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+ pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
dmstate->query);
/* Get the number of rows affected. */
@@ -5560,6 +5993,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
/* XXX Consider parameterized paths for the join relation */
}
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add wait event so that the ForeignScan node is going to wait for.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+ void *caller_data, bool reinit)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+ /* Reinit is not supported for now. */
+ Assert(reinit);
+
+ if (fsstate->s.commonstate->leader == node)
+ {
+ AddWaitEventToSet(wes,
+ WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+ NULL, caller_data);
+ return true;
+ }
+
+ return false;
+}
+
+
/*
* Assess whether the aggregation, grouping and having operations can be pushed
* down to the foreign server. As a side effect, save information we obtain in
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index eef410db39..96af75a33e 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -85,6 +85,7 @@ typedef struct PgFdwRelationInfo
UserMapping *user; /* only set in use_remote_estimate mode */
int fetch_size; /* fetch size for this remote table */
+ bool allow_prefetch; /* true to allow overlapped fetching */
/*
* Name of the relation, for use while EXPLAINing ForeignScan. It is used
@@ -130,6 +131,7 @@ extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 83971665e3..359208a12a 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1780,25 +1780,25 @@ INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
@@ -1840,12 +1840,12 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
@@ -1904,8 +1904,8 @@ explain (verbose, costs off)
delete from foo where f1 < 5 returning *;
delete from foo where f1 < 5 returning *;
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
-- Test that UPDATE/DELETE with inherited target works with row-level triggers
CREATE TRIGGER trig_row_before
--
2.18.2
On 2/28/20 3:06 AM, Kyotaro Horiguchi wrote:
Hello, this is a follow-on of [1] and [2].
Currently the executor visits execution nodes one-by-one. Considering
sharding, Append on multiple postgres_fdw nodes can work
simultaneously and that can largely shorten the respons of the whole
query. For example, aggregations that can be pushed-down to remote
would be accelerated by the number of remote servers. Even other than
such an extreme case, collecting tuples from multiple servers also can
be accelerated by tens of percent [2].I have suspended the work waiting asyncrohous or push-up executor to
come but the mood seems inclining toward doing that before that to
come [3].The patchset consists of three parts.
Are these improvements targeted at PG13 or PG14? This seems to be a
pretty big change for the last CF of PG13.
Regards,
--
-David
david@pgmasters.net
At Wed, 4 Mar 2020 09:56:55 -0500, David Steele <david@pgmasters.net> wrote in
On 2/28/20 3:06 AM, Kyotaro Horiguchi wrote:
Hello, this is a follow-on of [1] and [2].
Currently the executor visits execution nodes one-by-one. Considering
sharding, Append on multiple postgres_fdw nodes can work
simultaneously and that can largely shorten the respons of the whole
query. For example, aggregations that can be pushed-down to remote
would be accelerated by the number of remote servers. Even other than
such an extreme case, collecting tuples from multiple servers also can
be accelerated by tens of percent [2].
I have suspended the work waiting asyncrohous or push-up executor to
come but the mood seems inclining toward doing that before that to
come [3].
The patchset consists of three parts.Are these improvements targeted at PG13 or PG14? This seems to be a
pretty big change for the last CF of PG13.
It is targeted at PG14. As we have the target version in CF-app now,
I marked it as targetting PG14.
Thank you for the suggestion.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Fri, Feb 28, 2020 at 9:08 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
- v2-0001-Allow-wait-event-set-to-be-regsitered-to-resoure.patch
The async feature uses WaitEvent, and it needs to be released on
error. This patch makes it possible to register WaitEvent to
resowner to handle that case..
+1
- v2-0002-infrastructure-for-asynchronous-execution.patch
It povides an abstraction layer of asynchronous behavior
(execAsync). Then adds ExecAppend, another version of ExecAppend,
that handles "async-capable" subnodes asynchronously. Also it
contains planner part that makes planner aware of "async-capable"
and "async-aware" path nodes.
This patch add an infrastructure for asynchronous execution. As a PoC
this makes only Append capable to handle asynchronously executable
subnodes.
What other nodes do you think could be async aware? I suppose you
would teach joins to pass through the async support of their children,
and then you could make partition-wise join work like that.
+ /* choose appropriate version of Exec function */
+ if (appendstate->as_nasyncplans == 0)
+ appendstate->ps.ExecProcNode = ExecAppend;
+ else
+ appendstate->ps.ExecProcNode = ExecAppendAsync;
Cool. No extra cost for people not using the new feature.
+ slot = ExecProcNode(subnode);
+ if (subnode->asyncstate == AS_AVAILABLE)
So, now when you execute a node, you get a result AND you get some
information that you access by reaching into the child node's
PlanState. The ExecProcNode() interface is extremely limiting, but
I'm not sure if this is the right way to extend it. Maybe
ExecAsyncProcNode() with a wide enough interface to do the job?
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+ static int *refind = NULL;
+ static int refindsize = 0;
...
+ if (refindsize < n)
...
+ static ExecAsync_mcbarg mcb_arg =
+ { &refind, &refindsize };
+ static MemoryContextCallback mcb =
+ { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
...
+ MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
This seems a bit strange. Why not just put the pointer in the plan
state? I suppose you want to avoid allocating a new buffer for every
query. Perhaps you could fix that by having a small fixed-size buffer
in the PlanState to cover common cases and allocating a larger one in
a per-query memory context if that one is too small, or just not
worrying about it and allocating every time since you know the desired
size.
+ wes = CreateWaitEventSet(TopTransactionContext,
TopTransactionResourceOwner, n);
...
+ FreeWaitEventSet(wes);
BTW, just as an FYI, I am proposing[1]https://commitfest.postgresql.org/27/2452/ to add support for
RemoveWaitEvent(), so that you could have a single WaitEventSet for
the lifetime of the executor node, and then add and remove sockets
only as needed. I'm hoping to commit that for PG13, if there are no
objections or better ideas soon, because it's useful for some other
places where we currently create and destroy WaitEventSets frequently.
One complication when working with long-lived WaitEventSet objects is
that libpq (or some other thing used by some other hypothetical
async-capable FDW) is free to close and reopen its socket whenever it
wants, so you need a way to know when it has done that. In that patch
set I added pqSocketChangeCount() so that you can see when pgSocket()
refers to a new socket (even if the file descriptor number is the same
by coincidence), but that imposes some book-keeping duties on the
caller, and now I'm wondering how that would look in your patch set.
My goal is to generate the minimum number of systems calls. I think
it would be nice if a 1000-shard query only calls epoll_ctl() when a
child node needs to be added or removed from the set, not
epoll_create(), 1000 * epoll_ctl(), epoll_wait(), close() for every
wait. But I suppose there is an argument that it's more complication
than it's worth.
Thank you for the comment.
At Thu, 5 Mar 2020 16:17:24 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in
On Fri, Feb 28, 2020 at 9:08 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:- v2-0001-Allow-wait-event-set-to-be-regsitered-to-resoure.patch
The async feature uses WaitEvent, and it needs to be released on
error. This patch makes it possible to register WaitEvent to
resowner to handle that case..+1
- v2-0002-infrastructure-for-asynchronous-execution.patch
It povides an abstraction layer of asynchronous behavior
(execAsync). Then adds ExecAppend, another version of ExecAppend,
that handles "async-capable" subnodes asynchronously. Also it
contains planner part that makes planner aware of "async-capable"
and "async-aware" path nodes.This patch add an infrastructure for asynchronous execution. As a PoC
this makes only Append capable to handle asynchronously executable
subnodes.What other nodes do you think could be async aware? I suppose you
would teach joins to pass through the async support of their children,
and then you could make partition-wise join work like that.
An Append node is fed from many immediate-child async-capable nodes,
so the Apeend node can pick any child node that have fired.
Unfortunately joins are not wide but deep. That means a set of
async-capable nodes have multiple async-aware (and async capable at
the same time for intermediate nodes) parent nodes. So if we want to
be async on that configuration, we need "push-up" executor engine. In
my last trial, ignoring performane, I could turn almost all nodes into
push-up style but a few nodes, like WindowAgg or HashJoin have a quite
low affinity with push-up style since the caller sites to child nodes
are many and scattered. I got through the low-affinity by turning the
nodes into state machines, but I don't think it is good.
+ /* choose appropriate version of Exec function */ + if (appendstate->as_nasyncplans == 0) + appendstate->ps.ExecProcNode = ExecAppend; + else + appendstate->ps.ExecProcNode = ExecAppendAsync;Cool. No extra cost for people not using the new feature.
It creates some duplicate code but I agree on the performance
perspective.
+ slot = ExecProcNode(subnode); + if (subnode->asyncstate == AS_AVAILABLE)So, now when you execute a node, you get a result AND you get some
information that you access by reaching into the child node's
PlanState. The ExecProcNode() interface is extremely limiting, but
I'm not sure if this is the right way to extend it. Maybe
ExecAsyncProcNode() with a wide enough interface to do the job?
Sounds resonable and seems easy to do.
+Bitmapset * +ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout) +{ + static int *refind = NULL; + static int refindsize = 0; ... + if (refindsize < n) ... + static ExecAsync_mcbarg mcb_arg = + { &refind, &refindsize }; + static MemoryContextCallback mcb = + { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL }; ... + MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);This seems a bit strange. Why not just put the pointer in the plan
state? I suppose you want to avoid allocating a new buffer for every
query. Perhaps you could fix that by having a small fixed-size buffer
in the PlanState to cover common cases and allocating a larger one in
a per-query memory context if that one is too small, or just not
worrying about it and allocating every time since you know the desired
size.
The most significant factor for the shape would be ExecAsync is not a
kind of ExecNode. So ExecAsyncEventWait doen't have direcgt access to
EState other than one of given mutiple nodes. I consider tryig to use
given ExecNodes as an access path to ESttate.
+ wes = CreateWaitEventSet(TopTransactionContext, TopTransactionResourceOwner, n); ... + FreeWaitEventSet(wes);BTW, just as an FYI, I am proposing[1] to add support for
RemoveWaitEvent(), so that you could have a single WaitEventSet for
the lifetime of the executor node, and then add and remove sockets
only as needed. I'm hoping to commit that for PG13, if there are no
objections or better ideas soon, because it's useful for some other
places where we currently create and destroy WaitEventSets frequently.
Yes! I have wanted that (but haven't done by myself..., and I didn't
understand the details from the title "Reducint WaitEventSet syscall
churn":p)
One complication when working with long-lived WaitEventSet objects is
that libpq (or some other thing used by some other hypothetical
async-capable FDW) is free to close and reopen its socket whenever it
wants, so you need a way to know when it has done that. In that patch
set I added pqSocketChangeCount() so that you can see when pgSocket()
refers to a new socket (even if the file descriptor number is the same
by coincidence), but that imposes some book-keeping duties on the
caller, and now I'm wondering how that would look in your patch set.
As for postgres-fdw, unsponaneous disconnection immedately leands to
query ERROR.
My goal is to generate the minimum number of systems calls. I think
it would be nice if a 1000-shard query only calls epoll_ctl() when a
child node needs to be added or removed from the set, not
epoll_create(), 1000 * epoll_ctl(), epoll_wait(), close() for every
wait. But I suppose there is an argument that it's more complication
than it's worth.
I'm not sure how it gives performance gain, but reducing syscalls
itself is good. I'll look on it.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
The following review has been posted through the commitfest application:
make installcheck-world: not tested
Implements feature: tested, passed
Spec compliant: not tested
Documentation: not tested
I have tested the feature and it shows great performance in queries
which have small amount result compared with base scan amount.
The following review has been posted through the commitfest application:
make installcheck-world: tested, failed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not tested
I occur a strange issue when a exec 'make installcheck-world', it is:
##########################################################
...
============== running regression test queries ==============
test adminpack ... FAILED 60 ms
======================
1 of 1 tests failed.
======================
The differences that caused some tests to fail can be viewed in the
file "/work/src/postgres_app_for/contrib/adminpack/regression.diffs". A copy of the test summary that you see
above is saved in the file "/work/src/postgres_app_for/contrib/adminpack/regression.out".
...
##########################################################
And the content in 'contrib/adminpack/regression.out' is:
##########################################################
SELECT pg_file_write('/tmp/test_file0', 'test0', false);
ERROR: absolute path not allowed
SELECT pg_file_write(current_setting('data_directory') || '/test_file4', 'test4', false);
- pg_file_write
----------------
- 5
-(1 row)
-
+ERROR: reference to parent directory ("..") not allowed
SELECT pg_file_write(current_setting('data_directory') || '/../test_file4', 'test4', false);
ERROR: reference to parent directory ("..") not allowed
RESET ROLE;
@@ -149,7 +145,7 @@
SELECT pg_file_unlink('test_file4');
pg_file_unlink
----------------
- t
+ f
(1 row)
##########################################################
However the issue does not occur when I do a 'make check-world'.
And it doesn't occur when I test the 'make installcheck-world' without the patch.
The new status of this patch is: Waiting on Author
Hello. Thank you for testing.
At Tue, 10 Mar 2020 05:13:42 +0000, movead li <movead.li@highgo.ca> wrote in
The following review has been posted through the commitfest application:
make installcheck-world: tested, failed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not testedI occur a strange issue when a exec 'make installcheck-world', it is:
I had't done that.. Bud it worked for me.
##########################################################
...
============== running regression test queries ==============
test adminpack ... FAILED 60 ms======================
1 of 1 tests failed.
======================The differences that caused some tests to fail can be viewed in the
file "/work/src/postgres_app_for/contrib/adminpack/regression.diffs". A copy of the test summary that you see
above is saved in the file "/work/src/postgres_app_for/contrib/adminpack/regression.out".
...
##########################################################And the content in 'contrib/adminpack/regression.out' is:
I don't see that file. Maybe regression.diff?
########################################################## SELECT pg_file_write('/tmp/test_file0', 'test0', false); ERROR: absolute path not allowed SELECT pg_file_write(current_setting('data_directory') || '/test_file4', 'test4', false); - pg_file_write ---------------- - 5 -(1 row) - +ERROR: reference to parent directory ("..") not allowed
It seems to me that you are setting a path containing ".." to PGDATA.
However the issue does not occur when I do a 'make check-world'.
And it doesn't occur when I test the 'make installcheck-world' without the patch.
check-world doesn't use path containing ".." as PGDATA.
The new status of this patch is: Waiting on Author
Thanks for noticing that.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
It seems to me that you are setting a path containing ".." to PGDATA.
Thanks point it for me.
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not tested
I redo the make installcheck-world as Kyotaro Horiguchi point out and the
result nothing wrong. And I think the patch is good in feature and performance
here is the test result thread I made before:
/messages/by-id/CA+9bhCK7chd0qx+mny+U9xaOs2FDNJ7RaxG4=9rpgT6oAKBgWA@mail.gmail.com
The new status of this patch is: Ready for Committer
Hi,
On Wed, Mar 11, 2020 at 10:47 AM movead li <movead.li@highgo.ca> wrote:
I redo the make installcheck-world as Kyotaro Horiguchi point out and the
result nothing wrong. And I think the patch is good in feature and performance
here is the test result thread I made before:
/messages/by-id/CA+9bhCK7chd0qx+mny+U9xaOs2FDNJ7RaxG4=9rpgT6oAKBgWA@mail.gmail.comThe new status of this patch is: Ready for Committer
As discussed upthread, this is a material for PG14, so I moved this to
the next commitfest, keeping the same status. I've not looked at the
patch in any detail yet, so I'm not sure that that is the right status
for the patch, though. I'd like to work on this for PG14 if I have
time.
Thanks!
Best regards,
Etsuro Fujita
On 3/30/20 1:15 PM, Etsuro Fujita wrote:
Hi,
On Wed, Mar 11, 2020 at 10:47 AM movead li <movead.li@highgo.ca> wrote:
I redo the make installcheck-world as Kyotaro Horiguchi point out and the
result nothing wrong. And I think the patch is good in feature and performance
here is the test result thread I made before:
/messages/by-id/CA+9bhCK7chd0qx+mny+U9xaOs2FDNJ7RaxG4=9rpgT6oAKBgWA@mail.gmail.comThe new status of this patch is: Ready for Committer
As discussed upthread, this is a material for PG14, so I moved this to
the next commitfest, keeping the same status. I've not looked at the
patch in any detail yet, so I'm not sure that that is the right status
for the patch, though. I'd like to work on this for PG14 if I have
time.
Hi,
This patch no longer applies cleanly.
In addition, code comments contain spelling errors.
--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company
Hello, Andrey.
At Wed, 3 Jun 2020 15:00:06 +0500, Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote in
This patch no longer applies cleanly.
In addition, code comments contain spelling errors.
Sure. Thaks for noticing of them and sorry for the many typos.
Additional item in WaitEventIPC conflicted with this.
I found the following typos.
connection.c:
s/Rerturns/Returns/
postgres-fdw.c:
s/Retrive/Retrieve/
s/ForeginScanState/ForeignScanState/
s/manipuration/manipulation/
s/asyncstate/async state/
s/alrady/already/
nodeAppend.c:
s/Rery/Retry/
createplan.c:
s/chidlren/children/
resowner.c:
s/identier/identifier/ X 2
execnodes.h:
s/sutff/stuff/
plannodes.h:
s/asyncronous/asynchronous/
Removed a useless variable PgFdwScanState.result_ready.
Removed duplicate code from remove_async_node() by using move_to_next_waiter().
Done some minor cleanups.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v3-0001-Allow-wait-event-set-to-be-registered-to-resource.patchtext/x-patch; charset=us-asciiDownload
From db231fa99da5954b52e195f6af800c0f9b991ed4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH v3 1/3] Allow wait event set to be registered to resource
owner
WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
src/backend/libpq/pqcomm.c | 2 +-
src/backend/storage/ipc/latch.c | 18 ++++-
src/backend/storage/lmgr/condition_variable.c | 2 +-
src/backend/utils/resowner/resowner.c | 67 +++++++++++++++++++
src/include/storage/latch.h | 4 +-
src/include/utils/resowner_private.h | 8 +++
6 files changed, 96 insertions(+), 5 deletions(-)
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 7717bb2719..16aefb03ee 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -218,7 +218,7 @@ pq_init(void)
(errmsg("could not set socket to nonblocking mode: %m")));
#endif
- FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+ FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
NULL, NULL);
AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 05df5017c4..a8b52cd381 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -56,6 +56,7 @@
#include "storage/latch.h"
#include "storage/pmsignal.h"
#include "storage/shmem.h"
+#include "utils/resowner_private.h"
/*
* Select the fd readiness primitive to use. Normally the "most modern"
@@ -84,6 +85,8 @@ struct WaitEventSet
int nevents; /* number of registered events */
int nevents_space; /* maximum number of events in this set */
+ ResourceOwner resowner; /* Resource owner */
+
/*
* Array, of nevents_space length, storing the definition of events this
* set is waiting for.
@@ -393,7 +396,7 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
int ret = 0;
int rc;
WaitEvent event;
- WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
if (wakeEvents & WL_TIMEOUT)
Assert(timeout >= 0);
@@ -560,12 +563,15 @@ ResetLatch(Latch *latch)
* WaitEventSetWait().
*/
WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
{
WaitEventSet *set;
char *data;
Size sz = 0;
+ if (res)
+ ResourceOwnerEnlargeWESs(res);
+
/*
* Use MAXALIGN size/alignment to guarantee that later uses of memory are
* aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -680,6 +686,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
#endif
+ /* Register this wait event set if requested */
+ set->resowner = res;
+ if (res)
+ ResourceOwnerRememberWES(set->resowner, set);
+
return set;
}
@@ -725,6 +736,9 @@ FreeWaitEventSet(WaitEventSet *set)
}
#endif
+ if (set->resowner != NULL)
+ ResourceOwnerForgetWES(set->resowner, set);
+
pfree(set);
}
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 37b6a4eecd..fcc92138fe 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -70,7 +70,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
{
WaitEventSet *new_event_set;
- new_event_set = CreateWaitEventSet(TopMemoryContext, 2);
+ new_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 2);
AddWaitEventToSet(new_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
MyLatch, NULL);
AddWaitEventToSet(new_event_set, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 8bc2c4e9ea..237ca9fa30 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -128,6 +128,7 @@ typedef struct ResourceOwnerData
ResourceArray filearr; /* open temporary files */
ResourceArray dsmarr; /* dynamic shmem segments */
ResourceArray jitarr; /* JIT contexts */
+ ResourceArray wesarr; /* wait event sets */
/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
int nlocks; /* number of owned locks */
@@ -175,6 +176,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
static void PrintSnapshotLeakWarning(Snapshot snapshot);
static void PrintFileLeakWarning(File file);
static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
/*****************************************************************************
@@ -444,6 +446,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
+ ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
return owner;
}
@@ -553,6 +556,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
jit_release_context(context);
}
+
+ /* Ditto for wait event sets */
+ while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+ {
+ WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+ if (isCommit)
+ PrintWESLeakWarning(event);
+ FreeWaitEventSet(event);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -725,6 +738,7 @@ ResourceOwnerDelete(ResourceOwner owner)
Assert(owner->filearr.nitems == 0);
Assert(owner->dsmarr.nitems == 0);
Assert(owner->jitarr.nitems == 0);
+ Assert(owner->wesarr.nitems == 0);
Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
/*
@@ -752,6 +766,7 @@ ResourceOwnerDelete(ResourceOwner owner)
ResourceArrayFree(&(owner->filearr));
ResourceArrayFree(&(owner->dsmarr));
ResourceArrayFree(&(owner->jitarr));
+ ResourceArrayFree(&(owner->wesarr));
pfree(owner);
}
@@ -1370,3 +1385,55 @@ ResourceOwnerForgetJIT(ResourceOwner owner, Datum handle)
elog(ERROR, "JIT context %p is not owned by resource owner %s",
DatumGetPointer(handle), owner->name);
}
+
+/*
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+ ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+ ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+ elog(ERROR, "wait event set %p is not owned by resource owner %s",
+ events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ elog(WARNING, "wait event set leak: %p still referenced",
+ events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 46ae56cae3..b1b8375768 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
#define LATCH_H
#include <signal.h>
+#include "utils/resowner.h"
/*
* Latch structure should be treated as opaque and only accessed through
@@ -163,7 +164,8 @@ extern void DisownLatch(Latch *latch);
extern void SetLatch(Latch *latch);
extern void ResetLatch(Latch *latch);
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+ ResourceOwner res, int nevents);
extern void FreeWaitEventSet(WaitEventSet *set);
extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index a781a7a2aa..7d19dadd57 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
#include "storage/dsm.h"
#include "storage/fd.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "utils/catcache.h"
#include "utils/plancache.h"
@@ -95,4 +96,11 @@ extern void ResourceOwnerRememberJIT(ResourceOwner owner,
extern void ResourceOwnerForgetJIT(ResourceOwner owner,
Datum handle);
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+ WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+ WaitEventSet *);
+
#endif /* RESOWNER_PRIVATE_H */
--
2.18.2
v3-0002-infrastructure-for-asynchronous-execution.patchtext/x-patch; charset=us-asciiDownload
From ced3307f27f01e657499ae6ef4436efaa5e350e5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 15 May 2018 20:21:32 +0900
Subject: [PATCH v3 2/3] infrastructure for asynchronous execution
This patch add an infrastructure for asynchronous execution. As a PoC
this makes only Append capable to handle asynchronously executable
subnodes.
---
src/backend/commands/explain.c | 17 ++
src/backend/executor/Makefile | 1 +
src/backend/executor/execAsync.c | 152 +++++++++++
src/backend/executor/nodeAppend.c | 342 ++++++++++++++++++++----
src/backend/executor/nodeForeignscan.c | 21 ++
src/backend/nodes/bitmapset.c | 72 +++++
src/backend/nodes/copyfuncs.c | 3 +
src/backend/nodes/outfuncs.c | 3 +
src/backend/nodes/readfuncs.c | 3 +
src/backend/optimizer/plan/createplan.c | 66 ++++-
src/backend/postmaster/pgstat.c | 3 +
src/backend/postmaster/syslogger.c | 2 +-
src/backend/utils/adt/ruleutils.c | 8 +-
src/backend/utils/resowner/resowner.c | 4 +-
src/include/executor/execAsync.h | 22 ++
src/include/executor/executor.h | 1 +
src/include/executor/nodeForeignscan.h | 3 +
src/include/foreign/fdwapi.h | 11 +
src/include/nodes/bitmapset.h | 1 +
src/include/nodes/execnodes.h | 23 +-
src/include/nodes/plannodes.h | 9 +
src/include/pgstat.h | 3 +-
22 files changed, 705 insertions(+), 65 deletions(-)
create mode 100644 src/backend/executor/execAsync.c
create mode 100644 src/include/executor/execAsync.h
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index efd7201d61..708e9ed546 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -86,6 +86,7 @@ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
List *ancestors, ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
+static void show_append_info(AppendState *astate, ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
ExplainState *es);
static void show_grouping_sets(PlanState *planstate, Agg *agg,
@@ -1389,6 +1390,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
}
if (plan->parallel_aware)
appendStringInfoString(es->str, "Parallel ");
+ if (plan->async_capable)
+ appendStringInfoString(es->str, "Async ");
appendStringInfoString(es->str, pname);
es->indent++;
}
@@ -1969,6 +1972,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Hash:
show_hash_info(castNode(HashState, planstate), es);
break;
+
+ case T_Append:
+ show_append_info(castNode(AppendState, planstate), es);
+ break;
+
default:
break;
}
@@ -2322,6 +2330,15 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ancestors, es);
}
+static void
+show_append_info(AppendState *astate, ExplainState *es)
+{
+ Append *plan = (Append *) astate->ps.plan;
+
+ if (plan->nasyncplans > 0)
+ ExplainPropertyInteger("Async subplans", "", plan->nasyncplans, es);
+}
+
/*
* Show the grouping keys for an Agg node.
*/
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index f990c6473a..1004647d4f 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
+ execAsync.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000000..2b7d1877e0
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,152 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+/*
+ * ExecAsyncConfigureWait: Add wait event to the WaitEventSet if needed.
+ *
+ * If reinit is true, the caller didn't reuse existing WaitEventSet.
+ */
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+ void *data, bool reinit)
+{
+ switch (nodeTag(node))
+ {
+ case T_ForeignScanState:
+ return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+ wes, data, reinit);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(node));
+ }
+}
+
+/*
+ * struct for memory context callback argument used in ExecAsyncEventWait
+ */
+typedef struct {
+ int **p_refind;
+ int *p_refindsize;
+} ExecAsync_mcbarg;
+
+/*
+ * callback function to reset static variables pointing to the memory in
+ * TopTransactionContext in ExecAsyncEventWait.
+ */
+static void ExecAsyncMemoryContextCallback(void *arg)
+{
+ /* arg is the address of the variable refind in ExecAsyncEventWait */
+ ExecAsync_mcbarg *mcbarg = (ExecAsync_mcbarg *) arg;
+ *mcbarg->p_refind = NULL;
+ *mcbarg->p_refindsize = 0;
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * ExecAsyncEventWait:
+ *
+ * Wait for async events to fire. Returns the Bitmapset of fired events.
+ */
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+ static int *refind = NULL;
+ static int refindsize = 0;
+ WaitEventSet *wes;
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred = 0;
+ Bitmapset *fired_events = NULL;
+ int i;
+ int n;
+
+ n = bms_num_members(waitnodes);
+ wes = CreateWaitEventSet(TopTransactionContext,
+ TopTransactionResourceOwner, n);
+ if (refindsize < n)
+ {
+ if (refindsize == 0)
+ refindsize = EVENT_BUFFER_SIZE; /* XXX */
+ while (refindsize < n)
+ refindsize *= 2;
+ if (refind)
+ refind = (int *) repalloc(refind, refindsize * sizeof(int));
+ else
+ {
+ static ExecAsync_mcbarg mcb_arg =
+ { &refind, &refindsize };
+ static MemoryContextCallback mcb =
+ { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
+ MemoryContext oldctxt =
+ MemoryContextSwitchTo(TopTransactionContext);
+
+ /*
+ * refind points to a memory block in
+ * TopTransactionContext. Register a callback to reset it.
+ */
+ MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
+ refind = (int *) palloc(refindsize * sizeof(int));
+ MemoryContextSwitchTo(oldctxt);
+ }
+ }
+
+ /* Prepare WaitEventSet for waiting on the waitnodes. */
+ n = 0;
+ for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+ i = bms_next_member(waitnodes, i))
+ {
+ refind[i] = i;
+ if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+ n++;
+ }
+
+ /* Return immediately if no node to wait. */
+ if (n == 0)
+ {
+ FreeWaitEventSet(wes);
+ return NULL;
+ }
+
+ noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+ EVENT_BUFFER_SIZE,
+ WAIT_EVENT_ASYNC_WAIT);
+ FreeWaitEventSet(wes);
+ if (noccurred == 0)
+ return NULL;
+
+ for (i = 0 ; i < noccurred ; i++)
+ {
+ WaitEvent *w = &occurred_event[i];
+
+ if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+ {
+ int n = *(int*)w->user_data;
+
+ fired_events = bms_add_member(fired_events, n);
+ }
+ }
+
+ return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 88919e62fa..60c36ee048 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -60,6 +60,7 @@
#include "executor/execdebug.h"
#include "executor/execPartition.h"
#include "executor/nodeAppend.h"
+#include "executor/execAsync.h"
#include "miscadmin.h"
/* Shared state for parallel-aware Append. */
@@ -80,6 +81,7 @@ struct ParallelAppendState
#define INVALID_SUBPLAN_INDEX -1
static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);
static bool choose_next_subplan_locally(AppendState *node);
static bool choose_next_subplan_for_leader(AppendState *node);
static bool choose_next_subplan_for_worker(AppendState *node);
@@ -103,22 +105,22 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
PlanState **appendplanstates;
Bitmapset *validsubplans;
int nplans;
+ int nasyncplans;
int firstvalid;
int i,
j;
/* check for unsupported flags */
- Assert(!(eflags & EXEC_FLAG_MARK));
+ Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC)));
/*
* create new AppendState for our append node
*/
appendstate->ps.plan = (Plan *) node;
appendstate->ps.state = estate;
- appendstate->ps.ExecProcNode = ExecAppend;
/* Let choose_next_subplan_* function handle setting the first subplan */
- appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+ appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
/* If run-time partition pruning is enabled, then set that up now */
if (node->part_prune_info != NULL)
@@ -152,11 +154,12 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
/*
* When no run-time pruning is required and there's at least one
- * subplan, we can fill as_valid_subplans immediately, preventing
+ * subplan, we can fill as_valid_syncsubplans immediately, preventing
* later calls to ExecFindMatchingSubPlans.
*/
if (!prunestate->do_exec_prune && nplans > 0)
- appendstate->as_valid_subplans = bms_add_range(NULL, 0, nplans - 1);
+ appendstate->as_valid_syncsubplans =
+ bms_add_range(NULL, node->nasyncplans, nplans - 1);
}
else
{
@@ -167,8 +170,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
* subplans as valid; they must also all be initialized.
*/
Assert(nplans > 0);
- appendstate->as_valid_subplans = validsubplans =
- bms_add_range(NULL, 0, nplans - 1);
+ validsubplans = bms_add_range(NULL, 0, nplans - 1);
+ appendstate->as_valid_syncsubplans =
+ bms_add_range(NULL, node->nasyncplans, nplans - 1);
appendstate->as_prune_state = NULL;
}
@@ -192,10 +196,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
*/
j = 0;
firstvalid = nplans;
+ nasyncplans = 0;
+
i = -1;
while ((i = bms_next_member(validsubplans, i)) >= 0)
{
Plan *initNode = (Plan *) list_nth(node->appendplans, i);
+ int sub_eflags = eflags;
+
+ /* Let async-capable subplans run asynchronously */
+ if (i < node->nasyncplans)
+ {
+ sub_eflags |= EXEC_FLAG_ASYNC;
+ nasyncplans++;
+ }
/*
* Record the lowest appendplans index which is a valid partial plan.
@@ -203,13 +217,46 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
if (i >= node->first_partial_plan && j < firstvalid)
firstvalid = j;
- appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
+ appendplanstates[j++] = ExecInitNode(initNode, estate, sub_eflags);
}
appendstate->as_first_partial_plan = firstvalid;
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ /* fill in async stuff */
+ appendstate->as_nasyncplans = nasyncplans;
+ appendstate->as_syncdone = (nasyncplans == nplans);
+ appendstate->as_exec_prune = false;
+
+ /* choose appropriate version of Exec function */
+ if (appendstate->as_nasyncplans == 0)
+ appendstate->ps.ExecProcNode = ExecAppend;
+ else
+ appendstate->ps.ExecProcNode = ExecAppendAsync;
+
+ if (appendstate->as_nasyncplans)
+ {
+ appendstate->as_asyncresult = (TupleTableSlot **)
+ palloc0(appendstate->as_nasyncplans * sizeof(TupleTableSlot *));
+
+ /* initially, all async requests need a request */
+ appendstate->as_needrequest =
+ bms_add_range(NULL, 0, appendstate->as_nasyncplans - 1);
+
+ /*
+ * ExecAppendAsync needs as_valid_syncsubplans to handle async
+ * subnodes.
+ */
+ if (appendstate->as_prune_state != NULL &&
+ appendstate->as_prune_state->do_exec_prune)
+ {
+ Assert(appendstate->as_valid_syncsubplans == NULL);
+
+ appendstate->as_exec_prune = true;
+ }
+ }
+
/*
* Miscellaneous initialization
*/
@@ -233,7 +280,7 @@ ExecAppend(PlanState *pstate)
{
AppendState *node = castNode(AppendState, pstate);
- if (node->as_whichplan < 0)
+ if (node->as_whichsyncplan < 0)
{
/* Nothing to do if there are no subplans */
if (node->as_nplans == 0)
@@ -243,11 +290,13 @@ ExecAppend(PlanState *pstate)
* If no subplan has been chosen, we must choose one before
* proceeding.
*/
- if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
+ if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
!node->choose_next_subplan(node))
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
+ Assert(node->as_nasyncplans == 0);
+
for (;;)
{
PlanState *subnode;
@@ -258,8 +307,9 @@ ExecAppend(PlanState *pstate)
/*
* figure out which subplan we are currently processing
*/
- Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
- subnode = node->appendplans[node->as_whichplan];
+ Assert(node->as_whichsyncplan >= 0 &&
+ node->as_whichsyncplan < node->as_nplans);
+ subnode = node->appendplans[node->as_whichsyncplan];
/*
* get a tuple from the subplan
@@ -282,6 +332,172 @@ ExecAppend(PlanState *pstate)
}
}
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+ AppendState *node = castNode(AppendState, pstate);
+ Bitmapset *needrequest;
+ int i;
+
+ Assert(node->as_nasyncplans > 0);
+
+restart:
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ if (node->as_exec_prune)
+ {
+ Bitmapset *valid_subplans =
+ ExecFindMatchingSubPlans(node->as_prune_state);
+
+ /* Distribute valid subplans into sync and async */
+ node->as_needrequest =
+ bms_intersect(node->as_needrequest, valid_subplans);
+ node->as_valid_syncsubplans =
+ bms_difference(valid_subplans, node->as_needrequest);
+
+ node->as_exec_prune = false;
+ }
+
+ needrequest = node->as_needrequest;
+ node->as_needrequest = NULL;
+ while ((i = bms_first_member(needrequest)) >= 0)
+ {
+ TupleTableSlot *slot;
+ PlanState *subnode = node->appendplans[i];
+
+ slot = ExecProcNode(subnode);
+ if (subnode->asyncstate == AS_AVAILABLE)
+ {
+ if (!TupIsNull(slot))
+ {
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+ node->as_needrequest = bms_add_member(node->as_needrequest, i);
+ }
+ }
+ else
+ node->as_pending_async = bms_add_member(node->as_pending_async, i);
+ }
+ bms_free(needrequest);
+
+ for (;;)
+ {
+ TupleTableSlot *result;
+
+ /* return now if a result is available */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ while (!bms_is_empty(node->as_pending_async))
+ {
+ /* Don't wait for async nodes if any sync node exists. */
+ long timeout = node->as_syncdone ? -1 : 0;
+ Bitmapset *fired;
+ int i;
+
+ fired = ExecAsyncEventWait(node->appendplans,
+ node->as_pending_async,
+ timeout);
+
+ if (bms_is_empty(fired) && node->as_syncdone)
+ {
+ /*
+ * We come here when all the subnodes had fired before
+ * waiting. Retry fetching from the nodes.
+ */
+ node->as_needrequest = node->as_pending_async;
+ node->as_pending_async = NULL;
+ goto restart;
+ }
+
+ while ((i = bms_first_member(fired)) >= 0)
+ {
+ TupleTableSlot *slot;
+ PlanState *subnode = node->appendplans[i];
+ slot = ExecProcNode(subnode);
+
+ Assert(subnode->asyncstate == AS_AVAILABLE);
+
+ if (!TupIsNull(slot))
+ {
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+ node->as_needrequest =
+ bms_add_member(node->as_needrequest, i);
+ }
+
+ node->as_pending_async =
+ bms_del_member(node->as_pending_async, i);
+ }
+ bms_free(fired);
+
+ /* return now if a result is available */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If there is no asynchronous activity still pending and the
+ * synchronous activity is also complete, we're totally done scanning
+ * this node. Otherwise, we're done with the asynchronous stuff but
+ * must continue scanning the synchronous children.
+ */
+
+ if (!node->as_syncdone &&
+ node->as_whichsyncplan == INVALID_SUBPLAN_INDEX)
+ node->as_syncdone = !node->choose_next_subplan(node);
+
+ if (node->as_syncdone)
+ {
+ Assert(bms_is_empty(node->as_pending_async));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+
+ /*
+ * get a tuple from the subplan
+ */
+ result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+ if (!TupIsNull(result))
+ {
+ /*
+ * If the subplan gave us something then return it as-is. We do
+ * NOT make use of the result slot that was set up in
+ * ExecInitAppend; there's no need for it.
+ */
+ return result;
+ }
+
+ /*
+ * Go on to the "next" subplan. If no more subplans, return the empty
+ * slot set up for us by ExecInitAppend, unless there are async plans
+ * we have yet to finish.
+ */
+ if (!node->choose_next_subplan(node))
+ {
+ node->as_syncdone = true;
+ if (bms_is_empty(node->as_pending_async))
+ {
+ Assert(bms_is_empty(node->as_needrequest));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
+
+ /* Else loop back and try to get a tuple from the new subplan */
+ }
+}
+
/* ----------------------------------------------------------------
* ExecEndAppend
*
@@ -324,10 +540,18 @@ ExecReScanAppend(AppendState *node)
bms_overlap(node->ps.chgParam,
node->as_prune_state->execparamids))
{
- bms_free(node->as_valid_subplans);
- node->as_valid_subplans = NULL;
+ bms_free(node->as_valid_syncsubplans);
+ node->as_valid_syncsubplans = NULL;
}
+ /* Reset async state. */
+ for (i = 0; i < node->as_nasyncplans; ++i)
+ ExecShutdownNode(node->appendplans[i]);
+
+ node->as_nasyncresult = 0;
+ node->as_needrequest = bms_add_range(NULL, 0, node->as_nasyncplans - 1);
+ node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
for (i = 0; i < node->as_nplans; i++)
{
PlanState *subnode = node->appendplans[i];
@@ -348,7 +572,7 @@ ExecReScanAppend(AppendState *node)
}
/* Let choose_next_subplan_* function handle setting the first subplan */
- node->as_whichplan = INVALID_SUBPLAN_INDEX;
+ node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
}
/* ----------------------------------------------------------------
@@ -436,7 +660,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
static bool
choose_next_subplan_locally(AppendState *node)
{
- int whichplan = node->as_whichplan;
+ int whichplan = node->as_whichsyncplan;
int nextplan;
/* We should never be called when there are no subplans */
@@ -451,10 +675,18 @@ choose_next_subplan_locally(AppendState *node)
*/
if (whichplan == INVALID_SUBPLAN_INDEX)
{
- if (node->as_valid_subplans == NULL)
- node->as_valid_subplans =
+ /* Shouldn't have an active async node */
+ Assert(bms_is_empty(node->as_needrequest));
+
+ if (node->as_valid_syncsubplans == NULL)
+ node->as_valid_syncsubplans =
ExecFindMatchingSubPlans(node->as_prune_state);
+ /* Exclude async plans */
+ if (node->as_nasyncplans > 0)
+ bms_del_range(node->as_valid_syncsubplans,
+ 0, node->as_nasyncplans - 1);
+
whichplan = -1;
}
@@ -462,14 +694,14 @@ choose_next_subplan_locally(AppendState *node)
Assert(whichplan >= -1 && whichplan <= node->as_nplans);
if (ScanDirectionIsForward(node->ps.state->es_direction))
- nextplan = bms_next_member(node->as_valid_subplans, whichplan);
+ nextplan = bms_next_member(node->as_valid_syncsubplans, whichplan);
else
- nextplan = bms_prev_member(node->as_valid_subplans, whichplan);
+ nextplan = bms_prev_member(node->as_valid_syncsubplans, whichplan);
if (nextplan < 0)
return false;
- node->as_whichplan = nextplan;
+ node->as_whichsyncplan = nextplan;
return true;
}
@@ -490,29 +722,29 @@ choose_next_subplan_for_leader(AppendState *node)
/* Backward scan is not supported by parallel-aware plans */
Assert(ScanDirectionIsForward(node->ps.state->es_direction));
- /* We should never be called when there are no subplans */
- Assert(node->as_nplans > 0);
+ /* We should never be called when there are no sync subplans */
+ Assert(node->as_nplans > node->as_nasyncplans);
LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
- if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
+ if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
{
/* Mark just-completed subplan as finished. */
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
}
else
{
/* Start with last subplan. */
- node->as_whichplan = node->as_nplans - 1;
+ node->as_whichsyncplan = node->as_nplans - 1;
/*
* If we've yet to determine the valid subplans then do so now. If
* run-time pruning is disabled then the valid subplans will always be
* set to all subplans.
*/
- if (node->as_valid_subplans == NULL)
+ if (node->as_valid_syncsubplans == NULL)
{
- node->as_valid_subplans =
+ node->as_valid_syncsubplans =
ExecFindMatchingSubPlans(node->as_prune_state);
/*
@@ -524,26 +756,26 @@ choose_next_subplan_for_leader(AppendState *node)
}
/* Loop until we find a subplan to execute. */
- while (pstate->pa_finished[node->as_whichplan])
+ while (pstate->pa_finished[node->as_whichsyncplan])
{
- if (node->as_whichplan == 0)
+ if (node->as_whichsyncplan == 0)
{
pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
- node->as_whichplan = INVALID_SUBPLAN_INDEX;
+ node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
LWLockRelease(&pstate->pa_lock);
return false;
}
/*
- * We needn't pay attention to as_valid_subplans here as all invalid
+ * We needn't pay attention to as_valid_syncsubplans here as all invalid
* plans have been marked as finished.
*/
- node->as_whichplan--;
+ node->as_whichsyncplan--;
}
/* If non-partial, immediately mark as finished. */
- if (node->as_whichplan < node->as_first_partial_plan)
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ if (node->as_whichsyncplan < node->as_first_partial_plan)
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
LWLockRelease(&pstate->pa_lock);
@@ -571,23 +803,23 @@ choose_next_subplan_for_worker(AppendState *node)
/* Backward scan is not supported by parallel-aware plans */
Assert(ScanDirectionIsForward(node->ps.state->es_direction));
- /* We should never be called when there are no subplans */
- Assert(node->as_nplans > 0);
+ /* We should never be called when there are no sync subplans */
+ Assert(node->as_nplans > node->as_nasyncplans);
LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
/* Mark just-completed subplan as finished. */
- if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
/*
* If we've yet to determine the valid subplans then do so now. If
* run-time pruning is disabled then the valid subplans will always be set
* to all subplans.
*/
- else if (node->as_valid_subplans == NULL)
+ else if (node->as_valid_syncsubplans == NULL)
{
- node->as_valid_subplans =
+ node->as_valid_syncsubplans =
ExecFindMatchingSubPlans(node->as_prune_state);
mark_invalid_subplans_as_finished(node);
}
@@ -600,30 +832,30 @@ choose_next_subplan_for_worker(AppendState *node)
}
/* Save the plan from which we are starting the search. */
- node->as_whichplan = pstate->pa_next_plan;
+ node->as_whichsyncplan = pstate->pa_next_plan;
/* Loop until we find a valid subplan to execute. */
while (pstate->pa_finished[pstate->pa_next_plan])
{
int nextplan;
- nextplan = bms_next_member(node->as_valid_subplans,
+ nextplan = bms_next_member(node->as_valid_syncsubplans,
pstate->pa_next_plan);
if (nextplan >= 0)
{
/* Advance to the next valid plan. */
pstate->pa_next_plan = nextplan;
}
- else if (node->as_whichplan > node->as_first_partial_plan)
+ else if (node->as_whichsyncplan > node->as_first_partial_plan)
{
/*
* Try looping back to the first valid partial plan, if there is
* one. If there isn't, arrange to bail out below.
*/
- nextplan = bms_next_member(node->as_valid_subplans,
+ nextplan = bms_next_member(node->as_valid_syncsubplans,
node->as_first_partial_plan - 1);
pstate->pa_next_plan =
- nextplan < 0 ? node->as_whichplan : nextplan;
+ nextplan < 0 ? node->as_whichsyncplan : nextplan;
}
else
{
@@ -631,10 +863,10 @@ choose_next_subplan_for_worker(AppendState *node)
* At last plan, and either there are no partial plans or we've
* tried them all. Arrange to bail out.
*/
- pstate->pa_next_plan = node->as_whichplan;
+ pstate->pa_next_plan = node->as_whichsyncplan;
}
- if (pstate->pa_next_plan == node->as_whichplan)
+ if (pstate->pa_next_plan == node->as_whichsyncplan)
{
/* We've tried everything! */
pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
@@ -644,8 +876,8 @@ choose_next_subplan_for_worker(AppendState *node)
}
/* Pick the plan we found, and advance pa_next_plan one more time. */
- node->as_whichplan = pstate->pa_next_plan;
- pstate->pa_next_plan = bms_next_member(node->as_valid_subplans,
+ node->as_whichsyncplan = pstate->pa_next_plan;
+ pstate->pa_next_plan = bms_next_member(node->as_valid_syncsubplans,
pstate->pa_next_plan);
/*
@@ -654,7 +886,7 @@ choose_next_subplan_for_worker(AppendState *node)
*/
if (pstate->pa_next_plan < 0)
{
- int nextplan = bms_next_member(node->as_valid_subplans,
+ int nextplan = bms_next_member(node->as_valid_syncsubplans,
node->as_first_partial_plan - 1);
if (nextplan >= 0)
@@ -671,8 +903,8 @@ choose_next_subplan_for_worker(AppendState *node)
}
/* If non-partial, immediately mark as finished. */
- if (node->as_whichplan < node->as_first_partial_plan)
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ if (node->as_whichsyncplan < node->as_first_partial_plan)
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
LWLockRelease(&pstate->pa_lock);
@@ -699,13 +931,13 @@ mark_invalid_subplans_as_finished(AppendState *node)
Assert(node->as_prune_state);
/* Nothing to do if all plans are valid */
- if (bms_num_members(node->as_valid_subplans) == node->as_nplans)
+ if (bms_num_members(node->as_valid_syncsubplans) == node->as_nplans)
return;
/* Mark all non-valid plans as finished */
for (i = 0; i < node->as_nplans; i++)
{
- if (!bms_is_member(i, node->as_valid_subplans))
+ if (!bms_is_member(i, node->as_valid_syncsubplans))
node->as_pstate->pa_finished[i] = true;
}
}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 513471ab9b..3bf4aaa63d 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -141,6 +141,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
scanstate->ss.ps.plan = (Plan *) node;
scanstate->ss.ps.state = estate;
scanstate->ss.ps.ExecProcNode = ExecForeignScan;
+ scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+ if ((eflags & EXEC_FLAG_ASYNC) != 0)
+ scanstate->fs_async = true;
/*
* Miscellaneous initialization
@@ -384,3 +388,20 @@ ExecShutdownForeignScan(ForeignScanState *node)
if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+ void *caller_data, bool reinit)
+{
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+ caller_data, reinit);
+}
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 2719ea45a3..05b625783b 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -895,6 +895,78 @@ bms_add_range(Bitmapset *a, int lower, int upper)
return a;
}
+/*
+ * bms_del_range
+ * Delete members in the range of 'lower' to 'upper' from the set.
+ *
+ * Note this could also be done by calling bms_del_member in a loop, however,
+ * using this function will be faster when the range is large as we work at
+ * the bitmapword level rather than at bit level.
+ */
+Bitmapset *
+bms_del_range(Bitmapset *a, int lower, int upper)
+{
+ int lwordnum,
+ lbitnum,
+ uwordnum,
+ ushiftbits,
+ wordnum;
+
+ if (lower < 0 || upper < 0)
+ elog(ERROR, "negative bitmapset member not allowed");
+ if (lower > upper)
+ elog(ERROR, "lower range must not be above upper range");
+ uwordnum = WORDNUM(upper);
+
+ if (a == NULL)
+ {
+ a = (Bitmapset *) palloc0(BITMAPSET_SIZE(uwordnum + 1));
+ a->nwords = uwordnum + 1;
+ }
+
+ /* ensure we have enough words to store the upper bit */
+ else if (uwordnum >= a->nwords)
+ {
+ int oldnwords = a->nwords;
+ int i;
+
+ a = (Bitmapset *) repalloc(a, BITMAPSET_SIZE(uwordnum + 1));
+ a->nwords = uwordnum + 1;
+ /* zero out the enlarged portion */
+ for (i = oldnwords; i < a->nwords; i++)
+ a->words[i] = 0;
+ }
+
+ wordnum = lwordnum = WORDNUM(lower);
+
+ lbitnum = BITNUM(lower);
+ ushiftbits = BITNUM(upper) + 1;
+
+ /*
+ * Special case when lwordnum is the same as uwordnum we must perform the
+ * upper and lower masking on the word.
+ */
+ if (lwordnum == uwordnum)
+ {
+ a->words[lwordnum] &= ((bitmapword) (((bitmapword) 1 << lbitnum) - 1)
+ | (~(bitmapword) 0) << ushiftbits);
+ }
+ else
+ {
+ /* turn off lbitnum and all bits left of it */
+ a->words[wordnum++] &= (bitmapword) (((bitmapword) 1 << lbitnum) - 1);
+
+ /* turn off all bits for any intermediate words */
+ while (wordnum < uwordnum)
+ a->words[wordnum++] = (bitmapword) 0;
+
+ /* turn off upper's bit and all bits right of it. */
+ a->words[uwordnum] &= (~(bitmapword) 0) << ushiftbits;
+ }
+
+ return a;
+}
+
/*
* bms_int_members - like bms_intersect, but left input is recycled
*/
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d8cf87e6d0..89a49e2fdc 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -121,6 +121,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
COPY_SCALAR_FIELD(plan_width);
COPY_SCALAR_FIELD(parallel_aware);
COPY_SCALAR_FIELD(parallel_safe);
+ COPY_SCALAR_FIELD(async_capable);
COPY_SCALAR_FIELD(plan_node_id);
COPY_NODE_FIELD(targetlist);
COPY_NODE_FIELD(qual);
@@ -246,6 +247,8 @@ _copyAppend(const Append *from)
COPY_NODE_FIELD(appendplans);
COPY_SCALAR_FIELD(first_partial_plan);
COPY_NODE_FIELD(part_prune_info);
+ COPY_SCALAR_FIELD(nasyncplans);
+ COPY_SCALAR_FIELD(referent);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e2f177515d..d4bb44b268 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -334,6 +334,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
WRITE_INT_FIELD(plan_width);
WRITE_BOOL_FIELD(parallel_aware);
WRITE_BOOL_FIELD(parallel_safe);
+ WRITE_BOOL_FIELD(async_capable);
WRITE_INT_FIELD(plan_node_id);
WRITE_NODE_FIELD(targetlist);
WRITE_NODE_FIELD(qual);
@@ -436,6 +437,8 @@ _outAppend(StringInfo str, const Append *node)
WRITE_NODE_FIELD(appendplans);
WRITE_INT_FIELD(first_partial_plan);
WRITE_NODE_FIELD(part_prune_info);
+ WRITE_INT_FIELD(nasyncplans);
+ WRITE_INT_FIELD(referent);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab719..63af7c02d8 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1572,6 +1572,7 @@ ReadCommonPlan(Plan *local_node)
READ_INT_FIELD(plan_width);
READ_BOOL_FIELD(parallel_aware);
READ_BOOL_FIELD(parallel_safe);
+ READ_BOOL_FIELD(async_capable);
READ_INT_FIELD(plan_node_id);
READ_NODE_FIELD(targetlist);
READ_NODE_FIELD(qual);
@@ -1672,6 +1673,8 @@ _readAppend(void)
READ_NODE_FIELD(appendplans);
READ_INT_FIELD(first_partial_plan);
READ_NODE_FIELD(part_prune_info);
+ READ_INT_FIELD(nasyncplans);
+ READ_INT_FIELD(referent);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 744eed187d..ba18dd88a8 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -300,6 +300,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
List *rowMarks, OnConflictExpr *onconflict, int epqParam);
static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
GatherMergePath *best_path);
+static bool is_async_capable_path(Path *path);
/*
@@ -1082,6 +1083,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
bool tlist_was_changed = false;
List *pathkeys = best_path->path.pathkeys;
List *subplans = NIL;
+ List *asyncplans = NIL;
+ List *syncplans = NIL;
+ List *asyncpaths = NIL;
+ List *syncpaths = NIL;
+ List *newsubpaths = NIL;
ListCell *subpaths;
RelOptInfo *rel = best_path->path.parent;
PartitionPruneInfo *partpruneinfo = NULL;
@@ -1090,6 +1096,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
Oid *nodeSortOperators = NULL;
Oid *nodeCollations = NULL;
bool *nodeNullsFirst = NULL;
+ int nasyncplans = 0;
+ bool first = true;
+ bool referent_is_sync = true;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1219,9 +1228,36 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
}
- subplans = lappend(subplans, subplan);
+ /*
+ * Classify as async-capable or not. If we have decided to run the
+ * children in parallel, we cannot any one of them run asynchronously.
+ */
+ if (!best_path->path.parallel_safe && is_async_capable_path(subpath))
+ {
+ subplan->async_capable = true;
+ asyncplans = lappend(asyncplans, subplan);
+ asyncpaths = lappend(asyncpaths, subpath);
+ ++nasyncplans;
+ if (first)
+ referent_is_sync = false;
+ }
+ else
+ {
+ syncplans = lappend(syncplans, subplan);
+ syncpaths = lappend(syncpaths, subpath);
+ }
+
+ first = false;
}
+ /*
+ * subplan contains asyncplans in the first half, if any, and sync plans in
+ * another half, if any. We need that the same for subpaths to make
+ * partition pruning information in sync with subplans.
+ */
+ subplans = list_concat(asyncplans, syncplans);
+ newsubpaths = list_concat(asyncpaths, syncpaths);
+
/*
* If any quals exist, they may be useful to perform further partition
* pruning during execution. Gather information needed by the executor to
@@ -1249,7 +1285,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
if (prunequal != NIL)
partpruneinfo =
make_partition_pruneinfo(root, rel,
- best_path->subpaths,
+ newsubpaths,
best_path->partitioned_rels,
prunequal);
}
@@ -1257,6 +1293,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
plan->appendplans = subplans;
plan->first_partial_plan = best_path->first_partial_path;
plan->part_prune_info = partpruneinfo;
+ plan->nasyncplans = nasyncplans;
+ plan->referent = referent_is_sync ? nasyncplans : 0;
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -7016,3 +7054,27 @@ is_projection_capable_plan(Plan *plan)
}
return true;
}
+
+/*
+ * is_projection_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index d7f99d9944..79a2562454 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3882,6 +3882,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
case WAIT_EVENT_XACT_GROUP_UPDATE:
event_name = "XactGroupUpdate";
break;
+ case WAIT_EVENT_ASYNC_WAIT:
+ event_name = "AsyncExecWait";
+ break;
/* no default case, so that compiler will warn */
}
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index ffcb54968f..a4de6d90e2 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -300,7 +300,7 @@ SysLoggerMain(int argc, char *argv[])
* syslog pipe, which implies that all other backends have exited
* (including the postmaster).
*/
- wes = CreateWaitEventSet(CurrentMemoryContext, 2);
+ wes = CreateWaitEventSet(CurrentMemoryContext, NULL, 2);
AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
#ifndef WIN32
AddWaitEventToSet(wes, WL_SOCKET_READABLE, syslogPipe[0], NULL, NULL);
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 076c3c019f..f7b5587d7f 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4584,10 +4584,14 @@ set_deparse_plan(deparse_namespace *dpns, Plan *plan)
* tlists according to one of the children, and the first one is the most
* natural choice. Likewise special-case ModifyTable to pretend that the
* first child plan is the OUTER referent; this is to support RETURNING
- * lists containing references to non-target relations.
+ * lists containing references to non-target relations. For Append, use the
+ * explicitly specified referent.
*/
if (IsA(plan, Append))
- dpns->outer_plan = linitial(((Append *) plan)->appendplans);
+ {
+ Append *app = (Append *) plan;
+ dpns->outer_plan = list_nth(app->appendplans, app->referent);
+ }
else if (IsA(plan, MergeAppend))
dpns->outer_plan = linitial(((MergeAppend *) plan)->mergeplans);
else if (IsA(plan, ModifyTable))
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 237ca9fa30..27742a1641 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -1416,7 +1416,7 @@ void
ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
{
/*
- * XXXX: There's no property to show as an identier of a wait event set,
+ * XXXX: There's no property to show as an identifier of a wait event set,
* use its pointer instead.
*/
if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
@@ -1431,7 +1431,7 @@ static void
PrintWESLeakWarning(WaitEventSet *events)
{
/*
- * XXXX: There's no property to show as an identier of a wait event set,
+ * XXXX: There's no property to show as an identifier of a wait event set,
* use its pointer instead.
*/
elog(WARNING, "wait event set leak: %p still referenced",
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000000..3b6bf4a516
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,22 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ * Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+ void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+ long timeout);
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c7deeac662..aca9e2bddd 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -59,6 +59,7 @@
#define EXEC_FLAG_MARK 0x0008 /* need mark/restore */
#define EXEC_FLAG_SKIP_TRIGGERS 0x0010 /* skip AfterTrigger calls */
#define EXEC_FLAG_WITH_NO_DATA 0x0020 /* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC 0x0040 /* request async execution */
/* Hook for plugins to get control in ExecutorStart() */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 326d713ebf..71a233b41f 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,
extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
ParallelWorkerContext *pwcxt);
extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+ WaitEventSet *wes,
+ void *caller_data, bool reinit);
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 95556dfb15..853ba2b5ad 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -169,6 +169,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
List *fdw_private,
RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+ WaitEventSet *wes,
+ void *caller_data,
+ bool reinit);
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -190,6 +195,7 @@ typedef struct FdwRoutine
GetForeignPlan_function GetForeignPlan;
BeginForeignScan_function BeginForeignScan;
IterateForeignScan_function IterateForeignScan;
+ IterateForeignScan_function IterateForeignScanAsync;
ReScanForeignScan_function ReScanForeignScan;
EndForeignScan_function EndForeignScan;
@@ -242,6 +248,11 @@ typedef struct FdwRoutine
InitializeDSMForeignScan_function InitializeDSMForeignScan;
ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan;
InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+
ShutdownForeignScan_function ShutdownForeignScan;
/* Support functions for path reparameterization. */
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index d113c271ee..177e6218cb 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -107,6 +107,7 @@ extern Bitmapset *bms_add_members(Bitmapset *a, const Bitmapset *b);
extern Bitmapset *bms_add_range(Bitmapset *a, int lower, int upper);
extern Bitmapset *bms_int_members(Bitmapset *a, const Bitmapset *b);
extern Bitmapset *bms_del_members(Bitmapset *a, const Bitmapset *b);
+extern Bitmapset *bms_del_range(Bitmapset *a, int lower, int upper);
extern Bitmapset *bms_join(Bitmapset *a, Bitmapset *b);
/* support for iterating through the integer elements of a set: */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 98e0072b8a..cd50494c74 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -938,6 +938,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
* abstract superclass for all PlanState-type nodes.
* ----------------
*/
+typedef enum AsyncState
+{
+ AS_AVAILABLE,
+ AS_WAITING
+} AsyncState;
+
typedef struct PlanState
{
NodeTag type;
@@ -1026,6 +1032,11 @@ typedef struct PlanState
bool outeropsset;
bool inneropsset;
bool resultopsset;
+
+ /* Async subnode execution stuff */
+ AsyncState asyncstate;
+
+ int32 padding; /* to keep alignment of derived types */
} PlanState;
/* ----------------
@@ -1221,14 +1232,21 @@ struct AppendState
PlanState ps; /* its first field is NodeTag */
PlanState **appendplans; /* array of PlanStates for my inputs */
int as_nplans;
- int as_whichplan;
+ int as_whichsyncplan; /* which sync plan is being executed */
int as_first_partial_plan; /* Index of 'appendplans' containing
* the first partial plan */
+ int as_nasyncplans; /* # of async-capable children */
ParallelAppendState *as_pstate; /* parallel coordination info */
Size pstate_len; /* size of parallel coordination info */
struct PartitionPruneState *as_prune_state;
- Bitmapset *as_valid_subplans;
+ Bitmapset *as_valid_syncsubplans;
bool (*choose_next_subplan) (AppendState *);
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_needrequest; /* async plans needing a new request */
+ Bitmapset *as_pending_async; /* pending async plans */
+ TupleTableSlot **as_asyncresult; /* results of each async plan */
+ int as_nasyncresult; /* # of valid entries in as_asyncresult */
+ bool as_exec_prune; /* runtime pruning needed for async exec? */
};
/* ----------------
@@ -1796,6 +1814,7 @@ typedef struct ForeignScanState
Size pscan_len; /* size of parallel coordination information */
/* use struct pointer to avoid including fdwapi.h here */
struct FdwRoutine *fdwroutine;
+ bool fs_async;
void *fdw_state; /* foreign-data wrapper can keep state here */
} ForeignScanState;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 83e01074ed..abad89b327 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -135,6 +135,11 @@ typedef struct Plan
bool parallel_aware; /* engage parallel-aware logic? */
bool parallel_safe; /* OK to use as part of parallel plan? */
+ /*
+ * information needed for asynchronous execution
+ */
+ bool async_capable; /* engage asynchronous execution logic? */
+
/*
* Common structural data for all Plan types.
*/
@@ -262,6 +267,10 @@ typedef struct Append
/* Info for run-time subplan pruning; NULL if we're not doing that */
struct PartitionPruneInfo *part_prune_info;
+
+ /* Async child node execution stuff */
+ int nasyncplans; /* # async subplans, always at start of list */
+ int referent; /* index of inheritance tree referent */
} Append;
/* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c55dc1481c..2259910637 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -887,7 +887,8 @@ typedef enum
WAIT_EVENT_REPLICATION_SLOT_DROP,
WAIT_EVENT_SAFE_SNAPSHOT,
WAIT_EVENT_SYNC_REP,
- WAIT_EVENT_XACT_GROUP_UPDATE
+ WAIT_EVENT_XACT_GROUP_UPDATE,
+ WAIT_EVENT_ASYNC_WAIT
} WaitEventIPC;
/* ----------
--
2.18.2
v3-0003-async-postgres_fdw.patchtext/x-patch; charset=us-asciiDownload
From 4c207a7901f0a9d05aacb5ce46a7f1daa83ce474 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH v3 3/3] async postgres_fdw
---
contrib/postgres_fdw/connection.c | 28 +
.../postgres_fdw/expected/postgres_fdw.out | 222 ++++---
contrib/postgres_fdw/postgres_fdw.c | 603 +++++++++++++++---
contrib/postgres_fdw/postgres_fdw.h | 2 +
contrib/postgres_fdw/sql/postgres_fdw.sql | 20 +-
5 files changed, 694 insertions(+), 181 deletions(-)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 52d1fe3563..d9edc5e4de 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -58,6 +58,7 @@ typedef struct ConnCacheEntry
bool invalidated; /* true if reconnect is pending */
uint32 server_hashvalue; /* hash value of foreign server OID */
uint32 mapping_hashvalue; /* hash value of user mapping OID */
+ void *storage; /* connection specific storage */
} ConnCacheEntry;
/*
@@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
entry->conn, server->servername, user->umid, user->userid);
+ entry->storage = NULL;
}
/*
@@ -215,6 +217,32 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
return entry->conn;
}
+/*
+ * Returns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+ bool found;
+ ConnCacheEntry *entry;
+ ConnCacheKey key;
+
+ /* Find storage using the same key with GetConnection */
+ key = user->umid;
+ entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+ Assert(found);
+
+ /* Create one if not yet. */
+ if (entry->storage == NULL)
+ {
+ entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+ memset(entry->storage, 0, initsize);
+ }
+
+ return entry->storage;
+}
+
/*
* Connect to remote server using specified server and user mapping properties.
*/
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 82fc1290ef..29aa09db8e 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6973,7 +6973,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+-------
a | aaa
@@ -7001,7 +7001,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7029,7 +7029,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7057,7 +7057,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | newtoo
@@ -7127,35 +7127,41 @@ insert into bar2 values(3,33,33);
insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
f1 | f2
----+----
1 | 11
@@ -7165,35 +7171,41 @@ select * from bar where f1 in (select f1 from foo) for update;
(4 rows)
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
f1 | f2
----+----
1 | 11
@@ -7223,11 +7235,12 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-> Hash Join
Output: bar_1.f1, (bar_1.f2 + 100), bar_1.f3, bar_1.ctid, foo.ctid, foo.*, foo.tableoid
Inner Unique: true
@@ -7241,12 +7254,13 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(39 rows)
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(41 rows)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
select tableoid::regclass, * from bar order by 1,2;
@@ -7276,16 +7290,17 @@ where bar.f1 = ss.f1;
Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
Hash Cond: (foo.f1 = bar.f1)
-> Append
+ Async subplans: 2
+ -> Async Foreign Scan on public.foo2 foo_1
+ Output: ROW(foo_1.f1), foo_1.f1
+ Remote SQL: SELECT f1 FROM public.loct1
+ -> Async Foreign Scan on public.foo2 foo_3
+ Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
+ Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo
Output: ROW(foo.f1), foo.f1
- -> Foreign Scan on public.foo2 foo_1
- Output: ROW(foo_1.f1), foo_1.f1
- Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo foo_2
Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
- -> Foreign Scan on public.foo2 foo_3
- Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
- Remote SQL: SELECT f1 FROM public.loct1
-> Hash
Output: bar.f1, bar.f2, bar.ctid
-> Seq Scan on public.bar
@@ -7303,17 +7318,18 @@ where bar.f1 = ss.f1;
Output: (ROW(foo.f1)), foo.f1
Sort Key: foo.f1
-> Append
+ Async subplans: 2
+ -> Async Foreign Scan on public.foo2 foo_1
+ Output: ROW(foo_1.f1), foo_1.f1
+ Remote SQL: SELECT f1 FROM public.loct1
+ -> Async Foreign Scan on public.foo2 foo_3
+ Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
+ Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo
Output: ROW(foo.f1), foo.f1
- -> Foreign Scan on public.foo2 foo_1
- Output: ROW(foo_1.f1), foo_1.f1
- Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo foo_2
Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
- -> Foreign Scan on public.foo2 foo_3
- Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
- Remote SQL: SELECT f1 FROM public.loct1
-(45 rows)
+(47 rows)
update bar set f2 = f2 + 100
from
@@ -7463,27 +7479,33 @@ delete from foo where f1 < 5 returning *;
(5 rows)
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
- QUERY PLAN
-------------------------------------------------------------------------------
- Update on public.bar
- Output: bar.f1, bar.f2
- Update on public.bar
- Foreign Update on public.bar2 bar_1
- -> Seq Scan on public.bar
- Output: bar.f1, (bar.f2 + 100), bar.ctid
- -> Foreign Update on public.bar2 bar_1
- Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Output: u.f1, u.f2
+ Sort Key: u.f1
+ CTE u
+ -> Update on public.bar
+ Output: bar.f1, bar.f2
+ Update on public.bar
+ Foreign Update on public.bar2 bar_1
+ -> Seq Scan on public.bar
+ Output: bar.f1, (bar.f2 + 100), bar.ctid
+ -> Foreign Update on public.bar2 bar_1
+ Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+ -> CTE Scan on u
+ Output: u.f1, u.f2
+(14 rows)
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
f1 | f2
----+-----
1 | 311
2 | 322
- 6 | 266
3 | 333
4 | 344
+ 6 | 266
7 | 277
(6 rows)
@@ -8558,11 +8580,12 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
Sort
Sort Key: t1.a, t3.c
-> Append
- -> Foreign Scan
+ Async subplans: 2
+ -> Async Foreign Scan
Relations: ((ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)) INNER JOIN (ftprt1_p1 t3_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: ((ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)) INNER JOIN (ftprt1_p2 t3_2)
-(7 rows)
+(8 rows)
SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER JOIN fprt1 t3 ON (t2.b = t3.a) WHERE t1.a % 25 =0 ORDER BY 1,2,3;
a | b | c
@@ -8597,20 +8620,22 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
-- with whole-row reference; partitionwise join does not apply
EXPLAIN (COSTS OFF)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
- QUERY PLAN
---------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------
Sort
Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
-> Hash Full Join
Hash Cond: (t1.a = t2.b)
-> Append
- -> Foreign Scan on ftprt1_p1 t1_1
- -> Foreign Scan on ftprt1_p2 t1_2
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt1_p1 t1_1
+ -> Async Foreign Scan on ftprt1_p2 t1_2
-> Hash
-> Append
- -> Foreign Scan on ftprt2_p1 t2_1
- -> Foreign Scan on ftprt2_p2 t2_2
-(11 rows)
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt2_p1 t2_1
+ -> Async Foreign Scan on ftprt2_p2 t2_2
+(13 rows)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
wr | wr
@@ -8639,11 +8664,12 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
Sort
Sort Key: t1.a, t1.b
-> Append
- -> Foreign Scan
+ Async subplans: 2
+ -> Async Foreign Scan
Relations: (ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: (ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)
-(7 rows)
+(8 rows)
SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE t1.a%25 = 0 ORDER BY 1,2;
a | b
@@ -8696,21 +8722,23 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
-- test FOR UPDATE; partitionwise join does not apply
EXPLAIN (COSTS OFF)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
- QUERY PLAN
---------------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------------
LockRows
-> Sort
Sort Key: t1.a
-> Hash Join
Hash Cond: (t2.b = t1.a)
-> Append
- -> Foreign Scan on ftprt2_p1 t2_1
- -> Foreign Scan on ftprt2_p2 t2_2
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt2_p1 t2_1
+ -> Async Foreign Scan on ftprt2_p2 t2_2
-> Hash
-> Append
- -> Foreign Scan on ftprt1_p1 t1_1
- -> Foreign Scan on ftprt1_p2 t1_2
-(12 rows)
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt1_p1 t1_1
+ -> Async Foreign Scan on ftprt1_p2 t1_2
+(14 rows)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
a | b
@@ -8745,18 +8773,19 @@ ANALYZE fpagg_tab_p3;
SET enable_partitionwise_aggregate TO false;
EXPLAIN (COSTS OFF)
SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
- QUERY PLAN
------------------------------------------------------------
+ QUERY PLAN
+-----------------------------------------------------------------
Sort
Sort Key: pagg_tab.a
-> HashAggregate
Group Key: pagg_tab.a
Filter: (avg(pagg_tab.b) < '22'::numeric)
-> Append
- -> Foreign Scan on fpagg_tab_p1 pagg_tab_1
- -> Foreign Scan on fpagg_tab_p2 pagg_tab_2
- -> Foreign Scan on fpagg_tab_p3 pagg_tab_3
-(9 rows)
+ Async subplans: 3
+ -> Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+ -> Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+ -> Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
+(10 rows)
-- Plan with partitionwise aggregates is enabled
SET enable_partitionwise_aggregate TO true;
@@ -8767,13 +8796,14 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
Sort
Sort Key: pagg_tab.a
-> Append
- -> Foreign Scan
+ Async subplans: 3
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p1 pagg_tab)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p2 pagg_tab_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p3 pagg_tab_2)
-(9 rows)
+(10 rows)
SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
a | sum | min | count
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 9fc53cad68..b04b6a0e54 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -21,6 +21,8 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -35,6 +37,7 @@
#include "optimizer/restrictinfo.h"
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
+#include "pgstat.h"
#include "postgres_fdw.h"
#include "utils/builtins.h"
#include "utils/float.h"
@@ -56,6 +59,9 @@ PG_MODULE_MAGIC;
/* If no remote estimates, assume a sort costs 20% extra */
#define DEFAULT_FDW_SORT_MULTIPLIER 1.2
+/* Retrieve PgFdwScanState struct from ForeignScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
/*
* Indexes of FDW-private information stored in fdw_private lists.
*
@@ -122,11 +128,29 @@ enum FdwDirectModifyPrivateIndex
FdwDirectModifyPrivateSetProcessed
};
+/*
+ * Connection common state - shared among all PgFdwState instances using the
+ * same connection.
+ */
+typedef struct PgFdwConnCommonState
+{
+ ForeignScanState *leader; /* leader node of this connection */
+ bool busy; /* true if this connection is busy */
+} PgFdwConnCommonState;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+ PGconn *conn; /* connection for the scan */
+ PgFdwConnCommonState *commonstate; /* connection common state */
+} PgFdwState;
+
/*
* Execution state of a foreign scan using postgres_fdw.
*/
typedef struct PgFdwScanState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table. NULL
* for a foreign join scan. */
TupleDesc tupdesc; /* tuple descriptor of scan */
@@ -137,7 +161,6 @@ typedef struct PgFdwScanState
List *retrieved_attrs; /* list of retrieved attribute numbers */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -153,6 +176,12 @@ typedef struct PgFdwScanState
/* batch-level state, for optimizing rewinds and avoiding useless fetch */
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ bool async; /* true if run asynchronously */
+ bool queued; /* true if this node is in waiter queue */
+ ForeignScanState *waiter; /* Next node to run a query among nodes
+ * sharing the same connection */
+ ForeignScanState *last_waiter; /* last element in waiter queue.
+ * valid only on the leader node */
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
@@ -166,11 +195,11 @@ typedef struct PgFdwScanState
*/
typedef struct PgFdwModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -197,6 +226,7 @@ typedef struct PgFdwModifyState
*/
typedef struct PgFdwDirectModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
@@ -326,6 +356,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
static void postgresReScanForeignScan(ForeignScanState *node);
static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
static void postgresAddForeignUpdateTargets(Query *parsetree,
RangeTblEntry *target_rte,
Relation target_relation);
@@ -391,6 +422,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
RelOptInfo *input_rel,
RelOptInfo *output_rel,
void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+ WaitEventSet *wes,
+ void *caller_data, bool reinit);
/*
* Helper functions
@@ -419,7 +454,9 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn, bool clear_queue);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static PgFdwModifyState *create_foreign_modify(EState *estate,
RangeTblEntry *rte,
@@ -522,6 +559,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
routine->IterateForeignScan = postgresIterateForeignScan;
routine->ReScanForeignScan = postgresReScanForeignScan;
routine->EndForeignScan = postgresEndForeignScan;
+ routine->ShutdownForeignScan = postgresShutdownForeignScan;
/* Functions for updating foreign tables */
routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -558,6 +596,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for async execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+
PG_RETURN_POINTER(routine);
}
@@ -1434,12 +1476,22 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->s.conn = GetConnection(user, false);
+ fsstate->s.commonstate = (PgFdwConnCommonState *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
+ fsstate->s.commonstate->leader = NULL;
+ fsstate->s.commonstate->busy = false;
+ fsstate->waiter = NULL;
+ fsstate->last_waiter = node;
/* Assign a unique ID for my cursor */
- fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+ fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
fsstate->cursor_exists = false;
+ /* Initialize async execution status */
+ fsstate->async = false;
+ fsstate->queued = false;
+
/* Get private info created by planner functions. */
fsstate->query = strVal(list_nth(fsplan->fdw_private,
FdwScanPrivateSelectSql));
@@ -1487,40 +1539,244 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
&fsstate->param_values);
}
+/*
+ * Async queue manipulation functions
+ */
+
+/*
+ * add_async_waiter:
+ *
+ * Enqueue node if it isn't in the queue. Immediately send request it if the
+ * underlying connection is not busy.
+ */
+static inline void
+add_async_waiter(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+
+ /*
+ * Do nothing if the node is already in the queue or already eof'ed.
+ * Note: leader node is not marked as queued.
+ */
+ if (leader == node || fsstate->queued || fsstate->eof_reached)
+ return;
+
+ if (leader == NULL)
+ {
+ /* no leader means not busy, send request immediately */
+ request_more_data(node);
+ }
+ else
+ {
+ /* the connection is busy, queue the node */
+ PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+ PgFdwScanState *last_waiter_state
+ = GetPgFdwScanState(leader_state->last_waiter);
+
+ last_waiter_state->waiter = node;
+ leader_state->last_waiter = node;
+ fsstate->queued = true;
+ }
+}
+
+/*
+ * move_to_next_waiter:
+ *
+ * Make the first waiter be the next leader
+ * Returns the new leader or NULL if there's no waiter.
+ */
+static inline ForeignScanState *
+move_to_next_waiter(ForeignScanState *node)
+{
+ PgFdwScanState *leader_state = GetPgFdwScanState(node);
+ ForeignScanState *next_leader = leader_state->waiter;
+
+ Assert(leader_state->s.commonstate->leader = node);
+
+ if (next_leader)
+ {
+ /* the first waiter becomes the next leader */
+ PgFdwScanState *next_leader_state = GetPgFdwScanState(next_leader);
+ next_leader_state->last_waiter = leader_state->last_waiter;
+ next_leader_state->queued = false;
+ }
+
+ leader_state->waiter = NULL;
+ leader_state->s.commonstate->leader = next_leader;
+
+ return next_leader;
+}
+
+/*
+ * Remove the node from waiter queue.
+ *
+ * Remaining results are cleared if the node is a busy leader.
+ * This intended to be used during node shutdown.
+ */
+static inline void
+remove_async_node(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+ PgFdwScanState *leader_state;
+ ForeignScanState *prev;
+ PgFdwScanState *prev_state;
+ ForeignScanState *cur;
+
+ /* no need to remove me */
+ if (!leader || !fsstate->queued)
+ return;
+
+ leader_state = GetPgFdwScanState(leader);
+
+ if (leader == node)
+ {
+ /* It's the leader */
+ ForeignScanState *next_leader;
+
+ if (leader_state->s.commonstate->busy)
+ {
+ /*
+ * this node is waiting for result, absorb the result first so
+ * that the following commands can be sent on the connection.
+ */
+ PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+ PGconn *conn = leader_state->s.conn;
+
+ while(PQisBusy(conn))
+ PQclear(PQgetResult(conn));
+
+ leader_state->s.commonstate->busy = false;
+ }
+
+ move_to_next_waiter(node);
+
+ return;
+ }
+
+ /*
+ * Just remove the node from the queue
+ *
+ * Nodes don't have a link to the previous node but anyway this function is
+ * called on the shutdown path, so we don't bother seeking for faster way
+ * to do this.
+ */
+ prev = leader;
+ prev_state = leader_state;
+ cur = GetPgFdwScanState(prev)->waiter;
+ while (cur)
+ {
+ PgFdwScanState *curstate = GetPgFdwScanState(cur);
+
+ if (cur == node)
+ {
+ prev_state->waiter = curstate->waiter;
+
+ /* relink to the previous node if the last node was removed */
+ if (leader_state->last_waiter == cur)
+ leader_state->last_waiter = prev;
+
+ fsstate->queued = false;
+
+ return;
+ }
+ prev = cur;
+ prev_state = curstate;
+ cur = curstate->waiter;
+ }
+}
+
/*
* postgresIterateForeignScan
- * Retrieve next row from the result set, or clear tuple slot to indicate
- * EOF.
+ * Retrieve next row from the result set.
+ *
+ * For synchronous nodes, returns clear tuple slot means EOF.
+ *
+ * For asynchronous nodes, if clear tuple slot is returned, the caller
+ * needs to check async state to tell if all tuples received
+ * (AS_AVAILABLE) or waiting for the next data to come (AS_WAITING).
*/
static TupleTableSlot *
postgresIterateForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
- /*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
- */
- if (!fsstate->cursor_exists)
- create_cursor(node);
-
- /*
- * Get some more tuples, if we've run out.
- */
+ if (fsstate->next_tuple >= fsstate->num_tuples && !fsstate->eof_reached)
+ {
+ /* we've run out, get some more tuples */
+ if (!node->fs_async)
+ {
+ /*
+ * finish the running query before sending the next command for
+ * this node
+ */
+ if (!fsstate->s.commonstate->busy)
+ vacate_connection((PgFdwState *)fsstate, false);
+
+ request_more_data(node);
+
+ /* Fetch the result immediately. */
+ fetch_received_data(node);
+ }
+ else if (!fsstate->s.commonstate->busy)
+ {
+ /* If the connection is not busy, just send the request. */
+ request_more_data(node);
+ }
+ else
+ {
+ /* The connection is busy, queue the request */
+ bool available = true;
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+ PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+
+ /* queue the requested node */
+ add_async_waiter(node);
+
+ /*
+ * The request for the next node cannot be sent before the leader
+ * responds. Finish the current leader if possible.
+ */
+ if (PQisBusy(leader_state->s.conn))
+ {
+ int rc = WaitLatchOrSocket(NULL,
+ WL_SOCKET_READABLE | WL_TIMEOUT |
+ WL_EXIT_ON_PM_DEATH,
+ PQsocket(leader_state->s.conn), 0,
+ WAIT_EVENT_ASYNC_WAIT);
+ if (!(rc & WL_SOCKET_READABLE))
+ available = false;
+ }
+
+ /* fetch the leader's data and enqueue it for the next request */
+ if (available)
+ {
+ fetch_received_data(leader);
+ add_async_waiter(leader);
+ }
+ }
+ }
+
if (fsstate->next_tuple >= fsstate->num_tuples)
{
- /* No point in another fetch if we already detected EOF, though. */
- if (!fsstate->eof_reached)
- fetch_more_data(node);
- /* If we didn't get any tuples, must be end of data. */
- if (fsstate->next_tuple >= fsstate->num_tuples)
- return ExecClearTuple(slot);
+ /*
+ * We haven't received a result for the given node this time, return
+ * with no tuple to give way to another node.
+ */
+ if (fsstate->eof_reached)
+ node->ss.ps.asyncstate = AS_AVAILABLE;
+ else
+ node->ss.ps.asyncstate = AS_WAITING;
+
+ return ExecClearTuple(slot);
}
/*
* Return the next tuple.
*/
+ node->ss.ps.asyncstate = AS_AVAILABLE;
ExecStoreHeapTuple(fsstate->tuples[fsstate->next_tuple++],
slot,
false);
@@ -1535,7 +1791,7 @@ postgresIterateForeignScan(ForeignScanState *node)
static void
postgresReScanForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
char sql[64];
PGresult *res;
@@ -1543,6 +1799,8 @@ postgresReScanForeignScan(ForeignScanState *node)
if (!fsstate->cursor_exists)
return;
+ vacate_connection((PgFdwState *)fsstate, true);
+
/*
* If any internal parameters affecting this node have changed, we'd
* better destroy and recreate the cursor. Otherwise, rewinding it should
@@ -1571,9 +1829,9 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
PQclear(res);
/* Now force a fresh FETCH. */
@@ -1591,7 +1849,7 @@ postgresReScanForeignScan(ForeignScanState *node)
static void
postgresEndForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
if (fsstate == NULL)
@@ -1599,15 +1857,31 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->s.conn, fsstate->cursor_number);
/* Release remote connection */
- ReleaseConnection(fsstate->conn);
- fsstate->conn = NULL;
+ ReleaseConnection(fsstate->s.conn);
+ fsstate->s.conn = NULL;
/* MemoryContexts will be deleted automatically. */
}
+/*
+ * postgresShutdownForeignScan
+ * Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+ ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+ if (plan->operation != CMD_SELECT)
+ return;
+
+ /* remove the node from waiting queue */
+ remove_async_node(node);
+}
+
/*
* postgresAddForeignUpdateTargets
* Add resjunk column(s) needed for update/delete on a foreign table
@@ -2372,7 +2646,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->s.conn = GetConnection(user, false);
+ dmstate->s.commonstate = (PgFdwConnCommonState *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
/* Update the foreign-join-related fields. */
if (fsplan->scan.scanrelid == 0)
@@ -2457,7 +2733,11 @@ postgresIterateDirectModify(ForeignScanState *node)
* If this is the first call after Begin, execute the statement.
*/
if (dmstate->num_tuples == -1)
+ {
+ /* finish running query to send my command */
+ vacate_connection((PgFdwState *)dmstate, true);
execute_dml_stmt(node);
+ }
/*
* If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2504,8 +2784,8 @@ postgresEndDirectModify(ForeignScanState *node)
PQclear(dmstate->result);
/* Release remote connection */
- ReleaseConnection(dmstate->conn);
- dmstate->conn = NULL;
+ ReleaseConnection(dmstate->s.conn);
+ dmstate->s.conn = NULL;
/* MemoryContext will be deleted automatically. */
}
@@ -2703,6 +2983,7 @@ estimate_path_cost_size(PlannerInfo *root,
List *local_param_join_conds;
StringInfoData sql;
PGconn *conn;
+ PgFdwConnCommonState *commonstate;
Selectivity local_sel;
QualCost local_cost;
List *fdw_scan_tlist = NIL;
@@ -2747,6 +3028,18 @@ estimate_path_cost_size(PlannerInfo *root,
/* Get the remote estimate */
conn = GetConnection(fpinfo->user, false);
+ commonstate = GetConnectionSpecificStorage(fpinfo->user,
+ sizeof(PgFdwConnCommonState));
+ if (commonstate)
+ {
+ PgFdwState tmpstate;
+ tmpstate.conn = conn;
+ tmpstate.commonstate = commonstate;
+
+ /* finish running query to send my command */
+ vacate_connection(&tmpstate, true);
+ }
+
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -3317,11 +3610,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
static void
create_cursor(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
ExprContext *econtext = node->ss.ps.ps_ExprContext;
int numParams = fsstate->numParams;
const char **values = fsstate->param_values;
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
StringInfoData buf;
PGresult *res;
@@ -3384,50 +3677,120 @@ create_cursor(ForeignScanState *node)
}
/*
- * Fetch some more rows from the node's cursor.
+ * Sends the next request of the node. If the given node is different from the
+ * current connection leader, pushes it back to waiter queue and let the given
+ * node be the leader.
*/
static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+ PGconn *conn = fsstate->s.conn;
+ char sql[64];
+
+ /* must be non-busy */
+ Assert(!fsstate->s.commonstate->busy);
+ /* must be not-eof'ed */
+ Assert(!fsstate->eof_reached);
+
+ /*
+ * If this is the first call after Begin or ReScan, we need to create the
+ * cursor on the remote side.
+ */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (!PQsendQuery(conn, sql))
+ pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+ fsstate->s.commonstate->busy = true;
+
+ /* The node is the current leader, just return. */
+ if (leader == node)
+ return;
+
+ /* Let the node be the leader */
+ if (leader != NULL)
+ {
+ remove_async_node(node);
+ fsstate->last_waiter = GetPgFdwScanState(leader)->last_waiter;
+ fsstate->waiter = leader;
+ }
+ else
+ {
+ fsstate->last_waiter = node;
+ fsstate->waiter = NULL;
+ }
+
+ fsstate->s.commonstate->leader = node;
+}
+
+/*
+ * Fetches received data and automatically send requests of the next waiter.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
PGresult *volatile res = NULL;
MemoryContext oldcontext;
+ ForeignScanState *waiter;
+
+ /* I should be the current connection leader */
+ Assert(fsstate->s.commonstate->leader == node);
/*
* We'll store the tuples in the batch_cxt. First, flush the previous
- * batch.
+ * batch if no tuple is remaining
*/
- fsstate->tuples = NULL;
- MemoryContextReset(fsstate->batch_cxt);
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->tuples = NULL;
+ fsstate->num_tuples = 0;
+ MemoryContextReset(fsstate->batch_cxt);
+ }
+ else if (fsstate->next_tuple > 0)
+ {
+ /* There's some remains. Move them to the beginning of the store */
+ int n = 0;
+
+ while(fsstate->next_tuple < fsstate->num_tuples)
+ fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+ fsstate->num_tuples = n;
+ }
+
oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
/* PGresult must be released before leaving this function. */
PG_TRY();
{
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
char sql[64];
- int numrows;
+ int addrows;
+ size_t newsize;
int i;
- snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
- fsstate->fetch_size, fsstate->cursor_number);
-
- res = pgfdw_exec_query(conn, sql);
- /* On error, report the original query, not the FETCH. */
+ res = pgfdw_get_result(conn, fsstate->query);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
/* Convert the data into HeapTuples */
- numrows = PQntuples(res);
- fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
- fsstate->num_tuples = numrows;
- fsstate->next_tuple = 0;
+ addrows = PQntuples(res);
+ newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+ if (fsstate->tuples)
+ fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+ else
+ fsstate->tuples = (HeapTuple *) palloc(newsize);
- for (i = 0; i < numrows; i++)
+ for (i = 0; i < addrows; i++)
{
Assert(IsA(node->ss.ps.plan, ForeignScan));
- fsstate->tuples[i] =
+ fsstate->tuples[fsstate->num_tuples + i] =
make_tuple_from_result_row(res, i,
fsstate->rel,
fsstate->attinmeta,
@@ -3437,22 +3800,73 @@ fetch_more_data(ForeignScanState *node)
}
/* Update fetch_ct_2 */
- if (fsstate->fetch_ct_2 < 2)
+ if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
fsstate->fetch_ct_2++;
+ fsstate->next_tuple = 0;
+ fsstate->num_tuples += addrows;
+
/* Must be EOF if we didn't get as many tuples as we asked for. */
- fsstate->eof_reached = (numrows < fsstate->fetch_size);
+ fsstate->eof_reached = (addrows < fsstate->fetch_size);
}
PG_FINALLY();
{
+ fsstate->s.commonstate->busy = false;
+
if (res)
PQclear(res);
}
PG_END_TRY();
+ /* let the first waiter be the next leader of this connection */
+ waiter = move_to_next_waiter(node);
+
+ /* send the next request if any */
+ if (waiter)
+ request_more_data(waiter);
+
MemoryContextSwitchTo(oldcontext);
}
+/*
+ * Vacate the underlying connection so that this node can send the next query.
+ */
+static void
+vacate_connection(PgFdwState *fdwstate, bool clear_queue)
+{
+ PgFdwConnCommonState *commonstate = fdwstate->commonstate;
+ ForeignScanState *leader;
+
+ Assert(commonstate != NULL);
+
+ /* just return if the connection is already available */
+ if (commonstate->leader == NULL || !commonstate->busy)
+ return;
+
+ /*
+ * let the current connection leader read all of the result for the running
+ * query
+ */
+ leader = commonstate->leader;
+ fetch_received_data(leader);
+
+ /* let the first waiter be the next leader of this connection */
+ move_to_next_waiter(leader);
+
+ if (!clear_queue)
+ return;
+
+ /* Clear the waiting list */
+ while (leader)
+ {
+ PgFdwScanState *fsstate = GetPgFdwScanState(leader);
+
+ fsstate->last_waiter = NULL;
+ leader = fsstate->waiter;
+ fsstate->waiter = NULL;
+ }
+}
+
/*
* Force assorted GUC parameters to settings that ensure that we'll output
* data values in a form that is unambiguous to the remote server.
@@ -3566,7 +3980,9 @@ create_foreign_modify(EState *estate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->s.conn = GetConnection(user, true);
+ fmstate->s.commonstate = (PgFdwConnCommonState *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Set up remote query information. */
@@ -3653,6 +4069,9 @@ execute_foreign_modify(EState *estate,
operation == CMD_UPDATE ||
operation == CMD_DELETE);
+ /* finish running query to send my command */
+ vacate_connection((PgFdwState *)fmstate, true);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -3680,14 +4099,14 @@ execute_foreign_modify(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3695,10 +4114,10 @@ execute_foreign_modify(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -3734,7 +4153,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
- GetPrepStmtNumber(fmstate->conn));
+ GetPrepStmtNumber(fmstate->s.conn));
p_name = pstrdup(prep_name);
/*
@@ -3744,12 +4163,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* the prepared statements we use in this module are simple enough that
* the remote server will make the right choices.
*/
- if (!PQsendPrepare(fmstate->conn,
+ if (!PQsendPrepare(fmstate->s.conn,
p_name,
fmstate->query,
0,
NULL))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3757,9 +4176,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
PQclear(res);
/* This action shows that the prepare has been done. */
@@ -3888,16 +4307,16 @@ finish_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
PQclear(res);
fmstate->p_name = NULL;
}
/* Release remote connection */
- ReleaseConnection(fmstate->conn);
- fmstate->conn = NULL;
+ ReleaseConnection(fmstate->s.conn);
+ fmstate->s.conn = NULL;
}
/*
@@ -4056,9 +4475,9 @@ execute_dml_stmt(ForeignScanState *node)
* the desired result. This allows us to avoid assuming that the remote
* server has the same OIDs we do for the parameters' types.
*/
- if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+ if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
NULL, values, NULL, NULL, 0))
- pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+ pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
/*
* Get the result, and check for success.
@@ -4066,10 +4485,10 @@ execute_dml_stmt(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+ dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
if (PQresultStatus(dmstate->result) !=
(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+ pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
dmstate->query);
/* Get the number of rows affected. */
@@ -5560,6 +5979,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
/* XXX Consider parameterized paths for the join relation */
}
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add wait event so that the ForeignScan node is going to wait for.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+ void *caller_data, bool reinit)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+ /* Reinit is not supported for now. */
+ Assert(reinit);
+
+ if (fsstate->s.commonstate->leader == node)
+ {
+ AddWaitEventToSet(wes,
+ WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+ NULL, caller_data);
+ return true;
+ }
+
+ return false;
+}
+
+
/*
* Assess whether the aggregation, grouping and having operations can be pushed
* down to the foreign server. As a side effect, save information we obtain in
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index eef410db39..96af75a33e 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -85,6 +85,7 @@ typedef struct PgFdwRelationInfo
UserMapping *user; /* only set in use_remote_estimate mode */
int fetch_size; /* fetch size for this remote table */
+ bool allow_prefetch; /* true to allow overlapped fetching */
/*
* Name of the relation, for use while EXPLAINing ForeignScan. It is used
@@ -130,6 +131,7 @@ extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 83971665e3..359208a12a 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1780,25 +1780,25 @@ INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
@@ -1840,12 +1840,12 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
@@ -1904,8 +1904,8 @@ explain (verbose, costs off)
delete from foo where f1 < 5 returning *;
delete from foo where f1 < 5 returning *;
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
-- Test that UPDATE/DELETE with inherited target works with row-level triggers
CREATE TRIGGER trig_row_before
--
2.18.2
On 6/4/20 11:00 AM, Kyotaro Horiguchi wrote:
Removed a useless variable PgFdwScanState.result_ready.
Removed duplicate code from remove_async_node() by using move_to_next_waiter().
Done some minor cleanups.
I am reviewing your code.
A couple of variables are no longer needed (see changes.patch in attachment.
Something about the cost of an asynchronous plan:
At the simple query plan (see below) I see:
1. Startup cost of local SeqScan is equal 0, ForeignScan - 100. But
startup cost of Append is 0.
2. Total cost of an Append node is a sum of the subplans. Maybe in the
case of asynchronous append we need to use some reduce factor?
explain select * from parts;
With Async Append:
=====================
Append (cost=0.00..2510.30 rows=106780 width=8)
Async subplans: 3
-> Async Foreign Scan on part_1 parts_2 (cost=100.00..177.80
rows=2260 width=8)
-> Async Foreign Scan on part_2 parts_3 (cost=100.00..177.80
rows=2260 width=8)
-> Async Foreign Scan on part_3 parts_4 (cost=100.00..177.80
rows=2260 width=8)
-> Seq Scan on part_0 parts_1 (cost=0.00..1443.00 rows=100000 width=8)
Without Async Append:
=====================
Append (cost=0.00..2510.30 rows=106780 width=8)
-> Seq Scan on part_0 parts_1 (cost=0.00..1443.00 rows=100000 width=8)
-> Foreign Scan on part_1 parts_2 (cost=100.00..177.80 rows=2260
width=8)
-> Foreign Scan on part_2 parts_3 (cost=100.00..177.80 rows=2260
width=8)
-> Foreign Scan on part_3 parts_4 (cost=100.00..177.80 rows=2260
width=8)
--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company
Attachments:
changes.patchtext/x-patch; charset=UTF-8; name=changes.patchDownload
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index b04b6a0e54..4406a9c3b3 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -1632,9 +1632,6 @@ remove_async_node(ForeignScanState *node)
if (leader == node)
{
- /* It's the leader */
- ForeignScanState *next_leader;
-
if (leader_state->s.commonstate->busy)
{
/*
@@ -1769,7 +1766,7 @@ postgresIterateForeignScan(ForeignScanState *node)
node->ss.ps.asyncstate = AS_AVAILABLE;
else
node->ss.ps.asyncstate = AS_WAITING;
-
+elog(WARNING, "No tuple result %d", fsstate->cursor_number);
return ExecClearTuple(slot);
}
@@ -3703,7 +3700,7 @@ request_more_data(ForeignScanState *node)
snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
fsstate->fetch_size, fsstate->cursor_number);
-
+elog(WARNING, "FETCH: %s", sql);
if (!PQsendQuery(conn, sql))
pgfdw_report_error(ERROR, NULL, conn, false, sql);
@@ -3769,7 +3766,6 @@ fetch_received_data(ForeignScanState *node)
PG_TRY();
{
PGconn *conn = fsstate->s.conn;
- char sql[64];
int addrows;
size_t newsize;
int i;
@@ -3798,7 +3794,7 @@ fetch_received_data(ForeignScanState *node)
node,
fsstate->temp_cxt);
}
-
+elog(WARNING, "fetch cursor: %d (%d %d)", fsstate->cursor_number, fsstate->num_tuples, addrows);
/* Update fetch_ct_2 */
if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
fsstate->fetch_ct_2++;
Hello, Andrey.
At Tue, 9 Jun 2020 14:20:42 +0500, Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote in
On 6/4/20 11:00 AM, Kyotaro Horiguchi wrote:
Removed a useless variable PgFdwScanState.result_ready.
Removed duplicate code from remove_async_node() by using
move_to_next_waiter().
Done some minor cleanups.I am reviewing your code.
A couple of variables are no longer needed (see changes.patch in
attachment.
Thanks! The recent changes made them useless. Fixed.
Something about the cost of an asynchronous plan:
At the simple query plan (see below) I see:
1. Startup cost of local SeqScan is equal 0, ForeignScan - 100. But
startup cost of Append is 0.
The result itself is right that the append doesn't wait for foreign
scans for the first iteration then fetches a tuple from the local
table. But the estimation is made just by an accident. If you
defined a foreign table as the first partition, the cost of Append
would be 100, which is rather wrong.
2. Total cost of an Append node is a sum of the subplans. Maybe in the
case of asynchronous append we need to use some reduce factor?
Yes. For the reason mentioned above, foreign subpaths don't affect
the startup cost of Append as far as any sync subpaths exist. If no
sync subpaths exist, the Append's startup cost is the minimum startup
cost among the async subpaths.
I fixed cost_append so that it calculates the correct startup
cost. Now the function estimates as follows.
Append (Foreign(100), Foreign(100), Local(0)) => 0;
Append (Local(0), Foreign(100), Foreign(100)) => 0;
Append (Foreign(100), Foreign(100)) => 100;
explain select * from parts;
With Async Append:
=====================Append (cost=0.00..2510.30 rows=106780 width=8)
Async subplans: 3
-> Async Foreign Scan on part_1 parts_2 (cost=100.00..177.80 rows=2260
-> width=8)
-> Async Foreign Scan on part_2 parts_3 (cost=100.00..177.80 rows=2260
-> width=8)
-> Async Foreign Scan on part_3 parts_4 (cost=100.00..177.80 rows=2260
-> width=8)
-> Seq Scan on part_0 parts_1 (cost=0.00..1443.00 rows=100000 width=8)
The SeqScan seems to be the first partition for the parent. It is the
first subnode at cost estimation. The result is right but it comes
from a wrong logic.
Without Async Append:
=====================Append (cost=0.00..2510.30 rows=106780 width=8)
-> Seq Scan on part_0 parts_1 (cost=0.00..1443.00 rows=100000 width=8)
-> Foreign Scan on part_1 parts_2 (cost=100.00..177.80 rows=2260 width=8)
-> Foreign Scan on part_2 parts_3 (cost=100.00..177.80 rows=2260 width=8)
-> Foreign Scan on part_3 parts_4 (cost=100.00..177.80 rows=2260 width=8)
The starup cost of the Append is the cost of the first subnode, that is, 0.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v4-0001-Allow-wait-event-set-to-be-registered-to-resource.patchtext/x-patch; charset=us-asciiDownload
From 281ed344f13352dd00bd08926fd8dd12ae182d01 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH v4 1/3] Allow wait event set to be registered to resource
owner
WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
src/backend/libpq/pqcomm.c | 2 +-
src/backend/storage/ipc/latch.c | 18 ++++-
src/backend/storage/lmgr/condition_variable.c | 2 +-
src/backend/utils/resowner/resowner.c | 67 +++++++++++++++++++
src/include/storage/latch.h | 4 +-
src/include/utils/resowner_private.h | 8 +++
6 files changed, 96 insertions(+), 5 deletions(-)
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 7717bb2719..16aefb03ee 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -218,7 +218,7 @@ pq_init(void)
(errmsg("could not set socket to nonblocking mode: %m")));
#endif
- FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+ FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
NULL, NULL);
AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 91fa4b619b..10d71b46cb 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -56,6 +56,7 @@
#include "storage/latch.h"
#include "storage/pmsignal.h"
#include "storage/shmem.h"
+#include "utils/resowner_private.h"
/*
* Select the fd readiness primitive to use. Normally the "most modern"
@@ -84,6 +85,8 @@ struct WaitEventSet
int nevents; /* number of registered events */
int nevents_space; /* maximum number of events in this set */
+ ResourceOwner resowner; /* Resource owner */
+
/*
* Array, of nevents_space length, storing the definition of events this
* set is waiting for.
@@ -393,7 +396,7 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
int ret = 0;
int rc;
WaitEvent event;
- WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
if (wakeEvents & WL_TIMEOUT)
Assert(timeout >= 0);
@@ -560,12 +563,15 @@ ResetLatch(Latch *latch)
* WaitEventSetWait().
*/
WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
{
WaitEventSet *set;
char *data;
Size sz = 0;
+ if (res)
+ ResourceOwnerEnlargeWESs(res);
+
/*
* Use MAXALIGN size/alignment to guarantee that later uses of memory are
* aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -680,6 +686,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
#endif
+ /* Register this wait event set if requested */
+ set->resowner = res;
+ if (res)
+ ResourceOwnerRememberWES(set->resowner, set);
+
return set;
}
@@ -725,6 +736,9 @@ FreeWaitEventSet(WaitEventSet *set)
}
#endif
+ if (set->resowner != NULL)
+ ResourceOwnerForgetWES(set->resowner, set);
+
pfree(set);
}
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 37b6a4eecd..fcc92138fe 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -70,7 +70,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
{
WaitEventSet *new_event_set;
- new_event_set = CreateWaitEventSet(TopMemoryContext, 2);
+ new_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 2);
AddWaitEventToSet(new_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
MyLatch, NULL);
AddWaitEventToSet(new_event_set, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 8bc2c4e9ea..237ca9fa30 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -128,6 +128,7 @@ typedef struct ResourceOwnerData
ResourceArray filearr; /* open temporary files */
ResourceArray dsmarr; /* dynamic shmem segments */
ResourceArray jitarr; /* JIT contexts */
+ ResourceArray wesarr; /* wait event sets */
/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
int nlocks; /* number of owned locks */
@@ -175,6 +176,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
static void PrintSnapshotLeakWarning(Snapshot snapshot);
static void PrintFileLeakWarning(File file);
static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
/*****************************************************************************
@@ -444,6 +446,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
+ ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
return owner;
}
@@ -553,6 +556,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
jit_release_context(context);
}
+
+ /* Ditto for wait event sets */
+ while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+ {
+ WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+ if (isCommit)
+ PrintWESLeakWarning(event);
+ FreeWaitEventSet(event);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -725,6 +738,7 @@ ResourceOwnerDelete(ResourceOwner owner)
Assert(owner->filearr.nitems == 0);
Assert(owner->dsmarr.nitems == 0);
Assert(owner->jitarr.nitems == 0);
+ Assert(owner->wesarr.nitems == 0);
Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
/*
@@ -752,6 +766,7 @@ ResourceOwnerDelete(ResourceOwner owner)
ResourceArrayFree(&(owner->filearr));
ResourceArrayFree(&(owner->dsmarr));
ResourceArrayFree(&(owner->jitarr));
+ ResourceArrayFree(&(owner->wesarr));
pfree(owner);
}
@@ -1370,3 +1385,55 @@ ResourceOwnerForgetJIT(ResourceOwner owner, Datum handle)
elog(ERROR, "JIT context %p is not owned by resource owner %s",
DatumGetPointer(handle), owner->name);
}
+
+/*
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+ ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+ ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+ elog(ERROR, "wait event set %p is not owned by resource owner %s",
+ events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ elog(WARNING, "wait event set leak: %p still referenced",
+ events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 46ae56cae3..b1b8375768 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
#define LATCH_H
#include <signal.h>
+#include "utils/resowner.h"
/*
* Latch structure should be treated as opaque and only accessed through
@@ -163,7 +164,8 @@ extern void DisownLatch(Latch *latch);
extern void SetLatch(Latch *latch);
extern void ResetLatch(Latch *latch);
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+ ResourceOwner res, int nevents);
extern void FreeWaitEventSet(WaitEventSet *set);
extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index a781a7a2aa..7d19dadd57 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
#include "storage/dsm.h"
#include "storage/fd.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "utils/catcache.h"
#include "utils/plancache.h"
@@ -95,4 +96,11 @@ extern void ResourceOwnerRememberJIT(ResourceOwner owner,
extern void ResourceOwnerForgetJIT(ResourceOwner owner,
Datum handle);
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+ WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+ WaitEventSet *);
+
#endif /* RESOWNER_PRIVATE_H */
--
2.18.2
v4-0002-infrastructure-for-asynchronous-execution.patchtext/x-patch; charset=us-asciiDownload
From 4a172f8009b881ac17ebabcf5f12b0fccca004ba Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 15 May 2018 20:21:32 +0900
Subject: [PATCH v4 2/3] infrastructure for asynchronous execution
This patch add an infrastructure for asynchronous execution. As a PoC
this makes only Append capable to handle asynchronously executable
subnodes.
---
src/backend/commands/explain.c | 17 ++
src/backend/executor/Makefile | 1 +
src/backend/executor/execAsync.c | 152 +++++++++++
src/backend/executor/nodeAppend.c | 342 ++++++++++++++++++++----
src/backend/executor/nodeForeignscan.c | 21 ++
src/backend/nodes/bitmapset.c | 72 +++++
src/backend/nodes/copyfuncs.c | 3 +
src/backend/nodes/outfuncs.c | 3 +
src/backend/nodes/readfuncs.c | 3 +
src/backend/optimizer/path/allpaths.c | 24 ++
src/backend/optimizer/path/costsize.c | 40 ++-
src/backend/optimizer/plan/createplan.c | 41 ++-
src/backend/postmaster/pgstat.c | 3 +
src/backend/postmaster/syslogger.c | 2 +-
src/backend/utils/adt/ruleutils.c | 8 +-
src/backend/utils/resowner/resowner.c | 4 +-
src/include/executor/execAsync.h | 22 ++
src/include/executor/executor.h | 1 +
src/include/executor/nodeForeignscan.h | 3 +
src/include/foreign/fdwapi.h | 11 +
src/include/nodes/bitmapset.h | 1 +
src/include/nodes/execnodes.h | 23 +-
src/include/nodes/plannodes.h | 9 +
src/include/optimizer/paths.h | 2 +
src/include/pgstat.h | 3 +-
25 files changed, 740 insertions(+), 71 deletions(-)
create mode 100644 src/backend/executor/execAsync.c
create mode 100644 src/include/executor/execAsync.h
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 9092b4b309..d35de920c8 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -86,6 +86,7 @@ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
List *ancestors, ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
+static void show_append_info(AppendState *astate, ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
ExplainState *es);
static void show_grouping_sets(PlanState *planstate, Agg *agg,
@@ -1389,6 +1390,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
}
if (plan->parallel_aware)
appendStringInfoString(es->str, "Parallel ");
+ if (plan->async_capable)
+ appendStringInfoString(es->str, "Async ");
appendStringInfoString(es->str, pname);
es->indent++;
}
@@ -1969,6 +1972,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Hash:
show_hash_info(castNode(HashState, planstate), es);
break;
+
+ case T_Append:
+ show_append_info(castNode(AppendState, planstate), es);
+ break;
+
default:
break;
}
@@ -2322,6 +2330,15 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ancestors, es);
}
+static void
+show_append_info(AppendState *astate, ExplainState *es)
+{
+ Append *plan = (Append *) astate->ps.plan;
+
+ if (plan->nasyncplans > 0)
+ ExplainPropertyInteger("Async subplans", "", plan->nasyncplans, es);
+}
+
/*
* Show the grouping keys for an Agg node.
*/
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index f990c6473a..1004647d4f 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
+ execAsync.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000000..2b7d1877e0
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,152 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+/*
+ * ExecAsyncConfigureWait: Add wait event to the WaitEventSet if needed.
+ *
+ * If reinit is true, the caller didn't reuse existing WaitEventSet.
+ */
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+ void *data, bool reinit)
+{
+ switch (nodeTag(node))
+ {
+ case T_ForeignScanState:
+ return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+ wes, data, reinit);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(node));
+ }
+}
+
+/*
+ * struct for memory context callback argument used in ExecAsyncEventWait
+ */
+typedef struct {
+ int **p_refind;
+ int *p_refindsize;
+} ExecAsync_mcbarg;
+
+/*
+ * callback function to reset static variables pointing to the memory in
+ * TopTransactionContext in ExecAsyncEventWait.
+ */
+static void ExecAsyncMemoryContextCallback(void *arg)
+{
+ /* arg is the address of the variable refind in ExecAsyncEventWait */
+ ExecAsync_mcbarg *mcbarg = (ExecAsync_mcbarg *) arg;
+ *mcbarg->p_refind = NULL;
+ *mcbarg->p_refindsize = 0;
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * ExecAsyncEventWait:
+ *
+ * Wait for async events to fire. Returns the Bitmapset of fired events.
+ */
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+ static int *refind = NULL;
+ static int refindsize = 0;
+ WaitEventSet *wes;
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred = 0;
+ Bitmapset *fired_events = NULL;
+ int i;
+ int n;
+
+ n = bms_num_members(waitnodes);
+ wes = CreateWaitEventSet(TopTransactionContext,
+ TopTransactionResourceOwner, n);
+ if (refindsize < n)
+ {
+ if (refindsize == 0)
+ refindsize = EVENT_BUFFER_SIZE; /* XXX */
+ while (refindsize < n)
+ refindsize *= 2;
+ if (refind)
+ refind = (int *) repalloc(refind, refindsize * sizeof(int));
+ else
+ {
+ static ExecAsync_mcbarg mcb_arg =
+ { &refind, &refindsize };
+ static MemoryContextCallback mcb =
+ { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
+ MemoryContext oldctxt =
+ MemoryContextSwitchTo(TopTransactionContext);
+
+ /*
+ * refind points to a memory block in
+ * TopTransactionContext. Register a callback to reset it.
+ */
+ MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
+ refind = (int *) palloc(refindsize * sizeof(int));
+ MemoryContextSwitchTo(oldctxt);
+ }
+ }
+
+ /* Prepare WaitEventSet for waiting on the waitnodes. */
+ n = 0;
+ for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+ i = bms_next_member(waitnodes, i))
+ {
+ refind[i] = i;
+ if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+ n++;
+ }
+
+ /* Return immediately if no node to wait. */
+ if (n == 0)
+ {
+ FreeWaitEventSet(wes);
+ return NULL;
+ }
+
+ noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+ EVENT_BUFFER_SIZE,
+ WAIT_EVENT_ASYNC_WAIT);
+ FreeWaitEventSet(wes);
+ if (noccurred == 0)
+ return NULL;
+
+ for (i = 0 ; i < noccurred ; i++)
+ {
+ WaitEvent *w = &occurred_event[i];
+
+ if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+ {
+ int n = *(int*)w->user_data;
+
+ fired_events = bms_add_member(fired_events, n);
+ }
+ }
+
+ return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 88919e62fa..60c36ee048 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -60,6 +60,7 @@
#include "executor/execdebug.h"
#include "executor/execPartition.h"
#include "executor/nodeAppend.h"
+#include "executor/execAsync.h"
#include "miscadmin.h"
/* Shared state for parallel-aware Append. */
@@ -80,6 +81,7 @@ struct ParallelAppendState
#define INVALID_SUBPLAN_INDEX -1
static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);
static bool choose_next_subplan_locally(AppendState *node);
static bool choose_next_subplan_for_leader(AppendState *node);
static bool choose_next_subplan_for_worker(AppendState *node);
@@ -103,22 +105,22 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
PlanState **appendplanstates;
Bitmapset *validsubplans;
int nplans;
+ int nasyncplans;
int firstvalid;
int i,
j;
/* check for unsupported flags */
- Assert(!(eflags & EXEC_FLAG_MARK));
+ Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC)));
/*
* create new AppendState for our append node
*/
appendstate->ps.plan = (Plan *) node;
appendstate->ps.state = estate;
- appendstate->ps.ExecProcNode = ExecAppend;
/* Let choose_next_subplan_* function handle setting the first subplan */
- appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+ appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
/* If run-time partition pruning is enabled, then set that up now */
if (node->part_prune_info != NULL)
@@ -152,11 +154,12 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
/*
* When no run-time pruning is required and there's at least one
- * subplan, we can fill as_valid_subplans immediately, preventing
+ * subplan, we can fill as_valid_syncsubplans immediately, preventing
* later calls to ExecFindMatchingSubPlans.
*/
if (!prunestate->do_exec_prune && nplans > 0)
- appendstate->as_valid_subplans = bms_add_range(NULL, 0, nplans - 1);
+ appendstate->as_valid_syncsubplans =
+ bms_add_range(NULL, node->nasyncplans, nplans - 1);
}
else
{
@@ -167,8 +170,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
* subplans as valid; they must also all be initialized.
*/
Assert(nplans > 0);
- appendstate->as_valid_subplans = validsubplans =
- bms_add_range(NULL, 0, nplans - 1);
+ validsubplans = bms_add_range(NULL, 0, nplans - 1);
+ appendstate->as_valid_syncsubplans =
+ bms_add_range(NULL, node->nasyncplans, nplans - 1);
appendstate->as_prune_state = NULL;
}
@@ -192,10 +196,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
*/
j = 0;
firstvalid = nplans;
+ nasyncplans = 0;
+
i = -1;
while ((i = bms_next_member(validsubplans, i)) >= 0)
{
Plan *initNode = (Plan *) list_nth(node->appendplans, i);
+ int sub_eflags = eflags;
+
+ /* Let async-capable subplans run asynchronously */
+ if (i < node->nasyncplans)
+ {
+ sub_eflags |= EXEC_FLAG_ASYNC;
+ nasyncplans++;
+ }
/*
* Record the lowest appendplans index which is a valid partial plan.
@@ -203,13 +217,46 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
if (i >= node->first_partial_plan && j < firstvalid)
firstvalid = j;
- appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
+ appendplanstates[j++] = ExecInitNode(initNode, estate, sub_eflags);
}
appendstate->as_first_partial_plan = firstvalid;
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ /* fill in async stuff */
+ appendstate->as_nasyncplans = nasyncplans;
+ appendstate->as_syncdone = (nasyncplans == nplans);
+ appendstate->as_exec_prune = false;
+
+ /* choose appropriate version of Exec function */
+ if (appendstate->as_nasyncplans == 0)
+ appendstate->ps.ExecProcNode = ExecAppend;
+ else
+ appendstate->ps.ExecProcNode = ExecAppendAsync;
+
+ if (appendstate->as_nasyncplans)
+ {
+ appendstate->as_asyncresult = (TupleTableSlot **)
+ palloc0(appendstate->as_nasyncplans * sizeof(TupleTableSlot *));
+
+ /* initially, all async requests need a request */
+ appendstate->as_needrequest =
+ bms_add_range(NULL, 0, appendstate->as_nasyncplans - 1);
+
+ /*
+ * ExecAppendAsync needs as_valid_syncsubplans to handle async
+ * subnodes.
+ */
+ if (appendstate->as_prune_state != NULL &&
+ appendstate->as_prune_state->do_exec_prune)
+ {
+ Assert(appendstate->as_valid_syncsubplans == NULL);
+
+ appendstate->as_exec_prune = true;
+ }
+ }
+
/*
* Miscellaneous initialization
*/
@@ -233,7 +280,7 @@ ExecAppend(PlanState *pstate)
{
AppendState *node = castNode(AppendState, pstate);
- if (node->as_whichplan < 0)
+ if (node->as_whichsyncplan < 0)
{
/* Nothing to do if there are no subplans */
if (node->as_nplans == 0)
@@ -243,11 +290,13 @@ ExecAppend(PlanState *pstate)
* If no subplan has been chosen, we must choose one before
* proceeding.
*/
- if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
+ if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
!node->choose_next_subplan(node))
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
+ Assert(node->as_nasyncplans == 0);
+
for (;;)
{
PlanState *subnode;
@@ -258,8 +307,9 @@ ExecAppend(PlanState *pstate)
/*
* figure out which subplan we are currently processing
*/
- Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
- subnode = node->appendplans[node->as_whichplan];
+ Assert(node->as_whichsyncplan >= 0 &&
+ node->as_whichsyncplan < node->as_nplans);
+ subnode = node->appendplans[node->as_whichsyncplan];
/*
* get a tuple from the subplan
@@ -282,6 +332,172 @@ ExecAppend(PlanState *pstate)
}
}
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+ AppendState *node = castNode(AppendState, pstate);
+ Bitmapset *needrequest;
+ int i;
+
+ Assert(node->as_nasyncplans > 0);
+
+restart:
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ if (node->as_exec_prune)
+ {
+ Bitmapset *valid_subplans =
+ ExecFindMatchingSubPlans(node->as_prune_state);
+
+ /* Distribute valid subplans into sync and async */
+ node->as_needrequest =
+ bms_intersect(node->as_needrequest, valid_subplans);
+ node->as_valid_syncsubplans =
+ bms_difference(valid_subplans, node->as_needrequest);
+
+ node->as_exec_prune = false;
+ }
+
+ needrequest = node->as_needrequest;
+ node->as_needrequest = NULL;
+ while ((i = bms_first_member(needrequest)) >= 0)
+ {
+ TupleTableSlot *slot;
+ PlanState *subnode = node->appendplans[i];
+
+ slot = ExecProcNode(subnode);
+ if (subnode->asyncstate == AS_AVAILABLE)
+ {
+ if (!TupIsNull(slot))
+ {
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+ node->as_needrequest = bms_add_member(node->as_needrequest, i);
+ }
+ }
+ else
+ node->as_pending_async = bms_add_member(node->as_pending_async, i);
+ }
+ bms_free(needrequest);
+
+ for (;;)
+ {
+ TupleTableSlot *result;
+
+ /* return now if a result is available */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ while (!bms_is_empty(node->as_pending_async))
+ {
+ /* Don't wait for async nodes if any sync node exists. */
+ long timeout = node->as_syncdone ? -1 : 0;
+ Bitmapset *fired;
+ int i;
+
+ fired = ExecAsyncEventWait(node->appendplans,
+ node->as_pending_async,
+ timeout);
+
+ if (bms_is_empty(fired) && node->as_syncdone)
+ {
+ /*
+ * We come here when all the subnodes had fired before
+ * waiting. Retry fetching from the nodes.
+ */
+ node->as_needrequest = node->as_pending_async;
+ node->as_pending_async = NULL;
+ goto restart;
+ }
+
+ while ((i = bms_first_member(fired)) >= 0)
+ {
+ TupleTableSlot *slot;
+ PlanState *subnode = node->appendplans[i];
+ slot = ExecProcNode(subnode);
+
+ Assert(subnode->asyncstate == AS_AVAILABLE);
+
+ if (!TupIsNull(slot))
+ {
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+ node->as_needrequest =
+ bms_add_member(node->as_needrequest, i);
+ }
+
+ node->as_pending_async =
+ bms_del_member(node->as_pending_async, i);
+ }
+ bms_free(fired);
+
+ /* return now if a result is available */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If there is no asynchronous activity still pending and the
+ * synchronous activity is also complete, we're totally done scanning
+ * this node. Otherwise, we're done with the asynchronous stuff but
+ * must continue scanning the synchronous children.
+ */
+
+ if (!node->as_syncdone &&
+ node->as_whichsyncplan == INVALID_SUBPLAN_INDEX)
+ node->as_syncdone = !node->choose_next_subplan(node);
+
+ if (node->as_syncdone)
+ {
+ Assert(bms_is_empty(node->as_pending_async));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+
+ /*
+ * get a tuple from the subplan
+ */
+ result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+ if (!TupIsNull(result))
+ {
+ /*
+ * If the subplan gave us something then return it as-is. We do
+ * NOT make use of the result slot that was set up in
+ * ExecInitAppend; there's no need for it.
+ */
+ return result;
+ }
+
+ /*
+ * Go on to the "next" subplan. If no more subplans, return the empty
+ * slot set up for us by ExecInitAppend, unless there are async plans
+ * we have yet to finish.
+ */
+ if (!node->choose_next_subplan(node))
+ {
+ node->as_syncdone = true;
+ if (bms_is_empty(node->as_pending_async))
+ {
+ Assert(bms_is_empty(node->as_needrequest));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
+
+ /* Else loop back and try to get a tuple from the new subplan */
+ }
+}
+
/* ----------------------------------------------------------------
* ExecEndAppend
*
@@ -324,10 +540,18 @@ ExecReScanAppend(AppendState *node)
bms_overlap(node->ps.chgParam,
node->as_prune_state->execparamids))
{
- bms_free(node->as_valid_subplans);
- node->as_valid_subplans = NULL;
+ bms_free(node->as_valid_syncsubplans);
+ node->as_valid_syncsubplans = NULL;
}
+ /* Reset async state. */
+ for (i = 0; i < node->as_nasyncplans; ++i)
+ ExecShutdownNode(node->appendplans[i]);
+
+ node->as_nasyncresult = 0;
+ node->as_needrequest = bms_add_range(NULL, 0, node->as_nasyncplans - 1);
+ node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
for (i = 0; i < node->as_nplans; i++)
{
PlanState *subnode = node->appendplans[i];
@@ -348,7 +572,7 @@ ExecReScanAppend(AppendState *node)
}
/* Let choose_next_subplan_* function handle setting the first subplan */
- node->as_whichplan = INVALID_SUBPLAN_INDEX;
+ node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
}
/* ----------------------------------------------------------------
@@ -436,7 +660,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
static bool
choose_next_subplan_locally(AppendState *node)
{
- int whichplan = node->as_whichplan;
+ int whichplan = node->as_whichsyncplan;
int nextplan;
/* We should never be called when there are no subplans */
@@ -451,10 +675,18 @@ choose_next_subplan_locally(AppendState *node)
*/
if (whichplan == INVALID_SUBPLAN_INDEX)
{
- if (node->as_valid_subplans == NULL)
- node->as_valid_subplans =
+ /* Shouldn't have an active async node */
+ Assert(bms_is_empty(node->as_needrequest));
+
+ if (node->as_valid_syncsubplans == NULL)
+ node->as_valid_syncsubplans =
ExecFindMatchingSubPlans(node->as_prune_state);
+ /* Exclude async plans */
+ if (node->as_nasyncplans > 0)
+ bms_del_range(node->as_valid_syncsubplans,
+ 0, node->as_nasyncplans - 1);
+
whichplan = -1;
}
@@ -462,14 +694,14 @@ choose_next_subplan_locally(AppendState *node)
Assert(whichplan >= -1 && whichplan <= node->as_nplans);
if (ScanDirectionIsForward(node->ps.state->es_direction))
- nextplan = bms_next_member(node->as_valid_subplans, whichplan);
+ nextplan = bms_next_member(node->as_valid_syncsubplans, whichplan);
else
- nextplan = bms_prev_member(node->as_valid_subplans, whichplan);
+ nextplan = bms_prev_member(node->as_valid_syncsubplans, whichplan);
if (nextplan < 0)
return false;
- node->as_whichplan = nextplan;
+ node->as_whichsyncplan = nextplan;
return true;
}
@@ -490,29 +722,29 @@ choose_next_subplan_for_leader(AppendState *node)
/* Backward scan is not supported by parallel-aware plans */
Assert(ScanDirectionIsForward(node->ps.state->es_direction));
- /* We should never be called when there are no subplans */
- Assert(node->as_nplans > 0);
+ /* We should never be called when there are no sync subplans */
+ Assert(node->as_nplans > node->as_nasyncplans);
LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
- if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
+ if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
{
/* Mark just-completed subplan as finished. */
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
}
else
{
/* Start with last subplan. */
- node->as_whichplan = node->as_nplans - 1;
+ node->as_whichsyncplan = node->as_nplans - 1;
/*
* If we've yet to determine the valid subplans then do so now. If
* run-time pruning is disabled then the valid subplans will always be
* set to all subplans.
*/
- if (node->as_valid_subplans == NULL)
+ if (node->as_valid_syncsubplans == NULL)
{
- node->as_valid_subplans =
+ node->as_valid_syncsubplans =
ExecFindMatchingSubPlans(node->as_prune_state);
/*
@@ -524,26 +756,26 @@ choose_next_subplan_for_leader(AppendState *node)
}
/* Loop until we find a subplan to execute. */
- while (pstate->pa_finished[node->as_whichplan])
+ while (pstate->pa_finished[node->as_whichsyncplan])
{
- if (node->as_whichplan == 0)
+ if (node->as_whichsyncplan == 0)
{
pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
- node->as_whichplan = INVALID_SUBPLAN_INDEX;
+ node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
LWLockRelease(&pstate->pa_lock);
return false;
}
/*
- * We needn't pay attention to as_valid_subplans here as all invalid
+ * We needn't pay attention to as_valid_syncsubplans here as all invalid
* plans have been marked as finished.
*/
- node->as_whichplan--;
+ node->as_whichsyncplan--;
}
/* If non-partial, immediately mark as finished. */
- if (node->as_whichplan < node->as_first_partial_plan)
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ if (node->as_whichsyncplan < node->as_first_partial_plan)
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
LWLockRelease(&pstate->pa_lock);
@@ -571,23 +803,23 @@ choose_next_subplan_for_worker(AppendState *node)
/* Backward scan is not supported by parallel-aware plans */
Assert(ScanDirectionIsForward(node->ps.state->es_direction));
- /* We should never be called when there are no subplans */
- Assert(node->as_nplans > 0);
+ /* We should never be called when there are no sync subplans */
+ Assert(node->as_nplans > node->as_nasyncplans);
LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
/* Mark just-completed subplan as finished. */
- if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
/*
* If we've yet to determine the valid subplans then do so now. If
* run-time pruning is disabled then the valid subplans will always be set
* to all subplans.
*/
- else if (node->as_valid_subplans == NULL)
+ else if (node->as_valid_syncsubplans == NULL)
{
- node->as_valid_subplans =
+ node->as_valid_syncsubplans =
ExecFindMatchingSubPlans(node->as_prune_state);
mark_invalid_subplans_as_finished(node);
}
@@ -600,30 +832,30 @@ choose_next_subplan_for_worker(AppendState *node)
}
/* Save the plan from which we are starting the search. */
- node->as_whichplan = pstate->pa_next_plan;
+ node->as_whichsyncplan = pstate->pa_next_plan;
/* Loop until we find a valid subplan to execute. */
while (pstate->pa_finished[pstate->pa_next_plan])
{
int nextplan;
- nextplan = bms_next_member(node->as_valid_subplans,
+ nextplan = bms_next_member(node->as_valid_syncsubplans,
pstate->pa_next_plan);
if (nextplan >= 0)
{
/* Advance to the next valid plan. */
pstate->pa_next_plan = nextplan;
}
- else if (node->as_whichplan > node->as_first_partial_plan)
+ else if (node->as_whichsyncplan > node->as_first_partial_plan)
{
/*
* Try looping back to the first valid partial plan, if there is
* one. If there isn't, arrange to bail out below.
*/
- nextplan = bms_next_member(node->as_valid_subplans,
+ nextplan = bms_next_member(node->as_valid_syncsubplans,
node->as_first_partial_plan - 1);
pstate->pa_next_plan =
- nextplan < 0 ? node->as_whichplan : nextplan;
+ nextplan < 0 ? node->as_whichsyncplan : nextplan;
}
else
{
@@ -631,10 +863,10 @@ choose_next_subplan_for_worker(AppendState *node)
* At last plan, and either there are no partial plans or we've
* tried them all. Arrange to bail out.
*/
- pstate->pa_next_plan = node->as_whichplan;
+ pstate->pa_next_plan = node->as_whichsyncplan;
}
- if (pstate->pa_next_plan == node->as_whichplan)
+ if (pstate->pa_next_plan == node->as_whichsyncplan)
{
/* We've tried everything! */
pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
@@ -644,8 +876,8 @@ choose_next_subplan_for_worker(AppendState *node)
}
/* Pick the plan we found, and advance pa_next_plan one more time. */
- node->as_whichplan = pstate->pa_next_plan;
- pstate->pa_next_plan = bms_next_member(node->as_valid_subplans,
+ node->as_whichsyncplan = pstate->pa_next_plan;
+ pstate->pa_next_plan = bms_next_member(node->as_valid_syncsubplans,
pstate->pa_next_plan);
/*
@@ -654,7 +886,7 @@ choose_next_subplan_for_worker(AppendState *node)
*/
if (pstate->pa_next_plan < 0)
{
- int nextplan = bms_next_member(node->as_valid_subplans,
+ int nextplan = bms_next_member(node->as_valid_syncsubplans,
node->as_first_partial_plan - 1);
if (nextplan >= 0)
@@ -671,8 +903,8 @@ choose_next_subplan_for_worker(AppendState *node)
}
/* If non-partial, immediately mark as finished. */
- if (node->as_whichplan < node->as_first_partial_plan)
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ if (node->as_whichsyncplan < node->as_first_partial_plan)
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
LWLockRelease(&pstate->pa_lock);
@@ -699,13 +931,13 @@ mark_invalid_subplans_as_finished(AppendState *node)
Assert(node->as_prune_state);
/* Nothing to do if all plans are valid */
- if (bms_num_members(node->as_valid_subplans) == node->as_nplans)
+ if (bms_num_members(node->as_valid_syncsubplans) == node->as_nplans)
return;
/* Mark all non-valid plans as finished */
for (i = 0; i < node->as_nplans; i++)
{
- if (!bms_is_member(i, node->as_valid_subplans))
+ if (!bms_is_member(i, node->as_valid_syncsubplans))
node->as_pstate->pa_finished[i] = true;
}
}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 513471ab9b..3bf4aaa63d 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -141,6 +141,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
scanstate->ss.ps.plan = (Plan *) node;
scanstate->ss.ps.state = estate;
scanstate->ss.ps.ExecProcNode = ExecForeignScan;
+ scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+ if ((eflags & EXEC_FLAG_ASYNC) != 0)
+ scanstate->fs_async = true;
/*
* Miscellaneous initialization
@@ -384,3 +388,20 @@ ExecShutdownForeignScan(ForeignScanState *node)
if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+ void *caller_data, bool reinit)
+{
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+ caller_data, reinit);
+}
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 2719ea45a3..05b625783b 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -895,6 +895,78 @@ bms_add_range(Bitmapset *a, int lower, int upper)
return a;
}
+/*
+ * bms_del_range
+ * Delete members in the range of 'lower' to 'upper' from the set.
+ *
+ * Note this could also be done by calling bms_del_member in a loop, however,
+ * using this function will be faster when the range is large as we work at
+ * the bitmapword level rather than at bit level.
+ */
+Bitmapset *
+bms_del_range(Bitmapset *a, int lower, int upper)
+{
+ int lwordnum,
+ lbitnum,
+ uwordnum,
+ ushiftbits,
+ wordnum;
+
+ if (lower < 0 || upper < 0)
+ elog(ERROR, "negative bitmapset member not allowed");
+ if (lower > upper)
+ elog(ERROR, "lower range must not be above upper range");
+ uwordnum = WORDNUM(upper);
+
+ if (a == NULL)
+ {
+ a = (Bitmapset *) palloc0(BITMAPSET_SIZE(uwordnum + 1));
+ a->nwords = uwordnum + 1;
+ }
+
+ /* ensure we have enough words to store the upper bit */
+ else if (uwordnum >= a->nwords)
+ {
+ int oldnwords = a->nwords;
+ int i;
+
+ a = (Bitmapset *) repalloc(a, BITMAPSET_SIZE(uwordnum + 1));
+ a->nwords = uwordnum + 1;
+ /* zero out the enlarged portion */
+ for (i = oldnwords; i < a->nwords; i++)
+ a->words[i] = 0;
+ }
+
+ wordnum = lwordnum = WORDNUM(lower);
+
+ lbitnum = BITNUM(lower);
+ ushiftbits = BITNUM(upper) + 1;
+
+ /*
+ * Special case when lwordnum is the same as uwordnum we must perform the
+ * upper and lower masking on the word.
+ */
+ if (lwordnum == uwordnum)
+ {
+ a->words[lwordnum] &= ((bitmapword) (((bitmapword) 1 << lbitnum) - 1)
+ | (~(bitmapword) 0) << ushiftbits);
+ }
+ else
+ {
+ /* turn off lbitnum and all bits left of it */
+ a->words[wordnum++] &= (bitmapword) (((bitmapword) 1 << lbitnum) - 1);
+
+ /* turn off all bits for any intermediate words */
+ while (wordnum < uwordnum)
+ a->words[wordnum++] = (bitmapword) 0;
+
+ /* turn off upper's bit and all bits right of it. */
+ a->words[uwordnum] &= (~(bitmapword) 0) << ushiftbits;
+ }
+
+ return a;
+}
+
/*
* bms_int_members - like bms_intersect, but left input is recycled
*/
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d8cf87e6d0..89a49e2fdc 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -121,6 +121,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
COPY_SCALAR_FIELD(plan_width);
COPY_SCALAR_FIELD(parallel_aware);
COPY_SCALAR_FIELD(parallel_safe);
+ COPY_SCALAR_FIELD(async_capable);
COPY_SCALAR_FIELD(plan_node_id);
COPY_NODE_FIELD(targetlist);
COPY_NODE_FIELD(qual);
@@ -246,6 +247,8 @@ _copyAppend(const Append *from)
COPY_NODE_FIELD(appendplans);
COPY_SCALAR_FIELD(first_partial_plan);
COPY_NODE_FIELD(part_prune_info);
+ COPY_SCALAR_FIELD(nasyncplans);
+ COPY_SCALAR_FIELD(referent);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e2f177515d..d4bb44b268 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -334,6 +334,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
WRITE_INT_FIELD(plan_width);
WRITE_BOOL_FIELD(parallel_aware);
WRITE_BOOL_FIELD(parallel_safe);
+ WRITE_BOOL_FIELD(async_capable);
WRITE_INT_FIELD(plan_node_id);
WRITE_NODE_FIELD(targetlist);
WRITE_NODE_FIELD(qual);
@@ -436,6 +437,8 @@ _outAppend(StringInfo str, const Append *node)
WRITE_NODE_FIELD(appendplans);
WRITE_INT_FIELD(first_partial_plan);
WRITE_NODE_FIELD(part_prune_info);
+ WRITE_INT_FIELD(nasyncplans);
+ WRITE_INT_FIELD(referent);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab719..63af7c02d8 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1572,6 +1572,7 @@ ReadCommonPlan(Plan *local_node)
READ_INT_FIELD(plan_width);
READ_BOOL_FIELD(parallel_aware);
READ_BOOL_FIELD(parallel_safe);
+ READ_BOOL_FIELD(async_capable);
READ_INT_FIELD(plan_node_id);
READ_NODE_FIELD(targetlist);
READ_NODE_FIELD(qual);
@@ -1672,6 +1673,8 @@ _readAppend(void)
READ_NODE_FIELD(appendplans);
READ_INT_FIELD(first_partial_plan);
READ_NODE_FIELD(part_prune_info);
+ READ_INT_FIELD(nasyncplans);
+ READ_INT_FIELD(referent);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index d984da25d7..bb4c8723bc 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3937,6 +3937,30 @@ generate_partitionwise_join_paths(PlannerInfo *root, RelOptInfo *rel)
list_free(live_children);
}
+/*
+ * is_projection_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
+
/*****************************************************************************
* DEBUG SUPPORT
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b976afb69d..c4c83b5887 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2050,13 +2050,9 @@ cost_append(AppendPath *apath)
if (pathkeys == NIL)
{
- Path *subpath = (Path *) linitial(apath->subpaths);
+ Cost async_min_startup_cost = -1.0;
- /*
- * For an unordered, non-parallel-aware Append we take the startup
- * cost as the startup cost of the first subpath.
- */
- apath->path.startup_cost = subpath->startup_cost;
+ apath->path.startup_cost = -1.0;
/* Compute rows and costs as sums of subplan rows and costs. */
foreach(l, apath->subpaths)
@@ -2065,6 +2061,38 @@ cost_append(AppendPath *apath)
apath->path.rows += subpath->rows;
apath->path.total_cost += subpath->total_cost;
+
+ if (!is_async_capable_path(subpath))
+ {
+ /*
+ * For an unordered, non-parallel-aware Append we take the
+ * startup cost as the startup cost of the first subpath.
+ */
+ if (apath->path.startup_cost < 0)
+ apath->path.startup_cost = subpath->startup_cost;
+ }
+ else if (apath->path.startup_cost < 0)
+ {
+ /*
+ * Usually async-capable paths don't affect the startup
+ * cost of Append. However, if no sync paths are
+ * contained, the startup cost of the Append is the minimal
+ * async startup cost. Remember the cost just in case.
+ */
+ if (async_min_startup_cost < 0 ||
+ async_min_startup_cost > subpath->startup_cost)
+ async_min_startup_cost = subpath->startup_cost;
+ }
+ }
+
+ /*
+ * Use the minimum async startup cost if no sync startup has been
+ * found.
+ */
+ if (apath->path.startup_cost < 0)
+ {
+ Assert(async_min_startup_cost >= 0);
+ apath->path.startup_cost = async_min_startup_cost;
}
}
else
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index eb9543f6ad..7f75708134 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1082,6 +1082,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
bool tlist_was_changed = false;
List *pathkeys = best_path->path.pathkeys;
List *subplans = NIL;
+ List *asyncplans = NIL;
+ List *syncplans = NIL;
+ List *asyncpaths = NIL;
+ List *syncpaths = NIL;
+ List *newsubpaths = NIL;
ListCell *subpaths;
RelOptInfo *rel = best_path->path.parent;
PartitionPruneInfo *partpruneinfo = NULL;
@@ -1090,6 +1095,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
Oid *nodeSortOperators = NULL;
Oid *nodeCollations = NULL;
bool *nodeNullsFirst = NULL;
+ int nasyncplans = 0;
+ bool first = true;
+ bool referent_is_sync = true;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1219,9 +1227,36 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
}
- subplans = lappend(subplans, subplan);
+ /*
+ * Classify as async-capable or not. If we have decided to run the
+ * children in parallel, we cannot any one of them run asynchronously.
+ */
+ if (!best_path->path.parallel_safe && is_async_capable_path(subpath))
+ {
+ subplan->async_capable = true;
+ asyncplans = lappend(asyncplans, subplan);
+ asyncpaths = lappend(asyncpaths, subpath);
+ ++nasyncplans;
+ if (first)
+ referent_is_sync = false;
+ }
+ else
+ {
+ syncplans = lappend(syncplans, subplan);
+ syncpaths = lappend(syncpaths, subpath);
+ }
+
+ first = false;
}
+ /*
+ * subplan contains asyncplans in the first half, if any, and sync plans in
+ * another half, if any. We need that the same for subpaths to make
+ * partition pruning information in sync with subplans.
+ */
+ subplans = list_concat(asyncplans, syncplans);
+ newsubpaths = list_concat(asyncpaths, syncpaths);
+
/*
* If any quals exist, they may be useful to perform further partition
* pruning during execution. Gather information needed by the executor to
@@ -1249,7 +1284,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
if (prunequal != NIL)
partpruneinfo =
make_partition_pruneinfo(root, rel,
- best_path->subpaths,
+ newsubpaths,
best_path->partitioned_rels,
prunequal);
}
@@ -1257,6 +1292,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
plan->appendplans = subplans;
plan->first_partial_plan = best_path->first_partial_path;
plan->part_prune_info = partpruneinfo;
+ plan->nasyncplans = nasyncplans;
+ plan->referent = referent_is_sync ? nasyncplans : 0;
copy_generic_path_info(&plan->plan, (Path *) best_path);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 309378ae54..d9ea75d823 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3882,6 +3882,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
case WAIT_EVENT_XACT_GROUP_UPDATE:
event_name = "XactGroupUpdate";
break;
+ case WAIT_EVENT_ASYNC_WAIT:
+ event_name = "AsyncExecWait";
+ break;
/* no default case, so that compiler will warn */
}
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index ffcb54968f..a4de6d90e2 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -300,7 +300,7 @@ SysLoggerMain(int argc, char *argv[])
* syslog pipe, which implies that all other backends have exited
* (including the postmaster).
*/
- wes = CreateWaitEventSet(CurrentMemoryContext, 2);
+ wes = CreateWaitEventSet(CurrentMemoryContext, NULL, 2);
AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
#ifndef WIN32
AddWaitEventToSet(wes, WL_SOCKET_READABLE, syslogPipe[0], NULL, NULL);
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 076c3c019f..f7b5587d7f 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4584,10 +4584,14 @@ set_deparse_plan(deparse_namespace *dpns, Plan *plan)
* tlists according to one of the children, and the first one is the most
* natural choice. Likewise special-case ModifyTable to pretend that the
* first child plan is the OUTER referent; this is to support RETURNING
- * lists containing references to non-target relations.
+ * lists containing references to non-target relations. For Append, use the
+ * explicitly specified referent.
*/
if (IsA(plan, Append))
- dpns->outer_plan = linitial(((Append *) plan)->appendplans);
+ {
+ Append *app = (Append *) plan;
+ dpns->outer_plan = list_nth(app->appendplans, app->referent);
+ }
else if (IsA(plan, MergeAppend))
dpns->outer_plan = linitial(((MergeAppend *) plan)->mergeplans);
else if (IsA(plan, ModifyTable))
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 237ca9fa30..27742a1641 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -1416,7 +1416,7 @@ void
ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
{
/*
- * XXXX: There's no property to show as an identier of a wait event set,
+ * XXXX: There's no property to show as an identifier of a wait event set,
* use its pointer instead.
*/
if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
@@ -1431,7 +1431,7 @@ static void
PrintWESLeakWarning(WaitEventSet *events)
{
/*
- * XXXX: There's no property to show as an identier of a wait event set,
+ * XXXX: There's no property to show as an identifier of a wait event set,
* use its pointer instead.
*/
elog(WARNING, "wait event set leak: %p still referenced",
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000000..3b6bf4a516
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,22 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ * Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+ void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+ long timeout);
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c7deeac662..aca9e2bddd 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -59,6 +59,7 @@
#define EXEC_FLAG_MARK 0x0008 /* need mark/restore */
#define EXEC_FLAG_SKIP_TRIGGERS 0x0010 /* skip AfterTrigger calls */
#define EXEC_FLAG_WITH_NO_DATA 0x0020 /* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC 0x0040 /* request async execution */
/* Hook for plugins to get control in ExecutorStart() */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 326d713ebf..71a233b41f 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,
extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
ParallelWorkerContext *pwcxt);
extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+ WaitEventSet *wes,
+ void *caller_data, bool reinit);
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 95556dfb15..853ba2b5ad 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -169,6 +169,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
List *fdw_private,
RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+ WaitEventSet *wes,
+ void *caller_data,
+ bool reinit);
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -190,6 +195,7 @@ typedef struct FdwRoutine
GetForeignPlan_function GetForeignPlan;
BeginForeignScan_function BeginForeignScan;
IterateForeignScan_function IterateForeignScan;
+ IterateForeignScan_function IterateForeignScanAsync;
ReScanForeignScan_function ReScanForeignScan;
EndForeignScan_function EndForeignScan;
@@ -242,6 +248,11 @@ typedef struct FdwRoutine
InitializeDSMForeignScan_function InitializeDSMForeignScan;
ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan;
InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+
ShutdownForeignScan_function ShutdownForeignScan;
/* Support functions for path reparameterization. */
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index d113c271ee..177e6218cb 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -107,6 +107,7 @@ extern Bitmapset *bms_add_members(Bitmapset *a, const Bitmapset *b);
extern Bitmapset *bms_add_range(Bitmapset *a, int lower, int upper);
extern Bitmapset *bms_int_members(Bitmapset *a, const Bitmapset *b);
extern Bitmapset *bms_del_members(Bitmapset *a, const Bitmapset *b);
+extern Bitmapset *bms_del_range(Bitmapset *a, int lower, int upper);
extern Bitmapset *bms_join(Bitmapset *a, Bitmapset *b);
/* support for iterating through the integer elements of a set: */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 98e0072b8a..cd50494c74 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -938,6 +938,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
* abstract superclass for all PlanState-type nodes.
* ----------------
*/
+typedef enum AsyncState
+{
+ AS_AVAILABLE,
+ AS_WAITING
+} AsyncState;
+
typedef struct PlanState
{
NodeTag type;
@@ -1026,6 +1032,11 @@ typedef struct PlanState
bool outeropsset;
bool inneropsset;
bool resultopsset;
+
+ /* Async subnode execution stuff */
+ AsyncState asyncstate;
+
+ int32 padding; /* to keep alignment of derived types */
} PlanState;
/* ----------------
@@ -1221,14 +1232,21 @@ struct AppendState
PlanState ps; /* its first field is NodeTag */
PlanState **appendplans; /* array of PlanStates for my inputs */
int as_nplans;
- int as_whichplan;
+ int as_whichsyncplan; /* which sync plan is being executed */
int as_first_partial_plan; /* Index of 'appendplans' containing
* the first partial plan */
+ int as_nasyncplans; /* # of async-capable children */
ParallelAppendState *as_pstate; /* parallel coordination info */
Size pstate_len; /* size of parallel coordination info */
struct PartitionPruneState *as_prune_state;
- Bitmapset *as_valid_subplans;
+ Bitmapset *as_valid_syncsubplans;
bool (*choose_next_subplan) (AppendState *);
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_needrequest; /* async plans needing a new request */
+ Bitmapset *as_pending_async; /* pending async plans */
+ TupleTableSlot **as_asyncresult; /* results of each async plan */
+ int as_nasyncresult; /* # of valid entries in as_asyncresult */
+ bool as_exec_prune; /* runtime pruning needed for async exec? */
};
/* ----------------
@@ -1796,6 +1814,7 @@ typedef struct ForeignScanState
Size pscan_len; /* size of parallel coordination information */
/* use struct pointer to avoid including fdwapi.h here */
struct FdwRoutine *fdwroutine;
+ bool fs_async;
void *fdw_state; /* foreign-data wrapper can keep state here */
} ForeignScanState;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 83e01074ed..abad89b327 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -135,6 +135,11 @@ typedef struct Plan
bool parallel_aware; /* engage parallel-aware logic? */
bool parallel_safe; /* OK to use as part of parallel plan? */
+ /*
+ * information needed for asynchronous execution
+ */
+ bool async_capable; /* engage asynchronous execution logic? */
+
/*
* Common structural data for all Plan types.
*/
@@ -262,6 +267,10 @@ typedef struct Append
/* Info for run-time subplan pruning; NULL if we're not doing that */
struct PartitionPruneInfo *part_prune_info;
+
+ /* Async child node execution stuff */
+ int nasyncplans; /* # async subplans, always at start of list */
+ int referent; /* index of inheritance tree referent */
} Append;
/* ----------------
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 10b6e81079..53876b2d8b 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -241,4 +241,6 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
List *live_childrels);
+extern bool is_async_capable_path(Path *path);
+
#endif /* PATHS_H */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c55dc1481c..2259910637 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -887,7 +887,8 @@ typedef enum
WAIT_EVENT_REPLICATION_SLOT_DROP,
WAIT_EVENT_SAFE_SNAPSHOT,
WAIT_EVENT_SYNC_REP,
- WAIT_EVENT_XACT_GROUP_UPDATE
+ WAIT_EVENT_XACT_GROUP_UPDATE,
+ WAIT_EVENT_ASYNC_WAIT
} WaitEventIPC;
/* ----------
--
2.18.2
v4-0003-async-postgres_fdw.patchtext/x-patch; charset=us-asciiDownload
From 5c64d2d3315d7e38676f349c10f94f445da2da58 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH v4 3/3] async postgres_fdw
---
contrib/postgres_fdw/connection.c | 28 +
.../postgres_fdw/expected/postgres_fdw.out | 222 ++++---
contrib/postgres_fdw/postgres_fdw.c | 601 +++++++++++++++---
contrib/postgres_fdw/postgres_fdw.h | 2 +
contrib/postgres_fdw/sql/postgres_fdw.sql | 20 +-
5 files changed, 691 insertions(+), 182 deletions(-)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 52d1fe3563..d9edc5e4de 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -58,6 +58,7 @@ typedef struct ConnCacheEntry
bool invalidated; /* true if reconnect is pending */
uint32 server_hashvalue; /* hash value of foreign server OID */
uint32 mapping_hashvalue; /* hash value of user mapping OID */
+ void *storage; /* connection specific storage */
} ConnCacheEntry;
/*
@@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
entry->conn, server->servername, user->umid, user->userid);
+ entry->storage = NULL;
}
/*
@@ -215,6 +217,32 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
return entry->conn;
}
+/*
+ * Returns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+ bool found;
+ ConnCacheEntry *entry;
+ ConnCacheKey key;
+
+ /* Find storage using the same key with GetConnection */
+ key = user->umid;
+ entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+ Assert(found);
+
+ /* Create one if not yet. */
+ if (entry->storage == NULL)
+ {
+ entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+ memset(entry->storage, 0, initsize);
+ }
+
+ return entry->storage;
+}
+
/*
* Connect to remote server using specified server and user mapping properties.
*/
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 82fc1290ef..29aa09db8e 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6973,7 +6973,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+-------
a | aaa
@@ -7001,7 +7001,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7029,7 +7029,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7057,7 +7057,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | newtoo
@@ -7127,35 +7127,41 @@ insert into bar2 values(3,33,33);
insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
f1 | f2
----+----
1 | 11
@@ -7165,35 +7171,41 @@ select * from bar where f1 in (select f1 from foo) for update;
(4 rows)
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
f1 | f2
----+----
1 | 11
@@ -7223,11 +7235,12 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-> Hash Join
Output: bar_1.f1, (bar_1.f2 + 100), bar_1.f3, bar_1.ctid, foo.ctid, foo.*, foo.tableoid
Inner Unique: true
@@ -7241,12 +7254,13 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(39 rows)
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(41 rows)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
select tableoid::regclass, * from bar order by 1,2;
@@ -7276,16 +7290,17 @@ where bar.f1 = ss.f1;
Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
Hash Cond: (foo.f1 = bar.f1)
-> Append
+ Async subplans: 2
+ -> Async Foreign Scan on public.foo2 foo_1
+ Output: ROW(foo_1.f1), foo_1.f1
+ Remote SQL: SELECT f1 FROM public.loct1
+ -> Async Foreign Scan on public.foo2 foo_3
+ Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
+ Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo
Output: ROW(foo.f1), foo.f1
- -> Foreign Scan on public.foo2 foo_1
- Output: ROW(foo_1.f1), foo_1.f1
- Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo foo_2
Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
- -> Foreign Scan on public.foo2 foo_3
- Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
- Remote SQL: SELECT f1 FROM public.loct1
-> Hash
Output: bar.f1, bar.f2, bar.ctid
-> Seq Scan on public.bar
@@ -7303,17 +7318,18 @@ where bar.f1 = ss.f1;
Output: (ROW(foo.f1)), foo.f1
Sort Key: foo.f1
-> Append
+ Async subplans: 2
+ -> Async Foreign Scan on public.foo2 foo_1
+ Output: ROW(foo_1.f1), foo_1.f1
+ Remote SQL: SELECT f1 FROM public.loct1
+ -> Async Foreign Scan on public.foo2 foo_3
+ Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
+ Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo
Output: ROW(foo.f1), foo.f1
- -> Foreign Scan on public.foo2 foo_1
- Output: ROW(foo_1.f1), foo_1.f1
- Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo foo_2
Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
- -> Foreign Scan on public.foo2 foo_3
- Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
- Remote SQL: SELECT f1 FROM public.loct1
-(45 rows)
+(47 rows)
update bar set f2 = f2 + 100
from
@@ -7463,27 +7479,33 @@ delete from foo where f1 < 5 returning *;
(5 rows)
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
- QUERY PLAN
-------------------------------------------------------------------------------
- Update on public.bar
- Output: bar.f1, bar.f2
- Update on public.bar
- Foreign Update on public.bar2 bar_1
- -> Seq Scan on public.bar
- Output: bar.f1, (bar.f2 + 100), bar.ctid
- -> Foreign Update on public.bar2 bar_1
- Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Output: u.f1, u.f2
+ Sort Key: u.f1
+ CTE u
+ -> Update on public.bar
+ Output: bar.f1, bar.f2
+ Update on public.bar
+ Foreign Update on public.bar2 bar_1
+ -> Seq Scan on public.bar
+ Output: bar.f1, (bar.f2 + 100), bar.ctid
+ -> Foreign Update on public.bar2 bar_1
+ Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+ -> CTE Scan on u
+ Output: u.f1, u.f2
+(14 rows)
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
f1 | f2
----+-----
1 | 311
2 | 322
- 6 | 266
3 | 333
4 | 344
+ 6 | 266
7 | 277
(6 rows)
@@ -8558,11 +8580,12 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
Sort
Sort Key: t1.a, t3.c
-> Append
- -> Foreign Scan
+ Async subplans: 2
+ -> Async Foreign Scan
Relations: ((ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)) INNER JOIN (ftprt1_p1 t3_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: ((ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)) INNER JOIN (ftprt1_p2 t3_2)
-(7 rows)
+(8 rows)
SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER JOIN fprt1 t3 ON (t2.b = t3.a) WHERE t1.a % 25 =0 ORDER BY 1,2,3;
a | b | c
@@ -8597,20 +8620,22 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
-- with whole-row reference; partitionwise join does not apply
EXPLAIN (COSTS OFF)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
- QUERY PLAN
---------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------
Sort
Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
-> Hash Full Join
Hash Cond: (t1.a = t2.b)
-> Append
- -> Foreign Scan on ftprt1_p1 t1_1
- -> Foreign Scan on ftprt1_p2 t1_2
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt1_p1 t1_1
+ -> Async Foreign Scan on ftprt1_p2 t1_2
-> Hash
-> Append
- -> Foreign Scan on ftprt2_p1 t2_1
- -> Foreign Scan on ftprt2_p2 t2_2
-(11 rows)
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt2_p1 t2_1
+ -> Async Foreign Scan on ftprt2_p2 t2_2
+(13 rows)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
wr | wr
@@ -8639,11 +8664,12 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
Sort
Sort Key: t1.a, t1.b
-> Append
- -> Foreign Scan
+ Async subplans: 2
+ -> Async Foreign Scan
Relations: (ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: (ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)
-(7 rows)
+(8 rows)
SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE t1.a%25 = 0 ORDER BY 1,2;
a | b
@@ -8696,21 +8722,23 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
-- test FOR UPDATE; partitionwise join does not apply
EXPLAIN (COSTS OFF)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
- QUERY PLAN
---------------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------------
LockRows
-> Sort
Sort Key: t1.a
-> Hash Join
Hash Cond: (t2.b = t1.a)
-> Append
- -> Foreign Scan on ftprt2_p1 t2_1
- -> Foreign Scan on ftprt2_p2 t2_2
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt2_p1 t2_1
+ -> Async Foreign Scan on ftprt2_p2 t2_2
-> Hash
-> Append
- -> Foreign Scan on ftprt1_p1 t1_1
- -> Foreign Scan on ftprt1_p2 t1_2
-(12 rows)
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt1_p1 t1_1
+ -> Async Foreign Scan on ftprt1_p2 t1_2
+(14 rows)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
a | b
@@ -8745,18 +8773,19 @@ ANALYZE fpagg_tab_p3;
SET enable_partitionwise_aggregate TO false;
EXPLAIN (COSTS OFF)
SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
- QUERY PLAN
------------------------------------------------------------
+ QUERY PLAN
+-----------------------------------------------------------------
Sort
Sort Key: pagg_tab.a
-> HashAggregate
Group Key: pagg_tab.a
Filter: (avg(pagg_tab.b) < '22'::numeric)
-> Append
- -> Foreign Scan on fpagg_tab_p1 pagg_tab_1
- -> Foreign Scan on fpagg_tab_p2 pagg_tab_2
- -> Foreign Scan on fpagg_tab_p3 pagg_tab_3
-(9 rows)
+ Async subplans: 3
+ -> Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+ -> Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+ -> Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
+(10 rows)
-- Plan with partitionwise aggregates is enabled
SET enable_partitionwise_aggregate TO true;
@@ -8767,13 +8796,14 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
Sort
Sort Key: pagg_tab.a
-> Append
- -> Foreign Scan
+ Async subplans: 3
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p1 pagg_tab)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p2 pagg_tab_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p3 pagg_tab_2)
-(9 rows)
+(10 rows)
SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
a | sum | min | count
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 9fc53cad68..4bfc2d39ea 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -21,6 +21,8 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -35,6 +37,7 @@
#include "optimizer/restrictinfo.h"
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
+#include "pgstat.h"
#include "postgres_fdw.h"
#include "utils/builtins.h"
#include "utils/float.h"
@@ -56,6 +59,9 @@ PG_MODULE_MAGIC;
/* If no remote estimates, assume a sort costs 20% extra */
#define DEFAULT_FDW_SORT_MULTIPLIER 1.2
+/* Retrieve PgFdwScanState struct from ForeignScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
/*
* Indexes of FDW-private information stored in fdw_private lists.
*
@@ -122,11 +128,29 @@ enum FdwDirectModifyPrivateIndex
FdwDirectModifyPrivateSetProcessed
};
+/*
+ * Connection common state - shared among all PgFdwState instances using the
+ * same connection.
+ */
+typedef struct PgFdwConnCommonState
+{
+ ForeignScanState *leader; /* leader node of this connection */
+ bool busy; /* true if this connection is busy */
+} PgFdwConnCommonState;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+ PGconn *conn; /* connection for the scan */
+ PgFdwConnCommonState *commonstate; /* connection common state */
+} PgFdwState;
+
/*
* Execution state of a foreign scan using postgres_fdw.
*/
typedef struct PgFdwScanState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table. NULL
* for a foreign join scan. */
TupleDesc tupdesc; /* tuple descriptor of scan */
@@ -137,7 +161,6 @@ typedef struct PgFdwScanState
List *retrieved_attrs; /* list of retrieved attribute numbers */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -153,6 +176,12 @@ typedef struct PgFdwScanState
/* batch-level state, for optimizing rewinds and avoiding useless fetch */
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ bool async; /* true if run asynchronously */
+ bool queued; /* true if this node is in waiter queue */
+ ForeignScanState *waiter; /* Next node to run a query among nodes
+ * sharing the same connection */
+ ForeignScanState *last_waiter; /* last element in waiter queue.
+ * valid only on the leader node */
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
@@ -166,11 +195,11 @@ typedef struct PgFdwScanState
*/
typedef struct PgFdwModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -197,6 +226,7 @@ typedef struct PgFdwModifyState
*/
typedef struct PgFdwDirectModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
@@ -326,6 +356,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
static void postgresReScanForeignScan(ForeignScanState *node);
static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
static void postgresAddForeignUpdateTargets(Query *parsetree,
RangeTblEntry *target_rte,
Relation target_relation);
@@ -391,6 +422,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
RelOptInfo *input_rel,
RelOptInfo *output_rel,
void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+ WaitEventSet *wes,
+ void *caller_data, bool reinit);
/*
* Helper functions
@@ -419,7 +454,9 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn, bool clear_queue);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static PgFdwModifyState *create_foreign_modify(EState *estate,
RangeTblEntry *rte,
@@ -522,6 +559,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
routine->IterateForeignScan = postgresIterateForeignScan;
routine->ReScanForeignScan = postgresReScanForeignScan;
routine->EndForeignScan = postgresEndForeignScan;
+ routine->ShutdownForeignScan = postgresShutdownForeignScan;
/* Functions for updating foreign tables */
routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -558,6 +596,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for async execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+
PG_RETURN_POINTER(routine);
}
@@ -1434,12 +1476,22 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->s.conn = GetConnection(user, false);
+ fsstate->s.commonstate = (PgFdwConnCommonState *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
+ fsstate->s.commonstate->leader = NULL;
+ fsstate->s.commonstate->busy = false;
+ fsstate->waiter = NULL;
+ fsstate->last_waiter = node;
/* Assign a unique ID for my cursor */
- fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+ fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
fsstate->cursor_exists = false;
+ /* Initialize async execution status */
+ fsstate->async = false;
+ fsstate->queued = false;
+
/* Get private info created by planner functions. */
fsstate->query = strVal(list_nth(fsplan->fdw_private,
FdwScanPrivateSelectSql));
@@ -1487,40 +1539,241 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
&fsstate->param_values);
}
+/*
+ * Async queue manipulation functions
+ */
+
+/*
+ * add_async_waiter:
+ *
+ * Enqueue node if it isn't in the queue. Immediately send request it if the
+ * underlying connection is not busy.
+ */
+static inline void
+add_async_waiter(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+
+ /*
+ * Do nothing if the node is already in the queue or already eof'ed.
+ * Note: leader node is not marked as queued.
+ */
+ if (leader == node || fsstate->queued || fsstate->eof_reached)
+ return;
+
+ if (leader == NULL)
+ {
+ /* no leader means not busy, send request immediately */
+ request_more_data(node);
+ }
+ else
+ {
+ /* the connection is busy, queue the node */
+ PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+ PgFdwScanState *last_waiter_state
+ = GetPgFdwScanState(leader_state->last_waiter);
+
+ last_waiter_state->waiter = node;
+ leader_state->last_waiter = node;
+ fsstate->queued = true;
+ }
+}
+
+/*
+ * move_to_next_waiter:
+ *
+ * Make the first waiter be the next leader
+ * Returns the new leader or NULL if there's no waiter.
+ */
+static inline ForeignScanState *
+move_to_next_waiter(ForeignScanState *node)
+{
+ PgFdwScanState *leader_state = GetPgFdwScanState(node);
+ ForeignScanState *next_leader = leader_state->waiter;
+
+ Assert(leader_state->s.commonstate->leader = node);
+
+ if (next_leader)
+ {
+ /* the first waiter becomes the next leader */
+ PgFdwScanState *next_leader_state = GetPgFdwScanState(next_leader);
+ next_leader_state->last_waiter = leader_state->last_waiter;
+ next_leader_state->queued = false;
+ }
+
+ leader_state->waiter = NULL;
+ leader_state->s.commonstate->leader = next_leader;
+
+ return next_leader;
+}
+
+/*
+ * Remove the node from waiter queue.
+ *
+ * Remaining results are cleared if the node is a busy leader.
+ * This intended to be used during node shutdown.
+ */
+static inline void
+remove_async_node(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+ PgFdwScanState *leader_state;
+ ForeignScanState *prev;
+ PgFdwScanState *prev_state;
+ ForeignScanState *cur;
+
+ /* no need to remove me */
+ if (!leader || !fsstate->queued)
+ return;
+
+ leader_state = GetPgFdwScanState(leader);
+
+ if (leader == node)
+ {
+ if (leader_state->s.commonstate->busy)
+ {
+ /*
+ * this node is waiting for result, absorb the result first so
+ * that the following commands can be sent on the connection.
+ */
+ PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+ PGconn *conn = leader_state->s.conn;
+
+ while(PQisBusy(conn))
+ PQclear(PQgetResult(conn));
+
+ leader_state->s.commonstate->busy = false;
+ }
+
+ move_to_next_waiter(node);
+
+ return;
+ }
+
+ /*
+ * Just remove the node from the queue
+ *
+ * Nodes don't have a link to the previous node but anyway this function is
+ * called on the shutdown path, so we don't bother seeking for faster way
+ * to do this.
+ */
+ prev = leader;
+ prev_state = leader_state;
+ cur = GetPgFdwScanState(prev)->waiter;
+ while (cur)
+ {
+ PgFdwScanState *curstate = GetPgFdwScanState(cur);
+
+ if (cur == node)
+ {
+ prev_state->waiter = curstate->waiter;
+
+ /* relink to the previous node if the last node was removed */
+ if (leader_state->last_waiter == cur)
+ leader_state->last_waiter = prev;
+
+ fsstate->queued = false;
+
+ return;
+ }
+ prev = cur;
+ prev_state = curstate;
+ cur = curstate->waiter;
+ }
+}
+
/*
* postgresIterateForeignScan
- * Retrieve next row from the result set, or clear tuple slot to indicate
- * EOF.
+ * Retrieve next row from the result set.
+ *
+ * For synchronous nodes, returns clear tuple slot means EOF.
+ *
+ * For asynchronous nodes, if clear tuple slot is returned, the caller
+ * needs to check async state to tell if all tuples received
+ * (AS_AVAILABLE) or waiting for the next data to come (AS_WAITING).
*/
static TupleTableSlot *
postgresIterateForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
- /*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
- */
- if (!fsstate->cursor_exists)
- create_cursor(node);
-
- /*
- * Get some more tuples, if we've run out.
- */
+ if (fsstate->next_tuple >= fsstate->num_tuples && !fsstate->eof_reached)
+ {
+ /* we've run out, get some more tuples */
+ if (!node->fs_async)
+ {
+ /*
+ * finish the running query before sending the next command for
+ * this node
+ */
+ if (!fsstate->s.commonstate->busy)
+ vacate_connection((PgFdwState *)fsstate, false);
+
+ request_more_data(node);
+
+ /* Fetch the result immediately. */
+ fetch_received_data(node);
+ }
+ else if (!fsstate->s.commonstate->busy)
+ {
+ /* If the connection is not busy, just send the request. */
+ request_more_data(node);
+ }
+ else
+ {
+ /* The connection is busy, queue the request */
+ bool available = true;
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+ PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+
+ /* queue the requested node */
+ add_async_waiter(node);
+
+ /*
+ * The request for the next node cannot be sent before the leader
+ * responds. Finish the current leader if possible.
+ */
+ if (PQisBusy(leader_state->s.conn))
+ {
+ int rc = WaitLatchOrSocket(NULL,
+ WL_SOCKET_READABLE | WL_TIMEOUT |
+ WL_EXIT_ON_PM_DEATH,
+ PQsocket(leader_state->s.conn), 0,
+ WAIT_EVENT_ASYNC_WAIT);
+ if (!(rc & WL_SOCKET_READABLE))
+ available = false;
+ }
+
+ /* fetch the leader's data and enqueue it for the next request */
+ if (available)
+ {
+ fetch_received_data(leader);
+ add_async_waiter(leader);
+ }
+ }
+ }
+
if (fsstate->next_tuple >= fsstate->num_tuples)
{
- /* No point in another fetch if we already detected EOF, though. */
- if (!fsstate->eof_reached)
- fetch_more_data(node);
- /* If we didn't get any tuples, must be end of data. */
- if (fsstate->next_tuple >= fsstate->num_tuples)
- return ExecClearTuple(slot);
+ /*
+ * We haven't received a result for the given node this time, return
+ * with no tuple to give way to another node.
+ */
+ if (fsstate->eof_reached)
+ node->ss.ps.asyncstate = AS_AVAILABLE;
+ else
+ node->ss.ps.asyncstate = AS_WAITING;
+
+ return ExecClearTuple(slot);
}
/*
* Return the next tuple.
*/
+ node->ss.ps.asyncstate = AS_AVAILABLE;
ExecStoreHeapTuple(fsstate->tuples[fsstate->next_tuple++],
slot,
false);
@@ -1535,7 +1788,7 @@ postgresIterateForeignScan(ForeignScanState *node)
static void
postgresReScanForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
char sql[64];
PGresult *res;
@@ -1543,6 +1796,8 @@ postgresReScanForeignScan(ForeignScanState *node)
if (!fsstate->cursor_exists)
return;
+ vacate_connection((PgFdwState *)fsstate, true);
+
/*
* If any internal parameters affecting this node have changed, we'd
* better destroy and recreate the cursor. Otherwise, rewinding it should
@@ -1571,9 +1826,9 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
PQclear(res);
/* Now force a fresh FETCH. */
@@ -1591,7 +1846,7 @@ postgresReScanForeignScan(ForeignScanState *node)
static void
postgresEndForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
if (fsstate == NULL)
@@ -1599,15 +1854,31 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->s.conn, fsstate->cursor_number);
/* Release remote connection */
- ReleaseConnection(fsstate->conn);
- fsstate->conn = NULL;
+ ReleaseConnection(fsstate->s.conn);
+ fsstate->s.conn = NULL;
/* MemoryContexts will be deleted automatically. */
}
+/*
+ * postgresShutdownForeignScan
+ * Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+ ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+ if (plan->operation != CMD_SELECT)
+ return;
+
+ /* remove the node from waiting queue */
+ remove_async_node(node);
+}
+
/*
* postgresAddForeignUpdateTargets
* Add resjunk column(s) needed for update/delete on a foreign table
@@ -2372,7 +2643,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->s.conn = GetConnection(user, false);
+ dmstate->s.commonstate = (PgFdwConnCommonState *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
/* Update the foreign-join-related fields. */
if (fsplan->scan.scanrelid == 0)
@@ -2457,7 +2730,11 @@ postgresIterateDirectModify(ForeignScanState *node)
* If this is the first call after Begin, execute the statement.
*/
if (dmstate->num_tuples == -1)
+ {
+ /* finish running query to send my command */
+ vacate_connection((PgFdwState *)dmstate, true);
execute_dml_stmt(node);
+ }
/*
* If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2504,8 +2781,8 @@ postgresEndDirectModify(ForeignScanState *node)
PQclear(dmstate->result);
/* Release remote connection */
- ReleaseConnection(dmstate->conn);
- dmstate->conn = NULL;
+ ReleaseConnection(dmstate->s.conn);
+ dmstate->s.conn = NULL;
/* MemoryContext will be deleted automatically. */
}
@@ -2703,6 +2980,7 @@ estimate_path_cost_size(PlannerInfo *root,
List *local_param_join_conds;
StringInfoData sql;
PGconn *conn;
+ PgFdwConnCommonState *commonstate;
Selectivity local_sel;
QualCost local_cost;
List *fdw_scan_tlist = NIL;
@@ -2747,6 +3025,18 @@ estimate_path_cost_size(PlannerInfo *root,
/* Get the remote estimate */
conn = GetConnection(fpinfo->user, false);
+ commonstate = GetConnectionSpecificStorage(fpinfo->user,
+ sizeof(PgFdwConnCommonState));
+ if (commonstate)
+ {
+ PgFdwState tmpstate;
+ tmpstate.conn = conn;
+ tmpstate.commonstate = commonstate;
+
+ /* finish running query to send my command */
+ vacate_connection(&tmpstate, true);
+ }
+
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -3317,11 +3607,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
static void
create_cursor(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
ExprContext *econtext = node->ss.ps.ps_ExprContext;
int numParams = fsstate->numParams;
const char **values = fsstate->param_values;
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
StringInfoData buf;
PGresult *res;
@@ -3384,50 +3674,119 @@ create_cursor(ForeignScanState *node)
}
/*
- * Fetch some more rows from the node's cursor.
+ * Sends the next request of the node. If the given node is different from the
+ * current connection leader, pushes it back to waiter queue and let the given
+ * node be the leader.
*/
static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+ PGconn *conn = fsstate->s.conn;
+ char sql[64];
+
+ /* must be non-busy */
+ Assert(!fsstate->s.commonstate->busy);
+ /* must be not-eof'ed */
+ Assert(!fsstate->eof_reached);
+
+ /*
+ * If this is the first call after Begin or ReScan, we need to create the
+ * cursor on the remote side.
+ */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (!PQsendQuery(conn, sql))
+ pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+ fsstate->s.commonstate->busy = true;
+
+ /* The node is the current leader, just return. */
+ if (leader == node)
+ return;
+
+ /* Let the node be the leader */
+ if (leader != NULL)
+ {
+ remove_async_node(node);
+ fsstate->last_waiter = GetPgFdwScanState(leader)->last_waiter;
+ fsstate->waiter = leader;
+ }
+ else
+ {
+ fsstate->last_waiter = node;
+ fsstate->waiter = NULL;
+ }
+
+ fsstate->s.commonstate->leader = node;
+}
+
+/*
+ * Fetches received data and automatically send requests of the next waiter.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
PGresult *volatile res = NULL;
MemoryContext oldcontext;
+ ForeignScanState *waiter;
+
+ /* I should be the current connection leader */
+ Assert(fsstate->s.commonstate->leader == node);
/*
* We'll store the tuples in the batch_cxt. First, flush the previous
- * batch.
+ * batch if no tuple is remaining
*/
- fsstate->tuples = NULL;
- MemoryContextReset(fsstate->batch_cxt);
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->tuples = NULL;
+ fsstate->num_tuples = 0;
+ MemoryContextReset(fsstate->batch_cxt);
+ }
+ else if (fsstate->next_tuple > 0)
+ {
+ /* There's some remains. Move them to the beginning of the store */
+ int n = 0;
+
+ while(fsstate->next_tuple < fsstate->num_tuples)
+ fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+ fsstate->num_tuples = n;
+ }
+
oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
/* PGresult must be released before leaving this function. */
PG_TRY();
{
- PGconn *conn = fsstate->conn;
- char sql[64];
- int numrows;
+ PGconn *conn = fsstate->s.conn;
+ int addrows;
+ size_t newsize;
int i;
- snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
- fsstate->fetch_size, fsstate->cursor_number);
-
- res = pgfdw_exec_query(conn, sql);
- /* On error, report the original query, not the FETCH. */
+ res = pgfdw_get_result(conn, fsstate->query);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
/* Convert the data into HeapTuples */
- numrows = PQntuples(res);
- fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
- fsstate->num_tuples = numrows;
- fsstate->next_tuple = 0;
+ addrows = PQntuples(res);
+ newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+ if (fsstate->tuples)
+ fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+ else
+ fsstate->tuples = (HeapTuple *) palloc(newsize);
- for (i = 0; i < numrows; i++)
+ for (i = 0; i < addrows; i++)
{
Assert(IsA(node->ss.ps.plan, ForeignScan));
- fsstate->tuples[i] =
+ fsstate->tuples[fsstate->num_tuples + i] =
make_tuple_from_result_row(res, i,
fsstate->rel,
fsstate->attinmeta,
@@ -3437,22 +3796,73 @@ fetch_more_data(ForeignScanState *node)
}
/* Update fetch_ct_2 */
- if (fsstate->fetch_ct_2 < 2)
+ if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
fsstate->fetch_ct_2++;
+ fsstate->next_tuple = 0;
+ fsstate->num_tuples += addrows;
+
/* Must be EOF if we didn't get as many tuples as we asked for. */
- fsstate->eof_reached = (numrows < fsstate->fetch_size);
+ fsstate->eof_reached = (addrows < fsstate->fetch_size);
}
PG_FINALLY();
{
+ fsstate->s.commonstate->busy = false;
+
if (res)
PQclear(res);
}
PG_END_TRY();
+ /* let the first waiter be the next leader of this connection */
+ waiter = move_to_next_waiter(node);
+
+ /* send the next request if any */
+ if (waiter)
+ request_more_data(waiter);
+
MemoryContextSwitchTo(oldcontext);
}
+/*
+ * Vacate the underlying connection so that this node can send the next query.
+ */
+static void
+vacate_connection(PgFdwState *fdwstate, bool clear_queue)
+{
+ PgFdwConnCommonState *commonstate = fdwstate->commonstate;
+ ForeignScanState *leader;
+
+ Assert(commonstate != NULL);
+
+ /* just return if the connection is already available */
+ if (commonstate->leader == NULL || !commonstate->busy)
+ return;
+
+ /*
+ * let the current connection leader read all of the result for the running
+ * query
+ */
+ leader = commonstate->leader;
+ fetch_received_data(leader);
+
+ /* let the first waiter be the next leader of this connection */
+ move_to_next_waiter(leader);
+
+ if (!clear_queue)
+ return;
+
+ /* Clear the waiting list */
+ while (leader)
+ {
+ PgFdwScanState *fsstate = GetPgFdwScanState(leader);
+
+ fsstate->last_waiter = NULL;
+ leader = fsstate->waiter;
+ fsstate->waiter = NULL;
+ }
+}
+
/*
* Force assorted GUC parameters to settings that ensure that we'll output
* data values in a form that is unambiguous to the remote server.
@@ -3566,7 +3976,9 @@ create_foreign_modify(EState *estate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->s.conn = GetConnection(user, true);
+ fmstate->s.commonstate = (PgFdwConnCommonState *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Set up remote query information. */
@@ -3653,6 +4065,9 @@ execute_foreign_modify(EState *estate,
operation == CMD_UPDATE ||
operation == CMD_DELETE);
+ /* finish running query to send my command */
+ vacate_connection((PgFdwState *)fmstate, true);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -3680,14 +4095,14 @@ execute_foreign_modify(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3695,10 +4110,10 @@ execute_foreign_modify(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -3734,7 +4149,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
- GetPrepStmtNumber(fmstate->conn));
+ GetPrepStmtNumber(fmstate->s.conn));
p_name = pstrdup(prep_name);
/*
@@ -3744,12 +4159,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* the prepared statements we use in this module are simple enough that
* the remote server will make the right choices.
*/
- if (!PQsendPrepare(fmstate->conn,
+ if (!PQsendPrepare(fmstate->s.conn,
p_name,
fmstate->query,
0,
NULL))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3757,9 +4172,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
PQclear(res);
/* This action shows that the prepare has been done. */
@@ -3888,16 +4303,16 @@ finish_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
PQclear(res);
fmstate->p_name = NULL;
}
/* Release remote connection */
- ReleaseConnection(fmstate->conn);
- fmstate->conn = NULL;
+ ReleaseConnection(fmstate->s.conn);
+ fmstate->s.conn = NULL;
}
/*
@@ -4056,9 +4471,9 @@ execute_dml_stmt(ForeignScanState *node)
* the desired result. This allows us to avoid assuming that the remote
* server has the same OIDs we do for the parameters' types.
*/
- if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+ if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
NULL, values, NULL, NULL, 0))
- pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+ pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
/*
* Get the result, and check for success.
@@ -4066,10 +4481,10 @@ execute_dml_stmt(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+ dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
if (PQresultStatus(dmstate->result) !=
(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+ pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
dmstate->query);
/* Get the number of rows affected. */
@@ -5560,6 +5975,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
/* XXX Consider parameterized paths for the join relation */
}
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add wait event so that the ForeignScan node is going to wait for.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+ void *caller_data, bool reinit)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+ /* Reinit is not supported for now. */
+ Assert(reinit);
+
+ if (fsstate->s.commonstate->leader == node)
+ {
+ AddWaitEventToSet(wes,
+ WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+ NULL, caller_data);
+ return true;
+ }
+
+ return false;
+}
+
+
/*
* Assess whether the aggregation, grouping and having operations can be pushed
* down to the foreign server. As a side effect, save information we obtain in
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index eef410db39..96af75a33e 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -85,6 +85,7 @@ typedef struct PgFdwRelationInfo
UserMapping *user; /* only set in use_remote_estimate mode */
int fetch_size; /* fetch size for this remote table */
+ bool allow_prefetch; /* true to allow overlapped fetching */
/*
* Name of the relation, for use while EXPLAINing ForeignScan. It is used
@@ -130,6 +131,7 @@ extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 83971665e3..359208a12a 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1780,25 +1780,25 @@ INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
@@ -1840,12 +1840,12 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
@@ -1904,8 +1904,8 @@ explain (verbose, costs off)
delete from foo where f1 < 5 returning *;
delete from foo where f1 < 5 returning *;
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
-- Test that UPDATE/DELETE with inherited target works with row-level triggers
CREATE TRIGGER trig_row_before
--
2.18.2
On 6/10/20 8:05 AM, Kyotaro Horiguchi wrote:
Hello, Andrey.
At Tue, 9 Jun 2020 14:20:42 +0500, Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote in
On 6/4/20 11:00 AM, Kyotaro Horiguchi wrote:
2. Total cost of an Append node is a sum of the subplans. Maybe in the
case of asynchronous append we need to use some reduce factor?Yes. For the reason mentioned above, foreign subpaths don't affect
the startup cost of Append as far as any sync subpaths exist. If no
sync subpaths exist, the Append's startup cost is the minimum startup
cost among the async subpaths.
I mean that you can possibly change computation of total cost of the
Async append node. It may affect the planner choice between ForeignScan
(followed by the execution of the JOIN locally) and partitionwise join
strategies.
Have you also considered the possibility of dynamic choice between
synchronous and async append (during optimization)? This may be useful
for a query with the LIMIT clause.
--
Andrey Lepikhov
Postgres Professional
The patch has a problem with partitionwise aggregates.
Asynchronous append do not allow the planner to use partial aggregates.
Example you can see in attachment. I can't understand why: costs of
partitionwise join are less.
Initial script and explains of the query with and without the patch you
can see in attachment.
--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
Thanks for testing, but..
At Mon, 15 Jun 2020 08:51:23 +0500, "Andrey V. Lepikhov" <a.lepikhov@postgrespro.ru> wrote in
The patch has a problem with partitionwise aggregates.
Asynchronous append do not allow the planner to use partial
aggregates. Example you can see in attachment. I can't understand why:
costs of partitionwise join are less.
Initial script and explains of the query with and without the patch
you can see in attachment.
I had more or less the same plan with the second one without the patch
(that is, vanilla master/HEAD, but used merge joins instead).
I'm not sure what prevented join pushdown, but the difference between
the two is whether the each partitionwise join is pushed down to
remote or not, That is hardly seems related to the async execution
patch.
Could you tell me how did you get the first plan?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 6/15/20 1:29 PM, Kyotaro Horiguchi wrote:
Thanks for testing, but..
At Mon, 15 Jun 2020 08:51:23 +0500, "Andrey V. Lepikhov" <a.lepikhov@postgrespro.ru> wrote in
The patch has a problem with partitionwise aggregates.
Asynchronous append do not allow the planner to use partial
aggregates. Example you can see in attachment. I can't understand why:
costs of partitionwise join are less.
Initial script and explains of the query with and without the patch
you can see in attachment.I had more or less the same plan with the second one without the patch
(that is, vanilla master/HEAD, but used merge joins instead).I'm not sure what prevented join pushdown, but the difference between
the two is whether the each partitionwise join is pushed down to
remote or not, That is hardly seems related to the async execution
patch.Could you tell me how did you get the first plan?
1. Use clear current vanilla master.
2. Start two instances with the script 'frgn2n.sh' from attachment.
There are I set GUCs:
enable_partitionwise_join = true
enable_partitionwise_aggregate = true
3. Execute query:
explain analyze SELECT sum(parts.b)
FROM parts, second
WHERE parts.a = second.a AND second.b < 100;
That's all.
--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
Attachments:
Thanks.
My conclusion on this is the async patch is not the cause of the
behavior change mentioned here.
At Mon, 15 Jun 2020 14:59:18 +0500, "Andrey V. Lepikhov" <a.lepikhov@postgrespro.ru> wrote in
Could you tell me how did you get the first plan?
1. Use clear current vanilla master.
2. Start two instances with the script 'frgn2n.sh' from attachment.
There are I set GUCs:
enable_partitionwise_join = true
enable_partitionwise_aggregate = true3. Execute query:
explain analyze SELECT sum(parts.b)
FROM parts, second
WHERE parts.a = second.a AND second.b < 100;That's all.
With mater/HEAD, I got the second (local join) plan for a while first
then got the first (remote join). The cause of the plan change was
found to be autovacuum on the remote node.
Before the vacuum the result of remote estimation was as follows.
Node2 (remote)
=# EXPLAIN SELECT r4.b FROM (public.part_1 r4 INNER JOIN public.second_1 r8 ON (((r4.a = r8.a)) AND ((r8.b < 100))));
QUERY PLAN
---------------------------------------------------------------------------
Merge Join (cost=2269.20..3689.70 rows=94449 width=4)
Merge Cond: (r8.a = r4.a)
-> Sort (cost=74.23..76.11 rows=753 width=4)
Sort Key: r8.a
-> Seq Scan on second_1 r8 (cost=0.00..38.25 rows=753 width=4)
Filter: (b < 100)
-> Sort (cost=2194.97..2257.68 rows=25086 width=8)
Sort Key: r4.a
-> Seq Scan on part_1 r4 (cost=0.00..361.86 rows=25086 width=8)
(9 rows)
After running a vacuum it changes as follows.
QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=5.90..776.31 rows=9741 width=4)
Hash Cond: (r4.a = r8.a)
-> Seq Scan on part_1 r4 (cost=0.00..360.78 rows=24978 width=8)
-> Hash (cost=4.93..4.93 rows=78 width=4)
-> Seq Scan on second_1 r8 (cost=0.00..4.93 rows=78 width=4)
Filter: (b < 100)
(6 rows)
That changes the plan on the local side the way you saw. I saw the
exactly same behavior with the async execution patch.
regards.
FYI, the explain results for another plan changed as follows. It is
estimated to return 25839 rows, which is far less than 94449. So local
join beated remote join.
=# EXPLAIN SELECT a, b FROM public.part_1 ORDER BY a ASC NULLS LAST;
QUERY PLAN
------------------------------------------------------------------
Sort (cost=2194.97..2257.68 rows=25086 width=8)
Sort Key: a
-> Seq Scan on part_1 (cost=0.00..361.86 rows=25086 width=8)
(3 rows)
=# EXPLAIN SELECT a FROM public.second_1 WHERE ((b < 100)) ORDER BY a ASC NULLS LAST;
QUERY PLAN
-----------------------------------------------------------------
Sort (cost=74.23..76.11 rows=753 width=4)
Sort Key: a
-> Seq Scan on second_1 (cost=0.00..38.25 rows=753 width=4)
Filter: (b < 100)
(4 rows)
Are changed to:
=# EXPLAIN SELECT a, b FROM public.part_1 ORDER BY a ASC NULLS LAST;
QUERY PLAN
------------------------------------------------------------------
Sort (cost=2185.22..2247.66 rows=24978 width=8)
Sort Key: a
-> Seq Scan on part_1 (cost=0.00..360.78 rows=24978 width=8)
(3 rows)
horiguti=# EXPLAIN SELECT a FROM public.second_1 WHERE ((b < 100)) ORDER BY a ASC NULLS LAST;
QUERY PLAN
---------------------------------------------------------------
Sort (cost=7.38..7.57 rows=78 width=4)
Sort Key: a
-> Seq Scan on second_1 (cost=0.00..4.93 rows=78 width=4)
Filter: (b < 100)
(4 rows)
They return 25056 rows, which is far more than 9741 rows. So remote
join won.
Of course the number of returning rows is not the only factor of the
cost change but is the most significant factor in this case.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 6/16/20 1:30 PM, Kyotaro Horiguchi wrote:
They return 25056 rows, which is far more than 9741 rows. So remote
join won.Of course the number of returning rows is not the only factor of the
cost change but is the most significant factor in this case.
Thanks for the attention.
I see one slight flaw of this approach to asynchronous append:
AsyncAppend works only for ForeignScan subplans. if we have
PartialAggregate, Join or another more complicated subplan, we can't use
asynchronous machinery.
It may lead to a situation than small difference in a filter constant
can cause a big difference in execution time.
I imagine an Append node, that can switch current subplan from time to
time and all ForeignScan nodes of the overall plan are added to one
queue. The scan buffer can be larger than a cursor fetch size and each
IterateForeignScan() call can induce asynchronous scan of another
ForeignScan node if buffer is not full.
But these are only thoughts, not an proposal. I have no questions to
your patch right now.
--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
At Wed, 17 Jun 2020 15:01:08 +0500, "Andrey V. Lepikhov" <a.lepikhov@postgrespro.ru> wrote in
On 6/16/20 1:30 PM, Kyotaro Horiguchi wrote:
They return 25056 rows, which is far more than 9741 rows. So remote
join won.
Of course the number of returning rows is not the only factor of the
cost change but is the most significant factor in this case.Thanks for the attention.
I see one slight flaw of this approach to asynchronous append:
AsyncAppend works only for ForeignScan subplans. if we have
PartialAggregate, Join or another more complicated subplan, we can't
use asynchronous machinery.
Yes, the asynchronous append works only when it has at least one
async-capable immediate subnode. Currently there's only one
async-capable node, ForeignScan.
I imagine an Append node, that can switch current subplan from time to
time and all ForeignScan nodes of the overall plan are added to one
queue. The scan buffer can be larger than a cursor fetch size and each
IterateForeignScan() call can induce asynchronous scan of another
ForeignScan node if buffer is not full.
But these are only thoughts, not an proposal. I have no questions to
your patch right now.
A major property of async-capable nodes is yieldability(?), that is,
it ought to be able to give way for other nodes when it is not ready
to return a tuple. That means such nodes are state machine rather than
function. Fortunately ForeignScan is natively a kind of state machine
in a sense so it is easily turned into async-capable node. Append is
also a state machine in the same sense but currently no other nodes
can use it as async-capable node.
For example, an Agg or Sort node generally needs two or more tuples
from its subnode to generate a tuple to be returned to parent. Some
working memory is needed while generating a returning tuple. If the
node takes in a tuple from a subnode but not generated a result tuple,
the node must yield CPU time to other nodes. These nodes are not state
machines at all and it is somewhat hard to make it so. It gets quite
complex in WindowAgg since it calls subnodes at arbitrary call level
of component functions.
Further issue is leaf scan nodes, SeqScan, IndexScan, etc. also need
to be asynchronous.
Finally the executor will turn into push-up style from the current
volcano (pull-style).
I tried all of that (perhaps except scan nodes) a couple of years ago
but the result was a kind of crap^^;
After all, I returned to the current shape. It doesn't seem bad as
Thomas proposed the same thing.
*1: async-aware is defined (here) as a node that can have
async-capable subnodes.
It may lead to a situation than small difference in a filter constant
can cause a big difference in execution time.
It is what we usually see? We could get a big win for certain
condition without a loss even otherwise.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello.
As the result of a discussion with Fujita-san off-list, I'm going to
hold off development until he decides whether mine or Thomas' is
better.
However, I fixed two misbehaviors and rebased.
A. It runs ordered Append asynchronously, but that leads to a bogus
result. I taught create_append_plan not to make subnodes async when
pathkey is not NIL.
B. It calculated the total cost of Append by summing up total costs of
all subnodes including async subnodes. It is too pessimistic so I
changed that to the following.
Max(total cost of sync subnodes, maximum cost of async subnodes);
However this is a bit too optimistic in that it ignores interference
between async subnodes, it is more realistic in the cases where the
subnode ForeignScans are connecting to different servers.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v5-0001-Allow-wait-event-set-to-be-registered-to-resource.patchtext/x-patch; charset=us-asciiDownload
From 76349549522b1c8ac9bad637cce763718331a066 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH v5 1/3] Allow wait event set to be registered to resource
owner
WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
src/backend/libpq/pqcomm.c | 2 +-
src/backend/storage/ipc/latch.c | 18 ++++-
src/backend/storage/lmgr/condition_variable.c | 2 +-
src/backend/utils/resowner/resowner.c | 67 +++++++++++++++++++
src/include/storage/latch.h | 4 +-
src/include/utils/resowner_private.h | 8 +++
6 files changed, 96 insertions(+), 5 deletions(-)
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 7717bb2719..16aefb03ee 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -218,7 +218,7 @@ pq_init(void)
(errmsg("could not set socket to nonblocking mode: %m")));
#endif
- FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+ FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
NULL, NULL);
AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 91fa4b619b..10d71b46cb 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -56,6 +56,7 @@
#include "storage/latch.h"
#include "storage/pmsignal.h"
#include "storage/shmem.h"
+#include "utils/resowner_private.h"
/*
* Select the fd readiness primitive to use. Normally the "most modern"
@@ -84,6 +85,8 @@ struct WaitEventSet
int nevents; /* number of registered events */
int nevents_space; /* maximum number of events in this set */
+ ResourceOwner resowner; /* Resource owner */
+
/*
* Array, of nevents_space length, storing the definition of events this
* set is waiting for.
@@ -393,7 +396,7 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
int ret = 0;
int rc;
WaitEvent event;
- WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
if (wakeEvents & WL_TIMEOUT)
Assert(timeout >= 0);
@@ -560,12 +563,15 @@ ResetLatch(Latch *latch)
* WaitEventSetWait().
*/
WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
{
WaitEventSet *set;
char *data;
Size sz = 0;
+ if (res)
+ ResourceOwnerEnlargeWESs(res);
+
/*
* Use MAXALIGN size/alignment to guarantee that later uses of memory are
* aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -680,6 +686,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
#endif
+ /* Register this wait event set if requested */
+ set->resowner = res;
+ if (res)
+ ResourceOwnerRememberWES(set->resowner, set);
+
return set;
}
@@ -725,6 +736,9 @@ FreeWaitEventSet(WaitEventSet *set)
}
#endif
+ if (set->resowner != NULL)
+ ResourceOwnerForgetWES(set->resowner, set);
+
pfree(set);
}
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 37b6a4eecd..fcc92138fe 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -70,7 +70,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
{
WaitEventSet *new_event_set;
- new_event_set = CreateWaitEventSet(TopMemoryContext, 2);
+ new_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 2);
AddWaitEventToSet(new_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
MyLatch, NULL);
AddWaitEventToSet(new_event_set, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 8bc2c4e9ea..237ca9fa30 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -128,6 +128,7 @@ typedef struct ResourceOwnerData
ResourceArray filearr; /* open temporary files */
ResourceArray dsmarr; /* dynamic shmem segments */
ResourceArray jitarr; /* JIT contexts */
+ ResourceArray wesarr; /* wait event sets */
/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
int nlocks; /* number of owned locks */
@@ -175,6 +176,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
static void PrintSnapshotLeakWarning(Snapshot snapshot);
static void PrintFileLeakWarning(File file);
static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
/*****************************************************************************
@@ -444,6 +446,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
+ ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
return owner;
}
@@ -553,6 +556,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
jit_release_context(context);
}
+
+ /* Ditto for wait event sets */
+ while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+ {
+ WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+ if (isCommit)
+ PrintWESLeakWarning(event);
+ FreeWaitEventSet(event);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -725,6 +738,7 @@ ResourceOwnerDelete(ResourceOwner owner)
Assert(owner->filearr.nitems == 0);
Assert(owner->dsmarr.nitems == 0);
Assert(owner->jitarr.nitems == 0);
+ Assert(owner->wesarr.nitems == 0);
Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
/*
@@ -752,6 +766,7 @@ ResourceOwnerDelete(ResourceOwner owner)
ResourceArrayFree(&(owner->filearr));
ResourceArrayFree(&(owner->dsmarr));
ResourceArrayFree(&(owner->jitarr));
+ ResourceArrayFree(&(owner->wesarr));
pfree(owner);
}
@@ -1370,3 +1385,55 @@ ResourceOwnerForgetJIT(ResourceOwner owner, Datum handle)
elog(ERROR, "JIT context %p is not owned by resource owner %s",
DatumGetPointer(handle), owner->name);
}
+
+/*
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+ ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+ ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+ elog(ERROR, "wait event set %p is not owned by resource owner %s",
+ events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ elog(WARNING, "wait event set leak: %p still referenced",
+ events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 46ae56cae3..b1b8375768 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
#define LATCH_H
#include <signal.h>
+#include "utils/resowner.h"
/*
* Latch structure should be treated as opaque and only accessed through
@@ -163,7 +164,8 @@ extern void DisownLatch(Latch *latch);
extern void SetLatch(Latch *latch);
extern void ResetLatch(Latch *latch);
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+ ResourceOwner res, int nevents);
extern void FreeWaitEventSet(WaitEventSet *set);
extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index a781a7a2aa..7d19dadd57 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
#include "storage/dsm.h"
#include "storage/fd.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "utils/catcache.h"
#include "utils/plancache.h"
@@ -95,4 +96,11 @@ extern void ResourceOwnerRememberJIT(ResourceOwner owner,
extern void ResourceOwnerForgetJIT(ResourceOwner owner,
Datum handle);
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+ WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+ WaitEventSet *);
+
#endif /* RESOWNER_PRIVATE_H */
--
2.18.4
v5-0002-Infrastructure-for-asynchronous-execution.patchtext/x-patch; charset=us-asciiDownload
From e45b0a7c2a832a2e02411528e95efb4441d7d22d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 15 May 2018 20:21:32 +0900
Subject: [PATCH v5 2/3] Infrastructure for asynchronous execution
This patch add an infrastructure for asynchronous execution. As a PoC
this makes only Append capable to handle asynchronously executable
subnodes.
---
src/backend/commands/explain.c | 17 ++
src/backend/executor/Makefile | 1 +
src/backend/executor/execAsync.c | 152 +++++++++++
src/backend/executor/nodeAppend.c | 342 ++++++++++++++++++++----
src/backend/executor/nodeForeignscan.c | 21 ++
src/backend/nodes/bitmapset.c | 72 +++++
src/backend/nodes/copyfuncs.c | 3 +
src/backend/nodes/outfuncs.c | 3 +
src/backend/nodes/readfuncs.c | 3 +
src/backend/optimizer/path/allpaths.c | 24 ++
src/backend/optimizer/path/costsize.c | 55 +++-
src/backend/optimizer/plan/createplan.c | 45 +++-
src/backend/postmaster/pgstat.c | 3 +
src/backend/postmaster/syslogger.c | 2 +-
src/backend/utils/adt/ruleutils.c | 8 +-
src/backend/utils/resowner/resowner.c | 4 +-
src/include/executor/execAsync.h | 22 ++
src/include/executor/executor.h | 1 +
src/include/executor/nodeForeignscan.h | 3 +
src/include/foreign/fdwapi.h | 11 +
src/include/nodes/bitmapset.h | 1 +
src/include/nodes/execnodes.h | 23 +-
src/include/nodes/plannodes.h | 9 +
src/include/optimizer/paths.h | 2 +
src/include/pgstat.h | 3 +-
25 files changed, 757 insertions(+), 73 deletions(-)
create mode 100644 src/backend/executor/execAsync.c
create mode 100644 src/include/executor/execAsync.h
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 093864cfc0..244676ba11 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -86,6 +86,7 @@ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
List *ancestors, ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
+static void show_append_info(AppendState *astate, ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
ExplainState *es);
static void show_grouping_sets(PlanState *planstate, Agg *agg,
@@ -1389,6 +1390,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
}
if (plan->parallel_aware)
appendStringInfoString(es->str, "Parallel ");
+ if (plan->async_capable)
+ appendStringInfoString(es->str, "Async ");
appendStringInfoString(es->str, pname);
es->indent++;
}
@@ -1970,6 +1973,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Hash:
show_hash_info(castNode(HashState, planstate), es);
break;
+
+ case T_Append:
+ show_append_info(castNode(AppendState, planstate), es);
+ break;
+
default:
break;
}
@@ -2323,6 +2331,15 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ancestors, es);
}
+static void
+show_append_info(AppendState *astate, ExplainState *es)
+{
+ Append *plan = (Append *) astate->ps.plan;
+
+ if (plan->nasyncplans > 0)
+ ExplainPropertyInteger("Async subplans", "", plan->nasyncplans, es);
+}
+
/*
* Show the grouping keys for an Agg node.
*/
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index f990c6473a..1004647d4f 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
+ execAsync.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000000..2b7d1877e0
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,152 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+/*
+ * ExecAsyncConfigureWait: Add wait event to the WaitEventSet if needed.
+ *
+ * If reinit is true, the caller didn't reuse existing WaitEventSet.
+ */
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+ void *data, bool reinit)
+{
+ switch (nodeTag(node))
+ {
+ case T_ForeignScanState:
+ return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+ wes, data, reinit);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(node));
+ }
+}
+
+/*
+ * struct for memory context callback argument used in ExecAsyncEventWait
+ */
+typedef struct {
+ int **p_refind;
+ int *p_refindsize;
+} ExecAsync_mcbarg;
+
+/*
+ * callback function to reset static variables pointing to the memory in
+ * TopTransactionContext in ExecAsyncEventWait.
+ */
+static void ExecAsyncMemoryContextCallback(void *arg)
+{
+ /* arg is the address of the variable refind in ExecAsyncEventWait */
+ ExecAsync_mcbarg *mcbarg = (ExecAsync_mcbarg *) arg;
+ *mcbarg->p_refind = NULL;
+ *mcbarg->p_refindsize = 0;
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * ExecAsyncEventWait:
+ *
+ * Wait for async events to fire. Returns the Bitmapset of fired events.
+ */
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+ static int *refind = NULL;
+ static int refindsize = 0;
+ WaitEventSet *wes;
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred = 0;
+ Bitmapset *fired_events = NULL;
+ int i;
+ int n;
+
+ n = bms_num_members(waitnodes);
+ wes = CreateWaitEventSet(TopTransactionContext,
+ TopTransactionResourceOwner, n);
+ if (refindsize < n)
+ {
+ if (refindsize == 0)
+ refindsize = EVENT_BUFFER_SIZE; /* XXX */
+ while (refindsize < n)
+ refindsize *= 2;
+ if (refind)
+ refind = (int *) repalloc(refind, refindsize * sizeof(int));
+ else
+ {
+ static ExecAsync_mcbarg mcb_arg =
+ { &refind, &refindsize };
+ static MemoryContextCallback mcb =
+ { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
+ MemoryContext oldctxt =
+ MemoryContextSwitchTo(TopTransactionContext);
+
+ /*
+ * refind points to a memory block in
+ * TopTransactionContext. Register a callback to reset it.
+ */
+ MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
+ refind = (int *) palloc(refindsize * sizeof(int));
+ MemoryContextSwitchTo(oldctxt);
+ }
+ }
+
+ /* Prepare WaitEventSet for waiting on the waitnodes. */
+ n = 0;
+ for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+ i = bms_next_member(waitnodes, i))
+ {
+ refind[i] = i;
+ if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+ n++;
+ }
+
+ /* Return immediately if no node to wait. */
+ if (n == 0)
+ {
+ FreeWaitEventSet(wes);
+ return NULL;
+ }
+
+ noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+ EVENT_BUFFER_SIZE,
+ WAIT_EVENT_ASYNC_WAIT);
+ FreeWaitEventSet(wes);
+ if (noccurred == 0)
+ return NULL;
+
+ for (i = 0 ; i < noccurred ; i++)
+ {
+ WaitEvent *w = &occurred_event[i];
+
+ if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+ {
+ int n = *(int*)w->user_data;
+
+ fired_events = bms_add_member(fired_events, n);
+ }
+ }
+
+ return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 88919e62fa..60c36ee048 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -60,6 +60,7 @@
#include "executor/execdebug.h"
#include "executor/execPartition.h"
#include "executor/nodeAppend.h"
+#include "executor/execAsync.h"
#include "miscadmin.h"
/* Shared state for parallel-aware Append. */
@@ -80,6 +81,7 @@ struct ParallelAppendState
#define INVALID_SUBPLAN_INDEX -1
static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);
static bool choose_next_subplan_locally(AppendState *node);
static bool choose_next_subplan_for_leader(AppendState *node);
static bool choose_next_subplan_for_worker(AppendState *node);
@@ -103,22 +105,22 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
PlanState **appendplanstates;
Bitmapset *validsubplans;
int nplans;
+ int nasyncplans;
int firstvalid;
int i,
j;
/* check for unsupported flags */
- Assert(!(eflags & EXEC_FLAG_MARK));
+ Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC)));
/*
* create new AppendState for our append node
*/
appendstate->ps.plan = (Plan *) node;
appendstate->ps.state = estate;
- appendstate->ps.ExecProcNode = ExecAppend;
/* Let choose_next_subplan_* function handle setting the first subplan */
- appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+ appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
/* If run-time partition pruning is enabled, then set that up now */
if (node->part_prune_info != NULL)
@@ -152,11 +154,12 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
/*
* When no run-time pruning is required and there's at least one
- * subplan, we can fill as_valid_subplans immediately, preventing
+ * subplan, we can fill as_valid_syncsubplans immediately, preventing
* later calls to ExecFindMatchingSubPlans.
*/
if (!prunestate->do_exec_prune && nplans > 0)
- appendstate->as_valid_subplans = bms_add_range(NULL, 0, nplans - 1);
+ appendstate->as_valid_syncsubplans =
+ bms_add_range(NULL, node->nasyncplans, nplans - 1);
}
else
{
@@ -167,8 +170,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
* subplans as valid; they must also all be initialized.
*/
Assert(nplans > 0);
- appendstate->as_valid_subplans = validsubplans =
- bms_add_range(NULL, 0, nplans - 1);
+ validsubplans = bms_add_range(NULL, 0, nplans - 1);
+ appendstate->as_valid_syncsubplans =
+ bms_add_range(NULL, node->nasyncplans, nplans - 1);
appendstate->as_prune_state = NULL;
}
@@ -192,10 +196,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
*/
j = 0;
firstvalid = nplans;
+ nasyncplans = 0;
+
i = -1;
while ((i = bms_next_member(validsubplans, i)) >= 0)
{
Plan *initNode = (Plan *) list_nth(node->appendplans, i);
+ int sub_eflags = eflags;
+
+ /* Let async-capable subplans run asynchronously */
+ if (i < node->nasyncplans)
+ {
+ sub_eflags |= EXEC_FLAG_ASYNC;
+ nasyncplans++;
+ }
/*
* Record the lowest appendplans index which is a valid partial plan.
@@ -203,13 +217,46 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
if (i >= node->first_partial_plan && j < firstvalid)
firstvalid = j;
- appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
+ appendplanstates[j++] = ExecInitNode(initNode, estate, sub_eflags);
}
appendstate->as_first_partial_plan = firstvalid;
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ /* fill in async stuff */
+ appendstate->as_nasyncplans = nasyncplans;
+ appendstate->as_syncdone = (nasyncplans == nplans);
+ appendstate->as_exec_prune = false;
+
+ /* choose appropriate version of Exec function */
+ if (appendstate->as_nasyncplans == 0)
+ appendstate->ps.ExecProcNode = ExecAppend;
+ else
+ appendstate->ps.ExecProcNode = ExecAppendAsync;
+
+ if (appendstate->as_nasyncplans)
+ {
+ appendstate->as_asyncresult = (TupleTableSlot **)
+ palloc0(appendstate->as_nasyncplans * sizeof(TupleTableSlot *));
+
+ /* initially, all async requests need a request */
+ appendstate->as_needrequest =
+ bms_add_range(NULL, 0, appendstate->as_nasyncplans - 1);
+
+ /*
+ * ExecAppendAsync needs as_valid_syncsubplans to handle async
+ * subnodes.
+ */
+ if (appendstate->as_prune_state != NULL &&
+ appendstate->as_prune_state->do_exec_prune)
+ {
+ Assert(appendstate->as_valid_syncsubplans == NULL);
+
+ appendstate->as_exec_prune = true;
+ }
+ }
+
/*
* Miscellaneous initialization
*/
@@ -233,7 +280,7 @@ ExecAppend(PlanState *pstate)
{
AppendState *node = castNode(AppendState, pstate);
- if (node->as_whichplan < 0)
+ if (node->as_whichsyncplan < 0)
{
/* Nothing to do if there are no subplans */
if (node->as_nplans == 0)
@@ -243,11 +290,13 @@ ExecAppend(PlanState *pstate)
* If no subplan has been chosen, we must choose one before
* proceeding.
*/
- if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
+ if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
!node->choose_next_subplan(node))
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
+ Assert(node->as_nasyncplans == 0);
+
for (;;)
{
PlanState *subnode;
@@ -258,8 +307,9 @@ ExecAppend(PlanState *pstate)
/*
* figure out which subplan we are currently processing
*/
- Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
- subnode = node->appendplans[node->as_whichplan];
+ Assert(node->as_whichsyncplan >= 0 &&
+ node->as_whichsyncplan < node->as_nplans);
+ subnode = node->appendplans[node->as_whichsyncplan];
/*
* get a tuple from the subplan
@@ -282,6 +332,172 @@ ExecAppend(PlanState *pstate)
}
}
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+ AppendState *node = castNode(AppendState, pstate);
+ Bitmapset *needrequest;
+ int i;
+
+ Assert(node->as_nasyncplans > 0);
+
+restart:
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ if (node->as_exec_prune)
+ {
+ Bitmapset *valid_subplans =
+ ExecFindMatchingSubPlans(node->as_prune_state);
+
+ /* Distribute valid subplans into sync and async */
+ node->as_needrequest =
+ bms_intersect(node->as_needrequest, valid_subplans);
+ node->as_valid_syncsubplans =
+ bms_difference(valid_subplans, node->as_needrequest);
+
+ node->as_exec_prune = false;
+ }
+
+ needrequest = node->as_needrequest;
+ node->as_needrequest = NULL;
+ while ((i = bms_first_member(needrequest)) >= 0)
+ {
+ TupleTableSlot *slot;
+ PlanState *subnode = node->appendplans[i];
+
+ slot = ExecProcNode(subnode);
+ if (subnode->asyncstate == AS_AVAILABLE)
+ {
+ if (!TupIsNull(slot))
+ {
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+ node->as_needrequest = bms_add_member(node->as_needrequest, i);
+ }
+ }
+ else
+ node->as_pending_async = bms_add_member(node->as_pending_async, i);
+ }
+ bms_free(needrequest);
+
+ for (;;)
+ {
+ TupleTableSlot *result;
+
+ /* return now if a result is available */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ while (!bms_is_empty(node->as_pending_async))
+ {
+ /* Don't wait for async nodes if any sync node exists. */
+ long timeout = node->as_syncdone ? -1 : 0;
+ Bitmapset *fired;
+ int i;
+
+ fired = ExecAsyncEventWait(node->appendplans,
+ node->as_pending_async,
+ timeout);
+
+ if (bms_is_empty(fired) && node->as_syncdone)
+ {
+ /*
+ * We come here when all the subnodes had fired before
+ * waiting. Retry fetching from the nodes.
+ */
+ node->as_needrequest = node->as_pending_async;
+ node->as_pending_async = NULL;
+ goto restart;
+ }
+
+ while ((i = bms_first_member(fired)) >= 0)
+ {
+ TupleTableSlot *slot;
+ PlanState *subnode = node->appendplans[i];
+ slot = ExecProcNode(subnode);
+
+ Assert(subnode->asyncstate == AS_AVAILABLE);
+
+ if (!TupIsNull(slot))
+ {
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+ node->as_needrequest =
+ bms_add_member(node->as_needrequest, i);
+ }
+
+ node->as_pending_async =
+ bms_del_member(node->as_pending_async, i);
+ }
+ bms_free(fired);
+
+ /* return now if a result is available */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If there is no asynchronous activity still pending and the
+ * synchronous activity is also complete, we're totally done scanning
+ * this node. Otherwise, we're done with the asynchronous stuff but
+ * must continue scanning the synchronous children.
+ */
+
+ if (!node->as_syncdone &&
+ node->as_whichsyncplan == INVALID_SUBPLAN_INDEX)
+ node->as_syncdone = !node->choose_next_subplan(node);
+
+ if (node->as_syncdone)
+ {
+ Assert(bms_is_empty(node->as_pending_async));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+
+ /*
+ * get a tuple from the subplan
+ */
+ result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+ if (!TupIsNull(result))
+ {
+ /*
+ * If the subplan gave us something then return it as-is. We do
+ * NOT make use of the result slot that was set up in
+ * ExecInitAppend; there's no need for it.
+ */
+ return result;
+ }
+
+ /*
+ * Go on to the "next" subplan. If no more subplans, return the empty
+ * slot set up for us by ExecInitAppend, unless there are async plans
+ * we have yet to finish.
+ */
+ if (!node->choose_next_subplan(node))
+ {
+ node->as_syncdone = true;
+ if (bms_is_empty(node->as_pending_async))
+ {
+ Assert(bms_is_empty(node->as_needrequest));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
+
+ /* Else loop back and try to get a tuple from the new subplan */
+ }
+}
+
/* ----------------------------------------------------------------
* ExecEndAppend
*
@@ -324,10 +540,18 @@ ExecReScanAppend(AppendState *node)
bms_overlap(node->ps.chgParam,
node->as_prune_state->execparamids))
{
- bms_free(node->as_valid_subplans);
- node->as_valid_subplans = NULL;
+ bms_free(node->as_valid_syncsubplans);
+ node->as_valid_syncsubplans = NULL;
}
+ /* Reset async state. */
+ for (i = 0; i < node->as_nasyncplans; ++i)
+ ExecShutdownNode(node->appendplans[i]);
+
+ node->as_nasyncresult = 0;
+ node->as_needrequest = bms_add_range(NULL, 0, node->as_nasyncplans - 1);
+ node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
for (i = 0; i < node->as_nplans; i++)
{
PlanState *subnode = node->appendplans[i];
@@ -348,7 +572,7 @@ ExecReScanAppend(AppendState *node)
}
/* Let choose_next_subplan_* function handle setting the first subplan */
- node->as_whichplan = INVALID_SUBPLAN_INDEX;
+ node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
}
/* ----------------------------------------------------------------
@@ -436,7 +660,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
static bool
choose_next_subplan_locally(AppendState *node)
{
- int whichplan = node->as_whichplan;
+ int whichplan = node->as_whichsyncplan;
int nextplan;
/* We should never be called when there are no subplans */
@@ -451,10 +675,18 @@ choose_next_subplan_locally(AppendState *node)
*/
if (whichplan == INVALID_SUBPLAN_INDEX)
{
- if (node->as_valid_subplans == NULL)
- node->as_valid_subplans =
+ /* Shouldn't have an active async node */
+ Assert(bms_is_empty(node->as_needrequest));
+
+ if (node->as_valid_syncsubplans == NULL)
+ node->as_valid_syncsubplans =
ExecFindMatchingSubPlans(node->as_prune_state);
+ /* Exclude async plans */
+ if (node->as_nasyncplans > 0)
+ bms_del_range(node->as_valid_syncsubplans,
+ 0, node->as_nasyncplans - 1);
+
whichplan = -1;
}
@@ -462,14 +694,14 @@ choose_next_subplan_locally(AppendState *node)
Assert(whichplan >= -1 && whichplan <= node->as_nplans);
if (ScanDirectionIsForward(node->ps.state->es_direction))
- nextplan = bms_next_member(node->as_valid_subplans, whichplan);
+ nextplan = bms_next_member(node->as_valid_syncsubplans, whichplan);
else
- nextplan = bms_prev_member(node->as_valid_subplans, whichplan);
+ nextplan = bms_prev_member(node->as_valid_syncsubplans, whichplan);
if (nextplan < 0)
return false;
- node->as_whichplan = nextplan;
+ node->as_whichsyncplan = nextplan;
return true;
}
@@ -490,29 +722,29 @@ choose_next_subplan_for_leader(AppendState *node)
/* Backward scan is not supported by parallel-aware plans */
Assert(ScanDirectionIsForward(node->ps.state->es_direction));
- /* We should never be called when there are no subplans */
- Assert(node->as_nplans > 0);
+ /* We should never be called when there are no sync subplans */
+ Assert(node->as_nplans > node->as_nasyncplans);
LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
- if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
+ if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
{
/* Mark just-completed subplan as finished. */
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
}
else
{
/* Start with last subplan. */
- node->as_whichplan = node->as_nplans - 1;
+ node->as_whichsyncplan = node->as_nplans - 1;
/*
* If we've yet to determine the valid subplans then do so now. If
* run-time pruning is disabled then the valid subplans will always be
* set to all subplans.
*/
- if (node->as_valid_subplans == NULL)
+ if (node->as_valid_syncsubplans == NULL)
{
- node->as_valid_subplans =
+ node->as_valid_syncsubplans =
ExecFindMatchingSubPlans(node->as_prune_state);
/*
@@ -524,26 +756,26 @@ choose_next_subplan_for_leader(AppendState *node)
}
/* Loop until we find a subplan to execute. */
- while (pstate->pa_finished[node->as_whichplan])
+ while (pstate->pa_finished[node->as_whichsyncplan])
{
- if (node->as_whichplan == 0)
+ if (node->as_whichsyncplan == 0)
{
pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
- node->as_whichplan = INVALID_SUBPLAN_INDEX;
+ node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
LWLockRelease(&pstate->pa_lock);
return false;
}
/*
- * We needn't pay attention to as_valid_subplans here as all invalid
+ * We needn't pay attention to as_valid_syncsubplans here as all invalid
* plans have been marked as finished.
*/
- node->as_whichplan--;
+ node->as_whichsyncplan--;
}
/* If non-partial, immediately mark as finished. */
- if (node->as_whichplan < node->as_first_partial_plan)
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ if (node->as_whichsyncplan < node->as_first_partial_plan)
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
LWLockRelease(&pstate->pa_lock);
@@ -571,23 +803,23 @@ choose_next_subplan_for_worker(AppendState *node)
/* Backward scan is not supported by parallel-aware plans */
Assert(ScanDirectionIsForward(node->ps.state->es_direction));
- /* We should never be called when there are no subplans */
- Assert(node->as_nplans > 0);
+ /* We should never be called when there are no sync subplans */
+ Assert(node->as_nplans > node->as_nasyncplans);
LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
/* Mark just-completed subplan as finished. */
- if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
/*
* If we've yet to determine the valid subplans then do so now. If
* run-time pruning is disabled then the valid subplans will always be set
* to all subplans.
*/
- else if (node->as_valid_subplans == NULL)
+ else if (node->as_valid_syncsubplans == NULL)
{
- node->as_valid_subplans =
+ node->as_valid_syncsubplans =
ExecFindMatchingSubPlans(node->as_prune_state);
mark_invalid_subplans_as_finished(node);
}
@@ -600,30 +832,30 @@ choose_next_subplan_for_worker(AppendState *node)
}
/* Save the plan from which we are starting the search. */
- node->as_whichplan = pstate->pa_next_plan;
+ node->as_whichsyncplan = pstate->pa_next_plan;
/* Loop until we find a valid subplan to execute. */
while (pstate->pa_finished[pstate->pa_next_plan])
{
int nextplan;
- nextplan = bms_next_member(node->as_valid_subplans,
+ nextplan = bms_next_member(node->as_valid_syncsubplans,
pstate->pa_next_plan);
if (nextplan >= 0)
{
/* Advance to the next valid plan. */
pstate->pa_next_plan = nextplan;
}
- else if (node->as_whichplan > node->as_first_partial_plan)
+ else if (node->as_whichsyncplan > node->as_first_partial_plan)
{
/*
* Try looping back to the first valid partial plan, if there is
* one. If there isn't, arrange to bail out below.
*/
- nextplan = bms_next_member(node->as_valid_subplans,
+ nextplan = bms_next_member(node->as_valid_syncsubplans,
node->as_first_partial_plan - 1);
pstate->pa_next_plan =
- nextplan < 0 ? node->as_whichplan : nextplan;
+ nextplan < 0 ? node->as_whichsyncplan : nextplan;
}
else
{
@@ -631,10 +863,10 @@ choose_next_subplan_for_worker(AppendState *node)
* At last plan, and either there are no partial plans or we've
* tried them all. Arrange to bail out.
*/
- pstate->pa_next_plan = node->as_whichplan;
+ pstate->pa_next_plan = node->as_whichsyncplan;
}
- if (pstate->pa_next_plan == node->as_whichplan)
+ if (pstate->pa_next_plan == node->as_whichsyncplan)
{
/* We've tried everything! */
pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
@@ -644,8 +876,8 @@ choose_next_subplan_for_worker(AppendState *node)
}
/* Pick the plan we found, and advance pa_next_plan one more time. */
- node->as_whichplan = pstate->pa_next_plan;
- pstate->pa_next_plan = bms_next_member(node->as_valid_subplans,
+ node->as_whichsyncplan = pstate->pa_next_plan;
+ pstate->pa_next_plan = bms_next_member(node->as_valid_syncsubplans,
pstate->pa_next_plan);
/*
@@ -654,7 +886,7 @@ choose_next_subplan_for_worker(AppendState *node)
*/
if (pstate->pa_next_plan < 0)
{
- int nextplan = bms_next_member(node->as_valid_subplans,
+ int nextplan = bms_next_member(node->as_valid_syncsubplans,
node->as_first_partial_plan - 1);
if (nextplan >= 0)
@@ -671,8 +903,8 @@ choose_next_subplan_for_worker(AppendState *node)
}
/* If non-partial, immediately mark as finished. */
- if (node->as_whichplan < node->as_first_partial_plan)
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ if (node->as_whichsyncplan < node->as_first_partial_plan)
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
LWLockRelease(&pstate->pa_lock);
@@ -699,13 +931,13 @@ mark_invalid_subplans_as_finished(AppendState *node)
Assert(node->as_prune_state);
/* Nothing to do if all plans are valid */
- if (bms_num_members(node->as_valid_subplans) == node->as_nplans)
+ if (bms_num_members(node->as_valid_syncsubplans) == node->as_nplans)
return;
/* Mark all non-valid plans as finished */
for (i = 0; i < node->as_nplans; i++)
{
- if (!bms_is_member(i, node->as_valid_subplans))
+ if (!bms_is_member(i, node->as_valid_syncsubplans))
node->as_pstate->pa_finished[i] = true;
}
}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 513471ab9b..3bf4aaa63d 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -141,6 +141,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
scanstate->ss.ps.plan = (Plan *) node;
scanstate->ss.ps.state = estate;
scanstate->ss.ps.ExecProcNode = ExecForeignScan;
+ scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+ if ((eflags & EXEC_FLAG_ASYNC) != 0)
+ scanstate->fs_async = true;
/*
* Miscellaneous initialization
@@ -384,3 +388,20 @@ ExecShutdownForeignScan(ForeignScanState *node)
if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+ void *caller_data, bool reinit)
+{
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+ caller_data, reinit);
+}
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 2719ea45a3..05b625783b 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -895,6 +895,78 @@ bms_add_range(Bitmapset *a, int lower, int upper)
return a;
}
+/*
+ * bms_del_range
+ * Delete members in the range of 'lower' to 'upper' from the set.
+ *
+ * Note this could also be done by calling bms_del_member in a loop, however,
+ * using this function will be faster when the range is large as we work at
+ * the bitmapword level rather than at bit level.
+ */
+Bitmapset *
+bms_del_range(Bitmapset *a, int lower, int upper)
+{
+ int lwordnum,
+ lbitnum,
+ uwordnum,
+ ushiftbits,
+ wordnum;
+
+ if (lower < 0 || upper < 0)
+ elog(ERROR, "negative bitmapset member not allowed");
+ if (lower > upper)
+ elog(ERROR, "lower range must not be above upper range");
+ uwordnum = WORDNUM(upper);
+
+ if (a == NULL)
+ {
+ a = (Bitmapset *) palloc0(BITMAPSET_SIZE(uwordnum + 1));
+ a->nwords = uwordnum + 1;
+ }
+
+ /* ensure we have enough words to store the upper bit */
+ else if (uwordnum >= a->nwords)
+ {
+ int oldnwords = a->nwords;
+ int i;
+
+ a = (Bitmapset *) repalloc(a, BITMAPSET_SIZE(uwordnum + 1));
+ a->nwords = uwordnum + 1;
+ /* zero out the enlarged portion */
+ for (i = oldnwords; i < a->nwords; i++)
+ a->words[i] = 0;
+ }
+
+ wordnum = lwordnum = WORDNUM(lower);
+
+ lbitnum = BITNUM(lower);
+ ushiftbits = BITNUM(upper) + 1;
+
+ /*
+ * Special case when lwordnum is the same as uwordnum we must perform the
+ * upper and lower masking on the word.
+ */
+ if (lwordnum == uwordnum)
+ {
+ a->words[lwordnum] &= ((bitmapword) (((bitmapword) 1 << lbitnum) - 1)
+ | (~(bitmapword) 0) << ushiftbits);
+ }
+ else
+ {
+ /* turn off lbitnum and all bits left of it */
+ a->words[wordnum++] &= (bitmapword) (((bitmapword) 1 << lbitnum) - 1);
+
+ /* turn off all bits for any intermediate words */
+ while (wordnum < uwordnum)
+ a->words[wordnum++] = (bitmapword) 0;
+
+ /* turn off upper's bit and all bits right of it. */
+ a->words[uwordnum] &= (~(bitmapword) 0) << ushiftbits;
+ }
+
+ return a;
+}
+
/*
* bms_int_members - like bms_intersect, but left input is recycled
*/
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d8cf87e6d0..89a49e2fdc 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -121,6 +121,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
COPY_SCALAR_FIELD(plan_width);
COPY_SCALAR_FIELD(parallel_aware);
COPY_SCALAR_FIELD(parallel_safe);
+ COPY_SCALAR_FIELD(async_capable);
COPY_SCALAR_FIELD(plan_node_id);
COPY_NODE_FIELD(targetlist);
COPY_NODE_FIELD(qual);
@@ -246,6 +247,8 @@ _copyAppend(const Append *from)
COPY_NODE_FIELD(appendplans);
COPY_SCALAR_FIELD(first_partial_plan);
COPY_NODE_FIELD(part_prune_info);
+ COPY_SCALAR_FIELD(nasyncplans);
+ COPY_SCALAR_FIELD(referent);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e2f177515d..d4bb44b268 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -334,6 +334,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
WRITE_INT_FIELD(plan_width);
WRITE_BOOL_FIELD(parallel_aware);
WRITE_BOOL_FIELD(parallel_safe);
+ WRITE_BOOL_FIELD(async_capable);
WRITE_INT_FIELD(plan_node_id);
WRITE_NODE_FIELD(targetlist);
WRITE_NODE_FIELD(qual);
@@ -436,6 +437,8 @@ _outAppend(StringInfo str, const Append *node)
WRITE_NODE_FIELD(appendplans);
WRITE_INT_FIELD(first_partial_plan);
WRITE_NODE_FIELD(part_prune_info);
+ WRITE_INT_FIELD(nasyncplans);
+ WRITE_INT_FIELD(referent);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab719..63af7c02d8 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1572,6 +1572,7 @@ ReadCommonPlan(Plan *local_node)
READ_INT_FIELD(plan_width);
READ_BOOL_FIELD(parallel_aware);
READ_BOOL_FIELD(parallel_safe);
+ READ_BOOL_FIELD(async_capable);
READ_INT_FIELD(plan_node_id);
READ_NODE_FIELD(targetlist);
READ_NODE_FIELD(qual);
@@ -1672,6 +1673,8 @@ _readAppend(void)
READ_NODE_FIELD(appendplans);
READ_INT_FIELD(first_partial_plan);
READ_NODE_FIELD(part_prune_info);
+ READ_INT_FIELD(nasyncplans);
+ READ_INT_FIELD(referent);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index d984da25d7..bb4c8723bc 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3937,6 +3937,30 @@ generate_partitionwise_join_paths(PlannerInfo *root, RelOptInfo *rel)
list_free(live_children);
}
+/*
+ * is_projection_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
+
/*****************************************************************************
* DEBUG SUPPORT
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4ff3c7a2fd..ccaeb8cc5c 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2049,22 +2049,59 @@ cost_append(AppendPath *apath)
if (pathkeys == NIL)
{
- Path *subpath = (Path *) linitial(apath->subpaths);
-
- /*
- * For an unordered, non-parallel-aware Append we take the startup
- * cost as the startup cost of the first subpath.
- */
- apath->path.startup_cost = subpath->startup_cost;
+ Cost first_nonasync_startup_cost = -1.0;
+ Cost async_min_startup_cost = -1;
+ Cost async_max_cost = 0.0;
/* Compute rows and costs as sums of subplan rows and costs. */
foreach(l, apath->subpaths)
{
Path *subpath = (Path *) lfirst(l);
+ /*
+ * For an unordered, non-parallel-aware Append we take the
+ * startup cost as the startup cost of the first
+ * nonasync-capable subpath or the minimum startup cost of
+ * async-capable subpaths.
+ */
+ if (!is_async_capable_path(subpath))
+ {
+ if (first_nonasync_startup_cost < 0.0)
+ first_nonasync_startup_cost = subpath->startup_cost;
+
+ apath->path.total_cost += subpath->total_cost;
+ }
+ else
+ {
+ if (async_min_startup_cost < 0.0 ||
+ async_min_startup_cost > subpath->startup_cost)
+ async_min_startup_cost = subpath->startup_cost;
+
+ /*
+ * It's not obvious how to determine the total cost of
+ * async subnodes. Although it is not always true, we
+ * assume it is the maximum cost among all async subnodes.
+ */
+ if (async_max_cost < subpath->total_cost)
+ async_max_cost = subpath->total_cost;
+ }
+
apath->path.rows += subpath->rows;
- apath->path.total_cost += subpath->total_cost;
}
+
+ /*
+ * If there's an sync subnodes, the startup cost is the startup
+ * cost of the first sync subnode. Otherwise it's the minimal
+ * startup cost of async subnodes.
+ */
+ if (first_nonasync_startup_cost >= 0.0)
+ apath->path.startup_cost = first_nonasync_startup_cost;
+ else
+ apath->path.startup_cost = async_min_startup_cost;
+
+ /* Use async maximum cost if it exceeds the sync total cost */
+ if (async_max_cost > apath->path.total_cost)
+ apath->path.total_cost = async_max_cost;
}
else
{
@@ -2085,6 +2122,8 @@ cost_append(AppendPath *apath)
* This case is also different from the above in that we have to
* account for possibly injecting sorts into subpaths that aren't
* natively ordered.
+ *
+ * Note: An ordered append won't be run asynchronously.
*/
foreach(l, apath->subpaths)
{
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index eb9543f6ad..27ff01f159 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1082,6 +1082,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
bool tlist_was_changed = false;
List *pathkeys = best_path->path.pathkeys;
List *subplans = NIL;
+ List *asyncplans = NIL;
+ List *syncplans = NIL;
+ List *asyncpaths = NIL;
+ List *syncpaths = NIL;
+ List *newsubpaths = NIL;
ListCell *subpaths;
RelOptInfo *rel = best_path->path.parent;
PartitionPruneInfo *partpruneinfo = NULL;
@@ -1090,6 +1095,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
Oid *nodeSortOperators = NULL;
Oid *nodeCollations = NULL;
bool *nodeNullsFirst = NULL;
+ int nasyncplans = 0;
+ bool first = true;
+ bool referent_is_sync = true;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1219,9 +1227,40 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
}
- subplans = lappend(subplans, subplan);
+ /*
+ * Classify as async-capable or not. If we have decided to run the
+ * children in parallel, we cannot any one of them run asynchronously.
+ * Planner thinks that all subnodes are executed in order if this
+ * append is orderd. No subpaths cannot be run asynchronously in that
+ * case.
+ */
+ if (pathkeys == NIL &&
+ !best_path->path.parallel_safe && is_async_capable_path(subpath))
+ {
+ subplan->async_capable = true;
+ asyncplans = lappend(asyncplans, subplan);
+ asyncpaths = lappend(asyncpaths, subpath);
+ ++nasyncplans;
+ if (first)
+ referent_is_sync = false;
+ }
+ else
+ {
+ syncplans = lappend(syncplans, subplan);
+ syncpaths = lappend(syncpaths, subpath);
+ }
+
+ first = false;
}
+ /*
+ * subplan contains asyncplans in the first half, if any, and sync plans in
+ * another half, if any. We need that the same for subpaths to make
+ * partition pruning information in sync with subplans.
+ */
+ subplans = list_concat(asyncplans, syncplans);
+ newsubpaths = list_concat(asyncpaths, syncpaths);
+
/*
* If any quals exist, they may be useful to perform further partition
* pruning during execution. Gather information needed by the executor to
@@ -1249,7 +1288,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
if (prunequal != NIL)
partpruneinfo =
make_partition_pruneinfo(root, rel,
- best_path->subpaths,
+ newsubpaths,
best_path->partitioned_rels,
prunequal);
}
@@ -1257,6 +1296,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
plan->appendplans = subplans;
plan->first_partial_plan = best_path->first_partial_path;
plan->part_prune_info = partpruneinfo;
+ plan->nasyncplans = nasyncplans;
+ plan->referent = referent_is_sync ? nasyncplans : 0;
copy_generic_path_info(&plan->plan, (Path *) best_path);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c022597bc0..4db86252c9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3878,6 +3878,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
case WAIT_EVENT_XACT_GROUP_UPDATE:
event_name = "XactGroupUpdate";
break;
+ case WAIT_EVENT_ASYNC_WAIT:
+ event_name = "AsyncExecWait";
+ break;
/* no default case, so that compiler will warn */
}
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index ffcb54968f..a4de6d90e2 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -300,7 +300,7 @@ SysLoggerMain(int argc, char *argv[])
* syslog pipe, which implies that all other backends have exited
* (including the postmaster).
*/
- wes = CreateWaitEventSet(CurrentMemoryContext, 2);
+ wes = CreateWaitEventSet(CurrentMemoryContext, NULL, 2);
AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
#ifndef WIN32
AddWaitEventToSet(wes, WL_SOCKET_READABLE, syslogPipe[0], NULL, NULL);
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 2cbcb4b85e..46a4b0696f 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4574,10 +4574,14 @@ set_deparse_plan(deparse_namespace *dpns, Plan *plan)
* tlists according to one of the children, and the first one is the most
* natural choice. Likewise special-case ModifyTable to pretend that the
* first child plan is the OUTER referent; this is to support RETURNING
- * lists containing references to non-target relations.
+ * lists containing references to non-target relations. For Append, use the
+ * explicitly specified referent.
*/
if (IsA(plan, Append))
- dpns->outer_plan = linitial(((Append *) plan)->appendplans);
+ {
+ Append *app = (Append *) plan;
+ dpns->outer_plan = list_nth(app->appendplans, app->referent);
+ }
else if (IsA(plan, MergeAppend))
dpns->outer_plan = linitial(((MergeAppend *) plan)->mergeplans);
else if (IsA(plan, ModifyTable))
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 237ca9fa30..27742a1641 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -1416,7 +1416,7 @@ void
ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
{
/*
- * XXXX: There's no property to show as an identier of a wait event set,
+ * XXXX: There's no property to show as an identifier of a wait event set,
* use its pointer instead.
*/
if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
@@ -1431,7 +1431,7 @@ static void
PrintWESLeakWarning(WaitEventSet *events)
{
/*
- * XXXX: There's no property to show as an identier of a wait event set,
+ * XXXX: There's no property to show as an identifier of a wait event set,
* use its pointer instead.
*/
elog(WARNING, "wait event set leak: %p still referenced",
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000000..3b6bf4a516
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,22 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ * Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+ void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+ long timeout);
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c7deeac662..aca9e2bddd 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -59,6 +59,7 @@
#define EXEC_FLAG_MARK 0x0008 /* need mark/restore */
#define EXEC_FLAG_SKIP_TRIGGERS 0x0010 /* skip AfterTrigger calls */
#define EXEC_FLAG_WITH_NO_DATA 0x0020 /* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC 0x0040 /* request async execution */
/* Hook for plugins to get control in ExecutorStart() */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 326d713ebf..71a233b41f 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,
extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
ParallelWorkerContext *pwcxt);
extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+ WaitEventSet *wes,
+ void *caller_data, bool reinit);
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 95556dfb15..853ba2b5ad 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -169,6 +169,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
List *fdw_private,
RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+ WaitEventSet *wes,
+ void *caller_data,
+ bool reinit);
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -190,6 +195,7 @@ typedef struct FdwRoutine
GetForeignPlan_function GetForeignPlan;
BeginForeignScan_function BeginForeignScan;
IterateForeignScan_function IterateForeignScan;
+ IterateForeignScan_function IterateForeignScanAsync;
ReScanForeignScan_function ReScanForeignScan;
EndForeignScan_function EndForeignScan;
@@ -242,6 +248,11 @@ typedef struct FdwRoutine
InitializeDSMForeignScan_function InitializeDSMForeignScan;
ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan;
InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+
ShutdownForeignScan_function ShutdownForeignScan;
/* Support functions for path reparameterization. */
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index d113c271ee..177e6218cb 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -107,6 +107,7 @@ extern Bitmapset *bms_add_members(Bitmapset *a, const Bitmapset *b);
extern Bitmapset *bms_add_range(Bitmapset *a, int lower, int upper);
extern Bitmapset *bms_int_members(Bitmapset *a, const Bitmapset *b);
extern Bitmapset *bms_del_members(Bitmapset *a, const Bitmapset *b);
+extern Bitmapset *bms_del_range(Bitmapset *a, int lower, int upper);
extern Bitmapset *bms_join(Bitmapset *a, Bitmapset *b);
/* support for iterating through the integer elements of a set: */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f5dfa32d55..8e230ee5c3 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -938,6 +938,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
* abstract superclass for all PlanState-type nodes.
* ----------------
*/
+typedef enum AsyncState
+{
+ AS_AVAILABLE,
+ AS_WAITING
+} AsyncState;
+
typedef struct PlanState
{
NodeTag type;
@@ -1026,6 +1032,11 @@ typedef struct PlanState
bool outeropsset;
bool inneropsset;
bool resultopsset;
+
+ /* Async subnode execution stuff */
+ AsyncState asyncstate;
+
+ int32 padding; /* to keep alignment of derived types */
} PlanState;
/* ----------------
@@ -1221,14 +1232,21 @@ struct AppendState
PlanState ps; /* its first field is NodeTag */
PlanState **appendplans; /* array of PlanStates for my inputs */
int as_nplans;
- int as_whichplan;
+ int as_whichsyncplan; /* which sync plan is being executed */
int as_first_partial_plan; /* Index of 'appendplans' containing
* the first partial plan */
+ int as_nasyncplans; /* # of async-capable children */
ParallelAppendState *as_pstate; /* parallel coordination info */
Size pstate_len; /* size of parallel coordination info */
struct PartitionPruneState *as_prune_state;
- Bitmapset *as_valid_subplans;
+ Bitmapset *as_valid_syncsubplans;
bool (*choose_next_subplan) (AppendState *);
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_needrequest; /* async plans needing a new request */
+ Bitmapset *as_pending_async; /* pending async plans */
+ TupleTableSlot **as_asyncresult; /* results of each async plan */
+ int as_nasyncresult; /* # of valid entries in as_asyncresult */
+ bool as_exec_prune; /* runtime pruning needed for async exec? */
};
/* ----------------
@@ -1796,6 +1814,7 @@ typedef struct ForeignScanState
Size pscan_len; /* size of parallel coordination information */
/* use struct pointer to avoid including fdwapi.h here */
struct FdwRoutine *fdwroutine;
+ bool fs_async;
void *fdw_state; /* foreign-data wrapper can keep state here */
} ForeignScanState;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 83e01074ed..abad89b327 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -135,6 +135,11 @@ typedef struct Plan
bool parallel_aware; /* engage parallel-aware logic? */
bool parallel_safe; /* OK to use as part of parallel plan? */
+ /*
+ * information needed for asynchronous execution
+ */
+ bool async_capable; /* engage asynchronous execution logic? */
+
/*
* Common structural data for all Plan types.
*/
@@ -262,6 +267,10 @@ typedef struct Append
/* Info for run-time subplan pruning; NULL if we're not doing that */
struct PartitionPruneInfo *part_prune_info;
+
+ /* Async child node execution stuff */
+ int nasyncplans; /* # async subplans, always at start of list */
+ int referent; /* index of inheritance tree referent */
} Append;
/* ----------------
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 10b6e81079..53876b2d8b 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -241,4 +241,6 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
List *live_childrels);
+extern bool is_async_capable_path(Path *path);
+
#endif /* PATHS_H */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201382..c0ea7f5aa4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -887,7 +887,8 @@ typedef enum
WAIT_EVENT_REPLICATION_SLOT_DROP,
WAIT_EVENT_SAFE_SNAPSHOT,
WAIT_EVENT_SYNC_REP,
- WAIT_EVENT_XACT_GROUP_UPDATE
+ WAIT_EVENT_XACT_GROUP_UPDATE,
+ WAIT_EVENT_ASYNC_WAIT
} WaitEventIPC;
/* ----------
--
2.18.4
v5-0003-async-postgres_fdw.patchtext/x-patch; charset=us-asciiDownload
From 37cf695f5f019bcfd6554473a50d4ff6f7b44462 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH v5 3/3] async postgres_fdw
---
contrib/postgres_fdw/connection.c | 28 +
.../postgres_fdw/expected/postgres_fdw.out | 272 ++++----
contrib/postgres_fdw/postgres_fdw.c | 601 +++++++++++++++---
contrib/postgres_fdw/postgres_fdw.h | 2 +
contrib/postgres_fdw/sql/postgres_fdw.sql | 20 +-
5 files changed, 710 insertions(+), 213 deletions(-)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 52d1fe3563..d9edc5e4de 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -58,6 +58,7 @@ typedef struct ConnCacheEntry
bool invalidated; /* true if reconnect is pending */
uint32 server_hashvalue; /* hash value of foreign server OID */
uint32 mapping_hashvalue; /* hash value of user mapping OID */
+ void *storage; /* connection specific storage */
} ConnCacheEntry;
/*
@@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
entry->conn, server->servername, user->umid, user->userid);
+ entry->storage = NULL;
}
/*
@@ -215,6 +217,32 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
return entry->conn;
}
+/*
+ * Returns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+ bool found;
+ ConnCacheEntry *entry;
+ ConnCacheKey key;
+
+ /* Find storage using the same key with GetConnection */
+ key = user->umid;
+ entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+ Assert(found);
+
+ /* Create one if not yet. */
+ if (entry->storage == NULL)
+ {
+ entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+ memset(entry->storage, 0, initsize);
+ }
+
+ return entry->storage;
+}
+
/*
* Connect to remote server using specified server and user mapping properties.
*/
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 82fc1290ef..bf9b4041cd 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6973,7 +6973,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+-------
a | aaa
@@ -7001,7 +7001,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7029,7 +7029,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7057,7 +7057,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | newtoo
@@ -7127,35 +7127,41 @@ insert into bar2 values(3,33,33);
insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
f1 | f2
----+----
1 | 11
@@ -7165,35 +7171,41 @@ select * from bar where f1 in (select f1 from foo) for update;
(4 rows)
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
f1 | f2
----+----
1 | 11
@@ -7223,11 +7235,12 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-> Hash Join
Output: bar_1.f1, (bar_1.f2 + 100), bar_1.f3, bar_1.ctid, foo.ctid, foo.*, foo.tableoid
Inner Unique: true
@@ -7241,12 +7254,13 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(39 rows)
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(41 rows)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
select tableoid::regclass, * from bar order by 1,2;
@@ -7276,16 +7290,17 @@ where bar.f1 = ss.f1;
Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
Hash Cond: (foo.f1 = bar.f1)
-> Append
+ Async subplans: 2
+ -> Async Foreign Scan on public.foo2 foo_1
+ Output: ROW(foo_1.f1), foo_1.f1
+ Remote SQL: SELECT f1 FROM public.loct1
+ -> Async Foreign Scan on public.foo2 foo_3
+ Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
+ Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo
Output: ROW(foo.f1), foo.f1
- -> Foreign Scan on public.foo2 foo_1
- Output: ROW(foo_1.f1), foo_1.f1
- Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo foo_2
Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
- -> Foreign Scan on public.foo2 foo_3
- Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
- Remote SQL: SELECT f1 FROM public.loct1
-> Hash
Output: bar.f1, bar.f2, bar.ctid
-> Seq Scan on public.bar
@@ -7303,17 +7318,18 @@ where bar.f1 = ss.f1;
Output: (ROW(foo.f1)), foo.f1
Sort Key: foo.f1
-> Append
+ Async subplans: 2
+ -> Async Foreign Scan on public.foo2 foo_1
+ Output: ROW(foo_1.f1), foo_1.f1
+ Remote SQL: SELECT f1 FROM public.loct1
+ -> Async Foreign Scan on public.foo2 foo_3
+ Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
+ Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo
Output: ROW(foo.f1), foo.f1
- -> Foreign Scan on public.foo2 foo_1
- Output: ROW(foo_1.f1), foo_1.f1
- Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo foo_2
Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
- -> Foreign Scan on public.foo2 foo_3
- Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
- Remote SQL: SELECT f1 FROM public.loct1
-(45 rows)
+(47 rows)
update bar set f2 = f2 + 100
from
@@ -7463,27 +7479,33 @@ delete from foo where f1 < 5 returning *;
(5 rows)
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
- QUERY PLAN
-------------------------------------------------------------------------------
- Update on public.bar
- Output: bar.f1, bar.f2
- Update on public.bar
- Foreign Update on public.bar2 bar_1
- -> Seq Scan on public.bar
- Output: bar.f1, (bar.f2 + 100), bar.ctid
- -> Foreign Update on public.bar2 bar_1
- Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Output: u.f1, u.f2
+ Sort Key: u.f1
+ CTE u
+ -> Update on public.bar
+ Output: bar.f1, bar.f2
+ Update on public.bar
+ Foreign Update on public.bar2 bar_1
+ -> Seq Scan on public.bar
+ Output: bar.f1, (bar.f2 + 100), bar.ctid
+ -> Foreign Update on public.bar2 bar_1
+ Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+ -> CTE Scan on u
+ Output: u.f1, u.f2
+(14 rows)
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
f1 | f2
----+-----
1 | 311
2 | 322
- 6 | 266
3 | 333
4 | 344
+ 6 | 266
7 | 277
(6 rows)
@@ -8558,11 +8580,12 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
Sort
Sort Key: t1.a, t3.c
-> Append
- -> Foreign Scan
+ Async subplans: 2
+ -> Async Foreign Scan
Relations: ((ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)) INNER JOIN (ftprt1_p1 t3_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: ((ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)) INNER JOIN (ftprt1_p2 t3_2)
-(7 rows)
+(8 rows)
SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER JOIN fprt1 t3 ON (t2.b = t3.a) WHERE t1.a % 25 =0 ORDER BY 1,2,3;
a | b | c
@@ -8597,20 +8620,22 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
-- with whole-row reference; partitionwise join does not apply
EXPLAIN (COSTS OFF)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
- QUERY PLAN
---------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------
Sort
Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
-> Hash Full Join
Hash Cond: (t1.a = t2.b)
-> Append
- -> Foreign Scan on ftprt1_p1 t1_1
- -> Foreign Scan on ftprt1_p2 t1_2
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt1_p1 t1_1
+ -> Async Foreign Scan on ftprt1_p2 t1_2
-> Hash
-> Append
- -> Foreign Scan on ftprt2_p1 t2_1
- -> Foreign Scan on ftprt2_p2 t2_2
-(11 rows)
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt2_p1 t2_1
+ -> Async Foreign Scan on ftprt2_p2 t2_2
+(13 rows)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
wr | wr
@@ -8639,11 +8664,12 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
Sort
Sort Key: t1.a, t1.b
-> Append
- -> Foreign Scan
+ Async subplans: 2
+ -> Async Foreign Scan
Relations: (ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: (ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)
-(7 rows)
+(8 rows)
SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE t1.a%25 = 0 ORDER BY 1,2;
a | b
@@ -8696,21 +8722,23 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
-- test FOR UPDATE; partitionwise join does not apply
EXPLAIN (COSTS OFF)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
- QUERY PLAN
---------------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------------
LockRows
-> Sort
Sort Key: t1.a
-> Hash Join
Hash Cond: (t2.b = t1.a)
-> Append
- -> Foreign Scan on ftprt2_p1 t2_1
- -> Foreign Scan on ftprt2_p2 t2_2
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt2_p1 t2_1
+ -> Async Foreign Scan on ftprt2_p2 t2_2
-> Hash
-> Append
- -> Foreign Scan on ftprt1_p1 t1_1
- -> Foreign Scan on ftprt1_p2 t1_2
-(12 rows)
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt1_p1 t1_1
+ -> Async Foreign Scan on ftprt1_p2 t1_2
+(14 rows)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
a | b
@@ -8745,18 +8773,19 @@ ANALYZE fpagg_tab_p3;
SET enable_partitionwise_aggregate TO false;
EXPLAIN (COSTS OFF)
SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
- QUERY PLAN
------------------------------------------------------------
+ QUERY PLAN
+-----------------------------------------------------------------
Sort
Sort Key: pagg_tab.a
-> HashAggregate
Group Key: pagg_tab.a
Filter: (avg(pagg_tab.b) < '22'::numeric)
-> Append
- -> Foreign Scan on fpagg_tab_p1 pagg_tab_1
- -> Foreign Scan on fpagg_tab_p2 pagg_tab_2
- -> Foreign Scan on fpagg_tab_p3 pagg_tab_3
-(9 rows)
+ Async subplans: 3
+ -> Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+ -> Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+ -> Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
+(10 rows)
-- Plan with partitionwise aggregates is enabled
SET enable_partitionwise_aggregate TO true;
@@ -8767,13 +8796,14 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
Sort
Sort Key: pagg_tab.a
-> Append
- -> Foreign Scan
+ Async subplans: 3
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p1 pagg_tab)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p2 pagg_tab_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p3 pagg_tab_2)
-(9 rows)
+(10 rows)
SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
a | sum | min | count
@@ -8795,29 +8825,22 @@ SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
Sort
Output: t1.a, (count(((t1.*)::pagg_tab)))
Sort Key: t1.a
- -> Append
- -> HashAggregate
- Output: t1.a, count(((t1.*)::pagg_tab))
- Group Key: t1.a
- Filter: (avg(t1.b) < '22'::numeric)
- -> Foreign Scan on public.fpagg_tab_p1 t1
- Output: t1.a, t1.*, t1.b
- Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1
- -> HashAggregate
- Output: t1_1.a, count(((t1_1.*)::pagg_tab))
- Group Key: t1_1.a
- Filter: (avg(t1_1.b) < '22'::numeric)
- -> Foreign Scan on public.fpagg_tab_p2 t1_1
+ -> HashAggregate
+ Output: t1.a, count(((t1.*)::pagg_tab))
+ Group Key: t1.a
+ Filter: (avg(t1.b) < '22'::numeric)
+ -> Append
+ Async subplans: 3
+ -> Async Foreign Scan on public.fpagg_tab_p1 t1_1
Output: t1_1.a, t1_1.*, t1_1.b
- Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2
- -> HashAggregate
- Output: t1_2.a, count(((t1_2.*)::pagg_tab))
- Group Key: t1_2.a
- Filter: (avg(t1_2.b) < '22'::numeric)
- -> Foreign Scan on public.fpagg_tab_p3 t1_2
+ Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1
+ -> Async Foreign Scan on public.fpagg_tab_p2 t1_2
Output: t1_2.a, t1_2.*, t1_2.b
+ Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2
+ -> Async Foreign Scan on public.fpagg_tab_p3 t1_3
+ Output: t1_3.a, t1_3.*, t1_3.b
Remote SQL: SELECT a, b, c FROM public.pagg_tab_p3
-(25 rows)
+(18 rows)
SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
a | count
@@ -8837,20 +8860,15 @@ SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b HAVING sum(a) < 700
-----------------------------------------------------------------
Sort
Sort Key: pagg_tab.b
- -> Finalize HashAggregate
+ -> HashAggregate
Group Key: pagg_tab.b
Filter: (sum(pagg_tab.a) < 700)
-> Append
- -> Partial HashAggregate
- Group Key: pagg_tab.b
- -> Foreign Scan on fpagg_tab_p1 pagg_tab
- -> Partial HashAggregate
- Group Key: pagg_tab_1.b
- -> Foreign Scan on fpagg_tab_p2 pagg_tab_1
- -> Partial HashAggregate
- Group Key: pagg_tab_2.b
- -> Foreign Scan on fpagg_tab_p3 pagg_tab_2
-(15 rows)
+ Async subplans: 3
+ -> Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+ -> Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+ -> Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
+(10 rows)
-- ===================================================================
-- access rights and superuser
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 9fc53cad68..4bfc2d39ea 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -21,6 +21,8 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -35,6 +37,7 @@
#include "optimizer/restrictinfo.h"
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
+#include "pgstat.h"
#include "postgres_fdw.h"
#include "utils/builtins.h"
#include "utils/float.h"
@@ -56,6 +59,9 @@ PG_MODULE_MAGIC;
/* If no remote estimates, assume a sort costs 20% extra */
#define DEFAULT_FDW_SORT_MULTIPLIER 1.2
+/* Retrieve PgFdwScanState struct from ForeignScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
/*
* Indexes of FDW-private information stored in fdw_private lists.
*
@@ -122,11 +128,29 @@ enum FdwDirectModifyPrivateIndex
FdwDirectModifyPrivateSetProcessed
};
+/*
+ * Connection common state - shared among all PgFdwState instances using the
+ * same connection.
+ */
+typedef struct PgFdwConnCommonState
+{
+ ForeignScanState *leader; /* leader node of this connection */
+ bool busy; /* true if this connection is busy */
+} PgFdwConnCommonState;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+ PGconn *conn; /* connection for the scan */
+ PgFdwConnCommonState *commonstate; /* connection common state */
+} PgFdwState;
+
/*
* Execution state of a foreign scan using postgres_fdw.
*/
typedef struct PgFdwScanState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table. NULL
* for a foreign join scan. */
TupleDesc tupdesc; /* tuple descriptor of scan */
@@ -137,7 +161,6 @@ typedef struct PgFdwScanState
List *retrieved_attrs; /* list of retrieved attribute numbers */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -153,6 +176,12 @@ typedef struct PgFdwScanState
/* batch-level state, for optimizing rewinds and avoiding useless fetch */
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ bool async; /* true if run asynchronously */
+ bool queued; /* true if this node is in waiter queue */
+ ForeignScanState *waiter; /* Next node to run a query among nodes
+ * sharing the same connection */
+ ForeignScanState *last_waiter; /* last element in waiter queue.
+ * valid only on the leader node */
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
@@ -166,11 +195,11 @@ typedef struct PgFdwScanState
*/
typedef struct PgFdwModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -197,6 +226,7 @@ typedef struct PgFdwModifyState
*/
typedef struct PgFdwDirectModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
@@ -326,6 +356,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
static void postgresReScanForeignScan(ForeignScanState *node);
static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
static void postgresAddForeignUpdateTargets(Query *parsetree,
RangeTblEntry *target_rte,
Relation target_relation);
@@ -391,6 +422,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
RelOptInfo *input_rel,
RelOptInfo *output_rel,
void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+ WaitEventSet *wes,
+ void *caller_data, bool reinit);
/*
* Helper functions
@@ -419,7 +454,9 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn, bool clear_queue);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static PgFdwModifyState *create_foreign_modify(EState *estate,
RangeTblEntry *rte,
@@ -522,6 +559,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
routine->IterateForeignScan = postgresIterateForeignScan;
routine->ReScanForeignScan = postgresReScanForeignScan;
routine->EndForeignScan = postgresEndForeignScan;
+ routine->ShutdownForeignScan = postgresShutdownForeignScan;
/* Functions for updating foreign tables */
routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -558,6 +596,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for async execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+
PG_RETURN_POINTER(routine);
}
@@ -1434,12 +1476,22 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->s.conn = GetConnection(user, false);
+ fsstate->s.commonstate = (PgFdwConnCommonState *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
+ fsstate->s.commonstate->leader = NULL;
+ fsstate->s.commonstate->busy = false;
+ fsstate->waiter = NULL;
+ fsstate->last_waiter = node;
/* Assign a unique ID for my cursor */
- fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+ fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
fsstate->cursor_exists = false;
+ /* Initialize async execution status */
+ fsstate->async = false;
+ fsstate->queued = false;
+
/* Get private info created by planner functions. */
fsstate->query = strVal(list_nth(fsplan->fdw_private,
FdwScanPrivateSelectSql));
@@ -1487,40 +1539,241 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
&fsstate->param_values);
}
+/*
+ * Async queue manipulation functions
+ */
+
+/*
+ * add_async_waiter:
+ *
+ * Enqueue node if it isn't in the queue. Immediately send request it if the
+ * underlying connection is not busy.
+ */
+static inline void
+add_async_waiter(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+
+ /*
+ * Do nothing if the node is already in the queue or already eof'ed.
+ * Note: leader node is not marked as queued.
+ */
+ if (leader == node || fsstate->queued || fsstate->eof_reached)
+ return;
+
+ if (leader == NULL)
+ {
+ /* no leader means not busy, send request immediately */
+ request_more_data(node);
+ }
+ else
+ {
+ /* the connection is busy, queue the node */
+ PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+ PgFdwScanState *last_waiter_state
+ = GetPgFdwScanState(leader_state->last_waiter);
+
+ last_waiter_state->waiter = node;
+ leader_state->last_waiter = node;
+ fsstate->queued = true;
+ }
+}
+
+/*
+ * move_to_next_waiter:
+ *
+ * Make the first waiter be the next leader
+ * Returns the new leader or NULL if there's no waiter.
+ */
+static inline ForeignScanState *
+move_to_next_waiter(ForeignScanState *node)
+{
+ PgFdwScanState *leader_state = GetPgFdwScanState(node);
+ ForeignScanState *next_leader = leader_state->waiter;
+
+ Assert(leader_state->s.commonstate->leader = node);
+
+ if (next_leader)
+ {
+ /* the first waiter becomes the next leader */
+ PgFdwScanState *next_leader_state = GetPgFdwScanState(next_leader);
+ next_leader_state->last_waiter = leader_state->last_waiter;
+ next_leader_state->queued = false;
+ }
+
+ leader_state->waiter = NULL;
+ leader_state->s.commonstate->leader = next_leader;
+
+ return next_leader;
+}
+
+/*
+ * Remove the node from waiter queue.
+ *
+ * Remaining results are cleared if the node is a busy leader.
+ * This intended to be used during node shutdown.
+ */
+static inline void
+remove_async_node(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+ PgFdwScanState *leader_state;
+ ForeignScanState *prev;
+ PgFdwScanState *prev_state;
+ ForeignScanState *cur;
+
+ /* no need to remove me */
+ if (!leader || !fsstate->queued)
+ return;
+
+ leader_state = GetPgFdwScanState(leader);
+
+ if (leader == node)
+ {
+ if (leader_state->s.commonstate->busy)
+ {
+ /*
+ * this node is waiting for result, absorb the result first so
+ * that the following commands can be sent on the connection.
+ */
+ PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+ PGconn *conn = leader_state->s.conn;
+
+ while(PQisBusy(conn))
+ PQclear(PQgetResult(conn));
+
+ leader_state->s.commonstate->busy = false;
+ }
+
+ move_to_next_waiter(node);
+
+ return;
+ }
+
+ /*
+ * Just remove the node from the queue
+ *
+ * Nodes don't have a link to the previous node but anyway this function is
+ * called on the shutdown path, so we don't bother seeking for faster way
+ * to do this.
+ */
+ prev = leader;
+ prev_state = leader_state;
+ cur = GetPgFdwScanState(prev)->waiter;
+ while (cur)
+ {
+ PgFdwScanState *curstate = GetPgFdwScanState(cur);
+
+ if (cur == node)
+ {
+ prev_state->waiter = curstate->waiter;
+
+ /* relink to the previous node if the last node was removed */
+ if (leader_state->last_waiter == cur)
+ leader_state->last_waiter = prev;
+
+ fsstate->queued = false;
+
+ return;
+ }
+ prev = cur;
+ prev_state = curstate;
+ cur = curstate->waiter;
+ }
+}
+
/*
* postgresIterateForeignScan
- * Retrieve next row from the result set, or clear tuple slot to indicate
- * EOF.
+ * Retrieve next row from the result set.
+ *
+ * For synchronous nodes, returns clear tuple slot means EOF.
+ *
+ * For asynchronous nodes, if clear tuple slot is returned, the caller
+ * needs to check async state to tell if all tuples received
+ * (AS_AVAILABLE) or waiting for the next data to come (AS_WAITING).
*/
static TupleTableSlot *
postgresIterateForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
- /*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
- */
- if (!fsstate->cursor_exists)
- create_cursor(node);
-
- /*
- * Get some more tuples, if we've run out.
- */
+ if (fsstate->next_tuple >= fsstate->num_tuples && !fsstate->eof_reached)
+ {
+ /* we've run out, get some more tuples */
+ if (!node->fs_async)
+ {
+ /*
+ * finish the running query before sending the next command for
+ * this node
+ */
+ if (!fsstate->s.commonstate->busy)
+ vacate_connection((PgFdwState *)fsstate, false);
+
+ request_more_data(node);
+
+ /* Fetch the result immediately. */
+ fetch_received_data(node);
+ }
+ else if (!fsstate->s.commonstate->busy)
+ {
+ /* If the connection is not busy, just send the request. */
+ request_more_data(node);
+ }
+ else
+ {
+ /* The connection is busy, queue the request */
+ bool available = true;
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+ PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+
+ /* queue the requested node */
+ add_async_waiter(node);
+
+ /*
+ * The request for the next node cannot be sent before the leader
+ * responds. Finish the current leader if possible.
+ */
+ if (PQisBusy(leader_state->s.conn))
+ {
+ int rc = WaitLatchOrSocket(NULL,
+ WL_SOCKET_READABLE | WL_TIMEOUT |
+ WL_EXIT_ON_PM_DEATH,
+ PQsocket(leader_state->s.conn), 0,
+ WAIT_EVENT_ASYNC_WAIT);
+ if (!(rc & WL_SOCKET_READABLE))
+ available = false;
+ }
+
+ /* fetch the leader's data and enqueue it for the next request */
+ if (available)
+ {
+ fetch_received_data(leader);
+ add_async_waiter(leader);
+ }
+ }
+ }
+
if (fsstate->next_tuple >= fsstate->num_tuples)
{
- /* No point in another fetch if we already detected EOF, though. */
- if (!fsstate->eof_reached)
- fetch_more_data(node);
- /* If we didn't get any tuples, must be end of data. */
- if (fsstate->next_tuple >= fsstate->num_tuples)
- return ExecClearTuple(slot);
+ /*
+ * We haven't received a result for the given node this time, return
+ * with no tuple to give way to another node.
+ */
+ if (fsstate->eof_reached)
+ node->ss.ps.asyncstate = AS_AVAILABLE;
+ else
+ node->ss.ps.asyncstate = AS_WAITING;
+
+ return ExecClearTuple(slot);
}
/*
* Return the next tuple.
*/
+ node->ss.ps.asyncstate = AS_AVAILABLE;
ExecStoreHeapTuple(fsstate->tuples[fsstate->next_tuple++],
slot,
false);
@@ -1535,7 +1788,7 @@ postgresIterateForeignScan(ForeignScanState *node)
static void
postgresReScanForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
char sql[64];
PGresult *res;
@@ -1543,6 +1796,8 @@ postgresReScanForeignScan(ForeignScanState *node)
if (!fsstate->cursor_exists)
return;
+ vacate_connection((PgFdwState *)fsstate, true);
+
/*
* If any internal parameters affecting this node have changed, we'd
* better destroy and recreate the cursor. Otherwise, rewinding it should
@@ -1571,9 +1826,9 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
PQclear(res);
/* Now force a fresh FETCH. */
@@ -1591,7 +1846,7 @@ postgresReScanForeignScan(ForeignScanState *node)
static void
postgresEndForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
if (fsstate == NULL)
@@ -1599,15 +1854,31 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->s.conn, fsstate->cursor_number);
/* Release remote connection */
- ReleaseConnection(fsstate->conn);
- fsstate->conn = NULL;
+ ReleaseConnection(fsstate->s.conn);
+ fsstate->s.conn = NULL;
/* MemoryContexts will be deleted automatically. */
}
+/*
+ * postgresShutdownForeignScan
+ * Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+ ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+ if (plan->operation != CMD_SELECT)
+ return;
+
+ /* remove the node from waiting queue */
+ remove_async_node(node);
+}
+
/*
* postgresAddForeignUpdateTargets
* Add resjunk column(s) needed for update/delete on a foreign table
@@ -2372,7 +2643,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->s.conn = GetConnection(user, false);
+ dmstate->s.commonstate = (PgFdwConnCommonState *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
/* Update the foreign-join-related fields. */
if (fsplan->scan.scanrelid == 0)
@@ -2457,7 +2730,11 @@ postgresIterateDirectModify(ForeignScanState *node)
* If this is the first call after Begin, execute the statement.
*/
if (dmstate->num_tuples == -1)
+ {
+ /* finish running query to send my command */
+ vacate_connection((PgFdwState *)dmstate, true);
execute_dml_stmt(node);
+ }
/*
* If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2504,8 +2781,8 @@ postgresEndDirectModify(ForeignScanState *node)
PQclear(dmstate->result);
/* Release remote connection */
- ReleaseConnection(dmstate->conn);
- dmstate->conn = NULL;
+ ReleaseConnection(dmstate->s.conn);
+ dmstate->s.conn = NULL;
/* MemoryContext will be deleted automatically. */
}
@@ -2703,6 +2980,7 @@ estimate_path_cost_size(PlannerInfo *root,
List *local_param_join_conds;
StringInfoData sql;
PGconn *conn;
+ PgFdwConnCommonState *commonstate;
Selectivity local_sel;
QualCost local_cost;
List *fdw_scan_tlist = NIL;
@@ -2747,6 +3025,18 @@ estimate_path_cost_size(PlannerInfo *root,
/* Get the remote estimate */
conn = GetConnection(fpinfo->user, false);
+ commonstate = GetConnectionSpecificStorage(fpinfo->user,
+ sizeof(PgFdwConnCommonState));
+ if (commonstate)
+ {
+ PgFdwState tmpstate;
+ tmpstate.conn = conn;
+ tmpstate.commonstate = commonstate;
+
+ /* finish running query to send my command */
+ vacate_connection(&tmpstate, true);
+ }
+
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -3317,11 +3607,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
static void
create_cursor(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
ExprContext *econtext = node->ss.ps.ps_ExprContext;
int numParams = fsstate->numParams;
const char **values = fsstate->param_values;
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
StringInfoData buf;
PGresult *res;
@@ -3384,50 +3674,119 @@ create_cursor(ForeignScanState *node)
}
/*
- * Fetch some more rows from the node's cursor.
+ * Sends the next request of the node. If the given node is different from the
+ * current connection leader, pushes it back to waiter queue and let the given
+ * node be the leader.
*/
static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+ PGconn *conn = fsstate->s.conn;
+ char sql[64];
+
+ /* must be non-busy */
+ Assert(!fsstate->s.commonstate->busy);
+ /* must be not-eof'ed */
+ Assert(!fsstate->eof_reached);
+
+ /*
+ * If this is the first call after Begin or ReScan, we need to create the
+ * cursor on the remote side.
+ */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (!PQsendQuery(conn, sql))
+ pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+ fsstate->s.commonstate->busy = true;
+
+ /* The node is the current leader, just return. */
+ if (leader == node)
+ return;
+
+ /* Let the node be the leader */
+ if (leader != NULL)
+ {
+ remove_async_node(node);
+ fsstate->last_waiter = GetPgFdwScanState(leader)->last_waiter;
+ fsstate->waiter = leader;
+ }
+ else
+ {
+ fsstate->last_waiter = node;
+ fsstate->waiter = NULL;
+ }
+
+ fsstate->s.commonstate->leader = node;
+}
+
+/*
+ * Fetches received data and automatically send requests of the next waiter.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
PGresult *volatile res = NULL;
MemoryContext oldcontext;
+ ForeignScanState *waiter;
+
+ /* I should be the current connection leader */
+ Assert(fsstate->s.commonstate->leader == node);
/*
* We'll store the tuples in the batch_cxt. First, flush the previous
- * batch.
+ * batch if no tuple is remaining
*/
- fsstate->tuples = NULL;
- MemoryContextReset(fsstate->batch_cxt);
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->tuples = NULL;
+ fsstate->num_tuples = 0;
+ MemoryContextReset(fsstate->batch_cxt);
+ }
+ else if (fsstate->next_tuple > 0)
+ {
+ /* There's some remains. Move them to the beginning of the store */
+ int n = 0;
+
+ while(fsstate->next_tuple < fsstate->num_tuples)
+ fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+ fsstate->num_tuples = n;
+ }
+
oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
/* PGresult must be released before leaving this function. */
PG_TRY();
{
- PGconn *conn = fsstate->conn;
- char sql[64];
- int numrows;
+ PGconn *conn = fsstate->s.conn;
+ int addrows;
+ size_t newsize;
int i;
- snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
- fsstate->fetch_size, fsstate->cursor_number);
-
- res = pgfdw_exec_query(conn, sql);
- /* On error, report the original query, not the FETCH. */
+ res = pgfdw_get_result(conn, fsstate->query);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
/* Convert the data into HeapTuples */
- numrows = PQntuples(res);
- fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
- fsstate->num_tuples = numrows;
- fsstate->next_tuple = 0;
+ addrows = PQntuples(res);
+ newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+ if (fsstate->tuples)
+ fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+ else
+ fsstate->tuples = (HeapTuple *) palloc(newsize);
- for (i = 0; i < numrows; i++)
+ for (i = 0; i < addrows; i++)
{
Assert(IsA(node->ss.ps.plan, ForeignScan));
- fsstate->tuples[i] =
+ fsstate->tuples[fsstate->num_tuples + i] =
make_tuple_from_result_row(res, i,
fsstate->rel,
fsstate->attinmeta,
@@ -3437,22 +3796,73 @@ fetch_more_data(ForeignScanState *node)
}
/* Update fetch_ct_2 */
- if (fsstate->fetch_ct_2 < 2)
+ if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
fsstate->fetch_ct_2++;
+ fsstate->next_tuple = 0;
+ fsstate->num_tuples += addrows;
+
/* Must be EOF if we didn't get as many tuples as we asked for. */
- fsstate->eof_reached = (numrows < fsstate->fetch_size);
+ fsstate->eof_reached = (addrows < fsstate->fetch_size);
}
PG_FINALLY();
{
+ fsstate->s.commonstate->busy = false;
+
if (res)
PQclear(res);
}
PG_END_TRY();
+ /* let the first waiter be the next leader of this connection */
+ waiter = move_to_next_waiter(node);
+
+ /* send the next request if any */
+ if (waiter)
+ request_more_data(waiter);
+
MemoryContextSwitchTo(oldcontext);
}
+/*
+ * Vacate the underlying connection so that this node can send the next query.
+ */
+static void
+vacate_connection(PgFdwState *fdwstate, bool clear_queue)
+{
+ PgFdwConnCommonState *commonstate = fdwstate->commonstate;
+ ForeignScanState *leader;
+
+ Assert(commonstate != NULL);
+
+ /* just return if the connection is already available */
+ if (commonstate->leader == NULL || !commonstate->busy)
+ return;
+
+ /*
+ * let the current connection leader read all of the result for the running
+ * query
+ */
+ leader = commonstate->leader;
+ fetch_received_data(leader);
+
+ /* let the first waiter be the next leader of this connection */
+ move_to_next_waiter(leader);
+
+ if (!clear_queue)
+ return;
+
+ /* Clear the waiting list */
+ while (leader)
+ {
+ PgFdwScanState *fsstate = GetPgFdwScanState(leader);
+
+ fsstate->last_waiter = NULL;
+ leader = fsstate->waiter;
+ fsstate->waiter = NULL;
+ }
+}
+
/*
* Force assorted GUC parameters to settings that ensure that we'll output
* data values in a form that is unambiguous to the remote server.
@@ -3566,7 +3976,9 @@ create_foreign_modify(EState *estate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->s.conn = GetConnection(user, true);
+ fmstate->s.commonstate = (PgFdwConnCommonState *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Set up remote query information. */
@@ -3653,6 +4065,9 @@ execute_foreign_modify(EState *estate,
operation == CMD_UPDATE ||
operation == CMD_DELETE);
+ /* finish running query to send my command */
+ vacate_connection((PgFdwState *)fmstate, true);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -3680,14 +4095,14 @@ execute_foreign_modify(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3695,10 +4110,10 @@ execute_foreign_modify(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -3734,7 +4149,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
- GetPrepStmtNumber(fmstate->conn));
+ GetPrepStmtNumber(fmstate->s.conn));
p_name = pstrdup(prep_name);
/*
@@ -3744,12 +4159,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* the prepared statements we use in this module are simple enough that
* the remote server will make the right choices.
*/
- if (!PQsendPrepare(fmstate->conn,
+ if (!PQsendPrepare(fmstate->s.conn,
p_name,
fmstate->query,
0,
NULL))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3757,9 +4172,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
PQclear(res);
/* This action shows that the prepare has been done. */
@@ -3888,16 +4303,16 @@ finish_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
PQclear(res);
fmstate->p_name = NULL;
}
/* Release remote connection */
- ReleaseConnection(fmstate->conn);
- fmstate->conn = NULL;
+ ReleaseConnection(fmstate->s.conn);
+ fmstate->s.conn = NULL;
}
/*
@@ -4056,9 +4471,9 @@ execute_dml_stmt(ForeignScanState *node)
* the desired result. This allows us to avoid assuming that the remote
* server has the same OIDs we do for the parameters' types.
*/
- if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+ if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
NULL, values, NULL, NULL, 0))
- pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+ pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
/*
* Get the result, and check for success.
@@ -4066,10 +4481,10 @@ execute_dml_stmt(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+ dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
if (PQresultStatus(dmstate->result) !=
(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+ pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
dmstate->query);
/* Get the number of rows affected. */
@@ -5560,6 +5975,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
/* XXX Consider parameterized paths for the join relation */
}
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add wait event so that the ForeignScan node is going to wait for.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+ void *caller_data, bool reinit)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+ /* Reinit is not supported for now. */
+ Assert(reinit);
+
+ if (fsstate->s.commonstate->leader == node)
+ {
+ AddWaitEventToSet(wes,
+ WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+ NULL, caller_data);
+ return true;
+ }
+
+ return false;
+}
+
+
/*
* Assess whether the aggregation, grouping and having operations can be pushed
* down to the foreign server. As a side effect, save information we obtain in
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index eef410db39..96af75a33e 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -85,6 +85,7 @@ typedef struct PgFdwRelationInfo
UserMapping *user; /* only set in use_remote_estimate mode */
int fetch_size; /* fetch size for this remote table */
+ bool allow_prefetch; /* true to allow overlapped fetching */
/*
* Name of the relation, for use while EXPLAINing ForeignScan. It is used
@@ -130,6 +131,7 @@ extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 83971665e3..359208a12a 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1780,25 +1780,25 @@ INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
@@ -1840,12 +1840,12 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
@@ -1904,8 +1904,8 @@ explain (verbose, costs off)
delete from foo where f1 < 5 returning *;
delete from foo where f1 < 5 returning *;
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
-- Test that UPDATE/DELETE with inherited target works with row-level triggers
CREATE TRIGGER trig_row_before
--
2.18.4
Horiguchi-san,
On Thu, Jul 2, 2020 at 11:14 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
As the result of a discussion with Fujita-san off-list, I'm going to
hold off development until he decides whether mine or Thomas' is
better.
I'd like to join the party, but IIUC, we don't yet reach a consensus
on which one is the right way to go. So I think we need to discuss
that first.
However, I fixed two misbehaviors and rebased.
Thank you for the updated patch!
Best regards,
Etsuro Fujita
On Thu, Jul 2, 2020 at 3:20 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Thu, Jul 2, 2020 at 11:14 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:As the result of a discussion with Fujita-san off-list, I'm going to
hold off development until he decides whether mine or Thomas' is
better.I'd like to join the party, but IIUC, we don't yet reach a consensus
on which one is the right way to go. So I think we need to discuss
that first.
Either way, we definitely need patch 0001. One comment:
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
I wonder if it's better to have it receive ResourceOwner like that, or
to have it capture CurrentResourceOwner. I think the latter is more
common in existing code.
On Fri, Aug 14, 2020 at 10:29 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Thu, Jul 2, 2020 at 3:20 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
I'd like to join the party, but IIUC, we don't yet reach a consensus
on which one is the right way to go. So I think we need to discuss
that first.Either way, we definitely need patch 0001. One comment:
-CreateWaitEventSet(MemoryContext context, int nevents) +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)I wonder if it's better to have it receive ResourceOwner like that, or
to have it capture CurrentResourceOwner. I think the latter is more
common in existing code.
Sorry for not having discussed anything, but actually, I’ve started
reviewing your patch first. I’ll return to this after reviewing it
some more.
Thanks!
Best regards,
Etsuro Fujita
On Thu, Jul 02, 2020 at 11:14:48AM +0900, Kyotaro Horiguchi wrote:
As the result of a discussion with Fujita-san off-list, I'm going to
hold off development until he decides whether mine or Thomas' is
better.
The latest patch doesn't apply so I set as WoA.
https://commitfest.postgresql.org/29/2491/
--
Justin
At Wed, 19 Aug 2020 23:25:36 -0500, Justin Pryzby <pryzby@telsasoft.com> wrote in
On Thu, Jul 02, 2020 at 11:14:48AM +0900, Kyotaro Horiguchi wrote:
As the result of a discussion with Fujita-san off-list, I'm going to
hold off development until he decides whether mine or Thomas' is
better.The latest patch doesn't apply so I set as WoA.
https://commitfest.postgresql.org/29/2491/
Thanks. This is rebased version.
At Fri, 14 Aug 2020 13:29:16 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in
Either way, we definitely need patch 0001. One comment:
-CreateWaitEventSet(MemoryContext context, int nevents) +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)I wonder if it's better to have it receive ResourceOwner like that, or
to have it capture CurrentResourceOwner. I think the latter is more
common in existing code.
There's no existing WaitEventSets belonging to a resowner. So
unconditionally capturing CurrentResourceOwner doesn't work well. I
could pass a bool instead but that make things more complex.
Come to think of "complex", ExecAsync stuff in this patch might be
too-much for a short-term solution until executor overhaul, if it
comes shortly. (the patch of mine here as a whole is like that,
though..). The queueing stuff in postgres_fdw is, too.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v6-0001-Allow-wait-event-set-to-be-registered-to-resource.patchtext/x-patch; charset=us-asciiDownload
From 18176c9caa856c707ef6e8ab64bfc7f8abd9aea6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH v6 1/3] Allow wait event set to be registered to resource
owner
WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
src/backend/libpq/pqcomm.c | 2 +-
src/backend/postmaster/pgstat.c | 2 +-
src/backend/postmaster/syslogger.c | 2 +-
src/backend/storage/ipc/latch.c | 20 ++++++--
src/backend/utils/resowner/resowner.c | 67 +++++++++++++++++++++++++++
src/include/storage/latch.h | 4 +-
src/include/utils/resowner_private.h | 8 ++++
7 files changed, 98 insertions(+), 7 deletions(-)
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index ac986c0505..799fa5006d 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -218,7 +218,7 @@ pq_init(void)
(errmsg("could not set socket to nonblocking mode: %m")));
#endif
- FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+ FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
NULL, NULL);
AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 73ce944fb1..9d6b3778b4 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4488,7 +4488,7 @@ PgstatCollectorMain(int argc, char *argv[])
pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
/* Prepare to wait for our latch or data in our socket. */
- wes = CreateWaitEventSet(CurrentMemoryContext, 3);
+ wes = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL, NULL);
AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL);
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index ffcb54968f..a4de6d90e2 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -300,7 +300,7 @@ SysLoggerMain(int argc, char *argv[])
* syslog pipe, which implies that all other backends have exited
* (including the postmaster).
*/
- wes = CreateWaitEventSet(CurrentMemoryContext, 2);
+ wes = CreateWaitEventSet(CurrentMemoryContext, NULL, 2);
AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
#ifndef WIN32
AddWaitEventToSet(wes, WL_SOCKET_READABLE, syslogPipe[0], NULL, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 4153cc8557..e771ac9610 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -57,6 +57,7 @@
#include "storage/pmsignal.h"
#include "storage/shmem.h"
#include "utils/memutils.h"
+#include "utils/resowner_private.h"
/*
* Select the fd readiness primitive to use. Normally the "most modern"
@@ -85,6 +86,8 @@ struct WaitEventSet
int nevents; /* number of registered events */
int nevents_space; /* maximum number of events in this set */
+ ResourceOwner resowner; /* Resource owner */
+
/*
* Array, of nevents_space length, storing the definition of events this
* set is waiting for.
@@ -257,7 +260,7 @@ InitializeLatchWaitSet(void)
Assert(LatchWaitSet == NULL);
/* Set up the WaitEventSet used by WaitLatch(). */
- LatchWaitSet = CreateWaitEventSet(TopMemoryContext, 2);
+ LatchWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 2);
latch_pos = AddWaitEventToSet(LatchWaitSet, WL_LATCH_SET, PGINVALID_SOCKET,
MyLatch, NULL);
if (IsUnderPostmaster)
@@ -441,7 +444,7 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
int ret = 0;
int rc;
WaitEvent event;
- WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
if (wakeEvents & WL_TIMEOUT)
Assert(timeout >= 0);
@@ -608,12 +611,15 @@ ResetLatch(Latch *latch)
* WaitEventSetWait().
*/
WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
{
WaitEventSet *set;
char *data;
Size sz = 0;
+ if (res)
+ ResourceOwnerEnlargeWESs(res);
+
/*
* Use MAXALIGN size/alignment to guarantee that later uses of memory are
* aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -728,6 +734,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
#endif
+ /* Register this wait event set if requested */
+ set->resowner = res;
+ if (res)
+ ResourceOwnerRememberWES(set->resowner, set);
+
return set;
}
@@ -773,6 +784,9 @@ FreeWaitEventSet(WaitEventSet *set)
}
#endif
+ if (set->resowner != NULL)
+ ResourceOwnerForgetWES(set->resowner, set);
+
pfree(set);
}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 8bc2c4e9ea..237ca9fa30 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -128,6 +128,7 @@ typedef struct ResourceOwnerData
ResourceArray filearr; /* open temporary files */
ResourceArray dsmarr; /* dynamic shmem segments */
ResourceArray jitarr; /* JIT contexts */
+ ResourceArray wesarr; /* wait event sets */
/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
int nlocks; /* number of owned locks */
@@ -175,6 +176,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
static void PrintSnapshotLeakWarning(Snapshot snapshot);
static void PrintFileLeakWarning(File file);
static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
/*****************************************************************************
@@ -444,6 +446,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
+ ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
return owner;
}
@@ -553,6 +556,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
jit_release_context(context);
}
+
+ /* Ditto for wait event sets */
+ while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+ {
+ WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+ if (isCommit)
+ PrintWESLeakWarning(event);
+ FreeWaitEventSet(event);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -725,6 +738,7 @@ ResourceOwnerDelete(ResourceOwner owner)
Assert(owner->filearr.nitems == 0);
Assert(owner->dsmarr.nitems == 0);
Assert(owner->jitarr.nitems == 0);
+ Assert(owner->wesarr.nitems == 0);
Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
/*
@@ -752,6 +766,7 @@ ResourceOwnerDelete(ResourceOwner owner)
ResourceArrayFree(&(owner->filearr));
ResourceArrayFree(&(owner->dsmarr));
ResourceArrayFree(&(owner->jitarr));
+ ResourceArrayFree(&(owner->wesarr));
pfree(owner);
}
@@ -1370,3 +1385,55 @@ ResourceOwnerForgetJIT(ResourceOwner owner, Datum handle)
elog(ERROR, "JIT context %p is not owned by resource owner %s",
DatumGetPointer(handle), owner->name);
}
+
+/*
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+ ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+ ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+ elog(ERROR, "wait event set %p is not owned by resource owner %s",
+ events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ elog(WARNING, "wait event set leak: %p still referenced",
+ events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 7c742021fb..ae13d4c08d 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
#define LATCH_H
#include <signal.h>
+#include "utils/resowner.h"
/*
* Latch structure should be treated as opaque and only accessed through
@@ -163,7 +164,8 @@ extern void DisownLatch(Latch *latch);
extern void SetLatch(Latch *latch);
extern void ResetLatch(Latch *latch);
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+ ResourceOwner res, int nevents);
extern void FreeWaitEventSet(WaitEventSet *set);
extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index a781a7a2aa..7d19dadd57 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
#include "storage/dsm.h"
#include "storage/fd.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "utils/catcache.h"
#include "utils/plancache.h"
@@ -95,4 +96,11 @@ extern void ResourceOwnerRememberJIT(ResourceOwner owner,
extern void ResourceOwnerForgetJIT(ResourceOwner owner,
Datum handle);
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+ WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+ WaitEventSet *);
+
#endif /* RESOWNER_PRIVATE_H */
--
2.18.4
v6-0002-Infrastructure-for-asynchronous-execution.patchtext/x-patch; charset=us-asciiDownload
From 87edb960381302fb487e5200d3cd228eb8a9b413 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 15 May 2018 20:21:32 +0900
Subject: [PATCH v6 2/3] Infrastructure for asynchronous execution
This patch add an infrastructure for asynchronous execution. As a PoC
this makes only Append capable to handle asynchronously executable
subnodes.
---
src/backend/commands/explain.c | 17 ++
src/backend/executor/Makefile | 1 +
src/backend/executor/execAsync.c | 152 +++++++++++
src/backend/executor/nodeAppend.c | 342 ++++++++++++++++++++----
src/backend/executor/nodeForeignscan.c | 21 ++
src/backend/nodes/bitmapset.c | 72 +++++
src/backend/nodes/copyfuncs.c | 3 +
src/backend/nodes/outfuncs.c | 3 +
src/backend/nodes/readfuncs.c | 3 +
src/backend/optimizer/path/allpaths.c | 24 ++
src/backend/optimizer/path/costsize.c | 55 +++-
src/backend/optimizer/plan/createplan.c | 45 +++-
src/backend/postmaster/pgstat.c | 3 +
src/backend/utils/adt/ruleutils.c | 8 +-
src/backend/utils/resowner/resowner.c | 4 +-
src/include/executor/execAsync.h | 22 ++
src/include/executor/executor.h | 1 +
src/include/executor/nodeForeignscan.h | 3 +
src/include/foreign/fdwapi.h | 11 +
src/include/nodes/bitmapset.h | 1 +
src/include/nodes/execnodes.h | 23 +-
src/include/nodes/plannodes.h | 9 +
src/include/optimizer/paths.h | 2 +
src/include/pgstat.h | 3 +-
24 files changed, 756 insertions(+), 72 deletions(-)
create mode 100644 src/backend/executor/execAsync.c
create mode 100644 src/include/executor/execAsync.h
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 30e0a7ee7f..07001da4a3 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -86,6 +86,7 @@ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
List *ancestors, ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
+static void show_append_info(AppendState *astate, ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
ExplainState *es);
static void show_grouping_sets(PlanState *planstate, Agg *agg,
@@ -1389,6 +1390,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
}
if (plan->parallel_aware)
appendStringInfoString(es->str, "Parallel ");
+ if (plan->async_capable)
+ appendStringInfoString(es->str, "Async ");
appendStringInfoString(es->str, pname);
es->indent++;
}
@@ -1970,6 +1973,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Hash:
show_hash_info(castNode(HashState, planstate), es);
break;
+
+ case T_Append:
+ show_append_info(castNode(AppendState, planstate), es);
+ break;
+
default:
break;
}
@@ -2323,6 +2331,15 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ancestors, es);
}
+static void
+show_append_info(AppendState *astate, ExplainState *es)
+{
+ Append *plan = (Append *) astate->ps.plan;
+
+ if (plan->nasyncplans > 0)
+ ExplainPropertyInteger("Async subplans", "", plan->nasyncplans, es);
+}
+
/*
* Show the grouping keys for an Agg node.
*/
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index f990c6473a..1004647d4f 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
+ execAsync.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000000..2b7d1877e0
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,152 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+/*
+ * ExecAsyncConfigureWait: Add wait event to the WaitEventSet if needed.
+ *
+ * If reinit is true, the caller didn't reuse existing WaitEventSet.
+ */
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+ void *data, bool reinit)
+{
+ switch (nodeTag(node))
+ {
+ case T_ForeignScanState:
+ return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+ wes, data, reinit);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(node));
+ }
+}
+
+/*
+ * struct for memory context callback argument used in ExecAsyncEventWait
+ */
+typedef struct {
+ int **p_refind;
+ int *p_refindsize;
+} ExecAsync_mcbarg;
+
+/*
+ * callback function to reset static variables pointing to the memory in
+ * TopTransactionContext in ExecAsyncEventWait.
+ */
+static void ExecAsyncMemoryContextCallback(void *arg)
+{
+ /* arg is the address of the variable refind in ExecAsyncEventWait */
+ ExecAsync_mcbarg *mcbarg = (ExecAsync_mcbarg *) arg;
+ *mcbarg->p_refind = NULL;
+ *mcbarg->p_refindsize = 0;
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * ExecAsyncEventWait:
+ *
+ * Wait for async events to fire. Returns the Bitmapset of fired events.
+ */
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+ static int *refind = NULL;
+ static int refindsize = 0;
+ WaitEventSet *wes;
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred = 0;
+ Bitmapset *fired_events = NULL;
+ int i;
+ int n;
+
+ n = bms_num_members(waitnodes);
+ wes = CreateWaitEventSet(TopTransactionContext,
+ TopTransactionResourceOwner, n);
+ if (refindsize < n)
+ {
+ if (refindsize == 0)
+ refindsize = EVENT_BUFFER_SIZE; /* XXX */
+ while (refindsize < n)
+ refindsize *= 2;
+ if (refind)
+ refind = (int *) repalloc(refind, refindsize * sizeof(int));
+ else
+ {
+ static ExecAsync_mcbarg mcb_arg =
+ { &refind, &refindsize };
+ static MemoryContextCallback mcb =
+ { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
+ MemoryContext oldctxt =
+ MemoryContextSwitchTo(TopTransactionContext);
+
+ /*
+ * refind points to a memory block in
+ * TopTransactionContext. Register a callback to reset it.
+ */
+ MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
+ refind = (int *) palloc(refindsize * sizeof(int));
+ MemoryContextSwitchTo(oldctxt);
+ }
+ }
+
+ /* Prepare WaitEventSet for waiting on the waitnodes. */
+ n = 0;
+ for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+ i = bms_next_member(waitnodes, i))
+ {
+ refind[i] = i;
+ if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+ n++;
+ }
+
+ /* Return immediately if no node to wait. */
+ if (n == 0)
+ {
+ FreeWaitEventSet(wes);
+ return NULL;
+ }
+
+ noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+ EVENT_BUFFER_SIZE,
+ WAIT_EVENT_ASYNC_WAIT);
+ FreeWaitEventSet(wes);
+ if (noccurred == 0)
+ return NULL;
+
+ for (i = 0 ; i < noccurred ; i++)
+ {
+ WaitEvent *w = &occurred_event[i];
+
+ if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+ {
+ int n = *(int*)w->user_data;
+
+ fired_events = bms_add_member(fired_events, n);
+ }
+ }
+
+ return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 88919e62fa..60c36ee048 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -60,6 +60,7 @@
#include "executor/execdebug.h"
#include "executor/execPartition.h"
#include "executor/nodeAppend.h"
+#include "executor/execAsync.h"
#include "miscadmin.h"
/* Shared state for parallel-aware Append. */
@@ -80,6 +81,7 @@ struct ParallelAppendState
#define INVALID_SUBPLAN_INDEX -1
static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);
static bool choose_next_subplan_locally(AppendState *node);
static bool choose_next_subplan_for_leader(AppendState *node);
static bool choose_next_subplan_for_worker(AppendState *node);
@@ -103,22 +105,22 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
PlanState **appendplanstates;
Bitmapset *validsubplans;
int nplans;
+ int nasyncplans;
int firstvalid;
int i,
j;
/* check for unsupported flags */
- Assert(!(eflags & EXEC_FLAG_MARK));
+ Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC)));
/*
* create new AppendState for our append node
*/
appendstate->ps.plan = (Plan *) node;
appendstate->ps.state = estate;
- appendstate->ps.ExecProcNode = ExecAppend;
/* Let choose_next_subplan_* function handle setting the first subplan */
- appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+ appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
/* If run-time partition pruning is enabled, then set that up now */
if (node->part_prune_info != NULL)
@@ -152,11 +154,12 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
/*
* When no run-time pruning is required and there's at least one
- * subplan, we can fill as_valid_subplans immediately, preventing
+ * subplan, we can fill as_valid_syncsubplans immediately, preventing
* later calls to ExecFindMatchingSubPlans.
*/
if (!prunestate->do_exec_prune && nplans > 0)
- appendstate->as_valid_subplans = bms_add_range(NULL, 0, nplans - 1);
+ appendstate->as_valid_syncsubplans =
+ bms_add_range(NULL, node->nasyncplans, nplans - 1);
}
else
{
@@ -167,8 +170,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
* subplans as valid; they must also all be initialized.
*/
Assert(nplans > 0);
- appendstate->as_valid_subplans = validsubplans =
- bms_add_range(NULL, 0, nplans - 1);
+ validsubplans = bms_add_range(NULL, 0, nplans - 1);
+ appendstate->as_valid_syncsubplans =
+ bms_add_range(NULL, node->nasyncplans, nplans - 1);
appendstate->as_prune_state = NULL;
}
@@ -192,10 +196,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
*/
j = 0;
firstvalid = nplans;
+ nasyncplans = 0;
+
i = -1;
while ((i = bms_next_member(validsubplans, i)) >= 0)
{
Plan *initNode = (Plan *) list_nth(node->appendplans, i);
+ int sub_eflags = eflags;
+
+ /* Let async-capable subplans run asynchronously */
+ if (i < node->nasyncplans)
+ {
+ sub_eflags |= EXEC_FLAG_ASYNC;
+ nasyncplans++;
+ }
/*
* Record the lowest appendplans index which is a valid partial plan.
@@ -203,13 +217,46 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
if (i >= node->first_partial_plan && j < firstvalid)
firstvalid = j;
- appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
+ appendplanstates[j++] = ExecInitNode(initNode, estate, sub_eflags);
}
appendstate->as_first_partial_plan = firstvalid;
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ /* fill in async stuff */
+ appendstate->as_nasyncplans = nasyncplans;
+ appendstate->as_syncdone = (nasyncplans == nplans);
+ appendstate->as_exec_prune = false;
+
+ /* choose appropriate version of Exec function */
+ if (appendstate->as_nasyncplans == 0)
+ appendstate->ps.ExecProcNode = ExecAppend;
+ else
+ appendstate->ps.ExecProcNode = ExecAppendAsync;
+
+ if (appendstate->as_nasyncplans)
+ {
+ appendstate->as_asyncresult = (TupleTableSlot **)
+ palloc0(appendstate->as_nasyncplans * sizeof(TupleTableSlot *));
+
+ /* initially, all async requests need a request */
+ appendstate->as_needrequest =
+ bms_add_range(NULL, 0, appendstate->as_nasyncplans - 1);
+
+ /*
+ * ExecAppendAsync needs as_valid_syncsubplans to handle async
+ * subnodes.
+ */
+ if (appendstate->as_prune_state != NULL &&
+ appendstate->as_prune_state->do_exec_prune)
+ {
+ Assert(appendstate->as_valid_syncsubplans == NULL);
+
+ appendstate->as_exec_prune = true;
+ }
+ }
+
/*
* Miscellaneous initialization
*/
@@ -233,7 +280,7 @@ ExecAppend(PlanState *pstate)
{
AppendState *node = castNode(AppendState, pstate);
- if (node->as_whichplan < 0)
+ if (node->as_whichsyncplan < 0)
{
/* Nothing to do if there are no subplans */
if (node->as_nplans == 0)
@@ -243,11 +290,13 @@ ExecAppend(PlanState *pstate)
* If no subplan has been chosen, we must choose one before
* proceeding.
*/
- if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
+ if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
!node->choose_next_subplan(node))
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
+ Assert(node->as_nasyncplans == 0);
+
for (;;)
{
PlanState *subnode;
@@ -258,8 +307,9 @@ ExecAppend(PlanState *pstate)
/*
* figure out which subplan we are currently processing
*/
- Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
- subnode = node->appendplans[node->as_whichplan];
+ Assert(node->as_whichsyncplan >= 0 &&
+ node->as_whichsyncplan < node->as_nplans);
+ subnode = node->appendplans[node->as_whichsyncplan];
/*
* get a tuple from the subplan
@@ -282,6 +332,172 @@ ExecAppend(PlanState *pstate)
}
}
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+ AppendState *node = castNode(AppendState, pstate);
+ Bitmapset *needrequest;
+ int i;
+
+ Assert(node->as_nasyncplans > 0);
+
+restart:
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ if (node->as_exec_prune)
+ {
+ Bitmapset *valid_subplans =
+ ExecFindMatchingSubPlans(node->as_prune_state);
+
+ /* Distribute valid subplans into sync and async */
+ node->as_needrequest =
+ bms_intersect(node->as_needrequest, valid_subplans);
+ node->as_valid_syncsubplans =
+ bms_difference(valid_subplans, node->as_needrequest);
+
+ node->as_exec_prune = false;
+ }
+
+ needrequest = node->as_needrequest;
+ node->as_needrequest = NULL;
+ while ((i = bms_first_member(needrequest)) >= 0)
+ {
+ TupleTableSlot *slot;
+ PlanState *subnode = node->appendplans[i];
+
+ slot = ExecProcNode(subnode);
+ if (subnode->asyncstate == AS_AVAILABLE)
+ {
+ if (!TupIsNull(slot))
+ {
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+ node->as_needrequest = bms_add_member(node->as_needrequest, i);
+ }
+ }
+ else
+ node->as_pending_async = bms_add_member(node->as_pending_async, i);
+ }
+ bms_free(needrequest);
+
+ for (;;)
+ {
+ TupleTableSlot *result;
+
+ /* return now if a result is available */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ while (!bms_is_empty(node->as_pending_async))
+ {
+ /* Don't wait for async nodes if any sync node exists. */
+ long timeout = node->as_syncdone ? -1 : 0;
+ Bitmapset *fired;
+ int i;
+
+ fired = ExecAsyncEventWait(node->appendplans,
+ node->as_pending_async,
+ timeout);
+
+ if (bms_is_empty(fired) && node->as_syncdone)
+ {
+ /*
+ * We come here when all the subnodes had fired before
+ * waiting. Retry fetching from the nodes.
+ */
+ node->as_needrequest = node->as_pending_async;
+ node->as_pending_async = NULL;
+ goto restart;
+ }
+
+ while ((i = bms_first_member(fired)) >= 0)
+ {
+ TupleTableSlot *slot;
+ PlanState *subnode = node->appendplans[i];
+ slot = ExecProcNode(subnode);
+
+ Assert(subnode->asyncstate == AS_AVAILABLE);
+
+ if (!TupIsNull(slot))
+ {
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+ node->as_needrequest =
+ bms_add_member(node->as_needrequest, i);
+ }
+
+ node->as_pending_async =
+ bms_del_member(node->as_pending_async, i);
+ }
+ bms_free(fired);
+
+ /* return now if a result is available */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If there is no asynchronous activity still pending and the
+ * synchronous activity is also complete, we're totally done scanning
+ * this node. Otherwise, we're done with the asynchronous stuff but
+ * must continue scanning the synchronous children.
+ */
+
+ if (!node->as_syncdone &&
+ node->as_whichsyncplan == INVALID_SUBPLAN_INDEX)
+ node->as_syncdone = !node->choose_next_subplan(node);
+
+ if (node->as_syncdone)
+ {
+ Assert(bms_is_empty(node->as_pending_async));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+
+ /*
+ * get a tuple from the subplan
+ */
+ result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+ if (!TupIsNull(result))
+ {
+ /*
+ * If the subplan gave us something then return it as-is. We do
+ * NOT make use of the result slot that was set up in
+ * ExecInitAppend; there's no need for it.
+ */
+ return result;
+ }
+
+ /*
+ * Go on to the "next" subplan. If no more subplans, return the empty
+ * slot set up for us by ExecInitAppend, unless there are async plans
+ * we have yet to finish.
+ */
+ if (!node->choose_next_subplan(node))
+ {
+ node->as_syncdone = true;
+ if (bms_is_empty(node->as_pending_async))
+ {
+ Assert(bms_is_empty(node->as_needrequest));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
+
+ /* Else loop back and try to get a tuple from the new subplan */
+ }
+}
+
/* ----------------------------------------------------------------
* ExecEndAppend
*
@@ -324,10 +540,18 @@ ExecReScanAppend(AppendState *node)
bms_overlap(node->ps.chgParam,
node->as_prune_state->execparamids))
{
- bms_free(node->as_valid_subplans);
- node->as_valid_subplans = NULL;
+ bms_free(node->as_valid_syncsubplans);
+ node->as_valid_syncsubplans = NULL;
}
+ /* Reset async state. */
+ for (i = 0; i < node->as_nasyncplans; ++i)
+ ExecShutdownNode(node->appendplans[i]);
+
+ node->as_nasyncresult = 0;
+ node->as_needrequest = bms_add_range(NULL, 0, node->as_nasyncplans - 1);
+ node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
for (i = 0; i < node->as_nplans; i++)
{
PlanState *subnode = node->appendplans[i];
@@ -348,7 +572,7 @@ ExecReScanAppend(AppendState *node)
}
/* Let choose_next_subplan_* function handle setting the first subplan */
- node->as_whichplan = INVALID_SUBPLAN_INDEX;
+ node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
}
/* ----------------------------------------------------------------
@@ -436,7 +660,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
static bool
choose_next_subplan_locally(AppendState *node)
{
- int whichplan = node->as_whichplan;
+ int whichplan = node->as_whichsyncplan;
int nextplan;
/* We should never be called when there are no subplans */
@@ -451,10 +675,18 @@ choose_next_subplan_locally(AppendState *node)
*/
if (whichplan == INVALID_SUBPLAN_INDEX)
{
- if (node->as_valid_subplans == NULL)
- node->as_valid_subplans =
+ /* Shouldn't have an active async node */
+ Assert(bms_is_empty(node->as_needrequest));
+
+ if (node->as_valid_syncsubplans == NULL)
+ node->as_valid_syncsubplans =
ExecFindMatchingSubPlans(node->as_prune_state);
+ /* Exclude async plans */
+ if (node->as_nasyncplans > 0)
+ bms_del_range(node->as_valid_syncsubplans,
+ 0, node->as_nasyncplans - 1);
+
whichplan = -1;
}
@@ -462,14 +694,14 @@ choose_next_subplan_locally(AppendState *node)
Assert(whichplan >= -1 && whichplan <= node->as_nplans);
if (ScanDirectionIsForward(node->ps.state->es_direction))
- nextplan = bms_next_member(node->as_valid_subplans, whichplan);
+ nextplan = bms_next_member(node->as_valid_syncsubplans, whichplan);
else
- nextplan = bms_prev_member(node->as_valid_subplans, whichplan);
+ nextplan = bms_prev_member(node->as_valid_syncsubplans, whichplan);
if (nextplan < 0)
return false;
- node->as_whichplan = nextplan;
+ node->as_whichsyncplan = nextplan;
return true;
}
@@ -490,29 +722,29 @@ choose_next_subplan_for_leader(AppendState *node)
/* Backward scan is not supported by parallel-aware plans */
Assert(ScanDirectionIsForward(node->ps.state->es_direction));
- /* We should never be called when there are no subplans */
- Assert(node->as_nplans > 0);
+ /* We should never be called when there are no sync subplans */
+ Assert(node->as_nplans > node->as_nasyncplans);
LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
- if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
+ if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
{
/* Mark just-completed subplan as finished. */
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
}
else
{
/* Start with last subplan. */
- node->as_whichplan = node->as_nplans - 1;
+ node->as_whichsyncplan = node->as_nplans - 1;
/*
* If we've yet to determine the valid subplans then do so now. If
* run-time pruning is disabled then the valid subplans will always be
* set to all subplans.
*/
- if (node->as_valid_subplans == NULL)
+ if (node->as_valid_syncsubplans == NULL)
{
- node->as_valid_subplans =
+ node->as_valid_syncsubplans =
ExecFindMatchingSubPlans(node->as_prune_state);
/*
@@ -524,26 +756,26 @@ choose_next_subplan_for_leader(AppendState *node)
}
/* Loop until we find a subplan to execute. */
- while (pstate->pa_finished[node->as_whichplan])
+ while (pstate->pa_finished[node->as_whichsyncplan])
{
- if (node->as_whichplan == 0)
+ if (node->as_whichsyncplan == 0)
{
pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
- node->as_whichplan = INVALID_SUBPLAN_INDEX;
+ node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
LWLockRelease(&pstate->pa_lock);
return false;
}
/*
- * We needn't pay attention to as_valid_subplans here as all invalid
+ * We needn't pay attention to as_valid_syncsubplans here as all invalid
* plans have been marked as finished.
*/
- node->as_whichplan--;
+ node->as_whichsyncplan--;
}
/* If non-partial, immediately mark as finished. */
- if (node->as_whichplan < node->as_first_partial_plan)
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ if (node->as_whichsyncplan < node->as_first_partial_plan)
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
LWLockRelease(&pstate->pa_lock);
@@ -571,23 +803,23 @@ choose_next_subplan_for_worker(AppendState *node)
/* Backward scan is not supported by parallel-aware plans */
Assert(ScanDirectionIsForward(node->ps.state->es_direction));
- /* We should never be called when there are no subplans */
- Assert(node->as_nplans > 0);
+ /* We should never be called when there are no sync subplans */
+ Assert(node->as_nplans > node->as_nasyncplans);
LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
/* Mark just-completed subplan as finished. */
- if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
/*
* If we've yet to determine the valid subplans then do so now. If
* run-time pruning is disabled then the valid subplans will always be set
* to all subplans.
*/
- else if (node->as_valid_subplans == NULL)
+ else if (node->as_valid_syncsubplans == NULL)
{
- node->as_valid_subplans =
+ node->as_valid_syncsubplans =
ExecFindMatchingSubPlans(node->as_prune_state);
mark_invalid_subplans_as_finished(node);
}
@@ -600,30 +832,30 @@ choose_next_subplan_for_worker(AppendState *node)
}
/* Save the plan from which we are starting the search. */
- node->as_whichplan = pstate->pa_next_plan;
+ node->as_whichsyncplan = pstate->pa_next_plan;
/* Loop until we find a valid subplan to execute. */
while (pstate->pa_finished[pstate->pa_next_plan])
{
int nextplan;
- nextplan = bms_next_member(node->as_valid_subplans,
+ nextplan = bms_next_member(node->as_valid_syncsubplans,
pstate->pa_next_plan);
if (nextplan >= 0)
{
/* Advance to the next valid plan. */
pstate->pa_next_plan = nextplan;
}
- else if (node->as_whichplan > node->as_first_partial_plan)
+ else if (node->as_whichsyncplan > node->as_first_partial_plan)
{
/*
* Try looping back to the first valid partial plan, if there is
* one. If there isn't, arrange to bail out below.
*/
- nextplan = bms_next_member(node->as_valid_subplans,
+ nextplan = bms_next_member(node->as_valid_syncsubplans,
node->as_first_partial_plan - 1);
pstate->pa_next_plan =
- nextplan < 0 ? node->as_whichplan : nextplan;
+ nextplan < 0 ? node->as_whichsyncplan : nextplan;
}
else
{
@@ -631,10 +863,10 @@ choose_next_subplan_for_worker(AppendState *node)
* At last plan, and either there are no partial plans or we've
* tried them all. Arrange to bail out.
*/
- pstate->pa_next_plan = node->as_whichplan;
+ pstate->pa_next_plan = node->as_whichsyncplan;
}
- if (pstate->pa_next_plan == node->as_whichplan)
+ if (pstate->pa_next_plan == node->as_whichsyncplan)
{
/* We've tried everything! */
pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
@@ -644,8 +876,8 @@ choose_next_subplan_for_worker(AppendState *node)
}
/* Pick the plan we found, and advance pa_next_plan one more time. */
- node->as_whichplan = pstate->pa_next_plan;
- pstate->pa_next_plan = bms_next_member(node->as_valid_subplans,
+ node->as_whichsyncplan = pstate->pa_next_plan;
+ pstate->pa_next_plan = bms_next_member(node->as_valid_syncsubplans,
pstate->pa_next_plan);
/*
@@ -654,7 +886,7 @@ choose_next_subplan_for_worker(AppendState *node)
*/
if (pstate->pa_next_plan < 0)
{
- int nextplan = bms_next_member(node->as_valid_subplans,
+ int nextplan = bms_next_member(node->as_valid_syncsubplans,
node->as_first_partial_plan - 1);
if (nextplan >= 0)
@@ -671,8 +903,8 @@ choose_next_subplan_for_worker(AppendState *node)
}
/* If non-partial, immediately mark as finished. */
- if (node->as_whichplan < node->as_first_partial_plan)
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ if (node->as_whichsyncplan < node->as_first_partial_plan)
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
LWLockRelease(&pstate->pa_lock);
@@ -699,13 +931,13 @@ mark_invalid_subplans_as_finished(AppendState *node)
Assert(node->as_prune_state);
/* Nothing to do if all plans are valid */
- if (bms_num_members(node->as_valid_subplans) == node->as_nplans)
+ if (bms_num_members(node->as_valid_syncsubplans) == node->as_nplans)
return;
/* Mark all non-valid plans as finished */
for (i = 0; i < node->as_nplans; i++)
{
- if (!bms_is_member(i, node->as_valid_subplans))
+ if (!bms_is_member(i, node->as_valid_syncsubplans))
node->as_pstate->pa_finished[i] = true;
}
}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 513471ab9b..3bf4aaa63d 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -141,6 +141,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
scanstate->ss.ps.plan = (Plan *) node;
scanstate->ss.ps.state = estate;
scanstate->ss.ps.ExecProcNode = ExecForeignScan;
+ scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+ if ((eflags & EXEC_FLAG_ASYNC) != 0)
+ scanstate->fs_async = true;
/*
* Miscellaneous initialization
@@ -384,3 +388,20 @@ ExecShutdownForeignScan(ForeignScanState *node)
if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+ void *caller_data, bool reinit)
+{
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+ caller_data, reinit);
+}
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 2719ea45a3..05b625783b 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -895,6 +895,78 @@ bms_add_range(Bitmapset *a, int lower, int upper)
return a;
}
+/*
+ * bms_del_range
+ * Delete members in the range of 'lower' to 'upper' from the set.
+ *
+ * Note this could also be done by calling bms_del_member in a loop, however,
+ * using this function will be faster when the range is large as we work at
+ * the bitmapword level rather than at bit level.
+ */
+Bitmapset *
+bms_del_range(Bitmapset *a, int lower, int upper)
+{
+ int lwordnum,
+ lbitnum,
+ uwordnum,
+ ushiftbits,
+ wordnum;
+
+ if (lower < 0 || upper < 0)
+ elog(ERROR, "negative bitmapset member not allowed");
+ if (lower > upper)
+ elog(ERROR, "lower range must not be above upper range");
+ uwordnum = WORDNUM(upper);
+
+ if (a == NULL)
+ {
+ a = (Bitmapset *) palloc0(BITMAPSET_SIZE(uwordnum + 1));
+ a->nwords = uwordnum + 1;
+ }
+
+ /* ensure we have enough words to store the upper bit */
+ else if (uwordnum >= a->nwords)
+ {
+ int oldnwords = a->nwords;
+ int i;
+
+ a = (Bitmapset *) repalloc(a, BITMAPSET_SIZE(uwordnum + 1));
+ a->nwords = uwordnum + 1;
+ /* zero out the enlarged portion */
+ for (i = oldnwords; i < a->nwords; i++)
+ a->words[i] = 0;
+ }
+
+ wordnum = lwordnum = WORDNUM(lower);
+
+ lbitnum = BITNUM(lower);
+ ushiftbits = BITNUM(upper) + 1;
+
+ /*
+ * Special case when lwordnum is the same as uwordnum we must perform the
+ * upper and lower masking on the word.
+ */
+ if (lwordnum == uwordnum)
+ {
+ a->words[lwordnum] &= ((bitmapword) (((bitmapword) 1 << lbitnum) - 1)
+ | (~(bitmapword) 0) << ushiftbits);
+ }
+ else
+ {
+ /* turn off lbitnum and all bits left of it */
+ a->words[wordnum++] &= (bitmapword) (((bitmapword) 1 << lbitnum) - 1);
+
+ /* turn off all bits for any intermediate words */
+ while (wordnum < uwordnum)
+ a->words[wordnum++] = (bitmapword) 0;
+
+ /* turn off upper's bit and all bits right of it. */
+ a->words[uwordnum] &= (~(bitmapword) 0) << ushiftbits;
+ }
+
+ return a;
+}
+
/*
* bms_int_members - like bms_intersect, but left input is recycled
*/
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 89c409de66..db0234b17a 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -121,6 +121,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
COPY_SCALAR_FIELD(plan_width);
COPY_SCALAR_FIELD(parallel_aware);
COPY_SCALAR_FIELD(parallel_safe);
+ COPY_SCALAR_FIELD(async_capable);
COPY_SCALAR_FIELD(plan_node_id);
COPY_NODE_FIELD(targetlist);
COPY_NODE_FIELD(qual);
@@ -246,6 +247,8 @@ _copyAppend(const Append *from)
COPY_NODE_FIELD(appendplans);
COPY_SCALAR_FIELD(first_partial_plan);
COPY_NODE_FIELD(part_prune_info);
+ COPY_SCALAR_FIELD(nasyncplans);
+ COPY_SCALAR_FIELD(referent);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e2f177515d..d4bb44b268 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -334,6 +334,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
WRITE_INT_FIELD(plan_width);
WRITE_BOOL_FIELD(parallel_aware);
WRITE_BOOL_FIELD(parallel_safe);
+ WRITE_BOOL_FIELD(async_capable);
WRITE_INT_FIELD(plan_node_id);
WRITE_NODE_FIELD(targetlist);
WRITE_NODE_FIELD(qual);
@@ -436,6 +437,8 @@ _outAppend(StringInfo str, const Append *node)
WRITE_NODE_FIELD(appendplans);
WRITE_INT_FIELD(first_partial_plan);
WRITE_NODE_FIELD(part_prune_info);
+ WRITE_INT_FIELD(nasyncplans);
+ WRITE_INT_FIELD(referent);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab719..63af7c02d8 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1572,6 +1572,7 @@ ReadCommonPlan(Plan *local_node)
READ_INT_FIELD(plan_width);
READ_BOOL_FIELD(parallel_aware);
READ_BOOL_FIELD(parallel_safe);
+ READ_BOOL_FIELD(async_capable);
READ_INT_FIELD(plan_node_id);
READ_NODE_FIELD(targetlist);
READ_NODE_FIELD(qual);
@@ -1672,6 +1673,8 @@ _readAppend(void)
READ_NODE_FIELD(appendplans);
READ_INT_FIELD(first_partial_plan);
READ_NODE_FIELD(part_prune_info);
+ READ_INT_FIELD(nasyncplans);
+ READ_INT_FIELD(referent);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 6da0dcd61c..055c8a9fb0 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3954,6 +3954,30 @@ generate_partitionwise_join_paths(PlannerInfo *root, RelOptInfo *rel)
list_free(live_children);
}
+/*
+ * is_projection_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
+
/*****************************************************************************
* DEBUG SUPPORT
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index fda4b2c6e8..da59c48091 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2048,22 +2048,59 @@ cost_append(AppendPath *apath)
if (pathkeys == NIL)
{
- Path *subpath = (Path *) linitial(apath->subpaths);
-
- /*
- * For an unordered, non-parallel-aware Append we take the startup
- * cost as the startup cost of the first subpath.
- */
- apath->path.startup_cost = subpath->startup_cost;
+ Cost first_nonasync_startup_cost = -1.0;
+ Cost async_min_startup_cost = -1;
+ Cost async_max_cost = 0.0;
/* Compute rows and costs as sums of subplan rows and costs. */
foreach(l, apath->subpaths)
{
Path *subpath = (Path *) lfirst(l);
+ /*
+ * For an unordered, non-parallel-aware Append we take the
+ * startup cost as the startup cost of the first
+ * nonasync-capable subpath or the minimum startup cost of
+ * async-capable subpaths.
+ */
+ if (!is_async_capable_path(subpath))
+ {
+ if (first_nonasync_startup_cost < 0.0)
+ first_nonasync_startup_cost = subpath->startup_cost;
+
+ apath->path.total_cost += subpath->total_cost;
+ }
+ else
+ {
+ if (async_min_startup_cost < 0.0 ||
+ async_min_startup_cost > subpath->startup_cost)
+ async_min_startup_cost = subpath->startup_cost;
+
+ /*
+ * It's not obvious how to determine the total cost of
+ * async subnodes. Although it is not always true, we
+ * assume it is the maximum cost among all async subnodes.
+ */
+ if (async_max_cost < subpath->total_cost)
+ async_max_cost = subpath->total_cost;
+ }
+
apath->path.rows += subpath->rows;
- apath->path.total_cost += subpath->total_cost;
}
+
+ /*
+ * If there's an sync subnodes, the startup cost is the startup
+ * cost of the first sync subnode. Otherwise it's the minimal
+ * startup cost of async subnodes.
+ */
+ if (first_nonasync_startup_cost >= 0.0)
+ apath->path.startup_cost = first_nonasync_startup_cost;
+ else
+ apath->path.startup_cost = async_min_startup_cost;
+
+ /* Use async maximum cost if it exceeds the sync total cost */
+ if (async_max_cost > apath->path.total_cost)
+ apath->path.total_cost = async_max_cost;
}
else
{
@@ -2084,6 +2121,8 @@ cost_append(AppendPath *apath)
* This case is also different from the above in that we have to
* account for possibly injecting sorts into subpaths that aren't
* natively ordered.
+ *
+ * Note: An ordered append won't be run asynchronously.
*/
foreach(l, apath->subpaths)
{
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 99278eed93..b38cb5e4ca 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1082,6 +1082,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
bool tlist_was_changed = false;
List *pathkeys = best_path->path.pathkeys;
List *subplans = NIL;
+ List *asyncplans = NIL;
+ List *syncplans = NIL;
+ List *asyncpaths = NIL;
+ List *syncpaths = NIL;
+ List *newsubpaths = NIL;
ListCell *subpaths;
RelOptInfo *rel = best_path->path.parent;
PartitionPruneInfo *partpruneinfo = NULL;
@@ -1090,6 +1095,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
Oid *nodeSortOperators = NULL;
Oid *nodeCollations = NULL;
bool *nodeNullsFirst = NULL;
+ int nasyncplans = 0;
+ bool first = true;
+ bool referent_is_sync = true;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1219,9 +1227,40 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
}
- subplans = lappend(subplans, subplan);
+ /*
+ * Classify as async-capable or not. If we have decided to run the
+ * children in parallel, we cannot any one of them run asynchronously.
+ * Planner thinks that all subnodes are executed in order if this
+ * append is orderd. No subpaths cannot be run asynchronously in that
+ * case.
+ */
+ if (pathkeys == NIL &&
+ !best_path->path.parallel_safe && is_async_capable_path(subpath))
+ {
+ subplan->async_capable = true;
+ asyncplans = lappend(asyncplans, subplan);
+ asyncpaths = lappend(asyncpaths, subpath);
+ ++nasyncplans;
+ if (first)
+ referent_is_sync = false;
+ }
+ else
+ {
+ syncplans = lappend(syncplans, subplan);
+ syncpaths = lappend(syncpaths, subpath);
+ }
+
+ first = false;
}
+ /*
+ * subplan contains asyncplans in the first half, if any, and sync plans in
+ * another half, if any. We need that the same for subpaths to make
+ * partition pruning information in sync with subplans.
+ */
+ subplans = list_concat(asyncplans, syncplans);
+ newsubpaths = list_concat(asyncpaths, syncpaths);
+
/*
* If any quals exist, they may be useful to perform further partition
* pruning during execution. Gather information needed by the executor to
@@ -1249,7 +1288,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
if (prunequal != NIL)
partpruneinfo =
make_partition_pruneinfo(root, rel,
- best_path->subpaths,
+ newsubpaths,
best_path->partitioned_rels,
prunequal);
}
@@ -1257,6 +1296,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
plan->appendplans = subplans;
plan->first_partial_plan = best_path->first_partial_path;
plan->part_prune_info = partpruneinfo;
+ plan->nasyncplans = nasyncplans;
+ plan->referent = referent_is_sync ? nasyncplans : 0;
copy_generic_path_info(&plan->plan, (Path *) best_path);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 9d6b3778b4..1765c56545 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3878,6 +3878,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
case WAIT_EVENT_XACT_GROUP_UPDATE:
event_name = "XactGroupUpdate";
break;
+ case WAIT_EVENT_ASYNC_WAIT:
+ event_name = "AsyncExecWait";
+ break;
/* no default case, so that compiler will warn */
}
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 60dd80c23c..5680c58739 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4574,10 +4574,14 @@ set_deparse_plan(deparse_namespace *dpns, Plan *plan)
* tlists according to one of the children, and the first one is the most
* natural choice. Likewise special-case ModifyTable to pretend that the
* first child plan is the OUTER referent; this is to support RETURNING
- * lists containing references to non-target relations.
+ * lists containing references to non-target relations. For Append, use the
+ * explicitly specified referent.
*/
if (IsA(plan, Append))
- dpns->outer_plan = linitial(((Append *) plan)->appendplans);
+ {
+ Append *app = (Append *) plan;
+ dpns->outer_plan = list_nth(app->appendplans, app->referent);
+ }
else if (IsA(plan, MergeAppend))
dpns->outer_plan = linitial(((MergeAppend *) plan)->mergeplans);
else if (IsA(plan, ModifyTable))
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 237ca9fa30..27742a1641 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -1416,7 +1416,7 @@ void
ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
{
/*
- * XXXX: There's no property to show as an identier of a wait event set,
+ * XXXX: There's no property to show as an identifier of a wait event set,
* use its pointer instead.
*/
if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
@@ -1431,7 +1431,7 @@ static void
PrintWESLeakWarning(WaitEventSet *events)
{
/*
- * XXXX: There's no property to show as an identier of a wait event set,
+ * XXXX: There's no property to show as an identifier of a wait event set,
* use its pointer instead.
*/
elog(WARNING, "wait event set leak: %p still referenced",
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000000..3b6bf4a516
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,22 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ * Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+ void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+ long timeout);
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 415e117407..9cf2c1f676 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -59,6 +59,7 @@
#define EXEC_FLAG_MARK 0x0008 /* need mark/restore */
#define EXEC_FLAG_SKIP_TRIGGERS 0x0010 /* skip AfterTrigger calls */
#define EXEC_FLAG_WITH_NO_DATA 0x0020 /* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC 0x0040 /* request async execution */
/* Hook for plugins to get control in ExecutorStart() */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 326d713ebf..71a233b41f 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,
extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
ParallelWorkerContext *pwcxt);
extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+ WaitEventSet *wes,
+ void *caller_data, bool reinit);
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 95556dfb15..853ba2b5ad 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -169,6 +169,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
List *fdw_private,
RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+ WaitEventSet *wes,
+ void *caller_data,
+ bool reinit);
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -190,6 +195,7 @@ typedef struct FdwRoutine
GetForeignPlan_function GetForeignPlan;
BeginForeignScan_function BeginForeignScan;
IterateForeignScan_function IterateForeignScan;
+ IterateForeignScan_function IterateForeignScanAsync;
ReScanForeignScan_function ReScanForeignScan;
EndForeignScan_function EndForeignScan;
@@ -242,6 +248,11 @@ typedef struct FdwRoutine
InitializeDSMForeignScan_function InitializeDSMForeignScan;
ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan;
InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+
ShutdownForeignScan_function ShutdownForeignScan;
/* Support functions for path reparameterization. */
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index d113c271ee..177e6218cb 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -107,6 +107,7 @@ extern Bitmapset *bms_add_members(Bitmapset *a, const Bitmapset *b);
extern Bitmapset *bms_add_range(Bitmapset *a, int lower, int upper);
extern Bitmapset *bms_int_members(Bitmapset *a, const Bitmapset *b);
extern Bitmapset *bms_del_members(Bitmapset *a, const Bitmapset *b);
+extern Bitmapset *bms_del_range(Bitmapset *a, int lower, int upper);
extern Bitmapset *bms_join(Bitmapset *a, Bitmapset *b);
/* support for iterating through the integer elements of a set: */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0b42dd6f94..2d47a4162e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -940,6 +940,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
* abstract superclass for all PlanState-type nodes.
* ----------------
*/
+typedef enum AsyncState
+{
+ AS_AVAILABLE,
+ AS_WAITING
+} AsyncState;
+
typedef struct PlanState
{
NodeTag type;
@@ -1028,6 +1034,11 @@ typedef struct PlanState
bool outeropsset;
bool inneropsset;
bool resultopsset;
+
+ /* Async subnode execution stuff */
+ AsyncState asyncstate;
+
+ int32 padding; /* to keep alignment of derived types */
} PlanState;
/* ----------------
@@ -1223,14 +1234,21 @@ struct AppendState
PlanState ps; /* its first field is NodeTag */
PlanState **appendplans; /* array of PlanStates for my inputs */
int as_nplans;
- int as_whichplan;
+ int as_whichsyncplan; /* which sync plan is being executed */
int as_first_partial_plan; /* Index of 'appendplans' containing
* the first partial plan */
+ int as_nasyncplans; /* # of async-capable children */
ParallelAppendState *as_pstate; /* parallel coordination info */
Size pstate_len; /* size of parallel coordination info */
struct PartitionPruneState *as_prune_state;
- Bitmapset *as_valid_subplans;
+ Bitmapset *as_valid_syncsubplans;
bool (*choose_next_subplan) (AppendState *);
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_needrequest; /* async plans needing a new request */
+ Bitmapset *as_pending_async; /* pending async plans */
+ TupleTableSlot **as_asyncresult; /* results of each async plan */
+ int as_nasyncresult; /* # of valid entries in as_asyncresult */
+ bool as_exec_prune; /* runtime pruning needed for async exec? */
};
/* ----------------
@@ -1798,6 +1816,7 @@ typedef struct ForeignScanState
Size pscan_len; /* size of parallel coordination information */
/* use struct pointer to avoid including fdwapi.h here */
struct FdwRoutine *fdwroutine;
+ bool fs_async;
void *fdw_state; /* foreign-data wrapper can keep state here */
} ForeignScanState;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 83e01074ed..abad89b327 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -135,6 +135,11 @@ typedef struct Plan
bool parallel_aware; /* engage parallel-aware logic? */
bool parallel_safe; /* OK to use as part of parallel plan? */
+ /*
+ * information needed for asynchronous execution
+ */
+ bool async_capable; /* engage asynchronous execution logic? */
+
/*
* Common structural data for all Plan types.
*/
@@ -262,6 +267,10 @@ typedef struct Append
/* Info for run-time subplan pruning; NULL if we're not doing that */
struct PartitionPruneInfo *part_prune_info;
+
+ /* Async child node execution stuff */
+ int nasyncplans; /* # async subplans, always at start of list */
+ int referent; /* index of inheritance tree referent */
} Append;
/* ----------------
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 10b6e81079..53876b2d8b 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -241,4 +241,6 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
List *live_childrels);
+extern bool is_async_capable_path(Path *path);
+
#endif /* PATHS_H */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201382..c0ea7f5aa4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -887,7 +887,8 @@ typedef enum
WAIT_EVENT_REPLICATION_SLOT_DROP,
WAIT_EVENT_SAFE_SNAPSHOT,
WAIT_EVENT_SYNC_REP,
- WAIT_EVENT_XACT_GROUP_UPDATE
+ WAIT_EVENT_XACT_GROUP_UPDATE,
+ WAIT_EVENT_ASYNC_WAIT
} WaitEventIPC;
/* ----------
--
2.18.4
v6-0003-async-postgres_fdw.patchtext/x-patch; charset=us-asciiDownload
From 74f8a34ee84ef45cffd5b5c51f67301d3b176cab Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH v6 3/3] async postgres_fdw
---
contrib/postgres_fdw/connection.c | 28 +
.../postgres_fdw/expected/postgres_fdw.out | 272 ++++----
contrib/postgres_fdw/postgres_fdw.c | 601 +++++++++++++++---
contrib/postgres_fdw/postgres_fdw.h | 2 +
contrib/postgres_fdw/sql/postgres_fdw.sql | 20 +-
5 files changed, 710 insertions(+), 213 deletions(-)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 08daf26fdf..be5948f613 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -59,6 +59,7 @@ typedef struct ConnCacheEntry
bool invalidated; /* true if reconnect is pending */
uint32 server_hashvalue; /* hash value of foreign server OID */
uint32 mapping_hashvalue; /* hash value of user mapping OID */
+ void *storage; /* connection specific storage */
} ConnCacheEntry;
/*
@@ -203,6 +204,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
entry->conn, server->servername, user->umid, user->userid);
+ entry->storage = NULL;
}
/*
@@ -216,6 +218,32 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
return entry->conn;
}
+/*
+ * Returns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+ bool found;
+ ConnCacheEntry *entry;
+ ConnCacheKey key;
+
+ /* Find storage using the same key with GetConnection */
+ key = user->umid;
+ entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+ Assert(found);
+
+ /* Create one if not yet. */
+ if (entry->storage == NULL)
+ {
+ entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+ memset(entry->storage, 0, initsize);
+ }
+
+ return entry->storage;
+}
+
/*
* Connect to remote server using specified server and user mapping properties.
*/
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 90db550b92..9374fa3a6c 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6973,7 +6973,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+-------
a | aaa
@@ -7001,7 +7001,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7029,7 +7029,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7057,7 +7057,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | newtoo
@@ -7127,35 +7127,41 @@ insert into bar2 values(3,33,33);
insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
f1 | f2
----+----
1 | 11
@@ -7165,35 +7171,41 @@ select * from bar where f1 in (select f1 from foo) for update;
(4 rows)
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
f1 | f2
----+----
1 | 11
@@ -7223,11 +7235,12 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-> Hash Join
Output: bar_1.f1, (bar_1.f2 + 100), bar_1.f3, bar_1.ctid, foo.ctid, foo.*, foo.tableoid
Inner Unique: true
@@ -7241,12 +7254,13 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(39 rows)
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(41 rows)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
select tableoid::regclass, * from bar order by 1,2;
@@ -7276,16 +7290,17 @@ where bar.f1 = ss.f1;
Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
Hash Cond: (foo.f1 = bar.f1)
-> Append
+ Async subplans: 2
+ -> Async Foreign Scan on public.foo2 foo_1
+ Output: ROW(foo_1.f1), foo_1.f1
+ Remote SQL: SELECT f1 FROM public.loct1
+ -> Async Foreign Scan on public.foo2 foo_3
+ Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
+ Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo
Output: ROW(foo.f1), foo.f1
- -> Foreign Scan on public.foo2 foo_1
- Output: ROW(foo_1.f1), foo_1.f1
- Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo foo_2
Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
- -> Foreign Scan on public.foo2 foo_3
- Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
- Remote SQL: SELECT f1 FROM public.loct1
-> Hash
Output: bar.f1, bar.f2, bar.ctid
-> Seq Scan on public.bar
@@ -7303,17 +7318,18 @@ where bar.f1 = ss.f1;
Output: (ROW(foo.f1)), foo.f1
Sort Key: foo.f1
-> Append
+ Async subplans: 2
+ -> Async Foreign Scan on public.foo2 foo_1
+ Output: ROW(foo_1.f1), foo_1.f1
+ Remote SQL: SELECT f1 FROM public.loct1
+ -> Async Foreign Scan on public.foo2 foo_3
+ Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
+ Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo
Output: ROW(foo.f1), foo.f1
- -> Foreign Scan on public.foo2 foo_1
- Output: ROW(foo_1.f1), foo_1.f1
- Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo foo_2
Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
- -> Foreign Scan on public.foo2 foo_3
- Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
- Remote SQL: SELECT f1 FROM public.loct1
-(45 rows)
+(47 rows)
update bar set f2 = f2 + 100
from
@@ -7463,27 +7479,33 @@ delete from foo where f1 < 5 returning *;
(5 rows)
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
- QUERY PLAN
-------------------------------------------------------------------------------
- Update on public.bar
- Output: bar.f1, bar.f2
- Update on public.bar
- Foreign Update on public.bar2 bar_1
- -> Seq Scan on public.bar
- Output: bar.f1, (bar.f2 + 100), bar.ctid
- -> Foreign Update on public.bar2 bar_1
- Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Output: u.f1, u.f2
+ Sort Key: u.f1
+ CTE u
+ -> Update on public.bar
+ Output: bar.f1, bar.f2
+ Update on public.bar
+ Foreign Update on public.bar2 bar_1
+ -> Seq Scan on public.bar
+ Output: bar.f1, (bar.f2 + 100), bar.ctid
+ -> Foreign Update on public.bar2 bar_1
+ Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+ -> CTE Scan on u
+ Output: u.f1, u.f2
+(14 rows)
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
f1 | f2
----+-----
1 | 311
2 | 322
- 6 | 266
3 | 333
4 | 344
+ 6 | 266
7 | 277
(6 rows)
@@ -8558,11 +8580,12 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
Sort
Sort Key: t1.a, t3.c
-> Append
- -> Foreign Scan
+ Async subplans: 2
+ -> Async Foreign Scan
Relations: ((ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)) INNER JOIN (ftprt1_p1 t3_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: ((ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)) INNER JOIN (ftprt1_p2 t3_2)
-(7 rows)
+(8 rows)
SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER JOIN fprt1 t3 ON (t2.b = t3.a) WHERE t1.a % 25 =0 ORDER BY 1,2,3;
a | b | c
@@ -8597,20 +8620,22 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
-- with whole-row reference; partitionwise join does not apply
EXPLAIN (COSTS OFF)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
- QUERY PLAN
---------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------
Sort
Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
-> Hash Full Join
Hash Cond: (t1.a = t2.b)
-> Append
- -> Foreign Scan on ftprt1_p1 t1_1
- -> Foreign Scan on ftprt1_p2 t1_2
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt1_p1 t1_1
+ -> Async Foreign Scan on ftprt1_p2 t1_2
-> Hash
-> Append
- -> Foreign Scan on ftprt2_p1 t2_1
- -> Foreign Scan on ftprt2_p2 t2_2
-(11 rows)
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt2_p1 t2_1
+ -> Async Foreign Scan on ftprt2_p2 t2_2
+(13 rows)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
wr | wr
@@ -8639,11 +8664,12 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
Sort
Sort Key: t1.a, t1.b
-> Append
- -> Foreign Scan
+ Async subplans: 2
+ -> Async Foreign Scan
Relations: (ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: (ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)
-(7 rows)
+(8 rows)
SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE t1.a%25 = 0 ORDER BY 1,2;
a | b
@@ -8696,21 +8722,23 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
-- test FOR UPDATE; partitionwise join does not apply
EXPLAIN (COSTS OFF)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
- QUERY PLAN
---------------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------------
LockRows
-> Sort
Sort Key: t1.a
-> Hash Join
Hash Cond: (t2.b = t1.a)
-> Append
- -> Foreign Scan on ftprt2_p1 t2_1
- -> Foreign Scan on ftprt2_p2 t2_2
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt2_p1 t2_1
+ -> Async Foreign Scan on ftprt2_p2 t2_2
-> Hash
-> Append
- -> Foreign Scan on ftprt1_p1 t1_1
- -> Foreign Scan on ftprt1_p2 t1_2
-(12 rows)
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt1_p1 t1_1
+ -> Async Foreign Scan on ftprt1_p2 t1_2
+(14 rows)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
a | b
@@ -8745,18 +8773,19 @@ ANALYZE fpagg_tab_p3;
SET enable_partitionwise_aggregate TO false;
EXPLAIN (COSTS OFF)
SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
- QUERY PLAN
------------------------------------------------------------
+ QUERY PLAN
+-----------------------------------------------------------------
Sort
Sort Key: pagg_tab.a
-> HashAggregate
Group Key: pagg_tab.a
Filter: (avg(pagg_tab.b) < '22'::numeric)
-> Append
- -> Foreign Scan on fpagg_tab_p1 pagg_tab_1
- -> Foreign Scan on fpagg_tab_p2 pagg_tab_2
- -> Foreign Scan on fpagg_tab_p3 pagg_tab_3
-(9 rows)
+ Async subplans: 3
+ -> Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+ -> Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+ -> Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
+(10 rows)
-- Plan with partitionwise aggregates is enabled
SET enable_partitionwise_aggregate TO true;
@@ -8767,13 +8796,14 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
Sort
Sort Key: pagg_tab.a
-> Append
- -> Foreign Scan
+ Async subplans: 3
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p1 pagg_tab)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p2 pagg_tab_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p3 pagg_tab_2)
-(9 rows)
+(10 rows)
SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
a | sum | min | count
@@ -8795,29 +8825,22 @@ SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
Sort
Output: t1.a, (count(((t1.*)::pagg_tab)))
Sort Key: t1.a
- -> Append
- -> HashAggregate
- Output: t1.a, count(((t1.*)::pagg_tab))
- Group Key: t1.a
- Filter: (avg(t1.b) < '22'::numeric)
- -> Foreign Scan on public.fpagg_tab_p1 t1
- Output: t1.a, t1.*, t1.b
- Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1
- -> HashAggregate
- Output: t1_1.a, count(((t1_1.*)::pagg_tab))
- Group Key: t1_1.a
- Filter: (avg(t1_1.b) < '22'::numeric)
- -> Foreign Scan on public.fpagg_tab_p2 t1_1
+ -> HashAggregate
+ Output: t1.a, count(((t1.*)::pagg_tab))
+ Group Key: t1.a
+ Filter: (avg(t1.b) < '22'::numeric)
+ -> Append
+ Async subplans: 3
+ -> Async Foreign Scan on public.fpagg_tab_p1 t1_1
Output: t1_1.a, t1_1.*, t1_1.b
- Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2
- -> HashAggregate
- Output: t1_2.a, count(((t1_2.*)::pagg_tab))
- Group Key: t1_2.a
- Filter: (avg(t1_2.b) < '22'::numeric)
- -> Foreign Scan on public.fpagg_tab_p3 t1_2
+ Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1
+ -> Async Foreign Scan on public.fpagg_tab_p2 t1_2
Output: t1_2.a, t1_2.*, t1_2.b
+ Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2
+ -> Async Foreign Scan on public.fpagg_tab_p3 t1_3
+ Output: t1_3.a, t1_3.*, t1_3.b
Remote SQL: SELECT a, b, c FROM public.pagg_tab_p3
-(25 rows)
+(18 rows)
SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
a | count
@@ -8837,20 +8860,15 @@ SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b HAVING sum(a) < 700
-----------------------------------------------------------------
Sort
Sort Key: pagg_tab.b
- -> Finalize HashAggregate
+ -> HashAggregate
Group Key: pagg_tab.b
Filter: (sum(pagg_tab.a) < 700)
-> Append
- -> Partial HashAggregate
- Group Key: pagg_tab.b
- -> Foreign Scan on fpagg_tab_p1 pagg_tab
- -> Partial HashAggregate
- Group Key: pagg_tab_1.b
- -> Foreign Scan on fpagg_tab_p2 pagg_tab_1
- -> Partial HashAggregate
- Group Key: pagg_tab_2.b
- -> Foreign Scan on fpagg_tab_p3 pagg_tab_2
-(15 rows)
+ Async subplans: 3
+ -> Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+ -> Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+ -> Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
+(10 rows)
-- ===================================================================
-- access rights and superuser
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 9fc53cad68..4bfc2d39ea 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -21,6 +21,8 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -35,6 +37,7 @@
#include "optimizer/restrictinfo.h"
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
+#include "pgstat.h"
#include "postgres_fdw.h"
#include "utils/builtins.h"
#include "utils/float.h"
@@ -56,6 +59,9 @@ PG_MODULE_MAGIC;
/* If no remote estimates, assume a sort costs 20% extra */
#define DEFAULT_FDW_SORT_MULTIPLIER 1.2
+/* Retrieve PgFdwScanState struct from ForeignScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
/*
* Indexes of FDW-private information stored in fdw_private lists.
*
@@ -122,11 +128,29 @@ enum FdwDirectModifyPrivateIndex
FdwDirectModifyPrivateSetProcessed
};
+/*
+ * Connection common state - shared among all PgFdwState instances using the
+ * same connection.
+ */
+typedef struct PgFdwConnCommonState
+{
+ ForeignScanState *leader; /* leader node of this connection */
+ bool busy; /* true if this connection is busy */
+} PgFdwConnCommonState;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+ PGconn *conn; /* connection for the scan */
+ PgFdwConnCommonState *commonstate; /* connection common state */
+} PgFdwState;
+
/*
* Execution state of a foreign scan using postgres_fdw.
*/
typedef struct PgFdwScanState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table. NULL
* for a foreign join scan. */
TupleDesc tupdesc; /* tuple descriptor of scan */
@@ -137,7 +161,6 @@ typedef struct PgFdwScanState
List *retrieved_attrs; /* list of retrieved attribute numbers */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -153,6 +176,12 @@ typedef struct PgFdwScanState
/* batch-level state, for optimizing rewinds and avoiding useless fetch */
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ bool async; /* true if run asynchronously */
+ bool queued; /* true if this node is in waiter queue */
+ ForeignScanState *waiter; /* Next node to run a query among nodes
+ * sharing the same connection */
+ ForeignScanState *last_waiter; /* last element in waiter queue.
+ * valid only on the leader node */
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
@@ -166,11 +195,11 @@ typedef struct PgFdwScanState
*/
typedef struct PgFdwModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -197,6 +226,7 @@ typedef struct PgFdwModifyState
*/
typedef struct PgFdwDirectModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
@@ -326,6 +356,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
static void postgresReScanForeignScan(ForeignScanState *node);
static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
static void postgresAddForeignUpdateTargets(Query *parsetree,
RangeTblEntry *target_rte,
Relation target_relation);
@@ -391,6 +422,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
RelOptInfo *input_rel,
RelOptInfo *output_rel,
void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+ WaitEventSet *wes,
+ void *caller_data, bool reinit);
/*
* Helper functions
@@ -419,7 +454,9 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn, bool clear_queue);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static PgFdwModifyState *create_foreign_modify(EState *estate,
RangeTblEntry *rte,
@@ -522,6 +559,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
routine->IterateForeignScan = postgresIterateForeignScan;
routine->ReScanForeignScan = postgresReScanForeignScan;
routine->EndForeignScan = postgresEndForeignScan;
+ routine->ShutdownForeignScan = postgresShutdownForeignScan;
/* Functions for updating foreign tables */
routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -558,6 +596,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for async execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+
PG_RETURN_POINTER(routine);
}
@@ -1434,12 +1476,22 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->s.conn = GetConnection(user, false);
+ fsstate->s.commonstate = (PgFdwConnCommonState *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
+ fsstate->s.commonstate->leader = NULL;
+ fsstate->s.commonstate->busy = false;
+ fsstate->waiter = NULL;
+ fsstate->last_waiter = node;
/* Assign a unique ID for my cursor */
- fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+ fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
fsstate->cursor_exists = false;
+ /* Initialize async execution status */
+ fsstate->async = false;
+ fsstate->queued = false;
+
/* Get private info created by planner functions. */
fsstate->query = strVal(list_nth(fsplan->fdw_private,
FdwScanPrivateSelectSql));
@@ -1487,40 +1539,241 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
&fsstate->param_values);
}
+/*
+ * Async queue manipulation functions
+ */
+
+/*
+ * add_async_waiter:
+ *
+ * Enqueue node if it isn't in the queue. Immediately send request it if the
+ * underlying connection is not busy.
+ */
+static inline void
+add_async_waiter(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+
+ /*
+ * Do nothing if the node is already in the queue or already eof'ed.
+ * Note: leader node is not marked as queued.
+ */
+ if (leader == node || fsstate->queued || fsstate->eof_reached)
+ return;
+
+ if (leader == NULL)
+ {
+ /* no leader means not busy, send request immediately */
+ request_more_data(node);
+ }
+ else
+ {
+ /* the connection is busy, queue the node */
+ PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+ PgFdwScanState *last_waiter_state
+ = GetPgFdwScanState(leader_state->last_waiter);
+
+ last_waiter_state->waiter = node;
+ leader_state->last_waiter = node;
+ fsstate->queued = true;
+ }
+}
+
+/*
+ * move_to_next_waiter:
+ *
+ * Make the first waiter be the next leader
+ * Returns the new leader or NULL if there's no waiter.
+ */
+static inline ForeignScanState *
+move_to_next_waiter(ForeignScanState *node)
+{
+ PgFdwScanState *leader_state = GetPgFdwScanState(node);
+ ForeignScanState *next_leader = leader_state->waiter;
+
+ Assert(leader_state->s.commonstate->leader = node);
+
+ if (next_leader)
+ {
+ /* the first waiter becomes the next leader */
+ PgFdwScanState *next_leader_state = GetPgFdwScanState(next_leader);
+ next_leader_state->last_waiter = leader_state->last_waiter;
+ next_leader_state->queued = false;
+ }
+
+ leader_state->waiter = NULL;
+ leader_state->s.commonstate->leader = next_leader;
+
+ return next_leader;
+}
+
+/*
+ * Remove the node from waiter queue.
+ *
+ * Remaining results are cleared if the node is a busy leader.
+ * This intended to be used during node shutdown.
+ */
+static inline void
+remove_async_node(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+ PgFdwScanState *leader_state;
+ ForeignScanState *prev;
+ PgFdwScanState *prev_state;
+ ForeignScanState *cur;
+
+ /* no need to remove me */
+ if (!leader || !fsstate->queued)
+ return;
+
+ leader_state = GetPgFdwScanState(leader);
+
+ if (leader == node)
+ {
+ if (leader_state->s.commonstate->busy)
+ {
+ /*
+ * this node is waiting for result, absorb the result first so
+ * that the following commands can be sent on the connection.
+ */
+ PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+ PGconn *conn = leader_state->s.conn;
+
+ while(PQisBusy(conn))
+ PQclear(PQgetResult(conn));
+
+ leader_state->s.commonstate->busy = false;
+ }
+
+ move_to_next_waiter(node);
+
+ return;
+ }
+
+ /*
+ * Just remove the node from the queue
+ *
+ * Nodes don't have a link to the previous node but anyway this function is
+ * called on the shutdown path, so we don't bother seeking for faster way
+ * to do this.
+ */
+ prev = leader;
+ prev_state = leader_state;
+ cur = GetPgFdwScanState(prev)->waiter;
+ while (cur)
+ {
+ PgFdwScanState *curstate = GetPgFdwScanState(cur);
+
+ if (cur == node)
+ {
+ prev_state->waiter = curstate->waiter;
+
+ /* relink to the previous node if the last node was removed */
+ if (leader_state->last_waiter == cur)
+ leader_state->last_waiter = prev;
+
+ fsstate->queued = false;
+
+ return;
+ }
+ prev = cur;
+ prev_state = curstate;
+ cur = curstate->waiter;
+ }
+}
+
/*
* postgresIterateForeignScan
- * Retrieve next row from the result set, or clear tuple slot to indicate
- * EOF.
+ * Retrieve next row from the result set.
+ *
+ * For synchronous nodes, returns clear tuple slot means EOF.
+ *
+ * For asynchronous nodes, if clear tuple slot is returned, the caller
+ * needs to check async state to tell if all tuples received
+ * (AS_AVAILABLE) or waiting for the next data to come (AS_WAITING).
*/
static TupleTableSlot *
postgresIterateForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
- /*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
- */
- if (!fsstate->cursor_exists)
- create_cursor(node);
-
- /*
- * Get some more tuples, if we've run out.
- */
+ if (fsstate->next_tuple >= fsstate->num_tuples && !fsstate->eof_reached)
+ {
+ /* we've run out, get some more tuples */
+ if (!node->fs_async)
+ {
+ /*
+ * finish the running query before sending the next command for
+ * this node
+ */
+ if (!fsstate->s.commonstate->busy)
+ vacate_connection((PgFdwState *)fsstate, false);
+
+ request_more_data(node);
+
+ /* Fetch the result immediately. */
+ fetch_received_data(node);
+ }
+ else if (!fsstate->s.commonstate->busy)
+ {
+ /* If the connection is not busy, just send the request. */
+ request_more_data(node);
+ }
+ else
+ {
+ /* The connection is busy, queue the request */
+ bool available = true;
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+ PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+
+ /* queue the requested node */
+ add_async_waiter(node);
+
+ /*
+ * The request for the next node cannot be sent before the leader
+ * responds. Finish the current leader if possible.
+ */
+ if (PQisBusy(leader_state->s.conn))
+ {
+ int rc = WaitLatchOrSocket(NULL,
+ WL_SOCKET_READABLE | WL_TIMEOUT |
+ WL_EXIT_ON_PM_DEATH,
+ PQsocket(leader_state->s.conn), 0,
+ WAIT_EVENT_ASYNC_WAIT);
+ if (!(rc & WL_SOCKET_READABLE))
+ available = false;
+ }
+
+ /* fetch the leader's data and enqueue it for the next request */
+ if (available)
+ {
+ fetch_received_data(leader);
+ add_async_waiter(leader);
+ }
+ }
+ }
+
if (fsstate->next_tuple >= fsstate->num_tuples)
{
- /* No point in another fetch if we already detected EOF, though. */
- if (!fsstate->eof_reached)
- fetch_more_data(node);
- /* If we didn't get any tuples, must be end of data. */
- if (fsstate->next_tuple >= fsstate->num_tuples)
- return ExecClearTuple(slot);
+ /*
+ * We haven't received a result for the given node this time, return
+ * with no tuple to give way to another node.
+ */
+ if (fsstate->eof_reached)
+ node->ss.ps.asyncstate = AS_AVAILABLE;
+ else
+ node->ss.ps.asyncstate = AS_WAITING;
+
+ return ExecClearTuple(slot);
}
/*
* Return the next tuple.
*/
+ node->ss.ps.asyncstate = AS_AVAILABLE;
ExecStoreHeapTuple(fsstate->tuples[fsstate->next_tuple++],
slot,
false);
@@ -1535,7 +1788,7 @@ postgresIterateForeignScan(ForeignScanState *node)
static void
postgresReScanForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
char sql[64];
PGresult *res;
@@ -1543,6 +1796,8 @@ postgresReScanForeignScan(ForeignScanState *node)
if (!fsstate->cursor_exists)
return;
+ vacate_connection((PgFdwState *)fsstate, true);
+
/*
* If any internal parameters affecting this node have changed, we'd
* better destroy and recreate the cursor. Otherwise, rewinding it should
@@ -1571,9 +1826,9 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
PQclear(res);
/* Now force a fresh FETCH. */
@@ -1591,7 +1846,7 @@ postgresReScanForeignScan(ForeignScanState *node)
static void
postgresEndForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
if (fsstate == NULL)
@@ -1599,15 +1854,31 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->s.conn, fsstate->cursor_number);
/* Release remote connection */
- ReleaseConnection(fsstate->conn);
- fsstate->conn = NULL;
+ ReleaseConnection(fsstate->s.conn);
+ fsstate->s.conn = NULL;
/* MemoryContexts will be deleted automatically. */
}
+/*
+ * postgresShutdownForeignScan
+ * Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+ ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+ if (plan->operation != CMD_SELECT)
+ return;
+
+ /* remove the node from waiting queue */
+ remove_async_node(node);
+}
+
/*
* postgresAddForeignUpdateTargets
* Add resjunk column(s) needed for update/delete on a foreign table
@@ -2372,7 +2643,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->s.conn = GetConnection(user, false);
+ dmstate->s.commonstate = (PgFdwConnCommonState *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
/* Update the foreign-join-related fields. */
if (fsplan->scan.scanrelid == 0)
@@ -2457,7 +2730,11 @@ postgresIterateDirectModify(ForeignScanState *node)
* If this is the first call after Begin, execute the statement.
*/
if (dmstate->num_tuples == -1)
+ {
+ /* finish running query to send my command */
+ vacate_connection((PgFdwState *)dmstate, true);
execute_dml_stmt(node);
+ }
/*
* If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2504,8 +2781,8 @@ postgresEndDirectModify(ForeignScanState *node)
PQclear(dmstate->result);
/* Release remote connection */
- ReleaseConnection(dmstate->conn);
- dmstate->conn = NULL;
+ ReleaseConnection(dmstate->s.conn);
+ dmstate->s.conn = NULL;
/* MemoryContext will be deleted automatically. */
}
@@ -2703,6 +2980,7 @@ estimate_path_cost_size(PlannerInfo *root,
List *local_param_join_conds;
StringInfoData sql;
PGconn *conn;
+ PgFdwConnCommonState *commonstate;
Selectivity local_sel;
QualCost local_cost;
List *fdw_scan_tlist = NIL;
@@ -2747,6 +3025,18 @@ estimate_path_cost_size(PlannerInfo *root,
/* Get the remote estimate */
conn = GetConnection(fpinfo->user, false);
+ commonstate = GetConnectionSpecificStorage(fpinfo->user,
+ sizeof(PgFdwConnCommonState));
+ if (commonstate)
+ {
+ PgFdwState tmpstate;
+ tmpstate.conn = conn;
+ tmpstate.commonstate = commonstate;
+
+ /* finish running query to send my command */
+ vacate_connection(&tmpstate, true);
+ }
+
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -3317,11 +3607,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
static void
create_cursor(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
ExprContext *econtext = node->ss.ps.ps_ExprContext;
int numParams = fsstate->numParams;
const char **values = fsstate->param_values;
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
StringInfoData buf;
PGresult *res;
@@ -3384,50 +3674,119 @@ create_cursor(ForeignScanState *node)
}
/*
- * Fetch some more rows from the node's cursor.
+ * Sends the next request of the node. If the given node is different from the
+ * current connection leader, pushes it back to waiter queue and let the given
+ * node be the leader.
*/
static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+ PGconn *conn = fsstate->s.conn;
+ char sql[64];
+
+ /* must be non-busy */
+ Assert(!fsstate->s.commonstate->busy);
+ /* must be not-eof'ed */
+ Assert(!fsstate->eof_reached);
+
+ /*
+ * If this is the first call after Begin or ReScan, we need to create the
+ * cursor on the remote side.
+ */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (!PQsendQuery(conn, sql))
+ pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+ fsstate->s.commonstate->busy = true;
+
+ /* The node is the current leader, just return. */
+ if (leader == node)
+ return;
+
+ /* Let the node be the leader */
+ if (leader != NULL)
+ {
+ remove_async_node(node);
+ fsstate->last_waiter = GetPgFdwScanState(leader)->last_waiter;
+ fsstate->waiter = leader;
+ }
+ else
+ {
+ fsstate->last_waiter = node;
+ fsstate->waiter = NULL;
+ }
+
+ fsstate->s.commonstate->leader = node;
+}
+
+/*
+ * Fetches received data and automatically send requests of the next waiter.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
PGresult *volatile res = NULL;
MemoryContext oldcontext;
+ ForeignScanState *waiter;
+
+ /* I should be the current connection leader */
+ Assert(fsstate->s.commonstate->leader == node);
/*
* We'll store the tuples in the batch_cxt. First, flush the previous
- * batch.
+ * batch if no tuple is remaining
*/
- fsstate->tuples = NULL;
- MemoryContextReset(fsstate->batch_cxt);
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->tuples = NULL;
+ fsstate->num_tuples = 0;
+ MemoryContextReset(fsstate->batch_cxt);
+ }
+ else if (fsstate->next_tuple > 0)
+ {
+ /* There's some remains. Move them to the beginning of the store */
+ int n = 0;
+
+ while(fsstate->next_tuple < fsstate->num_tuples)
+ fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+ fsstate->num_tuples = n;
+ }
+
oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
/* PGresult must be released before leaving this function. */
PG_TRY();
{
- PGconn *conn = fsstate->conn;
- char sql[64];
- int numrows;
+ PGconn *conn = fsstate->s.conn;
+ int addrows;
+ size_t newsize;
int i;
- snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
- fsstate->fetch_size, fsstate->cursor_number);
-
- res = pgfdw_exec_query(conn, sql);
- /* On error, report the original query, not the FETCH. */
+ res = pgfdw_get_result(conn, fsstate->query);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
/* Convert the data into HeapTuples */
- numrows = PQntuples(res);
- fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
- fsstate->num_tuples = numrows;
- fsstate->next_tuple = 0;
+ addrows = PQntuples(res);
+ newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+ if (fsstate->tuples)
+ fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+ else
+ fsstate->tuples = (HeapTuple *) palloc(newsize);
- for (i = 0; i < numrows; i++)
+ for (i = 0; i < addrows; i++)
{
Assert(IsA(node->ss.ps.plan, ForeignScan));
- fsstate->tuples[i] =
+ fsstate->tuples[fsstate->num_tuples + i] =
make_tuple_from_result_row(res, i,
fsstate->rel,
fsstate->attinmeta,
@@ -3437,22 +3796,73 @@ fetch_more_data(ForeignScanState *node)
}
/* Update fetch_ct_2 */
- if (fsstate->fetch_ct_2 < 2)
+ if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
fsstate->fetch_ct_2++;
+ fsstate->next_tuple = 0;
+ fsstate->num_tuples += addrows;
+
/* Must be EOF if we didn't get as many tuples as we asked for. */
- fsstate->eof_reached = (numrows < fsstate->fetch_size);
+ fsstate->eof_reached = (addrows < fsstate->fetch_size);
}
PG_FINALLY();
{
+ fsstate->s.commonstate->busy = false;
+
if (res)
PQclear(res);
}
PG_END_TRY();
+ /* let the first waiter be the next leader of this connection */
+ waiter = move_to_next_waiter(node);
+
+ /* send the next request if any */
+ if (waiter)
+ request_more_data(waiter);
+
MemoryContextSwitchTo(oldcontext);
}
+/*
+ * Vacate the underlying connection so that this node can send the next query.
+ */
+static void
+vacate_connection(PgFdwState *fdwstate, bool clear_queue)
+{
+ PgFdwConnCommonState *commonstate = fdwstate->commonstate;
+ ForeignScanState *leader;
+
+ Assert(commonstate != NULL);
+
+ /* just return if the connection is already available */
+ if (commonstate->leader == NULL || !commonstate->busy)
+ return;
+
+ /*
+ * let the current connection leader read all of the result for the running
+ * query
+ */
+ leader = commonstate->leader;
+ fetch_received_data(leader);
+
+ /* let the first waiter be the next leader of this connection */
+ move_to_next_waiter(leader);
+
+ if (!clear_queue)
+ return;
+
+ /* Clear the waiting list */
+ while (leader)
+ {
+ PgFdwScanState *fsstate = GetPgFdwScanState(leader);
+
+ fsstate->last_waiter = NULL;
+ leader = fsstate->waiter;
+ fsstate->waiter = NULL;
+ }
+}
+
/*
* Force assorted GUC parameters to settings that ensure that we'll output
* data values in a form that is unambiguous to the remote server.
@@ -3566,7 +3976,9 @@ create_foreign_modify(EState *estate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->s.conn = GetConnection(user, true);
+ fmstate->s.commonstate = (PgFdwConnCommonState *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Set up remote query information. */
@@ -3653,6 +4065,9 @@ execute_foreign_modify(EState *estate,
operation == CMD_UPDATE ||
operation == CMD_DELETE);
+ /* finish running query to send my command */
+ vacate_connection((PgFdwState *)fmstate, true);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -3680,14 +4095,14 @@ execute_foreign_modify(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3695,10 +4110,10 @@ execute_foreign_modify(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -3734,7 +4149,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
- GetPrepStmtNumber(fmstate->conn));
+ GetPrepStmtNumber(fmstate->s.conn));
p_name = pstrdup(prep_name);
/*
@@ -3744,12 +4159,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* the prepared statements we use in this module are simple enough that
* the remote server will make the right choices.
*/
- if (!PQsendPrepare(fmstate->conn,
+ if (!PQsendPrepare(fmstate->s.conn,
p_name,
fmstate->query,
0,
NULL))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3757,9 +4172,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
PQclear(res);
/* This action shows that the prepare has been done. */
@@ -3888,16 +4303,16 @@ finish_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
PQclear(res);
fmstate->p_name = NULL;
}
/* Release remote connection */
- ReleaseConnection(fmstate->conn);
- fmstate->conn = NULL;
+ ReleaseConnection(fmstate->s.conn);
+ fmstate->s.conn = NULL;
}
/*
@@ -4056,9 +4471,9 @@ execute_dml_stmt(ForeignScanState *node)
* the desired result. This allows us to avoid assuming that the remote
* server has the same OIDs we do for the parameters' types.
*/
- if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+ if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
NULL, values, NULL, NULL, 0))
- pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+ pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
/*
* Get the result, and check for success.
@@ -4066,10 +4481,10 @@ execute_dml_stmt(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+ dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
if (PQresultStatus(dmstate->result) !=
(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+ pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
dmstate->query);
/* Get the number of rows affected. */
@@ -5560,6 +5975,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
/* XXX Consider parameterized paths for the join relation */
}
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add wait event so that the ForeignScan node is going to wait for.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+ void *caller_data, bool reinit)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+ /* Reinit is not supported for now. */
+ Assert(reinit);
+
+ if (fsstate->s.commonstate->leader == node)
+ {
+ AddWaitEventToSet(wes,
+ WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+ NULL, caller_data);
+ return true;
+ }
+
+ return false;
+}
+
+
/*
* Assess whether the aggregation, grouping and having operations can be pushed
* down to the foreign server. As a side effect, save information we obtain in
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index eef410db39..96af75a33e 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -85,6 +85,7 @@ typedef struct PgFdwRelationInfo
UserMapping *user; /* only set in use_remote_estimate mode */
int fetch_size; /* fetch size for this remote table */
+ bool allow_prefetch; /* true to allow overlapped fetching */
/*
* Name of the relation, for use while EXPLAINing ForeignScan. It is used
@@ -130,6 +131,7 @@ extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 83971665e3..359208a12a 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1780,25 +1780,25 @@ INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
@@ -1840,12 +1840,12 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
@@ -1904,8 +1904,8 @@ explain (verbose, costs off)
delete from foo where f1 < 5 returning *;
delete from foo where f1 < 5 returning *;
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
-- Test that UPDATE/DELETE with inherited target works with row-level triggers
CREATE TRIGGER trig_row_before
--
2.18.4
Import Notes
Reply to msg id not found: 20200820042536.GB2886@telsasoft.comCA+hUKGJ9vxx-XbrEikpGCqV_R5uXhn_0QEXUk+BX0q7H-LsW0g@mail.gmail.com
On 20.08.2020 10:36, Kyotaro Horiguchi wrote:
At Wed, 19 Aug 2020 23:25:36 -0500, Justin Pryzby <pryzby@telsasoft.com> wrote in
On Thu, Jul 02, 2020 at 11:14:48AM +0900, Kyotaro Horiguchi wrote:
As the result of a discussion with Fujita-san off-list, I'm going to
hold off development until he decides whether mine or Thomas' is
better.The latest patch doesn't apply so I set as WoA.
https://commitfest.postgresql.org/29/2491/Thanks. This is rebased version.
At Fri, 14 Aug 2020 13:29:16 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in
Either way, we definitely need patch 0001. One comment:
-CreateWaitEventSet(MemoryContext context, int nevents) +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)I wonder if it's better to have it receive ResourceOwner like that, or
to have it capture CurrentResourceOwner. I think the latter is more
common in existing code.There's no existing WaitEventSets belonging to a resowner. So
unconditionally capturing CurrentResourceOwner doesn't work well. I
could pass a bool instead but that make things more complex.Come to think of "complex", ExecAsync stuff in this patch might be
too-much for a short-term solution until executor overhaul, if it
comes shortly. (the patch of mine here as a whole is like that,
though..). The queueing stuff in postgres_fdw is, too.regards.
Hi,
Looks like current implementation of asynchronous append incorrectly
handle LIMIT clause:
psql:append.sql:10: ERROR:� another command is already in progress
CONTEXT:� remote SQL command: CLOSE c1
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
On 22.09.2020 15:52, Konstantin Knizhnik wrote:
On 20.08.2020 10:36, Kyotaro Horiguchi wrote:
At Wed, 19 Aug 2020 23:25:36 -0500, Justin Pryzby
<pryzby@telsasoft.com> wrote inOn Thu, Jul 02, 2020 at 11:14:48AM +0900, Kyotaro Horiguchi wrote:
As the result of a discussion with Fujita-san off-list, I'm going to
hold off development until he decides whether mine or Thomas' is
better.The latest patch doesn't apply so I set as WoA.
https://commitfest.postgresql.org/29/2491/Thanks. This is rebased version.
At Fri, 14 Aug 2020 13:29:16 +1200, Thomas Munro
<thomas.munro@gmail.com> wrote inEither way, we definitely need patch 0001.� One comment:
-CreateWaitEventSet(MemoryContext context, int nevents) +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)I wonder if it's better to have it receive ResourceOwner like that, or
to have it capture CurrentResourceOwner.� I think the latter is more
common in existing code.There's no existing WaitEventSets belonging to a resowner. So
unconditionally capturing CurrentResourceOwner doesn't work well. I
could pass a bool instead but that make things more complex.Come to think of "complex", ExecAsync stuff in this patch might be
too-much for a short-term solution until executor overhaul, if it
comes shortly. (the patch of mine here as a whole is like that,
though..). The queueing stuff in postgres_fdw is, too.regards.
Hi,
Looks like current implementation of asynchronous append incorrectly
handle LIMIT clause:psql:append.sql:10: ERROR:� another command is already in progress
CONTEXT:� remote SQL command: CLOSE c1
Just FYI: the following patch fixes the problem:
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -1667,6 +1667,11 @@ remove_async_node(ForeignScanState *node)
���� ��� if (cur == node)
���� ��� {
+��� ��� ��� PGconn *conn = curstate->s.conn;
+
+��� ��� ��� while(PQisBusy(conn))
+��� ��� ��� ��� PQclear(PQgetResult(conn));
+
���� ��� ��� prev_state->waiter = curstate->waiter;
���� ��� ��� /* relink to the previous node if the last node was removed */
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
async_append_limit.patchtext/x-patch; name=async_append_limit.patchDownload
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 1482436..9fe16cf 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -1667,6 +1667,11 @@ remove_async_node(ForeignScanState *node)
if (cur == node)
{
+ PGconn *conn = curstate->s.conn;
+
+ while(PQisBusy(conn))
+ PQclear(PQgetResult(conn));
+
prev_state->waiter = curstate->waiter;
/* relink to the previous node if the last node was removed */
On 22.09.2020 16:40, Konstantin Knizhnik wrote:
On 22.09.2020 15:52, Konstantin Knizhnik wrote:
On 20.08.2020 10:36, Kyotaro Horiguchi wrote:
At Wed, 19 Aug 2020 23:25:36 -0500, Justin Pryzby
<pryzby@telsasoft.com> wrote inOn Thu, Jul 02, 2020 at 11:14:48AM +0900, Kyotaro Horiguchi wrote:
As the result of a discussion with Fujita-san off-list, I'm going to
hold off development until he decides whether mine or Thomas' is
better.The latest patch doesn't apply so I set as WoA.
https://commitfest.postgresql.org/29/2491/Thanks. This is rebased version.
At Fri, 14 Aug 2020 13:29:16 +1200, Thomas Munro
<thomas.munro@gmail.com> wrote inEither way, we definitely need patch 0001.� One comment:
-CreateWaitEventSet(MemoryContext context, int nevents) +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)I wonder if it's better to have it receive ResourceOwner like that, or
to have it capture CurrentResourceOwner.� I think the latter is more
common in existing code.There's no existing WaitEventSets belonging to a resowner. So
unconditionally capturing CurrentResourceOwner doesn't work well. I
could pass a bool instead but that make things more complex.Come to think of "complex", ExecAsync stuff in this patch might be
too-much for a short-term solution until executor overhaul, if it
comes shortly. (the patch of mine here as a whole is like that,
though..). The queueing stuff in postgres_fdw is, too.regards.
Hi,
Looks like current implementation of asynchronous append incorrectly
handle LIMIT clause:psql:append.sql:10: ERROR:� another command is already in progress
CONTEXT:� remote SQL command: CLOSE c1Just FYI: the following patch fixes the problem:
--- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -1667,6 +1667,11 @@ remove_async_node(ForeignScanState *node)���� ��� if (cur == node) ���� ��� { +��� ��� ��� PGconn *conn = curstate->s.conn; + +��� ��� ��� while(PQisBusy(conn)) +��� ��� ��� ��� PQclear(PQgetResult(conn)); + ���� ��� ��� prev_state->waiter = curstate->waiter;���� ��� ��� /* relink to the previous node if the last node was
removed */
Sorry, but it is not the only problem.
If you execute the query above and then in the same backend try to
insert more records, then backend is crashed:
Program terminated with signal SIGSEGV, Segmentation fault.
#0� 0x00007f5dfc59a231 in fetch_received_data (node=0x230c130) at
postgres_fdw.c:3736
3736��� ��� Assert(fsstate->s.commonstate->leader == node);
(gdb) p sstate->s.commonstate
No symbol "sstate" in current context.
(gdb) p fsstate->s.commonstate
Cannot access memory at address 0x7f7f7f7f7f7f7f87
Also my patch doesn't solve the problem for small number of records
(100) in the table.
I attach yet another patch which fix both problems.
Please notice that I did not go deep inside code of async append, so I
am not sure that my patch is complete and correct.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
async_append_limit-2.patchtext/x-patch; name=async_append_limit-2.patchDownload
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 1482436..ff15642 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -1623,7 +1623,19 @@ remove_async_node(ForeignScanState *node)
PgFdwScanState *prev_state;
ForeignScanState *cur;
- /* no need to remove me */
+ if (fsstate->s.commonstate->busy)
+ {
+ /*
+ * this node is waiting for result, absorb the result first so
+ * that the following commands can be sent on the connection.
+ */
+ PGconn *conn = fsstate->s.conn;
+
+ while(PQisBusy(conn))
+ PQclear(PQgetResult(conn));
+ }
+
+ /* no need to remove me */
if (!leader || !fsstate->queued)
return;
@@ -1631,23 +1643,7 @@ remove_async_node(ForeignScanState *node)
if (leader == node)
{
- if (leader_state->s.commonstate->busy)
- {
- /*
- * this node is waiting for result, absorb the result first so
- * that the following commands can be sent on the connection.
- */
- PgFdwScanState *leader_state = GetPgFdwScanState(leader);
- PGconn *conn = leader_state->s.conn;
-
- while(PQisBusy(conn))
- PQclear(PQgetResult(conn));
-
- leader_state->s.commonstate->busy = false;
- }
-
move_to_next_waiter(node);
-
return;
}
@@ -1858,7 +1854,7 @@ postgresEndForeignScan(ForeignScanState *node)
/* Release remote connection */
ReleaseConnection(fsstate->s.conn);
fsstate->s.conn = NULL;
-
+ fsstate->s.commonstate->leader = NULL;
/* MemoryContexts will be deleted automatically. */
}
On Tue, Sep 22, 2020 at 9:52 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
On 20.08.2020 10:36, Kyotaro Horiguchi wrote:
Come to think of "complex", ExecAsync stuff in this patch might be
too-much for a short-term solution until executor overhaul, if it
comes shortly. (the patch of mine here as a whole is like that,
though..). The queueing stuff in postgres_fdw is, too.
Looks like current implementation of asynchronous append incorrectly
handle LIMIT clause:psql:append.sql:10: ERROR: another command is already in progress
CONTEXT: remote SQL command: CLOSE c1
Thanks for the report (and patch)!
The same issue has already been noticed in [1]/messages/by-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx+3dyivCT=A@mail.gmail.com. I too think the cause
of the issue would be in the 0003 patch (ie, “the queueing stuff “ in
postgres_fdw), but I’m not sure it is really a good idea to have that
in postgres_fdw in the first place, because it would impact
performance negatively in some cases (see [1]/messages/by-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx+3dyivCT=A@mail.gmail.com).
Best regards,
Etsuro Fujita
[1]: /messages/by-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx+3dyivCT=A@mail.gmail.com
Your AsyncAppend doesn't switch to another source if the data in current
leader is available:
/*
* The request for the next node cannot be sent before the leader
* responds. Finish the current leader if possible.
*/
if (PQisBusy(leader_state->s.conn))
{
int rc = WaitLatchOrSocket(NULL, WL_SOCKET_READABLE | WL_TIMEOUT |
WL_EXIT_ON_PM_DEATH, PQsocket(leader_state->s.conn), 0,
WAIT_EVENT_ASYNC_WAIT);
if (!(rc & WL_SOCKET_READABLE))
available = false;
}
/* fetch the leader's data and enqueue it for the next request */
if (available)
{
fetch_received_data(leader);
add_async_waiter(leader);
}
I don't understand, why it is needed. If we have fdw connections with
different latency, then we will read data from the fast connection
first. I think this may be a source of skew and decrease efficiency of
asynchronous append.
For example, see below synthetic query:
CREATE TABLE l (a integer) PARTITION BY LIST (a);
CREATE FOREIGN TABLE f1 PARTITION OF l FOR VALUES IN (1) SERVER lb
OPTIONS (table_name 'l1');
CREATE FOREIGN TABLE f2 PARTITION OF l FOR VALUES IN (2) SERVER lb
OPTIONS (table_name 'l2');
INSERT INTO l (a) SELECT 2 FROM generate_series(1,200) as gs;
INSERT INTO l (a) SELECT 1 FROM generate_series(1,1000) as gs;
EXPLAIN ANALYZE (SELECT * FROM f1) UNION ALL (SELECT * FROM f2) LIMIT 400;
Result:
Limit (cost=100.00..122.21 rows=400 width=4) (actual time=0.483..1.183
rows=400 loops=1)
-> Append (cost=100.00..424.75 rows=5850 width=4) (actual
time=0.482..1.149 rows=400 loops=1)
-> Foreign Scan on f1 (cost=100.00..197.75 rows=2925
width=4) (actual time=0.481..1.115 rows=400 loops=1)
-> Foreign Scan on f2 (cost=100.00..197.75 rows=2925
width=4) (never executed)
As you can see, executor scans one input and doesn't tried to scan another.
--
regards,
Andrey Lepikhov
Postgres Professional
On Thu, Aug 20, 2020 at 4:36 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
This is rebased version.
Thanks for the rebased version!
Come to think of "complex", ExecAsync stuff in this patch might be
too-much for a short-term solution until executor overhaul, if it
comes shortly. (the patch of mine here as a whole is like that,
though..). The queueing stuff in postgres_fdw is, too.
Here are some review comments on “ExecAsync stuff” (the 0002 patch):
@@ -192,10 +196,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
i = -1;
while ((i = bms_next_member(validsubplans, i)) >= 0)
{
Plan *initNode = (Plan *) list_nth(node->appendplans, i);
+ int sub_eflags = eflags;
+
+ /* Let async-capable subplans run asynchronously */
+ if (i < node->nasyncplans)
+ {
+ sub_eflags |= EXEC_FLAG_ASYNC;
+ nasyncplans++;
+ }
This would be more ambitious than Thomas’ patch: his patch only allows
foreign scan nodes beneath an Append node to be executed
asynchronously, but your patch allows any plan nodes beneath it (e.g.,
local child joins between foreign tables). Right? I think that would
be great, but I’m not sure how we execute such plan nodes
asynchronously as other parts of your patch seem to assume that only
foreign scan nodes beneath an Append are considered as async-capable.
Maybe I’m missing something, though. Could you elaborate on that?
Your patch (and the original patch by Robert [1]/messages/by-id/CA+TgmoaXQEt4tZ03FtQhnzeDEMzBck+Lrni0UWHVVgOTnA6C1w@mail.gmail.com) modified
ExecAppend() so that it can process local plan nodes while waiting for
the results from remote queries, which would be also a feature that’s
not supported by Thomas’ patch, but I’d like to know performance
results. Did you do performance testing on that? I couldn’t find
that from the archive.
+bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
Do we really need to introduce the FDW API
IsForeignPathAsyncCapable()? I think we could determine whether a
foreign path is async-capable, by checking whether the FDW has the
postgresForeignAsyncConfigureWait() API.
In relation to the first comment, I noticed this change in the
postgres_fdw regression tests:
HEAD:
EXPLAIN (VERBOSE, COSTS OFF)
SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
QUERY PLAN
------------------------------------------------------------------------
Sort
Output: t1.a, (count(((t1.*)::pagg_tab)))
Sort Key: t1.a
-> Append
-> HashAggregate
Output: t1.a, count(((t1.*)::pagg_tab))
Group Key: t1.a
Filter: (avg(t1.b) < '22'::numeric)
-> Foreign Scan on public.fpagg_tab_p1 t1
Output: t1.a, t1.*, t1.b
Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1
-> HashAggregate
Output: t1_1.a, count(((t1_1.*)::pagg_tab))
Group Key: t1_1.a
Filter: (avg(t1_1.b) < '22'::numeric)
-> Foreign Scan on public.fpagg_tab_p2 t1_1
Output: t1_1.a, t1_1.*, t1_1.b
Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2
-> HashAggregate
Output: t1_2.a, count(((t1_2.*)::pagg_tab))
Group Key: t1_2.a
Filter: (avg(t1_2.b) < '22'::numeric)
-> Foreign Scan on public.fpagg_tab_p3 t1_2
Output: t1_2.a, t1_2.*, t1_2.b
Remote SQL: SELECT a, b, c FROM public.pagg_tab_p3
(25 rows)
Patched:
EXPLAIN (VERBOSE, COSTS OFF)
SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
QUERY PLAN
------------------------------------------------------------------------
Sort
Output: t1.a, (count(((t1.*)::pagg_tab)))
Sort Key: t1.a
-> HashAggregate
Output: t1.a, count(((t1.*)::pagg_tab))
Group Key: t1.a
Filter: (avg(t1.b) < '22'::numeric)
-> Append
Async subplans: 3
-> Async Foreign Scan on public.fpagg_tab_p1 t1_1
Output: t1_1.a, t1_1.*, t1_1.b
Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1
-> Async Foreign Scan on public.fpagg_tab_p2 t1_2
Output: t1_2.a, t1_2.*, t1_2.b
Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2
-> Async Foreign Scan on public.fpagg_tab_p3 t1_3
Output: t1_3.a, t1_3.*, t1_3.b
Remote SQL: SELECT a, b, c FROM public.pagg_tab_p3
(18 rows)
So, your patch can only handle foreign scan nodes beneath an Append
for now? Anyway, I think this would lead to the improved efficiency,
considering performance results from Movead [2]/messages/by-id/2020011417113872105895@highgo.ca. And I think planner
changes to make this happen would be a good thing in your patch.
That’s all I have for now. Sorry for the delay.
Best regards,
Etsuro Fujita
[1]: /messages/by-id/CA+TgmoaXQEt4tZ03FtQhnzeDEMzBck+Lrni0UWHVVgOTnA6C1w@mail.gmail.com
[2]: /messages/by-id/2020011417113872105895@highgo.ca
Thanks for reviewing.
At Sat, 26 Sep 2020 19:45:39 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
Come to think of "complex", ExecAsync stuff in this patch might be
too-much for a short-term solution until executor overhaul, if it
comes shortly. (the patch of mine here as a whole is like that,
though..). The queueing stuff in postgres_fdw is, too.Here are some review comments on “ExecAsync stuff” (the 0002 patch):
@@ -192,10 +196,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
i = -1; while ((i = bms_next_member(validsubplans, i)) >= 0) { Plan *initNode = (Plan *) list_nth(node->appendplans, i); + int sub_eflags = eflags; + + /* Let async-capable subplans run asynchronously */ + if (i < node->nasyncplans) + { + sub_eflags |= EXEC_FLAG_ASYNC; + nasyncplans++; + }This would be more ambitious than Thomas’ patch: his patch only allows
foreign scan nodes beneath an Append node to be executed
asynchronously, but your patch allows any plan nodes beneath it (e.g.,
local child joins between foreign tables). Right? I think that would
Right. It is intended to work any place, but all upper nodes up to the
common node must be "async-aware and capable" for the machinery to work. So it
doesn't work currently since Append is the only async-aware node.
be great, but I’m not sure how we execute such plan nodes
asynchronously as other parts of your patch seem to assume that only
foreign scan nodes beneath an Append are considered as async-capable.
Maybe I’m missing something, though. Could you elaborate on that?
Right about this patch. As a trial at hand, in my faint memory, some
join methods and some aggregaioion can be async-aware but they are not
included in this patch not to bloat it with more complex stuff.
Your patch (and the original patch by Robert [1]) modified
ExecAppend() so that it can process local plan nodes while waiting for
the results from remote queries, which would be also a feature that’s
not supported by Thomas’ patch, but I’d like to know performance
results. Did you do performance testing on that? I couldn’t find
that from the archive.
At least, even though theoretically, I think it's obvious that it's
performant to do something than just sitting waitng for the next tuple
to come from abroad. (I's not so obvious for slow local
vs. hyperspeed-remotes configuration, but...)
+bool +is_async_capable_path(Path *path) +{ + switch (nodeTag(path)) + { + case T_ForeignPath: + { + FdwRoutine *fdwroutine = path->parent->fdwroutine; + + Assert(fdwroutine != NULL); + if (fdwroutine->IsForeignPathAsyncCapable != NULL && + fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path)) + return true; + }Do we really need to introduce the FDW API
IsForeignPathAsyncCapable()? I think we could determine whether a
foreign path is async-capable, by checking whether the FDW has the
postgresForeignAsyncConfigureWait() API.
Note that the API routine takes a path, but it's just that a child
path in a certain form theoretically can obstruct async behavior.
In relation to the first comment, I noticed this change in the
postgres_fdw regression tests:HEAD:
EXPLAIN (VERBOSE, COSTS OFF)
SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
QUERY PLAN
------------------------------------------------------------------------
Sort
Output: t1.a, (count(((t1.*)::pagg_tab)))
Sort Key: t1.a
-> Append
-> HashAggregate
Output: t1.a, count(((t1.*)::pagg_tab))
Group Key: t1.a
Filter: (avg(t1.b) < '22'::numeric)
-> Foreign Scan on public.fpagg_tab_p1 t1
Output: t1.a, t1.*, t1.b
Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1
-> HashAggregate
Output: t1_1.a, count(((t1_1.*)::pagg_tab))
Group Key: t1_1.a
Filter: (avg(t1_1.b) < '22'::numeric)
-> Foreign Scan on public.fpagg_tab_p2 t1_1
Output: t1_1.a, t1_1.*, t1_1.b
Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2
-> HashAggregate
Output: t1_2.a, count(((t1_2.*)::pagg_tab))
Group Key: t1_2.a
Filter: (avg(t1_2.b) < '22'::numeric)
-> Foreign Scan on public.fpagg_tab_p3 t1_2
Output: t1_2.a, t1_2.*, t1_2.b
Remote SQL: SELECT a, b, c FROM public.pagg_tab_p3
(25 rows)Patched:
EXPLAIN (VERBOSE, COSTS OFF)
SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
QUERY PLAN
------------------------------------------------------------------------
Sort
Output: t1.a, (count(((t1.*)::pagg_tab)))
Sort Key: t1.a
-> HashAggregate
Output: t1.a, count(((t1.*)::pagg_tab))
Group Key: t1.a
Filter: (avg(t1.b) < '22'::numeric)
-> Append
Async subplans: 3
-> Async Foreign Scan on public.fpagg_tab_p1 t1_1
Output: t1_1.a, t1_1.*, t1_1.b
Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1
-> Async Foreign Scan on public.fpagg_tab_p2 t1_2
Output: t1_2.a, t1_2.*, t1_2.b
Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2
-> Async Foreign Scan on public.fpagg_tab_p3 t1_3
Output: t1_3.a, t1_3.*, t1_3.b
Remote SQL: SELECT a, b, c FROM public.pagg_tab_p3
(18 rows)So, your patch can only handle foreign scan nodes beneath an Append
Yes, as I wrote above. Append-Foreign is the most promising and
suitable as an example. (and... Agg/WindowAgg are the hardest nodes
to make async-aware.)
for now? Anyway, I think this would lead to the improved efficiency,
considering performance results from Movead [2]. And I think planner
changes to make this happen would be a good thing in your patch.
Thanks.
That’s all I have for now. Sorry for the delay.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Mon, Sep 28, 2020 at 10:35 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Sat, 26 Sep 2020 19:45:39 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
Here are some review comments on “ExecAsync stuff” (the 0002 patch):
@@ -192,10 +196,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
i = -1; while ((i = bms_next_member(validsubplans, i)) >= 0) { Plan *initNode = (Plan *) list_nth(node->appendplans, i); + int sub_eflags = eflags; + + /* Let async-capable subplans run asynchronously */ + if (i < node->nasyncplans) + { + sub_eflags |= EXEC_FLAG_ASYNC; + nasyncplans++; + }This would be more ambitious than Thomas’ patch: his patch only allows
foreign scan nodes beneath an Append node to be executed
asynchronously, but your patch allows any plan nodes beneath it (e.g.,
local child joins between foreign tables). Right? I think that wouldRight. It is intended to work any place,
be great, but I’m not sure how we execute such plan nodes
asynchronously as other parts of your patch seem to assume that only
foreign scan nodes beneath an Append are considered as async-capable.
Maybe I’m missing something, though. Could you elaborate on that?Right about this patch. As a trial at hand, in my faint memory, some
join methods and some aggregaioion can be async-aware but they are not
included in this patch not to bloat it with more complex stuff.
Yeah. I’m concerned about what was discussed in [1]/messages/by-id/CA+TgmoYrbgTBnLwnr1v=pk+C=znWg7AgV9=M9ehrq6TDexPQNw@mail.gmail.com as well. I think
it would be better only to allow foreign scan nodes beneath an Append,
as in Thomas’ patch (and the original patch by Robert), at least in
the first cut of this feature.
BTW: I noticed that you changed the ExecProcNode() API so that an
Append calling FDWs can know wether they return tuples immediately or
not:
+ while ((i = bms_first_member(needrequest)) >= 0)
+ {
+ TupleTableSlot *slot;
+ PlanState *subnode = node->appendplans[i];
+
+ slot = ExecProcNode(subnode);
+ if (subnode->asyncstate == AS_AVAILABLE)
+ {
+ if (!TupIsNull(slot))
+ {
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+ node->as_needrequest = bms_add_member(node->as_needrequest, i);
+ }
+ }
+ else
+ node->as_pending_async = bms_add_member(node->as_pending_async, i);
+ }
In the case of postgres_fdw:
/*
* postgresIterateForeignScan
- * Retrieve next row from the result set, or clear tuple slot to indicate
- * EOF.
+ * Retrieve next row from the result set.
+ *
+ * For synchronous nodes, returns clear tuple slot means EOF.
+ *
+ * For asynchronous nodes, if clear tuple slot is returned, the caller
+ * needs to check async state to tell if all tuples received
+ * (AS_AVAILABLE) or waiting for the next data to come (AS_WAITING).
*/
That is, 1) in postgresIterateForeignScan() postgres_fdw sets the new
PlanState’s flag asyncstate to AS_AVAILABLE/AS_WAITING depending on
whether it returns a tuple immediately or not, and then 2) the Append
knows that from the new flag when the callback routine returns. I’m
not sure this is a good idea, because it seems likely that the
ExecProcNode() change would affect many other places in the executor,
making maintenance and/or future development difficult. I think the
FDW callback routines proposed in the original patch by Robert would
provide a cleaner way to do asynchronous execution of FDWs without
changing the ExecProcNode() API, IIUC:
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait. This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
What is the reason for not doing like this in your patch?
Thanks for the explanation!
Best regards,
Etsuro Fujita
[1]: /messages/by-id/CA+TgmoYrbgTBnLwnr1v=pk+C=znWg7AgV9=M9ehrq6TDexPQNw@mail.gmail.com
On Mon, Sep 28, 2020 at 10:35 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Sat, 26 Sep 2020 19:45:39 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
Your patch (and the original patch by Robert [1]) modified
ExecAppend() so that it can process local plan nodes while waiting for
the results from remote queries, which would be also a feature that’s
not supported by Thomas’ patch, but I’d like to know performance
results.
At least, even though theoretically, I think it's obvious that it's
performant to do something than just sitting waitng for the next tuple
to come from abroad.
I did a simple test on my laptop:
create table t1 (a int, b int, c text);
create foreign table p1 (a int, b int, c text) server server1 options
(table_name 't1');
create table p2 (a int, b int, c text);
insert into p1 select 10 + i % 10, i, to_char(i, 'FM00000') from
generate_series(0, 99999) i;
insert into p2 select 20 + i % 10, i, to_char(i, 'FM00000') from
generate_series(0, 99999) i;
analyze p1;
vacuum analyze p2;
create table pt (a int, b int, c text) partition by range (a);
alter table pt attach partition p1 for values from (10) to (20);
alter table pt attach partition p2 for values from (20) to (30);
set enable_partitionwise_aggregate to on;
select a, count(*) from pt group by a;
HEAD: 47.734 ms
With your patch: 32.400 ms
This test is pretty simple, but I think this shows that the mentioned
feature would be useful for cases where it takes time to get the
results from remote queries.
Cool!
Best regards,
Etsuro Fujita
At Wed, 30 Sep 2020 16:30:41 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
On Mon, Sep 28, 2020 at 10:35 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:At Sat, 26 Sep 2020 19:45:39 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
Your patch (and the original patch by Robert [1]) modified
ExecAppend() so that it can process local plan nodes while waiting for
the results from remote queries, which would be also a feature that’s
not supported by Thomas’ patch, but I’d like to know performance
results.At least, even though theoretically, I think it's obvious that it's
performant to do something than just sitting waitng for the next tuple
to come from abroad.I did a simple test on my laptop:
create table t1 (a int, b int, c text);
create foreign table p1 (a int, b int, c text) server server1 options
(table_name 't1');
create table p2 (a int, b int, c text);insert into p1 select 10 + i % 10, i, to_char(i, 'FM00000') from
generate_series(0, 99999) i;
insert into p2 select 20 + i % 10, i, to_char(i, 'FM00000') from
generate_series(0, 99999) i;analyze p1;
vacuum analyze p2;create table pt (a int, b int, c text) partition by range (a);
alter table pt attach partition p1 for values from (10) to (20);
alter table pt attach partition p2 for values from (20) to (30);set enable_partitionwise_aggregate to on;
select a, count(*) from pt group by a;
HEAD: 47.734 ms
With your patch: 32.400 msThis test is pretty simple, but I think this shows that the mentioned
feature would be useful for cases where it takes time to get the
results from remote queries.Cool!
Thanks. Since it starts all remote nodes before local ones, the
startup gain would be the shorter of the startup time of the fastest
remote and the time required for all local nodes. Plus remote
transfer gets asynchronous fetch gain.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Thu, Oct 01, 2020 at 11:16:53AM +0900, Kyotaro Horiguchi wrote:
Thanks. Since it starts all remote nodes before local ones, the
startup gain would be the shorter of the startup time of the fastest
remote and the time required for all local nodes. Plus remote
transfer gets asynchronous fetch gain.
The patch fails to apply per the CF bot. For now, I have moved it to
next CF, waiting on author.
--
Michael
At Thu, 1 Oct 2020 12:56:02 +0900, Michael Paquier <michael@paquier.xyz> wrote in
On Thu, Oct 01, 2020 at 11:16:53AM +0900, Kyotaro Horiguchi wrote:
Thanks. Since it starts all remote nodes before local ones, the
startup gain would be the shorter of the startup time of the fastest
remote and the time required for all local nodes. Plus remote
transfer gets asynchronous fetch gain.The patch fails to apply per the CF bot. For now, I have moved it to
next CF, waiting on author.
Thanks! Rebased.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v7-0001-Allow-wait-event-set-to-be-registered-to-resource.patchtext/x-patch; charset=us-asciiDownload
From 09a38c30aed31673d3f9360a1853f5f99948f016 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH v7 1/3] Allow wait event set to be registered to resource
owner
WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
src/backend/libpq/pqcomm.c | 2 +-
src/backend/postmaster/pgstat.c | 2 +-
src/backend/postmaster/syslogger.c | 2 +-
src/backend/storage/ipc/latch.c | 20 ++++++--
src/backend/utils/resowner/resowner.c | 67 +++++++++++++++++++++++++++
src/include/storage/latch.h | 4 +-
src/include/utils/resowner_private.h | 8 ++++
7 files changed, 98 insertions(+), 7 deletions(-)
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index ac986c0505..799fa5006d 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -218,7 +218,7 @@ pq_init(void)
(errmsg("could not set socket to nonblocking mode: %m")));
#endif
- FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+ FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
NULL, NULL);
AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e6be2b7836..30020f8cda 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4503,7 +4503,7 @@ PgstatCollectorMain(int argc, char *argv[])
pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
/* Prepare to wait for our latch or data in our socket. */
- wes = CreateWaitEventSet(CurrentMemoryContext, 3);
+ wes = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL, NULL);
AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL);
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index ffcb54968f..a4de6d90e2 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -300,7 +300,7 @@ SysLoggerMain(int argc, char *argv[])
* syslog pipe, which implies that all other backends have exited
* (including the postmaster).
*/
- wes = CreateWaitEventSet(CurrentMemoryContext, 2);
+ wes = CreateWaitEventSet(CurrentMemoryContext, NULL, 2);
AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
#ifndef WIN32
AddWaitEventToSet(wes, WL_SOCKET_READABLE, syslogPipe[0], NULL, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 63c6c97536..108a6127e9 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -57,6 +57,7 @@
#include "storage/pmsignal.h"
#include "storage/shmem.h"
#include "utils/memutils.h"
+#include "utils/resowner_private.h"
/*
* Select the fd readiness primitive to use. Normally the "most modern"
@@ -85,6 +86,8 @@ struct WaitEventSet
int nevents; /* number of registered events */
int nevents_space; /* maximum number of events in this set */
+ ResourceOwner resowner; /* Resource owner */
+
/*
* Array, of nevents_space length, storing the definition of events this
* set is waiting for.
@@ -257,7 +260,7 @@ InitializeLatchWaitSet(void)
Assert(LatchWaitSet == NULL);
/* Set up the WaitEventSet used by WaitLatch(). */
- LatchWaitSet = CreateWaitEventSet(TopMemoryContext, 2);
+ LatchWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 2);
latch_pos = AddWaitEventToSet(LatchWaitSet, WL_LATCH_SET, PGINVALID_SOCKET,
MyLatch, NULL);
if (IsUnderPostmaster)
@@ -441,7 +444,7 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
int ret = 0;
int rc;
WaitEvent event;
- WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
if (wakeEvents & WL_TIMEOUT)
Assert(timeout >= 0);
@@ -608,12 +611,15 @@ ResetLatch(Latch *latch)
* WaitEventSetWait().
*/
WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
{
WaitEventSet *set;
char *data;
Size sz = 0;
+ if (res)
+ ResourceOwnerEnlargeWESs(res);
+
/*
* Use MAXALIGN size/alignment to guarantee that later uses of memory are
* aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -728,6 +734,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
#endif
+ /* Register this wait event set if requested */
+ set->resowner = res;
+ if (res)
+ ResourceOwnerRememberWES(set->resowner, set);
+
return set;
}
@@ -773,6 +784,9 @@ FreeWaitEventSet(WaitEventSet *set)
}
#endif
+ if (set->resowner != NULL)
+ ResourceOwnerForgetWES(set->resowner, set);
+
pfree(set);
}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 8bc2c4e9ea..237ca9fa30 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -128,6 +128,7 @@ typedef struct ResourceOwnerData
ResourceArray filearr; /* open temporary files */
ResourceArray dsmarr; /* dynamic shmem segments */
ResourceArray jitarr; /* JIT contexts */
+ ResourceArray wesarr; /* wait event sets */
/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
int nlocks; /* number of owned locks */
@@ -175,6 +176,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
static void PrintSnapshotLeakWarning(Snapshot snapshot);
static void PrintFileLeakWarning(File file);
static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
/*****************************************************************************
@@ -444,6 +446,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
+ ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
return owner;
}
@@ -553,6 +556,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
jit_release_context(context);
}
+
+ /* Ditto for wait event sets */
+ while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+ {
+ WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+ if (isCommit)
+ PrintWESLeakWarning(event);
+ FreeWaitEventSet(event);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -725,6 +738,7 @@ ResourceOwnerDelete(ResourceOwner owner)
Assert(owner->filearr.nitems == 0);
Assert(owner->dsmarr.nitems == 0);
Assert(owner->jitarr.nitems == 0);
+ Assert(owner->wesarr.nitems == 0);
Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
/*
@@ -752,6 +766,7 @@ ResourceOwnerDelete(ResourceOwner owner)
ResourceArrayFree(&(owner->filearr));
ResourceArrayFree(&(owner->dsmarr));
ResourceArrayFree(&(owner->jitarr));
+ ResourceArrayFree(&(owner->wesarr));
pfree(owner);
}
@@ -1370,3 +1385,55 @@ ResourceOwnerForgetJIT(ResourceOwner owner, Datum handle)
elog(ERROR, "JIT context %p is not owned by resource owner %s",
DatumGetPointer(handle), owner->name);
}
+
+/*
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+ ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+ ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+ elog(ERROR, "wait event set %p is not owned by resource owner %s",
+ events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ elog(WARNING, "wait event set leak: %p still referenced",
+ events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 7c742021fb..ae13d4c08d 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
#define LATCH_H
#include <signal.h>
+#include "utils/resowner.h"
/*
* Latch structure should be treated as opaque and only accessed through
@@ -163,7 +164,8 @@ extern void DisownLatch(Latch *latch);
extern void SetLatch(Latch *latch);
extern void ResetLatch(Latch *latch);
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+ ResourceOwner res, int nevents);
extern void FreeWaitEventSet(WaitEventSet *set);
extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index a781a7a2aa..7d19dadd57 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
#include "storage/dsm.h"
#include "storage/fd.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "utils/catcache.h"
#include "utils/plancache.h"
@@ -95,4 +96,11 @@ extern void ResourceOwnerRememberJIT(ResourceOwner owner,
extern void ResourceOwnerForgetJIT(ResourceOwner owner,
Datum handle);
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+ WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+ WaitEventSet *);
+
#endif /* RESOWNER_PRIVATE_H */
--
2.18.4
v7-0002-Infrastructure-for-asynchronous-execution.patchtext/x-patch; charset=us-asciiDownload
From c47ea326f3557d7ac03886c91fda5ebf689ae068 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 15 May 2018 20:21:32 +0900
Subject: [PATCH v7 2/3] Infrastructure for asynchronous execution
This patch add an infrastructure for asynchronous execution. As a PoC
this makes only Append capable to handle asynchronously executable
subnodes.
---
src/backend/commands/explain.c | 17 ++
src/backend/executor/Makefile | 1 +
src/backend/executor/execAsync.c | 152 +++++++++++
src/backend/executor/nodeAppend.c | 342 ++++++++++++++++++++----
src/backend/executor/nodeForeignscan.c | 21 ++
src/backend/nodes/bitmapset.c | 72 +++++
src/backend/nodes/copyfuncs.c | 3 +
src/backend/nodes/outfuncs.c | 3 +
src/backend/nodes/readfuncs.c | 3 +
src/backend/optimizer/path/allpaths.c | 24 ++
src/backend/optimizer/path/costsize.c | 55 +++-
src/backend/optimizer/plan/createplan.c | 45 +++-
src/backend/postmaster/pgstat.c | 3 +
src/backend/utils/adt/ruleutils.c | 8 +-
src/backend/utils/resowner/resowner.c | 4 +-
src/include/executor/execAsync.h | 22 ++
src/include/executor/executor.h | 1 +
src/include/executor/nodeForeignscan.h | 3 +
src/include/foreign/fdwapi.h | 11 +
src/include/nodes/bitmapset.h | 1 +
src/include/nodes/execnodes.h | 23 +-
src/include/nodes/plannodes.h | 9 +
src/include/optimizer/paths.h | 2 +
src/include/pgstat.h | 3 +-
24 files changed, 756 insertions(+), 72 deletions(-)
create mode 100644 src/backend/executor/execAsync.c
create mode 100644 src/include/executor/execAsync.h
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index c98c9b5547..097355f6f9 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -86,6 +86,7 @@ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
List *ancestors, ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
+static void show_append_info(AppendState *astate, ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
ExplainState *es);
static void show_grouping_sets(PlanState *planstate, Agg *agg,
@@ -1377,6 +1378,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
}
if (plan->parallel_aware)
appendStringInfoString(es->str, "Parallel ");
+ if (plan->async_capable)
+ appendStringInfoString(es->str, "Async ");
appendStringInfoString(es->str, pname);
es->indent++;
}
@@ -1958,6 +1961,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_Hash:
show_hash_info(castNode(HashState, planstate), es);
break;
+
+ case T_Append:
+ show_append_info(castNode(AppendState, planstate), es);
+ break;
+
default:
break;
}
@@ -2311,6 +2319,15 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ancestors, es);
}
+static void
+show_append_info(AppendState *astate, ExplainState *es)
+{
+ Append *plan = (Append *) astate->ps.plan;
+
+ if (plan->nasyncplans > 0)
+ ExplainPropertyInteger("Async subplans", "", plan->nasyncplans, es);
+}
+
/*
* Show the grouping keys for an Agg node.
*/
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index f990c6473a..1004647d4f 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
+ execAsync.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000000..2b7d1877e0
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,152 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+/*
+ * ExecAsyncConfigureWait: Add wait event to the WaitEventSet if needed.
+ *
+ * If reinit is true, the caller didn't reuse existing WaitEventSet.
+ */
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+ void *data, bool reinit)
+{
+ switch (nodeTag(node))
+ {
+ case T_ForeignScanState:
+ return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+ wes, data, reinit);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(node));
+ }
+}
+
+/*
+ * struct for memory context callback argument used in ExecAsyncEventWait
+ */
+typedef struct {
+ int **p_refind;
+ int *p_refindsize;
+} ExecAsync_mcbarg;
+
+/*
+ * callback function to reset static variables pointing to the memory in
+ * TopTransactionContext in ExecAsyncEventWait.
+ */
+static void ExecAsyncMemoryContextCallback(void *arg)
+{
+ /* arg is the address of the variable refind in ExecAsyncEventWait */
+ ExecAsync_mcbarg *mcbarg = (ExecAsync_mcbarg *) arg;
+ *mcbarg->p_refind = NULL;
+ *mcbarg->p_refindsize = 0;
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * ExecAsyncEventWait:
+ *
+ * Wait for async events to fire. Returns the Bitmapset of fired events.
+ */
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+ static int *refind = NULL;
+ static int refindsize = 0;
+ WaitEventSet *wes;
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred = 0;
+ Bitmapset *fired_events = NULL;
+ int i;
+ int n;
+
+ n = bms_num_members(waitnodes);
+ wes = CreateWaitEventSet(TopTransactionContext,
+ TopTransactionResourceOwner, n);
+ if (refindsize < n)
+ {
+ if (refindsize == 0)
+ refindsize = EVENT_BUFFER_SIZE; /* XXX */
+ while (refindsize < n)
+ refindsize *= 2;
+ if (refind)
+ refind = (int *) repalloc(refind, refindsize * sizeof(int));
+ else
+ {
+ static ExecAsync_mcbarg mcb_arg =
+ { &refind, &refindsize };
+ static MemoryContextCallback mcb =
+ { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
+ MemoryContext oldctxt =
+ MemoryContextSwitchTo(TopTransactionContext);
+
+ /*
+ * refind points to a memory block in
+ * TopTransactionContext. Register a callback to reset it.
+ */
+ MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
+ refind = (int *) palloc(refindsize * sizeof(int));
+ MemoryContextSwitchTo(oldctxt);
+ }
+ }
+
+ /* Prepare WaitEventSet for waiting on the waitnodes. */
+ n = 0;
+ for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+ i = bms_next_member(waitnodes, i))
+ {
+ refind[i] = i;
+ if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+ n++;
+ }
+
+ /* Return immediately if no node to wait. */
+ if (n == 0)
+ {
+ FreeWaitEventSet(wes);
+ return NULL;
+ }
+
+ noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+ EVENT_BUFFER_SIZE,
+ WAIT_EVENT_ASYNC_WAIT);
+ FreeWaitEventSet(wes);
+ if (noccurred == 0)
+ return NULL;
+
+ for (i = 0 ; i < noccurred ; i++)
+ {
+ WaitEvent *w = &occurred_event[i];
+
+ if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+ {
+ int n = *(int*)w->user_data;
+
+ fired_events = bms_add_member(fired_events, n);
+ }
+ }
+
+ return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 88919e62fa..60c36ee048 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -60,6 +60,7 @@
#include "executor/execdebug.h"
#include "executor/execPartition.h"
#include "executor/nodeAppend.h"
+#include "executor/execAsync.h"
#include "miscadmin.h"
/* Shared state for parallel-aware Append. */
@@ -80,6 +81,7 @@ struct ParallelAppendState
#define INVALID_SUBPLAN_INDEX -1
static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);
static bool choose_next_subplan_locally(AppendState *node);
static bool choose_next_subplan_for_leader(AppendState *node);
static bool choose_next_subplan_for_worker(AppendState *node);
@@ -103,22 +105,22 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
PlanState **appendplanstates;
Bitmapset *validsubplans;
int nplans;
+ int nasyncplans;
int firstvalid;
int i,
j;
/* check for unsupported flags */
- Assert(!(eflags & EXEC_FLAG_MARK));
+ Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC)));
/*
* create new AppendState for our append node
*/
appendstate->ps.plan = (Plan *) node;
appendstate->ps.state = estate;
- appendstate->ps.ExecProcNode = ExecAppend;
/* Let choose_next_subplan_* function handle setting the first subplan */
- appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+ appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
/* If run-time partition pruning is enabled, then set that up now */
if (node->part_prune_info != NULL)
@@ -152,11 +154,12 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
/*
* When no run-time pruning is required and there's at least one
- * subplan, we can fill as_valid_subplans immediately, preventing
+ * subplan, we can fill as_valid_syncsubplans immediately, preventing
* later calls to ExecFindMatchingSubPlans.
*/
if (!prunestate->do_exec_prune && nplans > 0)
- appendstate->as_valid_subplans = bms_add_range(NULL, 0, nplans - 1);
+ appendstate->as_valid_syncsubplans =
+ bms_add_range(NULL, node->nasyncplans, nplans - 1);
}
else
{
@@ -167,8 +170,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
* subplans as valid; they must also all be initialized.
*/
Assert(nplans > 0);
- appendstate->as_valid_subplans = validsubplans =
- bms_add_range(NULL, 0, nplans - 1);
+ validsubplans = bms_add_range(NULL, 0, nplans - 1);
+ appendstate->as_valid_syncsubplans =
+ bms_add_range(NULL, node->nasyncplans, nplans - 1);
appendstate->as_prune_state = NULL;
}
@@ -192,10 +196,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
*/
j = 0;
firstvalid = nplans;
+ nasyncplans = 0;
+
i = -1;
while ((i = bms_next_member(validsubplans, i)) >= 0)
{
Plan *initNode = (Plan *) list_nth(node->appendplans, i);
+ int sub_eflags = eflags;
+
+ /* Let async-capable subplans run asynchronously */
+ if (i < node->nasyncplans)
+ {
+ sub_eflags |= EXEC_FLAG_ASYNC;
+ nasyncplans++;
+ }
/*
* Record the lowest appendplans index which is a valid partial plan.
@@ -203,13 +217,46 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
if (i >= node->first_partial_plan && j < firstvalid)
firstvalid = j;
- appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
+ appendplanstates[j++] = ExecInitNode(initNode, estate, sub_eflags);
}
appendstate->as_first_partial_plan = firstvalid;
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ /* fill in async stuff */
+ appendstate->as_nasyncplans = nasyncplans;
+ appendstate->as_syncdone = (nasyncplans == nplans);
+ appendstate->as_exec_prune = false;
+
+ /* choose appropriate version of Exec function */
+ if (appendstate->as_nasyncplans == 0)
+ appendstate->ps.ExecProcNode = ExecAppend;
+ else
+ appendstate->ps.ExecProcNode = ExecAppendAsync;
+
+ if (appendstate->as_nasyncplans)
+ {
+ appendstate->as_asyncresult = (TupleTableSlot **)
+ palloc0(appendstate->as_nasyncplans * sizeof(TupleTableSlot *));
+
+ /* initially, all async requests need a request */
+ appendstate->as_needrequest =
+ bms_add_range(NULL, 0, appendstate->as_nasyncplans - 1);
+
+ /*
+ * ExecAppendAsync needs as_valid_syncsubplans to handle async
+ * subnodes.
+ */
+ if (appendstate->as_prune_state != NULL &&
+ appendstate->as_prune_state->do_exec_prune)
+ {
+ Assert(appendstate->as_valid_syncsubplans == NULL);
+
+ appendstate->as_exec_prune = true;
+ }
+ }
+
/*
* Miscellaneous initialization
*/
@@ -233,7 +280,7 @@ ExecAppend(PlanState *pstate)
{
AppendState *node = castNode(AppendState, pstate);
- if (node->as_whichplan < 0)
+ if (node->as_whichsyncplan < 0)
{
/* Nothing to do if there are no subplans */
if (node->as_nplans == 0)
@@ -243,11 +290,13 @@ ExecAppend(PlanState *pstate)
* If no subplan has been chosen, we must choose one before
* proceeding.
*/
- if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
+ if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
!node->choose_next_subplan(node))
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
+ Assert(node->as_nasyncplans == 0);
+
for (;;)
{
PlanState *subnode;
@@ -258,8 +307,9 @@ ExecAppend(PlanState *pstate)
/*
* figure out which subplan we are currently processing
*/
- Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
- subnode = node->appendplans[node->as_whichplan];
+ Assert(node->as_whichsyncplan >= 0 &&
+ node->as_whichsyncplan < node->as_nplans);
+ subnode = node->appendplans[node->as_whichsyncplan];
/*
* get a tuple from the subplan
@@ -282,6 +332,172 @@ ExecAppend(PlanState *pstate)
}
}
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+ AppendState *node = castNode(AppendState, pstate);
+ Bitmapset *needrequest;
+ int i;
+
+ Assert(node->as_nasyncplans > 0);
+
+restart:
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ if (node->as_exec_prune)
+ {
+ Bitmapset *valid_subplans =
+ ExecFindMatchingSubPlans(node->as_prune_state);
+
+ /* Distribute valid subplans into sync and async */
+ node->as_needrequest =
+ bms_intersect(node->as_needrequest, valid_subplans);
+ node->as_valid_syncsubplans =
+ bms_difference(valid_subplans, node->as_needrequest);
+
+ node->as_exec_prune = false;
+ }
+
+ needrequest = node->as_needrequest;
+ node->as_needrequest = NULL;
+ while ((i = bms_first_member(needrequest)) >= 0)
+ {
+ TupleTableSlot *slot;
+ PlanState *subnode = node->appendplans[i];
+
+ slot = ExecProcNode(subnode);
+ if (subnode->asyncstate == AS_AVAILABLE)
+ {
+ if (!TupIsNull(slot))
+ {
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+ node->as_needrequest = bms_add_member(node->as_needrequest, i);
+ }
+ }
+ else
+ node->as_pending_async = bms_add_member(node->as_pending_async, i);
+ }
+ bms_free(needrequest);
+
+ for (;;)
+ {
+ TupleTableSlot *result;
+
+ /* return now if a result is available */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ while (!bms_is_empty(node->as_pending_async))
+ {
+ /* Don't wait for async nodes if any sync node exists. */
+ long timeout = node->as_syncdone ? -1 : 0;
+ Bitmapset *fired;
+ int i;
+
+ fired = ExecAsyncEventWait(node->appendplans,
+ node->as_pending_async,
+ timeout);
+
+ if (bms_is_empty(fired) && node->as_syncdone)
+ {
+ /*
+ * We come here when all the subnodes had fired before
+ * waiting. Retry fetching from the nodes.
+ */
+ node->as_needrequest = node->as_pending_async;
+ node->as_pending_async = NULL;
+ goto restart;
+ }
+
+ while ((i = bms_first_member(fired)) >= 0)
+ {
+ TupleTableSlot *slot;
+ PlanState *subnode = node->appendplans[i];
+ slot = ExecProcNode(subnode);
+
+ Assert(subnode->asyncstate == AS_AVAILABLE);
+
+ if (!TupIsNull(slot))
+ {
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+ node->as_needrequest =
+ bms_add_member(node->as_needrequest, i);
+ }
+
+ node->as_pending_async =
+ bms_del_member(node->as_pending_async, i);
+ }
+ bms_free(fired);
+
+ /* return now if a result is available */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If there is no asynchronous activity still pending and the
+ * synchronous activity is also complete, we're totally done scanning
+ * this node. Otherwise, we're done with the asynchronous stuff but
+ * must continue scanning the synchronous children.
+ */
+
+ if (!node->as_syncdone &&
+ node->as_whichsyncplan == INVALID_SUBPLAN_INDEX)
+ node->as_syncdone = !node->choose_next_subplan(node);
+
+ if (node->as_syncdone)
+ {
+ Assert(bms_is_empty(node->as_pending_async));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+
+ /*
+ * get a tuple from the subplan
+ */
+ result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+ if (!TupIsNull(result))
+ {
+ /*
+ * If the subplan gave us something then return it as-is. We do
+ * NOT make use of the result slot that was set up in
+ * ExecInitAppend; there's no need for it.
+ */
+ return result;
+ }
+
+ /*
+ * Go on to the "next" subplan. If no more subplans, return the empty
+ * slot set up for us by ExecInitAppend, unless there are async plans
+ * we have yet to finish.
+ */
+ if (!node->choose_next_subplan(node))
+ {
+ node->as_syncdone = true;
+ if (bms_is_empty(node->as_pending_async))
+ {
+ Assert(bms_is_empty(node->as_needrequest));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
+
+ /* Else loop back and try to get a tuple from the new subplan */
+ }
+}
+
/* ----------------------------------------------------------------
* ExecEndAppend
*
@@ -324,10 +540,18 @@ ExecReScanAppend(AppendState *node)
bms_overlap(node->ps.chgParam,
node->as_prune_state->execparamids))
{
- bms_free(node->as_valid_subplans);
- node->as_valid_subplans = NULL;
+ bms_free(node->as_valid_syncsubplans);
+ node->as_valid_syncsubplans = NULL;
}
+ /* Reset async state. */
+ for (i = 0; i < node->as_nasyncplans; ++i)
+ ExecShutdownNode(node->appendplans[i]);
+
+ node->as_nasyncresult = 0;
+ node->as_needrequest = bms_add_range(NULL, 0, node->as_nasyncplans - 1);
+ node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
for (i = 0; i < node->as_nplans; i++)
{
PlanState *subnode = node->appendplans[i];
@@ -348,7 +572,7 @@ ExecReScanAppend(AppendState *node)
}
/* Let choose_next_subplan_* function handle setting the first subplan */
- node->as_whichplan = INVALID_SUBPLAN_INDEX;
+ node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
}
/* ----------------------------------------------------------------
@@ -436,7 +660,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
static bool
choose_next_subplan_locally(AppendState *node)
{
- int whichplan = node->as_whichplan;
+ int whichplan = node->as_whichsyncplan;
int nextplan;
/* We should never be called when there are no subplans */
@@ -451,10 +675,18 @@ choose_next_subplan_locally(AppendState *node)
*/
if (whichplan == INVALID_SUBPLAN_INDEX)
{
- if (node->as_valid_subplans == NULL)
- node->as_valid_subplans =
+ /* Shouldn't have an active async node */
+ Assert(bms_is_empty(node->as_needrequest));
+
+ if (node->as_valid_syncsubplans == NULL)
+ node->as_valid_syncsubplans =
ExecFindMatchingSubPlans(node->as_prune_state);
+ /* Exclude async plans */
+ if (node->as_nasyncplans > 0)
+ bms_del_range(node->as_valid_syncsubplans,
+ 0, node->as_nasyncplans - 1);
+
whichplan = -1;
}
@@ -462,14 +694,14 @@ choose_next_subplan_locally(AppendState *node)
Assert(whichplan >= -1 && whichplan <= node->as_nplans);
if (ScanDirectionIsForward(node->ps.state->es_direction))
- nextplan = bms_next_member(node->as_valid_subplans, whichplan);
+ nextplan = bms_next_member(node->as_valid_syncsubplans, whichplan);
else
- nextplan = bms_prev_member(node->as_valid_subplans, whichplan);
+ nextplan = bms_prev_member(node->as_valid_syncsubplans, whichplan);
if (nextplan < 0)
return false;
- node->as_whichplan = nextplan;
+ node->as_whichsyncplan = nextplan;
return true;
}
@@ -490,29 +722,29 @@ choose_next_subplan_for_leader(AppendState *node)
/* Backward scan is not supported by parallel-aware plans */
Assert(ScanDirectionIsForward(node->ps.state->es_direction));
- /* We should never be called when there are no subplans */
- Assert(node->as_nplans > 0);
+ /* We should never be called when there are no sync subplans */
+ Assert(node->as_nplans > node->as_nasyncplans);
LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
- if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
+ if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
{
/* Mark just-completed subplan as finished. */
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
}
else
{
/* Start with last subplan. */
- node->as_whichplan = node->as_nplans - 1;
+ node->as_whichsyncplan = node->as_nplans - 1;
/*
* If we've yet to determine the valid subplans then do so now. If
* run-time pruning is disabled then the valid subplans will always be
* set to all subplans.
*/
- if (node->as_valid_subplans == NULL)
+ if (node->as_valid_syncsubplans == NULL)
{
- node->as_valid_subplans =
+ node->as_valid_syncsubplans =
ExecFindMatchingSubPlans(node->as_prune_state);
/*
@@ -524,26 +756,26 @@ choose_next_subplan_for_leader(AppendState *node)
}
/* Loop until we find a subplan to execute. */
- while (pstate->pa_finished[node->as_whichplan])
+ while (pstate->pa_finished[node->as_whichsyncplan])
{
- if (node->as_whichplan == 0)
+ if (node->as_whichsyncplan == 0)
{
pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
- node->as_whichplan = INVALID_SUBPLAN_INDEX;
+ node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
LWLockRelease(&pstate->pa_lock);
return false;
}
/*
- * We needn't pay attention to as_valid_subplans here as all invalid
+ * We needn't pay attention to as_valid_syncsubplans here as all invalid
* plans have been marked as finished.
*/
- node->as_whichplan--;
+ node->as_whichsyncplan--;
}
/* If non-partial, immediately mark as finished. */
- if (node->as_whichplan < node->as_first_partial_plan)
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ if (node->as_whichsyncplan < node->as_first_partial_plan)
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
LWLockRelease(&pstate->pa_lock);
@@ -571,23 +803,23 @@ choose_next_subplan_for_worker(AppendState *node)
/* Backward scan is not supported by parallel-aware plans */
Assert(ScanDirectionIsForward(node->ps.state->es_direction));
- /* We should never be called when there are no subplans */
- Assert(node->as_nplans > 0);
+ /* We should never be called when there are no sync subplans */
+ Assert(node->as_nplans > node->as_nasyncplans);
LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
/* Mark just-completed subplan as finished. */
- if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
/*
* If we've yet to determine the valid subplans then do so now. If
* run-time pruning is disabled then the valid subplans will always be set
* to all subplans.
*/
- else if (node->as_valid_subplans == NULL)
+ else if (node->as_valid_syncsubplans == NULL)
{
- node->as_valid_subplans =
+ node->as_valid_syncsubplans =
ExecFindMatchingSubPlans(node->as_prune_state);
mark_invalid_subplans_as_finished(node);
}
@@ -600,30 +832,30 @@ choose_next_subplan_for_worker(AppendState *node)
}
/* Save the plan from which we are starting the search. */
- node->as_whichplan = pstate->pa_next_plan;
+ node->as_whichsyncplan = pstate->pa_next_plan;
/* Loop until we find a valid subplan to execute. */
while (pstate->pa_finished[pstate->pa_next_plan])
{
int nextplan;
- nextplan = bms_next_member(node->as_valid_subplans,
+ nextplan = bms_next_member(node->as_valid_syncsubplans,
pstate->pa_next_plan);
if (nextplan >= 0)
{
/* Advance to the next valid plan. */
pstate->pa_next_plan = nextplan;
}
- else if (node->as_whichplan > node->as_first_partial_plan)
+ else if (node->as_whichsyncplan > node->as_first_partial_plan)
{
/*
* Try looping back to the first valid partial plan, if there is
* one. If there isn't, arrange to bail out below.
*/
- nextplan = bms_next_member(node->as_valid_subplans,
+ nextplan = bms_next_member(node->as_valid_syncsubplans,
node->as_first_partial_plan - 1);
pstate->pa_next_plan =
- nextplan < 0 ? node->as_whichplan : nextplan;
+ nextplan < 0 ? node->as_whichsyncplan : nextplan;
}
else
{
@@ -631,10 +863,10 @@ choose_next_subplan_for_worker(AppendState *node)
* At last plan, and either there are no partial plans or we've
* tried them all. Arrange to bail out.
*/
- pstate->pa_next_plan = node->as_whichplan;
+ pstate->pa_next_plan = node->as_whichsyncplan;
}
- if (pstate->pa_next_plan == node->as_whichplan)
+ if (pstate->pa_next_plan == node->as_whichsyncplan)
{
/* We've tried everything! */
pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
@@ -644,8 +876,8 @@ choose_next_subplan_for_worker(AppendState *node)
}
/* Pick the plan we found, and advance pa_next_plan one more time. */
- node->as_whichplan = pstate->pa_next_plan;
- pstate->pa_next_plan = bms_next_member(node->as_valid_subplans,
+ node->as_whichsyncplan = pstate->pa_next_plan;
+ pstate->pa_next_plan = bms_next_member(node->as_valid_syncsubplans,
pstate->pa_next_plan);
/*
@@ -654,7 +886,7 @@ choose_next_subplan_for_worker(AppendState *node)
*/
if (pstate->pa_next_plan < 0)
{
- int nextplan = bms_next_member(node->as_valid_subplans,
+ int nextplan = bms_next_member(node->as_valid_syncsubplans,
node->as_first_partial_plan - 1);
if (nextplan >= 0)
@@ -671,8 +903,8 @@ choose_next_subplan_for_worker(AppendState *node)
}
/* If non-partial, immediately mark as finished. */
- if (node->as_whichplan < node->as_first_partial_plan)
- node->as_pstate->pa_finished[node->as_whichplan] = true;
+ if (node->as_whichsyncplan < node->as_first_partial_plan)
+ node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
LWLockRelease(&pstate->pa_lock);
@@ -699,13 +931,13 @@ mark_invalid_subplans_as_finished(AppendState *node)
Assert(node->as_prune_state);
/* Nothing to do if all plans are valid */
- if (bms_num_members(node->as_valid_subplans) == node->as_nplans)
+ if (bms_num_members(node->as_valid_syncsubplans) == node->as_nplans)
return;
/* Mark all non-valid plans as finished */
for (i = 0; i < node->as_nplans; i++)
{
- if (!bms_is_member(i, node->as_valid_subplans))
+ if (!bms_is_member(i, node->as_valid_syncsubplans))
node->as_pstate->pa_finished[i] = true;
}
}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 513471ab9b..3bf4aaa63d 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -141,6 +141,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
scanstate->ss.ps.plan = (Plan *) node;
scanstate->ss.ps.state = estate;
scanstate->ss.ps.ExecProcNode = ExecForeignScan;
+ scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+ if ((eflags & EXEC_FLAG_ASYNC) != 0)
+ scanstate->fs_async = true;
/*
* Miscellaneous initialization
@@ -384,3 +388,20 @@ ExecShutdownForeignScan(ForeignScanState *node)
if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+ void *caller_data, bool reinit)
+{
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+ caller_data, reinit);
+}
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 2719ea45a3..05b625783b 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -895,6 +895,78 @@ bms_add_range(Bitmapset *a, int lower, int upper)
return a;
}
+/*
+ * bms_del_range
+ * Delete members in the range of 'lower' to 'upper' from the set.
+ *
+ * Note this could also be done by calling bms_del_member in a loop, however,
+ * using this function will be faster when the range is large as we work at
+ * the bitmapword level rather than at bit level.
+ */
+Bitmapset *
+bms_del_range(Bitmapset *a, int lower, int upper)
+{
+ int lwordnum,
+ lbitnum,
+ uwordnum,
+ ushiftbits,
+ wordnum;
+
+ if (lower < 0 || upper < 0)
+ elog(ERROR, "negative bitmapset member not allowed");
+ if (lower > upper)
+ elog(ERROR, "lower range must not be above upper range");
+ uwordnum = WORDNUM(upper);
+
+ if (a == NULL)
+ {
+ a = (Bitmapset *) palloc0(BITMAPSET_SIZE(uwordnum + 1));
+ a->nwords = uwordnum + 1;
+ }
+
+ /* ensure we have enough words to store the upper bit */
+ else if (uwordnum >= a->nwords)
+ {
+ int oldnwords = a->nwords;
+ int i;
+
+ a = (Bitmapset *) repalloc(a, BITMAPSET_SIZE(uwordnum + 1));
+ a->nwords = uwordnum + 1;
+ /* zero out the enlarged portion */
+ for (i = oldnwords; i < a->nwords; i++)
+ a->words[i] = 0;
+ }
+
+ wordnum = lwordnum = WORDNUM(lower);
+
+ lbitnum = BITNUM(lower);
+ ushiftbits = BITNUM(upper) + 1;
+
+ /*
+ * Special case when lwordnum is the same as uwordnum we must perform the
+ * upper and lower masking on the word.
+ */
+ if (lwordnum == uwordnum)
+ {
+ a->words[lwordnum] &= ((bitmapword) (((bitmapword) 1 << lbitnum) - 1)
+ | (~(bitmapword) 0) << ushiftbits);
+ }
+ else
+ {
+ /* turn off lbitnum and all bits left of it */
+ a->words[wordnum++] &= (bitmapword) (((bitmapword) 1 << lbitnum) - 1);
+
+ /* turn off all bits for any intermediate words */
+ while (wordnum < uwordnum)
+ a->words[wordnum++] = (bitmapword) 0;
+
+ /* turn off upper's bit and all bits right of it. */
+ a->words[uwordnum] &= (~(bitmapword) 0) << ushiftbits;
+ }
+
+ return a;
+}
+
/*
* bms_int_members - like bms_intersect, but left input is recycled
*/
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 0409a40b82..4eff3712b7 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -121,6 +121,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
COPY_SCALAR_FIELD(plan_width);
COPY_SCALAR_FIELD(parallel_aware);
COPY_SCALAR_FIELD(parallel_safe);
+ COPY_SCALAR_FIELD(async_capable);
COPY_SCALAR_FIELD(plan_node_id);
COPY_NODE_FIELD(targetlist);
COPY_NODE_FIELD(qual);
@@ -246,6 +247,8 @@ _copyAppend(const Append *from)
COPY_NODE_FIELD(appendplans);
COPY_SCALAR_FIELD(first_partial_plan);
COPY_NODE_FIELD(part_prune_info);
+ COPY_SCALAR_FIELD(nasyncplans);
+ COPY_SCALAR_FIELD(referent);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index f0386480ab..2b1b0e9141 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -334,6 +334,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
WRITE_INT_FIELD(plan_width);
WRITE_BOOL_FIELD(parallel_aware);
WRITE_BOOL_FIELD(parallel_safe);
+ WRITE_BOOL_FIELD(async_capable);
WRITE_INT_FIELD(plan_node_id);
WRITE_NODE_FIELD(targetlist);
WRITE_NODE_FIELD(qual);
@@ -436,6 +437,8 @@ _outAppend(StringInfo str, const Append *node)
WRITE_NODE_FIELD(appendplans);
WRITE_INT_FIELD(first_partial_plan);
WRITE_NODE_FIELD(part_prune_info);
+ WRITE_INT_FIELD(nasyncplans);
+ WRITE_INT_FIELD(referent);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab719..63af7c02d8 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1572,6 +1572,7 @@ ReadCommonPlan(Plan *local_node)
READ_INT_FIELD(plan_width);
READ_BOOL_FIELD(parallel_aware);
READ_BOOL_FIELD(parallel_safe);
+ READ_BOOL_FIELD(async_capable);
READ_INT_FIELD(plan_node_id);
READ_NODE_FIELD(targetlist);
READ_NODE_FIELD(qual);
@@ -1672,6 +1673,8 @@ _readAppend(void)
READ_NODE_FIELD(appendplans);
READ_INT_FIELD(first_partial_plan);
READ_NODE_FIELD(part_prune_info);
+ READ_INT_FIELD(nasyncplans);
+ READ_INT_FIELD(referent);
READ_DONE();
}
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index b399592ff8..17e9a7a897 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3973,6 +3973,30 @@ generate_partitionwise_join_paths(PlannerInfo *root, RelOptInfo *rel)
list_free(live_children);
}
+/*
+ * is_projection_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
+
/*****************************************************************************
* DEBUG SUPPORT
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index cd3716d494..143e00b13e 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2048,22 +2048,59 @@ cost_append(AppendPath *apath)
if (pathkeys == NIL)
{
- Path *subpath = (Path *) linitial(apath->subpaths);
-
- /*
- * For an unordered, non-parallel-aware Append we take the startup
- * cost as the startup cost of the first subpath.
- */
- apath->path.startup_cost = subpath->startup_cost;
+ Cost first_nonasync_startup_cost = -1.0;
+ Cost async_min_startup_cost = -1;
+ Cost async_max_cost = 0.0;
/* Compute rows and costs as sums of subplan rows and costs. */
foreach(l, apath->subpaths)
{
Path *subpath = (Path *) lfirst(l);
+ /*
+ * For an unordered, non-parallel-aware Append we take the
+ * startup cost as the startup cost of the first
+ * nonasync-capable subpath or the minimum startup cost of
+ * async-capable subpaths.
+ */
+ if (!is_async_capable_path(subpath))
+ {
+ if (first_nonasync_startup_cost < 0.0)
+ first_nonasync_startup_cost = subpath->startup_cost;
+
+ apath->path.total_cost += subpath->total_cost;
+ }
+ else
+ {
+ if (async_min_startup_cost < 0.0 ||
+ async_min_startup_cost > subpath->startup_cost)
+ async_min_startup_cost = subpath->startup_cost;
+
+ /*
+ * It's not obvious how to determine the total cost of
+ * async subnodes. Although it is not always true, we
+ * assume it is the maximum cost among all async subnodes.
+ */
+ if (async_max_cost < subpath->total_cost)
+ async_max_cost = subpath->total_cost;
+ }
+
apath->path.rows += subpath->rows;
- apath->path.total_cost += subpath->total_cost;
}
+
+ /*
+ * If there's an sync subnodes, the startup cost is the startup
+ * cost of the first sync subnode. Otherwise it's the minimal
+ * startup cost of async subnodes.
+ */
+ if (first_nonasync_startup_cost >= 0.0)
+ apath->path.startup_cost = first_nonasync_startup_cost;
+ else
+ apath->path.startup_cost = async_min_startup_cost;
+
+ /* Use async maximum cost if it exceeds the sync total cost */
+ if (async_max_cost > apath->path.total_cost)
+ apath->path.total_cost = async_max_cost;
}
else
{
@@ -2084,6 +2121,8 @@ cost_append(AppendPath *apath)
* This case is also different from the above in that we have to
* account for possibly injecting sorts into subpaths that aren't
* natively ordered.
+ *
+ * Note: An ordered append won't be run asynchronously.
*/
foreach(l, apath->subpaths)
{
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 3d7a4e373f..3ae46ed6f1 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1082,6 +1082,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
bool tlist_was_changed = false;
List *pathkeys = best_path->path.pathkeys;
List *subplans = NIL;
+ List *asyncplans = NIL;
+ List *syncplans = NIL;
+ List *asyncpaths = NIL;
+ List *syncpaths = NIL;
+ List *newsubpaths = NIL;
ListCell *subpaths;
RelOptInfo *rel = best_path->path.parent;
PartitionPruneInfo *partpruneinfo = NULL;
@@ -1090,6 +1095,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
Oid *nodeSortOperators = NULL;
Oid *nodeCollations = NULL;
bool *nodeNullsFirst = NULL;
+ int nasyncplans = 0;
+ bool first = true;
+ bool referent_is_sync = true;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1219,9 +1227,40 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
}
- subplans = lappend(subplans, subplan);
+ /*
+ * Classify as async-capable or not. If we have decided to run the
+ * children in parallel, we cannot any one of them run asynchronously.
+ * Planner thinks that all subnodes are executed in order if this
+ * append is orderd. No subpaths cannot be run asynchronously in that
+ * case.
+ */
+ if (pathkeys == NIL &&
+ !best_path->path.parallel_safe && is_async_capable_path(subpath))
+ {
+ subplan->async_capable = true;
+ asyncplans = lappend(asyncplans, subplan);
+ asyncpaths = lappend(asyncpaths, subpath);
+ ++nasyncplans;
+ if (first)
+ referent_is_sync = false;
+ }
+ else
+ {
+ syncplans = lappend(syncplans, subplan);
+ syncpaths = lappend(syncpaths, subpath);
+ }
+
+ first = false;
}
+ /*
+ * subplan contains asyncplans in the first half, if any, and sync plans in
+ * another half, if any. We need that the same for subpaths to make
+ * partition pruning information in sync with subplans.
+ */
+ subplans = list_concat(asyncplans, syncplans);
+ newsubpaths = list_concat(asyncpaths, syncpaths);
+
/*
* If any quals exist, they may be useful to perform further partition
* pruning during execution. Gather information needed by the executor to
@@ -1249,7 +1288,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
if (prunequal != NIL)
partpruneinfo =
make_partition_pruneinfo(root, rel,
- best_path->subpaths,
+ newsubpaths,
best_path->partitioned_rels,
prunequal);
}
@@ -1257,6 +1296,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
plan->appendplans = subplans;
plan->first_partial_plan = best_path->first_partial_path;
plan->part_prune_info = partpruneinfo;
+ plan->nasyncplans = nasyncplans;
+ plan->referent = referent_is_sync ? nasyncplans : 0;
copy_generic_path_info(&plan->plan, (Path *) best_path);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 30020f8cda..faed3f2442 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3878,6 +3878,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
case WAIT_EVENT_XACT_GROUP_UPDATE:
event_name = "XactGroupUpdate";
break;
+ case WAIT_EVENT_ASYNC_WAIT:
+ event_name = "AsyncExecWait";
+ break;
/* no default case, so that compiler will warn */
}
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 62023c20b2..07aeb43a7f 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4574,10 +4574,14 @@ set_deparse_plan(deparse_namespace *dpns, Plan *plan)
* tlists according to one of the children, and the first one is the most
* natural choice. Likewise special-case ModifyTable to pretend that the
* first child plan is the OUTER referent; this is to support RETURNING
- * lists containing references to non-target relations.
+ * lists containing references to non-target relations. For Append, use the
+ * explicitly specified referent.
*/
if (IsA(plan, Append))
- dpns->outer_plan = linitial(((Append *) plan)->appendplans);
+ {
+ Append *app = (Append *) plan;
+ dpns->outer_plan = list_nth(app->appendplans, app->referent);
+ }
else if (IsA(plan, MergeAppend))
dpns->outer_plan = linitial(((MergeAppend *) plan)->mergeplans);
else if (IsA(plan, ModifyTable))
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 237ca9fa30..27742a1641 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -1416,7 +1416,7 @@ void
ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
{
/*
- * XXXX: There's no property to show as an identier of a wait event set,
+ * XXXX: There's no property to show as an identifier of a wait event set,
* use its pointer instead.
*/
if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
@@ -1431,7 +1431,7 @@ static void
PrintWESLeakWarning(WaitEventSet *events)
{
/*
- * XXXX: There's no property to show as an identier of a wait event set,
+ * XXXX: There's no property to show as an identifier of a wait event set,
* use its pointer instead.
*/
elog(WARNING, "wait event set leak: %p still referenced",
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000000..3b6bf4a516
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,22 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ * Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+ void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+ long timeout);
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 415e117407..9cf2c1f676 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -59,6 +59,7 @@
#define EXEC_FLAG_MARK 0x0008 /* need mark/restore */
#define EXEC_FLAG_SKIP_TRIGGERS 0x0010 /* skip AfterTrigger calls */
#define EXEC_FLAG_WITH_NO_DATA 0x0020 /* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC 0x0040 /* request async execution */
/* Hook for plugins to get control in ExecutorStart() */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 326d713ebf..71a233b41f 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,
extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
ParallelWorkerContext *pwcxt);
extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+ WaitEventSet *wes,
+ void *caller_data, bool reinit);
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 95556dfb15..853ba2b5ad 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -169,6 +169,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
List *fdw_private,
RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+ WaitEventSet *wes,
+ void *caller_data,
+ bool reinit);
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -190,6 +195,7 @@ typedef struct FdwRoutine
GetForeignPlan_function GetForeignPlan;
BeginForeignScan_function BeginForeignScan;
IterateForeignScan_function IterateForeignScan;
+ IterateForeignScan_function IterateForeignScanAsync;
ReScanForeignScan_function ReScanForeignScan;
EndForeignScan_function EndForeignScan;
@@ -242,6 +248,11 @@ typedef struct FdwRoutine
InitializeDSMForeignScan_function InitializeDSMForeignScan;
ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan;
InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+
ShutdownForeignScan_function ShutdownForeignScan;
/* Support functions for path reparameterization. */
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index d113c271ee..177e6218cb 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -107,6 +107,7 @@ extern Bitmapset *bms_add_members(Bitmapset *a, const Bitmapset *b);
extern Bitmapset *bms_add_range(Bitmapset *a, int lower, int upper);
extern Bitmapset *bms_int_members(Bitmapset *a, const Bitmapset *b);
extern Bitmapset *bms_del_members(Bitmapset *a, const Bitmapset *b);
+extern Bitmapset *bms_del_range(Bitmapset *a, int lower, int upper);
extern Bitmapset *bms_join(Bitmapset *a, Bitmapset *b);
/* support for iterating through the integer elements of a set: */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index ef448d67c7..dce7fb0e07 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -925,6 +925,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
* abstract superclass for all PlanState-type nodes.
* ----------------
*/
+typedef enum AsyncState
+{
+ AS_AVAILABLE,
+ AS_WAITING
+} AsyncState;
+
typedef struct PlanState
{
NodeTag type;
@@ -1013,6 +1019,11 @@ typedef struct PlanState
bool outeropsset;
bool inneropsset;
bool resultopsset;
+
+ /* Async subnode execution stuff */
+ AsyncState asyncstate;
+
+ int32 padding; /* to keep alignment of derived types */
} PlanState;
/* ----------------
@@ -1208,14 +1219,21 @@ struct AppendState
PlanState ps; /* its first field is NodeTag */
PlanState **appendplans; /* array of PlanStates for my inputs */
int as_nplans;
- int as_whichplan;
+ int as_whichsyncplan; /* which sync plan is being executed */
int as_first_partial_plan; /* Index of 'appendplans' containing
* the first partial plan */
+ int as_nasyncplans; /* # of async-capable children */
ParallelAppendState *as_pstate; /* parallel coordination info */
Size pstate_len; /* size of parallel coordination info */
struct PartitionPruneState *as_prune_state;
- Bitmapset *as_valid_subplans;
+ Bitmapset *as_valid_syncsubplans;
bool (*choose_next_subplan) (AppendState *);
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_needrequest; /* async plans needing a new request */
+ Bitmapset *as_pending_async; /* pending async plans */
+ TupleTableSlot **as_asyncresult; /* results of each async plan */
+ int as_nasyncresult; /* # of valid entries in as_asyncresult */
+ bool as_exec_prune; /* runtime pruning needed for async exec? */
};
/* ----------------
@@ -1783,6 +1801,7 @@ typedef struct ForeignScanState
Size pscan_len; /* size of parallel coordination information */
/* use struct pointer to avoid including fdwapi.h here */
struct FdwRoutine *fdwroutine;
+ bool fs_async;
void *fdw_state; /* foreign-data wrapper can keep state here */
} ForeignScanState;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 83e01074ed..abad89b327 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -135,6 +135,11 @@ typedef struct Plan
bool parallel_aware; /* engage parallel-aware logic? */
bool parallel_safe; /* OK to use as part of parallel plan? */
+ /*
+ * information needed for asynchronous execution
+ */
+ bool async_capable; /* engage asynchronous execution logic? */
+
/*
* Common structural data for all Plan types.
*/
@@ -262,6 +267,10 @@ typedef struct Append
/* Info for run-time subplan pruning; NULL if we're not doing that */
struct PartitionPruneInfo *part_prune_info;
+
+ /* Async child node execution stuff */
+ int nasyncplans; /* # async subplans, always at start of list */
+ int referent; /* index of inheritance tree referent */
} Append;
/* ----------------
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 10b6e81079..53876b2d8b 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -241,4 +241,6 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
List *live_childrels);
+extern bool is_async_capable_path(Path *path);
+
#endif /* PATHS_H */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0dfbac46b4..d673f9da6b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -887,7 +887,8 @@ typedef enum
WAIT_EVENT_REPLICATION_SLOT_DROP,
WAIT_EVENT_SAFE_SNAPSHOT,
WAIT_EVENT_SYNC_REP,
- WAIT_EVENT_XACT_GROUP_UPDATE
+ WAIT_EVENT_XACT_GROUP_UPDATE,
+ WAIT_EVENT_ASYNC_WAIT
} WaitEventIPC;
/* ----------
--
2.18.4
v7-0003-async-postgres_fdw.patchtext/x-patch; charset=us-asciiDownload
From 4df7f9b34ad8d9fd9b415459e2673ebe27f72343 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH v7 3/3] async postgres_fdw
---
contrib/postgres_fdw/connection.c | 28 +
.../postgres_fdw/expected/postgres_fdw.out | 272 ++++----
contrib/postgres_fdw/postgres_fdw.c | 601 +++++++++++++++---
contrib/postgres_fdw/postgres_fdw.h | 2 +
contrib/postgres_fdw/sql/postgres_fdw.sql | 20 +-
5 files changed, 710 insertions(+), 213 deletions(-)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 08daf26fdf..be5948f613 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -59,6 +59,7 @@ typedef struct ConnCacheEntry
bool invalidated; /* true if reconnect is pending */
uint32 server_hashvalue; /* hash value of foreign server OID */
uint32 mapping_hashvalue; /* hash value of user mapping OID */
+ void *storage; /* connection specific storage */
} ConnCacheEntry;
/*
@@ -203,6 +204,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
entry->conn, server->servername, user->umid, user->userid);
+ entry->storage = NULL;
}
/*
@@ -216,6 +218,32 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
return entry->conn;
}
+/*
+ * Returns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+ bool found;
+ ConnCacheEntry *entry;
+ ConnCacheKey key;
+
+ /* Find storage using the same key with GetConnection */
+ key = user->umid;
+ entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+ Assert(found);
+
+ /* Create one if not yet. */
+ if (entry->storage == NULL)
+ {
+ entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+ memset(entry->storage, 0, initsize);
+ }
+
+ return entry->storage;
+}
+
/*
* Connect to remote server using specified server and user mapping properties.
*/
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 10e23d02ed..0634ab9f6a 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6986,7 +6986,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+-------
a | aaa
@@ -7014,7 +7014,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7042,7 +7042,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7070,7 +7070,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | newtoo
@@ -7140,35 +7140,41 @@ insert into bar2 values(3,33,33);
insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
f1 | f2
----+----
1 | 11
@@ -7178,35 +7184,41 @@ select * from bar where f1 in (select f1 from foo) for update;
(4 rows)
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
f1 | f2
----+----
1 | 11
@@ -7236,11 +7248,12 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-> Hash Join
Output: bar_1.f1, (bar_1.f2 + 100), bar_1.f3, bar_1.ctid, foo.ctid, foo.*, foo.tableoid
Inner Unique: true
@@ -7254,12 +7267,13 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo foo_1
- Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ Async subplans: 1
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(39 rows)
+ -> Seq Scan on public.foo foo_1
+ Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(41 rows)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
select tableoid::regclass, * from bar order by 1,2;
@@ -7289,16 +7303,17 @@ where bar.f1 = ss.f1;
Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
Hash Cond: (foo.f1 = bar.f1)
-> Append
+ Async subplans: 2
+ -> Async Foreign Scan on public.foo2 foo_1
+ Output: ROW(foo_1.f1), foo_1.f1
+ Remote SQL: SELECT f1 FROM public.loct1
+ -> Async Foreign Scan on public.foo2 foo_3
+ Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
+ Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo
Output: ROW(foo.f1), foo.f1
- -> Foreign Scan on public.foo2 foo_1
- Output: ROW(foo_1.f1), foo_1.f1
- Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo foo_2
Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
- -> Foreign Scan on public.foo2 foo_3
- Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
- Remote SQL: SELECT f1 FROM public.loct1
-> Hash
Output: bar.f1, bar.f2, bar.ctid
-> Seq Scan on public.bar
@@ -7316,17 +7331,18 @@ where bar.f1 = ss.f1;
Output: (ROW(foo.f1)), foo.f1
Sort Key: foo.f1
-> Append
+ Async subplans: 2
+ -> Async Foreign Scan on public.foo2 foo_1
+ Output: ROW(foo_1.f1), foo_1.f1
+ Remote SQL: SELECT f1 FROM public.loct1
+ -> Async Foreign Scan on public.foo2 foo_3
+ Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
+ Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo
Output: ROW(foo.f1), foo.f1
- -> Foreign Scan on public.foo2 foo_1
- Output: ROW(foo_1.f1), foo_1.f1
- Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo foo_2
Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
- -> Foreign Scan on public.foo2 foo_3
- Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
- Remote SQL: SELECT f1 FROM public.loct1
-(45 rows)
+(47 rows)
update bar set f2 = f2 + 100
from
@@ -7476,27 +7492,33 @@ delete from foo where f1 < 5 returning *;
(5 rows)
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
- QUERY PLAN
-------------------------------------------------------------------------------
- Update on public.bar
- Output: bar.f1, bar.f2
- Update on public.bar
- Foreign Update on public.bar2 bar_1
- -> Seq Scan on public.bar
- Output: bar.f1, (bar.f2 + 100), bar.ctid
- -> Foreign Update on public.bar2 bar_1
- Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Output: u.f1, u.f2
+ Sort Key: u.f1
+ CTE u
+ -> Update on public.bar
+ Output: bar.f1, bar.f2
+ Update on public.bar
+ Foreign Update on public.bar2 bar_1
+ -> Seq Scan on public.bar
+ Output: bar.f1, (bar.f2 + 100), bar.ctid
+ -> Foreign Update on public.bar2 bar_1
+ Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+ -> CTE Scan on u
+ Output: u.f1, u.f2
+(14 rows)
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
f1 | f2
----+-----
1 | 311
2 | 322
- 6 | 266
3 | 333
4 | 344
+ 6 | 266
7 | 277
(6 rows)
@@ -8571,11 +8593,12 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
Sort
Sort Key: t1.a, t3.c
-> Append
- -> Foreign Scan
+ Async subplans: 2
+ -> Async Foreign Scan
Relations: ((ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)) INNER JOIN (ftprt1_p1 t3_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: ((ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)) INNER JOIN (ftprt1_p2 t3_2)
-(7 rows)
+(8 rows)
SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER JOIN fprt1 t3 ON (t2.b = t3.a) WHERE t1.a % 25 =0 ORDER BY 1,2,3;
a | b | c
@@ -8610,20 +8633,22 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
-- with whole-row reference; partitionwise join does not apply
EXPLAIN (COSTS OFF)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
- QUERY PLAN
---------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------
Sort
Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
-> Hash Full Join
Hash Cond: (t1.a = t2.b)
-> Append
- -> Foreign Scan on ftprt1_p1 t1_1
- -> Foreign Scan on ftprt1_p2 t1_2
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt1_p1 t1_1
+ -> Async Foreign Scan on ftprt1_p2 t1_2
-> Hash
-> Append
- -> Foreign Scan on ftprt2_p1 t2_1
- -> Foreign Scan on ftprt2_p2 t2_2
-(11 rows)
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt2_p1 t2_1
+ -> Async Foreign Scan on ftprt2_p2 t2_2
+(13 rows)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
wr | wr
@@ -8652,11 +8677,12 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
Sort
Sort Key: t1.a, t1.b
-> Append
- -> Foreign Scan
+ Async subplans: 2
+ -> Async Foreign Scan
Relations: (ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: (ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)
-(7 rows)
+(8 rows)
SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE t1.a%25 = 0 ORDER BY 1,2;
a | b
@@ -8709,21 +8735,23 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
-- test FOR UPDATE; partitionwise join does not apply
EXPLAIN (COSTS OFF)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
- QUERY PLAN
---------------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------------
LockRows
-> Sort
Sort Key: t1.a
-> Hash Join
Hash Cond: (t2.b = t1.a)
-> Append
- -> Foreign Scan on ftprt2_p1 t2_1
- -> Foreign Scan on ftprt2_p2 t2_2
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt2_p1 t2_1
+ -> Async Foreign Scan on ftprt2_p2 t2_2
-> Hash
-> Append
- -> Foreign Scan on ftprt1_p1 t1_1
- -> Foreign Scan on ftprt1_p2 t1_2
-(12 rows)
+ Async subplans: 2
+ -> Async Foreign Scan on ftprt1_p1 t1_1
+ -> Async Foreign Scan on ftprt1_p2 t1_2
+(14 rows)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
a | b
@@ -8758,18 +8786,19 @@ ANALYZE fpagg_tab_p3;
SET enable_partitionwise_aggregate TO false;
EXPLAIN (COSTS OFF)
SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
- QUERY PLAN
------------------------------------------------------------
+ QUERY PLAN
+-----------------------------------------------------------------
Sort
Sort Key: pagg_tab.a
-> HashAggregate
Group Key: pagg_tab.a
Filter: (avg(pagg_tab.b) < '22'::numeric)
-> Append
- -> Foreign Scan on fpagg_tab_p1 pagg_tab_1
- -> Foreign Scan on fpagg_tab_p2 pagg_tab_2
- -> Foreign Scan on fpagg_tab_p3 pagg_tab_3
-(9 rows)
+ Async subplans: 3
+ -> Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+ -> Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+ -> Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
+(10 rows)
-- Plan with partitionwise aggregates is enabled
SET enable_partitionwise_aggregate TO true;
@@ -8780,13 +8809,14 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
Sort
Sort Key: pagg_tab.a
-> Append
- -> Foreign Scan
+ Async subplans: 3
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p1 pagg_tab)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p2 pagg_tab_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p3 pagg_tab_2)
-(9 rows)
+(10 rows)
SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
a | sum | min | count
@@ -8808,29 +8838,22 @@ SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
Sort
Output: t1.a, (count(((t1.*)::pagg_tab)))
Sort Key: t1.a
- -> Append
- -> HashAggregate
- Output: t1.a, count(((t1.*)::pagg_tab))
- Group Key: t1.a
- Filter: (avg(t1.b) < '22'::numeric)
- -> Foreign Scan on public.fpagg_tab_p1 t1
- Output: t1.a, t1.*, t1.b
- Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1
- -> HashAggregate
- Output: t1_1.a, count(((t1_1.*)::pagg_tab))
- Group Key: t1_1.a
- Filter: (avg(t1_1.b) < '22'::numeric)
- -> Foreign Scan on public.fpagg_tab_p2 t1_1
+ -> HashAggregate
+ Output: t1.a, count(((t1.*)::pagg_tab))
+ Group Key: t1.a
+ Filter: (avg(t1.b) < '22'::numeric)
+ -> Append
+ Async subplans: 3
+ -> Async Foreign Scan on public.fpagg_tab_p1 t1_1
Output: t1_1.a, t1_1.*, t1_1.b
- Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2
- -> HashAggregate
- Output: t1_2.a, count(((t1_2.*)::pagg_tab))
- Group Key: t1_2.a
- Filter: (avg(t1_2.b) < '22'::numeric)
- -> Foreign Scan on public.fpagg_tab_p3 t1_2
+ Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1
+ -> Async Foreign Scan on public.fpagg_tab_p2 t1_2
Output: t1_2.a, t1_2.*, t1_2.b
+ Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2
+ -> Async Foreign Scan on public.fpagg_tab_p3 t1_3
+ Output: t1_3.a, t1_3.*, t1_3.b
Remote SQL: SELECT a, b, c FROM public.pagg_tab_p3
-(25 rows)
+(18 rows)
SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
a | count
@@ -8850,20 +8873,15 @@ SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b HAVING sum(a) < 700
-----------------------------------------------------------------
Sort
Sort Key: pagg_tab.b
- -> Finalize HashAggregate
+ -> HashAggregate
Group Key: pagg_tab.b
Filter: (sum(pagg_tab.a) < 700)
-> Append
- -> Partial HashAggregate
- Group Key: pagg_tab.b
- -> Foreign Scan on fpagg_tab_p1 pagg_tab
- -> Partial HashAggregate
- Group Key: pagg_tab_1.b
- -> Foreign Scan on fpagg_tab_p2 pagg_tab_1
- -> Partial HashAggregate
- Group Key: pagg_tab_2.b
- -> Foreign Scan on fpagg_tab_p3 pagg_tab_2
-(15 rows)
+ Async subplans: 3
+ -> Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+ -> Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+ -> Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
+(10 rows)
-- ===================================================================
-- access rights and superuser
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index a31abce7c9..14824368cc 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -21,6 +21,8 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -35,6 +37,7 @@
#include "optimizer/restrictinfo.h"
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
+#include "pgstat.h"
#include "postgres_fdw.h"
#include "utils/builtins.h"
#include "utils/float.h"
@@ -56,6 +59,9 @@ PG_MODULE_MAGIC;
/* If no remote estimates, assume a sort costs 20% extra */
#define DEFAULT_FDW_SORT_MULTIPLIER 1.2
+/* Retrieve PgFdwScanState struct from ForeignScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
/*
* Indexes of FDW-private information stored in fdw_private lists.
*
@@ -122,11 +128,29 @@ enum FdwDirectModifyPrivateIndex
FdwDirectModifyPrivateSetProcessed
};
+/*
+ * Connection common state - shared among all PgFdwState instances using the
+ * same connection.
+ */
+typedef struct PgFdwConnCommonState
+{
+ ForeignScanState *leader; /* leader node of this connection */
+ bool busy; /* true if this connection is busy */
+} PgFdwConnCommonState;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+ PGconn *conn; /* connection for the scan */
+ PgFdwConnCommonState *commonstate; /* connection common state */
+} PgFdwState;
+
/*
* Execution state of a foreign scan using postgres_fdw.
*/
typedef struct PgFdwScanState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table. NULL
* for a foreign join scan. */
TupleDesc tupdesc; /* tuple descriptor of scan */
@@ -137,7 +161,6 @@ typedef struct PgFdwScanState
List *retrieved_attrs; /* list of retrieved attribute numbers */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -153,6 +176,12 @@ typedef struct PgFdwScanState
/* batch-level state, for optimizing rewinds and avoiding useless fetch */
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ bool async; /* true if run asynchronously */
+ bool queued; /* true if this node is in waiter queue */
+ ForeignScanState *waiter; /* Next node to run a query among nodes
+ * sharing the same connection */
+ ForeignScanState *last_waiter; /* last element in waiter queue.
+ * valid only on the leader node */
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
@@ -166,11 +195,11 @@ typedef struct PgFdwScanState
*/
typedef struct PgFdwModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -197,6 +226,7 @@ typedef struct PgFdwModifyState
*/
typedef struct PgFdwDirectModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
@@ -326,6 +356,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
static void postgresReScanForeignScan(ForeignScanState *node);
static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
static void postgresAddForeignUpdateTargets(Query *parsetree,
RangeTblEntry *target_rte,
Relation target_relation);
@@ -391,6 +422,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
RelOptInfo *input_rel,
RelOptInfo *output_rel,
void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+ WaitEventSet *wes,
+ void *caller_data, bool reinit);
/*
* Helper functions
@@ -419,7 +454,9 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn, bool clear_queue);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static PgFdwModifyState *create_foreign_modify(EState *estate,
RangeTblEntry *rte,
@@ -522,6 +559,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
routine->IterateForeignScan = postgresIterateForeignScan;
routine->ReScanForeignScan = postgresReScanForeignScan;
routine->EndForeignScan = postgresEndForeignScan;
+ routine->ShutdownForeignScan = postgresShutdownForeignScan;
/* Functions for updating foreign tables */
routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -558,6 +596,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for async execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+
PG_RETURN_POINTER(routine);
}
@@ -1433,12 +1475,22 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->s.conn = GetConnection(user, false);
+ fsstate->s.commonstate = (PgFdwConnCommonState *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
+ fsstate->s.commonstate->leader = NULL;
+ fsstate->s.commonstate->busy = false;
+ fsstate->waiter = NULL;
+ fsstate->last_waiter = node;
/* Assign a unique ID for my cursor */
- fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+ fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
fsstate->cursor_exists = false;
+ /* Initialize async execution status */
+ fsstate->async = false;
+ fsstate->queued = false;
+
/* Get private info created by planner functions. */
fsstate->query = strVal(list_nth(fsplan->fdw_private,
FdwScanPrivateSelectSql));
@@ -1486,40 +1538,241 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
&fsstate->param_values);
}
+/*
+ * Async queue manipulation functions
+ */
+
+/*
+ * add_async_waiter:
+ *
+ * Enqueue node if it isn't in the queue. Immediately send request it if the
+ * underlying connection is not busy.
+ */
+static inline void
+add_async_waiter(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+
+ /*
+ * Do nothing if the node is already in the queue or already eof'ed.
+ * Note: leader node is not marked as queued.
+ */
+ if (leader == node || fsstate->queued || fsstate->eof_reached)
+ return;
+
+ if (leader == NULL)
+ {
+ /* no leader means not busy, send request immediately */
+ request_more_data(node);
+ }
+ else
+ {
+ /* the connection is busy, queue the node */
+ PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+ PgFdwScanState *last_waiter_state
+ = GetPgFdwScanState(leader_state->last_waiter);
+
+ last_waiter_state->waiter = node;
+ leader_state->last_waiter = node;
+ fsstate->queued = true;
+ }
+}
+
+/*
+ * move_to_next_waiter:
+ *
+ * Make the first waiter be the next leader
+ * Returns the new leader or NULL if there's no waiter.
+ */
+static inline ForeignScanState *
+move_to_next_waiter(ForeignScanState *node)
+{
+ PgFdwScanState *leader_state = GetPgFdwScanState(node);
+ ForeignScanState *next_leader = leader_state->waiter;
+
+ Assert(leader_state->s.commonstate->leader = node);
+
+ if (next_leader)
+ {
+ /* the first waiter becomes the next leader */
+ PgFdwScanState *next_leader_state = GetPgFdwScanState(next_leader);
+ next_leader_state->last_waiter = leader_state->last_waiter;
+ next_leader_state->queued = false;
+ }
+
+ leader_state->waiter = NULL;
+ leader_state->s.commonstate->leader = next_leader;
+
+ return next_leader;
+}
+
+/*
+ * Remove the node from waiter queue.
+ *
+ * Remaining results are cleared if the node is a busy leader.
+ * This intended to be used during node shutdown.
+ */
+static inline void
+remove_async_node(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+ PgFdwScanState *leader_state;
+ ForeignScanState *prev;
+ PgFdwScanState *prev_state;
+ ForeignScanState *cur;
+
+ /* no need to remove me */
+ if (!leader || !fsstate->queued)
+ return;
+
+ leader_state = GetPgFdwScanState(leader);
+
+ if (leader == node)
+ {
+ if (leader_state->s.commonstate->busy)
+ {
+ /*
+ * this node is waiting for result, absorb the result first so
+ * that the following commands can be sent on the connection.
+ */
+ PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+ PGconn *conn = leader_state->s.conn;
+
+ while(PQisBusy(conn))
+ PQclear(PQgetResult(conn));
+
+ leader_state->s.commonstate->busy = false;
+ }
+
+ move_to_next_waiter(node);
+
+ return;
+ }
+
+ /*
+ * Just remove the node from the queue
+ *
+ * Nodes don't have a link to the previous node but anyway this function is
+ * called on the shutdown path, so we don't bother seeking for faster way
+ * to do this.
+ */
+ prev = leader;
+ prev_state = leader_state;
+ cur = GetPgFdwScanState(prev)->waiter;
+ while (cur)
+ {
+ PgFdwScanState *curstate = GetPgFdwScanState(cur);
+
+ if (cur == node)
+ {
+ prev_state->waiter = curstate->waiter;
+
+ /* relink to the previous node if the last node was removed */
+ if (leader_state->last_waiter == cur)
+ leader_state->last_waiter = prev;
+
+ fsstate->queued = false;
+
+ return;
+ }
+ prev = cur;
+ prev_state = curstate;
+ cur = curstate->waiter;
+ }
+}
+
/*
* postgresIterateForeignScan
- * Retrieve next row from the result set, or clear tuple slot to indicate
- * EOF.
+ * Retrieve next row from the result set.
+ *
+ * For synchronous nodes, returns clear tuple slot means EOF.
+ *
+ * For asynchronous nodes, if clear tuple slot is returned, the caller
+ * needs to check async state to tell if all tuples received
+ * (AS_AVAILABLE) or waiting for the next data to come (AS_WAITING).
*/
static TupleTableSlot *
postgresIterateForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
- /*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
- */
- if (!fsstate->cursor_exists)
- create_cursor(node);
-
- /*
- * Get some more tuples, if we've run out.
- */
+ if (fsstate->next_tuple >= fsstate->num_tuples && !fsstate->eof_reached)
+ {
+ /* we've run out, get some more tuples */
+ if (!node->fs_async)
+ {
+ /*
+ * finish the running query before sending the next command for
+ * this node
+ */
+ if (!fsstate->s.commonstate->busy)
+ vacate_connection((PgFdwState *)fsstate, false);
+
+ request_more_data(node);
+
+ /* Fetch the result immediately. */
+ fetch_received_data(node);
+ }
+ else if (!fsstate->s.commonstate->busy)
+ {
+ /* If the connection is not busy, just send the request. */
+ request_more_data(node);
+ }
+ else
+ {
+ /* The connection is busy, queue the request */
+ bool available = true;
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+ PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+
+ /* queue the requested node */
+ add_async_waiter(node);
+
+ /*
+ * The request for the next node cannot be sent before the leader
+ * responds. Finish the current leader if possible.
+ */
+ if (PQisBusy(leader_state->s.conn))
+ {
+ int rc = WaitLatchOrSocket(NULL,
+ WL_SOCKET_READABLE | WL_TIMEOUT |
+ WL_EXIT_ON_PM_DEATH,
+ PQsocket(leader_state->s.conn), 0,
+ WAIT_EVENT_ASYNC_WAIT);
+ if (!(rc & WL_SOCKET_READABLE))
+ available = false;
+ }
+
+ /* fetch the leader's data and enqueue it for the next request */
+ if (available)
+ {
+ fetch_received_data(leader);
+ add_async_waiter(leader);
+ }
+ }
+ }
+
if (fsstate->next_tuple >= fsstate->num_tuples)
{
- /* No point in another fetch if we already detected EOF, though. */
- if (!fsstate->eof_reached)
- fetch_more_data(node);
- /* If we didn't get any tuples, must be end of data. */
- if (fsstate->next_tuple >= fsstate->num_tuples)
- return ExecClearTuple(slot);
+ /*
+ * We haven't received a result for the given node this time, return
+ * with no tuple to give way to another node.
+ */
+ if (fsstate->eof_reached)
+ node->ss.ps.asyncstate = AS_AVAILABLE;
+ else
+ node->ss.ps.asyncstate = AS_WAITING;
+
+ return ExecClearTuple(slot);
}
/*
* Return the next tuple.
*/
+ node->ss.ps.asyncstate = AS_AVAILABLE;
ExecStoreHeapTuple(fsstate->tuples[fsstate->next_tuple++],
slot,
false);
@@ -1534,7 +1787,7 @@ postgresIterateForeignScan(ForeignScanState *node)
static void
postgresReScanForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
char sql[64];
PGresult *res;
@@ -1542,6 +1795,8 @@ postgresReScanForeignScan(ForeignScanState *node)
if (!fsstate->cursor_exists)
return;
+ vacate_connection((PgFdwState *)fsstate, true);
+
/*
* If any internal parameters affecting this node have changed, we'd
* better destroy and recreate the cursor. Otherwise, rewinding it should
@@ -1570,9 +1825,9 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
PQclear(res);
/* Now force a fresh FETCH. */
@@ -1590,7 +1845,7 @@ postgresReScanForeignScan(ForeignScanState *node)
static void
postgresEndForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
if (fsstate == NULL)
@@ -1598,15 +1853,31 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->s.conn, fsstate->cursor_number);
/* Release remote connection */
- ReleaseConnection(fsstate->conn);
- fsstate->conn = NULL;
+ ReleaseConnection(fsstate->s.conn);
+ fsstate->s.conn = NULL;
/* MemoryContexts will be deleted automatically. */
}
+/*
+ * postgresShutdownForeignScan
+ * Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+ ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+ if (plan->operation != CMD_SELECT)
+ return;
+
+ /* remove the node from waiting queue */
+ remove_async_node(node);
+}
+
/*
* postgresAddForeignUpdateTargets
* Add resjunk column(s) needed for update/delete on a foreign table
@@ -2371,7 +2642,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->s.conn = GetConnection(user, false);
+ dmstate->s.commonstate = (PgFdwConnCommonState *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
/* Update the foreign-join-related fields. */
if (fsplan->scan.scanrelid == 0)
@@ -2456,7 +2729,11 @@ postgresIterateDirectModify(ForeignScanState *node)
* If this is the first call after Begin, execute the statement.
*/
if (dmstate->num_tuples == -1)
+ {
+ /* finish running query to send my command */
+ vacate_connection((PgFdwState *)dmstate, true);
execute_dml_stmt(node);
+ }
/*
* If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2503,8 +2780,8 @@ postgresEndDirectModify(ForeignScanState *node)
PQclear(dmstate->result);
/* Release remote connection */
- ReleaseConnection(dmstate->conn);
- dmstate->conn = NULL;
+ ReleaseConnection(dmstate->s.conn);
+ dmstate->s.conn = NULL;
/* MemoryContext will be deleted automatically. */
}
@@ -2702,6 +2979,7 @@ estimate_path_cost_size(PlannerInfo *root,
List *local_param_join_conds;
StringInfoData sql;
PGconn *conn;
+ PgFdwConnCommonState *commonstate;
Selectivity local_sel;
QualCost local_cost;
List *fdw_scan_tlist = NIL;
@@ -2746,6 +3024,18 @@ estimate_path_cost_size(PlannerInfo *root,
/* Get the remote estimate */
conn = GetConnection(fpinfo->user, false);
+ commonstate = GetConnectionSpecificStorage(fpinfo->user,
+ sizeof(PgFdwConnCommonState));
+ if (commonstate)
+ {
+ PgFdwState tmpstate;
+ tmpstate.conn = conn;
+ tmpstate.commonstate = commonstate;
+
+ /* finish running query to send my command */
+ vacate_connection(&tmpstate, true);
+ }
+
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -3316,11 +3606,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
static void
create_cursor(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
ExprContext *econtext = node->ss.ps.ps_ExprContext;
int numParams = fsstate->numParams;
const char **values = fsstate->param_values;
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
StringInfoData buf;
PGresult *res;
@@ -3383,50 +3673,119 @@ create_cursor(ForeignScanState *node)
}
/*
- * Fetch some more rows from the node's cursor.
+ * Sends the next request of the node. If the given node is different from the
+ * current connection leader, pushes it back to waiter queue and let the given
+ * node be the leader.
*/
static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *leader = fsstate->s.commonstate->leader;
+ PGconn *conn = fsstate->s.conn;
+ char sql[64];
+
+ /* must be non-busy */
+ Assert(!fsstate->s.commonstate->busy);
+ /* must be not-eof'ed */
+ Assert(!fsstate->eof_reached);
+
+ /*
+ * If this is the first call after Begin or ReScan, we need to create the
+ * cursor on the remote side.
+ */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (!PQsendQuery(conn, sql))
+ pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+ fsstate->s.commonstate->busy = true;
+
+ /* The node is the current leader, just return. */
+ if (leader == node)
+ return;
+
+ /* Let the node be the leader */
+ if (leader != NULL)
+ {
+ remove_async_node(node);
+ fsstate->last_waiter = GetPgFdwScanState(leader)->last_waiter;
+ fsstate->waiter = leader;
+ }
+ else
+ {
+ fsstate->last_waiter = node;
+ fsstate->waiter = NULL;
+ }
+
+ fsstate->s.commonstate->leader = node;
+}
+
+/*
+ * Fetches received data and automatically send requests of the next waiter.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
PGresult *volatile res = NULL;
MemoryContext oldcontext;
+ ForeignScanState *waiter;
+
+ /* I should be the current connection leader */
+ Assert(fsstate->s.commonstate->leader == node);
/*
* We'll store the tuples in the batch_cxt. First, flush the previous
- * batch.
+ * batch if no tuple is remaining
*/
- fsstate->tuples = NULL;
- MemoryContextReset(fsstate->batch_cxt);
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->tuples = NULL;
+ fsstate->num_tuples = 0;
+ MemoryContextReset(fsstate->batch_cxt);
+ }
+ else if (fsstate->next_tuple > 0)
+ {
+ /* There's some remains. Move them to the beginning of the store */
+ int n = 0;
+
+ while(fsstate->next_tuple < fsstate->num_tuples)
+ fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+ fsstate->num_tuples = n;
+ }
+
oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
/* PGresult must be released before leaving this function. */
PG_TRY();
{
- PGconn *conn = fsstate->conn;
- char sql[64];
- int numrows;
+ PGconn *conn = fsstate->s.conn;
+ int addrows;
+ size_t newsize;
int i;
- snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
- fsstate->fetch_size, fsstate->cursor_number);
-
- res = pgfdw_exec_query(conn, sql);
- /* On error, report the original query, not the FETCH. */
+ res = pgfdw_get_result(conn, fsstate->query);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
/* Convert the data into HeapTuples */
- numrows = PQntuples(res);
- fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
- fsstate->num_tuples = numrows;
- fsstate->next_tuple = 0;
+ addrows = PQntuples(res);
+ newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+ if (fsstate->tuples)
+ fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+ else
+ fsstate->tuples = (HeapTuple *) palloc(newsize);
- for (i = 0; i < numrows; i++)
+ for (i = 0; i < addrows; i++)
{
Assert(IsA(node->ss.ps.plan, ForeignScan));
- fsstate->tuples[i] =
+ fsstate->tuples[fsstate->num_tuples + i] =
make_tuple_from_result_row(res, i,
fsstate->rel,
fsstate->attinmeta,
@@ -3436,22 +3795,73 @@ fetch_more_data(ForeignScanState *node)
}
/* Update fetch_ct_2 */
- if (fsstate->fetch_ct_2 < 2)
+ if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
fsstate->fetch_ct_2++;
+ fsstate->next_tuple = 0;
+ fsstate->num_tuples += addrows;
+
/* Must be EOF if we didn't get as many tuples as we asked for. */
- fsstate->eof_reached = (numrows < fsstate->fetch_size);
+ fsstate->eof_reached = (addrows < fsstate->fetch_size);
}
PG_FINALLY();
{
+ fsstate->s.commonstate->busy = false;
+
if (res)
PQclear(res);
}
PG_END_TRY();
+ /* let the first waiter be the next leader of this connection */
+ waiter = move_to_next_waiter(node);
+
+ /* send the next request if any */
+ if (waiter)
+ request_more_data(waiter);
+
MemoryContextSwitchTo(oldcontext);
}
+/*
+ * Vacate the underlying connection so that this node can send the next query.
+ */
+static void
+vacate_connection(PgFdwState *fdwstate, bool clear_queue)
+{
+ PgFdwConnCommonState *commonstate = fdwstate->commonstate;
+ ForeignScanState *leader;
+
+ Assert(commonstate != NULL);
+
+ /* just return if the connection is already available */
+ if (commonstate->leader == NULL || !commonstate->busy)
+ return;
+
+ /*
+ * let the current connection leader read all of the result for the running
+ * query
+ */
+ leader = commonstate->leader;
+ fetch_received_data(leader);
+
+ /* let the first waiter be the next leader of this connection */
+ move_to_next_waiter(leader);
+
+ if (!clear_queue)
+ return;
+
+ /* Clear the waiting list */
+ while (leader)
+ {
+ PgFdwScanState *fsstate = GetPgFdwScanState(leader);
+
+ fsstate->last_waiter = NULL;
+ leader = fsstate->waiter;
+ fsstate->waiter = NULL;
+ }
+}
+
/*
* Force assorted GUC parameters to settings that ensure that we'll output
* data values in a form that is unambiguous to the remote server.
@@ -3565,7 +3975,9 @@ create_foreign_modify(EState *estate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->s.conn = GetConnection(user, true);
+ fmstate->s.commonstate = (PgFdwConnCommonState *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Set up remote query information. */
@@ -3652,6 +4064,9 @@ execute_foreign_modify(EState *estate,
operation == CMD_UPDATE ||
operation == CMD_DELETE);
+ /* finish running query to send my command */
+ vacate_connection((PgFdwState *)fmstate, true);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -3679,14 +4094,14 @@ execute_foreign_modify(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3694,10 +4109,10 @@ execute_foreign_modify(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -3733,7 +4148,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
- GetPrepStmtNumber(fmstate->conn));
+ GetPrepStmtNumber(fmstate->s.conn));
p_name = pstrdup(prep_name);
/*
@@ -3743,12 +4158,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* the prepared statements we use in this module are simple enough that
* the remote server will make the right choices.
*/
- if (!PQsendPrepare(fmstate->conn,
+ if (!PQsendPrepare(fmstate->s.conn,
p_name,
fmstate->query,
0,
NULL))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3756,9 +4171,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
PQclear(res);
/* This action shows that the prepare has been done. */
@@ -3887,16 +4302,16 @@ finish_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
PQclear(res);
fmstate->p_name = NULL;
}
/* Release remote connection */
- ReleaseConnection(fmstate->conn);
- fmstate->conn = NULL;
+ ReleaseConnection(fmstate->s.conn);
+ fmstate->s.conn = NULL;
}
/*
@@ -4055,9 +4470,9 @@ execute_dml_stmt(ForeignScanState *node)
* the desired result. This allows us to avoid assuming that the remote
* server has the same OIDs we do for the parameters' types.
*/
- if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+ if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
NULL, values, NULL, NULL, 0))
- pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+ pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
/*
* Get the result, and check for success.
@@ -4065,10 +4480,10 @@ execute_dml_stmt(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+ dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
if (PQresultStatus(dmstate->result) !=
(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+ pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
dmstate->query);
/* Get the number of rows affected. */
@@ -5559,6 +5974,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
/* XXX Consider parameterized paths for the join relation */
}
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add wait event so that the ForeignScan node is going to wait for.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+ void *caller_data, bool reinit)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+ /* Reinit is not supported for now. */
+ Assert(reinit);
+
+ if (fsstate->s.commonstate->leader == node)
+ {
+ AddWaitEventToSet(wes,
+ WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+ NULL, caller_data);
+ return true;
+ }
+
+ return false;
+}
+
+
/*
* Assess whether the aggregation, grouping and having operations can be pushed
* down to the foreign server. As a side effect, save information we obtain in
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index eef410db39..96af75a33e 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -85,6 +85,7 @@ typedef struct PgFdwRelationInfo
UserMapping *user; /* only set in use_remote_estimate mode */
int fetch_size; /* fetch size for this remote table */
+ bool allow_prefetch; /* true to allow overlapped fetching */
/*
* Name of the relation, for use while EXPLAINing ForeignScan. It is used
@@ -130,6 +131,7 @@ extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 78156d10b4..17d461b1a4 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1799,25 +1799,25 @@ INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
@@ -1859,12 +1859,12 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
@@ -1923,8 +1923,8 @@ explain (verbose, costs off)
delete from foo where f1 < 5 returning *;
delete from foo where f1 < 5 returning *;
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
-- Test that UPDATE/DELETE with inherited target works with row-level triggers
CREATE TRIGGER trig_row_before
--
2.18.4
On Tue, Sep 29, 2020 at 4:45 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
BTW: I noticed that you changed the ExecProcNode() API so that an
Append calling FDWs can know wether they return tuples immediately or
not:
That is, 1) in postgresIterateForeignScan() postgres_fdw sets the new
PlanState’s flag asyncstate to AS_AVAILABLE/AS_WAITING depending on
whether it returns a tuple immediately or not, and then 2) the Append
knows that from the new flag when the callback routine returns. I’m
not sure this is a good idea, because it seems likely that the
ExecProcNode() change would affect many other places in the executor,
making maintenance and/or future development difficult. I think the
FDW callback routines proposed in the original patch by Robert would
provide a cleaner way to do asynchronous execution of FDWs without
changing the ExecProcNode() API, IIUC:+On the other hand, nodes that wish to produce tuples asynchronously +generally need to implement three methods: + +1. When an asynchronous request is made, the node's ExecAsyncRequest callback +will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the +number of file descriptor events for which it wishes to wait and whether it +wishes to receive a callback when the process latch is set. Alternatively, +it can instead use ExecAsyncRequestDone if a result is available immediately. + +2. When the event loop wishes to wait or poll for file descriptor events and +the process latch, the ExecAsyncConfigureWait callback is invoked to configure +the file descriptor wait events for which the node wishes to wait. This +callback isn't needed if the node only cares about the process latch. + +3. When file descriptors or the process latch become ready, the node's +ExecAsyncNotify callback is invoked.What is the reason for not doing like this in your patch?
I think we should avoid changing the ExecProcNode() API.
Thomas’ patch also provides a clean FDW API that doesn’t change the
ExecProcNode() API, but I think the FDW API provided in Robert’ patch
would be better designed, because I think it would support more
different types of asynchronous interaction between the core and FDWs.
Consider this bit from Thomas’ patch, which produces a tuple when a
file descriptor becomes ready:
+ if (event.events & WL_SOCKET_READABLE)
+ {
+ /* Linear search for the node that told us to wait for this fd. */
+ for (i = 0; i < node->nasyncplans; ++i)
+ {
+ if (event.fd == node->asyncfds[i])
+ {
+ TupleTableSlot *result;
+
+ /*
+ --> * We assume that because the fd is ready, it can produce
+ --> * a tuple now, which is not perfect. An improvement
+ --> * would be if it could say 'not yet, I'm still not
+ --> * ready', so eg postgres_fdw could PQconsumeInput and
+ --> * then say 'I need more input'.
+ */
+ result = ExecProcNode(node->asyncplans[i]);
+ if (!TupIsNull(result))
+ {
+ /*
+ * Remember this plan so that append_next_async will
+ * keep trying this subplan first until it stops
+ * feeding us buffered tuples.
+ */
+ node->lastreadyplan = i;
+ /* We can stop waiting for this fd. */
+ node->asyncfds[i] = 0;
+ return result;
+ }
+ else
+ {
+ /*
+ * This subplan has reached EOF. We'll go back and
+ * wait for another one.
+ */
+ forget_async_subplan(node, i);
+ break;
+ }
+ }
+ }
+ }
As commented above, his patch doesn’t allow an FDW to do another data
fetch from the remote side before returning a tuple when the file
descriptor becomes available, but Robert’s patch would, using his FDW
API ForeignAsyncNotify(), which is called when the file descriptor
becomes available, IIUC.
I might be missing something, but I feel inclined to vote for Robert’s
patch (more precisely, Robert’s patch as a base patch with (1) some
planner/executor changes from Horiguchi-san’s patch and (2)
postgres_fdw changes from Thomas’ patch adjusted to match Robert’s FDW
API).
Best regards,
Etsuro Fujita
At Fri, 2 Oct 2020 09:00:53 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
On Tue, Sep 29, 2020 at 4:45 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
BTW: I noticed that you changed the ExecProcNode() API so that an
Append calling FDWs can know wether they return tuples immediately or
not:That is, 1) in postgresIterateForeignScan() postgres_fdw sets the new
PlanState’s flag asyncstate to AS_AVAILABLE/AS_WAITING depending on
whether it returns a tuple immediately or not, and then 2) the Append
knows that from the new flag when the callback routine returns. I’m
not sure this is a good idea, because it seems likely that the
ExecProcNode() change would affect many other places in the executor,
making maintenance and/or future development difficult. I think the
FDW callback routines proposed in the original patch by Robert would
provide a cleaner way to do asynchronous execution of FDWs without
changing the ExecProcNode() API, IIUC:+On the other hand, nodes that wish to produce tuples asynchronously +generally need to implement three methods: + +1. When an asynchronous request is made, the node's ExecAsyncRequest callback +will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the +number of file descriptor events for which it wishes to wait and whether it +wishes to receive a callback when the process latch is set. Alternatively, +it can instead use ExecAsyncRequestDone if a result is available immediately. + +2. When the event loop wishes to wait or poll for file descriptor events and +the process latch, the ExecAsyncConfigureWait callback is invoked to configure +the file descriptor wait events for which the node wishes to wait. This +callback isn't needed if the node only cares about the process latch. + +3. When file descriptors or the process latch become ready, the node's +ExecAsyncNotify callback is invoked.What is the reason for not doing like this in your patch?
I think we should avoid changing the ExecProcNode() API.
Thomas’ patch also provides a clean FDW API that doesn’t change the
ExecProcNode() API, but I think the FDW API provided in Robert’ patch
Could you explain about what the "change" you are mentioning is?
I have made many changes to reduce performance inpact on existing
paths (before the current PlanState.ExecProcNode was introduced.) So
large part of my changes could be actually reverted.
would be better designed, because I think it would support more
different types of asynchronous interaction between the core and FDWs.
Consider this bit from Thomas’ patch, which produces a tuple when a
file descriptor becomes ready:+ if (event.events & WL_SOCKET_READABLE) + { + /* Linear search for the node that told us to wait for this fd. */ + for (i = 0; i < node->nasyncplans; ++i) + { + if (event.fd == node->asyncfds[i]) + { + TupleTableSlot *result; + + /* + --> * We assume that because the fd is ready, it can produce + --> * a tuple now, which is not perfect. An improvement + --> * would be if it could say 'not yet, I'm still not + --> * ready', so eg postgres_fdw could PQconsumeInput and + --> * then say 'I need more input'. + */ + result = ExecProcNode(node->asyncplans[i]);
..
As commented above, his patch doesn’t allow an FDW to do another data
fetch from the remote side before returning a tuple when the file
descriptor becomes available, but Robert’s patch would, using his FDW
API ForeignAsyncNotify(), which is called when the file descriptor
becomes available, IIUC.I might be missing something, but I feel inclined to vote for Robert’s
patch (more precisely, Robert’s patch as a base patch with (1) some
planner/executor changes from Horiguchi-san’s patch and (2)
postgres_fdw changes from Thomas’ patch adjusted to match Robert’s FDW
API).
I'm not sure what you have in mind from the description above. Could
you please ellaborate?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Fri, Oct 2, 2020 at 3:39 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Fri, 2 Oct 2020 09:00:53 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
I think we should avoid changing the ExecProcNode() API.
Could you explain about what the "change" you are mentioning is?
It’s the contract of the ExecProcNode() API: if the result is NULL or
an empty slot, there is nothing more to do. You changed it to
something like this: “even if the result is NULL or an empty slot,
there might be something more to do if AS_WAITING, so please wait in
that case”. That seems pretty invasive to me.
I might be missing something, but I feel inclined to vote for Robert’s
patch (more precisely, Robert’s patch as a base patch with (1) some
planner/executor changes from Horiguchi-san’s patch and (2)
postgres_fdw changes from Thomas’ patch adjusted to match Robert’s FDW
API).I'm not sure what you have in mind from the description above. Could
you please ellaborate?
Sorry, my explanation was not enough.
You made lots of changes to the original patch by Robert, but I don’t
think those changes are all good; 1) as for the core part, you changed
his patch so that FDWs can interact with the core at execution time,
only through the ForeignAsyncConfigureWait() API, but that resulted in
an invasive change to the ExecProcNode() API as mentioned above, and
2) as for the postgres_fdw part, you changed it so that postgres_fdw
can handle concurrent data fetches from multiple foreign scan nodes
using the same connection, but that would cause a performance issue
that I mentioned in [1]/messages/by-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx+3dyivCT=A@mail.gmail.com.
So I think it would be better to use his patch rather as proposed
except for the postgres_fdw part and Thomas’ patch as a base patch for
that part. As for your patch, I think we could use some part of it as
improvements. One thing is the planner/executor changes that lead to
the improved efficiency discussed in [2]/messages/by-id/CAPmGK16+y8mEX9AT1LXVLksbTyDnYWZXm0uDxZ8bza153Wey9A@mail.gmail.com[3]/messages/by-id/CAPmGK14AjvCd9QuoRQ-ATyExA_SiVmGFGstuqAKSzZ7JDJTBVg@mail.gmail.com. Another would be to have
a separate ExecAppend() function for this feature like your patch to
avoid a performance penalty in the case of a plain old Append that
involves no FDWs with asynchronism optimization, if necessary. I also
think we could probably use the WaitEventSet-related changes in your
patch (i.e., the 0001 patch).
Does that answer your question?
Best regards,
Etsuro Fujita
[1]: /messages/by-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx+3dyivCT=A@mail.gmail.com
[2]: /messages/by-id/CAPmGK16+y8mEX9AT1LXVLksbTyDnYWZXm0uDxZ8bza153Wey9A@mail.gmail.com
[3]: /messages/by-id/CAPmGK14AjvCd9QuoRQ-ATyExA_SiVmGFGstuqAKSzZ7JDJTBVg@mail.gmail.com
At Sun, 4 Oct 2020 18:36:05 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
On Fri, Oct 2, 2020 at 3:39 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:At Fri, 2 Oct 2020 09:00:53 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
I think we should avoid changing the ExecProcNode() API.
Could you explain about what the "change" you are mentioning is?
Thank you for the explanation.
It’s the contract of the ExecProcNode() API: if the result is NULL or
an empty slot, there is nothing more to do. You changed it to
something like this: “even if the result is NULL or an empty slot,
there might be something more to do if AS_WAITING, so please wait in
that case”. That seems pretty invasive to me.
Yeah, it's "invasive' as I intended. I thought that the async-aware
and async-capable nodes should interact using a channel defined as a
part of ExecProcNode API. It was aiming an increased affinity to
push-up executor framework.
Since the current direction is committing this feature as a
intermediate or tentative implement, it sounds reasonable to avoid
such a change.
I might be missing something, but I feel inclined to vote for Robert’s
patch (more precisely, Robert’s patch as a base patch with (1) some
planner/executor changes from Horiguchi-san’s patch and (2)
postgres_fdw changes from Thomas’ patch adjusted to match Robert’s FDW
API).I'm not sure what you have in mind from the description above. Could
you please ellaborate?Sorry, my explanation was not enough.
You made lots of changes to the original patch by Robert, but I don’t
think those changes are all good; 1) as for the core part, you changed
his patch so that FDWs can interact with the core at execution time,
only through the ForeignAsyncConfigureWait() API, but that resulted in
an invasive change to the ExecProcNode() API as mentioned above, and
2) as for the postgres_fdw part, you changed it so that postgres_fdw
can handle concurrent data fetches from multiple foreign scan nodes
using the same connection, but that would cause a performance issue
that I mentioned in [1].
(Putting aside the bug itself..)
Yeah, I noticed such a possibility of fetch cascading, however, I
think that that situation that the feature is intended for is more
common than the problem case.
Being said, I agree that it is a candidate to rip out when we are
thinking to reduce the footprint of this patch.
So I think it would be better to use his patch rather as proposed
except for the postgres_fdw part and Thomas’ patch as a base patch for
that part. As for your patch, I think we could use some part of it as
improvements. One thing is the planner/executor changes that lead to
the improved efficiency discussed in [2][3]. Another would be to have
a separate ExecAppend() function for this feature like your patch to
avoid a performance penalty in the case of a plain old Append that
involves no FDWs with asynchronism optimization, if necessary. I also
think we could probably use the WaitEventSet-related changes in your
patch (i.e., the 0001 patch).Does that answer your question?
Yes, thanks. Comments about the direction from me is as above. Are
you going to continue working on this patch?
[1] /messages/by-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx+3dyivCT=A@mail.gmail.com
[2] /messages/by-id/CAPmGK16+y8mEX9AT1LXVLksbTyDnYWZXm0uDxZ8bza153Wey9A@mail.gmail.com
[3] /messages/by-id/CAPmGK14AjvCd9QuoRQ-ATyExA_SiVmGFGstuqAKSzZ7JDJTBVg@mail.gmail.com
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Mon, Oct 5, 2020 at 1:30 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Sun, 4 Oct 2020 18:36:05 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
It’s the contract of the ExecProcNode() API: if the result is NULL or
an empty slot, there is nothing more to do. You changed it to
something like this: “even if the result is NULL or an empty slot,
there might be something more to do if AS_WAITING, so please wait in
that case”. That seems pretty invasive to me.Yeah, it's "invasive' as I intended. I thought that the async-aware
and async-capable nodes should interact using a channel defined as a
part of ExecProcNode API. It was aiming an increased affinity to
push-up executor framework.Since the current direction is committing this feature as a
intermediate or tentative implement, it sounds reasonable to avoid
such a change.
OK (Actually, I'm wondering if we could probably extend this to the
case where an Append is indirectly on top of a foreign scan node
without changing the ExecProcNode() API.)
You made lots of changes to the original patch by Robert, but I don’t
think those changes are all good; 1) as for the core part, you changed
his patch so that FDWs can interact with the core at execution time,
only through the ForeignAsyncConfigureWait() API, but that resulted in
an invasive change to the ExecProcNode() API as mentioned above, and
2) as for the postgres_fdw part, you changed it so that postgres_fdw
can handle concurrent data fetches from multiple foreign scan nodes
using the same connection, but that would cause a performance issue
that I mentioned in [1].
Yeah, I noticed such a possibility of fetch cascading, however, I
think that that situation that the feature is intended for is more
common than the problem case.
I think a cleaner solution to that would be to support multiple
connections to the remote server...
So I think it would be better to use his patch rather as proposed
except for the postgres_fdw part and Thomas’ patch as a base patch for
that part. As for your patch, I think we could use some part of it as
improvements. One thing is the planner/executor changes that lead to
the improved efficiency discussed in [2][3]. Another would be to have
a separate ExecAppend() function for this feature like your patch to
avoid a performance penalty in the case of a plain old Append that
involves no FDWs with asynchronism optimization, if necessary. I also
think we could probably use the WaitEventSet-related changes in your
patch (i.e., the 0001 patch).Does that answer your question?
Yes, thanks. Comments about the direction from me is as above. Are
you going to continue working on this patch?
Yes, if there are no objections from you or Thomas or Robert or anyone
else, I'll update Robert's patch as such.
Best regards,
Etsuro Fujita
On 10/5/20 11:35 AM, Etsuro Fujita wrote:
Hi,
I found a small problem. If we have a mix of async and sync subplans
when we catch an assertion on a busy connection. Just for example:
PLAN
====
Nested Loop (cost=100.00..174316.95 rows=975 width=8) (actual
time=5.191..9.262 rows=9 loops=1)
Join Filter: (frgn.a = l.a)
Rows Removed by Join Filter: 8991
-> Append (cost=0.00..257.20 rows=11890 width=4) (actual
time=0.419..2.773 rows=1000 loops=1)
Async subplans: 4
-> Async Foreign Scan on f_1 l_2 (cost=100.00..197.75
rows=2925 width=4) (actual time=0.381..0.585 rows=211 loops=1)
-> Async Foreign Scan on f_2 l_3 (cost=100.00..197.75
rows=2925 width=4) (actual time=0.005..0.206 rows=195 loops=1)
-> Async Foreign Scan on f_3 l_4 (cost=100.00..197.75
rows=2925 width=4) (actual time=0.003..0.282 rows=187 loops=1)
-> Async Foreign Scan on f_4 l_5 (cost=100.00..197.75
rows=2925 width=4) (actual time=0.003..0.316 rows=217 loops=1)
-> Seq Scan on l_0 l_1 (cost=0.00..2.90 rows=190 width=4)
(actual time=0.017..0.057 rows=190 loops=1)
-> Materialize (cost=100.00..170.94 rows=975 width=4) (actual
time=0.001..0.002 rows=9 loops=1000)
-> Foreign Scan on frgn (cost=100.00..166.06 rows=975
width=4) (actual time=0.766..0.768 rows=9 loops=1)
Reproduction script 'test1.sql' see in attachment. Here I force the
problem reproduction with setting enable_hashjoin and enable_mergejoin
to off.
'asyncmix.patch' contains my solution to this problem.
--
regards,
Andrey Lepikhov
Postgres Professional
Attachments:
asyncmix.patchtext/x-patch; charset=UTF-8; name=asyncmix.patchDownload
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 14824368cc..613d406982 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -455,7 +455,7 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
void *arg);
static void create_cursor(ForeignScanState *node);
static void request_more_data(ForeignScanState *node);
-static void fetch_received_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node, bool vacateconn);
static void vacate_connection(PgFdwState *fdwconn, bool clear_queue);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static PgFdwModifyState *create_foreign_modify(EState *estate,
@@ -1706,15 +1706,19 @@ postgresIterateForeignScan(ForeignScanState *node)
{
/*
* finish the running query before sending the next command for
- * this node
+ * this node.
+ * When the plan contains both asynchronous subplans and non-async
+ * subplans backend could request more data in async mode and want to
+ * get data in sync mode by the same connection. Here it must wait
+ * for async data before request another.
*/
- if (!fsstate->s.commonstate->busy)
- vacate_connection((PgFdwState *)fsstate, false);
+ if (fsstate->s.commonstate->busy)
+ vacate_connection(&fsstate->s, false);
request_more_data(node);
/* Fetch the result immediately. */
- fetch_received_data(node);
+ fetch_received_data(node, false);
}
else if (!fsstate->s.commonstate->busy)
{
@@ -1749,7 +1753,7 @@ postgresIterateForeignScan(ForeignScanState *node)
/* fetch the leader's data and enqueue it for the next request */
if (available)
{
- fetch_received_data(leader);
+ fetch_received_data(leader, false);
add_async_waiter(leader);
}
}
@@ -3729,7 +3733,7 @@ request_more_data(ForeignScanState *node)
* Fetches received data and automatically send requests of the next waiter.
*/
static void
-fetch_received_data(ForeignScanState *node)
+fetch_received_data(ForeignScanState *node, bool vacateconn)
{
PgFdwScanState *fsstate = GetPgFdwScanState(node);
PGresult *volatile res = NULL;
@@ -3817,7 +3821,8 @@ fetch_received_data(ForeignScanState *node)
waiter = move_to_next_waiter(node);
/* send the next request if any */
- if (waiter)
+ if (waiter && (!vacateconn ||
+ GetPgFdwScanState(node)->s.conn != GetPgFdwScanState(waiter)->s.conn))
request_more_data(waiter);
MemoryContextSwitchTo(oldcontext);
@@ -3843,7 +3848,7 @@ vacate_connection(PgFdwState *fdwstate, bool clear_queue)
* query
*/
leader = commonstate->leader;
- fetch_received_data(leader);
+ fetch_received_data(leader, true);
/* let the first waiter be the next leader of this connection */
move_to_next_waiter(leader);
Hi,
I want to suggest one more improvement. Currently the
is_async_capable_path() routine allow only ForeignPath nodes as async
capable path. But in some cases we can allow SubqueryScanPath as async
capable too.
For example:
SELECT * FROM ((SELECT * FROM foreign_1)
UNION ALL
(SELECT a FROM foreign_2)) AS b;
is async capable, but:
SELECT * FROM ((SELECT * FROM foreign_1 LIMIT 10)
UNION ALL
(SELECT a FROM foreign_2 LIMIT 10)) AS b;
doesn't async capable.
The patch in attachment tries to improve this situation.
--
regards,
Andrey Lepikhov
Postgres Professional
Attachments:
0001-async_capable_subqueries.patchtext/x-patch; charset=UTF-8; name=0001-async_capable_subqueries.patchDownload
From fa73b84e8c456c48ef4788304d2ed14f31365aac Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov <a.lepikhov@postgrespro.ru>
Date: Thu, 8 Oct 2020 15:46:41 +0500
Subject: [PATCH] 2
---
contrib/postgres_fdw/expected/postgres_fdw.out | 3 ++-
src/backend/optimizer/path/allpaths.c | 4 ++++
src/backend/optimizer/plan/createplan.c | 2 +-
3 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index eca44a4f40..e5972574b6 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -2082,6 +2082,7 @@ SELECT t1c1, avg(t1c1 + t2c1) FROM (SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2
Output: t1.c1, t2.c1
Group Key: t1.c1, t2.c1
-> Append
+ Async subplans: 2
-> Foreign Scan
Output: t1.c1, t2.c1
Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
@@ -2090,7 +2091,7 @@ SELECT t1c1, avg(t1c1 + t2c1) FROM (SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2
Output: t1_1.c1, t2_1.c1
Relations: (public.ft1 t1_1) INNER JOIN (public.ft2 t2_1)
Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1"))))
-(20 rows)
+(21 rows)
SELECT t1c1, avg(t1c1 + t2c1) FROM (SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) UNION SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1)) AS t (t1c1, t2c1) GROUP BY t1c1 ORDER BY t1c1 OFFSET 100 LIMIT 10;
t1c1 | avg
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 17e9a7a897..5822ba83e0 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3991,6 +3991,10 @@ is_async_capable_path(Path *path)
fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
return true;
}
+ break;
+ case T_SubqueryScanPath:
+ if (is_async_capable_path(((SubqueryScanPath *) path)->subpath))
+ return true;
default:
break;
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 3ae46ed6f1..efb1b0cb4e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1231,7 +1231,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
* Classify as async-capable or not. If we have decided to run the
* children in parallel, we cannot any one of them run asynchronously.
* Planner thinks that all subnodes are executed in order if this
- * append is orderd. No subpaths cannot be run asynchronously in that
+ * append is ordered. No subpaths cannot be run asynchronously in that
* case.
*/
if (pathkeys == NIL &&
--
2.17.1
On Thu, Oct 8, 2020 at 6:39 PM Andrey V. Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
I found a small problem. If we have a mix of async and sync subplans
when we catch an assertion on a busy connection. Just for example:PLAN
====
Nested Loop (cost=100.00..174316.95 rows=975 width=8) (actual
time=5.191..9.262 rows=9 loops=1)
Join Filter: (frgn.a = l.a)
Rows Removed by Join Filter: 8991
-> Append (cost=0.00..257.20 rows=11890 width=4) (actual
time=0.419..2.773 rows=1000 loops=1)
Async subplans: 4
-> Async Foreign Scan on f_1 l_2 (cost=100.00..197.75
rows=2925 width=4) (actual time=0.381..0.585 rows=211 loops=1)
-> Async Foreign Scan on f_2 l_3 (cost=100.00..197.75
rows=2925 width=4) (actual time=0.005..0.206 rows=195 loops=1)
-> Async Foreign Scan on f_3 l_4 (cost=100.00..197.75
rows=2925 width=4) (actual time=0.003..0.282 rows=187 loops=1)
-> Async Foreign Scan on f_4 l_5 (cost=100.00..197.75
rows=2925 width=4) (actual time=0.003..0.316 rows=217 loops=1)
-> Seq Scan on l_0 l_1 (cost=0.00..2.90 rows=190 width=4)
(actual time=0.017..0.057 rows=190 loops=1)
-> Materialize (cost=100.00..170.94 rows=975 width=4) (actual
time=0.001..0.002 rows=9 loops=1000)
-> Foreign Scan on frgn (cost=100.00..166.06 rows=975
width=4) (actual time=0.766..0.768 rows=9 loops=1)
Actually I also found a similar issue before [1]/messages/by-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx+3dyivCT=A@mail.gmail.com. But in the first
place I'm not sure the way of handling concurrent data fetches by
multiple ForeignScan nodes using the same connection in postgres_fdw
implemented in Horiguchi-san's patch would be really acceptable,
because that would impact performance *negatively* in some cases as
mentioned in [1]/messages/by-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx+3dyivCT=A@mail.gmail.com. So I feel inclined to just disable this feature in
problematic cases including the above one in the first cut. Even with
such a limitation, I think it would be useful, because it would cover
typical use cases such as partitionwise joins and partitionwise
aggregates.
Thanks for the report!
Best regards,
Etsuro Fujita
[1]: /messages/by-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx+3dyivCT=A@mail.gmail.com
On Thu, Oct 8, 2020 at 8:40 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
I want to suggest one more improvement. Currently the
is_async_capable_path() routine allow only ForeignPath nodes as async
capable path. But in some cases we can allow SubqueryScanPath as async
capable too.
The patch in attachment tries to improve this situation.
Seems like a good idea. Will look at the patch in detail.
Best regards,
Etsuro Fujita
On Mon, Oct 5, 2020 at 3:35 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
Yes, if there are no objections from you or Thomas or Robert or anyone
else, I'll update Robert's patch as such.
Here is a new version of the patch (as promised in the developer
unconference in PostgresConf.CN & PGConf.Asia 2020):
* In Robert's patch [1]/messages/by-id/CA+TgmoaXQEt4tZ03FtQhnzeDEMzBck+Lrni0UWHVVgOTnA6C1w@mail.gmail.com (and Horiguchi-san's, which was created based
on Robert's), ExecAppend() was modified to retrieve tuples from
async-aware children *before* the tuples will be needed, but I don't
think that's really a good idea, because the query might complete
before returning the tuples. So I modified that function so that a
tuple is retrieved from an async-aware child *when* it is needed, like
Thomas' patch. I used FDW callback functions proposed by Robert, but
introduced another FDW callback function ForeignAsyncBegin() for each
async-aware child to start an asynchronous data fetch at the first
call to ExecAppend() after ExecInitAppend() or ExecReScanAppend().
* For EvalPlanQual, I modified the patch so that async-aware children
are treated as if they were synchronous when executing EvalPlanQual.
* In Robert's patch, all async-aware children below Append nodes in
the query waiting for events to occur were managed by a single EState,
but I modified the patch so that such children are managed by each
Append node, like Horiguchi-san's patch and Thomas'.
* In Robert's patch, the FDW callback function
ForeignAsyncConfigureWait() allowed multiple events to be configured,
but I limited that function to only allow a single event to be
configured, just for simplicity.
* I haven't yet added some planner/resowner changes from Horiguchi-san's patch.
* I haven't yet done anything about the issue on postgres_fdw's
handling of concurrent data fetches by multiple ForeignScan nodes
(below *different* Append nodes in the query) using the same
connection discussed in [2]/messages/by-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx+3dyivCT=A@mail.gmail.com. I modified the patch to just disable
applying this feature to problematic test cases in the postgres_fdw
regression tests, by a new GUC enable_async_append.
Comments welcome! The attached is still WIP and maybe I'm missing
something, though.
Best regards,
Etsuro Fujita
[1]: /messages/by-id/CA+TgmoaXQEt4tZ03FtQhnzeDEMzBck+Lrni0UWHVVgOTnA6C1w@mail.gmail.com
[2]: /messages/by-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx+3dyivCT=A@mail.gmail.com
Attachments:
async-wip-2020-11-17.patchapplication/octet-stream; name=async-wip-2020-11-17.patchDownload
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index ab3226287d..7093a41445 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -59,6 +59,7 @@ typedef struct ConnCacheEntry
bool invalidated; /* true if reconnect is pending */
uint32 server_hashvalue; /* hash value of foreign server OID */
uint32 mapping_hashvalue; /* hash value of user mapping OID */
+ PgFdwConnState state; /* extra per-connection state */
} ConnCacheEntry;
/*
@@ -106,7 +107,7 @@ static bool UserMappingPasswordRequired(UserMapping *user);
* (not even on error), we need this flag to cue manual cleanup.
*/
PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetConnection(UserMapping *user, bool will_prep_stmt, PgFdwConnState **state)
{
bool found;
bool retry = false;
@@ -256,6 +257,10 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
/* Remember if caller will prepare statements */
entry->have_prep_stmt |= will_prep_stmt;
+ /* If caller needs access to the per-connection state, return it. */
+ if (state)
+ *state = &entry->state;
+
return entry->conn;
}
@@ -282,6 +287,7 @@ make_new_connection(ConnCacheEntry *entry, UserMapping *user)
entry->mapping_hashvalue =
GetSysCacheHashValue1(USERMAPPINGOID,
ObjectIdGetDatum(user->umid));
+ memset(&entry->state, 0, sizeof(entry->state));
/* Now try to make the connection */
entry->conn = connect_pg_server(server, user);
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 2d88d06358..274a125b81 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6986,7 +6986,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+-------
a | aaa
@@ -7014,7 +7014,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7042,7 +7042,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7070,7 +7070,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | newtoo
@@ -7098,7 +7098,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
DELETE FROM a;
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+----
(0 rows)
@@ -7140,35 +7140,40 @@ insert into bar2 values(3,33,33);
insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
f1 | f2
----+----
1 | 11
@@ -7178,35 +7183,40 @@ select * from bar where f1 in (select f1 from foo) for update;
(4 rows)
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
f1 | f2
----+----
1 | 11
@@ -7238,7 +7248,7 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-> Hash Join
@@ -7256,7 +7266,7 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
(39 rows)
@@ -7274,6 +7284,7 @@ select tableoid::regclass, * from bar order by 1,2;
(6 rows)
-- Check UPDATE with inherited target and an appendrel subquery
+SET enable_async_append TO false;
explain (verbose, costs off)
update bar set f2 = f2 + 100
from
@@ -7332,6 +7343,7 @@ update bar set f2 = f2 + 100
from
( select f1 from foo union all select f1+3 from foo ) ss
where bar.f1 = ss.f1;
+RESET enable_async_append;
select tableoid::regclass, * from bar order by 1,2;
tableoid | f1 | f2
----------+----+-----
@@ -8571,9 +8583,9 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
Sort
Sort Key: t1.a, t3.c
-> Append
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: ((ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)) INNER JOIN (ftprt1_p1 t3_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: ((ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)) INNER JOIN (ftprt1_p2 t3_2)
(7 rows)
@@ -8610,19 +8622,19 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
-- with whole-row reference; partitionwise join does not apply
EXPLAIN (COSTS OFF)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
- QUERY PLAN
---------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------
Sort
Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
-> Hash Full Join
Hash Cond: (t1.a = t2.b)
-> Append
- -> Foreign Scan on ftprt1_p1 t1_1
- -> Foreign Scan on ftprt1_p2 t1_2
+ -> Async Foreign Scan on ftprt1_p1 t1_1
+ -> Async Foreign Scan on ftprt1_p2 t1_2
-> Hash
-> Append
- -> Foreign Scan on ftprt2_p1 t2_1
- -> Foreign Scan on ftprt2_p2 t2_2
+ -> Async Foreign Scan on ftprt2_p1 t2_1
+ -> Async Foreign Scan on ftprt2_p2 t2_2
(11 rows)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
@@ -8652,9 +8664,9 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
Sort
Sort Key: t1.a, t1.b
-> Append
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: (ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: (ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)
(7 rows)
@@ -8707,6 +8719,7 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
(14 rows)
-- test FOR UPDATE; partitionwise join does not apply
+SET enable_async_append TO false;
EXPLAIN (COSTS OFF)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
QUERY PLAN
@@ -8734,6 +8747,7 @@ SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a
400 | 400
(4 rows)
+RESET enable_async_append;
RESET enable_partitionwise_join;
-- ===================================================================
-- test partitionwise aggregates
@@ -8758,17 +8772,17 @@ ANALYZE fpagg_tab_p3;
SET enable_partitionwise_aggregate TO false;
EXPLAIN (COSTS OFF)
SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
- QUERY PLAN
------------------------------------------------------------
+ QUERY PLAN
+-----------------------------------------------------------------
Sort
Sort Key: pagg_tab.a
-> HashAggregate
Group Key: pagg_tab.a
Filter: (avg(pagg_tab.b) < '22'::numeric)
-> Append
- -> Foreign Scan on fpagg_tab_p1 pagg_tab_1
- -> Foreign Scan on fpagg_tab_p2 pagg_tab_2
- -> Foreign Scan on fpagg_tab_p3 pagg_tab_3
+ -> Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+ -> Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+ -> Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
(9 rows)
-- Plan with partitionwise aggregates is enabled
@@ -8780,11 +8794,11 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
Sort
Sort Key: pagg_tab.a
-> Append
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p1 pagg_tab)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p2 pagg_tab_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p3 pagg_tab_2)
(9 rows)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 9c5aaacc51..60afa37c76 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -21,6 +21,7 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -36,6 +37,7 @@
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
#include "postgres_fdw.h"
+#include "storage/latch.h"
#include "utils/builtins.h"
#include "utils/float.h"
#include "utils/guc.h"
@@ -154,6 +156,11 @@ typedef struct PgFdwScanState
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ /* for asynchronous execution */
+ bool async_aware; /* engage async-aware logic? */
+ PgFdwConnState *conn_state; /* extra per-connection state */
+ ForeignScanState *next_node; /* next ForeignScan node to activate */
+
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
MemoryContext temp_cxt; /* context for per-tuple temporary data */
@@ -391,6 +398,11 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
RelOptInfo *input_rel,
RelOptInfo *output_rel,
void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncBegin(AsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(AsyncRequest *areq);
+static void postgresForeignAsyncNotify(AsyncRequest *areq);
+static void postgresForeignAsyncRequest(AsyncRequest *areq);
/*
* Helper functions
@@ -419,6 +431,7 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
+static void fetch_more_data_begin(ForeignScanState *node);
static void fetch_more_data(ForeignScanState *node);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static PgFdwModifyState *create_foreign_modify(EState *estate,
@@ -559,6 +572,13 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for asynchronous execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncBegin = postgresForeignAsyncBegin;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+
PG_RETURN_POINTER(routine);
}
@@ -1434,7 +1454,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->conn = GetConnection(user, false, &fsstate->conn_state);
/* Assign a unique ID for my cursor */
fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1485,6 +1505,12 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
&fsstate->param_flinfo,
&fsstate->param_exprs,
&fsstate->param_values);
+
+ /* Initialize async state */
+ fsstate->async_aware = node->ss.ps.plan->async_aware;
+ fsstate->conn_state->curr_node = NULL;
+ fsstate->conn_state->async_query_sent = false;
+ fsstate->next_node = NULL;
}
/*
@@ -1510,6 +1536,9 @@ postgresIterateForeignScan(ForeignScanState *node)
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
+ /* In async mode, just clear tuple slot. */
+ if (fsstate->async_aware)
+ return ExecClearTuple(slot);
/* No point in another fetch if we already detected EOF, though. */
if (!fsstate->eof_reached)
fetch_more_data(node);
@@ -1539,6 +1568,14 @@ postgresReScanForeignScan(ForeignScanState *node)
char sql[64];
PGresult *res;
+ /* Reset async state */
+ if (fsstate->async_aware)
+ {
+ fsstate->conn_state->curr_node = NULL;
+ fsstate->conn_state->async_query_sent = false;
+ fsstate->next_node = NULL;
+ }
+
/* If we haven't created the cursor yet, nothing to do. */
if (!fsstate->cursor_exists)
return;
@@ -1597,6 +1634,14 @@ postgresEndForeignScan(ForeignScanState *node)
if (fsstate == NULL)
return;
+ /*
+ * If we're ending before we've collected a response from an asynchronous
+ * query, we have to consume the response.
+ */
+ if (fsstate->conn_state->curr_node == node &&
+ fsstate->conn_state->async_query_sent)
+ fetch_more_data(node);
+
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
close_cursor(fsstate->conn, fsstate->cursor_number);
@@ -2373,7 +2418,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->conn = GetConnection(user, false, NULL);
/* Update the foreign-join-related fields. */
if (fsplan->scan.scanrelid == 0)
@@ -2747,7 +2792,7 @@ estimate_path_cost_size(PlannerInfo *root,
false, &retrieved_attrs, NULL);
/* Get the remote estimate */
- conn = GetConnection(fpinfo->user, false);
+ conn = GetConnection(fpinfo->user, false, NULL);
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -3384,6 +3429,34 @@ create_cursor(ForeignScanState *node)
pfree(buf.data);
}
+/*
+ * Begin an asynchronous data fetch.
+ * fetch_more_data must be called to fetch the results..
+ */
+static void
+fetch_more_data_begin(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PGconn *conn = fsstate->conn;
+ char sql[64];
+
+ Assert(fsstate->conn_state->curr_node == node);
+ Assert(!fsstate->conn_state->async_query_sent);
+
+ /* Create the cursor synchronously. */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ /* We will send this query, but not wait for the response. */
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (PQsendQuery(conn, sql) < 0)
+ pgfdw_report_error(ERROR, NULL, conn, false, fsstate->query);
+
+ fsstate->conn_state->async_query_sent = true;
+}
+
/*
* Fetch some more rows from the node's cursor.
*/
@@ -3406,17 +3479,36 @@ fetch_more_data(ForeignScanState *node)
PG_TRY();
{
PGconn *conn = fsstate->conn;
- char sql[64];
int numrows;
int i;
- snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
- fsstate->fetch_size, fsstate->cursor_number);
+ if (fsstate->async_aware)
+ {
+ Assert(fsstate->conn_state->curr_node == node);
+ Assert(fsstate->conn_state->async_query_sent);
- res = pgfdw_exec_query(conn, sql);
- /* On error, report the original query, not the FETCH. */
- if (PQresultStatus(res) != PGRES_TUPLES_OK)
- pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ /*
+ * The query was already sent by an earlier call to
+ * fetch_more_data_begin. So now we just fetch the result.
+ */
+ res = PQgetResult(conn);
+ /* On error, report the original query, not the FETCH. */
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ }
+ else
+ {
+ char sql[64];
+
+ /* This is a regular synchronous fetch. */
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ res = pgfdw_exec_query(conn, sql);
+ /* On error, report the original query, not the FETCH. */
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ }
/* Convert the data into HeapTuples */
numrows = PQntuples(res);
@@ -3443,6 +3535,15 @@ fetch_more_data(ForeignScanState *node)
/* Must be EOF if we didn't get as many tuples as we asked for. */
fsstate->eof_reached = (numrows < fsstate->fetch_size);
+
+ /* If this was the second part of an async request, we must fetch until NULL. */
+ if (fsstate->async_aware)
+ {
+ /* call once and raise error if not NULL as expected? */
+ while (PQgetResult(conn) != NULL)
+ ;
+ fsstate->conn_state->async_query_sent = false;
+ }
}
PG_FINALLY();
{
@@ -3567,7 +3668,7 @@ create_foreign_modify(EState *estate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->conn = GetConnection(user, true, NULL);
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Set up remote query information. */
@@ -4442,7 +4543,7 @@ postgresAnalyzeForeignTable(Relation relation,
*/
table = GetForeignTable(RelationGetRelid(relation));
user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
- conn = GetConnection(user, false);
+ conn = GetConnection(user, false, NULL);
/*
* Construct command to get page count for relation.
@@ -4528,7 +4629,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
table = GetForeignTable(RelationGetRelid(relation));
server = GetForeignServer(table->serverid);
user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
- conn = GetConnection(user, false);
+ conn = GetConnection(user, false, NULL);
/*
* Construct cursor that retrieves whole rows from remote.
@@ -4756,7 +4857,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
*/
server = GetForeignServer(serverOid);
mapping = GetUserMapping(GetUserId(), server->serverid);
- conn = GetConnection(mapping, false);
+ conn = GetConnection(mapping, false, NULL);
/* Don't attempt to import collation if remote server hasn't got it */
if (PQserverVersion(conn) < 90100)
@@ -6302,6 +6403,170 @@ add_foreign_final_paths(PlannerInfo *root, RelOptInfo *input_rel,
add_path(final_rel, (Path *) final_path);
}
+/*
+ * postgresIsForeignPathAsyncCapable
+ * Check whether a given ForeignPath node is async-capable.
+ */
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+/*
+ * postgresForeignAsyncBegin
+ * Begin a data fetch from a foreign PostgreSQL table.
+ */
+static void
+postgresForeignAsyncBegin(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ ForeignScanState *curr_node = fsstate->conn_state->curr_node;
+
+ /*
+ * If the connection has already been used by another ForeignScan node,
+ * put this ForeignScan node at the end of the waiting-node chain.
+ * Otherwise, activate this ForeignScan node now.
+ */
+ if (curr_node)
+ {
+ PgFdwScanState *curr_fsstate = (PgFdwScanState *) curr_node->fdw_state;
+
+ /* Scan down the chain ... */
+ while (curr_fsstate->next_node)
+ {
+ curr_node = curr_fsstate->next_node;
+ curr_fsstate = (PgFdwScanState *) curr_node->fdw_state;
+ }
+ /* Update the chain linking */
+ curr_fsstate->next_node = node;
+ }
+ else
+ {
+ /* Mark the connection as used by the requestee node */
+ fsstate->conn_state->curr_node = node;
+ Assert(!fsstate->conn_state->async_query_sent);
+ /* Begin a data fetch */
+ fetch_more_data_begin(node);
+ }
+
+ /* Either way mark this ForeignScan node as needing a callback */
+ ExecAsyncMarkAsNeedingCallback(areq);
+}
+
+/*
+ * postgresForeignAsyncConfigureWait
+ * Configure a file descriptor event for which we wish to wait.
+ */
+static void
+postgresForeignAsyncConfigureWait(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ AppendState *requestor = (AppendState *) areq->requestor;
+ WaitEventSet *set = requestor->as_eventset;
+
+ Assert(areq->callback_pending);
+
+ /* If the ForeignScan node isn't activated, nothing to do */
+ if (fsstate->conn_state->curr_node != node)
+ return;
+
+ AddWaitEventToSet(set, WL_SOCKET_READABLE, PQsocket(fsstate->conn),
+ NULL, areq);
+}
+
+/*
+ * postgresForeignAsyncNotify
+ * Fetch data we have requested asynchronously.
+ */
+static void
+postgresForeignAsyncNotify(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+
+ /* The core code would have initialized the callback_pending flag */
+ Assert(!areq->callback_pending);
+
+ fetch_more_data(node);
+}
+
+/*
+ * postgresForeignAsyncRequest
+ * Asynchronously request next tuple from a foreign PostgreSQL table.
+ */
+static void
+postgresForeignAsyncRequest(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ TupleTableSlot *result;
+
+ /* Get some more tuples, if we've run out */
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ /* No point in another fetch if we already detected EOF, though */
+ if (!fsstate->eof_reached)
+ {
+ /* Begin another fetch */
+ fetch_more_data_begin(node);
+ /* Mark the ForeignScan node as needing a callback */
+ ExecAsyncMarkAsNeedingCallback(areq);
+ return;
+ }
+ fsstate->conn_state->curr_node = NULL;
+
+ /* Activate the next ForeignScan node if any */
+ if (fsstate->next_node)
+ {
+ /* Mark the connection as used by the next ForeignScan node */
+ fsstate->conn_state->curr_node = fsstate->next_node;
+ Assert(!fsstate->conn_state->async_query_sent);
+ /* Begin a data fetch */
+ fetch_more_data_begin(fsstate->next_node);
+ }
+
+ /* There's nothing more to do; set the result to a NULL pointer */
+ result = NULL;
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+ return;
+ }
+
+ /* Get a tuple from the ForeignScan node */
+ result = ExecProcNode((PlanState *) node);
+
+ if (TupIsNull(result))
+ {
+ Assert(fsstate->next_tuple >= fsstate->num_tuples);
+
+ /* Get some more tuples, if we've not detected EOF yet */
+ if (!fsstate->eof_reached)
+ {
+ /* Begin another fetch */
+ fetch_more_data_begin(node);
+ /* Mark the ForeignScan node as needing a callback */
+ ExecAsyncMarkAsNeedingCallback(areq);
+ return;
+ }
+ fsstate->conn_state->curr_node = NULL;
+
+ /* Activate the next ForeignScan node if any */
+ if (fsstate->next_node)
+ {
+ /* Mark the connection as used by the next ForeignScan node */
+ fsstate->conn_state->curr_node = fsstate->next_node;
+ Assert(!fsstate->conn_state->async_query_sent);
+ /* Begin a data fetch */
+ fetch_more_data_begin(fsstate->next_node);
+ }
+ }
+
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+}
+
/*
* Create a tuple from the specified row of the PGresult.
*
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index eef410db39..ee93262862 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -16,6 +16,7 @@
#include "foreign/foreign.h"
#include "lib/stringinfo.h"
#include "libpq-fe.h"
+#include "nodes/execnodes.h"
#include "nodes/pathnodes.h"
#include "utils/relcache.h"
@@ -124,12 +125,22 @@ typedef struct PgFdwRelationInfo
int relation_index;
} PgFdwRelationInfo;
+/*
+ * Extra control information relating to a connection.
+ */
+typedef struct PgFdwConnState
+{
+ ForeignScanState *curr_node; /* currently activated ForeignScan node */
+ bool async_query_sent; /* has an asynchronous query been sent? */
+} PgFdwConnState;
+
/* in postgres_fdw.c */
extern int set_transmission_modes(void);
extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+ PgFdwConnState **state);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 7581c5417b..074cdd96ab 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1799,31 +1799,31 @@ INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
DELETE FROM a;
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
@@ -1859,12 +1859,12 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
@@ -1874,6 +1874,7 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
select tableoid::regclass, * from bar order by 1,2;
-- Check UPDATE with inherited target and an appendrel subquery
+SET enable_async_append TO false;
explain (verbose, costs off)
update bar set f2 = f2 + 100
from
@@ -1883,6 +1884,7 @@ update bar set f2 = f2 + 100
from
( select f1 from foo union all select f1+3 from foo ) ss
where bar.f1 = ss.f1;
+RESET enable_async_append;
select tableoid::regclass, * from bar order by 1,2;
@@ -2492,9 +2494,11 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE a % 25 = 0) t1 FULL JOIN (SELECT 't2_phv' phv, * FROM fprt2 WHERE b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY t1.a, t2.b;
-- test FOR UPDATE; partitionwise join does not apply
+SET enable_async_append TO false;
EXPLAIN (COSTS OFF)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
+RESET enable_async_append;
RESET enable_partitionwise_join;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a632cf98ba..24a9e014da 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4704,6 +4704,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</para>
<variablelist>
+ <varlistentry id="guc-enable-async-append" xreflabel="enable_async_append">
+ <term><varname>enable_async_append</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_async_append</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of async-aware
+ append plan types. The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-bitmapscan" xreflabel="enable_bitmapscan">
<term><varname>enable_bitmapscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 98e1995453..1328df533a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1557,6 +1557,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</thead>
<tbody>
+ <row>
+ <entry><literal>AppendReady</literal></entry>
+ <entry>Waiting for a subplan of Append to be ready.</entry>
+ </row>
<row>
<entry><literal>BackupWaitWalArchive</literal></entry>
<entry>Waiting for WAL files required for a backup to be successfully
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 43f9b01e83..d4530bda22 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1374,6 +1374,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
}
if (plan->parallel_aware)
appendStringInfoString(es->str, "Parallel ");
+ if (plan->async_aware)
+ appendStringInfoString(es->str, "Async ");
appendStringInfoString(es->str, pname);
es->indent++;
}
@@ -1393,6 +1395,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (custom_name)
ExplainPropertyText("Custom Plan Provider", custom_name, es);
ExplainPropertyBool("Parallel Aware", plan->parallel_aware, es);
+ ExplainPropertyBool("Async Aware", plan->async_aware, es);
}
switch (nodeTag(plan))
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index f990c6473a..1004647d4f 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
+ execAsync.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index e2154ba86a..1848d58eda 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -521,6 +521,10 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans != 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index e69de29bb2..f29d450d27 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
+
+/*
+ * Begin execution of a designed async-aware node.
+ */
+void
+ExecAsyncBegin(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanBegin(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor event
+ * for which it wishes to wait. We expect the node-type specific callback to
+ * make a sigle call of the following form:
+ *
+ * AddWaitEventToSet(set, WL_SOCKET_READABLE, fd, NULL, areq);
+ */
+void
+ExecAsyncConfigureWait(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanConfigureWait(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+void
+ExecAsyncNotify(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Asynchronously request a tuple from the asynchronous node.
+ */
+void
+ExecAsyncRequest(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * A requestee node should call this function to indicate that it needs a
+ * callback to deliver tuples to its requestor node. The node can call this
+ * from its ExecAsyncBegin, ExecAsyncNotify, or ExecAsyncRequest callback.
+ */
+void
+ExecAsyncMarkAsNeedingCallback(AsyncRequest *areq)
+{
+ areq->callback_pending = true;
+ areq->request_complete = false;
+ areq->result = NULL;
+}
+
+/*
+ * A requestee node should call this function to deliver the tuple to its
+ * requestor node. The node can call this from its ExecAsyncRequest callback
+ * if the requested tuple is available immediately.
+ */
+void
+ExecAsyncRequestDone(AsyncRequest *areq, TupleTableSlot *result)
+{
+ areq->callback_pending = false;
+ areq->request_complete = true;
+ areq->result = result;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 88919e62fa..d0969745a4 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -57,10 +57,13 @@
#include "postgres.h"
+#include "executor/execAsync.h"
#include "executor/execdebug.h"
#include "executor/execPartition.h"
#include "executor/nodeAppend.h"
#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
/* Shared state for parallel-aware Append. */
struct ParallelAppendState
@@ -78,12 +81,22 @@ struct ParallelAppendState
};
#define INVALID_SUBPLAN_INDEX -1
+#define EVENT_BUFFER_SIZE 16
+
+#define ExecAppendAsyncDone(node) \
+ (bms_is_empty((node)->as_needrequest) && \
+ bms_is_empty((node)->as_asyncpending))
static TupleTableSlot *ExecAppend(PlanState *pstate);
static bool choose_next_subplan_locally(AppendState *node);
static bool choose_next_subplan_for_leader(AppendState *node);
static bool choose_next_subplan_for_worker(AppendState *node);
static void mark_invalid_subplans_as_finished(AppendState *node);
+static void ExecAppendAsyncBegin(AppendState *node);
+static bool ExecAppendAsyncGetNext(AppendState *node, TupleTableSlot **result);
+static bool ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result);
+static bool ExecAppendAsyncResponse(AsyncRequest *areq);
+static void ExecAppendAsyncEventWait(AppendState *node);
/* ----------------------------------------------------------------
* ExecInitAppend
@@ -102,7 +115,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
AppendState *appendstate = makeNode(AppendState);
PlanState **appendplanstates;
Bitmapset *validsubplans;
+ Bitmapset *asyncplans;
int nplans;
+ int nasyncplans;
int firstvalid;
int i,
j;
@@ -119,6 +134,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
/* Let choose_next_subplan_* function handle setting the first subplan */
appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+ appendstate->as_syncdone = false;
/* If run-time partition pruning is enabled, then set that up now */
if (node->part_prune_info != NULL)
@@ -191,12 +207,24 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
* While at it, find out the first valid partial plan.
*/
j = 0;
+ asyncplans = NULL;
+ nasyncplans = 0;
firstvalid = nplans;
i = -1;
while ((i = bms_next_member(validsubplans, i)) >= 0)
{
Plan *initNode = (Plan *) list_nth(node->appendplans, i);
+ /*
+ * Record async subplans. When executing EvalPlanQual, we process
+ * async subplans synchronously, so don't do this in that case.
+ */
+ if (initNode->async_aware && estate->es_epq_active == NULL)
+ {
+ asyncplans = bms_add_member(asyncplans, j);
+ nasyncplans++;
+ }
+
/*
* Record the lowest appendplans index which is a valid partial plan.
*/
@@ -210,6 +238,37 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ /* Initialize async state */
+ appendstate->as_asyncplans = asyncplans;
+ appendstate->as_nasyncplans = nasyncplans;
+ appendstate->as_lastasyncplan = INVALID_SUBPLAN_INDEX;
+ appendstate->as_needrequest = NULL;
+ appendstate->as_asyncpending = NULL;
+ appendstate->as_asyncrequests = NULL;
+ appendstate->as_eventset = NULL;
+
+ if (nasyncplans > 0)
+ {
+ appendstate->as_asyncrequests = (AsyncRequest **)
+ palloc0(nplans * sizeof(AsyncRequest *));
+
+ i = -1;
+ while ((i = bms_next_member(asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq;
+
+ areq = palloc(sizeof(AsyncRequest));
+ areq->requestor = (PlanState *) appendstate;
+ areq->requestee = appendplanstates[i];
+ areq->request_index = i;
+ areq->callback_pending = false;
+ areq->request_complete = false;
+ areq->result = NULL;
+
+ appendstate->as_asyncrequests[i] = areq;
+ }
+ }
+
/*
* Miscellaneous initialization
*/
@@ -232,31 +291,45 @@ static TupleTableSlot *
ExecAppend(PlanState *pstate)
{
AppendState *node = castNode(AppendState, pstate);
+ TupleTableSlot *result;
- if (node->as_whichplan < 0)
+ if (!node->as_syncdone && node->as_whichplan == INVALID_SUBPLAN_INDEX)
{
/* Nothing to do if there are no subplans */
if (node->as_nplans == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ /* If there are any async subplans, begin execution of them */
+ if (node->as_nasyncplans > 0)
+ ExecAppendAsyncBegin(node);
+
/*
- * If no subplan has been chosen, we must choose one before
+ * If no sync subplan has been chosen, we must choose one before
* proceeding.
*/
- if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
- !node->choose_next_subplan(node))
+ if (!node->choose_next_subplan(node) && ExecAppendAsyncDone(node))
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
for (;;)
{
PlanState *subnode;
- TupleTableSlot *result;
CHECK_FOR_INTERRUPTS();
/*
- * figure out which subplan we are currently processing
+ * try to get a tuple from async subplans
+ */
+ if (!bms_is_empty(node->as_needrequest) ||
+ (node->as_syncdone && !bms_is_empty(node->as_asyncpending)))
+ {
+ if (ExecAppendAsyncGetNext(node, &result))
+ return result;
+ Assert(bms_is_empty(node->as_needrequest));
+ }
+
+ /*
+ * figure out which sync subplan we are currently processing
*/
Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
subnode = node->appendplans[node->as_whichplan];
@@ -276,8 +349,16 @@ ExecAppend(PlanState *pstate)
return result;
}
- /* choose new subplan; if none, we're done */
- if (!node->choose_next_subplan(node))
+ /* wait or poll async events */
+ if (!bms_is_empty(node->as_asyncpending))
+ {
+ Assert(!node->as_syncdone);
+ Assert(bms_is_empty(node->as_needrequest));
+ ExecAppendAsyncEventWait(node);
+ }
+
+ /* choose new sync subplan; if no sync/async subplans, we're done */
+ if (!node->choose_next_subplan(node) && ExecAppendAsyncDone(node))
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
}
@@ -313,6 +394,7 @@ ExecEndAppend(AppendState *node)
void
ExecReScanAppend(AppendState *node)
{
+ int nasyncplans = node->as_nasyncplans;
int i;
/*
@@ -347,8 +429,29 @@ ExecReScanAppend(AppendState *node)
ExecReScan(subnode);
}
+ /* Reset async state */
+ node->as_lastasyncplan = INVALID_SUBPLAN_INDEX;
+ bms_free(node->as_needrequest);
+ node->as_needrequest = NULL;
+ bms_free(node->as_asyncpending);
+ node->as_asyncpending = NULL;
+
+ if (nasyncplans > 0)
+ {
+ i = -1;
+ while ((i = bms_next_member(node->as_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ areq->callback_pending = false;
+ areq->request_complete = false;
+ areq->result = NULL;
+ }
+ }
+
/* Let choose_next_subplan_* function handle setting the first subplan */
node->as_whichplan = INVALID_SUBPLAN_INDEX;
+ node->as_syncdone = false;
}
/* ----------------------------------------------------------------
@@ -429,7 +532,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
/* ----------------------------------------------------------------
* choose_next_subplan_locally
*
- * Choose next subplan for a non-parallel-aware Append,
+ * Choose next sync subplan for a non-parallel-aware Append,
* returning false if there are no more.
* ----------------------------------------------------------------
*/
@@ -444,9 +547,9 @@ choose_next_subplan_locally(AppendState *node)
/*
* If first call then have the bms member function choose the first valid
- * subplan by initializing whichplan to -1. If there happen to be no
- * valid subplans then the bms member function will handle that by
- * returning a negative number which will allow us to exit returning a
+ * sync subplan by initializing whichplan to -1. If there happen to be
+ * no valid sync subplans then the bms member function will handle that
+ * by returning a negative number which will allow us to exit returning a
* false value.
*/
if (whichplan == INVALID_SUBPLAN_INDEX)
@@ -467,7 +570,10 @@ choose_next_subplan_locally(AppendState *node)
nextplan = bms_prev_member(node->as_valid_subplans, whichplan);
if (nextplan < 0)
+ {
+ node->as_syncdone = true;
return false;
+ }
node->as_whichplan = nextplan;
@@ -709,3 +815,292 @@ mark_invalid_subplans_as_finished(AppendState *node)
node->as_pstate->pa_finished[i] = true;
}
}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncBegin
+ *
+ * Begin execution of designed async-aware nodes.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecAppendAsyncBegin(AppendState *node)
+{
+ Bitmapset *valid_asyncplans;
+ int i;
+
+ /* We should never be called when there are no async subplans. */
+ Assert(node->as_nasyncplans > 0);
+
+ if (node->as_valid_subplans == NULL)
+ {
+ node->as_valid_subplans =
+ ExecFindMatchingSubPlans(node->as_prune_state);
+ }
+
+ /* Nothing to do if there are no valid async subplans */
+ if (!bms_overlap(node->as_valid_subplans, node->as_asyncplans))
+ return;
+
+ /* Get valid async subplans. */
+ valid_asyncplans = bms_copy(node->as_asyncplans);
+ valid_asyncplans = bms_int_members(valid_asyncplans,
+ node->as_valid_subplans);
+
+ /* Adjust the node's as_valid_suplans to only contain sync subplans. */
+ node->as_valid_subplans = bms_del_members(node->as_valid_subplans,
+ valid_asyncplans);
+
+ /* Allow async-aware nodes to perform additional initialization. */
+ i = -1;
+ while ((i = bms_next_member(valid_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ Assert(areq->request_index == i);
+ Assert(!areq->callback_pending);
+
+ /* Perform the actual callback. */
+ ExecAsyncBegin(areq);
+
+ /*
+ * If the callback_pending flag is kept false, the node would be
+ * ready for a request. Otherwise, it would needs a callback.
+ */
+ if (!areq->callback_pending)
+ node->as_needrequest = bms_add_member(node->as_needrequest, i);
+ else
+ node->as_asyncpending = bms_add_member(node->as_asyncpending, i);
+ }
+ bms_free(valid_asyncplans);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncGetNext
+ *
+ * Retrieve a tuple from asynchronous subplans.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecAppendAsyncGetNext(AppendState *node, TupleTableSlot **result)
+{
+ *result = NULL;
+
+ /* Request a tuple asynchronously. */
+ if (ExecAppendAsyncRequest(node, result))
+ return true;
+
+ while (!bms_is_empty(node->as_asyncpending))
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ /* Wait or poll async events. */
+ ExecAppendAsyncEventWait(node);
+
+ /* Request a tuple asynchronously. */
+ if (ExecAppendAsyncRequest(node, result))
+ return true;
+
+ /* Break from loop if there is any sync subplan not complete */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If all sync subplans are complete, we're totally done scanning the
+ * given node. Otherwise, we're done with the asynchronous stuff but
+ * must continue scanning the sync subplans.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(ExecAppendAsyncDone(node));
+ *result = ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ return true;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncRequest
+ *
+ * Retrieve a tuple from ready subplans if any.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result)
+{
+ /* Nothing to do if there are no ready subplans. */
+ if (bms_is_empty(node->as_needrequest))
+ return false;
+
+ /* Asynchronously request a tuple from last ready subplan if any. */
+ if (node->as_lastasyncplan != INVALID_SUBPLAN_INDEX)
+ {
+ int i = node->as_lastasyncplan;
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ Assert(bms_is_member(i, node->as_needrequest));
+
+ /* Perform the actual callback. */
+ ExecAsyncRequest(areq);
+ if (ExecAppendAsyncResponse(areq))
+ {
+ Assert(!TupIsNull(areq->result));
+ *result = areq->result;
+ return true;
+ }
+ }
+
+ /* Likewise for the other ready subplans if any. */
+ if (!bms_is_empty(node->as_needrequest))
+ {
+ Bitmapset *needrequest = bms_copy(node->as_needrequest);
+ int i = -1;
+
+ while ((i = bms_next_member(needrequest, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ /* Perform the actual callback. */
+ ExecAsyncRequest(areq);
+ if (ExecAppendAsyncResponse(areq))
+ {
+ Assert(!TupIsNull(areq->result));
+ *result = areq->result;
+ bms_free(needrequest);
+ return true;
+ }
+ }
+
+ Assert(bms_is_empty(node->as_needrequest));
+ bms_free(needrequest);
+ return false;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncResponse
+ *
+ * Process a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecAppendAsyncResponse(AsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot = areq->result;
+
+ /* The result should be a TupleTableSlot or NULL. */
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ if (!areq->request_complete)
+ {
+ /* The result should be NULL. */
+ Assert(slot == NULL);
+ /* The requestee node would need a callback. */
+ Assert(areq->callback_pending);
+ bms_del_member(node->as_needrequest, areq->request_index);
+ node->as_asyncpending = bms_add_member(node->as_asyncpending,
+ areq->request_index);
+ return false;
+ }
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ {
+ bms_del_member(node->as_needrequest, areq->request_index);
+ node->as_lastasyncplan = INVALID_SUBPLAN_INDEX;
+ return false;
+ }
+
+ /*
+ * Remember the subplan so that ExecAppendAsyncRequest will keep trying
+ * the subplan first until it stops delivering tuples to us.
+ */
+ node->as_lastasyncplan = areq->request_index;
+ return true;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncEventWait
+ *
+ * Wait or poll for file descriptor wait events and fire callbacks.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecAppendAsyncEventWait(AppendState *node)
+{
+ long timeout = node->as_syncdone ? -1 : 0;
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+
+ /* Nothing to do if there are no pending subplans. */
+ if (bms_is_empty(node->as_asyncpending))
+ return;
+
+ node->as_eventset = CreateWaitEventSet(CurrentMemoryContext,
+ node->as_nasyncplans + 1);
+ AddWaitEventToSet(node->as_eventset, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
+ NULL, NULL);
+
+ /* Give each waiting node a chance to add a wait event. */
+ i = -1;
+ while ((i = bms_next_member(node->as_asyncpending, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ ExecAsyncConfigureWait(areq);
+ }
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(node->as_eventset, timeout, occurred_event,
+ EVENT_BUFFER_SIZE, WAIT_EVENT_APPEND_READY);
+ FreeWaitEventSet(node->as_eventset);
+ node->as_eventset = NULL;
+ if (noccurred == 0)
+ return;
+
+ /* Deliver notifications. */
+ for (i = 0; i < noccurred; i++)
+ {
+ WaitEvent *w = &occurred_event[i];
+
+ /*
+ * Each waiting node should have registered its wait event with
+ * user_data pointing back to its AsyncRequest.
+ */
+ if ((w->events & WL_SOCKET_READABLE) != 0)
+ {
+ AsyncRequest *areq = (AsyncRequest *) w->user_data;
+ int request_index = areq->request_index;
+
+ Assert(areq->callback_pending);
+ Assert(bms_is_member(request_index, node->as_asyncpending));
+
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ areq->callback_pending = false;
+ bms_del_member(node->as_asyncpending, request_index);
+
+ /* Perform the actual callback. */
+ ExecAsyncNotify(areq);
+
+ /*
+ * If the callback_pending flag is kept false, the node would be
+ * ready for a request. Otherwise, it would need a callback.
+ */
+ if (!areq->callback_pending)
+ node->as_needrequest = bms_add_member(node->as_needrequest,
+ request_index);
+ else
+ node->as_asyncpending = bms_add_member(node->as_asyncpending,
+ request_index);
+ }
+ }
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 0b20f94035..4caecdb78a 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -391,3 +391,67 @@ ExecShutdownForeignScan(ForeignScanState *node)
if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanBegin
+ *
+ * Begin execution of a designed async-aware node
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanBegin(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncBegin != NULL);
+ fdwroutine->ForeignAsyncBegin(areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ fdwroutine->ForeignAsyncConfigureWait(areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Callback invoked when a relevant event has occurred
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Request a tuple asynchronously
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 5a591d0a75..ebc1f013a8 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -120,6 +120,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
COPY_SCALAR_FIELD(plan_width);
COPY_SCALAR_FIELD(parallel_aware);
COPY_SCALAR_FIELD(parallel_safe);
+ COPY_SCALAR_FIELD(async_aware);
COPY_SCALAR_FIELD(plan_node_id);
COPY_NODE_FIELD(targetlist);
COPY_NODE_FIELD(qual);
@@ -241,6 +242,7 @@ _copyAppend(const Append *from)
*/
COPY_BITMAPSET_FIELD(apprelids);
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
COPY_SCALAR_FIELD(first_partial_plan);
COPY_NODE_FIELD(part_prune_info);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 4504b1503b..e9ad2a0803 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -333,6 +333,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
WRITE_INT_FIELD(plan_width);
WRITE_BOOL_FIELD(parallel_aware);
WRITE_BOOL_FIELD(parallel_safe);
+ WRITE_BOOL_FIELD(async_aware);
WRITE_INT_FIELD(plan_node_id);
WRITE_NODE_FIELD(targetlist);
WRITE_NODE_FIELD(qual);
@@ -431,6 +432,7 @@ _outAppend(StringInfo str, const Append *node)
WRITE_BITMAPSET_FIELD(apprelids);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
WRITE_INT_FIELD(first_partial_plan);
WRITE_NODE_FIELD(part_prune_info);
}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index ab7b535caa..5bb77d00aa 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1571,6 +1571,7 @@ ReadCommonPlan(Plan *local_node)
READ_INT_FIELD(plan_width);
READ_BOOL_FIELD(parallel_aware);
READ_BOOL_FIELD(parallel_safe);
+ READ_BOOL_FIELD(async_aware);
READ_INT_FIELD(plan_node_id);
READ_NODE_FIELD(targetlist);
READ_NODE_FIELD(qual);
@@ -1667,6 +1668,7 @@ _readAppend(void)
READ_BITMAPSET_FIELD(apprelids);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
READ_INT_FIELD(first_partial_plan);
READ_NODE_FIELD(part_prune_info);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index f1dfdc1a4a..4eadca0f50 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -147,6 +147,7 @@ bool enable_partitionwise_aggregate = false;
bool enable_parallel_append = true;
bool enable_parallel_hash = true;
bool enable_partition_pruning = true;
+bool enable_async_append = true;
typedef struct
{
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 40abe6f9f6..bc43d6f14d 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -81,6 +81,7 @@ static List *get_gating_quals(PlannerInfo *root, List *quals);
static Plan *create_gating_plan(PlannerInfo *root, Path *path, Plan *plan,
List *gating_quals);
static Plan *create_join_plan(PlannerInfo *root, JoinPath *best_path);
+static bool is_async_capable_path(Path *path);
static Plan *create_append_plan(PlannerInfo *root, AppendPath *best_path,
int flags);
static Plan *create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
@@ -1066,6 +1067,30 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
return plan;
}
+/*
+ * is_projection_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
+
/*
* create_append_plan
* Create an Append plan for 'best_path' and (recursively) plans
@@ -1083,6 +1108,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
List *pathkeys = best_path->path.pathkeys;
List *subplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
RelOptInfo *rel = best_path->path.parent;
PartitionPruneInfo *partpruneinfo = NULL;
int nodenumsortkeys = 0;
@@ -1220,6 +1246,17 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
subplans = lappend(subplans, subplan);
+
+ if (enable_async_append)
+ {
+ /* Determine whether the subplan can be executed asynchronously */
+ if (pathkeys == NIL && !best_path->path.parallel_safe &&
+ is_async_capable_path(subpath))
+ {
+ subplan->async_aware = true;
+ ++nasyncplans;
+ }
+ }
}
/*
@@ -1254,6 +1291,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
plan->appendplans = subplans;
+ plan->nasyncplans = nasyncplans;
plan->first_partial_plan = best_path->first_partial_path;
plan->part_prune_info = partpruneinfo;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e76e627c6b..57d6d933ed 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3898,6 +3898,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
switch (w)
{
+ case WAIT_EVENT_APPEND_READY:
+ event_name = "AppendReady";
+ break;
case WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE:
event_name = "BackupWaitWalArchive";
break;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index bb34630e8e..0347eedd33 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1108,6 +1108,16 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_async_append", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of async append plans."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_async_append,
+ true,
+ NULL, NULL, NULL
+ },
{
{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("Enables genetic query optimization."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f7cc..c9de4a1b63 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -371,6 +371,7 @@
#enable_partitionwise_aggregate = off
#enable_parallel_hash = on
#enable_partition_pruning = on
+#enable_async_append = on
# - Planner Cost Constants -
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
index e69de29bb2..0831726798 100644
--- a/src/include/executor/execAsync.h
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ * execAsync.h
+ * Support functions for asynchronous execution
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/executor/execAsync.h
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncBegin(AsyncRequest *areq);
+extern void ExecAsyncConfigureWait(AsyncRequest *areq);
+extern void ExecAsyncNotify(AsyncRequest *areq);
+extern void ExecAsyncRequest(AsyncRequest *areq);
+extern void ExecAsyncMarkAsNeedingCallback(AsyncRequest *areq);
+extern void ExecAsyncRequestDone(AsyncRequest *areq, TupleTableSlot *result);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 326d713ebf..e935a428e3 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -31,4 +31,9 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
ParallelWorkerContext *pwcxt);
extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern void ExecAsyncForeignScanBegin(AsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(AsyncRequest *areq);
+extern void ExecAsyncForeignScanNotify(AsyncRequest *areq);
+extern void ExecAsyncForeignScanRequest(AsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 95556dfb15..558b9ce30e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -170,6 +170,16 @@ typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
List *fdw_private,
RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+
+typedef void (*ForeignAsyncBegin_function) (AsyncRequest *areq);
+
+typedef void (*ForeignAsyncConfigureWait_function) (AsyncRequest *areq);
+
+typedef void (*ForeignAsyncNotify_function) (AsyncRequest *areq);
+
+typedef void (*ForeignAsyncRequest_function) (AsyncRequest *areq);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -246,6 +256,13 @@ typedef struct FdwRoutine
/* Support functions for path reparameterization. */
ReparameterizeForeignPathByChild_function ReparameterizeForeignPathByChild;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncBegin_function ForeignAsyncBegin;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6c0a7d68d6..ac4e459de6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -502,6 +502,22 @@ typedef struct ResultRelInfo
struct CopyMultiInsertBuffer *ri_CopyMultiInsertBuffer;
} ResultRelInfo;
+/* ----------------
+ * AsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct AsyncRequest
+{
+ struct PlanState *requestor; /* Node that wants a tuple */
+ struct PlanState *requestee; /* Node from which a tuple is wanted */
+ int request_index; /* Scratch space for requestor */
+ bool callback_pending; /* Callback is needed */
+ bool request_complete; /* Request complete, result valid */
+ TupleTableSlot *result; /* Result (NULL if no more tuples) */
+} AsyncRequest;
+
/* ----------------
* EState information
*
@@ -1218,6 +1234,15 @@ struct AppendState
PlanState **appendplans; /* array of PlanStates for my inputs */
int as_nplans;
int as_whichplan;
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_asyncplans; /* asynchronous plans indexes */
+ int as_nasyncplans; /* # of asynchronous plans */
+ int as_lastasyncplan; /* last async plan delivering a tuple */
+ Bitmapset *as_needrequest; /* async plans ready for a request */
+ Bitmapset *as_asyncpending; /* async plans needing a callback */
+ AsyncRequest **as_asyncrequests; /* array of AsyncRequests */
+ struct WaitEventSet *as_eventset; /* WaitEventSet used to configure
+ * file descriptor wait events */
int as_first_partial_plan; /* Index of 'appendplans' containing
* the first partial plan */
ParallelAppendState *as_pstate; /* parallel coordination info */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 7e6b10f86b..9f6ac35551 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -129,6 +129,11 @@ typedef struct Plan
bool parallel_aware; /* engage parallel-aware logic? */
bool parallel_safe; /* OK to use as part of parallel plan? */
+ /*
+ * information needed for asynchronous execution
+ */
+ bool async_aware; /* engage async-aware logic? */
+
/*
* Common structural data for all Plan types.
*/
@@ -245,6 +250,7 @@ typedef struct Append
Plan plan;
Bitmapset *apprelids; /* RTIs of appendrel(s) formed by this node */
List *appendplans;
+ int nasyncplans; /* # of async plans, always at start of list */
/*
* All 'appendplans' preceding this index are non-partial plans. All
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 6141654e47..107e57bb7c 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -65,6 +65,7 @@ extern PGDLLIMPORT bool enable_partitionwise_aggregate;
extern PGDLLIMPORT bool enable_parallel_append;
extern PGDLLIMPORT bool enable_parallel_hash;
extern PGDLLIMPORT bool enable_partition_pruning;
+extern PGDLLIMPORT bool enable_async_append;
extern PGDLLIMPORT int constraint_exclusion;
extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 257e515bfe..d4a2d580ca 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -917,6 +917,7 @@ typedef enum
*/
typedef enum
{
+ WAIT_EVENT_APPEND_READY,
WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE = PG_WAIT_IPC,
WAIT_EVENT_BGWORKER_SHUTDOWN,
WAIT_EVENT_BGWORKER_STARTUP,
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index dc7ab2ce8b..760847dd2a 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -87,6 +87,7 @@ select explain_filter('explain (analyze, buffers, format json) select * from int
"Plan": { +
"Node Type": "Seq Scan", +
"Parallel Aware": false, +
+ "Async Aware": false, +
"Relation Name": "int8_tbl",+
"Alias": "i8", +
"Startup Cost": N.N, +
@@ -136,6 +137,7 @@ select explain_filter('explain (analyze, buffers, format xml) select * from int8
<Plan> +
<Node-Type>Seq Scan</Node-Type> +
<Parallel-Aware>false</Parallel-Aware> +
+ <Async-Aware>false</Async-Aware> +
<Relation-Name>int8_tbl</Relation-Name> +
<Alias>i8</Alias> +
<Startup-Cost>N.N</Startup-Cost> +
@@ -183,6 +185,7 @@ select explain_filter('explain (analyze, buffers, format yaml) select * from int
- Plan: +
Node Type: "Seq Scan" +
Parallel Aware: false +
+ Async Aware: false +
Relation Name: "int8_tbl"+
Alias: "i8" +
Startup Cost: N.N +
@@ -233,6 +236,7 @@ select explain_filter('explain (buffers, format json) select * from int8_tbl i8'
"Plan": { +
"Node Type": "Seq Scan", +
"Parallel Aware": false, +
+ "Async Aware": false, +
"Relation Name": "int8_tbl",+
"Alias": "i8", +
"Startup Cost": N.N, +
@@ -346,6 +350,7 @@ select jsonb_pretty(
"Plan Width": 0, +
"Total Cost": 0.0, +
"Actual Rows": 0, +
+ "Async Aware": false, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
"Relation Name": "tenk1", +
@@ -391,6 +396,7 @@ select jsonb_pretty(
"Plan Width": 0, +
"Total Cost": 0.0, +
"Actual Rows": 0, +
+ "Async Aware": false, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
"Parallel Aware": false, +
@@ -433,6 +439,7 @@ select jsonb_pretty(
"Plan Width": 0, +
"Total Cost": 0.0, +
"Actual Rows": 0, +
+ "Async Aware": false, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
"Parallel Aware": false, +
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index 7cf2eee7c1..c97a7d0b89 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -557,6 +557,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
], +
"Node Type": "Incremental Sort", +
"Actual Rows": 55, +
+ "Async Aware": false, +
"Actual Loops": 1, +
"Presorted Key": [ +
"t.a" +
@@ -733,6 +734,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
], +
"Node Type": "Incremental Sort", +
"Actual Rows": 70, +
+ "Async Aware": false, +
"Actual Loops": 1, +
"Presorted Key": [ +
"t.a" +
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index ff157ceb1c..1b5e1d42aa 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -204,6 +204,7 @@ explain (costs off, format json) insert into insertconflicttest values (0, 'Bilb
"Node Type": "ModifyTable", +
"Operation": "Insert", +
"Parallel Aware": false, +
+ "Async Aware": false, +
"Relation Name": "insertconflicttest", +
"Alias": "insertconflicttest", +
"Conflict Resolution": "UPDATE", +
@@ -213,7 +214,8 @@ explain (costs off, format json) insert into insertconflicttest values (0, 'Bilb
{ +
"Node Type": "Result", +
"Parent Relationship": "Member", +
- "Parallel Aware": false +
+ "Parallel Aware": false, +
+ "Async Aware": false +
} +
] +
} +
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 81bdacf59d..b7818c0637 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -88,6 +88,7 @@ select count(*) = 1 as ok from pg_stat_wal;
select name, setting from pg_settings where name like 'enable%';
name | setting
--------------------------------+---------
+ enable_async_append | on
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
@@ -106,7 +107,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(18 rows)
+(19 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
Thanks you for the new version.
At Tue, 17 Nov 2020 18:56:02 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
On Mon, Oct 5, 2020 at 3:35 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
Yes, if there are no objections from you or Thomas or Robert or anyone
else, I'll update Robert's patch as such.Here is a new version of the patch (as promised in the developer
unconference in PostgresConf.CN & PGConf.Asia 2020):* In Robert's patch [1] (and Horiguchi-san's, which was created based
on Robert's), ExecAppend() was modified to retrieve tuples from
async-aware children *before* the tuples will be needed, but I don't
The "retrieve" means the move of a tuple from fdw to executor
(ExecAppend or ExecAsync) layer?
think that's really a good idea, because the query might complete
before returning the tuples. So I modified that function so that a
I'm not sure how it matters. Anyway the fdw holds up to tens of tuples
before the executor actually make requests for them. The reason for
the early fetching is letting fdw send the next request as early as
possible. (However, I didn't measure the effect of the
nodeAppend-level prefetching.)
tuple is retrieved from an async-aware child *when* it is needed, like
Thomas' patch. I used FDW callback functions proposed by Robert, but
introduced another FDW callback function ForeignAsyncBegin() for each
async-aware child to start an asynchronous data fetch at the first
call to ExecAppend() after ExecInitAppend() or ExecReScanAppend().
Even though the terminology is not officially determined, in the past
discussions "async-aware" meant "can handle async-capable subnodes"
and "async-capable" is used as "can run asynchronously". Likewise you
seem to have changed the meaning of as_needrequest from "subnodes that
needs to request for the next tuple" to "subnodes that already have
got query-send request and waiting for the result to come". I would
argue to use the words and variables (names) in such meanings. (Yeah,
parallel_aware is being used in that meaning, I'm not sure what is the
better wordings for the aware-capable relationship in that case.)
* For EvalPlanQual, I modified the patch so that async-aware children
are treated as if they were synchronous when executing EvalPlanQual.
Doesn't async execution accelerate the epq-fetching? Or does
async-execution goes into trouble in the EPQ path?
* In Robert's patch, all async-aware children below Append nodes in
the query waiting for events to occur were managed by a single EState,
but I modified the patch so that such children are managed by each
Append node, like Horiguchi-san's patch and Thomas'.
Managing in Estate give advantage for push-up style executor but
managing in node_state is simpler.
* In Robert's patch, the FDW callback function
ForeignAsyncConfigureWait() allowed multiple events to be configured,
but I limited that function to only allow a single event to be
configured, just for simplicity.
No problem for me.
* I haven't yet added some planner/resowner changes from Horiguchi-san's patch.
* I haven't yet done anything about the issue on postgres_fdw's
handling of concurrent data fetches by multiple ForeignScan nodes
(below *different* Append nodes in the query) using the same
connection discussed in [2]. I modified the patch to just disable
applying this feature to problematic test cases in the postgres_fdw
regression tests, by a new GUC enable_async_append.Comments welcome! The attached is still WIP and maybe I'm missing
something, though.Best regards,
Etsuro Fujita[1] /messages/by-id/CA+TgmoaXQEt4tZ03FtQhnzeDEMzBck+Lrni0UWHVVgOTnA6C1w@mail.gmail.com
[2] /messages/by-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx+3dyivCT=A@mail.gmail.com
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello.
I looked through the nodeAppend.c and postgres_fdw.c part and those
are I think the core of this patch.
- * figure out which subplan we are currently processing
+ * try to get a tuple from async subplans
+ */
+ if (!bms_is_empty(node->as_needrequest) ||
+ (node->as_syncdone && !bms_is_empty(node->as_asyncpending)))
+ {
+ if (ExecAppendAsyncGetNext(node, &result))
+ return result;
The function ExecAppendAsyncGetNext() is a function called only here,
and contains only 31 lines. It doesn't seem to me that the separation
makes the code more readable.
- /* choose new subplan; if none, we're done */
- if (!node->choose_next_subplan(node))
+ /* wait or poll async events */
+ if (!bms_is_empty(node->as_asyncpending))
+ {
+ Assert(!node->as_syncdone);
+ Assert(bms_is_empty(node->as_needrequest));
+ ExecAppendAsyncEventWait(node);
You moved the function to wait for events from execAsync to
nodeAppend. The former is a generic module that can be used from any
kind of executor nodes, but the latter is specialized for nodeAppend.
In other words, the abstraction level is lowered here. What is the
reason for the change?
+ /* Perform the actual callback. */
+ ExecAsyncRequest(areq);
+ if (ExecAppendAsyncResponse(areq))
+ {
+ Assert(!TupIsNull(areq->result));
+ *result = areq->result;
Putting aside the name of the functions, the first two function are
used only this way at only two places. ExecAsyncRequest(areq) tells
fdw to store the first tuple among the already received ones to areq,
and ExecAppendAsyncResponse(areq) is checking the result is actually
set. Finally the result is retrieved directory from areq->result.
What is the reason that the two functions are separately exists?
+ /* Perform the actual callback. */
+ ExecAsyncNotify(areq);
Mmm. The usage of the function (or its name) looks completely reverse
to me. I think FDW should NOTIFY to exec nodes that the new tuple
gets available but the reverse is nonsense. What the function is
actually doing is to REQUEST fdw to fetch tuples that are expected to
have arrived, which is different from what the name suggests.
postgres_fdw.c
postgresIterateForeignScan(ForeignScanState *node)
{
PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;/*
* If this is the first call after Begin or ReScan, we need to create the
* cursor on the remote side.
*/
if (!fsstate->cursor_exists)
create_cursor(node);
With the patch, cursors are also created in another place so at least
the comment is wrong. That being said, I think we should unify the
code except the differences between async and sync. For example, if
the fetch_more_data_begin() needs to be called only for async
fetching, the cursor should be created before calling the function, in
the code path common with sync fetching.
+
+ /* If this was the second part of an async request, we must fetch until NULL. */
+ if (fsstate->async_aware)
+ {
+ /* call once and raise error if not NULL as expected? */
+ while (PQgetResult(conn) != NULL)
+ ;
+ fsstate->conn_state->async_query_sent = false;
+ }
PQgetResult() receives the result of a query at once. This code means
several queries (FETCHes) are queued in, and we discard the result
except the last one. Actually the res is just PQclear'd just after so
this just discards *all* result of maybe more than one FETCHes. I
think something's wrong if we need this.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Fri, 20 Nov 2020 20:16:42 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
me> + /* If this was the second part of an async request, we must fetch until NULL. */
me> + if (fsstate->async_aware)
me> + {
me> + /* call once and raise error if not NULL as expected? */
me> + while (PQgetResult(conn) != NULL)
me> + ;
me> + fsstate->conn_state->async_query_sent = false;
me> + }
me>
me> PQgetResult() receives the result of a query at once. This code means
me> several queries (FETCHes) are queued in, and we discard the result
me> except the last one. Actually the res is just PQclear'd just after so
me> this just discards *all* result of maybe more than one FETCHes. I
me> think something's wrong if we need this.
I was wrong, it is worse. That leaks the returned PGresult.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
I test the patch and occur several issues as blow:
Issue one:
Get a Assert error at 'Assert(bms_is_member(i, node->as_needrequest));' in
ExecAppendAsyncRequest() function when I use more than two foreign table
on different foreign server.
I research the code and do such change then the Assert problom disappear.
@@ -1004,6 +1004,7 @@ ExecAppendAsyncResponse(AsyncRequest *areq) bms_del_member(node->as_needrequest, areq->request_index); node->as_asyncpending = bms_add_member(node->as_asyncpending, areq->request_index); + node->as_lastasyncplan = INVALID_SUBPLAN_INDEX; return false; }
Issue two:
Then I test and find if I have sync subplan and async sunbplan, it will run over
the sync subplan then the async turn, I do not know if it is intent.
Issue three:
After code change mentioned in the Issue one, I can not get performance improvement.
I query on partitioned table and all sub-partition the time spent on partitioned table
always same as the sum of all sub-partition.
Sorry if I have something wrong when test the patch.
Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
"On Thu, Nov 26, 2020 at 9:28 AM movead.li@highgo.ca
<movead.li@highgo.ca> wrote:
I test the patch and occur several issues as blow:
Issue one:
Get a Assert error at 'Assert(bms_is_member(i, node->as_needrequest));' in
ExecAppendAsyncRequest() function when I use more than two foreign table
on different foreign server.I research the code and do such change then the Assert problem disappear.
@@ -1004,6 +1004,7 @@ ExecAppendAsyncResponse(AsyncRequest *areq) bms_del_member(node->as_needrequest, areq->request_index); node->as_asyncpending = bms_add_member(node->as_asyncpending, areq->request_index); + node->as_lastasyncplan = INVALID_SUBPLAN_INDEX; return false; }
Issue two:
Then I test and find if I have sync subplan and async sunbplan, it will run over
the sync subplan then the async turn, I do not know if it is intent.
I only just noticed this patch. It's very interesting to me given the
ongoing work happening on postgres_fdw batching and the way libpq
pipelining is looking like it's getting there. I'll study up on the
executor and see if I can understand this well enough to hack together
a PoC to make it use libpq batching.
Have you taken a look at how this patch may overlap with those?
See -hackers threads:
* "POC: postgres_fdw insert batching" [1]/messages/by-id/OSBPR01MB2982039EA967F0304CC6A3ECFE0B0@OSBPR01MB2982.jpnprd01.prod.outlook.com
* "PATCH: Batch/pipelining support for libpq" [2]/messages/by-id/20201026190936.GA18705@alvherre.pgsql
[1]: /messages/by-id/OSBPR01MB2982039EA967F0304CC6A3ECFE0B0@OSBPR01MB2982.jpnprd01.prod.outlook.com
[2]: /messages/by-id/20201026190936.GA18705@alvherre.pgsql
On 11/17/20 2:56 PM, Etsuro Fujita wrote:
On Mon, Oct 5, 2020 at 3:35 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
Comments welcome! The attached is still WIP and maybe I'm missing
something, though.
I reviewed your patch and used it in my TPC-H benchmarks. It is still
WIP. Will you improve this patch?
I also want to say that, in my opinion, Horiguchi-san's version seems
preferable: it is more structured, simple to understand, executor-native
and allows to reduce FDW interface changes. This code really only needs
one procedure - IsForeignPathAsyncCapable.
--
regards,
Andrey Lepikhov
Postgres Professional
On Fri, Nov 20, 2020 at 3:51 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Tue, 17 Nov 2020 18:56:02 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
* In Robert's patch [1] (and Horiguchi-san's, which was created based
on Robert's), ExecAppend() was modified to retrieve tuples from
async-aware children *before* the tuples will be needed, but I don'tThe "retrieve" means the move of a tuple from fdw to executor
(ExecAppend or ExecAsync) layer?
Yes, that's what I mean.
think that's really a good idea, because the query might complete
before returning the tuples. So I modified that function so that aI'm not sure how it matters. Anyway the fdw holds up to tens of tuples
before the executor actually make requests for them. The reason for
the early fetching is letting fdw send the next request as early as
possible. (However, I didn't measure the effect of the
nodeAppend-level prefetching.)
I agree that that would lead to an improved efficiency in some cases,
but I still think that that would be useless in some other cases like
SELECT * FROM sharded_table LIMIT 1. Also, I think the situation
would get worse if we support Append on top of joins or aggregates
over ForeignScans, which would be more expensive to perform than these
ForeignScans.
If we do prefetching, I think it would be better that it’s the
responsibility of the FDW to do prefetching, and I think that that
could be done by letting the FDW to start another data fetch,
independently of the core, in the ForeignAsyncNotify callback routine,
which I revived from Robert's original patch. I think that that would
be more efficient, because the FDW would no longer need to wait until
all buffered tuples are returned to the core. In the WIP patch, I
only allowed the callback routine to put the corresponding ForeignScan
node into a state where it’s either ready for a new request or needing
a callback for another data fetch, but I think we could probably relax
the restriction so that the ForeignScan node can be put into another
state where it’s ready for a new request while needing a callback for
the prefetch.
tuple is retrieved from an async-aware child *when* it is needed, like
Thomas' patch. I used FDW callback functions proposed by Robert, but
introduced another FDW callback function ForeignAsyncBegin() for each
async-aware child to start an asynchronous data fetch at the first
call to ExecAppend() after ExecInitAppend() or ExecReScanAppend().Even though the terminology is not officially determined, in the past
discussions "async-aware" meant "can handle async-capable subnodes"
and "async-capable" is used as "can run asynchronously".
Thanks for the explanation!
Likewise you
seem to have changed the meaning of as_needrequest from "subnodes that
needs to request for the next tuple" to "subnodes that already have
got query-send request and waiting for the result to come".
No. I think I might slightly change the original definition of
as_needrequest, though.
I would
argue to use the words and variables (names) in such meanings.
I think the word "aware" has a broader meaning, so the naming as
proposed would be OK IMO. But actually, I don't have any strong
opinion about that, so I'll change it as explained.
* For EvalPlanQual, I modified the patch so that async-aware children
are treated as if they were synchronous when executing EvalPlanQual.Doesn't async execution accelerate the epq-fetching? Or does
async-execution goes into trouble in the EPQ path?
The reason why I disabled async execution when executing EPQ is to
avoid sending asynchronous queries to the remote sides, which would be
useless, because scan tuples for an EPQ recheck are obtained in a
dedicated way.
* In Robert's patch, all async-aware children below Append nodes in
the query waiting for events to occur were managed by a single EState,
but I modified the patch so that such children are managed by each
Append node, like Horiguchi-san's patch and Thomas'.Managing in Estate give advantage for push-up style executor but
managing in node_state is simpler.
What do you mean by "push-up style executor"?
Thanks for the review! Sorry for the delay.
Best regards,
Etsuro Fujita
On Fri, Nov 20, 2020 at 8:16 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
I looked through the nodeAppend.c and postgres_fdw.c part and those
are I think the core of this patch.
Thanks again for the review!
- * figure out which subplan we are currently processing + * try to get a tuple from async subplans + */ + if (!bms_is_empty(node->as_needrequest) || + (node->as_syncdone && !bms_is_empty(node->as_asyncpending))) + { + if (ExecAppendAsyncGetNext(node, &result)) + return result;The function ExecAppendAsyncGetNext() is a function called only here,
and contains only 31 lines. It doesn't seem to me that the separation
makes the code more readable.
Considering the original ExecAppend() is about 50 lines long, 31 lines
of code would not be small. So I'd vote for separating it into
another function as proposed.
- /* choose new subplan; if none, we're done */ - if (!node->choose_next_subplan(node)) + /* wait or poll async events */ + if (!bms_is_empty(node->as_asyncpending)) + { + Assert(!node->as_syncdone); + Assert(bms_is_empty(node->as_needrequest)); + ExecAppendAsyncEventWait(node);You moved the function to wait for events from execAsync to
nodeAppend. The former is a generic module that can be used from any
kind of executor nodes, but the latter is specialized for nodeAppend.
In other words, the abstraction level is lowered here. What is the
reason for the change?
The reason is just because that function is only called from
ExecAppend(). I put some functions only called from nodeAppend.c in
execAsync.c, though.
+ /* Perform the actual callback. */ + ExecAsyncRequest(areq); + if (ExecAppendAsyncResponse(areq)) + { + Assert(!TupIsNull(areq->result)); + *result = areq->result;Putting aside the name of the functions, the first two function are
used only this way at only two places. ExecAsyncRequest(areq) tells
fdw to store the first tuple among the already received ones to areq,
and ExecAppendAsyncResponse(areq) is checking the result is actually
set. Finally the result is retrieved directory from areq->result.
What is the reason that the two functions are separately exists?
I think that when an async-aware node gets a tuple from an
async-capable node, they should use ExecAsyncRequest() /
ExecAyncHogeResponse() rather than ExecProcNode() [1]/messages/by-id/CA+TgmoYrbgTBnLwnr1v=pk+C=znWg7AgV9=M9ehrq6TDexPQNw@mail.gmail.com. I modified the
patch so that ExecAppendAsyncResponse() is called from Append, but to
support bubbling up the plan tree discussed in [2]/messages/by-id/CA+TgmoZSWnhy=SB3ggQcB6EqKxzbNeNn=EfwARnCS5tyhhBNcw@mail.gmail.com, I think it should
be called from ForeignScans (the sides of async-capable nodes). Am I
right? Anyway, I’ll rename ExecAppendAyncResponse() to the one
proposed in Robert’s original patch.
+ /* Perform the actual callback. */ + ExecAsyncNotify(areq);Mmm. The usage of the function (or its name) looks completely reverse
to me. I think FDW should NOTIFY to exec nodes that the new tuple
gets available but the reverse is nonsense. What the function is
actually doing is to REQUEST fdw to fetch tuples that are expected to
have arrived, which is different from what the name suggests.
As mentioned in a previous email, this is an FDW callback routine
revived from Robert’s patch. I think the naming is reasonable,
because the callback routine notifies the FDW of readiness of a file
descriptor. And actually, the callback routine tells the core whether
the corresponding ForeignScan node is ready for a new request or not,
by setting the callback_pending flag accordingly.
postgres_fdw.c
postgresIterateForeignScan(ForeignScanState *node)
{
PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;/*
* If this is the first call after Begin or ReScan, we need to create the
* cursor on the remote side.
*/
if (!fsstate->cursor_exists)
create_cursor(node);With the patch, cursors are also created in another place so at least
the comment is wrong.
Good catch! Will fix.
That being said, I think we should unify the
code except the differences between async and sync. For example, if
the fetch_more_data_begin() needs to be called only for async
fetching, the cursor should be created before calling the function, in
the code path common with sync fetching.
I think that that would make the code easier to understand, but I’m
not 100% sure we really need to do so.
Best regards,
Etsuro Fujita
[1]: /messages/by-id/CA+TgmoYrbgTBnLwnr1v=pk+C=znWg7AgV9=M9ehrq6TDexPQNw@mail.gmail.com
[2]: /messages/by-id/CA+TgmoZSWnhy=SB3ggQcB6EqKxzbNeNn=EfwARnCS5tyhhBNcw@mail.gmail.com
At Sat, 12 Dec 2020 18:25:57 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
On Fri, Nov 20, 2020 at 3:51 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:At Tue, 17 Nov 2020 18:56:02 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
* In Robert's patch [1] (and Horiguchi-san's, which was created based
on Robert's), ExecAppend() was modified to retrieve tuples from
async-aware children *before* the tuples will be needed, but I don'tThe "retrieve" means the move of a tuple from fdw to executor
(ExecAppend or ExecAsync) layer?Yes, that's what I mean.
think that's really a good idea, because the query might complete
before returning the tuples. So I modified that function so that aI'm not sure how it matters. Anyway the fdw holds up to tens of tuples
before the executor actually make requests for them. The reason for
the early fetching is letting fdw send the next request as early as
possible. (However, I didn't measure the effect of the
nodeAppend-level prefetching.)I agree that that would lead to an improved efficiency in some cases,
but I still think that that would be useless in some other cases like
SELECT * FROM sharded_table LIMIT 1. Also, I think the situation
would get worse if we support Append on top of joins or aggregates
over ForeignScans, which would be more expensive to perform than these
ForeignScans.
I'm not sure which gain we weigh on, but if doing "LIMIT 1" on Append
for many times is more common than fetching all or "LIMIT <many
multiples of fetch_size>", that discussion would be convincing... Is
it really the case?
Since core knows of async execution, I think if we disable async
exection, it should be decided by planner, which knows how many tuples
are expected to be returned. On the other hand the most apparent
criteria for whether to enable async or not would be fetch_size, which
is fdw's secret. Thus we could rename ForeignPathAsyncCapable() to
something like ForeignPathRunAsync(), true from which means "the FDW
is telling that it can run async and is thinking that the given number
of tuples will be fetched at once.".
If we do prefetching, I think it would be better that it’s the
responsibility of the FDW to do prefetching, and I think that that
could be done by letting the FDW to start another data fetch,
independently of the core, in the ForeignAsyncNotify callback routine,
FDW does prefetching (if it means sending request to remote) in my
patch, so I agree to that. It suspect that you were intended to say
the opposite. The core (ExecAppendAsyncGetNext()) controls
prefetching in your patch.
which I revived from Robert's original patch. I think that that would
be more efficient, because the FDW would no longer need to wait until
all buffered tuples are returned to the core. In the WIP patch, I
I don't understand. My patch sends a prefetch-query as soon as all the
tuples of the last remote-request is stored into FDW storage. The
reason for removing ExecAsyncNotify() was it is just redundant as far
as concerning Append asynchrony. But I particulary oppose to revive
the function.
only allowed the callback routine to put the corresponding ForeignScan
node into a state where it’s either ready for a new request or needing
a callback for another data fetch, but I think we could probably relax
the restriction so that the ForeignScan node can be put into another
state where it’s ready for a new request while needing a callback for
the prefetch.
I don't understand this, too. ExecAsyncNotify() doesn't touch any of
the bitmaps, as_needrequest, callback_pending nor as_asyncpending in
your patch. Am I looking into something wrong? I'm looking
async-wip-2020-11-17.patch.
(By the way, it is one of those that make the code hard to read to me
that the "callback" means "calling an API function". I think none of
them (ExecAsyncBegin, ExecAsyncRequest, ExecAsyncNotify) are a
"callback".)
tuple is retrieved from an async-aware child *when* it is needed, like
Thomas' patch. I used FDW callback functions proposed by Robert, but
introduced another FDW callback function ForeignAsyncBegin() for each
async-aware child to start an asynchronous data fetch at the first
call to ExecAppend() after ExecInitAppend() or ExecReScanAppend().Even though the terminology is not officially determined, in the past
discussions "async-aware" meant "can handle async-capable subnodes"
and "async-capable" is used as "can run asynchronously".Thanks for the explanation!
Likewise you
seem to have changed the meaning of as_needrequest from "subnodes that
needs to request for the next tuple" to "subnodes that already have
got query-send request and waiting for the result to come".No. I think I might slightly change the original definition of
as_needrequest, though.
Mmm, sorry. I may have been perplexed by the comment below, which is
also added to ExecAsyncNotify().
ExecAppendAsyncRequest:
Assert(bms_is_member(i, node->as_needrequest));
/* Perform the actual callback. */
ExecAsyncRequest(areq);
if (ExecAppendAsyncResponse(areq))
{
Assert(!TupIsNull(areq->result));
*result = areq->result;
return true;
}
I would
argue to use the words and variables (names) in such meanings.I think the word "aware" has a broader meaning, so the naming as
proposed would be OK IMO. But actually, I don't have any strong
opinion about that, so I'll change it as explained.
Thanks.
* For EvalPlanQual, I modified the patch so that async-aware children
are treated as if they were synchronous when executing EvalPlanQual.Doesn't async execution accelerate the epq-fetching? Or does
async-execution goes into trouble in the EPQ path?The reason why I disabled async execution when executing EPQ is to
avoid sending asynchronous queries to the remote sides, which would be
useless, because scan tuples for an EPQ recheck are obtained in a
dedicated way.
If EPQ is performed onto Append, I think it should gain from
asynchronous execution since it is going to fetch *a* tuple from
several partitions or children. I believe EPQ doesn't contain Append
in major cases, though. (Or I didn't come up with the steps for the
case to happen...)
* In Robert's patch, all async-aware children below Append nodes in
the query waiting for events to occur were managed by a single EState,
but I modified the patch so that such children are managed by each
Append node, like Horiguchi-san's patch and Thomas'.Managing in Estate give advantage for push-up style executor but
managing in node_state is simpler.What do you mean by "push-up style executor"?
The reverse of the volcano-style executor, which enters from the
topmost node and down to the bottom. In the "push-up stule executor",
the bottom-most nodes fires by a certain trigger then every
intermediate nodes throws up the result to the parent until reaching
the topmost node.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Sat, 12 Dec 2020 19:06:51 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
On Fri, Nov 20, 2020 at 8:16 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:I looked through the nodeAppend.c and postgres_fdw.c part and those
are I think the core of this patch.Thanks again for the review!
- * figure out which subplan we are currently processing + * try to get a tuple from async subplans + */ + if (!bms_is_empty(node->as_needrequest) || + (node->as_syncdone && !bms_is_empty(node->as_asyncpending))) + { + if (ExecAppendAsyncGetNext(node, &result)) + return result;The function ExecAppendAsyncGetNext() is a function called only here,
and contains only 31 lines. It doesn't seem to me that the separation
makes the code more readable.Considering the original ExecAppend() is about 50 lines long, 31 lines
of code would not be small. So I'd vote for separating it into
another function as proposed.
Ok, I no longer oppose to separating some part from ExecAppend().
- /* choose new subplan; if none, we're done */ - if (!node->choose_next_subplan(node)) + /* wait or poll async events */ + if (!bms_is_empty(node->as_asyncpending)) + { + Assert(!node->as_syncdone); + Assert(bms_is_empty(node->as_needrequest)); + ExecAppendAsyncEventWait(node);You moved the function to wait for events from execAsync to
nodeAppend. The former is a generic module that can be used from any
kind of executor nodes, but the latter is specialized for nodeAppend.
In other words, the abstraction level is lowered here. What is the
reason for the change?The reason is just because that function is only called from
ExecAppend(). I put some functions only called from nodeAppend.c in
execAsync.c, though.
(I think) You told me that you preferred the genericity of the
original interface, but you're doing the opposite. If you think we
can move such a generic feature to a part of Append node, all other
features can be move the same way. I guess there's a reason you want
only the this specific feature out of all of them be Append-spcific
and I want to know that.
+ /* Perform the actual callback. */ + ExecAsyncRequest(areq); + if (ExecAppendAsyncResponse(areq)) + { + Assert(!TupIsNull(areq->result)); + *result = areq->result;Putting aside the name of the functions, the first two function are
used only this way at only two places. ExecAsyncRequest(areq) tells
fdw to store the first tuple among the already received ones to areq,
and ExecAppendAsyncResponse(areq) is checking the result is actually
set. Finally the result is retrieved directory from areq->result.
What is the reason that the two functions are separately exists?I think that when an async-aware node gets a tuple from an
async-capable node, they should use ExecAsyncRequest() /
ExecAyncHogeResponse() rather than ExecProcNode() [1]. I modified the
patch so that ExecAppendAsyncResponse() is called from Append, but to
support bubbling up the plan tree discussed in [2], I think it should
be called from ForeignScans (the sides of async-capable nodes). Am I
right? Anyway, I’ll rename ExecAppendAyncResponse() to the one
proposed in Robert’s original patch.
Even though I understand the concept but to make work it we need to
remember the parent *async* node somewhere. In my faint memory the
very early patch did something like that.
So I think just providing ExecAsyncResponse() doesn't make it
true. But if we make it true, it would be something like
partially-reversed steps from what the current Exec*()s do for some of
the existing nodes and further code is required for some other nodes
like WindowFunction. Bubbling up works only in very simple cases where
a returned tuple is thrown up to further parent as-is or at least when
the node convers a tuple into another shape. If an async-receiver node
wants to process multiple tuples from a child or from multiple
children, it is no longer be just a bubbling up.
That being said, we could avoid passing (a-kind-of) side-channel
information when ExecProcNode is called by providing
ExecAsyncResponse(). But I don't think the "side-channel" is not a
problem since it is just another state of the node.
And.. I think the reason I feel uneasy for the patch may be that the
patch uses the interface names in somewhat different context.
Origianlly the fraemework resides in-between executor nodes, not on a
node of either side. ExecAsyncNotify() notifies the requestee about an
event and ExecAsyncResonse() notifies the requestor about a new
tuple. I don't feel strangeness in this usage. But this patch feels to
me using the same names in different (and somewhat wrong) context.
+ /* Perform the actual callback. */ + ExecAsyncNotify(areq);Mmm. The usage of the function (or its name) looks completely reverse
to me. I think FDW should NOTIFY to exec nodes that the new tuple
gets available but the reverse is nonsense. What the function is
actually doing is to REQUEST fdw to fetch tuples that are expected to
have arrived, which is different from what the name suggests.As mentioned in a previous email, this is an FDW callback routine
revived from Robert’s patch. I think the naming is reasonable,
because the callback routine notifies the FDW of readiness of a file
descriptor. And actually, the callback routine tells the core whether
the corresponding ForeignScan node is ready for a new request or not,
by setting the callback_pending flag accordingly.
Hmm. Agreed. The word "callback" is also used there [3]/messages/by-id/20161018.103051.30820907.horiguchi.kyotaro@lab.ntt.co.jp... I
remember and it seems reasonable that the core calls AsyncNotify() on
FDW and the FDW calls ExecForeignScan as a response to it and notify
back to core of that using ExecAsyncRequestDone(). But the patch here
feels a little strange, or uneasy, to me.
[3]: /messages/by-id/20161018.103051.30820907.horiguchi.kyotaro@lab.ntt.co.jp
postgres_fdw.c
postgresIterateForeignScan(ForeignScanState *node)
{
PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;/*
* If this is the first call after Begin or ReScan, we need to create the
* cursor on the remote side.
*/
if (!fsstate->cursor_exists)
create_cursor(node);With the patch, cursors are also created in another place so at least
the comment is wrong.Good catch! Will fix.
That being said, I think we should unify the
code except the differences between async and sync. For example, if
the fetch_more_data_begin() needs to be called only for async
fetching, the cursor should be created before calling the function, in
the code path common with sync fetching.I think that that would make the code easier to understand, but I’m
not 100% sure we really need to do so.
And I believe that we don't tolerate even the slightest performance
degradation.
Best regards,
Etsuro Fujita[1] /messages/by-id/CA+TgmoYrbgTBnLwnr1v=pk+C=znWg7AgV9=M9ehrq6TDexPQNw@mail.gmail.com
[2] /messages/by-id/CA+TgmoZSWnhy=SB3ggQcB6EqKxzbNeNn=EfwARnCS5tyhhBNcw@mail.gmail.com
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Mon, Dec 14, 2020 at 4:01 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Sat, 12 Dec 2020 18:25:57 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
On Fri, Nov 20, 2020 at 3:51 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:The reason for
the early fetching is letting fdw send the next request as early as
possible. (However, I didn't measure the effect of the
nodeAppend-level prefetching.)I agree that that would lead to an improved efficiency in some cases,
but I still think that that would be useless in some other cases like
SELECT * FROM sharded_table LIMIT 1. Also, I think the situation
would get worse if we support Append on top of joins or aggregates
over ForeignScans, which would be more expensive to perform than these
ForeignScans.I'm not sure which gain we weigh on, but if doing "LIMIT 1" on Append
for many times is more common than fetching all or "LIMIT <many
multiples of fetch_size>", that discussion would be convincing... Is
it really the case?
I don't have a clear answer for that... Performance in the case you
mentioned would be improved by async execution without prefetching by
Append, so it seemed reasonable to me to remove that prefetching to
avoid unnecessary overheads in the case I mentioned. BUT: I started
to think my proposal, which needs an additional FDW callback routine
(ie, ForeignAsyncBegin()), might be a bad idea, because it would
increase the burden on FDW authors.
If we do prefetching, I think it would be better that it’s the
responsibility of the FDW to do prefetching, and I think that that
could be done by letting the FDW to start another data fetch,
independently of the core, in the ForeignAsyncNotify callback routine,FDW does prefetching (if it means sending request to remote) in my
patch, so I agree to that. It suspect that you were intended to say
the opposite. The core (ExecAppendAsyncGetNext()) controls
prefetching in your patch.
No. That function just tries to retrieve a tuple from any of the
ready subplans (ie, subplans marked as as_needrequest).
which I revived from Robert's original patch. I think that that would
be more efficient, because the FDW would no longer need to wait until
all buffered tuples are returned to the core. In the WIP patch, II don't understand. My patch sends a prefetch-query as soon as all the
tuples of the last remote-request is stored into FDW storage. The
reason for removing ExecAsyncNotify() was it is just redundant as far
as concerning Append asynchrony. But I particulary oppose to revive
the function.
Sorry, my explanation was not good, but what I'm saying here is about
my patch, not your patch. I think this FDW callback routine would be
useful; it allows an FDW to perform another asynchronous data fetch
before delivering a tuple to the core as discussed in [1]/messages/by-id/CAPmGK153oorYtTpW_-aZrjH-iecHbykX7qbxX_5630ZK8nqVHg@mail.gmail.com. Also, it
would be useful when extending to the case where we have intermediate
nodes between an Append and a ForeignScan such as joins or aggregates,
which I'll explain below.
only allowed the callback routine to put the corresponding ForeignScan
node into a state where it’s either ready for a new request or needing
a callback for another data fetch, but I think we could probably relax
the restriction so that the ForeignScan node can be put into another
state where it’s ready for a new request while needing a callback for
the prefetch.I don't understand this, too. ExecAsyncNotify() doesn't touch any of
the bitmaps, as_needrequest, callback_pending nor as_asyncpending in
your patch. Am I looking into something wrong? I'm looking
async-wip-2020-11-17.patch.
In the WIP patch I post, these bitmaps are modified in the core side
based on the callback_pending and request_complete flags in
AsyncRequests returned from FDWs (See ExecAppendAsyncEventWait()).
(By the way, it is one of those that make the code hard to read to me
that the "callback" means "calling an API function". I think none of
them (ExecAsyncBegin, ExecAsyncRequest, ExecAsyncNotify) are a
"callback".)
I thought the word “callback” was OK, because these functions would
call the corresponding FDW callback routines, but I’ll revise the
wording.
The reason why I disabled async execution when executing EPQ is to
avoid sending asynchronous queries to the remote sides, which would be
useless, because scan tuples for an EPQ recheck are obtained in a
dedicated way.If EPQ is performed onto Append, I think it should gain from
asynchronous execution since it is going to fetch *a* tuple from
several partitions or children. I believe EPQ doesn't contain Append
in major cases, though. (Or I didn't come up with the steps for the
case to happen...)
Sorry, I don’t understand this part. Could you elaborate a bit more on it?
What do you mean by "push-up style executor"?
The reverse of the volcano-style executor, which enters from the
topmost node and down to the bottom. In the "push-up stule executor",
the bottom-most nodes fires by a certain trigger then every
intermediate nodes throws up the result to the parent until reaching
the topmost node.
That is what I'm thinking to be able to support the case I mentioned
above. I think that that would allow us to find ready subplans
efficiently from occurred wait events in ExecAppendAsyncEventWait().
Consider a plan like this:
Append
-> Nested Loop
-> Foreign Scan on a
-> Foreign Scan on b
-> ...
I assume here that Foreign Scan on a, Foreign Scan on b, and Nested
Loop are all async-capable and that we have somewhere in the executor
an AsyncRequest with requestor="Nested Loop" and requestee="Foreign
Scan on a", an AsyncRequest with requestor="Nested Loop" and
requestee="Foreign Scan on b", and an AsyncRequest with
requestor="Append" and requestee="Nested Loop". In
ExecAppendAsyncEventWait(), if a file descriptor for foreign table a
becomes ready, we would call ForeignAsyncNotify() for a, and if it
returns a tuple back to the requestor node (ie, Nested Loop) (using
ExecAsyncResponse()), then *ForeignAsyncNotify() would be called for
Nested Loop*. Nested Loop would then call ExecAsyncRequest() for the
inner requestee node (ie, Foreign Scan on b; I assume here that it is
a foreign scan parameterized by a). If Foreign Scan on b returns a
tuple back to the requestor node (ie, Nested Loop) (using
ExecAsyncResponse()), then Nested Loop would match the tuples from the
outer and inner sides. If they match, the join result would be
returned back to the requestor node (ie, Append) (using
ExecAsyncResponse()), marking the Nested Loop subplan as
as_needrequest. Otherwise, Nested Loop would call ExecAsyncRequest()
for the inner requestee node for the next tuple, and so on. If
ExecAsyncRequest() can't return a tuple immediately, we would wait
until a file descriptor for foreign table b becomes ready; we would
start from calling ForeignAsyncNotify() for b when the file descriptor
becomes ready. In this way we could find ready subplans efficiently
from occurred wait events in ExecAppendAsyncEventWait() when extending
to the case where subplans are joins or aggregates over Foreign Scans,
I think. Maybe I’m missing something, though.
Thanks for the comments!
Best regards,
Etsuro Fujita
[1]: /messages/by-id/CAPmGK153oorYtTpW_-aZrjH-iecHbykX7qbxX_5630ZK8nqVHg@mail.gmail.com
On Mon, Dec 14, 2020 at 5:56 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Sat, 12 Dec 2020 19:06:51 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
On Fri, Nov 20, 2020 at 8:16 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
+ /* wait or poll async events */ + if (!bms_is_empty(node->as_asyncpending)) + { + Assert(!node->as_syncdone); + Assert(bms_is_empty(node->as_needrequest)); + ExecAppendAsyncEventWait(node);You moved the function to wait for events from execAsync to
nodeAppend. The former is a generic module that can be used from any
kind of executor nodes, but the latter is specialized for nodeAppend.
In other words, the abstraction level is lowered here. What is the
reason for the change?The reason is just because that function is only called from
ExecAppend(). I put some functions only called from nodeAppend.c in
execAsync.c, though.(I think) You told me that you preferred the genericity of the
original interface, but you're doing the opposite. If you think we
can move such a generic feature to a part of Append node, all other
features can be move the same way. I guess there's a reason you want
only the this specific feature out of all of them be Append-spcific
and I want to know that.
The reason is that I’m thinking to add a small feature for
multiplexing Append subplans, not a general feature for async
execution as discussed in [1]/messages/by-id/CA+Tgmobx8su_bYtAa3DgrqB+R7xZG6kHRj0ccMUUshKAQVftww@mail.gmail.com, because this would be an interim
solution until the executor rewrite is done.
I think that when an async-aware node gets a tuple from an
async-capable node, they should use ExecAsyncRequest() /
ExecAyncHogeResponse() rather than ExecProcNode() [1]. I modified the
patch so that ExecAppendAsyncResponse() is called from Append, but to
support bubbling up the plan tree discussed in [2], I think it should
be called from ForeignScans (the sides of async-capable nodes). Am I
right? Anyway, I’ll rename ExecAppendAyncResponse() to the one
proposed in Robert’s original patch.Even though I understand the concept but to make work it we need to
remember the parent *async* node somewhere. In my faint memory the
very early patch did something like that.So I think just providing ExecAsyncResponse() doesn't make it
true. But if we make it true, it would be something like
partially-reversed steps from what the current Exec*()s do for some of
the existing nodes and further code is required for some other nodes
like WindowFunction. Bubbling up works only in very simple cases where
a returned tuple is thrown up to further parent as-is or at least when
the node convers a tuple into another shape. If an async-receiver node
wants to process multiple tuples from a child or from multiple
children, it is no longer be just a bubbling up.
I explained the meaning of “bubbling up the plan tree” in a previous
email I sent a moment ago.
And.. I think the reason I feel uneasy for the patch may be that the
patch uses the interface names in somewhat different context.
Origianlly the fraemework resides in-between executor nodes, not on a
node of either side. ExecAsyncNotify() notifies the requestee about an
event and ExecAsyncResonse() notifies the requestor about a new
tuple. I don't feel strangeness in this usage. But this patch feels to
me using the same names in different (and somewhat wrong) context.
Sorry, this is a WIP patch. Will fix.
+ /* Perform the actual callback. */ + ExecAsyncNotify(areq);Mmm. The usage of the function (or its name) looks completely reverse
to me.
As mentioned in a previous email, this is an FDW callback routine
revived from Robert’s patch. I think the naming is reasonable,
because the callback routine notifies the FDW of readiness of a file
descriptor. And actually, the callback routine tells the core whether
the corresponding ForeignScan node is ready for a new request or not,
by setting the callback_pending flag accordingly.Hmm. Agreed. The word "callback" is also used there [3]... I
remember and it seems reasonable that the core calls AsyncNotify() on
FDW and the FDW calls ExecForeignScan as a response to it and notify
back to core of that using ExecAsyncRequestDone(). But the patch here
feels a little strange, or uneasy, to me.
I’m not sure what I should do to improve the patch. Could you
elaborate a bit more on this part?
postgres_fdw.c
postgresIterateForeignScan(ForeignScanState *node)
{
PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;/*
* If this is the first call after Begin or ReScan, we need to create the
* cursor on the remote side.
*/
if (!fsstate->cursor_exists)
create_cursor(node);
That being said, I think we should unify the
code except the differences between async and sync. For example, if
the fetch_more_data_begin() needs to be called only for async
fetching, the cursor should be created before calling the function, in
the code path common with sync fetching.I think that that would make the code easier to understand, but I’m
not 100% sure we really need to do so.And I believe that we don't tolerate even the slightest performance
degradation.
In the case of async execution, the cursor would have already been
created before we get here as mentioned by you, so we would just skip
create_cursor() in that case. I don’t think that that would degrade
performance noticeably. Am I wrong?
Thanks again!
Best regards,
Etsuro Fujita
[1]: /messages/by-id/CA+Tgmobx8su_bYtAa3DgrqB+R7xZG6kHRj0ccMUUshKAQVftww@mail.gmail.com
On Thu, Nov 26, 2020 at 10:28 AM movead.li@highgo.ca
<movead.li@highgo.ca> wrote:
I test the patch and occur several issues as blow:
Thank you for the review!
Issue one:
Get a Assert error at 'Assert(bms_is_member(i, node->as_needrequest));' in
ExecAppendAsyncRequest() function when I use more than two foreign table
on different foreign server.I research the code and do such change then the Assert problom disappear.
Could you show a test case causing the assertion failure?
Issue two:
Then I test and find if I have sync subplan and async sunbplan, it will run over
the sync subplan then the async turn, I do not know if it is intent.
Did you use a partitioned table with only two partitions where one is
local and the other is remote? If so, that would be expected, because
in that case, 1) the patch would first send an asynchronous query to
the remote, 2) it would then process the local partition until the
end, 3) it would then wait/poll the async event, and 4) it would
finally process the remote partition when the event occurs.
Sorry for the delay.
Best regards,
Etsuro Fujita
On Thu, Dec 10, 2020 at 3:38 PM Andrey V. Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
On 11/17/20 2:56 PM, Etsuro Fujita wrote:
On Mon, Oct 5, 2020 at 3:35 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
Comments welcome! The attached is still WIP and maybe I'm missing
something, though.I reviewed your patch and used it in my TPC-H benchmarks. It is still
WIP. Will you improve this patch?
Yeah, will do.
I also want to say that, in my opinion, Horiguchi-san's version seems
preferable: it is more structured, simple to understand, executor-native
and allows to reduce FDW interface changes.
I’m not sure what you mean by “executor-native”, but I partly agree
that Horiguchi-san’s version would be easier to understand, because
his version was made so that a tuple is requested from an async
subplan using our Volcano Iterator model almost as-is. But my
concerns about his version would be: 1) it’s actually pretty invasive,
because it changes the contract of the ExecProcNode() API [1]/messages/by-id/CAPmGK16YXCADSwsFLSxqTBBLbt3E_=iigKTtjS=dqu+8K8DWCw@mail.gmail.com, and 2)
IIUC it wouldn’t allow us to find ready subplans from occurred wait
events when we extend to the case where subplans are joins or
aggregates over ForeignScans [2]/messages/by-id/CAPmGK16rA5ODyRrVK9iPsyW-td2RcRZXsdWoVhMmLLmUhprsTg@mail.gmail.com.
This code really only needs
one procedure - IsForeignPathAsyncCapable.
This isn’t correct: his version uses ForeignAsyncConfigureWait() as well.
Thank you for reviewing! Sorry for the delay.
Best regards,
Etsuro Fujita
[1]: /messages/by-id/CAPmGK16YXCADSwsFLSxqTBBLbt3E_=iigKTtjS=dqu+8K8DWCw@mail.gmail.com
[2]: /messages/by-id/CAPmGK16rA5ODyRrVK9iPsyW-td2RcRZXsdWoVhMmLLmUhprsTg@mail.gmail.com
On Sat, Dec 19, 2020 at 5:55 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Mon, Dec 14, 2020 at 4:01 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:At Sat, 12 Dec 2020 18:25:57 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
On Fri, Nov 20, 2020 at 3:51 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:The reason for
the early fetching is letting fdw send the next request as early as
possible. (However, I didn't measure the effect of the
nodeAppend-level prefetching.)I agree that that would lead to an improved efficiency in some cases,
but I still think that that would be useless in some other cases like
SELECT * FROM sharded_table LIMIT 1. Also, I think the situation
would get worse if we support Append on top of joins or aggregates
over ForeignScans, which would be more expensive to perform than these
ForeignScans.I'm not sure which gain we weigh on, but if doing "LIMIT 1" on Append
for many times is more common than fetching all or "LIMIT <many
multiples of fetch_size>", that discussion would be convincing... Is
it really the case?I don't have a clear answer for that... Performance in the case you
mentioned would be improved by async execution without prefetching by
Append, so it seemed reasonable to me to remove that prefetching to
avoid unnecessary overheads in the case I mentioned. BUT: I started
to think my proposal, which needs an additional FDW callback routine
(ie, ForeignAsyncBegin()), might be a bad idea, because it would
increase the burden on FDW authors.
I dropped my proposal; I modified the patch so that ExecAppend()
requests tuples from all subplans needing a request *at once*, as
originally proposed by Robert and then you. Please find attached a
new version of the patch.
Other changes:
* I renamed ExecAppendAsyncResponse() to what was originally proposed
by Robert, and modified the patch so that that function is called from
the requestee side, not the requestor side as in the previous version.
* I renamed the variable async_aware as explained by you.
* I tweaked comments a bit to address your comments.
* I made code simpler, and added a bit more assertions.
Best regards,
Etsuro Fujita
Attachments:
async-wip-2020-12-31.patchapplication/octet-stream; name=async-wip-2020-12-31.patchDownload
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index d841cec39b..b5b6d30c39 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -59,6 +59,7 @@ typedef struct ConnCacheEntry
bool invalidated; /* true if reconnect is pending */
uint32 server_hashvalue; /* hash value of foreign server OID */
uint32 mapping_hashvalue; /* hash value of user mapping OID */
+ PgFdwConnState state; /* extra per-connection state */
} ConnCacheEntry;
/*
@@ -106,7 +107,7 @@ static bool UserMappingPasswordRequired(UserMapping *user);
* (not even on error), we need this flag to cue manual cleanup.
*/
PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetConnection(UserMapping *user, bool will_prep_stmt, PgFdwConnState **state)
{
bool found;
bool retry = false;
@@ -253,6 +254,10 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
/* Remember if caller will prepare statements */
entry->have_prep_stmt |= will_prep_stmt;
+ /* If caller needs access to the per-connection state, return it. */
+ if (state)
+ *state = &entry->state;
+
return entry->conn;
}
@@ -279,6 +284,7 @@ make_new_connection(ConnCacheEntry *entry, UserMapping *user)
entry->mapping_hashvalue =
GetSysCacheHashValue1(USERMAPPINGOID,
ObjectIdGetDatum(user->umid));
+ memset(&entry->state, 0, sizeof(entry->state));
/* Now try to make the connection */
entry->conn = connect_pg_server(server, user);
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index c11092f8cc..ac931e56e9 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6986,7 +6986,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+-------
a | aaa
@@ -7014,7 +7014,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7042,7 +7042,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7070,7 +7070,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | newtoo
@@ -7098,7 +7098,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
DELETE FROM a;
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+----
(0 rows)
@@ -7140,35 +7140,40 @@ insert into bar2 values(3,33,33);
insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
f1 | f2
----+----
1 | 11
@@ -7178,35 +7183,40 @@ select * from bar where f1 in (select f1 from foo) for update;
(4 rows)
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
f1 | f2
----+----
1 | 11
@@ -7238,7 +7248,7 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-> Hash Join
@@ -7256,7 +7266,7 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
(39 rows)
@@ -7274,6 +7284,7 @@ select tableoid::regclass, * from bar order by 1,2;
(6 rows)
-- Check UPDATE with inherited target and an appendrel subquery
+SET enable_async_append TO false;
explain (verbose, costs off)
update bar set f2 = f2 + 100
from
@@ -7332,6 +7343,7 @@ update bar set f2 = f2 + 100
from
( select f1 from foo union all select f1+3 from foo ) ss
where bar.f1 = ss.f1;
+RESET enable_async_append;
select tableoid::regclass, * from bar order by 1,2;
tableoid | f1 | f2
----------+----+-----
@@ -8571,9 +8583,9 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
Sort
Sort Key: t1.a, t3.c
-> Append
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: ((ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)) INNER JOIN (ftprt1_p1 t3_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: ((ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)) INNER JOIN (ftprt1_p2 t3_2)
(7 rows)
@@ -8610,19 +8622,19 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
-- with whole-row reference; partitionwise join does not apply
EXPLAIN (COSTS OFF)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
- QUERY PLAN
---------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------
Sort
Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
-> Hash Full Join
Hash Cond: (t1.a = t2.b)
-> Append
- -> Foreign Scan on ftprt1_p1 t1_1
- -> Foreign Scan on ftprt1_p2 t1_2
+ -> Async Foreign Scan on ftprt1_p1 t1_1
+ -> Async Foreign Scan on ftprt1_p2 t1_2
-> Hash
-> Append
- -> Foreign Scan on ftprt2_p1 t2_1
- -> Foreign Scan on ftprt2_p2 t2_2
+ -> Async Foreign Scan on ftprt2_p1 t2_1
+ -> Async Foreign Scan on ftprt2_p2 t2_2
(11 rows)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
@@ -8652,9 +8664,9 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
Sort
Sort Key: t1.a, t1.b
-> Append
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: (ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: (ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)
(7 rows)
@@ -8707,6 +8719,7 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
(14 rows)
-- test FOR UPDATE; partitionwise join does not apply
+SET enable_async_append TO false;
EXPLAIN (COSTS OFF)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
QUERY PLAN
@@ -8734,6 +8747,7 @@ SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a
400 | 400
(4 rows)
+RESET enable_async_append;
RESET enable_partitionwise_join;
-- ===================================================================
-- test partitionwise aggregates
@@ -8758,17 +8772,17 @@ ANALYZE fpagg_tab_p3;
SET enable_partitionwise_aggregate TO false;
EXPLAIN (COSTS OFF)
SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
- QUERY PLAN
------------------------------------------------------------
+ QUERY PLAN
+-----------------------------------------------------------------
Sort
Sort Key: pagg_tab.a
-> HashAggregate
Group Key: pagg_tab.a
Filter: (avg(pagg_tab.b) < '22'::numeric)
-> Append
- -> Foreign Scan on fpagg_tab_p1 pagg_tab_1
- -> Foreign Scan on fpagg_tab_p2 pagg_tab_2
- -> Foreign Scan on fpagg_tab_p3 pagg_tab_3
+ -> Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+ -> Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+ -> Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
(9 rows)
-- Plan with partitionwise aggregates is enabled
@@ -8780,11 +8794,11 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
Sort
Sort Key: pagg_tab.a
-> Append
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p1 pagg_tab)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p2 pagg_tab_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p3 pagg_tab_2)
(9 rows)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index b6c72e1d1e..08166805b6 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -21,6 +21,7 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -37,6 +38,7 @@
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
#include "postgres_fdw.h"
+#include "storage/latch.h"
#include "utils/builtins.h"
#include "utils/float.h"
#include "utils/guc.h"
@@ -155,6 +157,11 @@ typedef struct PgFdwScanState
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ /* for asynchronous execution */
+ bool async_capable; /* engage asynchronous-capable logic? */
+ PgFdwConnState *conn_state; /* extra per-connection state */
+ ForeignScanState *next_node; /* next ForeignScan node to activate */
+
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
MemoryContext temp_cxt; /* context for per-tuple temporary data */
@@ -392,6 +399,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
RelOptInfo *input_rel,
RelOptInfo *output_rel,
void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(AsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(AsyncRequest *areq);
+static void postgresForeignAsyncNotify(AsyncRequest *areq);
/*
* Helper functions
@@ -420,6 +431,7 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
+static void fetch_more_data_begin(ForeignScanState *node);
static void fetch_more_data(ForeignScanState *node);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static PgFdwModifyState *create_foreign_modify(EState *estate,
@@ -471,6 +483,7 @@ static int postgresAcquireSampleRowsFunc(Relation relation, int elevel,
double *totaldeadrows);
static void analyze_row_processor(PGresult *res, int row,
PgFdwAnalyzeState *astate);
+static void request_tuple_asynchronously(AsyncRequest *areq);
static HeapTuple make_tuple_from_result_row(PGresult *res,
int row,
Relation rel,
@@ -560,6 +573,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for asynchronous execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -1435,7 +1454,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->conn = GetConnection(user, false, &fsstate->conn_state);
/* Assign a unique ID for my cursor */
fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1486,6 +1505,12 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
&fsstate->param_flinfo,
&fsstate->param_exprs,
&fsstate->param_values);
+
+ /* Initialize async state */
+ fsstate->async_capable = node->ss.ps.plan->async_capable;
+ fsstate->conn_state->activated = NULL;
+ fsstate->conn_state->async_query_sent = false;
+ fsstate->next_node = NULL;
}
/*
@@ -1511,6 +1536,9 @@ postgresIterateForeignScan(ForeignScanState *node)
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
+ /* In async mode, just clear tuple slot. */
+ if (fsstate->async_capable)
+ return ExecClearTuple(slot);
/* No point in another fetch if we already detected EOF, though. */
if (!fsstate->eof_reached)
fetch_more_data(node);
@@ -1540,6 +1568,14 @@ postgresReScanForeignScan(ForeignScanState *node)
char sql[64];
PGresult *res;
+ /* Reset async state */
+ if (fsstate->async_capable)
+ {
+ fsstate->conn_state->activated = NULL;
+ fsstate->conn_state->async_query_sent = false;
+ fsstate->next_node = NULL;
+ }
+
/* If we haven't created the cursor yet, nothing to do. */
if (!fsstate->cursor_exists)
return;
@@ -1598,6 +1634,14 @@ postgresEndForeignScan(ForeignScanState *node)
if (fsstate == NULL)
return;
+ /*
+ * If we're ending before we've collected a response from an asynchronous
+ * query, we have to consume the response.
+ */
+ if (fsstate->conn_state->activated == node &&
+ fsstate->conn_state->async_query_sent)
+ fetch_more_data(node);
+
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
close_cursor(fsstate->conn, fsstate->cursor_number);
@@ -2374,7 +2418,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->conn = GetConnection(user, false, NULL);
/* Update the foreign-join-related fields. */
if (fsplan->scan.scanrelid == 0)
@@ -2748,7 +2792,7 @@ estimate_path_cost_size(PlannerInfo *root,
false, &retrieved_attrs, NULL);
/* Get the remote estimate */
- conn = GetConnection(fpinfo->user, false);
+ conn = GetConnection(fpinfo->user, false, NULL);
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -3376,6 +3420,34 @@ create_cursor(ForeignScanState *node)
pfree(buf.data);
}
+/*
+ * Begin an asynchronous data fetch.
+ * fetch_more_data must be called to fetch the results..
+ */
+static void
+fetch_more_data_begin(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PGconn *conn = fsstate->conn;
+ char sql[64];
+
+ Assert(fsstate->conn_state->activated == node);
+ Assert(!fsstate->conn_state->async_query_sent);
+
+ /* Create the cursor synchronously. */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ /* We will send this query, but not wait for the response. */
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (PQsendQuery(conn, sql) < 0)
+ pgfdw_report_error(ERROR, NULL, conn, false, fsstate->query);
+
+ fsstate->conn_state->async_query_sent = true;
+}
+
/*
* Fetch some more rows from the node's cursor.
*/
@@ -3398,17 +3470,36 @@ fetch_more_data(ForeignScanState *node)
PG_TRY();
{
PGconn *conn = fsstate->conn;
- char sql[64];
int numrows;
int i;
- snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
- fsstate->fetch_size, fsstate->cursor_number);
+ if (fsstate->async_capable)
+ {
+ Assert(fsstate->conn_state->activated == node);
+ Assert(fsstate->conn_state->async_query_sent);
- res = pgfdw_exec_query(conn, sql);
- /* On error, report the original query, not the FETCH. */
- if (PQresultStatus(res) != PGRES_TUPLES_OK)
- pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ /*
+ * The query was already sent by an earlier call to
+ * fetch_more_data_begin. So now we just fetch the result.
+ */
+ res = PQgetResult(conn);
+ /* On error, report the original query, not the FETCH. */
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ }
+ else
+ {
+ char sql[64];
+
+ /* This is a regular synchronous fetch. */
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ res = pgfdw_exec_query(conn, sql);
+ /* On error, report the original query, not the FETCH. */
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ }
/* Convert the data into HeapTuples */
numrows = PQntuples(res);
@@ -3435,6 +3526,15 @@ fetch_more_data(ForeignScanState *node)
/* Must be EOF if we didn't get as many tuples as we asked for. */
fsstate->eof_reached = (numrows < fsstate->fetch_size);
+
+ /* If this was the second part of an async request, we must fetch until NULL. */
+ if (fsstate->async_capable)
+ {
+ /* call once and raise error if not NULL as expected? */
+ while (PQgetResult(conn) != NULL)
+ ;
+ fsstate->conn_state->async_query_sent = false;
+ }
}
PG_FINALLY();
{
@@ -3559,7 +3659,7 @@ create_foreign_modify(EState *estate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->conn = GetConnection(user, true, NULL);
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Set up remote query information. */
@@ -4434,7 +4534,7 @@ postgresAnalyzeForeignTable(Relation relation,
*/
table = GetForeignTable(RelationGetRelid(relation));
user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
- conn = GetConnection(user, false);
+ conn = GetConnection(user, false, NULL);
/*
* Construct command to get page count for relation.
@@ -4520,7 +4620,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
table = GetForeignTable(RelationGetRelid(relation));
server = GetForeignServer(table->serverid);
user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
- conn = GetConnection(user, false);
+ conn = GetConnection(user, false, NULL);
/*
* Construct cursor that retrieves whole rows from remote.
@@ -4748,7 +4848,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
*/
server = GetForeignServer(serverOid);
mapping = GetUserMapping(GetUserId(), server->serverid);
- conn = GetConnection(mapping, false);
+ conn = GetConnection(mapping, false, NULL);
/* Don't attempt to import collation if remote server hasn't got it */
if (PQserverVersion(conn) < 90100)
@@ -6294,6 +6394,175 @@ add_foreign_final_paths(PlannerInfo *root, RelOptInfo *input_rel,
add_path(final_rel, (Path *) final_path);
}
+/*
+ * postgresIsForeignPathAsyncCapable
+ * Check whether a given ForeignPath node is async-capable.
+ */
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+/*
+ * postgresForeignAsyncRequest
+ * Asynchronously request next tuple from a foreign PostgreSQL table.
+ */
+static void
+postgresForeignAsyncRequest(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ ForeignScanState *curr_node;
+
+ /* Get currently-activated ForeignScan node. */
+ if (fsstate->conn_state->activated == NULL)
+ fsstate->conn_state->activated = node;
+ curr_node = fsstate->conn_state->activated;
+
+ /*
+ * If the ForeignScan node is not the currently-activated one, put it at
+ * the end of the chain of waiting ForeignScan nodes, and then return.
+ */
+ if (node != curr_node)
+ {
+ PgFdwScanState *curr_fsstate = (PgFdwScanState *) curr_node->fdw_state;
+
+ /* Scan down the chain ... */
+ while (curr_fsstate->next_node)
+ {
+ curr_node = curr_fsstate->next_node;
+ Assert(node != curr_node);
+ curr_fsstate = (PgFdwScanState *) curr_node->fdw_state;
+ }
+ /* Update the chain linking */
+ curr_fsstate->next_node = node;
+ /* Mark the request as needing a callback */
+ areq->callback_pending = true;
+ areq->request_complete = false;
+ return;
+ }
+
+ request_tuple_asynchronously(areq);
+}
+
+/*
+ * postgresForeignAsyncConfigureWait
+ * Configure a file descriptor event for which we wish to wait.
+ */
+static void
+postgresForeignAsyncConfigureWait(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ AppendState *requestor = (AppendState *) areq->requestor;
+ WaitEventSet *set = requestor->as_eventset;
+
+ /* This function should not be called unless callback_pending */
+ Assert(areq->callback_pending);
+
+ /* If the ForeignScan node isn't activated yet, nothing to do */
+ if (fsstate->conn_state->activated != node)
+ return;
+
+ AddWaitEventToSet(set, WL_SOCKET_READABLE, PQsocket(fsstate->conn),
+ NULL, areq);
+}
+
+/*
+ * postgresForeignAsyncNotify
+ * Fetch some more tuples from a file descriptor that becomes ready,
+ * requesting next tuple.
+ */
+static void
+postgresForeignAsyncNotify(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+
+ /* The core code would have initialized the callback_pending flag */
+ Assert(!areq->callback_pending);
+
+ fetch_more_data(node);
+
+ request_tuple_asynchronously(areq);
+}
+
+/*
+ * Asynchronously request next tuple from a foreign PostgreSQL table.
+ */
+static void
+request_tuple_asynchronously(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ TupleTableSlot *result;
+
+ /* Request some more tuples, if we've run out */
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ /* No point in another fetch if we already detected EOF, though */
+ if (!fsstate->eof_reached)
+ {
+ /* Begin another fetch */
+ fetch_more_data_begin(node);
+ /* Mark the request as needing a callback */
+ areq->callback_pending = true;
+ areq->request_complete = false;
+ return;
+ }
+ fsstate->conn_state->activated = NULL;
+
+ /* Activate the next ForeignScan node if any */
+ if (fsstate->next_node)
+ {
+ /* Mark the connection as used by the next ForeignScan node */
+ fsstate->conn_state->activated = fsstate->next_node;
+ Assert(!fsstate->conn_state->async_query_sent);
+ /* Begin an asynchronous fetch for that node */
+ fetch_more_data_begin(fsstate->next_node);
+ }
+
+ /* There's nothing more to do; just return a NULL pointer */
+ result = NULL;
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+ return;
+ }
+
+ /* Get a tuple from the ForeignScan node */
+ result = ExecProcNode((PlanState *) node);
+
+ if (TupIsNull(result))
+ {
+ Assert(fsstate->next_tuple >= fsstate->num_tuples);
+
+ /* Request some more tuples, if we've not detected EOF yet */
+ if (!fsstate->eof_reached)
+ {
+ /* Begin another fetch */
+ fetch_more_data_begin(node);
+ /* Mark the request as needing a callback */
+ areq->callback_pending = true;
+ areq->request_complete = false;
+ return;
+ }
+ fsstate->conn_state->activated = NULL;
+
+ /* Activate the next ForeignScan node if any */
+ if (fsstate->next_node)
+ {
+ /* Mark the connection as used by the next ForeignScan node */
+ fsstate->conn_state->activated = fsstate->next_node;
+ Assert(!fsstate->conn_state->async_query_sent);
+ /* Begin an asynchronous fetch for that node */
+ fetch_more_data_begin(fsstate->next_node);
+ }
+ }
+
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+}
+
/*
* Create a tuple from the specified row of the PGresult.
*
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index eef410db39..15c9750f8b 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -16,6 +16,7 @@
#include "foreign/foreign.h"
#include "lib/stringinfo.h"
#include "libpq-fe.h"
+#include "nodes/execnodes.h"
#include "nodes/pathnodes.h"
#include "utils/relcache.h"
@@ -124,12 +125,22 @@ typedef struct PgFdwRelationInfo
int relation_index;
} PgFdwRelationInfo;
+/*
+ * Extra control information relating to a connection.
+ */
+typedef struct PgFdwConnState
+{
+ ForeignScanState *activated; /* currently-activated ForeignScan node */
+ bool async_query_sent; /* has an asynchronous query been sent? */
+} PgFdwConnState;
+
/* in postgres_fdw.c */
extern int set_transmission_modes(void);
extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+ PgFdwConnState **state);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 25dbc08b98..ad7bf90bb0 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1799,31 +1799,31 @@ INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
DELETE FROM a;
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
@@ -1859,12 +1859,12 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
@@ -1874,6 +1874,7 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
select tableoid::regclass, * from bar order by 1,2;
-- Check UPDATE with inherited target and an appendrel subquery
+SET enable_async_append TO false;
explain (verbose, costs off)
update bar set f2 = f2 + 100
from
@@ -1883,6 +1884,7 @@ update bar set f2 = f2 + 100
from
( select f1 from foo union all select f1+3 from foo ) ss
where bar.f1 = ss.f1;
+RESET enable_async_append;
select tableoid::regclass, * from bar order by 1,2;
@@ -2492,9 +2494,11 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE a % 25 = 0) t1 FULL JOIN (SELECT 't2_phv' phv, * FROM fprt2 WHERE b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY t1.a, t2.b;
-- test FOR UPDATE; partitionwise join does not apply
+SET enable_async_append TO false;
EXPLAIN (COSTS OFF)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
+RESET enable_async_append;
RESET enable_partitionwise_join;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 67de4150b8..1427ffc82b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4733,6 +4733,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</para>
<variablelist>
+ <varlistentry id="guc-enable-async-append" xreflabel="enable_async_append">
+ <term><varname>enable_async_append</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_async_append</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of async-aware
+ append plan types. The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-bitmapscan" xreflabel="enable_bitmapscan">
<term><varname>enable_bitmapscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3d6c901306..c07d6a4d95 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1557,6 +1557,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</thead>
<tbody>
+ <row>
+ <entry><literal>AppendReady</literal></entry>
+ <entry>Waiting for a subplan of Append to be ready.</entry>
+ </row>
<row>
<entry><literal>BackupWaitWalArchive</literal></entry>
<entry>Waiting for WAL files required for a backup to be successfully
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d797b5f53e..895c4d8a4c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1390,6 +1390,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
}
if (plan->parallel_aware)
appendStringInfoString(es->str, "Parallel ");
+ if (plan->async_capable)
+ appendStringInfoString(es->str, "Async ");
appendStringInfoString(es->str, pname);
es->indent++;
}
@@ -1409,6 +1411,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (custom_name)
ExplainPropertyText("Custom Plan Provider", custom_name, es);
ExplainPropertyBool("Parallel Aware", plan->parallel_aware, es);
+ ExplainPropertyBool("Async Capable", plan->async_capable, es);
}
switch (nodeTag(plan))
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index f990c6473a..1004647d4f 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
+ execAsync.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 0c10f1d35c..07f028bdf9 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -526,6 +526,10 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans != 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index e69de29bb2..31a875f9a4 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,113 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+
+static void ExecAsyncResponse(AsyncRequest *areq);
+
+/*
+ * Asynchronously request a tuple from a designed async-capable node.
+ */
+void
+ExecAsyncRequest(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ ExecAsyncResponse(areq);
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor event
+ * for which it wishes to wait. We expect the node-type specific callback to
+ * make a sigle call of the following form:
+ *
+ * AddWaitEventToSet(set, WL_SOCKET_READABLE, fd, NULL, areq);
+ */
+void
+ExecAsyncConfigureWait(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanConfigureWait(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+void
+ExecAsyncNotify(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ ExecAsyncResponse(areq);
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestor))
+ {
+ case T_AppendState:
+ ExecAsyncAppendResponse(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestor));
+ }
+}
+
+/*
+ * A requestee node should call this function to deliver the tuple to its
+ * requestor node. The node can call this from its ExecAsyncRequest callback
+ * if the requested tuple is available immediately.
+ */
+void
+ExecAsyncRequestDone(AsyncRequest *areq, TupleTableSlot *result)
+{
+ areq->request_complete = true;
+ areq->result = result;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 88919e62fa..0579499a54 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -57,10 +57,13 @@
#include "postgres.h"
+#include "executor/execAsync.h"
#include "executor/execdebug.h"
#include "executor/execPartition.h"
#include "executor/nodeAppend.h"
#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
/* Shared state for parallel-aware Append. */
struct ParallelAppendState
@@ -78,12 +81,17 @@ struct ParallelAppendState
};
#define INVALID_SUBPLAN_INDEX -1
+#define EVENT_BUFFER_SIZE 16
static TupleTableSlot *ExecAppend(PlanState *pstate);
static bool choose_next_subplan_locally(AppendState *node);
static bool choose_next_subplan_for_leader(AppendState *node);
static bool choose_next_subplan_for_worker(AppendState *node);
static void mark_invalid_subplans_as_finished(AppendState *node);
+static void ExecAppendAsyncBegin(AppendState *node);
+static bool ExecAppendAsyncGetNext(AppendState *node, TupleTableSlot **result);
+static bool ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result);
+static void ExecAppendAsyncEventWait(AppendState *node);
/* ----------------------------------------------------------------
* ExecInitAppend
@@ -102,7 +110,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
AppendState *appendstate = makeNode(AppendState);
PlanState **appendplanstates;
Bitmapset *validsubplans;
+ Bitmapset *asyncplans;
int nplans;
+ int nasyncplans;
int firstvalid;
int i,
j;
@@ -119,6 +129,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
/* Let choose_next_subplan_* function handle setting the first subplan */
appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+ appendstate->as_syncdone = false;
/* If run-time partition pruning is enabled, then set that up now */
if (node->part_prune_info != NULL)
@@ -191,12 +202,24 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
* While at it, find out the first valid partial plan.
*/
j = 0;
+ asyncplans = NULL;
+ nasyncplans = 0;
firstvalid = nplans;
i = -1;
while ((i = bms_next_member(validsubplans, i)) >= 0)
{
Plan *initNode = (Plan *) list_nth(node->appendplans, i);
+ /*
+ * Record async subplans. When executing EvalPlanQual, we process
+ * async subplans synchronously, so don't do this in that case.
+ */
+ if (initNode->async_capable && estate->es_epq_active == NULL)
+ {
+ asyncplans = bms_add_member(asyncplans, j);
+ nasyncplans++;
+ }
+
/*
* Record the lowest appendplans index which is a valid partial plan.
*/
@@ -210,6 +233,38 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ /* Initialize async state */
+ appendstate->as_asyncplans = asyncplans;
+ appendstate->as_nasyncplans = nasyncplans;
+ appendstate->as_asyncrequests = NULL;
+ appendstate->as_asyncresults = (TupleTableSlot **)
+ palloc0(nasyncplans * sizeof(TupleTableSlot *));
+ appendstate->as_nasyncremain = nasyncplans;
+ appendstate->as_needrequest = NULL;
+ appendstate->as_eventset = NULL;
+
+ if (nasyncplans > 0)
+ {
+ appendstate->as_asyncrequests = (AsyncRequest **)
+ palloc0(nplans * sizeof(AsyncRequest *));
+
+ i = -1;
+ while ((i = bms_next_member(asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq;
+
+ areq = palloc(sizeof(AsyncRequest));
+ areq->requestor = (PlanState *) appendstate;
+ areq->requestee = appendplanstates[i];
+ areq->request_index = i;
+ areq->callback_pending = false;
+ areq->request_complete = false;
+ areq->result = NULL;
+
+ appendstate->as_asyncrequests[i] = areq;
+ }
+ }
+
/*
* Miscellaneous initialization
*/
@@ -232,31 +287,45 @@ static TupleTableSlot *
ExecAppend(PlanState *pstate)
{
AppendState *node = castNode(AppendState, pstate);
+ TupleTableSlot *result;
- if (node->as_whichplan < 0)
+ if (!node->as_syncdone && node->as_whichplan == INVALID_SUBPLAN_INDEX)
{
/* Nothing to do if there are no subplans */
if (node->as_nplans == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ /* If there are any async subplans, begin execution of them */
+ if (node->as_nasyncplans > 0)
+ ExecAppendAsyncBegin(node);
+
/*
- * If no subplan has been chosen, we must choose one before
+ * If no sync subplan has been chosen, we must choose one before
* proceeding.
*/
- if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
- !node->choose_next_subplan(node))
+ if (!node->choose_next_subplan(node) && node->as_nasyncremain == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
for (;;)
{
PlanState *subnode;
- TupleTableSlot *result;
CHECK_FOR_INTERRUPTS();
/*
- * figure out which subplan we are currently processing
+ * try to get a tuple from any of the async subplans
+ */
+ if (!bms_is_empty(node->as_needrequest) ||
+ (node->as_syncdone && node->as_nasyncremain > 0))
+ {
+ if (ExecAppendAsyncGetNext(node, &result))
+ return result;
+ Assert(bms_is_empty(node->as_needrequest));
+ }
+
+ /*
+ * figure out which sync subplan we are currently processing
*/
Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
subnode = node->appendplans[node->as_whichplan];
@@ -276,8 +345,16 @@ ExecAppend(PlanState *pstate)
return result;
}
- /* choose new subplan; if none, we're done */
- if (!node->choose_next_subplan(node))
+ /* wait or poll async events */
+ if (node->as_nasyncremain > 0)
+ {
+ Assert(!node->as_syncdone);
+ Assert(bms_is_empty(node->as_needrequest));
+ ExecAppendAsyncEventWait(node);
+ }
+
+ /* choose new sync subplan; if no sync/async subplans, we're done */
+ if (!node->choose_next_subplan(node) && node->as_nasyncremain == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
}
@@ -313,6 +390,7 @@ ExecEndAppend(AppendState *node)
void
ExecReScanAppend(AppendState *node)
{
+ int nasyncplans = node->as_nasyncplans;
int i;
/*
@@ -347,8 +425,27 @@ ExecReScanAppend(AppendState *node)
ExecReScan(subnode);
}
+ /* Reset async state */
+ node->as_nasyncremain = nasyncplans;
+ bms_free(node->as_needrequest);
+ node->as_needrequest = NULL;
+
+ if (nasyncplans > 0)
+ {
+ i = -1;
+ while ((i = bms_next_member(node->as_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ areq->callback_pending = false;
+ areq->request_complete = false;
+ areq->result = NULL;
+ }
+ }
+
/* Let choose_next_subplan_* function handle setting the first subplan */
node->as_whichplan = INVALID_SUBPLAN_INDEX;
+ node->as_syncdone = false;
}
/* ----------------------------------------------------------------
@@ -429,7 +526,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
/* ----------------------------------------------------------------
* choose_next_subplan_locally
*
- * Choose next subplan for a non-parallel-aware Append,
+ * Choose next sync subplan for a non-parallel-aware Append,
* returning false if there are no more.
* ----------------------------------------------------------------
*/
@@ -444,9 +541,9 @@ choose_next_subplan_locally(AppendState *node)
/*
* If first call then have the bms member function choose the first valid
- * subplan by initializing whichplan to -1. If there happen to be no
- * valid subplans then the bms member function will handle that by
- * returning a negative number which will allow us to exit returning a
+ * sync subplan by initializing whichplan to -1. If there happen to be
+ * no valid sync subplans then the bms member function will handle that
+ * by returning a negative number which will allow us to exit returning a
* false value.
*/
if (whichplan == INVALID_SUBPLAN_INDEX)
@@ -467,7 +564,10 @@ choose_next_subplan_locally(AppendState *node)
nextplan = bms_prev_member(node->as_valid_subplans, whichplan);
if (nextplan < 0)
+ {
+ node->as_syncdone = true;
return false;
+ }
node->as_whichplan = nextplan;
@@ -709,3 +809,262 @@ mark_invalid_subplans_as_finished(AppendState *node)
node->as_pstate->pa_finished[i] = true;
}
}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncBegin
+ *
+ * Begin execution of designed async-capable subplans.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecAppendAsyncBegin(AppendState *node)
+{
+ Bitmapset *valid_asyncplans;
+ int i;
+
+ /* We should never be called when there are no async subplans. */
+ Assert(node->as_nasyncplans > 0);
+
+ if (node->as_valid_subplans == NULL)
+ {
+ node->as_valid_subplans =
+ ExecFindMatchingSubPlans(node->as_prune_state);
+ }
+
+ /* Nothing to do if there are no valid async subplans */
+ if (!bms_overlap(node->as_valid_subplans, node->as_asyncplans))
+ return;
+
+ /* Get valid async subplans. */
+ valid_asyncplans = bms_copy(node->as_asyncplans);
+ valid_asyncplans = bms_int_members(valid_asyncplans,
+ node->as_valid_subplans);
+
+ /* Adjust the node's as_valid_suplans to only contain sync subplans. */
+ node->as_valid_subplans = bms_del_members(node->as_valid_subplans,
+ valid_asyncplans);
+
+ /* Initially, all async subplans need a request. */
+ i = -1;
+ while ((i = bms_next_member(valid_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ Assert(areq->request_index == i);
+ Assert(!areq->callback_pending);
+
+ /* Make a new request. */
+ ExecAsyncRequest(areq);
+ }
+ bms_free(valid_asyncplans);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncGetNext
+ *
+ * Get the next tuple from any of the asynchronous subplans.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecAppendAsyncGetNext(AppendState *node, TupleTableSlot **result)
+{
+ *result = NULL;
+
+ /* Make new async requests. */
+ if (ExecAppendAsyncRequest(node, result))
+ return true;
+
+ while (node->as_nasyncremain > 0)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ /* Wait or poll async events. */
+ ExecAppendAsyncEventWait(node);
+
+ /* Make new async requests. */
+ if (ExecAppendAsyncRequest(node, result))
+ return true;
+
+ /* Break from loop if there is any sync node that is not complete */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If all sync subplans are complete, we're totally done scanning the
+ * givne node. Otherwise, we're done with the asynchronous stuff but
+ * must continue scanning the sync subplans.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncremain == 0);
+ *result = ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ return true;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncRequest
+ *
+ * If there are any asynchronous subplans that need a new asynchronous
+ * request, make all of them.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result)
+{
+ Bitmapset *needrequest;
+ int i;
+
+ /* Nothing to do if there are no async subplans needing a new request. */
+ if (bms_is_empty(node->as_needrequest))
+ return false;
+
+ /*
+ * If there are any asynchronously-generated results that have not yet
+ * been returned, we have nothing to do; just return one of them.
+ */
+ if (node->as_nasyncresults > 0)
+ {
+ --node->as_nasyncresults;
+ *result = node->as_asyncresults[node->as_nasyncresults];
+ return true;
+ }
+
+ /* Make a new request for each subplan needing it. */
+ needrequest = node->as_needrequest;
+ node->as_needrequest = NULL;
+ i = -1;
+ while ((i = bms_next_member(needrequest, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ /* Make a new request. */
+ ExecAsyncRequest(areq);
+ }
+ bms_free(needrequest);
+
+ /* Return one of the asynchronously-generated results if any. */
+ if (node->as_nasyncresults > 0)
+ {
+ --node->as_nasyncresults;
+ *result = node->as_asyncresults[node->as_nasyncresults];
+ return true;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncEventWait
+ *
+ * Wait or poll for file descriptor wait events and fire callbacks.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecAppendAsyncEventWait(AppendState *node)
+{
+ long timeout = node->as_syncdone ? -1 : 0;
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+
+ /* Nothing to do if there are no async remaining subplans. */
+ if (node->as_nasyncremain == 0)
+ return;
+
+ node->as_eventset = CreateWaitEventSet(CurrentMemoryContext,
+ node->as_nasyncplans + 1);
+ AddWaitEventToSet(node->as_eventset, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
+ NULL, NULL);
+
+ /* Give each waiting subplan a chance to add a event. */
+ i = -1;
+ while ((i = bms_next_member(node->as_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ if (areq->callback_pending)
+ ExecAsyncConfigureWait(areq);
+ }
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(node->as_eventset, timeout, occurred_event,
+ EVENT_BUFFER_SIZE, WAIT_EVENT_APPEND_READY);
+ FreeWaitEventSet(node->as_eventset);
+ node->as_eventset = NULL;
+ if (noccurred == 0)
+ return;
+
+ /* Deliver notifications. */
+ for (i = 0; i < noccurred; i++)
+ {
+ WaitEvent *w = &occurred_event[i];
+
+ /*
+ * Each waiting subplan should have registered its wait event with
+ * user_data pointing back to its AsyncRequest.
+ */
+ if ((w->events & WL_SOCKET_READABLE) != 0)
+ {
+ AsyncRequest *areq = (AsyncRequest *) w->user_data;
+
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ Assert(areq->callback_pending);
+ areq->callback_pending = false;
+
+ /* Do the actual work: ExecAsyncNotify would call the callback. */
+ ExecAsyncNotify(areq);
+ }
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(AsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot = areq->result;
+
+ /* The result should be a TupleTableSlot or NULL. */
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* Nothing to do if the request is pending. */
+ if (!areq->request_complete)
+ {
+ /* The request would need a callback. */
+ Assert(areq->callback_pending);
+ return;
+ }
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ {
+ /* The ending subplan would no longer need a callback. */
+ Assert(!areq->callback_pending);
+ --node->as_nasyncremain;
+ return;
+ }
+
+ /* Save result so we can return it */
+ Assert(node->as_nasyncresults < node->as_nasyncplans);
+ node->as_asyncresults[node->as_nasyncresults++] = slot;
+
+ /*
+ * Mark the subplan that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might complete.
+ */
+ node->as_needrequest = bms_add_member(node->as_needrequest,
+ areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 0b20f94035..aacd3464ce 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -391,3 +391,51 @@ ExecShutdownForeignScan(ForeignScanState *node)
if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Asynchronously request a tuple from a designed async-capable node
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ fdwroutine->ForeignAsyncConfigureWait(areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Callback invoked when a relevant event has occurred
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 70f8b718e0..ea8f0ecfed 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -120,6 +120,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
COPY_SCALAR_FIELD(plan_width);
COPY_SCALAR_FIELD(parallel_aware);
COPY_SCALAR_FIELD(parallel_safe);
+ COPY_SCALAR_FIELD(async_capable);
COPY_SCALAR_FIELD(plan_node_id);
COPY_NODE_FIELD(targetlist);
COPY_NODE_FIELD(qual);
@@ -241,6 +242,7 @@ _copyAppend(const Append *from)
*/
COPY_BITMAPSET_FIELD(apprelids);
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
COPY_SCALAR_FIELD(first_partial_plan);
COPY_NODE_FIELD(part_prune_info);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index d78b16ed1d..d8a9ec5be1 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -333,6 +333,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
WRITE_INT_FIELD(plan_width);
WRITE_BOOL_FIELD(parallel_aware);
WRITE_BOOL_FIELD(parallel_safe);
+ WRITE_BOOL_FIELD(async_capable);
WRITE_INT_FIELD(plan_node_id);
WRITE_NODE_FIELD(targetlist);
WRITE_NODE_FIELD(qual);
@@ -431,6 +432,7 @@ _outAppend(StringInfo str, const Append *node)
WRITE_BITMAPSET_FIELD(apprelids);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
WRITE_INT_FIELD(first_partial_plan);
WRITE_NODE_FIELD(part_prune_info);
}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 0f6a77afc4..56638a0437 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1574,6 +1574,7 @@ ReadCommonPlan(Plan *local_node)
READ_INT_FIELD(plan_width);
READ_BOOL_FIELD(parallel_aware);
READ_BOOL_FIELD(parallel_safe);
+ READ_BOOL_FIELD(async_capable);
READ_INT_FIELD(plan_node_id);
READ_NODE_FIELD(targetlist);
READ_NODE_FIELD(qual);
@@ -1670,6 +1671,7 @@ _readAppend(void)
READ_BITMAPSET_FIELD(apprelids);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
READ_INT_FIELD(first_partial_plan);
READ_NODE_FIELD(part_prune_info);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 22d6935824..97f28227cb 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -147,6 +147,7 @@ bool enable_partitionwise_aggregate = false;
bool enable_parallel_append = true;
bool enable_parallel_hash = true;
bool enable_partition_pruning = true;
+bool enable_async_append = true;
typedef struct
{
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index f7a8dae3c6..ba0624b48d 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -81,6 +81,7 @@ static List *get_gating_quals(PlannerInfo *root, List *quals);
static Plan *create_gating_plan(PlannerInfo *root, Path *path, Plan *plan,
List *gating_quals);
static Plan *create_join_plan(PlannerInfo *root, JoinPath *best_path);
+static bool is_async_capable_path(Path *path);
static Plan *create_append_plan(PlannerInfo *root, AppendPath *best_path,
int flags);
static Plan *create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
@@ -1066,6 +1067,30 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
return plan;
}
+/*
+ * is_projection_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
+
/*
* create_append_plan
* Create an Append plan for 'best_path' and (recursively) plans
@@ -1083,6 +1108,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
List *pathkeys = best_path->path.pathkeys;
List *subplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
RelOptInfo *rel = best_path->path.parent;
PartitionPruneInfo *partpruneinfo = NULL;
int nodenumsortkeys = 0;
@@ -1090,6 +1116,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
Oid *nodeSortOperators = NULL;
Oid *nodeCollations = NULL;
bool *nodeNullsFirst = NULL;
+ bool consider_async = false;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1153,6 +1180,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
tlist_was_changed = (orig_tlist_length != list_length(plan->plan.targetlist));
}
+ /* If appropriate, consider async append */
+ consider_async = (enable_async_append && pathkeys == NIL &&
+ !best_path->path.parallel_safe &&
+ list_length(best_path->subpaths) > 1);
+
/* Build the plan for each child */
foreach(subpaths, best_path->subpaths)
{
@@ -1220,6 +1252,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
subplans = lappend(subplans, subplan);
+
+ /* Check to see if subplan can be executed asynchronously */
+ if (consider_async && is_async_capable_path(subpath))
+ {
+ subplan->async_capable = true;
+ ++nasyncplans;
+ }
}
/*
@@ -1254,6 +1293,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
plan->appendplans = subplans;
+ plan->nasyncplans = nasyncplans;
plan->first_partial_plan = best_path->first_partial_path;
plan->part_prune_info = partpruneinfo;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 123369f4fa..d438e4cd17 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3919,6 +3919,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
switch (w)
{
+ case WAIT_EVENT_APPEND_READY:
+ event_name = "AppendReady";
+ break;
case WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE:
event_name = "BackupWaitWalArchive";
break;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 878fcc2236..a4d4b2027a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1108,6 +1108,16 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_async_append", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of async append plans."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_async_append,
+ true,
+ NULL, NULL, NULL
+ },
{
{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("Enables genetic query optimization."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5298e18ecd..e5415772be 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -370,6 +370,7 @@
#enable_partitionwise_aggregate = off
#enable_parallel_hash = on
#enable_partition_pruning = on
+#enable_async_append = on
# - Planner Cost Constants -
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
index e69de29bb2..c28199e92a 100644
--- a/src/include/executor/execAsync.h
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,23 @@
+/*-------------------------------------------------------------------------
+ * execAsync.h
+ * Support functions for asynchronous execution
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/executor/execAsync.h
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(AsyncRequest *areq);
+extern void ExecAsyncConfigureWait(AsyncRequest *areq);
+extern void ExecAsyncNotify(AsyncRequest *areq);
+extern void ExecAsyncRequestDone(AsyncRequest *areq, TupleTableSlot *result);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index be222ebff6..3d36096304 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -25,4 +25,6 @@ extern void ExecAppendInitializeDSM(AppendState *node, ParallelContext *pcxt);
extern void ExecAppendReInitializeDSM(AppendState *node, ParallelContext *pcxt);
extern void ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt);
+extern void ExecAsyncAppendResponse(AsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 326d713ebf..abd782a6f3 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -31,4 +31,8 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
ParallelWorkerContext *pwcxt);
extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern void ExecAsyncForeignScanRequest(AsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(AsyncRequest *areq);
+extern void ExecAsyncForeignScanNotify(AsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 95556dfb15..03cdfa12c1 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -170,6 +170,14 @@ typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
List *fdw_private,
RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+
+typedef void (*ForeignAsyncRequest_function) (AsyncRequest *areq);
+
+typedef void (*ForeignAsyncConfigureWait_function) (AsyncRequest *areq);
+
+typedef void (*ForeignAsyncNotify_function) (AsyncRequest *areq);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -246,6 +254,12 @@ typedef struct FdwRoutine
/* Support functions for path reparameterization. */
ReparameterizeForeignPathByChild_function ReparameterizeForeignPathByChild;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 61ba4c3666..6e2db12895 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -502,6 +502,22 @@ typedef struct ResultRelInfo
struct CopyMultiInsertBuffer *ri_CopyMultiInsertBuffer;
} ResultRelInfo;
+/* ----------------
+ * AsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct AsyncRequest
+{
+ struct PlanState *requestor; /* Node that wants a tuple */
+ struct PlanState *requestee; /* Node from which a tuple is wanted */
+ int request_index; /* Scratch space for requestor */
+ bool callback_pending; /* Callback is needed */
+ bool request_complete; /* Request complete, result valid */
+ TupleTableSlot *result; /* Result (NULL if no more tuples) */
+} AsyncRequest;
+
/* ----------------
* EState information
*
@@ -1207,6 +1223,16 @@ struct AppendState
PlanState **appendplans; /* array of PlanStates for my inputs */
int as_nplans;
int as_whichplan;
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_asyncplans; /* asynchronous plans indexes */
+ int as_nasyncplans; /* # of asynchronous plans */
+ AsyncRequest **as_asyncrequests; /* array of AsyncRequests */
+ TupleTableSlot **as_asyncresults; /* unreturned results of async plans */
+ int as_nasyncresults; /* # of valid entries in as_asyncresults */
+ int as_nasyncremain; /* # of remaining async plans */
+ Bitmapset *as_needrequest; /* async plans ready for a request */
+ struct WaitEventSet *as_eventset; /* WaitEventSet used to configure
+ * file descriptor wait events */
int as_first_partial_plan; /* Index of 'appendplans' containing
* the first partial plan */
ParallelAppendState *as_pstate; /* parallel coordination info */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 7e6b10f86b..6c5396e6a3 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -129,6 +129,11 @@ typedef struct Plan
bool parallel_aware; /* engage parallel-aware logic? */
bool parallel_safe; /* OK to use as part of parallel plan? */
+ /*
+ * information needed for asynchronous execution
+ */
+ bool async_capable; /* engage asynchronous-capable logic? */
+
/*
* Common structural data for all Plan types.
*/
@@ -245,6 +250,7 @@ typedef struct Append
Plan plan;
Bitmapset *apprelids; /* RTIs of appendrel(s) formed by this node */
List *appendplans;
+ int nasyncplans; /* # of asynchronous plans */
/*
* All 'appendplans' preceding this index are non-partial plans. All
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 8e621d2f76..33bc133dd4 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -65,6 +65,7 @@ extern PGDLLIMPORT bool enable_partitionwise_aggregate;
extern PGDLLIMPORT bool enable_parallel_append;
extern PGDLLIMPORT bool enable_parallel_hash;
extern PGDLLIMPORT bool enable_partition_pruning;
+extern PGDLLIMPORT bool enable_async_append;
extern PGDLLIMPORT int constraint_exclusion;
extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5954068dec..3249570a18 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -923,6 +923,7 @@ typedef enum
*/
typedef enum
{
+ WAIT_EVENT_APPEND_READY,
WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE = PG_WAIT_IPC,
WAIT_EVENT_BGWORKER_SHUTDOWN,
WAIT_EVENT_BGWORKER_STARTUP,
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index dc7ab2ce8b..e78ca7bddb 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -87,6 +87,7 @@ select explain_filter('explain (analyze, buffers, format json) select * from int
"Plan": { +
"Node Type": "Seq Scan", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "int8_tbl",+
"Alias": "i8", +
"Startup Cost": N.N, +
@@ -136,6 +137,7 @@ select explain_filter('explain (analyze, buffers, format xml) select * from int8
<Plan> +
<Node-Type>Seq Scan</Node-Type> +
<Parallel-Aware>false</Parallel-Aware> +
+ <Async-Capable>false</Async-Capable> +
<Relation-Name>int8_tbl</Relation-Name> +
<Alias>i8</Alias> +
<Startup-Cost>N.N</Startup-Cost> +
@@ -183,6 +185,7 @@ select explain_filter('explain (analyze, buffers, format yaml) select * from int
- Plan: +
Node Type: "Seq Scan" +
Parallel Aware: false +
+ Async Capable: false +
Relation Name: "int8_tbl"+
Alias: "i8" +
Startup Cost: N.N +
@@ -233,6 +236,7 @@ select explain_filter('explain (buffers, format json) select * from int8_tbl i8'
"Plan": { +
"Node Type": "Seq Scan", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "int8_tbl",+
"Alias": "i8", +
"Startup Cost": N.N, +
@@ -348,6 +352,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Relation Name": "tenk1", +
"Parallel Aware": true, +
"Local Hit Blocks": 0, +
@@ -393,6 +398,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Parallel Aware": false, +
"Sort Space Used": 0, +
"Local Hit Blocks": 0, +
@@ -435,6 +441,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Parallel Aware": false, +
"Workers Planned": 0, +
"Local Hit Blocks": 0, +
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index a8cbfd9f5f..af38f3b93c 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -558,6 +558,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
"Node Type": "Incremental Sort", +
"Actual Rows": 55, +
"Actual Loops": 1, +
+ "Async Capable": false, +
"Presorted Key": [ +
"t.a" +
], +
@@ -734,6 +735,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
"Node Type": "Incremental Sort", +
"Actual Rows": 70, +
"Actual Loops": 1, +
+ "Async Capable": false, +
"Presorted Key": [ +
"t.a" +
], +
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index ff157ceb1c..499245068a 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -204,6 +204,7 @@ explain (costs off, format json) insert into insertconflicttest values (0, 'Bilb
"Node Type": "ModifyTable", +
"Operation": "Insert", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "insertconflicttest", +
"Alias": "insertconflicttest", +
"Conflict Resolution": "UPDATE", +
@@ -213,7 +214,8 @@ explain (costs off, format json) insert into insertconflicttest values (0, 'Bilb
{ +
"Node Type": "Result", +
"Parent Relationship": "Member", +
- "Parallel Aware": false +
+ "Parallel Aware": false, +
+ "Async Capable": false +
} +
] +
} +
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 81bdacf59d..b7818c0637 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -88,6 +88,7 @@ select count(*) = 1 as ok from pg_stat_wal;
select name, setting from pg_settings where name like 'enable%';
name | setting
--------------------------------+---------
+ enable_async_append | on
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
@@ -106,7 +107,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(18 rows)
+(19 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
On Thu, Dec 31, 2020 at 7:15 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
* I tweaked comments a bit to address your comments.
I forgot to update some comments. :-( Attached is a new version of
the patch updating comments further. I did a bit of cleanup for the
postgres_fdw part as well.
Best regards,
Etsuro Fujita
Attachments:
async-wip-2021-01-01.patchapplication/octet-stream; name=async-wip-2021-01-01.patchDownload
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index d841cec39b..b5b6d30c39 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -59,6 +59,7 @@ typedef struct ConnCacheEntry
bool invalidated; /* true if reconnect is pending */
uint32 server_hashvalue; /* hash value of foreign server OID */
uint32 mapping_hashvalue; /* hash value of user mapping OID */
+ PgFdwConnState state; /* extra per-connection state */
} ConnCacheEntry;
/*
@@ -106,7 +107,7 @@ static bool UserMappingPasswordRequired(UserMapping *user);
* (not even on error), we need this flag to cue manual cleanup.
*/
PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetConnection(UserMapping *user, bool will_prep_stmt, PgFdwConnState **state)
{
bool found;
bool retry = false;
@@ -253,6 +254,10 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
/* Remember if caller will prepare statements */
entry->have_prep_stmt |= will_prep_stmt;
+ /* If caller needs access to the per-connection state, return it. */
+ if (state)
+ *state = &entry->state;
+
return entry->conn;
}
@@ -279,6 +284,7 @@ make_new_connection(ConnCacheEntry *entry, UserMapping *user)
entry->mapping_hashvalue =
GetSysCacheHashValue1(USERMAPPINGOID,
ObjectIdGetDatum(user->umid));
+ memset(&entry->state, 0, sizeof(entry->state));
/* Now try to make the connection */
entry->conn = connect_pg_server(server, user);
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index c11092f8cc..ac931e56e9 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6986,7 +6986,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+-------
a | aaa
@@ -7014,7 +7014,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7042,7 +7042,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7070,7 +7070,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | newtoo
@@ -7098,7 +7098,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
DELETE FROM a;
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+----
(0 rows)
@@ -7140,35 +7140,40 @@ insert into bar2 values(3,33,33);
insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
f1 | f2
----+----
1 | 11
@@ -7178,35 +7183,40 @@ select * from bar where f1 in (select f1 from foo) for update;
(4 rows)
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
f1 | f2
----+----
1 | 11
@@ -7238,7 +7248,7 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-> Hash Join
@@ -7256,7 +7266,7 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
(39 rows)
@@ -7274,6 +7284,7 @@ select tableoid::regclass, * from bar order by 1,2;
(6 rows)
-- Check UPDATE with inherited target and an appendrel subquery
+SET enable_async_append TO false;
explain (verbose, costs off)
update bar set f2 = f2 + 100
from
@@ -7332,6 +7343,7 @@ update bar set f2 = f2 + 100
from
( select f1 from foo union all select f1+3 from foo ) ss
where bar.f1 = ss.f1;
+RESET enable_async_append;
select tableoid::regclass, * from bar order by 1,2;
tableoid | f1 | f2
----------+----+-----
@@ -8571,9 +8583,9 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
Sort
Sort Key: t1.a, t3.c
-> Append
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: ((ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)) INNER JOIN (ftprt1_p1 t3_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: ((ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)) INNER JOIN (ftprt1_p2 t3_2)
(7 rows)
@@ -8610,19 +8622,19 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
-- with whole-row reference; partitionwise join does not apply
EXPLAIN (COSTS OFF)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
- QUERY PLAN
---------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------
Sort
Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
-> Hash Full Join
Hash Cond: (t1.a = t2.b)
-> Append
- -> Foreign Scan on ftprt1_p1 t1_1
- -> Foreign Scan on ftprt1_p2 t1_2
+ -> Async Foreign Scan on ftprt1_p1 t1_1
+ -> Async Foreign Scan on ftprt1_p2 t1_2
-> Hash
-> Append
- -> Foreign Scan on ftprt2_p1 t2_1
- -> Foreign Scan on ftprt2_p2 t2_2
+ -> Async Foreign Scan on ftprt2_p1 t2_1
+ -> Async Foreign Scan on ftprt2_p2 t2_2
(11 rows)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
@@ -8652,9 +8664,9 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
Sort
Sort Key: t1.a, t1.b
-> Append
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: (ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: (ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)
(7 rows)
@@ -8707,6 +8719,7 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
(14 rows)
-- test FOR UPDATE; partitionwise join does not apply
+SET enable_async_append TO false;
EXPLAIN (COSTS OFF)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
QUERY PLAN
@@ -8734,6 +8747,7 @@ SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a
400 | 400
(4 rows)
+RESET enable_async_append;
RESET enable_partitionwise_join;
-- ===================================================================
-- test partitionwise aggregates
@@ -8758,17 +8772,17 @@ ANALYZE fpagg_tab_p3;
SET enable_partitionwise_aggregate TO false;
EXPLAIN (COSTS OFF)
SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
- QUERY PLAN
------------------------------------------------------------
+ QUERY PLAN
+-----------------------------------------------------------------
Sort
Sort Key: pagg_tab.a
-> HashAggregate
Group Key: pagg_tab.a
Filter: (avg(pagg_tab.b) < '22'::numeric)
-> Append
- -> Foreign Scan on fpagg_tab_p1 pagg_tab_1
- -> Foreign Scan on fpagg_tab_p2 pagg_tab_2
- -> Foreign Scan on fpagg_tab_p3 pagg_tab_3
+ -> Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+ -> Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+ -> Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
(9 rows)
-- Plan with partitionwise aggregates is enabled
@@ -8780,11 +8794,11 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
Sort
Sort Key: pagg_tab.a
-> Append
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p1 pagg_tab)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p2 pagg_tab_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p3 pagg_tab_2)
(9 rows)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index b6c72e1d1e..de01cba130 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -21,6 +21,7 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -37,6 +38,7 @@
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
#include "postgres_fdw.h"
+#include "storage/latch.h"
#include "utils/builtins.h"
#include "utils/float.h"
#include "utils/guc.h"
@@ -155,6 +157,11 @@ typedef struct PgFdwScanState
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ /* for asynchronous execution */
+ bool async_capable; /* engage asynchronous-capable logic? */
+ PgFdwConnState *conn_state; /* extra per-connection state */
+ ForeignScanState *next_node; /* next ForeignScan node to activate */
+
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
MemoryContext temp_cxt; /* context for per-tuple temporary data */
@@ -392,6 +399,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
RelOptInfo *input_rel,
RelOptInfo *output_rel,
void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(AsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(AsyncRequest *areq);
+static void postgresForeignAsyncNotify(AsyncRequest *areq);
/*
* Helper functions
@@ -420,6 +431,7 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
+static void fetch_more_data_begin(ForeignScanState *node);
static void fetch_more_data(ForeignScanState *node);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static PgFdwModifyState *create_foreign_modify(EState *estate,
@@ -471,6 +483,7 @@ static int postgresAcquireSampleRowsFunc(Relation relation, int elevel,
double *totaldeadrows);
static void analyze_row_processor(PGresult *res, int row,
PgFdwAnalyzeState *astate);
+static void request_tuple_asynchronously(AsyncRequest *areq);
static HeapTuple make_tuple_from_result_row(PGresult *res,
int row,
Relation rel,
@@ -560,6 +573,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for asynchronous execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -1435,7 +1454,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->conn = GetConnection(user, false, &fsstate->conn_state);
/* Assign a unique ID for my cursor */
fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1486,6 +1505,12 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
&fsstate->param_flinfo,
&fsstate->param_exprs,
&fsstate->param_values);
+
+ /* Initialize async state */
+ fsstate->async_capable = node->ss.ps.plan->async_capable;
+ fsstate->conn_state->activated = NULL;
+ fsstate->conn_state->async_query_sent = false;
+ fsstate->next_node = NULL;
}
/*
@@ -1500,8 +1525,10 @@ postgresIterateForeignScan(ForeignScanState *node)
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
/*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
+ * In sync mode, if this is the first call after Begin or ReScan, we need
+ * to create the cursor on the remote side. In async mode, we would have
+ * aready created the cursor before we get here, even if this is the first
+ * call after Begin or ReScan.
*/
if (!fsstate->cursor_exists)
create_cursor(node);
@@ -1511,6 +1538,9 @@ postgresIterateForeignScan(ForeignScanState *node)
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
+ /* In async mode, just clear tuple slot. */
+ if (fsstate->async_capable)
+ return ExecClearTuple(slot);
/* No point in another fetch if we already detected EOF, though. */
if (!fsstate->eof_reached)
fetch_more_data(node);
@@ -1540,6 +1570,14 @@ postgresReScanForeignScan(ForeignScanState *node)
char sql[64];
PGresult *res;
+ /* Reset async state */
+ if (fsstate->async_capable)
+ {
+ fsstate->conn_state->activated = NULL;
+ fsstate->conn_state->async_query_sent = false;
+ fsstate->next_node = NULL;
+ }
+
/* If we haven't created the cursor yet, nothing to do. */
if (!fsstate->cursor_exists)
return;
@@ -1598,6 +1636,14 @@ postgresEndForeignScan(ForeignScanState *node)
if (fsstate == NULL)
return;
+ /*
+ * If we're ending before we've collected a response from an asynchronous
+ * query, we have to consume the response.
+ */
+ if (fsstate->conn_state->activated == node &&
+ fsstate->conn_state->async_query_sent)
+ fetch_more_data(node);
+
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
close_cursor(fsstate->conn, fsstate->cursor_number);
@@ -2374,7 +2420,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->conn = GetConnection(user, false, NULL);
/* Update the foreign-join-related fields. */
if (fsplan->scan.scanrelid == 0)
@@ -2748,7 +2794,7 @@ estimate_path_cost_size(PlannerInfo *root,
false, &retrieved_attrs, NULL);
/* Get the remote estimate */
- conn = GetConnection(fpinfo->user, false);
+ conn = GetConnection(fpinfo->user, false, NULL);
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -3376,6 +3422,34 @@ create_cursor(ForeignScanState *node)
pfree(buf.data);
}
+/*
+ * Begin an asynchronous data fetch.
+ * fetch_more_data must be called to fetch the results..
+ */
+static void
+fetch_more_data_begin(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PGconn *conn = fsstate->conn;
+ char sql[64];
+
+ Assert(fsstate->conn_state->activated == node);
+ Assert(!fsstate->conn_state->async_query_sent);
+
+ /* Create the cursor synchronously. */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ /* We will send this query, but not wait for the response. */
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (PQsendQuery(conn, sql) < 0)
+ pgfdw_report_error(ERROR, NULL, conn, false, fsstate->query);
+
+ fsstate->conn_state->async_query_sent = true;
+}
+
/*
* Fetch some more rows from the node's cursor.
*/
@@ -3398,17 +3472,36 @@ fetch_more_data(ForeignScanState *node)
PG_TRY();
{
PGconn *conn = fsstate->conn;
- char sql[64];
int numrows;
int i;
- snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
- fsstate->fetch_size, fsstate->cursor_number);
+ if (fsstate->async_capable)
+ {
+ Assert(fsstate->conn_state->activated == node);
+ Assert(fsstate->conn_state->async_query_sent);
- res = pgfdw_exec_query(conn, sql);
- /* On error, report the original query, not the FETCH. */
- if (PQresultStatus(res) != PGRES_TUPLES_OK)
- pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ /*
+ * The query was already sent by an earlier call to
+ * fetch_more_data_begin. So now we just fetch the result.
+ */
+ res = PQgetResult(conn);
+ /* On error, report the original query, not the FETCH. */
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ }
+ else
+ {
+ char sql[64];
+
+ /* This is a regular synchronous fetch. */
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ res = pgfdw_exec_query(conn, sql);
+ /* On error, report the original query, not the FETCH. */
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ }
/* Convert the data into HeapTuples */
numrows = PQntuples(res);
@@ -3435,6 +3528,15 @@ fetch_more_data(ForeignScanState *node)
/* Must be EOF if we didn't get as many tuples as we asked for. */
fsstate->eof_reached = (numrows < fsstate->fetch_size);
+
+ /* If this was the second part of an async request, we must fetch until NULL. */
+ if (fsstate->async_capable)
+ {
+ /* call once and raise error if not NULL as expected? */
+ while (PQgetResult(conn) != NULL)
+ ;
+ fsstate->conn_state->async_query_sent = false;
+ }
}
PG_FINALLY();
{
@@ -3559,7 +3661,7 @@ create_foreign_modify(EState *estate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->conn = GetConnection(user, true, NULL);
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Set up remote query information. */
@@ -4434,7 +4536,7 @@ postgresAnalyzeForeignTable(Relation relation,
*/
table = GetForeignTable(RelationGetRelid(relation));
user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
- conn = GetConnection(user, false);
+ conn = GetConnection(user, false, NULL);
/*
* Construct command to get page count for relation.
@@ -4520,7 +4622,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
table = GetForeignTable(RelationGetRelid(relation));
server = GetForeignServer(table->serverid);
user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
- conn = GetConnection(user, false);
+ conn = GetConnection(user, false, NULL);
/*
* Construct cursor that retrieves whole rows from remote.
@@ -4748,7 +4850,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
*/
server = GetForeignServer(serverOid);
mapping = GetUserMapping(GetUserId(), server->serverid);
- conn = GetConnection(mapping, false);
+ conn = GetConnection(mapping, false, NULL);
/* Don't attempt to import collation if remote server hasn't got it */
if (PQserverVersion(conn) < 90100)
@@ -6294,6 +6396,177 @@ add_foreign_final_paths(PlannerInfo *root, RelOptInfo *input_rel,
add_path(final_rel, (Path *) final_path);
}
+/*
+ * postgresIsForeignPathAsyncCapable
+ * Check whether a given ForeignPath node is async-capable.
+ */
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+/*
+ * postgresForeignAsyncRequest
+ * Asynchronously request next tuple from a foreign PostgreSQL table.
+ */
+static void
+postgresForeignAsyncRequest(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+
+ /*
+ * If this is the first call after Begin or ReScan, mark the connection
+ * as used by the ForeignScan node.
+ */
+ if (fsstate->conn_state->activated == NULL)
+ fsstate->conn_state->activated = node;
+
+ /*
+ * If the connection has already been used by a ForeignScan node, put it
+ * at the end of the chain of waiting ForeignScan nodes, and then return.
+ */
+ if (node != fsstate->conn_state->activated)
+ {
+ ForeignScanState *curr_node = fsstate->conn_state->activated;
+ PgFdwScanState *curr_fsstate = (PgFdwScanState *) curr_node->fdw_state;
+
+ /* Scan down the chain ... */
+ while (curr_fsstate->next_node)
+ {
+ curr_node = curr_fsstate->next_node;
+ Assert(node != curr_node);
+ curr_fsstate = (PgFdwScanState *) curr_node->fdw_state;
+ }
+ /* Update the chain linking */
+ curr_fsstate->next_node = node;
+ /* Mark the request as needing a callback */
+ areq->callback_pending = true;
+ areq->request_complete = false;
+ return;
+ }
+
+ request_tuple_asynchronously(areq);
+}
+
+/*
+ * postgresForeignAsyncConfigureWait
+ * Configure a file descriptor event for which we wish to wait.
+ */
+static void
+postgresForeignAsyncConfigureWait(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ AppendState *requestor = (AppendState *) areq->requestor;
+ WaitEventSet *set = requestor->as_eventset;
+
+ /* This function should not be called unless callback_pending */
+ Assert(areq->callback_pending);
+
+ /* If the ForeignScan node isn't activated yet, nothing to do */
+ if (fsstate->conn_state->activated != node)
+ return;
+
+ AddWaitEventToSet(set, WL_SOCKET_READABLE, PQsocket(fsstate->conn),
+ NULL, areq);
+}
+
+/*
+ * postgresForeignAsyncNotify
+ * Fetch some more tuples from a file descriptor that becomes ready,
+ * requesting next tuple.
+ */
+static void
+postgresForeignAsyncNotify(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+
+ /* The core code would have initialized the callback_pending flag */
+ Assert(!areq->callback_pending);
+
+ fetch_more_data(node);
+
+ request_tuple_asynchronously(areq);
+}
+
+/*
+ * Asynchronously request next tuple from a foreign PostgreSQL table.
+ */
+static void
+request_tuple_asynchronously(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ TupleTableSlot *result;
+
+ /* Request some more tuples, if we've run out */
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ /* No point in another fetch if we already detected EOF, though */
+ if (!fsstate->eof_reached)
+ {
+ /* Begin another fetch */
+ fetch_more_data_begin(node);
+ /* Mark the request as needing a callback */
+ areq->callback_pending = true;
+ areq->request_complete = false;
+ return;
+ }
+ fsstate->conn_state->activated = NULL;
+
+ /* Activate the next ForeignScan node if any */
+ if (fsstate->next_node)
+ {
+ /* Mark the connection as used by the next ForeignScan node */
+ fsstate->conn_state->activated = fsstate->next_node;
+ Assert(!fsstate->conn_state->async_query_sent);
+ /* Begin an asynchronous fetch for that node */
+ fetch_more_data_begin(fsstate->next_node);
+ }
+
+ /* There's nothing more to do; just return a NULL pointer */
+ result = NULL;
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+ return;
+ }
+
+ /* Get a tuple from the ForeignScan node */
+ result = ExecProcNode((PlanState *) node);
+
+ if (TupIsNull(result))
+ {
+ Assert(fsstate->next_tuple >= fsstate->num_tuples);
+
+ /* Request some more tuples, if we've not detected EOF yet */
+ if (!fsstate->eof_reached)
+ {
+ /* Begin another fetch */
+ fetch_more_data_begin(node);
+ /* Mark the request as needing a callback */
+ areq->callback_pending = true;
+ areq->request_complete = false;
+ return;
+ }
+ fsstate->conn_state->activated = NULL;
+
+ /* Activate the next ForeignScan node if any */
+ if (fsstate->next_node)
+ {
+ /* Mark the connection as used by the next ForeignScan node */
+ fsstate->conn_state->activated = fsstate->next_node;
+ Assert(!fsstate->conn_state->async_query_sent);
+ /* Begin an asynchronous fetch for that node */
+ fetch_more_data_begin(fsstate->next_node);
+ }
+ }
+
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+}
+
/*
* Create a tuple from the specified row of the PGresult.
*
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index eef410db39..15c9750f8b 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -16,6 +16,7 @@
#include "foreign/foreign.h"
#include "lib/stringinfo.h"
#include "libpq-fe.h"
+#include "nodes/execnodes.h"
#include "nodes/pathnodes.h"
#include "utils/relcache.h"
@@ -124,12 +125,22 @@ typedef struct PgFdwRelationInfo
int relation_index;
} PgFdwRelationInfo;
+/*
+ * Extra control information relating to a connection.
+ */
+typedef struct PgFdwConnState
+{
+ ForeignScanState *activated; /* currently-activated ForeignScan node */
+ bool async_query_sent; /* has an asynchronous query been sent? */
+} PgFdwConnState;
+
/* in postgres_fdw.c */
extern int set_transmission_modes(void);
extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+ PgFdwConnState **state);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 25dbc08b98..ad7bf90bb0 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1799,31 +1799,31 @@ INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
DELETE FROM a;
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
@@ -1859,12 +1859,12 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
@@ -1874,6 +1874,7 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
select tableoid::regclass, * from bar order by 1,2;
-- Check UPDATE with inherited target and an appendrel subquery
+SET enable_async_append TO false;
explain (verbose, costs off)
update bar set f2 = f2 + 100
from
@@ -1883,6 +1884,7 @@ update bar set f2 = f2 + 100
from
( select f1 from foo union all select f1+3 from foo ) ss
where bar.f1 = ss.f1;
+RESET enable_async_append;
select tableoid::regclass, * from bar order by 1,2;
@@ -2492,9 +2494,11 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE a % 25 = 0) t1 FULL JOIN (SELECT 't2_phv' phv, * FROM fprt2 WHERE b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY t1.a, t2.b;
-- test FOR UPDATE; partitionwise join does not apply
+SET enable_async_append TO false;
EXPLAIN (COSTS OFF)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
+RESET enable_async_append;
RESET enable_partitionwise_join;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 67de4150b8..1427ffc82b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4733,6 +4733,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</para>
<variablelist>
+ <varlistentry id="guc-enable-async-append" xreflabel="enable_async_append">
+ <term><varname>enable_async_append</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_async_append</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of async-aware
+ append plan types. The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-bitmapscan" xreflabel="enable_bitmapscan">
<term><varname>enable_bitmapscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3d6c901306..c07d6a4d95 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1557,6 +1557,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</thead>
<tbody>
+ <row>
+ <entry><literal>AppendReady</literal></entry>
+ <entry>Waiting for a subplan of Append to be ready.</entry>
+ </row>
<row>
<entry><literal>BackupWaitWalArchive</literal></entry>
<entry>Waiting for WAL files required for a backup to be successfully
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d797b5f53e..895c4d8a4c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1390,6 +1390,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
}
if (plan->parallel_aware)
appendStringInfoString(es->str, "Parallel ");
+ if (plan->async_capable)
+ appendStringInfoString(es->str, "Async ");
appendStringInfoString(es->str, pname);
es->indent++;
}
@@ -1409,6 +1411,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (custom_name)
ExplainPropertyText("Custom Plan Provider", custom_name, es);
ExplainPropertyBool("Parallel Aware", plan->parallel_aware, es);
+ ExplainPropertyBool("Async Capable", plan->async_capable, es);
}
switch (nodeTag(plan))
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index f990c6473a..1004647d4f 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
+ execAsync.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 0c10f1d35c..07f028bdf9 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -526,6 +526,10 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans != 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index e69de29bb2..6174ea1eb6 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,113 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+
+static void ExecAsyncResponse(AsyncRequest *areq);
+
+/*
+ * Asynchronously request a tuple from a designed async-capable node.
+ */
+void
+ExecAsyncRequest(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ ExecAsyncResponse(areq);
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor event
+ * for which it wishes to wait. We expect the node-type specific callback to
+ * make a sigle call of the following form:
+ *
+ * AddWaitEventToSet(set, WL_SOCKET_READABLE, fd, NULL, areq);
+ */
+void
+ExecAsyncConfigureWait(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanConfigureWait(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+void
+ExecAsyncNotify(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ ExecAsyncResponse(areq);
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestor))
+ {
+ case T_AppendState:
+ ExecAsyncAppendResponse(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestor));
+ }
+}
+
+/*
+ * A requestee node should call this function to deliver the tuple to its
+ * requestor node. The node can call this from its ExecAsyncRequest callback
+ * if the requested tuple is available immediately.
+ */
+void
+ExecAsyncRequestDone(AsyncRequest *areq, TupleTableSlot *result)
+{
+ areq->request_complete = true;
+ areq->result = result;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 88919e62fa..2120481743 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -57,10 +57,13 @@
#include "postgres.h"
+#include "executor/execAsync.h"
#include "executor/execdebug.h"
#include "executor/execPartition.h"
#include "executor/nodeAppend.h"
#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
/* Shared state for parallel-aware Append. */
struct ParallelAppendState
@@ -78,12 +81,17 @@ struct ParallelAppendState
};
#define INVALID_SUBPLAN_INDEX -1
+#define EVENT_BUFFER_SIZE 16
static TupleTableSlot *ExecAppend(PlanState *pstate);
static bool choose_next_subplan_locally(AppendState *node);
static bool choose_next_subplan_for_leader(AppendState *node);
static bool choose_next_subplan_for_worker(AppendState *node);
static void mark_invalid_subplans_as_finished(AppendState *node);
+static void ExecAppendAsyncBegin(AppendState *node);
+static bool ExecAppendAsyncGetNext(AppendState *node, TupleTableSlot **result);
+static bool ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result);
+static void ExecAppendAsyncEventWait(AppendState *node);
/* ----------------------------------------------------------------
* ExecInitAppend
@@ -102,7 +110,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
AppendState *appendstate = makeNode(AppendState);
PlanState **appendplanstates;
Bitmapset *validsubplans;
+ Bitmapset *asyncplans;
int nplans;
+ int nasyncplans;
int firstvalid;
int i,
j;
@@ -119,6 +129,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
/* Let choose_next_subplan_* function handle setting the first subplan */
appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+ appendstate->as_syncdone = false;
/* If run-time partition pruning is enabled, then set that up now */
if (node->part_prune_info != NULL)
@@ -191,12 +202,24 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
* While at it, find out the first valid partial plan.
*/
j = 0;
+ asyncplans = NULL;
+ nasyncplans = 0;
firstvalid = nplans;
i = -1;
while ((i = bms_next_member(validsubplans, i)) >= 0)
{
Plan *initNode = (Plan *) list_nth(node->appendplans, i);
+ /*
+ * Record async subplans. When executing EvalPlanQual, we process
+ * async subplans synchronously, so don't do this in that case.
+ */
+ if (initNode->async_capable && estate->es_epq_active == NULL)
+ {
+ asyncplans = bms_add_member(asyncplans, j);
+ nasyncplans++;
+ }
+
/*
* Record the lowest appendplans index which is a valid partial plan.
*/
@@ -210,6 +233,38 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ /* Initialize async state */
+ appendstate->as_asyncplans = asyncplans;
+ appendstate->as_nasyncplans = nasyncplans;
+ appendstate->as_asyncrequests = NULL;
+ appendstate->as_asyncresults = (TupleTableSlot **)
+ palloc0(nasyncplans * sizeof(TupleTableSlot *));
+ appendstate->as_nasyncremain = nasyncplans;
+ appendstate->as_needrequest = NULL;
+ appendstate->as_eventset = NULL;
+
+ if (nasyncplans > 0)
+ {
+ appendstate->as_asyncrequests = (AsyncRequest **)
+ palloc0(nplans * sizeof(AsyncRequest *));
+
+ i = -1;
+ while ((i = bms_next_member(asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq;
+
+ areq = palloc(sizeof(AsyncRequest));
+ areq->requestor = (PlanState *) appendstate;
+ areq->requestee = appendplanstates[i];
+ areq->request_index = i;
+ areq->callback_pending = false;
+ areq->request_complete = false;
+ areq->result = NULL;
+
+ appendstate->as_asyncrequests[i] = areq;
+ }
+ }
+
/*
* Miscellaneous initialization
*/
@@ -232,31 +287,45 @@ static TupleTableSlot *
ExecAppend(PlanState *pstate)
{
AppendState *node = castNode(AppendState, pstate);
+ TupleTableSlot *result;
- if (node->as_whichplan < 0)
+ if (!node->as_syncdone && node->as_whichplan == INVALID_SUBPLAN_INDEX)
{
/* Nothing to do if there are no subplans */
if (node->as_nplans == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ /* If there are any async subplans, begin execution of them */
+ if (node->as_nasyncplans > 0)
+ ExecAppendAsyncBegin(node);
+
/*
- * If no subplan has been chosen, we must choose one before
+ * If no sync subplan has been chosen, we must choose one before
* proceeding.
*/
- if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
- !node->choose_next_subplan(node))
+ if (!node->choose_next_subplan(node) && node->as_nasyncremain == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
for (;;)
{
PlanState *subnode;
- TupleTableSlot *result;
CHECK_FOR_INTERRUPTS();
/*
- * figure out which subplan we are currently processing
+ * try to get a tuple from any of the async subplans
+ */
+ if (!bms_is_empty(node->as_needrequest) ||
+ (node->as_syncdone && node->as_nasyncremain > 0))
+ {
+ if (ExecAppendAsyncGetNext(node, &result))
+ return result;
+ Assert(bms_is_empty(node->as_needrequest));
+ }
+
+ /*
+ * figure out which sync subplan we are currently processing
*/
Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
subnode = node->appendplans[node->as_whichplan];
@@ -276,8 +345,16 @@ ExecAppend(PlanState *pstate)
return result;
}
- /* choose new subplan; if none, we're done */
- if (!node->choose_next_subplan(node))
+ /* wait or poll async events */
+ if (node->as_nasyncremain > 0)
+ {
+ Assert(!node->as_syncdone);
+ Assert(bms_is_empty(node->as_needrequest));
+ ExecAppendAsyncEventWait(node);
+ }
+
+ /* choose new sync subplan; if no sync/async subplans, we're done */
+ if (!node->choose_next_subplan(node) && node->as_nasyncremain == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
}
@@ -313,6 +390,7 @@ ExecEndAppend(AppendState *node)
void
ExecReScanAppend(AppendState *node)
{
+ int nasyncplans = node->as_nasyncplans;
int i;
/*
@@ -347,8 +425,27 @@ ExecReScanAppend(AppendState *node)
ExecReScan(subnode);
}
+ /* Reset async state */
+ node->as_nasyncremain = nasyncplans;
+ bms_free(node->as_needrequest);
+ node->as_needrequest = NULL;
+
+ if (nasyncplans > 0)
+ {
+ i = -1;
+ while ((i = bms_next_member(node->as_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ areq->callback_pending = false;
+ areq->request_complete = false;
+ areq->result = NULL;
+ }
+ }
+
/* Let choose_next_subplan_* function handle setting the first subplan */
node->as_whichplan = INVALID_SUBPLAN_INDEX;
+ node->as_syncdone = false;
}
/* ----------------------------------------------------------------
@@ -429,7 +526,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
/* ----------------------------------------------------------------
* choose_next_subplan_locally
*
- * Choose next subplan for a non-parallel-aware Append,
+ * Choose next sync subplan for a non-parallel-aware Append,
* returning false if there are no more.
* ----------------------------------------------------------------
*/
@@ -444,9 +541,9 @@ choose_next_subplan_locally(AppendState *node)
/*
* If first call then have the bms member function choose the first valid
- * subplan by initializing whichplan to -1. If there happen to be no
- * valid subplans then the bms member function will handle that by
- * returning a negative number which will allow us to exit returning a
+ * sync subplan by initializing whichplan to -1. If there happen to be
+ * no valid sync subplans then the bms member function will handle that
+ * by returning a negative number which will allow us to exit returning a
* false value.
*/
if (whichplan == INVALID_SUBPLAN_INDEX)
@@ -467,7 +564,10 @@ choose_next_subplan_locally(AppendState *node)
nextplan = bms_prev_member(node->as_valid_subplans, whichplan);
if (nextplan < 0)
+ {
+ node->as_syncdone = true;
return false;
+ }
node->as_whichplan = nextplan;
@@ -709,3 +809,265 @@ mark_invalid_subplans_as_finished(AppendState *node)
node->as_pstate->pa_finished[i] = true;
}
}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncBegin
+ *
+ * Begin execution of designed async-capable subplans.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecAppendAsyncBegin(AppendState *node)
+{
+ Bitmapset *valid_asyncplans;
+ int i;
+
+ /* We should never be called when there are no async subplans. */
+ Assert(node->as_nasyncplans > 0);
+
+ if (node->as_valid_subplans == NULL)
+ {
+ node->as_valid_subplans =
+ ExecFindMatchingSubPlans(node->as_prune_state);
+ }
+
+ /* Nothing to do if there are no valid async subplans. */
+ if (!bms_overlap(node->as_valid_subplans, node->as_asyncplans))
+ return;
+
+ /* Get valid async subplans. */
+ valid_asyncplans = bms_copy(node->as_asyncplans);
+ valid_asyncplans = bms_int_members(valid_asyncplans,
+ node->as_valid_subplans);
+
+ /* Adjust the node's as_valid_suplans to only contain sync subplans. */
+ node->as_valid_subplans = bms_del_members(node->as_valid_subplans,
+ valid_asyncplans);
+
+ /* Make a request for each of the async subplans. */
+ i = -1;
+ while ((i = bms_next_member(valid_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ Assert(areq->request_index == i);
+ Assert(!areq->callback_pending);
+
+ /* Do the actual work. */
+ ExecAsyncRequest(areq);
+ }
+ bms_free(valid_asyncplans);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncGetNext
+ *
+ * Get the next tuple from any of the asynchronous subplans.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecAppendAsyncGetNext(AppendState *node, TupleTableSlot **result)
+{
+ *result = NULL;
+
+ /* Make new async requests. */
+ if (ExecAppendAsyncRequest(node, result))
+ return true;
+
+ while (node->as_nasyncremain > 0)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ /* Wait or poll async events. */
+ ExecAppendAsyncEventWait(node);
+
+ /* Make new async requests. */
+ if (ExecAppendAsyncRequest(node, result))
+ return true;
+
+ /* Break from loop if there is any sync node that is not complete */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If all sync subplans are complete, we're totally done scanning the
+ * givne node. Otherwise, we're done with the asynchronous stuff but
+ * must continue scanning the sync subplans.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncremain == 0);
+ *result = ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ return true;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncRequest
+ *
+ * If there are any asynchronous subplans that need a new asynchronous
+ * request, make all of them.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result)
+{
+ Bitmapset *needrequest;
+ int i;
+
+ /* Nothing to do if there are no async subplans needing a new request. */
+ if (bms_is_empty(node->as_needrequest))
+ return false;
+
+ /*
+ * If there are any asynchronously-generated results that have not yet
+ * been returned, we have nothing to do; just return one of them.
+ */
+ if (node->as_nasyncresults > 0)
+ {
+ --node->as_nasyncresults;
+ *result = node->as_asyncresults[node->as_nasyncresults];
+ return true;
+ }
+
+ /* Make a new request for each of the async subplans that need it. */
+ needrequest = node->as_needrequest;
+ node->as_needrequest = NULL;
+ i = -1;
+ while ((i = bms_next_member(needrequest, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ /* Do the actual work. */
+ ExecAsyncRequest(areq);
+ }
+ bms_free(needrequest);
+
+ /* Return one of the asynchronously-generated results if any. */
+ if (node->as_nasyncresults > 0)
+ {
+ --node->as_nasyncresults;
+ *result = node->as_asyncresults[node->as_nasyncresults];
+ return true;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncEventWait
+ *
+ * Wait or poll for file descriptor wait events and fire callbacks.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecAppendAsyncEventWait(AppendState *node)
+{
+ long timeout = node->as_syncdone ? -1 : 0;
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+
+ /* Nothing to do if there are no async remaining subplans. */
+ if (node->as_nasyncremain == 0)
+ return;
+
+ node->as_eventset = CreateWaitEventSet(CurrentMemoryContext,
+ node->as_nasyncplans + 1);
+ AddWaitEventToSet(node->as_eventset, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
+ NULL, NULL);
+
+ /* Give each waiting subplan a chance to add a event. */
+ i = -1;
+ while ((i = bms_next_member(node->as_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ if (areq->callback_pending)
+ ExecAsyncConfigureWait(areq);
+ }
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(node->as_eventset, timeout, occurred_event,
+ EVENT_BUFFER_SIZE, WAIT_EVENT_APPEND_READY);
+ FreeWaitEventSet(node->as_eventset);
+ node->as_eventset = NULL;
+ if (noccurred == 0)
+ return;
+
+ /* Deliver notifications. */
+ for (i = 0; i < noccurred; i++)
+ {
+ WaitEvent *w = &occurred_event[i];
+
+ /*
+ * Each waiting subplan should have registered its wait event with
+ * user_data pointing back to its AsyncRequest.
+ */
+ if ((w->events & WL_SOCKET_READABLE) != 0)
+ {
+ AsyncRequest *areq = (AsyncRequest *) w->user_data;
+
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ Assert(areq->callback_pending);
+ areq->callback_pending = false;
+
+ /* Do the actual work. */
+ ExecAsyncNotify(areq);
+ }
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(AsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot = areq->result;
+
+ /* The result should be a TupleTableSlot or NULL. */
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* Nothing to do if the request is pending. */
+ if (!areq->request_complete)
+ {
+ /*
+ * The subplan for which the request was made would be pending for a
+ * callback.
+ */
+ Assert(areq->callback_pending);
+ return;
+ }
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ {
+ /* The ending subplan would no longer be pending for a callback. */
+ Assert(!areq->callback_pending);
+ --node->as_nasyncremain;
+ return;
+ }
+
+ /* Save result so we can return it */
+ Assert(node->as_nasyncresults < node->as_nasyncplans);
+ node->as_asyncresults[node->as_nasyncresults++] = slot;
+
+ /*
+ * Mark the subplan that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might complete.
+ */
+ node->as_needrequest = bms_add_member(node->as_needrequest,
+ areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 0b20f94035..aacd3464ce 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -391,3 +391,51 @@ ExecShutdownForeignScan(ForeignScanState *node)
if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Asynchronously request a tuple from a designed async-capable node
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ fdwroutine->ForeignAsyncConfigureWait(areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Callback invoked when a relevant event has occurred
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 70f8b718e0..ea8f0ecfed 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -120,6 +120,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
COPY_SCALAR_FIELD(plan_width);
COPY_SCALAR_FIELD(parallel_aware);
COPY_SCALAR_FIELD(parallel_safe);
+ COPY_SCALAR_FIELD(async_capable);
COPY_SCALAR_FIELD(plan_node_id);
COPY_NODE_FIELD(targetlist);
COPY_NODE_FIELD(qual);
@@ -241,6 +242,7 @@ _copyAppend(const Append *from)
*/
COPY_BITMAPSET_FIELD(apprelids);
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
COPY_SCALAR_FIELD(first_partial_plan);
COPY_NODE_FIELD(part_prune_info);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index d78b16ed1d..d8a9ec5be1 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -333,6 +333,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
WRITE_INT_FIELD(plan_width);
WRITE_BOOL_FIELD(parallel_aware);
WRITE_BOOL_FIELD(parallel_safe);
+ WRITE_BOOL_FIELD(async_capable);
WRITE_INT_FIELD(plan_node_id);
WRITE_NODE_FIELD(targetlist);
WRITE_NODE_FIELD(qual);
@@ -431,6 +432,7 @@ _outAppend(StringInfo str, const Append *node)
WRITE_BITMAPSET_FIELD(apprelids);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
WRITE_INT_FIELD(first_partial_plan);
WRITE_NODE_FIELD(part_prune_info);
}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 0f6a77afc4..56638a0437 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1574,6 +1574,7 @@ ReadCommonPlan(Plan *local_node)
READ_INT_FIELD(plan_width);
READ_BOOL_FIELD(parallel_aware);
READ_BOOL_FIELD(parallel_safe);
+ READ_BOOL_FIELD(async_capable);
READ_INT_FIELD(plan_node_id);
READ_NODE_FIELD(targetlist);
READ_NODE_FIELD(qual);
@@ -1670,6 +1671,7 @@ _readAppend(void)
READ_BITMAPSET_FIELD(apprelids);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
READ_INT_FIELD(first_partial_plan);
READ_NODE_FIELD(part_prune_info);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 22d6935824..97f28227cb 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -147,6 +147,7 @@ bool enable_partitionwise_aggregate = false;
bool enable_parallel_append = true;
bool enable_parallel_hash = true;
bool enable_partition_pruning = true;
+bool enable_async_append = true;
typedef struct
{
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index f7a8dae3c6..7a29086f83 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -81,6 +81,7 @@ static List *get_gating_quals(PlannerInfo *root, List *quals);
static Plan *create_gating_plan(PlannerInfo *root, Path *path, Plan *plan,
List *gating_quals);
static Plan *create_join_plan(PlannerInfo *root, JoinPath *best_path);
+static bool is_async_capable_path(Path *path);
static Plan *create_append_plan(PlannerInfo *root, AppendPath *best_path,
int flags);
static Plan *create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
@@ -1066,6 +1067,30 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
return plan;
}
+/*
+ * is_async_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
+
/*
* create_append_plan
* Create an Append plan for 'best_path' and (recursively) plans
@@ -1083,6 +1108,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
List *pathkeys = best_path->path.pathkeys;
List *subplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
RelOptInfo *rel = best_path->path.parent;
PartitionPruneInfo *partpruneinfo = NULL;
int nodenumsortkeys = 0;
@@ -1090,6 +1116,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
Oid *nodeSortOperators = NULL;
Oid *nodeCollations = NULL;
bool *nodeNullsFirst = NULL;
+ bool consider_async = false;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1153,6 +1180,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
tlist_was_changed = (orig_tlist_length != list_length(plan->plan.targetlist));
}
+ /* If appropriate, consider async append */
+ consider_async = (enable_async_append && pathkeys == NIL &&
+ !best_path->path.parallel_safe &&
+ list_length(best_path->subpaths) > 1);
+
/* Build the plan for each child */
foreach(subpaths, best_path->subpaths)
{
@@ -1220,6 +1252,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
subplans = lappend(subplans, subplan);
+
+ /* Check to see if subplan can be executed asynchronously */
+ if (consider_async && is_async_capable_path(subpath))
+ {
+ subplan->async_capable = true;
+ ++nasyncplans;
+ }
}
/*
@@ -1254,6 +1293,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
plan->appendplans = subplans;
+ plan->nasyncplans = nasyncplans;
plan->first_partial_plan = best_path->first_partial_path;
plan->part_prune_info = partpruneinfo;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 123369f4fa..d438e4cd17 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3919,6 +3919,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
switch (w)
{
+ case WAIT_EVENT_APPEND_READY:
+ event_name = "AppendReady";
+ break;
case WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE:
event_name = "BackupWaitWalArchive";
break;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 878fcc2236..a4d4b2027a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1108,6 +1108,16 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_async_append", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of async append plans."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_async_append,
+ true,
+ NULL, NULL, NULL
+ },
{
{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("Enables genetic query optimization."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5298e18ecd..e5415772be 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -370,6 +370,7 @@
#enable_partitionwise_aggregate = off
#enable_parallel_hash = on
#enable_partition_pruning = on
+#enable_async_append = on
# - Planner Cost Constants -
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
index e69de29bb2..f7275fd154 100644
--- a/src/include/executor/execAsync.h
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,23 @@
+/*-------------------------------------------------------------------------
+ * execAsync.h
+ * Support functions for asynchronous execution
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/executor/execAsync.h
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(AsyncRequest *areq);
+extern void ExecAsyncConfigureWait(AsyncRequest *areq);
+extern void ExecAsyncNotify(AsyncRequest *areq);
+extern void ExecAsyncRequestDone(AsyncRequest *areq, TupleTableSlot *result);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index be222ebff6..3d36096304 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -25,4 +25,6 @@ extern void ExecAppendInitializeDSM(AppendState *node, ParallelContext *pcxt);
extern void ExecAppendReInitializeDSM(AppendState *node, ParallelContext *pcxt);
extern void ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt);
+extern void ExecAsyncAppendResponse(AsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 326d713ebf..abd782a6f3 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -31,4 +31,8 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
ParallelWorkerContext *pwcxt);
extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern void ExecAsyncForeignScanRequest(AsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(AsyncRequest *areq);
+extern void ExecAsyncForeignScanNotify(AsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 95556dfb15..03cdfa12c1 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -170,6 +170,14 @@ typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
List *fdw_private,
RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+
+typedef void (*ForeignAsyncRequest_function) (AsyncRequest *areq);
+
+typedef void (*ForeignAsyncConfigureWait_function) (AsyncRequest *areq);
+
+typedef void (*ForeignAsyncNotify_function) (AsyncRequest *areq);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -246,6 +254,12 @@ typedef struct FdwRoutine
/* Support functions for path reparameterization. */
ReparameterizeForeignPathByChild_function ReparameterizeForeignPathByChild;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 61ba4c3666..6e2db12895 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -502,6 +502,22 @@ typedef struct ResultRelInfo
struct CopyMultiInsertBuffer *ri_CopyMultiInsertBuffer;
} ResultRelInfo;
+/* ----------------
+ * AsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct AsyncRequest
+{
+ struct PlanState *requestor; /* Node that wants a tuple */
+ struct PlanState *requestee; /* Node from which a tuple is wanted */
+ int request_index; /* Scratch space for requestor */
+ bool callback_pending; /* Callback is needed */
+ bool request_complete; /* Request complete, result valid */
+ TupleTableSlot *result; /* Result (NULL if no more tuples) */
+} AsyncRequest;
+
/* ----------------
* EState information
*
@@ -1207,6 +1223,16 @@ struct AppendState
PlanState **appendplans; /* array of PlanStates for my inputs */
int as_nplans;
int as_whichplan;
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_asyncplans; /* asynchronous plans indexes */
+ int as_nasyncplans; /* # of asynchronous plans */
+ AsyncRequest **as_asyncrequests; /* array of AsyncRequests */
+ TupleTableSlot **as_asyncresults; /* unreturned results of async plans */
+ int as_nasyncresults; /* # of valid entries in as_asyncresults */
+ int as_nasyncremain; /* # of remaining async plans */
+ Bitmapset *as_needrequest; /* async plans ready for a request */
+ struct WaitEventSet *as_eventset; /* WaitEventSet used to configure
+ * file descriptor wait events */
int as_first_partial_plan; /* Index of 'appendplans' containing
* the first partial plan */
ParallelAppendState *as_pstate; /* parallel coordination info */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 7e6b10f86b..6c5396e6a3 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -129,6 +129,11 @@ typedef struct Plan
bool parallel_aware; /* engage parallel-aware logic? */
bool parallel_safe; /* OK to use as part of parallel plan? */
+ /*
+ * information needed for asynchronous execution
+ */
+ bool async_capable; /* engage asynchronous-capable logic? */
+
/*
* Common structural data for all Plan types.
*/
@@ -245,6 +250,7 @@ typedef struct Append
Plan plan;
Bitmapset *apprelids; /* RTIs of appendrel(s) formed by this node */
List *appendplans;
+ int nasyncplans; /* # of asynchronous plans */
/*
* All 'appendplans' preceding this index are non-partial plans. All
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 8e621d2f76..33bc133dd4 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -65,6 +65,7 @@ extern PGDLLIMPORT bool enable_partitionwise_aggregate;
extern PGDLLIMPORT bool enable_parallel_append;
extern PGDLLIMPORT bool enable_parallel_hash;
extern PGDLLIMPORT bool enable_partition_pruning;
+extern PGDLLIMPORT bool enable_async_append;
extern PGDLLIMPORT int constraint_exclusion;
extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5954068dec..3249570a18 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -923,6 +923,7 @@ typedef enum
*/
typedef enum
{
+ WAIT_EVENT_APPEND_READY,
WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE = PG_WAIT_IPC,
WAIT_EVENT_BGWORKER_SHUTDOWN,
WAIT_EVENT_BGWORKER_STARTUP,
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index dc7ab2ce8b..e78ca7bddb 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -87,6 +87,7 @@ select explain_filter('explain (analyze, buffers, format json) select * from int
"Plan": { +
"Node Type": "Seq Scan", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "int8_tbl",+
"Alias": "i8", +
"Startup Cost": N.N, +
@@ -136,6 +137,7 @@ select explain_filter('explain (analyze, buffers, format xml) select * from int8
<Plan> +
<Node-Type>Seq Scan</Node-Type> +
<Parallel-Aware>false</Parallel-Aware> +
+ <Async-Capable>false</Async-Capable> +
<Relation-Name>int8_tbl</Relation-Name> +
<Alias>i8</Alias> +
<Startup-Cost>N.N</Startup-Cost> +
@@ -183,6 +185,7 @@ select explain_filter('explain (analyze, buffers, format yaml) select * from int
- Plan: +
Node Type: "Seq Scan" +
Parallel Aware: false +
+ Async Capable: false +
Relation Name: "int8_tbl"+
Alias: "i8" +
Startup Cost: N.N +
@@ -233,6 +236,7 @@ select explain_filter('explain (buffers, format json) select * from int8_tbl i8'
"Plan": { +
"Node Type": "Seq Scan", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "int8_tbl",+
"Alias": "i8", +
"Startup Cost": N.N, +
@@ -348,6 +352,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Relation Name": "tenk1", +
"Parallel Aware": true, +
"Local Hit Blocks": 0, +
@@ -393,6 +398,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Parallel Aware": false, +
"Sort Space Used": 0, +
"Local Hit Blocks": 0, +
@@ -435,6 +441,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Parallel Aware": false, +
"Workers Planned": 0, +
"Local Hit Blocks": 0, +
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index a8cbfd9f5f..af38f3b93c 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -558,6 +558,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
"Node Type": "Incremental Sort", +
"Actual Rows": 55, +
"Actual Loops": 1, +
+ "Async Capable": false, +
"Presorted Key": [ +
"t.a" +
], +
@@ -734,6 +735,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
"Node Type": "Incremental Sort", +
"Actual Rows": 70, +
"Actual Loops": 1, +
+ "Async Capable": false, +
"Presorted Key": [ +
"t.a" +
], +
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index ff157ceb1c..499245068a 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -204,6 +204,7 @@ explain (costs off, format json) insert into insertconflicttest values (0, 'Bilb
"Node Type": "ModifyTable", +
"Operation": "Insert", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "insertconflicttest", +
"Alias": "insertconflicttest", +
"Conflict Resolution": "UPDATE", +
@@ -213,7 +214,8 @@ explain (costs off, format json) insert into insertconflicttest values (0, 'Bilb
{ +
"Node Type": "Result", +
"Parent Relationship": "Member", +
- "Parallel Aware": false +
+ "Parallel Aware": false, +
+ "Async Capable": false +
} +
] +
} +
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 81bdacf59d..b7818c0637 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -88,6 +88,7 @@ select count(*) = 1 as ok from pg_stat_wal;
select name, setting from pg_settings where name like 'enable%';
name | setting
--------------------------------+---------
+ enable_async_append | on
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
@@ -106,7 +107,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(18 rows)
+(19 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
On Sun, Dec 20, 2020 at 5:15 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Thu, Nov 26, 2020 at 10:28 AM movead.li@highgo.ca
<movead.li@highgo.ca> wrote:Issue one:
Get a Assert error at 'Assert(bms_is_member(i, node->as_needrequest));' in
ExecAppendAsyncRequest() function when I use more than two foreign table
on different foreign server.I research the code and do such change then the Assert problom disappear.
Could you show a test case causing the assertion failure?
I happened to reproduce the same failure in my environment.
I think your change would be correct, but I changed the patch so that
it doesn’t need as_lastasyncplan anymore [1]/messages/by-id/CAPmGK17L0j6otssa53ZvjnCsjguJHZXaqPL2HU_LDoZ4ATZjEw@mail.gmail.com. The new version of the
patch works well for my case. So, could you test your case with it?
Best regards,
Etsuro Fujita
[1]: /messages/by-id/CAPmGK17L0j6otssa53ZvjnCsjguJHZXaqPL2HU_LDoZ4ATZjEw@mail.gmail.com
At Sat, 19 Dec 2020 17:55:22 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
On Mon, Dec 14, 2020 at 4:01 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:At Sat, 12 Dec 2020 18:25:57 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
On Fri, Nov 20, 2020 at 3:51 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:The reason for
the early fetching is letting fdw send the next request as early as
possible. (However, I didn't measure the effect of the
nodeAppend-level prefetching.)I agree that that would lead to an improved efficiency in some cases,
but I still think that that would be useless in some other cases like
SELECT * FROM sharded_table LIMIT 1. Also, I think the situation
would get worse if we support Append on top of joins or aggregates
over ForeignScans, which would be more expensive to perform than these
ForeignScans.I'm not sure which gain we weigh on, but if doing "LIMIT 1" on Append
for many times is more common than fetching all or "LIMIT <many
multiples of fetch_size>", that discussion would be convincing... Is
it really the case?I don't have a clear answer for that... Performance in the case you
mentioned would be improved by async execution without prefetching by
Append, so it seemed reasonable to me to remove that prefetching to
avoid unnecessary overheads in the case I mentioned. BUT: I started
to think my proposal, which needs an additional FDW callback routine
(ie, ForeignAsyncBegin()), might be a bad idea, because it would
increase the burden on FDW authors.
I agree on the point of developers' burden.
If we do prefetching, I think it would be better that it’s the
responsibility of the FDW to do prefetching, and I think that that
could be done by letting the FDW to start another data fetch,
independently of the core, in the ForeignAsyncNotify callback routine,FDW does prefetching (if it means sending request to remote) in my
patch, so I agree to that. It suspect that you were intended to say
the opposite. The core (ExecAppendAsyncGetNext()) controls
prefetching in your patch.No. That function just tries to retrieve a tuple from any of the
ready subplans (ie, subplans marked as as_needrequest).
Mmm. I meant that the function explicitly calls
ExecAppendAsyncRequest(), which finally calls fetch_more_data_begin()
(if needed). Conversely if the function dosn't call
ExecAppendAsyncRequsest, the next request to remote doesn't
happen. That is, after the tuple buffer of FDW-side is exhausted, the
next request doesn't happen until executor requests for the next
tuple. You seem to be saying that "postgresForeignAsyncRequest() calls
fetch_more_data_begin() following its own decision." but this doesn't
seem to be "prefetching".
which I revived from Robert's original patch. I think that that would
be more efficient, because the FDW would no longer need to wait until
all buffered tuples are returned to the core. In the WIP patch, II don't understand. My patch sends a prefetch-query as soon as all the
tuples of the last remote-request is stored into FDW storage. The
reason for removing ExecAsyncNotify() was it is just redundant as far
as concerning Append asynchrony. But I particulary oppose to revive
the function.Sorry, my explanation was not good, but what I'm saying here is about
my patch, not your patch. I think this FDW callback routine would be
useful; it allows an FDW to perform another asynchronous data fetch
before delivering a tuple to the core as discussed in [1]. Also, it
would be useful when extending to the case where we have intermediate
nodes between an Append and a ForeignScan such as joins or aggregates,
which I'll explain below.
Yeah. If a not-immediate parent of an async-capable node works as
async-aware, the notify API would have the power. So I don't object to
the API.
only allowed the callback routine to put the corresponding ForeignScan
node into a state where it’s either ready for a new request or needing
a callback for another data fetch, but I think we could probably relax
the restriction so that the ForeignScan node can be put into another
state where it’s ready for a new request while needing a callback for
the prefetch.I don't understand this, too. ExecAsyncNotify() doesn't touch any of
the bitmaps, as_needrequest, callback_pending nor as_asyncpending in
your patch. Am I looking into something wrong? I'm looking
async-wip-2020-11-17.patch.In the WIP patch I post, these bitmaps are modified in the core side
based on the callback_pending and request_complete flags in
AsyncRequests returned from FDWs (See ExecAppendAsyncEventWait()).
Sorry. I think I misread you here. I agree that, the notify API is not
so useful now but would be useful if we allow notify descendents other
than immediate children. However, I stumbled on the fact that some
kinds of node doesn't return a result when all the underlying nodes
returned *a* tuple. Concretely count(*) doesn't return after *all*
tuple of the counted relation has been returned. I remember that the
fact might be the reason why I removed the API. After all the topmost
async-aware node must ask every immediate child if it can return a
tuple.
(By the way, it is one of those that make the code hard to read to me
that the "callback" means "calling an API function". I think none of
them (ExecAsyncBegin, ExecAsyncRequest, ExecAsyncNotify) are a
"callback".)I thought the word “callback” was OK, because these functions would
call the corresponding FDW callback routines, but I’ll revise the
wording.
I'm not confident on the usage of "callback", though:p (Sorry.) I
believe that "callback" is a function a caller tells a callee to call
it. In broader meaning, all FDW APIs are a function that an FDW
extention tells the core to call it (yeah, that's inversed.). However,
we don't call fread a callback of libc. They work based on slightly
different mechanism but substantially the same, I think.
The reason why I disabled async execution when executing EPQ is to
avoid sending asynchronous queries to the remote sides, which would be
useless, because scan tuples for an EPQ recheck are obtained in a
dedicated way.If EPQ is performed onto Append, I think it should gain from
asynchronous execution since it is going to fetch *a* tuple from
several partitions or children. I believe EPQ doesn't contain Append
in major cases, though. (Or I didn't come up with the steps for the
case to happen...)Sorry, I don’t understand this part. Could you elaborate a bit more on it?
EPQ retrieves a specific tuple from a node. If we perform EPQ on an
Append, only one of the children should offer a result tuple. Since
Append has no idea of which of its children will offer a result, it
has no way other than asking all children until it receives a
result. If we do that, asynchronously sending a query to all nodes
would win.
What do you mean by "push-up style executor"?
The reverse of the volcano-style executor, which enters from the
topmost node and down to the bottom. In the "push-up stule executor",
the bottom-most nodes fires by a certain trigger then every
intermediate nodes throws up the result to the parent until reaching
the topmost node.That is what I'm thinking to be able to support the case I mentioned
above. I think that that would allow us to find ready subplans
efficiently from occurred wait events in ExecAppendAsyncEventWait().
Consider a plan like this:Append
-> Nested Loop
-> Foreign Scan on a
-> Foreign Scan on b
-> ...I assume here that Foreign Scan on a, Foreign Scan on b, and Nested
Loop are all async-capable and that we have somewhere in the executor
an AsyncRequest with requestor="Nested Loop" and requestee="Foreign
Scan on a", an AsyncRequest with requestor="Nested Loop" and
requestee="Foreign Scan on b", and an AsyncRequest with
requestor="Append" and requestee="Nested Loop". In
ExecAppendAsyncEventWait(), if a file descriptor for foreign table a
becomes ready, we would call ForeignAsyncNotify() for a, and if it
returns a tuple back to the requestor node (ie, Nested Loop) (using
ExecAsyncResponse()), then *ForeignAsyncNotify() would be called for
Nested Loop*. Nested Loop would then call ExecAsyncRequest() for the
inner requestee node (ie, Foreign Scan on b; I assume here that it is
a foreign scan parameterized by a). If Foreign Scan on b returns a
tuple back to the requestor node (ie, Nested Loop) (using
ExecAsyncResponse()), then Nested Loop would match the tuples from the
outer and inner sides. If they match, the join result would be
returned back to the requestor node (ie, Append) (using
ExecAsyncResponse()), marking the Nested Loop subplan as
as_needrequest. Otherwise, Nested Loop would call ExecAsyncRequest()
for the inner requestee node for the next tuple, and so on. If
ExecAsyncRequest() can't return a tuple immediately, we would wait
until a file descriptor for foreign table b becomes ready; we would
start from calling ForeignAsyncNotify() for b when the file descriptor
becomes ready. In this way we could find ready subplans efficiently
from occurred wait events in ExecAppendAsyncEventWait() when extending
to the case where subplans are joins or aggregates over Foreign Scans,
I think. Maybe I’m missing something, though.
Maybe so. As I mentioned above, in the follwoing case..
Join -1
Join -2
ForegnScan -A
ForegnScan -B
ForegnScan -C
Where the Join-1 is the leader of asynchronous fetching. Even if both
of the FS-A,B have returned one tuple each, it's unsure that Join-2
returns a tuple. I'm not sure how to resolve the situation with the
current infrastructure as-is.
So I tried a structure where when a node gets a new tuple, the node
asks the parent whether it is satisfied or not. In that trial I needed
to make every execnodes a state machine and that was pretty messy..
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Fri, Jan 15, 2021 at 4:54 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
Mmm. I meant that the function explicitly calls
ExecAppendAsyncRequest(), which finally calls fetch_more_data_begin()
(if needed). Conversely if the function dosn't call
ExecAppendAsyncRequsest, the next request to remote doesn't
happen. That is, after the tuple buffer of FDW-side is exhausted, the
next request doesn't happen until executor requests for the next
tuple. You seem to be saying that "postgresForeignAsyncRequest() calls
fetch_more_data_begin() following its own decision." but this doesn't
seem to be "prefetching".
Let me explain a bit more. Actually, the new version of the patch
allows prefetching in the FDW side; for such prefetching in
postgres_fdw, I think we could add a fetch_more_data_begin() call in
postgresForeignAsyncNotify(). But I left that for future work,
because we don’t know yet if that’s really useful. (Another reason
why I left that is we have more important issues that should be
addressed [1]/messages/by-id/CAPmGK14xrGe+Xks7+fVLBoUUbKwcDkT9km1oFXhdY+FFhbMjUg@mail.gmail.com, and I think addressing those issues is a requirement
for us to commit this patch, but adding such prefetching isn’t, IMO.)
Sorry. I think I misread you here. I agree that, the notify API is not
so useful now but would be useful if we allow notify descendents other
than immediate children. However, I stumbled on the fact that some
kinds of node doesn't return a result when all the underlying nodes
returned *a* tuple. Concretely count(*) doesn't return after *all*
tuple of the counted relation has been returned. I remember that the
fact might be the reason why I removed the API. After all the topmost
async-aware node must ask every immediate child if it can return a
tuple.
The patch I posted, which revived Robert’s original patch using stuff
from your patch and Thomas’, provides ExecAsyncRequest() as well as
ExecAsyncNotify(), which supports pull-based execution like
ExecProcNode() (while ExecAsyncNotify() supports push-based
execution.) In the aggregate case you mentioned, I think we could
iterate calling ExecAsyncRequest() for the underlying subplan to get
all tuples from it, in a similar way to ExecProcNode() in the normal
case.
EPQ retrieves a specific tuple from a node. If we perform EPQ on an
Append, only one of the children should offer a result tuple. Since
Append has no idea of which of its children will offer a result, it
has no way other than asking all children until it receives a
result. If we do that, asynchronously sending a query to all nodes
would win.
Thanks for the explanation! But I’m still not sure why we need to
send an asynchronous query to each of the asynchronous nodes in an EPQ
recheck. Is it possible to explain a bit more about that?
I wrote:
That is what I'm thinking to be able to support the case I mentioned
above. I think that that would allow us to find ready subplans
efficiently from occurred wait events in ExecAppendAsyncEventWait().
Consider a plan like this:Append
-> Nested Loop
-> Foreign Scan on a
-> Foreign Scan on b
-> ...I assume here that Foreign Scan on a, Foreign Scan on b, and Nested
Loop are all async-capable and that we have somewhere in the executor
an AsyncRequest with requestor="Nested Loop" and requestee="Foreign
Scan on a", an AsyncRequest with requestor="Nested Loop" and
requestee="Foreign Scan on b", and an AsyncRequest with
requestor="Append" and requestee="Nested Loop". In
ExecAppendAsyncEventWait(), if a file descriptor for foreign table a
becomes ready, we would call ForeignAsyncNotify() for a, and if it
returns a tuple back to the requestor node (ie, Nested Loop) (using
ExecAsyncResponse()), then *ForeignAsyncNotify() would be called for
Nested Loop*. Nested Loop would then call ExecAsyncRequest() for the
inner requestee node (ie, Foreign Scan on b; I assume here that it is
a foreign scan parameterized by a). If Foreign Scan on b returns a
tuple back to the requestor node (ie, Nested Loop) (using
ExecAsyncResponse()), then Nested Loop would match the tuples from the
outer and inner sides. If they match, the join result would be
returned back to the requestor node (ie, Append) (using
ExecAsyncResponse()), marking the Nested Loop subplan as
as_needrequest. Otherwise, Nested Loop would call ExecAsyncRequest()
for the inner requestee node for the next tuple, and so on. If
ExecAsyncRequest() can't return a tuple immediately, we would wait
until a file descriptor for foreign table b becomes ready; we would
start from calling ForeignAsyncNotify() for b when the file descriptor
becomes ready. In this way we could find ready subplans efficiently
from occurred wait events in ExecAppendAsyncEventWait() when extending
to the case where subplans are joins or aggregates over Foreign Scans,
I think. Maybe I’m missing something, though.
Maybe so. As I mentioned above, in the follwoing case..
Join -1
Join -2
ForegnScan -A
ForegnScan -B
ForegnScan -CWhere the Join-1 is the leader of asynchronous fetching. Even if both
of the FS-A,B have returned one tuple each, it's unsure that Join-2
returns a tuple. I'm not sure how to resolve the situation with the
current infrastructure as-is.
Maybe my explanation was not good, so let me explain a bit more.
Assume that Join-2 is a nested loop join as shown above. If the
tuples from the outer/inner sides didn’t match, we could iterate
calling *ExecAsyncRequest()* for the inner side until a matched tuple
from it is found. If the inner side wasn’t able to return a tuple
immediately, 1) it would return request_complete=false to Join-2 using
ExecAsyncResponse(), and 2) we could wait for a file descriptor for
the inner side to become ready (while processing other part of the
Append tree), and 3) when the file descriptor becomes ready, recursive
ExecAsyncNotify() calls would restart the Join-2 processing in a
push-based manner as explained above.
Best regards,
Etsuro Fujita
[1]: /messages/by-id/CAPmGK14xrGe+Xks7+fVLBoUUbKwcDkT9km1oFXhdY+FFhbMjUg@mail.gmail.com
On Tue, Nov 17, 2020 at 6:56 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
* I haven't yet done anything about the issue on postgres_fdw's
handling of concurrent data fetches by multiple ForeignScan nodes
(below *different* Append nodes in the query) using the same
connection discussed in [2]. I modified the patch to just disable
applying this feature to problematic test cases in the postgres_fdw
regression tests, by a new GUC enable_async_append.
A solution for the issue would be a scheduler designed to handle such
data fetches more efficiently, but I don’t think it’s easy to create
such a scheduler. Rather than doing so, I'd like to propose to allow
FDWs to disable async execution of them in problematic cases by
themselves during executor startup in the first cut. What I have in
mind for that is:
1) For an FDW that has async-capable ForeignScan(s), we allow the FDW
to record, for each of the async-capable and non-async-capable
ForeignScan(s), the information on a connection to be used for the
ForeignScan into EState during BeginForeignScan().
2) After doing ExecProcNode() to each SubPlan and the main query tree
in InitPlan(), we give the FDW a chance to a) reconsider, for each of
the async-capable ForeignScan(s), whether the ForeignScan can be
executed asynchronously as planned, based on the information stored
into EState in #1, and then b) disable async execution of the
ForeignScan if not.
#1 and #2 would be done after initial partition pruning, so more
async-capable ForeignScans would be executed asynchronously, if other
async-capable ForeignScans conflicting with them are removed by that
pruning.
This wouldn’t prevent us from adding a feature like what was proposed
by Horiguchi-san later.
BTW: while considering this, I noticed some bugs with
ExecAppendAsyncBegin() in the previous patch. Attached is a new
version of the patch fixing them. I also tweaked some comments a
little bit.
Best regards,
Etsuro Fujita
Attachments:
async-wip-2021-02-01.patchapplication/octet-stream; name=async-wip-2021-02-01.patchDownload
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index ee0b4acf0b..3ecb8e1e4f 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -62,6 +62,7 @@ typedef struct ConnCacheEntry
Oid serverid; /* foreign server OID used to get server name */
uint32 server_hashvalue; /* hash value of foreign server OID */
uint32 mapping_hashvalue; /* hash value of user mapping OID */
+ PgFdwConnState state; /* extra per-connection state */
} ConnCacheEntry;
/*
@@ -117,7 +118,7 @@ static bool disconnect_cached_connections(Oid serverid);
* (not even on error), we need this flag to cue manual cleanup.
*/
PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetConnection(UserMapping *user, bool will_prep_stmt, PgFdwConnState **state)
{
bool found;
bool retry = false;
@@ -264,6 +265,10 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
/* Remember if caller will prepare statements */
entry->have_prep_stmt |= will_prep_stmt;
+ /* If caller needs access to the per-connection state, return it. */
+ if (state)
+ *state = &entry->state;
+
return entry->conn;
}
@@ -291,6 +296,7 @@ make_new_connection(ConnCacheEntry *entry, UserMapping *user)
entry->mapping_hashvalue =
GetSysCacheHashValue1(USERMAPPINGOID,
ObjectIdGetDatum(user->umid));
+ memset(&entry->state, 0, sizeof(entry->state));
/* Now try to make the connection */
entry->conn = connect_pg_server(server, user);
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index b09dce63f5..9dc0549a07 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -7003,7 +7003,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+-------
a | aaa
@@ -7031,7 +7031,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7059,7 +7059,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7087,7 +7087,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | newtoo
@@ -7115,7 +7115,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
DELETE FROM a;
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+----
(0 rows)
@@ -7157,35 +7157,40 @@ insert into bar2 values(3,33,33);
insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
f1 | f2
----+----
1 | 11
@@ -7195,35 +7200,40 @@ select * from bar where f1 in (select f1 from foo) for update;
(4 rows)
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
f1 | f2
----+----
1 | 11
@@ -7255,7 +7265,7 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-> Hash Join
@@ -7273,7 +7283,7 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
(39 rows)
@@ -7291,6 +7301,7 @@ select tableoid::regclass, * from bar order by 1,2;
(6 rows)
-- Check UPDATE with inherited target and an appendrel subquery
+SET enable_async_append TO false;
explain (verbose, costs off)
update bar set f2 = f2 + 100
from
@@ -7349,6 +7360,7 @@ update bar set f2 = f2 + 100
from
( select f1 from foo union all select f1+3 from foo ) ss
where bar.f1 = ss.f1;
+RESET enable_async_append;
select tableoid::regclass, * from bar order by 1,2;
tableoid | f1 | f2
----------+----+-----
@@ -8588,9 +8600,9 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
Sort
Sort Key: t1.a, t3.c
-> Append
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: ((ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)) INNER JOIN (ftprt1_p1 t3_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: ((ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)) INNER JOIN (ftprt1_p2 t3_2)
(7 rows)
@@ -8627,19 +8639,19 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
-- with whole-row reference; partitionwise join does not apply
EXPLAIN (COSTS OFF)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
- QUERY PLAN
---------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------
Sort
Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
-> Hash Full Join
Hash Cond: (t1.a = t2.b)
-> Append
- -> Foreign Scan on ftprt1_p1 t1_1
- -> Foreign Scan on ftprt1_p2 t1_2
+ -> Async Foreign Scan on ftprt1_p1 t1_1
+ -> Async Foreign Scan on ftprt1_p2 t1_2
-> Hash
-> Append
- -> Foreign Scan on ftprt2_p1 t2_1
- -> Foreign Scan on ftprt2_p2 t2_2
+ -> Async Foreign Scan on ftprt2_p1 t2_1
+ -> Async Foreign Scan on ftprt2_p2 t2_2
(11 rows)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
@@ -8669,9 +8681,9 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
Sort
Sort Key: t1.a, t1.b
-> Append
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: (ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: (ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)
(7 rows)
@@ -8724,6 +8736,7 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
(14 rows)
-- test FOR UPDATE; partitionwise join does not apply
+SET enable_async_append TO false;
EXPLAIN (COSTS OFF)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
QUERY PLAN
@@ -8751,6 +8764,7 @@ SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a
400 | 400
(4 rows)
+RESET enable_async_append;
RESET enable_partitionwise_join;
-- ===================================================================
-- test partitionwise aggregates
@@ -8775,17 +8789,17 @@ ANALYZE fpagg_tab_p3;
SET enable_partitionwise_aggregate TO false;
EXPLAIN (COSTS OFF)
SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
- QUERY PLAN
------------------------------------------------------------
+ QUERY PLAN
+-----------------------------------------------------------------
Sort
Sort Key: pagg_tab.a
-> HashAggregate
Group Key: pagg_tab.a
Filter: (avg(pagg_tab.b) < '22'::numeric)
-> Append
- -> Foreign Scan on fpagg_tab_p1 pagg_tab_1
- -> Foreign Scan on fpagg_tab_p2 pagg_tab_2
- -> Foreign Scan on fpagg_tab_p3 pagg_tab_3
+ -> Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+ -> Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+ -> Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
(9 rows)
-- Plan with partitionwise aggregates is enabled
@@ -8797,11 +8811,11 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
Sort
Sort Key: pagg_tab.a
-> Append
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p1 pagg_tab)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p2 pagg_tab_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p3 pagg_tab_2)
(9 rows)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 2ce42ce3f1..aff0f81426 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -21,6 +21,7 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -37,6 +38,7 @@
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
#include "postgres_fdw.h"
+#include "storage/latch.h"
#include "utils/builtins.h"
#include "utils/float.h"
#include "utils/guc.h"
@@ -159,6 +161,11 @@ typedef struct PgFdwScanState
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ /* for asynchronous execution */
+ bool async_capable; /* engage asynchronous-capable logic? */
+ PgFdwConnState *conn_state; /* extra per-connection state */
+ ForeignScanState *next_node; /* next ForeignScan node to activate */
+
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
MemoryContext temp_cxt; /* context for per-tuple temporary data */
@@ -408,6 +415,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
RelOptInfo *input_rel,
RelOptInfo *output_rel,
void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(AsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(AsyncRequest *areq);
+static void postgresForeignAsyncNotify(AsyncRequest *areq);
/*
* Helper functions
@@ -436,6 +447,7 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
+static void fetch_more_data_begin(ForeignScanState *node);
static void fetch_more_data(ForeignScanState *node);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static PgFdwModifyState *create_foreign_modify(EState *estate,
@@ -491,6 +503,7 @@ static int postgresAcquireSampleRowsFunc(Relation relation, int elevel,
double *totaldeadrows);
static void analyze_row_processor(PGresult *res, int row,
PgFdwAnalyzeState *astate);
+static void request_tuple_asynchronously(AsyncRequest *areq);
static HeapTuple make_tuple_from_result_row(PGresult *res,
int row,
Relation rel,
@@ -583,6 +596,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for asynchronous execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -1458,7 +1477,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->conn = GetConnection(user, false, &fsstate->conn_state);
/* Assign a unique ID for my cursor */
fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1509,6 +1528,12 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
&fsstate->param_flinfo,
&fsstate->param_exprs,
&fsstate->param_values);
+
+ /* Initialize async state */
+ fsstate->async_capable = node->ss.ps.plan->async_capable;
+ fsstate->conn_state->activated = NULL;
+ fsstate->conn_state->async_query_sent = false;
+ fsstate->next_node = NULL;
}
/*
@@ -1523,8 +1548,10 @@ postgresIterateForeignScan(ForeignScanState *node)
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
/*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
+ * In sync mode, if this is the first call after Begin or ReScan, we need
+ * to create the cursor on the remote side. In async mode, we would have
+ * aready created the cursor before we get here, even if this is the first
+ * call after Begin or ReScan.
*/
if (!fsstate->cursor_exists)
create_cursor(node);
@@ -1534,6 +1561,9 @@ postgresIterateForeignScan(ForeignScanState *node)
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
+ /* In async mode, just clear tuple slot. */
+ if (fsstate->async_capable)
+ return ExecClearTuple(slot);
/* No point in another fetch if we already detected EOF, though. */
if (!fsstate->eof_reached)
fetch_more_data(node);
@@ -1563,6 +1593,14 @@ postgresReScanForeignScan(ForeignScanState *node)
char sql[64];
PGresult *res;
+ /* Reset async state */
+ if (fsstate->async_capable)
+ {
+ fsstate->conn_state->activated = NULL;
+ fsstate->conn_state->async_query_sent = false;
+ fsstate->next_node = NULL;
+ }
+
/* If we haven't created the cursor yet, nothing to do. */
if (!fsstate->cursor_exists)
return;
@@ -1621,6 +1659,14 @@ postgresEndForeignScan(ForeignScanState *node)
if (fsstate == NULL)
return;
+ /*
+ * If we're ending before we've collected a response from an asynchronous
+ * query, we have to consume the response.
+ */
+ if (fsstate->conn_state->activated == node &&
+ fsstate->conn_state->async_query_sent)
+ fetch_more_data(node);
+
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
close_cursor(fsstate->conn, fsstate->cursor_number);
@@ -2481,7 +2527,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->conn = GetConnection(user, false, NULL);
/* Update the foreign-join-related fields. */
if (fsplan->scan.scanrelid == 0)
@@ -2862,7 +2908,7 @@ estimate_path_cost_size(PlannerInfo *root,
false, &retrieved_attrs, NULL);
/* Get the remote estimate */
- conn = GetConnection(fpinfo->user, false);
+ conn = GetConnection(fpinfo->user, false, NULL);
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -3490,6 +3536,34 @@ create_cursor(ForeignScanState *node)
pfree(buf.data);
}
+/*
+ * Begin an asynchronous data fetch.
+ * fetch_more_data must be called to fetch the results..
+ */
+static void
+fetch_more_data_begin(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PGconn *conn = fsstate->conn;
+ char sql[64];
+
+ Assert(fsstate->conn_state->activated == node);
+ Assert(!fsstate->conn_state->async_query_sent);
+
+ /* Create the cursor synchronously. */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ /* We will send this query, but not wait for the response. */
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (PQsendQuery(conn, sql) < 0)
+ pgfdw_report_error(ERROR, NULL, conn, false, fsstate->query);
+
+ fsstate->conn_state->async_query_sent = true;
+}
+
/*
* Fetch some more rows from the node's cursor.
*/
@@ -3512,17 +3586,36 @@ fetch_more_data(ForeignScanState *node)
PG_TRY();
{
PGconn *conn = fsstate->conn;
- char sql[64];
int numrows;
int i;
- snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
- fsstate->fetch_size, fsstate->cursor_number);
+ if (fsstate->async_capable)
+ {
+ Assert(fsstate->conn_state->activated == node);
+ Assert(fsstate->conn_state->async_query_sent);
- res = pgfdw_exec_query(conn, sql);
- /* On error, report the original query, not the FETCH. */
- if (PQresultStatus(res) != PGRES_TUPLES_OK)
- pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ /*
+ * The query was already sent by an earlier call to
+ * fetch_more_data_begin. So now we just fetch the result.
+ */
+ res = PQgetResult(conn);
+ /* On error, report the original query, not the FETCH. */
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ }
+ else
+ {
+ char sql[64];
+
+ /* This is a regular synchronous fetch. */
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ res = pgfdw_exec_query(conn, sql);
+ /* On error, report the original query, not the FETCH. */
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ }
/* Convert the data into HeapTuples */
numrows = PQntuples(res);
@@ -3549,6 +3642,15 @@ fetch_more_data(ForeignScanState *node)
/* Must be EOF if we didn't get as many tuples as we asked for. */
fsstate->eof_reached = (numrows < fsstate->fetch_size);
+
+ /* If this was the second part of an async request, we must fetch until NULL. */
+ if (fsstate->async_capable)
+ {
+ /* call once and raise error if not NULL as expected? */
+ while (PQgetResult(conn) != NULL)
+ ;
+ fsstate->conn_state->async_query_sent = false;
+ }
}
PG_FINALLY();
{
@@ -3674,7 +3776,7 @@ create_foreign_modify(EState *estate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->conn = GetConnection(user, true, NULL);
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Set up remote query information. */
@@ -4608,7 +4710,7 @@ postgresAnalyzeForeignTable(Relation relation,
*/
table = GetForeignTable(RelationGetRelid(relation));
user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
- conn = GetConnection(user, false);
+ conn = GetConnection(user, false, NULL);
/*
* Construct command to get page count for relation.
@@ -4694,7 +4796,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
table = GetForeignTable(RelationGetRelid(relation));
server = GetForeignServer(table->serverid);
user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
- conn = GetConnection(user, false);
+ conn = GetConnection(user, false, NULL);
/*
* Construct cursor that retrieves whole rows from remote.
@@ -4922,7 +5024,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
*/
server = GetForeignServer(serverOid);
mapping = GetUserMapping(GetUserId(), server->serverid);
- conn = GetConnection(mapping, false);
+ conn = GetConnection(mapping, false, NULL);
/* Don't attempt to import collation if remote server hasn't got it */
if (PQserverVersion(conn) < 90100)
@@ -6469,6 +6571,177 @@ add_foreign_final_paths(PlannerInfo *root, RelOptInfo *input_rel,
add_path(final_rel, (Path *) final_path);
}
+/*
+ * postgresIsForeignPathAsyncCapable
+ * Check whether a given ForeignPath node is async-capable.
+ */
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+/*
+ * postgresForeignAsyncRequest
+ * Asynchronously request next tuple from a foreign PostgreSQL table.
+ */
+static void
+postgresForeignAsyncRequest(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+
+ /*
+ * If this is the first call after Begin or ReScan, mark the connection
+ * as used by the ForeignScan node.
+ */
+ if (fsstate->conn_state->activated == NULL)
+ fsstate->conn_state->activated = node;
+
+ /*
+ * If the connection has already been used by a ForeignScan node, put it
+ * at the end of the chain of waiting ForeignScan nodes, and then return.
+ */
+ if (node != fsstate->conn_state->activated)
+ {
+ ForeignScanState *curr_node = fsstate->conn_state->activated;
+ PgFdwScanState *curr_fsstate = (PgFdwScanState *) curr_node->fdw_state;
+
+ /* Scan down the chain ... */
+ while (curr_fsstate->next_node)
+ {
+ curr_node = curr_fsstate->next_node;
+ Assert(node != curr_node);
+ curr_fsstate = (PgFdwScanState *) curr_node->fdw_state;
+ }
+ /* Update the chain linking */
+ curr_fsstate->next_node = node;
+ /* Mark the request as needing a callback */
+ areq->callback_pending = true;
+ areq->request_complete = false;
+ return;
+ }
+
+ request_tuple_asynchronously(areq);
+}
+
+/*
+ * postgresForeignAsyncConfigureWait
+ * Configure a file descriptor event for which we wish to wait.
+ */
+static void
+postgresForeignAsyncConfigureWait(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ AppendState *requestor = (AppendState *) areq->requestor;
+ WaitEventSet *set = requestor->as_eventset;
+
+ /* This function should not be called unless callback_pending */
+ Assert(areq->callback_pending);
+
+ /* If the ForeignScan node isn't activated yet, nothing to do */
+ if (fsstate->conn_state->activated != node)
+ return;
+
+ AddWaitEventToSet(set, WL_SOCKET_READABLE, PQsocket(fsstate->conn),
+ NULL, areq);
+}
+
+/*
+ * postgresForeignAsyncNotify
+ * Fetch some more tuples from a file descriptor that becomes ready,
+ * requesting next tuple.
+ */
+static void
+postgresForeignAsyncNotify(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+
+ /* The core code would have initialized the callback_pending flag */
+ Assert(!areq->callback_pending);
+
+ fetch_more_data(node);
+
+ request_tuple_asynchronously(areq);
+}
+
+/*
+ * Asynchronously request next tuple from a foreign PostgreSQL table.
+ */
+static void
+request_tuple_asynchronously(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ TupleTableSlot *result;
+
+ /* Request some more tuples, if we've run out */
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ /* No point in another fetch if we already detected EOF, though */
+ if (!fsstate->eof_reached)
+ {
+ /* Begin another fetch */
+ fetch_more_data_begin(node);
+ /* Mark the request as needing a callback */
+ areq->callback_pending = true;
+ areq->request_complete = false;
+ return;
+ }
+ fsstate->conn_state->activated = NULL;
+
+ /* Activate the next ForeignScan node if any */
+ if (fsstate->next_node)
+ {
+ /* Mark the connection as used by the next ForeignScan node */
+ fsstate->conn_state->activated = fsstate->next_node;
+ Assert(!fsstate->conn_state->async_query_sent);
+ /* Begin an asynchronous fetch for that node */
+ fetch_more_data_begin(fsstate->next_node);
+ }
+
+ /* There's nothing more to do; just return a NULL pointer */
+ result = NULL;
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+ return;
+ }
+
+ /* Get a tuple from the ForeignScan node */
+ result = ExecProcNode((PlanState *) node);
+
+ if (TupIsNull(result))
+ {
+ Assert(fsstate->next_tuple >= fsstate->num_tuples);
+
+ /* Request some more tuples, if we've not detected EOF yet */
+ if (!fsstate->eof_reached)
+ {
+ /* Begin another fetch */
+ fetch_more_data_begin(node);
+ /* Mark the request as needing a callback */
+ areq->callback_pending = true;
+ areq->request_complete = false;
+ return;
+ }
+ fsstate->conn_state->activated = NULL;
+
+ /* Activate the next ForeignScan node if any */
+ if (fsstate->next_node)
+ {
+ /* Mark the connection as used by the next ForeignScan node */
+ fsstate->conn_state->activated = fsstate->next_node;
+ Assert(!fsstate->conn_state->async_query_sent);
+ /* Begin an asynchronous fetch for that node */
+ fetch_more_data_begin(fsstate->next_node);
+ }
+ }
+
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+}
+
/*
* Create a tuple from the specified row of the PGresult.
*
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 1f67b4d9fd..c3537b6449 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -16,6 +16,7 @@
#include "foreign/foreign.h"
#include "lib/stringinfo.h"
#include "libpq-fe.h"
+#include "nodes/execnodes.h"
#include "nodes/pathnodes.h"
#include "utils/relcache.h"
@@ -124,12 +125,22 @@ typedef struct PgFdwRelationInfo
int relation_index;
} PgFdwRelationInfo;
+/*
+ * Extra control information relating to a connection.
+ */
+typedef struct PgFdwConnState
+{
+ ForeignScanState *activated; /* currently-activated ForeignScan node */
+ bool async_query_sent; /* has an asynchronous query been sent? */
+} PgFdwConnState;
+
/* in postgres_fdw.c */
extern int set_transmission_modes(void);
extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+ PgFdwConnState **state);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 319c15d635..b6911b4443 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1810,31 +1810,31 @@ INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
DELETE FROM a;
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
@@ -1870,12 +1870,12 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
@@ -1885,6 +1885,7 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
select tableoid::regclass, * from bar order by 1,2;
-- Check UPDATE with inherited target and an appendrel subquery
+SET enable_async_append TO false;
explain (verbose, costs off)
update bar set f2 = f2 + 100
from
@@ -1894,6 +1895,7 @@ update bar set f2 = f2 + 100
from
( select f1 from foo union all select f1+3 from foo ) ss
where bar.f1 = ss.f1;
+RESET enable_async_append;
select tableoid::regclass, * from bar order by 1,2;
@@ -2503,9 +2505,11 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE a % 25 = 0) t1 FULL JOIN (SELECT 't2_phv' phv, * FROM fprt2 WHERE b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY t1.a, t2.b;
-- test FOR UPDATE; partitionwise join does not apply
+SET enable_async_append TO false;
EXPLAIN (COSTS OFF)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
+RESET enable_async_append;
RESET enable_partitionwise_join;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e17cdcc816..c60c7ef66f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4735,6 +4735,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</para>
<variablelist>
+ <varlistentry id="guc-enable-async-append" xreflabel="enable_async_append">
+ <term><varname>enable_async_append</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_async_append</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of async-aware
+ append plan types. The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-bitmapscan" xreflabel="enable_bitmapscan">
<term><varname>enable_bitmapscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c602ee4427..a2d2f42e28 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1563,6 +1563,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</thead>
<tbody>
+ <row>
+ <entry><literal>AppendReady</literal></entry>
+ <entry>Waiting for a subplan of Append to be ready.</entry>
+ </row>
<row>
<entry><literal>BackupWaitWalArchive</literal></entry>
<entry>Waiting for WAL files required for a backup to be successfully
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 5d7eb3574c..83c24a5c08 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1390,6 +1390,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
}
if (plan->parallel_aware)
appendStringInfoString(es->str, "Parallel ");
+ if (plan->async_capable)
+ appendStringInfoString(es->str, "Async ");
appendStringInfoString(es->str, pname);
es->indent++;
}
@@ -1409,6 +1411,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (custom_name)
ExplainPropertyText("Custom Plan Provider", custom_name, es);
ExplainPropertyBool("Parallel Aware", plan->parallel_aware, es);
+ ExplainPropertyBool("Async Capable", plan->async_capable, es);
}
switch (nodeTag(plan))
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index f990c6473a..1004647d4f 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
+ execAsync.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 23bdb53cd1..613835b748 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -526,6 +526,10 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans != 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index e69de29bb2..6174ea1eb6 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,113 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+
+static void ExecAsyncResponse(AsyncRequest *areq);
+
+/*
+ * Asynchronously request a tuple from a designed async-capable node.
+ */
+void
+ExecAsyncRequest(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ ExecAsyncResponse(areq);
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor event
+ * for which it wishes to wait. We expect the node-type specific callback to
+ * make a sigle call of the following form:
+ *
+ * AddWaitEventToSet(set, WL_SOCKET_READABLE, fd, NULL, areq);
+ */
+void
+ExecAsyncConfigureWait(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanConfigureWait(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+void
+ExecAsyncNotify(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ ExecAsyncResponse(areq);
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestor))
+ {
+ case T_AppendState:
+ ExecAsyncAppendResponse(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestor));
+ }
+}
+
+/*
+ * A requestee node should call this function to deliver the tuple to its
+ * requestor node. The node can call this from its ExecAsyncRequest callback
+ * if the requested tuple is available immediately.
+ */
+void
+ExecAsyncRequestDone(AsyncRequest *areq, TupleTableSlot *result)
+{
+ areq->request_complete = true;
+ areq->result = result;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 15e4115bd6..123d5163de 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -57,10 +57,13 @@
#include "postgres.h"
+#include "executor/execAsync.h"
#include "executor/execdebug.h"
#include "executor/execPartition.h"
#include "executor/nodeAppend.h"
#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
/* Shared state for parallel-aware Append. */
struct ParallelAppendState
@@ -78,12 +81,18 @@ struct ParallelAppendState
};
#define INVALID_SUBPLAN_INDEX -1
+#define EVENT_BUFFER_SIZE 16
static TupleTableSlot *ExecAppend(PlanState *pstate);
static bool choose_next_subplan_locally(AppendState *node);
static bool choose_next_subplan_for_leader(AppendState *node);
static bool choose_next_subplan_for_worker(AppendState *node);
static void mark_invalid_subplans_as_finished(AppendState *node);
+static void ExecAppendAsyncBegin(AppendState *node);
+static bool ExecAppendAsyncGetNext(AppendState *node, TupleTableSlot **result);
+static bool ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result);
+static void ExecAppendAsyncEventWait(AppendState *node);
+static void classify_matching_subplans(AppendState *node);
/* ----------------------------------------------------------------
* ExecInitAppend
@@ -102,7 +111,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
AppendState *appendstate = makeNode(AppendState);
PlanState **appendplanstates;
Bitmapset *validsubplans;
+ Bitmapset *asyncplans;
int nplans;
+ int nasyncplans;
int firstvalid;
int i,
j;
@@ -119,6 +130,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
/* Let choose_next_subplan_* function handle setting the first subplan */
appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+ appendstate->as_syncdone = false;
/* If run-time partition pruning is enabled, then set that up now */
if (node->part_prune_info != NULL)
@@ -191,12 +203,24 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
* While at it, find out the first valid partial plan.
*/
j = 0;
+ asyncplans = NULL;
+ nasyncplans = 0;
firstvalid = nplans;
i = -1;
while ((i = bms_next_member(validsubplans, i)) >= 0)
{
Plan *initNode = (Plan *) list_nth(node->appendplans, i);
+ /*
+ * Record async subplans. When executing EvalPlanQual, we process
+ * async subplans synchronously, so don't do this in that case.
+ */
+ if (initNode->async_capable && estate->es_epq_active == NULL)
+ {
+ asyncplans = bms_add_member(asyncplans, j);
+ nasyncplans++;
+ }
+
/*
* Record the lowest appendplans index which is a valid partial plan.
*/
@@ -210,6 +234,39 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ /* Initialize async state */
+ appendstate->as_asyncplans = asyncplans;
+ appendstate->as_nasyncplans = nasyncplans;
+ appendstate->as_asyncrequests = NULL;
+ appendstate->as_asyncresults = (TupleTableSlot **)
+ palloc0(nasyncplans * sizeof(TupleTableSlot *));
+ appendstate->as_needrequest = NULL;
+ appendstate->as_eventset = NULL;
+
+ if (nasyncplans > 0)
+ {
+ appendstate->as_asyncrequests = (AsyncRequest **)
+ palloc0(nplans * sizeof(AsyncRequest *));
+
+ i = -1;
+ while ((i = bms_next_member(asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq;
+
+ areq = palloc(sizeof(AsyncRequest));
+ areq->requestor = (PlanState *) appendstate;
+ areq->requestee = appendplanstates[i];
+ areq->request_index = i;
+ areq->callback_pending = false;
+ areq->request_complete = false;
+ areq->result = NULL;
+
+ appendstate->as_asyncrequests[i] = areq;
+ }
+
+ classify_matching_subplans(appendstate);
+ }
+
/*
* Miscellaneous initialization
*/
@@ -232,31 +289,45 @@ static TupleTableSlot *
ExecAppend(PlanState *pstate)
{
AppendState *node = castNode(AppendState, pstate);
+ TupleTableSlot *result;
- if (node->as_whichplan < 0)
+ if (!node->as_syncdone && node->as_whichplan == INVALID_SUBPLAN_INDEX)
{
/* Nothing to do if there are no subplans */
if (node->as_nplans == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ /* If there are any async subplans, begin execution of them */
+ if (node->as_nasyncplans > 0)
+ ExecAppendAsyncBegin(node);
+
/*
- * If no subplan has been chosen, we must choose one before
+ * If no sync subplan has been chosen, we must choose one before
* proceeding.
*/
- if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
- !node->choose_next_subplan(node))
+ if (!node->choose_next_subplan(node) && node->as_nasyncremain == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
for (;;)
{
PlanState *subnode;
- TupleTableSlot *result;
CHECK_FOR_INTERRUPTS();
/*
- * figure out which subplan we are currently processing
+ * try to get a tuple from any of the async subplans
+ */
+ if (!bms_is_empty(node->as_needrequest) ||
+ (node->as_syncdone && node->as_nasyncremain > 0))
+ {
+ if (ExecAppendAsyncGetNext(node, &result))
+ return result;
+ Assert(bms_is_empty(node->as_needrequest));
+ }
+
+ /*
+ * figure out which sync subplan we are currently processing
*/
Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
subnode = node->appendplans[node->as_whichplan];
@@ -276,8 +347,16 @@ ExecAppend(PlanState *pstate)
return result;
}
- /* choose new subplan; if none, we're done */
- if (!node->choose_next_subplan(node))
+ /* wait or poll async events */
+ if (node->as_nasyncremain > 0)
+ {
+ Assert(!node->as_syncdone);
+ Assert(bms_is_empty(node->as_needrequest));
+ ExecAppendAsyncEventWait(node);
+ }
+
+ /* choose new sync subplan; if no sync/async subplans, we're done */
+ if (!node->choose_next_subplan(node) && node->as_nasyncremain == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
}
@@ -313,6 +392,7 @@ ExecEndAppend(AppendState *node)
void
ExecReScanAppend(AppendState *node)
{
+ int nasyncplans = node->as_nasyncplans;
int i;
/*
@@ -326,6 +406,11 @@ ExecReScanAppend(AppendState *node)
{
bms_free(node->as_valid_subplans);
node->as_valid_subplans = NULL;
+ if (nasyncplans > 0)
+ {
+ bms_free(node->as_valid_asyncplans);
+ node->as_valid_asyncplans = NULL;
+ }
}
for (i = 0; i < node->as_nplans; i++)
@@ -347,8 +432,26 @@ ExecReScanAppend(AppendState *node)
ExecReScan(subnode);
}
+ /* Reset async state */
+ if (nasyncplans > 0)
+ {
+ i = -1;
+ while ((i = bms_next_member(node->as_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ areq->callback_pending = false;
+ areq->request_complete = false;
+ areq->result = NULL;
+ }
+
+ bms_free(node->as_needrequest);
+ node->as_needrequest = NULL;
+ }
+
/* Let choose_next_subplan_* function handle setting the first subplan */
node->as_whichplan = INVALID_SUBPLAN_INDEX;
+ node->as_syncdone = false;
}
/* ----------------------------------------------------------------
@@ -429,7 +532,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
/* ----------------------------------------------------------------
* choose_next_subplan_locally
*
- * Choose next subplan for a non-parallel-aware Append,
+ * Choose next sync subplan for a non-parallel-aware Append,
* returning false if there are no more.
* ----------------------------------------------------------------
*/
@@ -444,9 +547,9 @@ choose_next_subplan_locally(AppendState *node)
/*
* If first call then have the bms member function choose the first valid
- * subplan by initializing whichplan to -1. If there happen to be no
- * valid subplans then the bms member function will handle that by
- * returning a negative number which will allow us to exit returning a
+ * sync subplan by initializing whichplan to -1. If there happen to be
+ * no valid sync subplans then the bms member function will handle that
+ * by returning a negative number which will allow us to exit returning a
* false value.
*/
if (whichplan == INVALID_SUBPLAN_INDEX)
@@ -467,7 +570,10 @@ choose_next_subplan_locally(AppendState *node)
nextplan = bms_prev_member(node->as_valid_subplans, whichplan);
if (nextplan < 0)
+ {
+ node->as_syncdone = true;
return false;
+ }
node->as_whichplan = nextplan;
@@ -709,3 +815,298 @@ mark_invalid_subplans_as_finished(AppendState *node)
node->as_pstate->pa_finished[i] = true;
}
}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncBegin
+ *
+ * Begin execution of designed async-capable subplans.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecAppendAsyncBegin(AppendState *node)
+{
+ Bitmapset *valid_asyncplans;
+ int i;
+
+ /* We should never be called when there are no async subplans. */
+ Assert(node->as_nasyncplans > 0);
+
+ if (node->as_valid_subplans == NULL)
+ {
+ Assert(node->as_valid_asyncplans == NULL);
+
+ node->as_valid_subplans =
+ ExecFindMatchingSubPlans(node->as_prune_state);
+
+ classify_matching_subplans(node);
+ }
+
+ node->as_nasyncremain = 0;
+
+ /* Nothing to do if there are no valid async subplans. */
+ valid_asyncplans = node->as_valid_asyncplans;
+ if (valid_asyncplans == NULL)
+ return;
+
+ /* Make a request for each of the async subplans. */
+ i = -1;
+ while ((i = bms_next_member(valid_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ Assert(areq->request_index == i);
+ Assert(!areq->callback_pending);
+
+ /* Do the actual work. */
+ ExecAsyncRequest(areq);
+
+ ++node->as_nasyncremain;
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncGetNext
+ *
+ * Get the next tuple from any of the asynchronous subplans.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecAppendAsyncGetNext(AppendState *node, TupleTableSlot **result)
+{
+ *result = NULL;
+
+ /* Make new async requests. */
+ if (ExecAppendAsyncRequest(node, result))
+ return true;
+
+ while (node->as_nasyncremain > 0)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ /* Wait or poll async events. */
+ ExecAppendAsyncEventWait(node);
+
+ /* Make new async requests. */
+ if (ExecAppendAsyncRequest(node, result))
+ return true;
+
+ /* Break from loop if there is any sync node that is not complete */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If all sync subplans are complete, we're totally done scanning the
+ * givne node. Otherwise, we're done with the asynchronous stuff but
+ * must continue scanning the sync subplans.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncremain == 0);
+ *result = ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ return true;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncRequest
+ *
+ * If there are any asynchronous subplans that need a new asynchronous
+ * request, make all of them.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result)
+{
+ Bitmapset *needrequest;
+ int i;
+
+ /* Nothing to do if there are no async subplans needing a new request. */
+ if (bms_is_empty(node->as_needrequest))
+ return false;
+
+ /*
+ * If there are any asynchronously-generated results that have not yet
+ * been returned, we have nothing to do; just return one of them.
+ */
+ if (node->as_nasyncresults > 0)
+ {
+ --node->as_nasyncresults;
+ *result = node->as_asyncresults[node->as_nasyncresults];
+ return true;
+ }
+
+ /* Make a new request for each of the async subplans that need it. */
+ needrequest = node->as_needrequest;
+ node->as_needrequest = NULL;
+ i = -1;
+ while ((i = bms_next_member(needrequest, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ /* Do the actual work. */
+ ExecAsyncRequest(areq);
+ }
+ bms_free(needrequest);
+
+ /* Return one of the asynchronously-generated results if any. */
+ if (node->as_nasyncresults > 0)
+ {
+ --node->as_nasyncresults;
+ *result = node->as_asyncresults[node->as_nasyncresults];
+ return true;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncEventWait
+ *
+ * Wait or poll for file descriptor wait events and fire callbacks.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecAppendAsyncEventWait(AppendState *node)
+{
+ long timeout = node->as_syncdone ? -1 : 0;
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+
+ /* Nothing to do if there are no remaining async subplans. */
+ if (node->as_nasyncremain == 0)
+ return;
+
+ node->as_eventset = CreateWaitEventSet(CurrentMemoryContext,
+ node->as_nasyncplans + 1);
+ AddWaitEventToSet(node->as_eventset, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
+ NULL, NULL);
+
+ /* Give each waiting subplan a chance to add a event. */
+ i = -1;
+ while ((i = bms_next_member(node->as_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ if (areq->callback_pending)
+ ExecAsyncConfigureWait(areq);
+ }
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(node->as_eventset, timeout, occurred_event,
+ EVENT_BUFFER_SIZE, WAIT_EVENT_APPEND_READY);
+ FreeWaitEventSet(node->as_eventset);
+ node->as_eventset = NULL;
+ if (noccurred == 0)
+ return;
+
+ /* Deliver notifications. */
+ for (i = 0; i < noccurred; i++)
+ {
+ WaitEvent *w = &occurred_event[i];
+
+ /*
+ * Each waiting subplan should have registered its wait event with
+ * user_data pointing back to its AsyncRequest.
+ */
+ if ((w->events & WL_SOCKET_READABLE) != 0)
+ {
+ AsyncRequest *areq = (AsyncRequest *) w->user_data;
+
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ Assert(areq->callback_pending);
+ areq->callback_pending = false;
+
+ /* Do the actual work. */
+ ExecAsyncNotify(areq);
+ }
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(AsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot = areq->result;
+
+ /* The result should be a TupleTableSlot or NULL. */
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* Nothing to do if the request is pending. */
+ if (!areq->request_complete)
+ {
+ /*
+ * The subplan for which the request was made would be pending for a
+ * callback.
+ */
+ Assert(areq->callback_pending);
+ return;
+ }
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ {
+ /* The ending subplan wouldn't have been pending for a callback. */
+ Assert(!areq->callback_pending);
+ --node->as_nasyncremain;
+ return;
+ }
+
+ /* Save result so we can return it */
+ Assert(node->as_nasyncresults < node->as_nasyncplans);
+ node->as_asyncresults[node->as_nasyncresults++] = slot;
+
+ /*
+ * Mark the subplan that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might complete.
+ */
+ node->as_needrequest = bms_add_member(node->as_needrequest,
+ areq->request_index);
+}
+
+/* ----------------------------------------------------------------
+ * classify_matching_subplans
+ *
+ * Classify the node's as_valid_subplans into sync ones and
+ * async ones, adjust it to contain sync ones only, and save
+ * async ones in the node's as_valid_asyncplans
+ * ----------------------------------------------------------------
+ */
+static void
+classify_matching_subplans(AppendState *node)
+{
+ Bitmapset *valid_asyncplans;
+
+ /* Nothing to do if there are no valid subplans. */
+ if (bms_is_empty(node->as_valid_subplans))
+ return;
+
+ /* Nothing to do if there are no valid async subplans. */
+ if (!bms_overlap(node->as_valid_subplans, node->as_asyncplans))
+ return;
+
+ /* Get valid async subplans. */
+ valid_asyncplans = bms_copy(node->as_asyncplans);
+ valid_asyncplans = bms_int_members(valid_asyncplans,
+ node->as_valid_subplans);
+
+ /* Adjust the valid subplans to contain sync subplans only. */
+ node->as_valid_subplans = bms_del_members(node->as_valid_subplans,
+ valid_asyncplans);
+
+ /* Save valid async subplans. */
+ node->as_valid_asyncplans = valid_asyncplans;
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 0969e53c3a..898890fb08 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -391,3 +391,51 @@ ExecShutdownForeignScan(ForeignScanState *node)
if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Asynchronously request a tuple from a designed async-capable node
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ fdwroutine->ForeignAsyncConfigureWait(areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Callback invoked when a relevant event has occurred
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 21e09c667a..6b0c5f0286 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -120,6 +120,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
COPY_SCALAR_FIELD(plan_width);
COPY_SCALAR_FIELD(parallel_aware);
COPY_SCALAR_FIELD(parallel_safe);
+ COPY_SCALAR_FIELD(async_capable);
COPY_SCALAR_FIELD(plan_node_id);
COPY_NODE_FIELD(targetlist);
COPY_NODE_FIELD(qual);
@@ -241,6 +242,7 @@ _copyAppend(const Append *from)
*/
COPY_BITMAPSET_FIELD(apprelids);
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
COPY_SCALAR_FIELD(first_partial_plan);
COPY_NODE_FIELD(part_prune_info);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 8392be6d44..3270d79a3e 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -333,6 +333,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
WRITE_INT_FIELD(plan_width);
WRITE_BOOL_FIELD(parallel_aware);
WRITE_BOOL_FIELD(parallel_safe);
+ WRITE_BOOL_FIELD(async_capable);
WRITE_INT_FIELD(plan_node_id);
WRITE_NODE_FIELD(targetlist);
WRITE_NODE_FIELD(qual);
@@ -431,6 +432,7 @@ _outAppend(StringInfo str, const Append *node)
WRITE_BITMAPSET_FIELD(apprelids);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
WRITE_INT_FIELD(first_partial_plan);
WRITE_NODE_FIELD(part_prune_info);
}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d2c8d58070..7c271fefea 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1574,6 +1574,7 @@ ReadCommonPlan(Plan *local_node)
READ_INT_FIELD(plan_width);
READ_BOOL_FIELD(parallel_aware);
READ_BOOL_FIELD(parallel_safe);
+ READ_BOOL_FIELD(async_capable);
READ_INT_FIELD(plan_node_id);
READ_NODE_FIELD(targetlist);
READ_NODE_FIELD(qual);
@@ -1670,6 +1671,7 @@ _readAppend(void)
READ_BITMAPSET_FIELD(apprelids);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
READ_INT_FIELD(first_partial_plan);
READ_NODE_FIELD(part_prune_info);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index aab06c7d21..3b034a0326 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -147,6 +147,7 @@ bool enable_partitionwise_aggregate = false;
bool enable_parallel_append = true;
bool enable_parallel_hash = true;
bool enable_partition_pruning = true;
+bool enable_async_append = true;
typedef struct
{
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 25d4750ca6..1f13bf5d55 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -81,6 +81,7 @@ static List *get_gating_quals(PlannerInfo *root, List *quals);
static Plan *create_gating_plan(PlannerInfo *root, Path *path, Plan *plan,
List *gating_quals);
static Plan *create_join_plan(PlannerInfo *root, JoinPath *best_path);
+static bool is_async_capable_path(Path *path);
static Plan *create_append_plan(PlannerInfo *root, AppendPath *best_path,
int flags);
static Plan *create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
@@ -1066,6 +1067,30 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
return plan;
}
+/*
+ * is_async_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
+
/*
* create_append_plan
* Create an Append plan for 'best_path' and (recursively) plans
@@ -1083,6 +1108,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
List *pathkeys = best_path->path.pathkeys;
List *subplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
RelOptInfo *rel = best_path->path.parent;
PartitionPruneInfo *partpruneinfo = NULL;
int nodenumsortkeys = 0;
@@ -1090,6 +1116,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
Oid *nodeSortOperators = NULL;
Oid *nodeCollations = NULL;
bool *nodeNullsFirst = NULL;
+ bool consider_async = false;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1153,6 +1180,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
tlist_was_changed = (orig_tlist_length != list_length(plan->plan.targetlist));
}
+ /* If appropriate, consider async append */
+ consider_async = (enable_async_append && pathkeys == NIL &&
+ !best_path->path.parallel_safe &&
+ list_length(best_path->subpaths) > 1);
+
/* Build the plan for each child */
foreach(subpaths, best_path->subpaths)
{
@@ -1220,6 +1252,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
subplans = lappend(subplans, subplan);
+
+ /* Check to see if subplan can be executed asynchronously */
+ if (consider_async && is_async_capable_path(subpath))
+ {
+ subplan->async_capable = true;
+ ++nasyncplans;
+ }
}
/*
@@ -1254,6 +1293,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
plan->appendplans = subplans;
+ plan->nasyncplans = nasyncplans;
plan->first_partial_plan = best_path->first_partial_path;
plan->part_prune_info = partpruneinfo;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..58f8e0bbcf 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3999,6 +3999,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
switch (w)
{
+ case WAIT_EVENT_APPEND_READY:
+ event_name = "AppendReady";
+ break;
case WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE:
event_name = "BackupWaitWalArchive";
break;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index eafdb1118e..507567aff3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1111,6 +1111,16 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_async_append", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of async append plans."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_async_append,
+ true,
+ NULL, NULL, NULL
+ },
{
{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("Enables genetic query optimization."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index bd57e917e1..1306094865 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -370,6 +370,7 @@
#enable_partitionwise_aggregate = off
#enable_parallel_hash = on
#enable_partition_pruning = on
+#enable_async_append = on
# - Planner Cost Constants -
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
index e69de29bb2..f7275fd154 100644
--- a/src/include/executor/execAsync.h
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,23 @@
+/*-------------------------------------------------------------------------
+ * execAsync.h
+ * Support functions for asynchronous execution
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/executor/execAsync.h
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(AsyncRequest *areq);
+extern void ExecAsyncConfigureWait(AsyncRequest *areq);
+extern void ExecAsyncNotify(AsyncRequest *areq);
+extern void ExecAsyncRequestDone(AsyncRequest *areq, TupleTableSlot *result);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index cafd410a5d..fa54ac6ad2 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -25,4 +25,6 @@ extern void ExecAppendInitializeDSM(AppendState *node, ParallelContext *pcxt);
extern void ExecAppendReInitializeDSM(AppendState *node, ParallelContext *pcxt);
extern void ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt);
+extern void ExecAsyncAppendResponse(AsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 6ae7733e25..8ffc0ca5bf 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -31,4 +31,8 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
ParallelWorkerContext *pwcxt);
extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern void ExecAsyncForeignScanRequest(AsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(AsyncRequest *areq);
+extern void ExecAsyncForeignScanNotify(AsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 248f78da45..7c89d081c7 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -178,6 +178,14 @@ typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
List *fdw_private,
RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+
+typedef void (*ForeignAsyncRequest_function) (AsyncRequest *areq);
+
+typedef void (*ForeignAsyncConfigureWait_function) (AsyncRequest *areq);
+
+typedef void (*ForeignAsyncNotify_function) (AsyncRequest *areq);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -256,6 +264,12 @@ typedef struct FdwRoutine
/* Support functions for path reparameterization. */
ReparameterizeForeignPathByChild_function ReparameterizeForeignPathByChild;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index d65099c94a..d754b59088 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -508,6 +508,22 @@ typedef struct ResultRelInfo
struct CopyMultiInsertBuffer *ri_CopyMultiInsertBuffer;
} ResultRelInfo;
+/* ----------------
+ * AsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct AsyncRequest
+{
+ struct PlanState *requestor; /* Node that wants a tuple */
+ struct PlanState *requestee; /* Node from which a tuple is wanted */
+ int request_index; /* Scratch space for requestor */
+ bool callback_pending; /* Callback is needed */
+ bool request_complete; /* Request complete, result valid */
+ TupleTableSlot *result; /* Result (NULL if no more tuples) */
+} AsyncRequest;
+
/* ----------------
* EState information
*
@@ -1213,12 +1229,23 @@ struct AppendState
PlanState **appendplans; /* array of PlanStates for my inputs */
int as_nplans;
int as_whichplan;
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_asyncplans; /* asynchronous plans indexes */
+ int as_nasyncplans; /* # of asynchronous plans */
+ AsyncRequest **as_asyncrequests; /* array of AsyncRequests */
+ TupleTableSlot **as_asyncresults; /* unreturned results of async plans */
+ int as_nasyncresults; /* # of valid entries in as_asyncresults */
+ int as_nasyncremain; /* # of remaining async plans */
+ Bitmapset *as_needrequest; /* async plans ready for a request */
+ struct WaitEventSet *as_eventset; /* WaitEventSet used to configure
+ * file descriptor wait events */
int as_first_partial_plan; /* Index of 'appendplans' containing
* the first partial plan */
ParallelAppendState *as_pstate; /* parallel coordination info */
Size pstate_len; /* size of parallel coordination info */
struct PartitionPruneState *as_prune_state;
Bitmapset *as_valid_subplans;
+ Bitmapset *as_valid_asyncplans;
bool (*choose_next_subplan) (AppendState *);
};
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 43160439f0..ebc94ffaf9 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -129,6 +129,11 @@ typedef struct Plan
bool parallel_aware; /* engage parallel-aware logic? */
bool parallel_safe; /* OK to use as part of parallel plan? */
+ /*
+ * information needed for asynchronous execution
+ */
+ bool async_capable; /* engage asynchronous-capable logic? */
+
/*
* Common structural data for all Plan types.
*/
@@ -245,6 +250,7 @@ typedef struct Append
Plan plan;
Bitmapset *apprelids; /* RTIs of appendrel(s) formed by this node */
List *appendplans;
+ int nasyncplans; /* # of asynchronous plans */
/*
* All 'appendplans' preceding this index are non-partial plans. All
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ed2e4af4be..c2952e375d 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -65,6 +65,7 @@ extern PGDLLIMPORT bool enable_partitionwise_aggregate;
extern PGDLLIMPORT bool enable_parallel_append;
extern PGDLLIMPORT bool enable_parallel_hash;
extern PGDLLIMPORT bool enable_partition_pruning;
+extern PGDLLIMPORT bool enable_async_append;
extern PGDLLIMPORT int constraint_exclusion;
extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..d9588da38a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -957,6 +957,7 @@ typedef enum
*/
typedef enum
{
+ WAIT_EVENT_APPEND_READY,
WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE = PG_WAIT_IPC,
WAIT_EVENT_BGWORKER_SHUTDOWN,
WAIT_EVENT_BGWORKER_STARTUP,
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index dc7ab2ce8b..e78ca7bddb 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -87,6 +87,7 @@ select explain_filter('explain (analyze, buffers, format json) select * from int
"Plan": { +
"Node Type": "Seq Scan", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "int8_tbl",+
"Alias": "i8", +
"Startup Cost": N.N, +
@@ -136,6 +137,7 @@ select explain_filter('explain (analyze, buffers, format xml) select * from int8
<Plan> +
<Node-Type>Seq Scan</Node-Type> +
<Parallel-Aware>false</Parallel-Aware> +
+ <Async-Capable>false</Async-Capable> +
<Relation-Name>int8_tbl</Relation-Name> +
<Alias>i8</Alias> +
<Startup-Cost>N.N</Startup-Cost> +
@@ -183,6 +185,7 @@ select explain_filter('explain (analyze, buffers, format yaml) select * from int
- Plan: +
Node Type: "Seq Scan" +
Parallel Aware: false +
+ Async Capable: false +
Relation Name: "int8_tbl"+
Alias: "i8" +
Startup Cost: N.N +
@@ -233,6 +236,7 @@ select explain_filter('explain (buffers, format json) select * from int8_tbl i8'
"Plan": { +
"Node Type": "Seq Scan", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "int8_tbl",+
"Alias": "i8", +
"Startup Cost": N.N, +
@@ -348,6 +352,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Relation Name": "tenk1", +
"Parallel Aware": true, +
"Local Hit Blocks": 0, +
@@ -393,6 +398,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Parallel Aware": false, +
"Sort Space Used": 0, +
"Local Hit Blocks": 0, +
@@ -435,6 +441,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Parallel Aware": false, +
"Workers Planned": 0, +
"Local Hit Blocks": 0, +
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index a8cbfd9f5f..af38f3b93c 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -558,6 +558,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
"Node Type": "Incremental Sort", +
"Actual Rows": 55, +
"Actual Loops": 1, +
+ "Async Capable": false, +
"Presorted Key": [ +
"t.a" +
], +
@@ -734,6 +735,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
"Node Type": "Incremental Sort", +
"Actual Rows": 70, +
"Actual Loops": 1, +
+ "Async Capable": false, +
"Presorted Key": [ +
"t.a" +
], +
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index ff157ceb1c..499245068a 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -204,6 +204,7 @@ explain (costs off, format json) insert into insertconflicttest values (0, 'Bilb
"Node Type": "ModifyTable", +
"Operation": "Insert", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "insertconflicttest", +
"Alias": "insertconflicttest", +
"Conflict Resolution": "UPDATE", +
@@ -213,7 +214,8 @@ explain (costs off, format json) insert into insertconflicttest values (0, 'Bilb
{ +
"Node Type": "Result", +
"Parent Relationship": "Member", +
- "Parallel Aware": false +
+ "Parallel Aware": false, +
+ "Async Capable": false +
} +
] +
} +
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 81bdacf59d..b7818c0637 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -88,6 +88,7 @@ select count(*) = 1 as ok from pg_stat_wal;
select name, setting from pg_settings where name like 'enable%';
name | setting
--------------------------------+---------
+ enable_async_append | on
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
@@ -106,7 +107,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(18 rows)
+(19 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
On Mon, Feb 1, 2021 at 12:06 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Tue, Nov 17, 2020 at 6:56 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
* I haven't yet done anything about the issue on postgres_fdw's
handling of concurrent data fetches by multiple ForeignScan nodes
(below *different* Append nodes in the query) using the same
connection discussed in [2]. I modified the patch to just disable
applying this feature to problematic test cases in the postgres_fdw
regression tests, by a new GUC enable_async_append.A solution for the issue would be a scheduler designed to handle such
data fetches more efficiently, but I don’t think it’s easy to create
such a scheduler. Rather than doing so, I'd like to propose to allow
FDWs to disable async execution of them in problematic cases by
themselves during executor startup in the first cut. What I have in
mind for that is:1) For an FDW that has async-capable ForeignScan(s), we allow the FDW
to record, for each of the async-capable and non-async-capable
ForeignScan(s), the information on a connection to be used for the
ForeignScan into EState during BeginForeignScan().2) After doing ExecProcNode() to each SubPlan and the main query tree
in InitPlan(), we give the FDW a chance to a) reconsider, for each of
the async-capable ForeignScan(s), whether the ForeignScan can be
executed asynchronously as planned, based on the information stored
into EState in #1, and then b) disable async execution of the
ForeignScan if not.
s/ExecProcNode()/ExecInitNode()/. Sorry for that. I’ll post an
updated patch for this in a few days.
Best regards,
Etsuro Fujita
On Thu, Feb 4, 2021 at 7:21 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Mon, Feb 1, 2021 at 12:06 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
Rather than doing so, I'd like to propose to allow
FDWs to disable async execution of them in problematic cases by
themselves during executor startup in the first cut. What I have in
mind for that is:1) For an FDW that has async-capable ForeignScan(s), we allow the FDW
to record, for each of the async-capable and non-async-capable
ForeignScan(s), the information on a connection to be used for the
ForeignScan into EState during BeginForeignScan().2) After doing ExecProcNode() to each SubPlan and the main query tree
in InitPlan(), we give the FDW a chance to a) reconsider, for each of
the async-capable ForeignScan(s), whether the ForeignScan can be
executed asynchronously as planned, based on the information stored
into EState in #1, and then b) disable async execution of the
ForeignScan if not.s/ExecProcNode()/ExecInitNode()/. Sorry for that. I’ll post an
updated patch for this in a few days.
I created a WIP patch for this. For #2, I added a new callback
routine ReconsiderAsyncForeignScan(). The routine for postgres_fdw
postgresReconsiderAsyncForeignScan() is pretty simple: async execution
of an async-capable ForeignScan is disabled if the connection used for
it is used in other parts of the query plan tree except async subplans
just below the parent Append. Here is a running example:
postgres=# create table t1 (a int, b int, c text);
postgres=# create table t2 (a int, b int, c text);
postgres=# create foreign table p1 (a int, b int, c text) server
server1 options (table_name 't1');
postgres=# create foreign table p2 (a int, b int, c text) server
server2 options (table_name 't2');
postgres=# create table pt (a int, b int, c text) partition by range (a);
postgres=# alter table pt attach partition p1 for values from (10) to (20);
postgres=# alter table pt attach partition p2 for values from (20) to (30);
postgres=# insert into p1 select 10 + i % 10, i, to_char(i, 'FM0000')
from generate_series(0, 99) i;
postgres=# insert into p2 select 20 + i % 10, i, to_char(i, 'FM0000')
from generate_series(0, 99) i;
postgres=# analyze pt;
postgres=# create table loct (a int, b int);
postgres=# create foreign table ft (a int, b int) server server1
options (table_name 'loct');
postgres=# insert into ft select i, i from generate_series(0, 99) i;
postgres=# analyze ft;
postgres=# create view v as select * from ft;
postgres=# explain verbose select * from pt, v where pt.b = v.b and v.b = 99;
QUERY PLAN
-----------------------------------------------------------------------------------------
Nested Loop (cost=200.00..306.84 rows=2 width=21)
Output: pt.a, pt.b, pt.c, ft.a, ft.b
-> Foreign Scan on public.ft (cost=100.00..102.27 rows=1 width=8)
Output: ft.a, ft.b
Remote SQL: SELECT a, b FROM public.loct WHERE ((b = 99))
-> Append (cost=100.00..204.55 rows=2 width=13)
-> Foreign Scan on public.p1 pt_1 (cost=100.00..102.27
rows=1 width=13)
Output: pt_1.a, pt_1.b, pt_1.c
Remote SQL: SELECT a, b, c FROM public.t1 WHERE ((b = 99))
-> Async Foreign Scan on public.p2 pt_2
(cost=100.00..102.27 rows=1 width=13)
Output: pt_2.a, pt_2.b, pt_2.c
Remote SQL: SELECT a, b, c FROM public.t2 WHERE ((b = 99))
(12 rows)
For this query, while p2 is executed asynchronously, p1 isn’t as it
uses the same connection with ft. BUT:
postgres=# create role view_owner SUPERUSER;
postgres=# create user mapping for view_owner server server1;
postgres=# alter view v owner to view_owner;
postgres=# explain verbose select * from pt, v where pt.b = v.b and v.b = 99;
QUERY PLAN
-----------------------------------------------------------------------------------------
Nested Loop (cost=200.00..306.84 rows=2 width=21)
Output: pt.a, pt.b, pt.c, ft.a, ft.b
-> Foreign Scan on public.ft (cost=100.00..102.27 rows=1 width=8)
Output: ft.a, ft.b
Remote SQL: SELECT a, b FROM public.loct WHERE ((b = 99))
-> Append (cost=100.00..204.55 rows=2 width=13)
-> Async Foreign Scan on public.p1 pt_1
(cost=100.00..102.27 rows=1 width=13)
Output: pt_1.a, pt_1.b, pt_1.c
Remote SQL: SELECT a, b, c FROM public.t1 WHERE ((b = 99))
-> Async Foreign Scan on public.p2 pt_2
(cost=100.00..102.27 rows=1 width=13)
Output: pt_2.a, pt_2.b, pt_2.c
Remote SQL: SELECT a, b, c FROM public.t2 WHERE ((b = 99))
(12 rows)
in this setup, p1 is executed asynchronously as ft doesn’t use the
same connection with p1.
I added to postgresReconsiderAsyncForeignScan() this as well: even if
the connection isn’t used in the other parts, async execution of an
async-capable ForeignScan is disabled if the subplans of the Append
are all async-capable, and they use the same connection, because in
that case the subplans won’t be parallelized at all, and the overhead
of async execution may cause a performance degradation.
Attached is an updated version of the patch. Sorry for the delay.
Best regards,
Etsuro Fujita
Attachments:
async-wip-2021-02-10.patchapplication/octet-stream; name=async-wip-2021-02-10.patchDownload
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index ee0b4acf0b..3ecb8e1e4f 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -62,6 +62,7 @@ typedef struct ConnCacheEntry
Oid serverid; /* foreign server OID used to get server name */
uint32 server_hashvalue; /* hash value of foreign server OID */
uint32 mapping_hashvalue; /* hash value of user mapping OID */
+ PgFdwConnState state; /* extra per-connection state */
} ConnCacheEntry;
/*
@@ -117,7 +118,7 @@ static bool disconnect_cached_connections(Oid serverid);
* (not even on error), we need this flag to cue manual cleanup.
*/
PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetConnection(UserMapping *user, bool will_prep_stmt, PgFdwConnState **state)
{
bool found;
bool retry = false;
@@ -264,6 +265,10 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
/* Remember if caller will prepare statements */
entry->have_prep_stmt |= will_prep_stmt;
+ /* If caller needs access to the per-connection state, return it. */
+ if (state)
+ *state = &entry->state;
+
return entry->conn;
}
@@ -291,6 +296,7 @@ make_new_connection(ConnCacheEntry *entry, UserMapping *user)
entry->mapping_hashvalue =
GetSysCacheHashValue1(USERMAPPINGOID,
ObjectIdGetDatum(user->umid));
+ memset(&entry->state, 0, sizeof(entry->state));
/* Now try to make the connection */
entry->conn = connect_pg_server(server, user);
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 60c7e115d6..05428ee018 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -7021,7 +7021,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+-------
a | aaa
@@ -7049,7 +7049,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7077,7 +7077,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7105,7 +7105,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | newtoo
@@ -7133,7 +7133,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
DELETE FROM a;
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+----
(0 rows)
@@ -7175,23 +7175,28 @@ insert into bar2 values(3,33,33);
insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
@@ -7201,9 +7206,9 @@ select * from bar where f1 in (select f1 from foo) for update;
-> Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
f1 | f2
----+----
1 | 11
@@ -7213,23 +7218,28 @@ select * from bar where f1 in (select f1 from foo) for update;
(4 rows)
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
@@ -7239,9 +7249,9 @@ select * from bar where f1 in (select f1 from foo) for share;
-> Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
f1 | f2
----+----
1 | 11
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 368997d9d1..11b19ae1ef 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -21,6 +21,7 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -37,9 +38,11 @@
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
#include "postgres_fdw.h"
+#include "storage/latch.h"
#include "utils/builtins.h"
#include "utils/float.h"
#include "utils/guc.h"
+#include "utils/hsearch.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -159,6 +162,11 @@ typedef struct PgFdwScanState
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ /* for asynchronous execution */
+ Oid umid; /* Oid of user mapping */
+ PgFdwConnState *conn_state; /* extra per-connection state */
+ ForeignScanState *next_node; /* next ForeignScan node to activate */
+
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
MemoryContext temp_cxt; /* context for per-tuple temporary data */
@@ -408,6 +416,12 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
RelOptInfo *input_rel,
RelOptInfo *output_rel,
void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresReconsiderAsyncForeignScan(ForeignScanState *node,
+ AsyncContext *acxt);
+static void postgresForeignAsyncRequest(AsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(AsyncRequest *areq);
+static void postgresForeignAsyncNotify(AsyncRequest *areq);
/*
* Helper functions
@@ -435,7 +449,11 @@ static void adjust_foreign_grouping_path_cost(PlannerInfo *root,
static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
+static UserMapping *get_user_mapping(EState *estate, ForeignScan *fsplan);
+static void record_foreign_scan_info(EState *estate, ForeignScanState *node,
+ UserMapping *user);
static void create_cursor(ForeignScanState *node);
+static void fetch_more_data_begin(ForeignScanState *node);
static void fetch_more_data(ForeignScanState *node);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static PgFdwModifyState *create_foreign_modify(EState *estate,
@@ -491,6 +509,7 @@ static int postgresAcquireSampleRowsFunc(Relation relation, int elevel,
double *totaldeadrows);
static void analyze_row_processor(PGresult *res, int row,
PgFdwAnalyzeState *astate);
+static void request_tuple_asynchronously(AsyncRequest *areq);
static HeapTuple make_tuple_from_result_row(PGresult *res,
int row,
Relation rel,
@@ -583,6 +602,13 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for asynchronous execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ReconsiderAsyncForeignScan = postgresReconsiderAsyncForeignScan;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -1417,19 +1443,40 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
{
ForeignScan *fsplan = (ForeignScan *) node->ss.ps.plan;
EState *estate = node->ss.ps.state;
+ bool asyncPlan = estate->es_plannedstmt->asyncPlan;
PgFdwScanState *fsstate;
- RangeTblEntry *rte;
- Oid userid;
- ForeignTable *table;
UserMapping *user;
- int rtindex;
int numParams;
/*
- * Do nothing in EXPLAIN (no ANALYZE) case. node->fdw_state stays NULL.
+ * No need to work hard in EXPLAIN (no ANALYZE) case. In that case,
+ * node->fdw_state stays NULL; or node->fdw_state->conn stays NULL.
*/
if (eflags & EXEC_FLAG_EXPLAIN_ONLY)
+ {
+ /* Do nothing if the query plan tree has no async-aware Appends. */
+ if (!asyncPlan)
+ return;
+
+ /* Get info about user mapping. */
+ user = get_user_mapping(estate, fsplan);
+
+ /* Record the information on the ForeignScan node in the EState. */
+ record_foreign_scan_info(estate, node, user);
+
+ /*
+ * If the ForeignScan node is async-capable, save the user mapping
+ * OID in node->fdw_state for use later.
+ */
+ if (node->ss.ps.async_capable)
+ {
+ fsstate = (PgFdwScanState *) palloc0(sizeof(PgFdwScanState));
+ node->fdw_state = (void *) fsstate;
+ fsstate->umid = user->umid;
+ }
+
return;
+ }
/*
* We'll save private state in node->fdw_state.
@@ -1437,28 +1484,27 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
fsstate = (PgFdwScanState *) palloc0(sizeof(PgFdwScanState));
node->fdw_state = (void *) fsstate;
- /*
- * Identify which user to do the remote access as. This should match what
- * ExecCheckRTEPerms() does. In case of a join or aggregate, use the
- * lowest-numbered member RTE as a representative; we would get the same
- * result from any.
- */
- if (fsplan->scan.scanrelid > 0)
- rtindex = fsplan->scan.scanrelid;
- else
- rtindex = bms_next_member(fsplan->fs_relids, -1);
- rte = exec_rt_fetch(rtindex, estate);
- userid = rte->checkAsUser ? rte->checkAsUser : GetUserId();
+ /* Get info about user mapping. */
+ user = get_user_mapping(estate, fsplan);
- /* Get info about foreign table. */
- table = GetForeignTable(rte->relid);
- user = GetUserMapping(userid, table->serverid);
+ if (asyncPlan)
+ {
+ /* Record the information on the ForeignScan node in the EState. */
+ record_foreign_scan_info(estate, node, user);
+
+ /*
+ * If the ForeignScan node is async-capable, save the user mapping
+ * OID in node->fdw_state for use later.
+ */
+ if (node->ss.ps.async_capable)
+ fsstate->umid = user->umid;
+ }
/*
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->conn = GetConnection(user, false, &fsstate->conn_state);
/* Assign a unique ID for my cursor */
fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1509,6 +1555,11 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
&fsstate->param_flinfo,
&fsstate->param_exprs,
&fsstate->param_values);
+
+ /* Initialize async state */
+ fsstate->conn_state->activated = NULL;
+ fsstate->conn_state->async_query_sent = false;
+ fsstate->next_node = NULL;
}
/*
@@ -1523,8 +1574,10 @@ postgresIterateForeignScan(ForeignScanState *node)
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
/*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
+ * In sync mode, if this is the first call after Begin or ReScan, we need
+ * to create the cursor on the remote side. In async mode, we would have
+ * aready created the cursor before we get here, even if this is the first
+ * call after Begin or ReScan.
*/
if (!fsstate->cursor_exists)
create_cursor(node);
@@ -1534,6 +1587,9 @@ postgresIterateForeignScan(ForeignScanState *node)
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
+ /* In async mode, just clear tuple slot. */
+ if (node->ss.ps.async_capable)
+ return ExecClearTuple(slot);
/* No point in another fetch if we already detected EOF, though. */
if (!fsstate->eof_reached)
fetch_more_data(node);
@@ -1563,6 +1619,14 @@ postgresReScanForeignScan(ForeignScanState *node)
char sql[64];
PGresult *res;
+ /* Reset async state */
+ if (node->ss.ps.async_capable)
+ {
+ fsstate->conn_state->activated = NULL;
+ fsstate->conn_state->async_query_sent = false;
+ fsstate->next_node = NULL;
+ }
+
/* If we haven't created the cursor yet, nothing to do. */
if (!fsstate->cursor_exists)
return;
@@ -1617,10 +1681,21 @@ postgresEndForeignScan(ForeignScanState *node)
{
PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
- /* if fsstate is NULL, we are in EXPLAIN; nothing to do */
- if (fsstate == NULL)
+ /*
+ * if fsstate is NULL or if fsstate->conn is NULL, we are in EXPLAIN;
+ * nothing to do
+ */
+ if (fsstate == NULL || fsstate->conn == NULL)
return;
+ /*
+ * If we're ending before we've collected a response from an asynchronous
+ * query, we have to consume the response.
+ */
+ if (fsstate->conn_state->activated == node &&
+ fsstate->conn_state->async_query_sent)
+ fetch_more_data(node);
+
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
close_cursor(fsstate->conn, fsstate->cursor_number);
@@ -2491,7 +2566,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->conn = GetConnection(user, false, NULL);
/* Update the foreign-join-related fields. */
if (fsplan->scan.scanrelid == 0)
@@ -2872,7 +2947,7 @@ estimate_path_cost_size(PlannerInfo *root,
false, &retrieved_attrs, NULL);
/* Get the remote estimate */
- conn = GetConnection(fpinfo->user, false);
+ conn = GetConnection(fpinfo->user, false, NULL);
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -3428,6 +3503,53 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
return true;
}
+static UserMapping *
+get_user_mapping(EState *estate, ForeignScan *fsplan)
+{
+ int rtindex;
+ RangeTblEntry *rte;
+ Oid userid;
+ ForeignTable *table;
+
+ /*
+ * Identify which user to do the remote access as. This should match what
+ * ExecCheckRTEPerms() does. In case of a join or aggregate, use the
+ * lowest-numbered member RTE as a representative; we would get the same
+ * result from any.
+ */
+ if (fsplan->scan.scanrelid > 0)
+ rtindex = fsplan->scan.scanrelid;
+ else
+ rtindex = bms_next_member(fsplan->fs_relids, -1);
+ rte = exec_rt_fetch(rtindex, estate);
+ userid = rte->checkAsUser ? rte->checkAsUser : GetUserId();
+
+ /* Get info about foreign table. */
+ table = GetForeignTable(rte->relid);
+
+ return GetUserMapping(userid, table->serverid);
+}
+
+static void
+record_foreign_scan_info(EState *estate, ForeignScanState *node,
+ UserMapping *user)
+{
+ HTAB *htab = estate->es_foreign_scan_hash;
+ int fsplanid = node->ss.ps.plan->plan_node_id;
+ bool found;
+ ForeignScanHashEntry *entry;
+
+ /* Find or create hash table entry for the user mapping. */
+ Assert(htab);
+ entry = (ForeignScanHashEntry *) hash_search(htab, &user->umid,
+ HASH_ENTER, &found);
+
+ if (!found)
+ entry->fsplanids = bms_make_singleton(fsplanid);
+ else
+ entry->fsplanids = bms_add_member(entry->fsplanids, fsplanid);
+}
+
/*
* Create cursor for node's query with current parameter values.
*/
@@ -3500,6 +3622,34 @@ create_cursor(ForeignScanState *node)
pfree(buf.data);
}
+/*
+ * Begin an asynchronous data fetch.
+ * fetch_more_data must be called to fetch the results..
+ */
+static void
+fetch_more_data_begin(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PGconn *conn = fsstate->conn;
+ char sql[64];
+
+ Assert(fsstate->conn_state->activated == node);
+ Assert(!fsstate->conn_state->async_query_sent);
+
+ /* Create the cursor synchronously. */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ /* We will send this query, but not wait for the response. */
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (PQsendQuery(conn, sql) < 0)
+ pgfdw_report_error(ERROR, NULL, conn, false, fsstate->query);
+
+ fsstate->conn_state->async_query_sent = true;
+}
+
/*
* Fetch some more rows from the node's cursor.
*/
@@ -3522,17 +3672,36 @@ fetch_more_data(ForeignScanState *node)
PG_TRY();
{
PGconn *conn = fsstate->conn;
- char sql[64];
int numrows;
int i;
- snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
- fsstate->fetch_size, fsstate->cursor_number);
+ if (node->ss.ps.async_capable)
+ {
+ Assert(fsstate->conn_state->activated == node);
+ Assert(fsstate->conn_state->async_query_sent);
+
+ /*
+ * The query was already sent by an earlier call to
+ * fetch_more_data_begin. So now we just fetch the result.
+ */
+ res = PQgetResult(conn);
+ /* On error, report the original query, not the FETCH. */
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ }
+ else
+ {
+ char sql[64];
- res = pgfdw_exec_query(conn, sql);
- /* On error, report the original query, not the FETCH. */
- if (PQresultStatus(res) != PGRES_TUPLES_OK)
- pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ /* This is a regular synchronous fetch. */
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ res = pgfdw_exec_query(conn, sql);
+ /* On error, report the original query, not the FETCH. */
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ }
/* Convert the data into HeapTuples */
numrows = PQntuples(res);
@@ -3559,6 +3728,15 @@ fetch_more_data(ForeignScanState *node)
/* Must be EOF if we didn't get as many tuples as we asked for. */
fsstate->eof_reached = (numrows < fsstate->fetch_size);
+
+ /* If this was the second part of an async request, we must fetch until NULL. */
+ if (node->ss.ps.async_capable)
+ {
+ /* call once and raise error if not NULL as expected? */
+ while (PQgetResult(conn) != NULL)
+ ;
+ fsstate->conn_state->async_query_sent = false;
+ }
}
PG_FINALLY();
{
@@ -3684,7 +3862,7 @@ create_foreign_modify(EState *estate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->conn = GetConnection(user, true, NULL);
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Set up remote query information. */
@@ -4618,7 +4796,7 @@ postgresAnalyzeForeignTable(Relation relation,
*/
table = GetForeignTable(RelationGetRelid(relation));
user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
- conn = GetConnection(user, false);
+ conn = GetConnection(user, false, NULL);
/*
* Construct command to get page count for relation.
@@ -4704,7 +4882,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
table = GetForeignTable(RelationGetRelid(relation));
server = GetForeignServer(table->serverid);
user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
- conn = GetConnection(user, false);
+ conn = GetConnection(user, false, NULL);
/*
* Construct cursor that retrieves whole rows from remote.
@@ -4932,7 +5110,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
*/
server = GetForeignServer(serverOid);
mapping = GetUserMapping(GetUserId(), server->serverid);
- conn = GetConnection(mapping, false);
+ conn = GetConnection(mapping, false, NULL);
/* Don't attempt to import collation if remote server hasn't got it */
if (PQserverVersion(conn) < 90100)
@@ -6479,6 +6657,221 @@ add_foreign_final_paths(PlannerInfo *root, RelOptInfo *input_rel,
add_path(final_rel, (Path *) final_path);
}
+/*
+ * postgresIsForeignPathAsyncCapable
+ * Check whether a given ForeignPath node is async-capable.
+ */
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+/*
+ * postgresReconsiderAsyncForeignScan
+ * Re-examine a given ForeignScan node that was planned as async-capable.
+ */
+static bool
+postgresReconsiderAsyncForeignScan(ForeignScanState *node, AsyncContext *acxt)
+{
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ EState *estate = node->ss.ps.state;
+ HTAB *htab = estate->es_foreign_scan_hash;
+ bool found;
+ ForeignScanHashEntry *entry;
+ AppendState *requestor = (AppendState *) acxt->requestor;
+ Bitmapset *asyncplanids = requestor->as_asyncplanids;
+ Bitmapset *fsplanids;
+
+ /* Find hash table entry for the ForeignScan node. */
+ Assert(htab);
+ entry = (ForeignScanHashEntry *) hash_search(htab, &fsstate->umid,
+ HASH_FIND, &found);
+ Assert(found);
+
+ fsplanids = entry->fsplanids;
+ Assert(bms_is_member(node->ss.ps.plan->plan_node_id, fsplanids));
+
+ /*
+ * If the connection used for the ForeignScan node is used in other parts
+ * of the query plan tree except async subplans of the parent Append node,
+ * disable async execution of the ForeignScan node.
+ */
+ if (!bms_is_subset(fsplanids, asyncplanids))
+ return false;
+
+ /*
+ * If the subplans of the Append node are all async-capable, and use the
+ * same connection, then we won't execute them asynchronously.
+ */
+ if (requestor->as_nasyncplans == requestor->as_nplans &&
+ !bms_nonempty_difference(asyncplanids, fsplanids))
+ return false;
+
+ return true;
+}
+
+/*
+ * postgresForeignAsyncRequest
+ * Asynchronously request next tuple from a foreign PostgreSQL table.
+ */
+static void
+postgresForeignAsyncRequest(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+
+ /*
+ * If this is the first call after Begin or ReScan, mark the connection
+ * as used by the ForeignScan node.
+ */
+ if (fsstate->conn_state->activated == NULL)
+ fsstate->conn_state->activated = node;
+
+ /*
+ * If the connection has already been used by a ForeignScan node, put it
+ * at the end of the chain of waiting ForeignScan nodes, and then return.
+ */
+ if (node != fsstate->conn_state->activated)
+ {
+ ForeignScanState *curr_node = fsstate->conn_state->activated;
+ PgFdwScanState *curr_fsstate = (PgFdwScanState *) curr_node->fdw_state;
+
+ /* Scan down the chain ... */
+ while (curr_fsstate->next_node)
+ {
+ curr_node = curr_fsstate->next_node;
+ Assert(node != curr_node);
+ curr_fsstate = (PgFdwScanState *) curr_node->fdw_state;
+ }
+ /* Update the chain linking */
+ curr_fsstate->next_node = node;
+ /* Mark the request as needing a callback */
+ areq->callback_pending = true;
+ areq->request_complete = false;
+ return;
+ }
+
+ request_tuple_asynchronously(areq);
+}
+
+/*
+ * postgresForeignAsyncConfigureWait
+ * Configure a file descriptor event for which we wish to wait.
+ */
+static void
+postgresForeignAsyncConfigureWait(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ AppendState *requestor = (AppendState *) areq->requestor;
+ WaitEventSet *set = requestor->as_eventset;
+
+ /* This function should not be called unless callback_pending */
+ Assert(areq->callback_pending);
+
+ /* If the ForeignScan node isn't activated yet, nothing to do */
+ if (fsstate->conn_state->activated != node)
+ return;
+
+ AddWaitEventToSet(set, WL_SOCKET_READABLE, PQsocket(fsstate->conn),
+ NULL, areq);
+}
+
+/*
+ * postgresForeignAsyncNotify
+ * Fetch some more tuples from a file descriptor that becomes ready,
+ * requesting next tuple.
+ */
+static void
+postgresForeignAsyncNotify(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+
+ /* The core code would have initialized the callback_pending flag */
+ Assert(!areq->callback_pending);
+
+ fetch_more_data(node);
+
+ request_tuple_asynchronously(areq);
+}
+
+/*
+ * Asynchronously request next tuple from a foreign PostgreSQL table.
+ */
+static void
+request_tuple_asynchronously(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ TupleTableSlot *result;
+
+ /* Request some more tuples, if we've run out */
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ /* No point in another fetch if we already detected EOF, though */
+ if (!fsstate->eof_reached)
+ {
+ /* Begin another fetch */
+ fetch_more_data_begin(node);
+ /* Mark the request as needing a callback */
+ areq->callback_pending = true;
+ areq->request_complete = false;
+ return;
+ }
+ fsstate->conn_state->activated = NULL;
+
+ /* Activate the next ForeignScan node if any */
+ if (fsstate->next_node)
+ {
+ /* Mark the connection as used by the next ForeignScan node */
+ fsstate->conn_state->activated = fsstate->next_node;
+ Assert(!fsstate->conn_state->async_query_sent);
+ /* Begin an asynchronous fetch for that node */
+ fetch_more_data_begin(fsstate->next_node);
+ }
+
+ /* There's nothing more to do; just return a NULL pointer */
+ result = NULL;
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+ return;
+ }
+
+ /* Get a tuple from the ForeignScan node */
+ result = ExecProcNode((PlanState *) node);
+
+ if (TupIsNull(result))
+ {
+ Assert(fsstate->next_tuple >= fsstate->num_tuples);
+
+ /* Request some more tuples, if we've not detected EOF yet */
+ if (!fsstate->eof_reached)
+ {
+ /* Begin another fetch */
+ fetch_more_data_begin(node);
+ /* Mark the request as needing a callback */
+ areq->callback_pending = true;
+ areq->request_complete = false;
+ return;
+ }
+ fsstate->conn_state->activated = NULL;
+
+ /* Activate the next ForeignScan node if any */
+ if (fsstate->next_node)
+ {
+ /* Mark the connection as used by the next ForeignScan node */
+ fsstate->conn_state->activated = fsstate->next_node;
+ Assert(!fsstate->conn_state->async_query_sent);
+ /* Begin an asynchronous fetch for that node */
+ fetch_more_data_begin(fsstate->next_node);
+ }
+ }
+
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+}
+
/*
* Create a tuple from the specified row of the PGresult.
*
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 1f67b4d9fd..c3537b6449 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -16,6 +16,7 @@
#include "foreign/foreign.h"
#include "lib/stringinfo.h"
#include "libpq-fe.h"
+#include "nodes/execnodes.h"
#include "nodes/pathnodes.h"
#include "utils/relcache.h"
@@ -124,12 +125,22 @@ typedef struct PgFdwRelationInfo
int relation_index;
} PgFdwRelationInfo;
+/*
+ * Extra control information relating to a connection.
+ */
+typedef struct PgFdwConnState
+{
+ ForeignScanState *activated; /* currently-activated ForeignScan node */
+ bool async_query_sent; /* has an asynchronous query been sent? */
+} PgFdwConnState;
+
/* in postgres_fdw.c */
extern int set_transmission_modes(void);
extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+ PgFdwConnState **state);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 151f4f1834..ceda16b92f 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1822,31 +1822,31 @@ INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
DELETE FROM a;
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
@@ -1882,12 +1882,12 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5ef1c7ad3c..4a9eece710 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4735,6 +4735,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</para>
<variablelist>
+ <varlistentry id="guc-enable-async-append" xreflabel="enable_async_append">
+ <term><varname>enable_async_append</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_async_append</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of async-aware
+ append plan types. The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-bitmapscan" xreflabel="enable_bitmapscan">
<term><varname>enable_bitmapscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c602ee4427..a2d2f42e28 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1563,6 +1563,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</thead>
<tbody>
+ <row>
+ <entry><literal>AppendReady</literal></entry>
+ <entry>Waiting for a subplan of Append to be ready.</entry>
+ </row>
<row>
<entry><literal>BackupWaitWalArchive</literal></entry>
<entry>Waiting for WAL files required for a backup to be successfully
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index f80e379973..a2b7b8bd67 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1390,6 +1390,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
}
if (plan->parallel_aware)
appendStringInfoString(es->str, "Parallel ");
+ if (planstate->async_capable)
+ appendStringInfoString(es->str, "Async ");
appendStringInfoString(es->str, pname);
es->indent++;
}
@@ -1409,6 +1411,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (custom_name)
ExplainPropertyText("Custom Plan Provider", custom_name, es);
ExplainPropertyBool("Parallel Aware", plan->parallel_aware, es);
+ ExplainPropertyBool("Async Capable", planstate->async_capable, es);
}
switch (nodeTag(plan))
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index f990c6473a..1004647d4f 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
+ execAsync.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 23bdb53cd1..613835b748 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -526,6 +526,10 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans != 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index c74ce36ffb..caf11dc4e0 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -48,6 +48,7 @@
#include "commands/matview.h"
#include "commands/trigger.h"
#include "executor/execdebug.h"
+#include "executor/nodeAppend.h"
#include "executor/nodeSubplan.h"
#include "foreign/fdwapi.h"
#include "jit/jit.h"
@@ -78,6 +79,8 @@ ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook = NULL;
/* decls for local routines only used within this module */
static void InitPlan(QueryDesc *queryDesc, int eflags);
static void CheckValidRowMarkRel(Relation rel, RowMarkType markType);
+static void ExecBuildForeignScanHashTable(EState *estate);
+static void ExecReconsiderPlan(EState *estate);
static void ExecPostprocessPlan(EState *estate);
static void ExecEndPlan(PlanState *planstate, EState *estate);
static void ExecutePlan(EState *estate, PlanState *planstate,
@@ -886,6 +889,9 @@ InitPlan(QueryDesc *queryDesc, int eflags)
/* signal that this EState is not used for EPQ */
estate->es_epq_active = NULL;
+ if (plannedstmt->asyncPlan)
+ ExecBuildForeignScanHashTable(estate);
+
/*
* Initialize private state information for each SubPlan. We must do this
* before running ExecInitNode on the main query tree, since
@@ -924,6 +930,9 @@ InitPlan(QueryDesc *queryDesc, int eflags)
*/
planstate = ExecInitNode(plan, estate, eflags);
+ if (plannedstmt->asyncPlan)
+ ExecReconsiderPlan(estate);
+
/*
* Get the tuple descriptor describing the type of tuples to return.
*/
@@ -1321,6 +1330,35 @@ ExecGetTriggerResultRel(EState *estate, Oid relid)
return rInfo;
}
+static void
+ExecBuildForeignScanHashTable(EState *estate)
+{
+ HASHCTL ctl;
+
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(ForeignScanHashEntry);
+ ctl.hcxt = CurrentMemoryContext;
+
+ estate->es_foreign_scan_hash =
+ hash_create("User mapping dependency table", 256,
+ &ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+}
+
+static void
+ExecReconsiderPlan(EState *estate)
+{
+ ListCell *lc;
+
+ foreach(lc, estate->es_asyncappends)
+ {
+ AppendState *appendstate = (AppendState *) lfirst(lc);
+
+ ExecReconsiderAsyncAppend(appendstate);
+ }
+}
+
/* ----------------------------------------------------------------
* ExecPostprocessPlan
*
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index c734283bfe..df7b9b591b 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -156,6 +156,9 @@ CreateExecutorState(void)
estate->es_use_parallel_mode = false;
+ estate->es_asyncappends = NIL;
+ estate->es_foreign_scan_hash = NULL;
+
estate->es_jit_flags = 0;
estate->es_jit = NULL;
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 15e4115bd6..3896d9fcd4 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -57,10 +57,13 @@
#include "postgres.h"
+#include "executor/execAsync.h"
#include "executor/execdebug.h"
#include "executor/execPartition.h"
#include "executor/nodeAppend.h"
#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
/* Shared state for parallel-aware Append. */
struct ParallelAppendState
@@ -78,12 +81,18 @@ struct ParallelAppendState
};
#define INVALID_SUBPLAN_INDEX -1
+#define EVENT_BUFFER_SIZE 16
static TupleTableSlot *ExecAppend(PlanState *pstate);
static bool choose_next_subplan_locally(AppendState *node);
static bool choose_next_subplan_for_leader(AppendState *node);
static bool choose_next_subplan_for_worker(AppendState *node);
static void mark_invalid_subplans_as_finished(AppendState *node);
+static void ExecAppendAsyncBegin(AppendState *node);
+static bool ExecAppendAsyncGetNext(AppendState *node, TupleTableSlot **result);
+static bool ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result);
+static void ExecAppendAsyncEventWait(AppendState *node);
+static void classify_matching_subplans(AppendState *node);
/* ----------------------------------------------------------------
* ExecInitAppend
@@ -102,7 +111,10 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
AppendState *appendstate = makeNode(AppendState);
PlanState **appendplanstates;
Bitmapset *validsubplans;
+ Bitmapset *asyncplans;
+ Bitmapset *asyncplanids;
int nplans;
+ int nasyncplans;
int firstvalid;
int i,
j;
@@ -119,6 +131,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
/* Let choose_next_subplan_* function handle setting the first subplan */
appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+ appendstate->as_syncdone = false;
/* If run-time partition pruning is enabled, then set that up now */
if (node->part_prune_info != NULL)
@@ -191,12 +204,27 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
* While at it, find out the first valid partial plan.
*/
j = 0;
+ asyncplans = NULL;
+ asyncplanids = NULL;
+ nasyncplans = 0;
firstvalid = nplans;
i = -1;
while ((i = bms_next_member(validsubplans, i)) >= 0)
{
Plan *initNode = (Plan *) list_nth(node->appendplans, i);
+ /*
+ * Record async subplans. When executing EvalPlanQual, we process
+ * async subplans synchronously, so don't do this in that case.
+ */
+ if (initNode->async_capable && estate->es_epq_active == NULL)
+ {
+ asyncplans = bms_add_member(asyncplans, j);
+ asyncplanids = bms_add_member(asyncplanids,
+ initNode->plan_node_id);
+ nasyncplans++;
+ }
+
/*
* Record the lowest appendplans index which is a valid partial plan.
*/
@@ -210,6 +238,11 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ /* Initialize async state */
+ appendstate->as_asyncplans = asyncplans;
+ appendstate->as_asyncplanids = asyncplanids;
+ appendstate->as_nasyncplans = nasyncplans;
+
/*
* Miscellaneous initialization
*/
@@ -219,6 +252,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
/* For parallel query, this will be overridden later. */
appendstate->choose_next_subplan = choose_next_subplan_locally;
+ /*
+ * Lastly, if there is at least one async subplan, add the Append node to
+ * estate->es_asyncappends so that we can re-examine it in
+ * ExecReconsiderPlan.
+ */
+ if (nasyncplans > 0)
+ estate->es_asyncappends = lappend(estate->es_asyncappends,
+ appendstate);
+
return appendstate;
}
@@ -232,31 +274,45 @@ static TupleTableSlot *
ExecAppend(PlanState *pstate)
{
AppendState *node = castNode(AppendState, pstate);
+ TupleTableSlot *result;
- if (node->as_whichplan < 0)
+ if (!node->as_syncdone && node->as_whichplan == INVALID_SUBPLAN_INDEX)
{
/* Nothing to do if there are no subplans */
if (node->as_nplans == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ /* If there are any async subplans, begin execution of them */
+ if (node->as_nasyncplans > 0)
+ ExecAppendAsyncBegin(node);
+
/*
- * If no subplan has been chosen, we must choose one before
+ * If no sync subplan has been chosen, we must choose one before
* proceeding.
*/
- if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
- !node->choose_next_subplan(node))
+ if (!node->choose_next_subplan(node) && node->as_nasyncremain == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
for (;;)
{
PlanState *subnode;
- TupleTableSlot *result;
CHECK_FOR_INTERRUPTS();
/*
- * figure out which subplan we are currently processing
+ * try to get a tuple from any of the async subplans
+ */
+ if (!bms_is_empty(node->as_needrequest) ||
+ (node->as_syncdone && node->as_nasyncremain > 0))
+ {
+ if (ExecAppendAsyncGetNext(node, &result))
+ return result;
+ Assert(bms_is_empty(node->as_needrequest));
+ }
+
+ /*
+ * figure out which sync subplan we are currently processing
*/
Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
subnode = node->appendplans[node->as_whichplan];
@@ -276,8 +332,16 @@ ExecAppend(PlanState *pstate)
return result;
}
- /* choose new subplan; if none, we're done */
- if (!node->choose_next_subplan(node))
+ /* wait or poll async events */
+ if (node->as_nasyncremain > 0)
+ {
+ Assert(!node->as_syncdone);
+ Assert(bms_is_empty(node->as_needrequest));
+ ExecAppendAsyncEventWait(node);
+ }
+
+ /* choose new sync subplan; if no sync/async subplans, we're done */
+ if (!node->choose_next_subplan(node) && node->as_nasyncremain == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
}
@@ -313,6 +377,7 @@ ExecEndAppend(AppendState *node)
void
ExecReScanAppend(AppendState *node)
{
+ int nasyncplans = node->as_nasyncplans;
int i;
/*
@@ -326,6 +391,11 @@ ExecReScanAppend(AppendState *node)
{
bms_free(node->as_valid_subplans);
node->as_valid_subplans = NULL;
+ if (nasyncplans > 0)
+ {
+ bms_free(node->as_valid_asyncplans);
+ node->as_valid_asyncplans = NULL;
+ }
}
for (i = 0; i < node->as_nplans; i++)
@@ -347,8 +417,26 @@ ExecReScanAppend(AppendState *node)
ExecReScan(subnode);
}
+ /* Reset async state */
+ if (nasyncplans > 0)
+ {
+ i = -1;
+ while ((i = bms_next_member(node->as_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ areq->callback_pending = false;
+ areq->request_complete = false;
+ areq->result = NULL;
+ }
+
+ bms_free(node->as_needrequest);
+ node->as_needrequest = NULL;
+ }
+
/* Let choose_next_subplan_* function handle setting the first subplan */
node->as_whichplan = INVALID_SUBPLAN_INDEX;
+ node->as_syncdone = false;
}
/* ----------------------------------------------------------------
@@ -429,7 +517,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
/* ----------------------------------------------------------------
* choose_next_subplan_locally
*
- * Choose next subplan for a non-parallel-aware Append,
+ * Choose next sync subplan for a non-parallel-aware Append,
* returning false if there are no more.
* ----------------------------------------------------------------
*/
@@ -444,9 +532,9 @@ choose_next_subplan_locally(AppendState *node)
/*
* If first call then have the bms member function choose the first valid
- * subplan by initializing whichplan to -1. If there happen to be no
- * valid subplans then the bms member function will handle that by
- * returning a negative number which will allow us to exit returning a
+ * sync subplan by initializing whichplan to -1. If there happen to be
+ * no valid sync subplans then the bms member function will handle that
+ * by returning a negative number which will allow us to exit returning a
* false value.
*/
if (whichplan == INVALID_SUBPLAN_INDEX)
@@ -467,7 +555,10 @@ choose_next_subplan_locally(AppendState *node)
nextplan = bms_prev_member(node->as_valid_subplans, whichplan);
if (nextplan < 0)
+ {
+ node->as_syncdone = true;
return false;
+ }
node->as_whichplan = nextplan;
@@ -709,3 +800,362 @@ mark_invalid_subplans_as_finished(AppendState *node)
node->as_pstate->pa_finished[i] = true;
}
}
+
+/* ----------------------------------------------------------------
+ * ExecReconsiderAsyncAppend
+ *
+ * Re-examine an async-aware Append node
+ * ----------------------------------------------------------------
+ */
+void
+ExecReconsiderAsyncAppend(AppendState *node)
+{
+ Bitmapset *asyncplans = bms_copy(node->as_asyncplans);
+ int nasyncplans;
+ AsyncRequest **asyncrequests;
+ AsyncContext acxt;
+ int i;
+
+ asyncrequests = (AsyncRequest **) palloc0(node->as_nplans *
+ sizeof(AsyncRequest *));
+
+ /* Re-examine each async subplan */
+ acxt.requestor = (PlanState *) node;
+ i = -1;
+ while ((i = bms_next_member(asyncplans, i)) >= 0)
+ {
+ PlanState *subnode = node->appendplans[i];
+
+ acxt.request_index = i;
+ if (!ExecReconsiderAsyncCapablePlan(subnode, &acxt))
+ {
+ bms_del_member(node->as_asyncplans, i);
+ bms_del_member(node->as_asyncplanids,
+ subnode->plan->plan_node_id);
+ --node->as_nasyncplans;
+ }
+ else
+ {
+ AsyncRequest *areq;
+
+ areq = palloc(sizeof(AsyncRequest));
+ areq->requestor = (PlanState *) node;
+ areq->requestee = subnode;
+ areq->request_index = i;
+ areq->callback_pending = false;
+ areq->request_complete = false;
+ areq->result = NULL;
+
+ asyncrequests[i] = areq;
+ }
+ }
+ bms_free(asyncplans);
+
+ /* No need for further processing if there are no async subplans */
+ nasyncplans = node->as_nasyncplans;
+ if (nasyncplans == 0)
+ return;
+
+ /* Initialize remaining async state */
+ node->as_asyncrequests = asyncrequests;
+ node->as_asyncresults = (TupleTableSlot **)
+ palloc0(nasyncplans * sizeof(TupleTableSlot *));
+ node->as_needrequest = NULL;
+
+ classify_matching_subplans(node);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncBegin
+ *
+ * Begin execution of designed async-capable subplans.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecAppendAsyncBegin(AppendState *node)
+{
+ Bitmapset *valid_asyncplans;
+ int i;
+
+ /* We should never be called when there are no async subplans. */
+ Assert(node->as_nasyncplans > 0);
+
+ if (node->as_valid_subplans == NULL)
+ {
+ Assert(node->as_valid_asyncplans == NULL);
+
+ node->as_valid_subplans =
+ ExecFindMatchingSubPlans(node->as_prune_state);
+
+ classify_matching_subplans(node);
+ }
+
+ node->as_nasyncremain = 0;
+
+ /* Nothing to do if there are no valid async subplans. */
+ valid_asyncplans = node->as_valid_asyncplans;
+ if (valid_asyncplans == NULL)
+ return;
+
+ /* Make a request for each of the async subplans. */
+ i = -1;
+ while ((i = bms_next_member(valid_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ Assert(areq->request_index == i);
+ Assert(!areq->callback_pending);
+
+ /* Do the actual work. */
+ ExecAsyncRequest(areq);
+
+ ++node->as_nasyncremain;
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncGetNext
+ *
+ * Get the next tuple from any of the asynchronous subplans.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecAppendAsyncGetNext(AppendState *node, TupleTableSlot **result)
+{
+ *result = NULL;
+
+ /* Make new async requests. */
+ if (ExecAppendAsyncRequest(node, result))
+ return true;
+
+ while (node->as_nasyncremain > 0)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ /* Wait or poll async events. */
+ ExecAppendAsyncEventWait(node);
+
+ /* Make new async requests. */
+ if (ExecAppendAsyncRequest(node, result))
+ return true;
+
+ /* Break from loop if there is any sync node that is not complete */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If all sync subplans are complete, we're totally done scanning the
+ * givne node. Otherwise, we're done with the asynchronous stuff but
+ * must continue scanning the sync subplans.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncremain == 0);
+ *result = ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ return true;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncRequest
+ *
+ * If there are any asynchronous subplans that need a new asynchronous
+ * request, make all of them.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result)
+{
+ Bitmapset *needrequest;
+ int i;
+
+ /* Nothing to do if there are no async subplans needing a new request. */
+ if (bms_is_empty(node->as_needrequest))
+ return false;
+
+ /*
+ * If there are any asynchronously-generated results that have not yet
+ * been returned, we have nothing to do; just return one of them.
+ */
+ if (node->as_nasyncresults > 0)
+ {
+ --node->as_nasyncresults;
+ *result = node->as_asyncresults[node->as_nasyncresults];
+ return true;
+ }
+
+ /* Make a new request for each of the async subplans that need it. */
+ needrequest = node->as_needrequest;
+ node->as_needrequest = NULL;
+ i = -1;
+ while ((i = bms_next_member(needrequest, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ /* Do the actual work. */
+ ExecAsyncRequest(areq);
+ }
+ bms_free(needrequest);
+
+ /* Return one of the asynchronously-generated results if any. */
+ if (node->as_nasyncresults > 0)
+ {
+ --node->as_nasyncresults;
+ *result = node->as_asyncresults[node->as_nasyncresults];
+ return true;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncEventWait
+ *
+ * Wait or poll for file descriptor wait events and fire callbacks.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecAppendAsyncEventWait(AppendState *node)
+{
+ long timeout = node->as_syncdone ? -1 : 0;
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+
+ /* Nothing to do if there are no remaining async subplans. */
+ if (node->as_nasyncremain == 0)
+ return;
+
+ node->as_eventset = CreateWaitEventSet(CurrentMemoryContext,
+ node->as_nasyncplans + 1);
+ AddWaitEventToSet(node->as_eventset, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
+ NULL, NULL);
+
+ /* Give each waiting subplan a chance to add a event. */
+ i = -1;
+ while ((i = bms_next_member(node->as_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ if (areq->callback_pending)
+ ExecAsyncConfigureWait(areq);
+ }
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(node->as_eventset, timeout, occurred_event,
+ EVENT_BUFFER_SIZE, WAIT_EVENT_APPEND_READY);
+ FreeWaitEventSet(node->as_eventset);
+ node->as_eventset = NULL;
+ if (noccurred == 0)
+ return;
+
+ /* Deliver notifications. */
+ for (i = 0; i < noccurred; i++)
+ {
+ WaitEvent *w = &occurred_event[i];
+
+ /*
+ * Each waiting subplan should have registered its wait event with
+ * user_data pointing back to its AsyncRequest.
+ */
+ if ((w->events & WL_SOCKET_READABLE) != 0)
+ {
+ AsyncRequest *areq = (AsyncRequest *) w->user_data;
+
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ Assert(areq->callback_pending);
+ areq->callback_pending = false;
+
+ /* Do the actual work. */
+ ExecAsyncNotify(areq);
+ }
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(AsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot = areq->result;
+
+ /* The result should be a TupleTableSlot or NULL. */
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* Nothing to do if the request is pending. */
+ if (!areq->request_complete)
+ {
+ /*
+ * The subplan for which the request was made would be pending for a
+ * callback.
+ */
+ Assert(areq->callback_pending);
+ return;
+ }
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ {
+ /* The ending subplan wouldn't have been pending for a callback. */
+ Assert(!areq->callback_pending);
+ --node->as_nasyncremain;
+ return;
+ }
+
+ /* Save result so we can return it */
+ Assert(node->as_nasyncresults < node->as_nasyncplans);
+ node->as_asyncresults[node->as_nasyncresults++] = slot;
+
+ /*
+ * Mark the subplan that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might complete.
+ */
+ node->as_needrequest = bms_add_member(node->as_needrequest,
+ areq->request_index);
+}
+
+/* ----------------------------------------------------------------
+ * classify_matching_subplans
+ *
+ * Classify the node's as_valid_subplans into sync ones and
+ * async ones, adjust it to contain sync ones only, and save
+ * async ones in the node's as_valid_asyncplans
+ * ----------------------------------------------------------------
+ */
+static void
+classify_matching_subplans(AppendState *node)
+{
+ Bitmapset *valid_asyncplans;
+
+ /* Nothing to do if there are no valid subplans. */
+ if (bms_is_empty(node->as_valid_subplans))
+ return;
+
+ /* Nothing to do if there are no valid async subplans. */
+ if (!bms_overlap(node->as_valid_subplans, node->as_asyncplans))
+ return;
+
+ /* Get valid async subplans. */
+ valid_asyncplans = bms_copy(node->as_asyncplans);
+ valid_asyncplans = bms_int_members(valid_asyncplans,
+ node->as_valid_subplans);
+
+ /* Adjust the valid subplans to contain sync subplans only. */
+ node->as_valid_subplans = bms_del_members(node->as_valid_subplans,
+ valid_asyncplans);
+
+ /* Save valid async subplans. */
+ node->as_valid_asyncplans = valid_asyncplans;
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 0969e53c3a..c92a35b8a6 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -22,6 +22,7 @@
*/
#include "postgres.h"
+#include "executor/execAsync.h"
#include "executor/executor.h"
#include "executor/nodeForeignscan.h"
#include "foreign/fdwapi.h"
@@ -222,6 +223,9 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
if (node->resultRelation > 0)
scanstate->resultRelInfo = estate->es_result_relations[node->resultRelation - 1];
+ /* Initialize the async_capable flag. */
+ scanstate->ss.ps.async_capable = ((Plan *) node)->async_capable;
+
/* Initialize any outer plan. */
if (outerPlan(node))
outerPlanState(scanstate) =
@@ -391,3 +395,73 @@ ExecShutdownForeignScan(ForeignScanState *node)
if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecReconsiderAsyncForeignScan
+ *
+ * Re-examine a ForeignScan node that was considered async-capable
+ * at plan time.
+ * ----------------------------------------------------------------
+ */
+bool
+ExecReconsiderAsyncForeignScan(ForeignScanState *node, AsyncContext *acxt)
+{
+ FdwRoutine *fdwroutine = node->fdwroutine;
+ bool result = true;
+
+ if (fdwroutine->ReconsiderAsyncForeignScan)
+ {
+ result = fdwroutine->ReconsiderAsyncForeignScan(node, acxt);
+ if (!result)
+ node->ss.ps.async_capable = false;
+ }
+ return result;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Asynchronously request a tuple from a designed async-capable node
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ fdwroutine->ForeignAsyncConfigureWait(areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Callback invoked when a relevant event has occurred
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 65bbc18ecb..8d64772931 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -87,6 +87,7 @@ _copyPlannedStmt(const PlannedStmt *from)
COPY_SCALAR_FIELD(transientPlan);
COPY_SCALAR_FIELD(dependsOnRole);
COPY_SCALAR_FIELD(parallelModeNeeded);
+ COPY_SCALAR_FIELD(asyncPlan);
COPY_SCALAR_FIELD(jitFlags);
COPY_NODE_FIELD(planTree);
COPY_NODE_FIELD(rtable);
@@ -120,6 +121,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
COPY_SCALAR_FIELD(plan_width);
COPY_SCALAR_FIELD(parallel_aware);
COPY_SCALAR_FIELD(parallel_safe);
+ COPY_SCALAR_FIELD(async_capable);
COPY_SCALAR_FIELD(plan_node_id);
COPY_NODE_FIELD(targetlist);
COPY_NODE_FIELD(qual);
@@ -241,6 +243,7 @@ _copyAppend(const Append *from)
*/
COPY_BITMAPSET_FIELD(apprelids);
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
COPY_SCALAR_FIELD(first_partial_plan);
COPY_NODE_FIELD(part_prune_info);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index f5dcedf6e8..80a853d706 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -305,6 +305,7 @@ _outPlannedStmt(StringInfo str, const PlannedStmt *node)
WRITE_BOOL_FIELD(transientPlan);
WRITE_BOOL_FIELD(dependsOnRole);
WRITE_BOOL_FIELD(parallelModeNeeded);
+ WRITE_BOOL_FIELD(asyncPlan);
WRITE_INT_FIELD(jitFlags);
WRITE_NODE_FIELD(planTree);
WRITE_NODE_FIELD(rtable);
@@ -333,6 +334,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
WRITE_INT_FIELD(plan_width);
WRITE_BOOL_FIELD(parallel_aware);
WRITE_BOOL_FIELD(parallel_safe);
+ WRITE_BOOL_FIELD(async_capable);
WRITE_INT_FIELD(plan_node_id);
WRITE_NODE_FIELD(targetlist);
WRITE_NODE_FIELD(qual);
@@ -431,6 +433,7 @@ _outAppend(StringInfo str, const Append *node)
WRITE_BITMAPSET_FIELD(apprelids);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
WRITE_INT_FIELD(first_partial_plan);
WRITE_NODE_FIELD(part_prune_info);
}
@@ -2221,6 +2224,7 @@ _outPlannerGlobal(StringInfo str, const PlannerGlobal *node)
WRITE_BOOL_FIELD(parallelModeOK);
WRITE_BOOL_FIELD(parallelModeNeeded);
WRITE_CHAR_FIELD(maxParallelHazard);
+ WRITE_BOOL_FIELD(asyncPlan);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 4388aae71d..5104d1a2b4 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1581,6 +1581,7 @@ _readPlannedStmt(void)
READ_BOOL_FIELD(transientPlan);
READ_BOOL_FIELD(dependsOnRole);
READ_BOOL_FIELD(parallelModeNeeded);
+ READ_BOOL_FIELD(asyncPlan);
READ_INT_FIELD(jitFlags);
READ_NODE_FIELD(planTree);
READ_NODE_FIELD(rtable);
@@ -1614,6 +1615,7 @@ ReadCommonPlan(Plan *local_node)
READ_INT_FIELD(plan_width);
READ_BOOL_FIELD(parallel_aware);
READ_BOOL_FIELD(parallel_safe);
+ READ_BOOL_FIELD(async_capable);
READ_INT_FIELD(plan_node_id);
READ_NODE_FIELD(targetlist);
READ_NODE_FIELD(qual);
@@ -1710,6 +1712,7 @@ _readAppend(void)
READ_BITMAPSET_FIELD(apprelids);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
READ_INT_FIELD(first_partial_plan);
READ_NODE_FIELD(part_prune_info);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index aab06c7d21..3b034a0326 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -147,6 +147,7 @@ bool enable_partitionwise_aggregate = false;
bool enable_parallel_append = true;
bool enable_parallel_hash = true;
bool enable_partition_pruning = true;
+bool enable_async_append = true;
typedef struct
{
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 6c8305c977..b30f8255f2 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -81,6 +81,7 @@ static List *get_gating_quals(PlannerInfo *root, List *quals);
static Plan *create_gating_plan(PlannerInfo *root, Path *path, Plan *plan,
List *gating_quals);
static Plan *create_join_plan(PlannerInfo *root, JoinPath *best_path);
+static bool is_async_capable_path(Path *path);
static Plan *create_append_plan(PlannerInfo *root, AppendPath *best_path,
int flags);
static Plan *create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
@@ -1066,6 +1067,30 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
return plan;
}
+/*
+ * is_async_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
+
/*
* create_append_plan
* Create an Append plan for 'best_path' and (recursively) plans
@@ -1083,6 +1108,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
List *pathkeys = best_path->path.pathkeys;
List *subplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
RelOptInfo *rel = best_path->path.parent;
PartitionPruneInfo *partpruneinfo = NULL;
int nodenumsortkeys = 0;
@@ -1090,6 +1116,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
Oid *nodeSortOperators = NULL;
Oid *nodeCollations = NULL;
bool *nodeNullsFirst = NULL;
+ bool consider_async = false;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1153,6 +1180,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
tlist_was_changed = (orig_tlist_length != list_length(plan->plan.targetlist));
}
+ /* If appropriate, consider async append */
+ consider_async = (enable_async_append && pathkeys == NIL &&
+ !best_path->path.parallel_safe &&
+ list_length(best_path->subpaths) > 1);
+
/* Build the plan for each child */
foreach(subpaths, best_path->subpaths)
{
@@ -1220,6 +1252,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
subplans = lappend(subplans, subplan);
+
+ /* Check to see if subplan can be executed asynchronously */
+ if (consider_async && is_async_capable_path(subpath))
+ {
+ subplan->async_capable = true;
+ ++nasyncplans;
+ }
}
/*
@@ -1252,9 +1291,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
plan->appendplans = subplans;
+ plan->nasyncplans = nasyncplans;
plan->first_partial_plan = best_path->first_partial_path;
plan->part_prune_info = partpruneinfo;
+ if (nasyncplans > 0)
+ root->glob->asyncPlan = true;
+
copy_generic_path_info(&plan->plan, (Path *) best_path);
/*
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index adf68d8790..95e7601a31 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -312,6 +312,7 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
glob->lastPlanNodeId = 0;
glob->transientPlan = false;
glob->dependsOnRole = false;
+ glob->asyncPlan = false;
/*
* Assess whether it's feasible to use parallel mode for this query. We
@@ -513,6 +514,7 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
result->transientPlan = glob->transientPlan;
result->dependsOnRole = glob->dependsOnRole;
result->parallelModeNeeded = glob->parallelModeNeeded;
+ result->asyncPlan = glob->asyncPlan;
result->planTree = top_plan;
result->rtable = glob->finalrtable;
result->resultRelations = glob->resultRelations;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..58f8e0bbcf 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3999,6 +3999,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
switch (w)
{
+ case WAIT_EVENT_APPEND_READY:
+ event_name = "AppendReady";
+ break;
case WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE:
event_name = "BackupWaitWalArchive";
break;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index eafdb1118e..507567aff3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1111,6 +1111,16 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_async_append", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of async append plans."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_async_append,
+ true,
+ NULL, NULL, NULL
+ },
{
{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("Enables genetic query optimization."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index bd57e917e1..1306094865 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -370,6 +370,7 @@
#enable_partitionwise_aggregate = off
#enable_parallel_hash = on
#enable_partition_pruning = on
+#enable_async_append = on
# - Planner Cost Constants -
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index cafd410a5d..8c7ebc2998 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -25,4 +25,7 @@ extern void ExecAppendInitializeDSM(AppendState *node, ParallelContext *pcxt);
extern void ExecAppendReInitializeDSM(AppendState *node, ParallelContext *pcxt);
extern void ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt);
+extern void ExecReconsiderAsyncAppend(AppendState *node);
+extern void ExecAsyncAppendResponse(AsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 6ae7733e25..56c3809d2d 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -17,6 +17,8 @@
#include "access/parallel.h"
#include "nodes/execnodes.h"
+struct AsyncContext;
+
extern ForeignScanState *ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags);
extern void ExecEndForeignScan(ForeignScanState *node);
extern void ExecReScanForeignScan(ForeignScanState *node);
@@ -31,4 +33,10 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
ParallelWorkerContext *pwcxt);
extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecReconsiderAsyncForeignScan(ForeignScanState *node,
+ struct AsyncContext *acxt);
+extern void ExecAsyncForeignScanRequest(AsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(AsyncRequest *areq);
+extern void ExecAsyncForeignScanNotify(AsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 248f78da45..99cabd6b94 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -19,6 +19,7 @@
/* To avoid including explain.h here, reference ExplainState thus: */
struct ExplainState;
+struct AsyncContext;
/*
* Callback function signatures --- see fdwhandler.sgml for more info.
@@ -178,6 +179,17 @@ typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
List *fdw_private,
RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+
+typedef bool (*ReconsiderAsyncForeignScan_function) (ForeignScanState *node,
+ struct AsyncContext *acxt);
+
+typedef void (*ForeignAsyncRequest_function) (AsyncRequest *areq);
+
+typedef void (*ForeignAsyncConfigureWait_function) (AsyncRequest *areq);
+
+typedef void (*ForeignAsyncNotify_function) (AsyncRequest *areq);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -256,6 +268,13 @@ typedef struct FdwRoutine
/* Support functions for path reparameterization. */
ReparameterizeForeignPathByChild_function ReparameterizeForeignPathByChild;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ReconsiderAsyncForeignScan_function ReconsiderAsyncForeignScan;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b6a88ff76b..68584b3c14 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -512,6 +512,32 @@ typedef struct ResultRelInfo
struct CopyMultiInsertBuffer *ri_CopyMultiInsertBuffer;
} ResultRelInfo;
+/* ----------------
+ * AsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct AsyncRequest
+{
+ struct PlanState *requestor; /* Node that wants a tuple */
+ struct PlanState *requestee; /* Node from which a tuple is wanted */
+ int request_index; /* Scratch space for requestor */
+ bool callback_pending; /* Callback is needed */
+ bool request_complete; /* Request complete, result valid */
+ TupleTableSlot *result; /* Result (NULL if no more tuples) */
+} AsyncRequest;
+
+/*
+ * Hash entry to store the set of IDs of ForeignScanStates that use the same
+ * user mapping
+ */
+typedef struct ForeignScanHashEntry
+{
+ Oid umid; /* hash key -- must be first */
+ Bitmapset *fsplanids;
+} ForeignScanHashEntry;
+
/* ----------------
* EState information
*
@@ -602,6 +628,14 @@ typedef struct EState
/* The per-query shared memory area to use for parallel execution. */
struct dsa_area *es_query_dsa;
+ List *es_asyncappends; /* List of async-aware AppendStates */
+
+ /*
+ * Hash table to store the set of IDs of ForeignScanStates using the same
+ * user mapping
+ */
+ HTAB *es_foreign_scan_hash;
+
/*
* JIT information. es_jit_flags indicates whether JIT should be performed
* and with which options. es_jit is created on-demand when JITing is
@@ -969,6 +1003,8 @@ typedef struct PlanState
*/
Bitmapset *chgParam; /* set of IDs of changed Params */
+ bool async_capable;
+
/*
* Other run-time state needed by most if not all node types.
*/
@@ -1217,12 +1253,24 @@ struct AppendState
PlanState **appendplans; /* array of PlanStates for my inputs */
int as_nplans;
int as_whichplan;
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_asyncplans; /* asynchronous plans indexes */
+ Bitmapset *as_asyncplanids; /* asynchronous plans IDs */
+ int as_nasyncplans; /* # of asynchronous plans */
+ AsyncRequest **as_asyncrequests; /* array of AsyncRequests */
+ TupleTableSlot **as_asyncresults; /* unreturned results of async plans */
+ int as_nasyncresults; /* # of valid entries in as_asyncresults */
+ int as_nasyncremain; /* # of remaining async plans */
+ Bitmapset *as_needrequest; /* async plans ready for a request */
+ struct WaitEventSet *as_eventset; /* WaitEventSet used to configure
+ * file descriptor wait events */
int as_first_partial_plan; /* Index of 'appendplans' containing
* the first partial plan */
ParallelAppendState *as_pstate; /* parallel coordination info */
Size pstate_len; /* size of parallel coordination info */
struct PartitionPruneState *as_prune_state;
Bitmapset *as_valid_subplans;
+ Bitmapset *as_valid_asyncplans;
bool (*choose_next_subplan) (AppendState *);
};
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0ec93e648c..e76db3eb4c 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -141,6 +141,8 @@ typedef struct PlannerGlobal
char maxParallelHazard; /* worst PROPARALLEL hazard level */
PartitionDirectory partition_directory; /* partition descriptors */
+
+ bool asyncPlan; /* does plan have async-aware Append? */
} PlannerGlobal;
/* macro for fetching the Plan associated with a SubPlan node */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 43160439f0..c636b498ef 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -59,6 +59,8 @@ typedef struct PlannedStmt
bool parallelModeNeeded; /* parallel mode required to execute? */
+ bool asyncPlan; /* does plan have async-aware Append? */
+
int jitFlags; /* which forms of JIT should be performed */
struct Plan *planTree; /* tree of Plan nodes */
@@ -129,6 +131,11 @@ typedef struct Plan
bool parallel_aware; /* engage parallel-aware logic? */
bool parallel_safe; /* OK to use as part of parallel plan? */
+ /*
+ * information needed for asynchronous execution
+ */
+ bool async_capable; /* engage asynchronous-capable logic? */
+
/*
* Common structural data for all Plan types.
*/
@@ -245,6 +252,7 @@ typedef struct Append
Plan plan;
Bitmapset *apprelids; /* RTIs of appendrel(s) formed by this node */
List *appendplans;
+ int nasyncplans; /* # of asynchronous plans */
/*
* All 'appendplans' preceding this index are non-partial plans. All
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ed2e4af4be..c2952e375d 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -65,6 +65,7 @@ extern PGDLLIMPORT bool enable_partitionwise_aggregate;
extern PGDLLIMPORT bool enable_parallel_append;
extern PGDLLIMPORT bool enable_parallel_hash;
extern PGDLLIMPORT bool enable_partition_pruning;
+extern PGDLLIMPORT bool enable_async_append;
extern PGDLLIMPORT int constraint_exclusion;
extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..d9588da38a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -957,6 +957,7 @@ typedef enum
*/
typedef enum
{
+ WAIT_EVENT_APPEND_READY,
WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE = PG_WAIT_IPC,
WAIT_EVENT_BGWORKER_SHUTDOWN,
WAIT_EVENT_BGWORKER_STARTUP,
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index dc7ab2ce8b..e78ca7bddb 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -87,6 +87,7 @@ select explain_filter('explain (analyze, buffers, format json) select * from int
"Plan": { +
"Node Type": "Seq Scan", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "int8_tbl",+
"Alias": "i8", +
"Startup Cost": N.N, +
@@ -136,6 +137,7 @@ select explain_filter('explain (analyze, buffers, format xml) select * from int8
<Plan> +
<Node-Type>Seq Scan</Node-Type> +
<Parallel-Aware>false</Parallel-Aware> +
+ <Async-Capable>false</Async-Capable> +
<Relation-Name>int8_tbl</Relation-Name> +
<Alias>i8</Alias> +
<Startup-Cost>N.N</Startup-Cost> +
@@ -183,6 +185,7 @@ select explain_filter('explain (analyze, buffers, format yaml) select * from int
- Plan: +
Node Type: "Seq Scan" +
Parallel Aware: false +
+ Async Capable: false +
Relation Name: "int8_tbl"+
Alias: "i8" +
Startup Cost: N.N +
@@ -233,6 +236,7 @@ select explain_filter('explain (buffers, format json) select * from int8_tbl i8'
"Plan": { +
"Node Type": "Seq Scan", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "int8_tbl",+
"Alias": "i8", +
"Startup Cost": N.N, +
@@ -348,6 +352,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Relation Name": "tenk1", +
"Parallel Aware": true, +
"Local Hit Blocks": 0, +
@@ -393,6 +398,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Parallel Aware": false, +
"Sort Space Used": 0, +
"Local Hit Blocks": 0, +
@@ -435,6 +441,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Parallel Aware": false, +
"Workers Planned": 0, +
"Local Hit Blocks": 0, +
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index d574583844..406fb88130 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -558,6 +558,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
"Node Type": "Incremental Sort", +
"Actual Rows": 55, +
"Actual Loops": 1, +
+ "Async Capable": false, +
"Presorted Key": [ +
"t.a" +
], +
@@ -745,6 +746,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
"Node Type": "Incremental Sort", +
"Actual Rows": 70, +
"Actual Loops": 1, +
+ "Async Capable": false, +
"Presorted Key": [ +
"t.a" +
], +
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index ff157ceb1c..499245068a 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -204,6 +204,7 @@ explain (costs off, format json) insert into insertconflicttest values (0, 'Bilb
"Node Type": "ModifyTable", +
"Operation": "Insert", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "insertconflicttest", +
"Alias": "insertconflicttest", +
"Conflict Resolution": "UPDATE", +
@@ -213,7 +214,8 @@ explain (costs off, format json) insert into insertconflicttest values (0, 'Bilb
{ +
"Node Type": "Result", +
"Parent Relationship": "Member", +
- "Parallel Aware": false +
+ "Parallel Aware": false, +
+ "Async Capable": false +
} +
] +
} +
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 81bdacf59d..b7818c0637 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -88,6 +88,7 @@ select count(*) = 1 as ok from pg_stat_wal;
select name, setting from pg_settings where name like 'enable%';
name | setting
--------------------------------+---------
+ enable_async_append | on
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
@@ -106,7 +107,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(18 rows)
+(19 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
On Wed, Feb 10, 2021 at 7:31 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
Attached is an updated version of the patch. Sorry for the delay.
I noticed that I forgot to add new files. :-(. Please find attached
an updated patch.
Best regards,
Etsuro Fujita
Attachments:
async-wip-2021-02-10-v2.patchapplication/octet-stream; name=async-wip-2021-02-10-v2.patchDownload
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index ee0b4acf0b..3ecb8e1e4f 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -62,6 +62,7 @@ typedef struct ConnCacheEntry
Oid serverid; /* foreign server OID used to get server name */
uint32 server_hashvalue; /* hash value of foreign server OID */
uint32 mapping_hashvalue; /* hash value of user mapping OID */
+ PgFdwConnState state; /* extra per-connection state */
} ConnCacheEntry;
/*
@@ -117,7 +118,7 @@ static bool disconnect_cached_connections(Oid serverid);
* (not even on error), we need this flag to cue manual cleanup.
*/
PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetConnection(UserMapping *user, bool will_prep_stmt, PgFdwConnState **state)
{
bool found;
bool retry = false;
@@ -264,6 +265,10 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
/* Remember if caller will prepare statements */
entry->have_prep_stmt |= will_prep_stmt;
+ /* If caller needs access to the per-connection state, return it. */
+ if (state)
+ *state = &entry->state;
+
return entry->conn;
}
@@ -291,6 +296,7 @@ make_new_connection(ConnCacheEntry *entry, UserMapping *user)
entry->mapping_hashvalue =
GetSysCacheHashValue1(USERMAPPINGOID,
ObjectIdGetDatum(user->umid));
+ memset(&entry->state, 0, sizeof(entry->state));
/* Now try to make the connection */
entry->conn = connect_pg_server(server, user);
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 60c7e115d6..05428ee018 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -7021,7 +7021,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+-------
a | aaa
@@ -7049,7 +7049,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7077,7 +7077,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7105,7 +7105,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | newtoo
@@ -7133,7 +7133,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
DELETE FROM a;
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+----
(0 rows)
@@ -7175,23 +7175,28 @@ insert into bar2 values(3,33,33);
insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
@@ -7201,9 +7206,9 @@ select * from bar where f1 in (select f1 from foo) for update;
-> Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
f1 | f2
----+----
1 | 11
@@ -7213,23 +7218,28 @@ select * from bar where f1 in (select f1 from foo) for update;
(4 rows)
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
@@ -7239,9 +7249,9 @@ select * from bar where f1 in (select f1 from foo) for share;
-> Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
f1 | f2
----+----
1 | 11
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 368997d9d1..11b19ae1ef 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -21,6 +21,7 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -37,9 +38,11 @@
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
#include "postgres_fdw.h"
+#include "storage/latch.h"
#include "utils/builtins.h"
#include "utils/float.h"
#include "utils/guc.h"
+#include "utils/hsearch.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -159,6 +162,11 @@ typedef struct PgFdwScanState
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ /* for asynchronous execution */
+ Oid umid; /* Oid of user mapping */
+ PgFdwConnState *conn_state; /* extra per-connection state */
+ ForeignScanState *next_node; /* next ForeignScan node to activate */
+
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
MemoryContext temp_cxt; /* context for per-tuple temporary data */
@@ -408,6 +416,12 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
RelOptInfo *input_rel,
RelOptInfo *output_rel,
void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresReconsiderAsyncForeignScan(ForeignScanState *node,
+ AsyncContext *acxt);
+static void postgresForeignAsyncRequest(AsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(AsyncRequest *areq);
+static void postgresForeignAsyncNotify(AsyncRequest *areq);
/*
* Helper functions
@@ -435,7 +449,11 @@ static void adjust_foreign_grouping_path_cost(PlannerInfo *root,
static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
+static UserMapping *get_user_mapping(EState *estate, ForeignScan *fsplan);
+static void record_foreign_scan_info(EState *estate, ForeignScanState *node,
+ UserMapping *user);
static void create_cursor(ForeignScanState *node);
+static void fetch_more_data_begin(ForeignScanState *node);
static void fetch_more_data(ForeignScanState *node);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static PgFdwModifyState *create_foreign_modify(EState *estate,
@@ -491,6 +509,7 @@ static int postgresAcquireSampleRowsFunc(Relation relation, int elevel,
double *totaldeadrows);
static void analyze_row_processor(PGresult *res, int row,
PgFdwAnalyzeState *astate);
+static void request_tuple_asynchronously(AsyncRequest *areq);
static HeapTuple make_tuple_from_result_row(PGresult *res,
int row,
Relation rel,
@@ -583,6 +602,13 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for asynchronous execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ReconsiderAsyncForeignScan = postgresReconsiderAsyncForeignScan;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -1417,19 +1443,40 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
{
ForeignScan *fsplan = (ForeignScan *) node->ss.ps.plan;
EState *estate = node->ss.ps.state;
+ bool asyncPlan = estate->es_plannedstmt->asyncPlan;
PgFdwScanState *fsstate;
- RangeTblEntry *rte;
- Oid userid;
- ForeignTable *table;
UserMapping *user;
- int rtindex;
int numParams;
/*
- * Do nothing in EXPLAIN (no ANALYZE) case. node->fdw_state stays NULL.
+ * No need to work hard in EXPLAIN (no ANALYZE) case. In that case,
+ * node->fdw_state stays NULL; or node->fdw_state->conn stays NULL.
*/
if (eflags & EXEC_FLAG_EXPLAIN_ONLY)
+ {
+ /* Do nothing if the query plan tree has no async-aware Appends. */
+ if (!asyncPlan)
+ return;
+
+ /* Get info about user mapping. */
+ user = get_user_mapping(estate, fsplan);
+
+ /* Record the information on the ForeignScan node in the EState. */
+ record_foreign_scan_info(estate, node, user);
+
+ /*
+ * If the ForeignScan node is async-capable, save the user mapping
+ * OID in node->fdw_state for use later.
+ */
+ if (node->ss.ps.async_capable)
+ {
+ fsstate = (PgFdwScanState *) palloc0(sizeof(PgFdwScanState));
+ node->fdw_state = (void *) fsstate;
+ fsstate->umid = user->umid;
+ }
+
return;
+ }
/*
* We'll save private state in node->fdw_state.
@@ -1437,28 +1484,27 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
fsstate = (PgFdwScanState *) palloc0(sizeof(PgFdwScanState));
node->fdw_state = (void *) fsstate;
- /*
- * Identify which user to do the remote access as. This should match what
- * ExecCheckRTEPerms() does. In case of a join or aggregate, use the
- * lowest-numbered member RTE as a representative; we would get the same
- * result from any.
- */
- if (fsplan->scan.scanrelid > 0)
- rtindex = fsplan->scan.scanrelid;
- else
- rtindex = bms_next_member(fsplan->fs_relids, -1);
- rte = exec_rt_fetch(rtindex, estate);
- userid = rte->checkAsUser ? rte->checkAsUser : GetUserId();
+ /* Get info about user mapping. */
+ user = get_user_mapping(estate, fsplan);
- /* Get info about foreign table. */
- table = GetForeignTable(rte->relid);
- user = GetUserMapping(userid, table->serverid);
+ if (asyncPlan)
+ {
+ /* Record the information on the ForeignScan node in the EState. */
+ record_foreign_scan_info(estate, node, user);
+
+ /*
+ * If the ForeignScan node is async-capable, save the user mapping
+ * OID in node->fdw_state for use later.
+ */
+ if (node->ss.ps.async_capable)
+ fsstate->umid = user->umid;
+ }
/*
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->conn = GetConnection(user, false, &fsstate->conn_state);
/* Assign a unique ID for my cursor */
fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1509,6 +1555,11 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
&fsstate->param_flinfo,
&fsstate->param_exprs,
&fsstate->param_values);
+
+ /* Initialize async state */
+ fsstate->conn_state->activated = NULL;
+ fsstate->conn_state->async_query_sent = false;
+ fsstate->next_node = NULL;
}
/*
@@ -1523,8 +1574,10 @@ postgresIterateForeignScan(ForeignScanState *node)
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
/*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
+ * In sync mode, if this is the first call after Begin or ReScan, we need
+ * to create the cursor on the remote side. In async mode, we would have
+ * aready created the cursor before we get here, even if this is the first
+ * call after Begin or ReScan.
*/
if (!fsstate->cursor_exists)
create_cursor(node);
@@ -1534,6 +1587,9 @@ postgresIterateForeignScan(ForeignScanState *node)
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
+ /* In async mode, just clear tuple slot. */
+ if (node->ss.ps.async_capable)
+ return ExecClearTuple(slot);
/* No point in another fetch if we already detected EOF, though. */
if (!fsstate->eof_reached)
fetch_more_data(node);
@@ -1563,6 +1619,14 @@ postgresReScanForeignScan(ForeignScanState *node)
char sql[64];
PGresult *res;
+ /* Reset async state */
+ if (node->ss.ps.async_capable)
+ {
+ fsstate->conn_state->activated = NULL;
+ fsstate->conn_state->async_query_sent = false;
+ fsstate->next_node = NULL;
+ }
+
/* If we haven't created the cursor yet, nothing to do. */
if (!fsstate->cursor_exists)
return;
@@ -1617,10 +1681,21 @@ postgresEndForeignScan(ForeignScanState *node)
{
PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
- /* if fsstate is NULL, we are in EXPLAIN; nothing to do */
- if (fsstate == NULL)
+ /*
+ * if fsstate is NULL or if fsstate->conn is NULL, we are in EXPLAIN;
+ * nothing to do
+ */
+ if (fsstate == NULL || fsstate->conn == NULL)
return;
+ /*
+ * If we're ending before we've collected a response from an asynchronous
+ * query, we have to consume the response.
+ */
+ if (fsstate->conn_state->activated == node &&
+ fsstate->conn_state->async_query_sent)
+ fetch_more_data(node);
+
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
close_cursor(fsstate->conn, fsstate->cursor_number);
@@ -2491,7 +2566,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->conn = GetConnection(user, false, NULL);
/* Update the foreign-join-related fields. */
if (fsplan->scan.scanrelid == 0)
@@ -2872,7 +2947,7 @@ estimate_path_cost_size(PlannerInfo *root,
false, &retrieved_attrs, NULL);
/* Get the remote estimate */
- conn = GetConnection(fpinfo->user, false);
+ conn = GetConnection(fpinfo->user, false, NULL);
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -3428,6 +3503,53 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
return true;
}
+static UserMapping *
+get_user_mapping(EState *estate, ForeignScan *fsplan)
+{
+ int rtindex;
+ RangeTblEntry *rte;
+ Oid userid;
+ ForeignTable *table;
+
+ /*
+ * Identify which user to do the remote access as. This should match what
+ * ExecCheckRTEPerms() does. In case of a join or aggregate, use the
+ * lowest-numbered member RTE as a representative; we would get the same
+ * result from any.
+ */
+ if (fsplan->scan.scanrelid > 0)
+ rtindex = fsplan->scan.scanrelid;
+ else
+ rtindex = bms_next_member(fsplan->fs_relids, -1);
+ rte = exec_rt_fetch(rtindex, estate);
+ userid = rte->checkAsUser ? rte->checkAsUser : GetUserId();
+
+ /* Get info about foreign table. */
+ table = GetForeignTable(rte->relid);
+
+ return GetUserMapping(userid, table->serverid);
+}
+
+static void
+record_foreign_scan_info(EState *estate, ForeignScanState *node,
+ UserMapping *user)
+{
+ HTAB *htab = estate->es_foreign_scan_hash;
+ int fsplanid = node->ss.ps.plan->plan_node_id;
+ bool found;
+ ForeignScanHashEntry *entry;
+
+ /* Find or create hash table entry for the user mapping. */
+ Assert(htab);
+ entry = (ForeignScanHashEntry *) hash_search(htab, &user->umid,
+ HASH_ENTER, &found);
+
+ if (!found)
+ entry->fsplanids = bms_make_singleton(fsplanid);
+ else
+ entry->fsplanids = bms_add_member(entry->fsplanids, fsplanid);
+}
+
/*
* Create cursor for node's query with current parameter values.
*/
@@ -3500,6 +3622,34 @@ create_cursor(ForeignScanState *node)
pfree(buf.data);
}
+/*
+ * Begin an asynchronous data fetch.
+ * fetch_more_data must be called to fetch the results..
+ */
+static void
+fetch_more_data_begin(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PGconn *conn = fsstate->conn;
+ char sql[64];
+
+ Assert(fsstate->conn_state->activated == node);
+ Assert(!fsstate->conn_state->async_query_sent);
+
+ /* Create the cursor synchronously. */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ /* We will send this query, but not wait for the response. */
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (PQsendQuery(conn, sql) < 0)
+ pgfdw_report_error(ERROR, NULL, conn, false, fsstate->query);
+
+ fsstate->conn_state->async_query_sent = true;
+}
+
/*
* Fetch some more rows from the node's cursor.
*/
@@ -3522,17 +3672,36 @@ fetch_more_data(ForeignScanState *node)
PG_TRY();
{
PGconn *conn = fsstate->conn;
- char sql[64];
int numrows;
int i;
- snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
- fsstate->fetch_size, fsstate->cursor_number);
+ if (node->ss.ps.async_capable)
+ {
+ Assert(fsstate->conn_state->activated == node);
+ Assert(fsstate->conn_state->async_query_sent);
+
+ /*
+ * The query was already sent by an earlier call to
+ * fetch_more_data_begin. So now we just fetch the result.
+ */
+ res = PQgetResult(conn);
+ /* On error, report the original query, not the FETCH. */
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ }
+ else
+ {
+ char sql[64];
- res = pgfdw_exec_query(conn, sql);
- /* On error, report the original query, not the FETCH. */
- if (PQresultStatus(res) != PGRES_TUPLES_OK)
- pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ /* This is a regular synchronous fetch. */
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ res = pgfdw_exec_query(conn, sql);
+ /* On error, report the original query, not the FETCH. */
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ }
/* Convert the data into HeapTuples */
numrows = PQntuples(res);
@@ -3559,6 +3728,15 @@ fetch_more_data(ForeignScanState *node)
/* Must be EOF if we didn't get as many tuples as we asked for. */
fsstate->eof_reached = (numrows < fsstate->fetch_size);
+
+ /* If this was the second part of an async request, we must fetch until NULL. */
+ if (node->ss.ps.async_capable)
+ {
+ /* call once and raise error if not NULL as expected? */
+ while (PQgetResult(conn) != NULL)
+ ;
+ fsstate->conn_state->async_query_sent = false;
+ }
}
PG_FINALLY();
{
@@ -3684,7 +3862,7 @@ create_foreign_modify(EState *estate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->conn = GetConnection(user, true, NULL);
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Set up remote query information. */
@@ -4618,7 +4796,7 @@ postgresAnalyzeForeignTable(Relation relation,
*/
table = GetForeignTable(RelationGetRelid(relation));
user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
- conn = GetConnection(user, false);
+ conn = GetConnection(user, false, NULL);
/*
* Construct command to get page count for relation.
@@ -4704,7 +4882,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
table = GetForeignTable(RelationGetRelid(relation));
server = GetForeignServer(table->serverid);
user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
- conn = GetConnection(user, false);
+ conn = GetConnection(user, false, NULL);
/*
* Construct cursor that retrieves whole rows from remote.
@@ -4932,7 +5110,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
*/
server = GetForeignServer(serverOid);
mapping = GetUserMapping(GetUserId(), server->serverid);
- conn = GetConnection(mapping, false);
+ conn = GetConnection(mapping, false, NULL);
/* Don't attempt to import collation if remote server hasn't got it */
if (PQserverVersion(conn) < 90100)
@@ -6479,6 +6657,221 @@ add_foreign_final_paths(PlannerInfo *root, RelOptInfo *input_rel,
add_path(final_rel, (Path *) final_path);
}
+/*
+ * postgresIsForeignPathAsyncCapable
+ * Check whether a given ForeignPath node is async-capable.
+ */
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+/*
+ * postgresReconsiderAsyncForeignScan
+ * Re-examine a given ForeignScan node that was planned as async-capable.
+ */
+static bool
+postgresReconsiderAsyncForeignScan(ForeignScanState *node, AsyncContext *acxt)
+{
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ EState *estate = node->ss.ps.state;
+ HTAB *htab = estate->es_foreign_scan_hash;
+ bool found;
+ ForeignScanHashEntry *entry;
+ AppendState *requestor = (AppendState *) acxt->requestor;
+ Bitmapset *asyncplanids = requestor->as_asyncplanids;
+ Bitmapset *fsplanids;
+
+ /* Find hash table entry for the ForeignScan node. */
+ Assert(htab);
+ entry = (ForeignScanHashEntry *) hash_search(htab, &fsstate->umid,
+ HASH_FIND, &found);
+ Assert(found);
+
+ fsplanids = entry->fsplanids;
+ Assert(bms_is_member(node->ss.ps.plan->plan_node_id, fsplanids));
+
+ /*
+ * If the connection used for the ForeignScan node is used in other parts
+ * of the query plan tree except async subplans of the parent Append node,
+ * disable async execution of the ForeignScan node.
+ */
+ if (!bms_is_subset(fsplanids, asyncplanids))
+ return false;
+
+ /*
+ * If the subplans of the Append node are all async-capable, and use the
+ * same connection, then we won't execute them asynchronously.
+ */
+ if (requestor->as_nasyncplans == requestor->as_nplans &&
+ !bms_nonempty_difference(asyncplanids, fsplanids))
+ return false;
+
+ return true;
+}
+
+/*
+ * postgresForeignAsyncRequest
+ * Asynchronously request next tuple from a foreign PostgreSQL table.
+ */
+static void
+postgresForeignAsyncRequest(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+
+ /*
+ * If this is the first call after Begin or ReScan, mark the connection
+ * as used by the ForeignScan node.
+ */
+ if (fsstate->conn_state->activated == NULL)
+ fsstate->conn_state->activated = node;
+
+ /*
+ * If the connection has already been used by a ForeignScan node, put it
+ * at the end of the chain of waiting ForeignScan nodes, and then return.
+ */
+ if (node != fsstate->conn_state->activated)
+ {
+ ForeignScanState *curr_node = fsstate->conn_state->activated;
+ PgFdwScanState *curr_fsstate = (PgFdwScanState *) curr_node->fdw_state;
+
+ /* Scan down the chain ... */
+ while (curr_fsstate->next_node)
+ {
+ curr_node = curr_fsstate->next_node;
+ Assert(node != curr_node);
+ curr_fsstate = (PgFdwScanState *) curr_node->fdw_state;
+ }
+ /* Update the chain linking */
+ curr_fsstate->next_node = node;
+ /* Mark the request as needing a callback */
+ areq->callback_pending = true;
+ areq->request_complete = false;
+ return;
+ }
+
+ request_tuple_asynchronously(areq);
+}
+
+/*
+ * postgresForeignAsyncConfigureWait
+ * Configure a file descriptor event for which we wish to wait.
+ */
+static void
+postgresForeignAsyncConfigureWait(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ AppendState *requestor = (AppendState *) areq->requestor;
+ WaitEventSet *set = requestor->as_eventset;
+
+ /* This function should not be called unless callback_pending */
+ Assert(areq->callback_pending);
+
+ /* If the ForeignScan node isn't activated yet, nothing to do */
+ if (fsstate->conn_state->activated != node)
+ return;
+
+ AddWaitEventToSet(set, WL_SOCKET_READABLE, PQsocket(fsstate->conn),
+ NULL, areq);
+}
+
+/*
+ * postgresForeignAsyncNotify
+ * Fetch some more tuples from a file descriptor that becomes ready,
+ * requesting next tuple.
+ */
+static void
+postgresForeignAsyncNotify(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+
+ /* The core code would have initialized the callback_pending flag */
+ Assert(!areq->callback_pending);
+
+ fetch_more_data(node);
+
+ request_tuple_asynchronously(areq);
+}
+
+/*
+ * Asynchronously request next tuple from a foreign PostgreSQL table.
+ */
+static void
+request_tuple_asynchronously(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ TupleTableSlot *result;
+
+ /* Request some more tuples, if we've run out */
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ /* No point in another fetch if we already detected EOF, though */
+ if (!fsstate->eof_reached)
+ {
+ /* Begin another fetch */
+ fetch_more_data_begin(node);
+ /* Mark the request as needing a callback */
+ areq->callback_pending = true;
+ areq->request_complete = false;
+ return;
+ }
+ fsstate->conn_state->activated = NULL;
+
+ /* Activate the next ForeignScan node if any */
+ if (fsstate->next_node)
+ {
+ /* Mark the connection as used by the next ForeignScan node */
+ fsstate->conn_state->activated = fsstate->next_node;
+ Assert(!fsstate->conn_state->async_query_sent);
+ /* Begin an asynchronous fetch for that node */
+ fetch_more_data_begin(fsstate->next_node);
+ }
+
+ /* There's nothing more to do; just return a NULL pointer */
+ result = NULL;
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+ return;
+ }
+
+ /* Get a tuple from the ForeignScan node */
+ result = ExecProcNode((PlanState *) node);
+
+ if (TupIsNull(result))
+ {
+ Assert(fsstate->next_tuple >= fsstate->num_tuples);
+
+ /* Request some more tuples, if we've not detected EOF yet */
+ if (!fsstate->eof_reached)
+ {
+ /* Begin another fetch */
+ fetch_more_data_begin(node);
+ /* Mark the request as needing a callback */
+ areq->callback_pending = true;
+ areq->request_complete = false;
+ return;
+ }
+ fsstate->conn_state->activated = NULL;
+
+ /* Activate the next ForeignScan node if any */
+ if (fsstate->next_node)
+ {
+ /* Mark the connection as used by the next ForeignScan node */
+ fsstate->conn_state->activated = fsstate->next_node;
+ Assert(!fsstate->conn_state->async_query_sent);
+ /* Begin an asynchronous fetch for that node */
+ fetch_more_data_begin(fsstate->next_node);
+ }
+ }
+
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+}
+
/*
* Create a tuple from the specified row of the PGresult.
*
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 1f67b4d9fd..c3537b6449 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -16,6 +16,7 @@
#include "foreign/foreign.h"
#include "lib/stringinfo.h"
#include "libpq-fe.h"
+#include "nodes/execnodes.h"
#include "nodes/pathnodes.h"
#include "utils/relcache.h"
@@ -124,12 +125,22 @@ typedef struct PgFdwRelationInfo
int relation_index;
} PgFdwRelationInfo;
+/*
+ * Extra control information relating to a connection.
+ */
+typedef struct PgFdwConnState
+{
+ ForeignScanState *activated; /* currently-activated ForeignScan node */
+ bool async_query_sent; /* has an asynchronous query been sent? */
+} PgFdwConnState;
+
/* in postgres_fdw.c */
extern int set_transmission_modes(void);
extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+ PgFdwConnState **state);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 151f4f1834..ceda16b92f 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1822,31 +1822,31 @@ INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
DELETE FROM a;
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
@@ -1882,12 +1882,12 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5ef1c7ad3c..4a9eece710 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4735,6 +4735,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</para>
<variablelist>
+ <varlistentry id="guc-enable-async-append" xreflabel="enable_async_append">
+ <term><varname>enable_async_append</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_async_append</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of async-aware
+ append plan types. The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-bitmapscan" xreflabel="enable_bitmapscan">
<term><varname>enable_bitmapscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c602ee4427..a2d2f42e28 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1563,6 +1563,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</thead>
<tbody>
+ <row>
+ <entry><literal>AppendReady</literal></entry>
+ <entry>Waiting for a subplan of Append to be ready.</entry>
+ </row>
<row>
<entry><literal>BackupWaitWalArchive</literal></entry>
<entry>Waiting for WAL files required for a backup to be successfully
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index f80e379973..a2b7b8bd67 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1390,6 +1390,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
}
if (plan->parallel_aware)
appendStringInfoString(es->str, "Parallel ");
+ if (planstate->async_capable)
+ appendStringInfoString(es->str, "Async ");
appendStringInfoString(es->str, pname);
es->indent++;
}
@@ -1409,6 +1411,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (custom_name)
ExplainPropertyText("Custom Plan Provider", custom_name, es);
ExplainPropertyBool("Parallel Aware", plan->parallel_aware, es);
+ ExplainPropertyBool("Async Capable", planstate->async_capable, es);
}
switch (nodeTag(plan))
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index f990c6473a..1004647d4f 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
+ execAsync.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 23bdb53cd1..613835b748 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -526,6 +526,10 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans != 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index e69de29bb2..4e87ae6489 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,138 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+
+static void ExecAsyncResponse(AsyncRequest *areq);
+
+/*
+ * Re-examine a plan node that was considered async-capable at plan time.
+ */
+bool
+ExecReconsiderAsyncCapablePlan(PlanState *node, AsyncContext *acxt)
+{
+ bool result;
+
+ switch (nodeTag(node))
+ {
+ case T_ForeignScanState:
+ result = ExecReconsiderAsyncForeignScan((ForeignScanState *) node,
+ acxt);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(node));
+ result = false; /* keep compiler quiet */
+ break;
+ }
+
+ return result;
+}
+
+/*
+ * Asynchronously request a tuple from a designed async-capable node.
+ */
+void
+ExecAsyncRequest(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ ExecAsyncResponse(areq);
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor event
+ * for which it wishes to wait. We expect the node-type specific callback to
+ * make a sigle call of the following form:
+ *
+ * AddWaitEventToSet(set, WL_SOCKET_READABLE, fd, NULL, areq);
+ */
+void
+ExecAsyncConfigureWait(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanConfigureWait(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+void
+ExecAsyncNotify(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ ExecAsyncResponse(areq);
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestor))
+ {
+ case T_AppendState:
+ ExecAsyncAppendResponse(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestor));
+ }
+}
+
+/*
+ * A requestee node should call this function to deliver the tuple to its
+ * requestor node. The node can call this from its ExecAsyncRequest callback
+ * if the requested tuple is available immediately.
+ */
+void
+ExecAsyncRequestDone(AsyncRequest *areq, TupleTableSlot *result)
+{
+ areq->request_complete = true;
+ areq->result = result;
+}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index c74ce36ffb..caf11dc4e0 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -48,6 +48,7 @@
#include "commands/matview.h"
#include "commands/trigger.h"
#include "executor/execdebug.h"
+#include "executor/nodeAppend.h"
#include "executor/nodeSubplan.h"
#include "foreign/fdwapi.h"
#include "jit/jit.h"
@@ -78,6 +79,8 @@ ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook = NULL;
/* decls for local routines only used within this module */
static void InitPlan(QueryDesc *queryDesc, int eflags);
static void CheckValidRowMarkRel(Relation rel, RowMarkType markType);
+static void ExecBuildForeignScanHashTable(EState *estate);
+static void ExecReconsiderPlan(EState *estate);
static void ExecPostprocessPlan(EState *estate);
static void ExecEndPlan(PlanState *planstate, EState *estate);
static void ExecutePlan(EState *estate, PlanState *planstate,
@@ -886,6 +889,9 @@ InitPlan(QueryDesc *queryDesc, int eflags)
/* signal that this EState is not used for EPQ */
estate->es_epq_active = NULL;
+ if (plannedstmt->asyncPlan)
+ ExecBuildForeignScanHashTable(estate);
+
/*
* Initialize private state information for each SubPlan. We must do this
* before running ExecInitNode on the main query tree, since
@@ -924,6 +930,9 @@ InitPlan(QueryDesc *queryDesc, int eflags)
*/
planstate = ExecInitNode(plan, estate, eflags);
+ if (plannedstmt->asyncPlan)
+ ExecReconsiderPlan(estate);
+
/*
* Get the tuple descriptor describing the type of tuples to return.
*/
@@ -1321,6 +1330,35 @@ ExecGetTriggerResultRel(EState *estate, Oid relid)
return rInfo;
}
+static void
+ExecBuildForeignScanHashTable(EState *estate)
+{
+ HASHCTL ctl;
+
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(ForeignScanHashEntry);
+ ctl.hcxt = CurrentMemoryContext;
+
+ estate->es_foreign_scan_hash =
+ hash_create("User mapping dependency table", 256,
+ &ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+}
+
+static void
+ExecReconsiderPlan(EState *estate)
+{
+ ListCell *lc;
+
+ foreach(lc, estate->es_asyncappends)
+ {
+ AppendState *appendstate = (AppendState *) lfirst(lc);
+
+ ExecReconsiderAsyncAppend(appendstate);
+ }
+}
+
/* ----------------------------------------------------------------
* ExecPostprocessPlan
*
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index c734283bfe..df7b9b591b 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -156,6 +156,9 @@ CreateExecutorState(void)
estate->es_use_parallel_mode = false;
+ estate->es_asyncappends = NIL;
+ estate->es_foreign_scan_hash = NULL;
+
estate->es_jit_flags = 0;
estate->es_jit = NULL;
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 15e4115bd6..3896d9fcd4 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -57,10 +57,13 @@
#include "postgres.h"
+#include "executor/execAsync.h"
#include "executor/execdebug.h"
#include "executor/execPartition.h"
#include "executor/nodeAppend.h"
#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
/* Shared state for parallel-aware Append. */
struct ParallelAppendState
@@ -78,12 +81,18 @@ struct ParallelAppendState
};
#define INVALID_SUBPLAN_INDEX -1
+#define EVENT_BUFFER_SIZE 16
static TupleTableSlot *ExecAppend(PlanState *pstate);
static bool choose_next_subplan_locally(AppendState *node);
static bool choose_next_subplan_for_leader(AppendState *node);
static bool choose_next_subplan_for_worker(AppendState *node);
static void mark_invalid_subplans_as_finished(AppendState *node);
+static void ExecAppendAsyncBegin(AppendState *node);
+static bool ExecAppendAsyncGetNext(AppendState *node, TupleTableSlot **result);
+static bool ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result);
+static void ExecAppendAsyncEventWait(AppendState *node);
+static void classify_matching_subplans(AppendState *node);
/* ----------------------------------------------------------------
* ExecInitAppend
@@ -102,7 +111,10 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
AppendState *appendstate = makeNode(AppendState);
PlanState **appendplanstates;
Bitmapset *validsubplans;
+ Bitmapset *asyncplans;
+ Bitmapset *asyncplanids;
int nplans;
+ int nasyncplans;
int firstvalid;
int i,
j;
@@ -119,6 +131,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
/* Let choose_next_subplan_* function handle setting the first subplan */
appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+ appendstate->as_syncdone = false;
/* If run-time partition pruning is enabled, then set that up now */
if (node->part_prune_info != NULL)
@@ -191,12 +204,27 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
* While at it, find out the first valid partial plan.
*/
j = 0;
+ asyncplans = NULL;
+ asyncplanids = NULL;
+ nasyncplans = 0;
firstvalid = nplans;
i = -1;
while ((i = bms_next_member(validsubplans, i)) >= 0)
{
Plan *initNode = (Plan *) list_nth(node->appendplans, i);
+ /*
+ * Record async subplans. When executing EvalPlanQual, we process
+ * async subplans synchronously, so don't do this in that case.
+ */
+ if (initNode->async_capable && estate->es_epq_active == NULL)
+ {
+ asyncplans = bms_add_member(asyncplans, j);
+ asyncplanids = bms_add_member(asyncplanids,
+ initNode->plan_node_id);
+ nasyncplans++;
+ }
+
/*
* Record the lowest appendplans index which is a valid partial plan.
*/
@@ -210,6 +238,11 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ /* Initialize async state */
+ appendstate->as_asyncplans = asyncplans;
+ appendstate->as_asyncplanids = asyncplanids;
+ appendstate->as_nasyncplans = nasyncplans;
+
/*
* Miscellaneous initialization
*/
@@ -219,6 +252,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
/* For parallel query, this will be overridden later. */
appendstate->choose_next_subplan = choose_next_subplan_locally;
+ /*
+ * Lastly, if there is at least one async subplan, add the Append node to
+ * estate->es_asyncappends so that we can re-examine it in
+ * ExecReconsiderPlan.
+ */
+ if (nasyncplans > 0)
+ estate->es_asyncappends = lappend(estate->es_asyncappends,
+ appendstate);
+
return appendstate;
}
@@ -232,31 +274,45 @@ static TupleTableSlot *
ExecAppend(PlanState *pstate)
{
AppendState *node = castNode(AppendState, pstate);
+ TupleTableSlot *result;
- if (node->as_whichplan < 0)
+ if (!node->as_syncdone && node->as_whichplan == INVALID_SUBPLAN_INDEX)
{
/* Nothing to do if there are no subplans */
if (node->as_nplans == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ /* If there are any async subplans, begin execution of them */
+ if (node->as_nasyncplans > 0)
+ ExecAppendAsyncBegin(node);
+
/*
- * If no subplan has been chosen, we must choose one before
+ * If no sync subplan has been chosen, we must choose one before
* proceeding.
*/
- if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
- !node->choose_next_subplan(node))
+ if (!node->choose_next_subplan(node) && node->as_nasyncremain == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
for (;;)
{
PlanState *subnode;
- TupleTableSlot *result;
CHECK_FOR_INTERRUPTS();
/*
- * figure out which subplan we are currently processing
+ * try to get a tuple from any of the async subplans
+ */
+ if (!bms_is_empty(node->as_needrequest) ||
+ (node->as_syncdone && node->as_nasyncremain > 0))
+ {
+ if (ExecAppendAsyncGetNext(node, &result))
+ return result;
+ Assert(bms_is_empty(node->as_needrequest));
+ }
+
+ /*
+ * figure out which sync subplan we are currently processing
*/
Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
subnode = node->appendplans[node->as_whichplan];
@@ -276,8 +332,16 @@ ExecAppend(PlanState *pstate)
return result;
}
- /* choose new subplan; if none, we're done */
- if (!node->choose_next_subplan(node))
+ /* wait or poll async events */
+ if (node->as_nasyncremain > 0)
+ {
+ Assert(!node->as_syncdone);
+ Assert(bms_is_empty(node->as_needrequest));
+ ExecAppendAsyncEventWait(node);
+ }
+
+ /* choose new sync subplan; if no sync/async subplans, we're done */
+ if (!node->choose_next_subplan(node) && node->as_nasyncremain == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
}
@@ -313,6 +377,7 @@ ExecEndAppend(AppendState *node)
void
ExecReScanAppend(AppendState *node)
{
+ int nasyncplans = node->as_nasyncplans;
int i;
/*
@@ -326,6 +391,11 @@ ExecReScanAppend(AppendState *node)
{
bms_free(node->as_valid_subplans);
node->as_valid_subplans = NULL;
+ if (nasyncplans > 0)
+ {
+ bms_free(node->as_valid_asyncplans);
+ node->as_valid_asyncplans = NULL;
+ }
}
for (i = 0; i < node->as_nplans; i++)
@@ -347,8 +417,26 @@ ExecReScanAppend(AppendState *node)
ExecReScan(subnode);
}
+ /* Reset async state */
+ if (nasyncplans > 0)
+ {
+ i = -1;
+ while ((i = bms_next_member(node->as_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ areq->callback_pending = false;
+ areq->request_complete = false;
+ areq->result = NULL;
+ }
+
+ bms_free(node->as_needrequest);
+ node->as_needrequest = NULL;
+ }
+
/* Let choose_next_subplan_* function handle setting the first subplan */
node->as_whichplan = INVALID_SUBPLAN_INDEX;
+ node->as_syncdone = false;
}
/* ----------------------------------------------------------------
@@ -429,7 +517,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
/* ----------------------------------------------------------------
* choose_next_subplan_locally
*
- * Choose next subplan for a non-parallel-aware Append,
+ * Choose next sync subplan for a non-parallel-aware Append,
* returning false if there are no more.
* ----------------------------------------------------------------
*/
@@ -444,9 +532,9 @@ choose_next_subplan_locally(AppendState *node)
/*
* If first call then have the bms member function choose the first valid
- * subplan by initializing whichplan to -1. If there happen to be no
- * valid subplans then the bms member function will handle that by
- * returning a negative number which will allow us to exit returning a
+ * sync subplan by initializing whichplan to -1. If there happen to be
+ * no valid sync subplans then the bms member function will handle that
+ * by returning a negative number which will allow us to exit returning a
* false value.
*/
if (whichplan == INVALID_SUBPLAN_INDEX)
@@ -467,7 +555,10 @@ choose_next_subplan_locally(AppendState *node)
nextplan = bms_prev_member(node->as_valid_subplans, whichplan);
if (nextplan < 0)
+ {
+ node->as_syncdone = true;
return false;
+ }
node->as_whichplan = nextplan;
@@ -709,3 +800,362 @@ mark_invalid_subplans_as_finished(AppendState *node)
node->as_pstate->pa_finished[i] = true;
}
}
+
+/* ----------------------------------------------------------------
+ * ExecReconsiderAsyncAppend
+ *
+ * Re-examine an async-aware Append node
+ * ----------------------------------------------------------------
+ */
+void
+ExecReconsiderAsyncAppend(AppendState *node)
+{
+ Bitmapset *asyncplans = bms_copy(node->as_asyncplans);
+ int nasyncplans;
+ AsyncRequest **asyncrequests;
+ AsyncContext acxt;
+ int i;
+
+ asyncrequests = (AsyncRequest **) palloc0(node->as_nplans *
+ sizeof(AsyncRequest *));
+
+ /* Re-examine each async subplan */
+ acxt.requestor = (PlanState *) node;
+ i = -1;
+ while ((i = bms_next_member(asyncplans, i)) >= 0)
+ {
+ PlanState *subnode = node->appendplans[i];
+
+ acxt.request_index = i;
+ if (!ExecReconsiderAsyncCapablePlan(subnode, &acxt))
+ {
+ bms_del_member(node->as_asyncplans, i);
+ bms_del_member(node->as_asyncplanids,
+ subnode->plan->plan_node_id);
+ --node->as_nasyncplans;
+ }
+ else
+ {
+ AsyncRequest *areq;
+
+ areq = palloc(sizeof(AsyncRequest));
+ areq->requestor = (PlanState *) node;
+ areq->requestee = subnode;
+ areq->request_index = i;
+ areq->callback_pending = false;
+ areq->request_complete = false;
+ areq->result = NULL;
+
+ asyncrequests[i] = areq;
+ }
+ }
+ bms_free(asyncplans);
+
+ /* No need for further processing if there are no async subplans */
+ nasyncplans = node->as_nasyncplans;
+ if (nasyncplans == 0)
+ return;
+
+ /* Initialize remaining async state */
+ node->as_asyncrequests = asyncrequests;
+ node->as_asyncresults = (TupleTableSlot **)
+ palloc0(nasyncplans * sizeof(TupleTableSlot *));
+ node->as_needrequest = NULL;
+
+ classify_matching_subplans(node);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncBegin
+ *
+ * Begin execution of designed async-capable subplans.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecAppendAsyncBegin(AppendState *node)
+{
+ Bitmapset *valid_asyncplans;
+ int i;
+
+ /* We should never be called when there are no async subplans. */
+ Assert(node->as_nasyncplans > 0);
+
+ if (node->as_valid_subplans == NULL)
+ {
+ Assert(node->as_valid_asyncplans == NULL);
+
+ node->as_valid_subplans =
+ ExecFindMatchingSubPlans(node->as_prune_state);
+
+ classify_matching_subplans(node);
+ }
+
+ node->as_nasyncremain = 0;
+
+ /* Nothing to do if there are no valid async subplans. */
+ valid_asyncplans = node->as_valid_asyncplans;
+ if (valid_asyncplans == NULL)
+ return;
+
+ /* Make a request for each of the async subplans. */
+ i = -1;
+ while ((i = bms_next_member(valid_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ Assert(areq->request_index == i);
+ Assert(!areq->callback_pending);
+
+ /* Do the actual work. */
+ ExecAsyncRequest(areq);
+
+ ++node->as_nasyncremain;
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncGetNext
+ *
+ * Get the next tuple from any of the asynchronous subplans.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecAppendAsyncGetNext(AppendState *node, TupleTableSlot **result)
+{
+ *result = NULL;
+
+ /* Make new async requests. */
+ if (ExecAppendAsyncRequest(node, result))
+ return true;
+
+ while (node->as_nasyncremain > 0)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ /* Wait or poll async events. */
+ ExecAppendAsyncEventWait(node);
+
+ /* Make new async requests. */
+ if (ExecAppendAsyncRequest(node, result))
+ return true;
+
+ /* Break from loop if there is any sync node that is not complete */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If all sync subplans are complete, we're totally done scanning the
+ * givne node. Otherwise, we're done with the asynchronous stuff but
+ * must continue scanning the sync subplans.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncremain == 0);
+ *result = ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ return true;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncRequest
+ *
+ * If there are any asynchronous subplans that need a new asynchronous
+ * request, make all of them.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result)
+{
+ Bitmapset *needrequest;
+ int i;
+
+ /* Nothing to do if there are no async subplans needing a new request. */
+ if (bms_is_empty(node->as_needrequest))
+ return false;
+
+ /*
+ * If there are any asynchronously-generated results that have not yet
+ * been returned, we have nothing to do; just return one of them.
+ */
+ if (node->as_nasyncresults > 0)
+ {
+ --node->as_nasyncresults;
+ *result = node->as_asyncresults[node->as_nasyncresults];
+ return true;
+ }
+
+ /* Make a new request for each of the async subplans that need it. */
+ needrequest = node->as_needrequest;
+ node->as_needrequest = NULL;
+ i = -1;
+ while ((i = bms_next_member(needrequest, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ /* Do the actual work. */
+ ExecAsyncRequest(areq);
+ }
+ bms_free(needrequest);
+
+ /* Return one of the asynchronously-generated results if any. */
+ if (node->as_nasyncresults > 0)
+ {
+ --node->as_nasyncresults;
+ *result = node->as_asyncresults[node->as_nasyncresults];
+ return true;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncEventWait
+ *
+ * Wait or poll for file descriptor wait events and fire callbacks.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecAppendAsyncEventWait(AppendState *node)
+{
+ long timeout = node->as_syncdone ? -1 : 0;
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+
+ /* Nothing to do if there are no remaining async subplans. */
+ if (node->as_nasyncremain == 0)
+ return;
+
+ node->as_eventset = CreateWaitEventSet(CurrentMemoryContext,
+ node->as_nasyncplans + 1);
+ AddWaitEventToSet(node->as_eventset, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
+ NULL, NULL);
+
+ /* Give each waiting subplan a chance to add a event. */
+ i = -1;
+ while ((i = bms_next_member(node->as_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ if (areq->callback_pending)
+ ExecAsyncConfigureWait(areq);
+ }
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(node->as_eventset, timeout, occurred_event,
+ EVENT_BUFFER_SIZE, WAIT_EVENT_APPEND_READY);
+ FreeWaitEventSet(node->as_eventset);
+ node->as_eventset = NULL;
+ if (noccurred == 0)
+ return;
+
+ /* Deliver notifications. */
+ for (i = 0; i < noccurred; i++)
+ {
+ WaitEvent *w = &occurred_event[i];
+
+ /*
+ * Each waiting subplan should have registered its wait event with
+ * user_data pointing back to its AsyncRequest.
+ */
+ if ((w->events & WL_SOCKET_READABLE) != 0)
+ {
+ AsyncRequest *areq = (AsyncRequest *) w->user_data;
+
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ Assert(areq->callback_pending);
+ areq->callback_pending = false;
+
+ /* Do the actual work. */
+ ExecAsyncNotify(areq);
+ }
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(AsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot = areq->result;
+
+ /* The result should be a TupleTableSlot or NULL. */
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* Nothing to do if the request is pending. */
+ if (!areq->request_complete)
+ {
+ /*
+ * The subplan for which the request was made would be pending for a
+ * callback.
+ */
+ Assert(areq->callback_pending);
+ return;
+ }
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ {
+ /* The ending subplan wouldn't have been pending for a callback. */
+ Assert(!areq->callback_pending);
+ --node->as_nasyncremain;
+ return;
+ }
+
+ /* Save result so we can return it */
+ Assert(node->as_nasyncresults < node->as_nasyncplans);
+ node->as_asyncresults[node->as_nasyncresults++] = slot;
+
+ /*
+ * Mark the subplan that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might complete.
+ */
+ node->as_needrequest = bms_add_member(node->as_needrequest,
+ areq->request_index);
+}
+
+/* ----------------------------------------------------------------
+ * classify_matching_subplans
+ *
+ * Classify the node's as_valid_subplans into sync ones and
+ * async ones, adjust it to contain sync ones only, and save
+ * async ones in the node's as_valid_asyncplans
+ * ----------------------------------------------------------------
+ */
+static void
+classify_matching_subplans(AppendState *node)
+{
+ Bitmapset *valid_asyncplans;
+
+ /* Nothing to do if there are no valid subplans. */
+ if (bms_is_empty(node->as_valid_subplans))
+ return;
+
+ /* Nothing to do if there are no valid async subplans. */
+ if (!bms_overlap(node->as_valid_subplans, node->as_asyncplans))
+ return;
+
+ /* Get valid async subplans. */
+ valid_asyncplans = bms_copy(node->as_asyncplans);
+ valid_asyncplans = bms_int_members(valid_asyncplans,
+ node->as_valid_subplans);
+
+ /* Adjust the valid subplans to contain sync subplans only. */
+ node->as_valid_subplans = bms_del_members(node->as_valid_subplans,
+ valid_asyncplans);
+
+ /* Save valid async subplans. */
+ node->as_valid_asyncplans = valid_asyncplans;
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 0969e53c3a..c92a35b8a6 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -22,6 +22,7 @@
*/
#include "postgres.h"
+#include "executor/execAsync.h"
#include "executor/executor.h"
#include "executor/nodeForeignscan.h"
#include "foreign/fdwapi.h"
@@ -222,6 +223,9 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
if (node->resultRelation > 0)
scanstate->resultRelInfo = estate->es_result_relations[node->resultRelation - 1];
+ /* Initialize the async_capable flag. */
+ scanstate->ss.ps.async_capable = ((Plan *) node)->async_capable;
+
/* Initialize any outer plan. */
if (outerPlan(node))
outerPlanState(scanstate) =
@@ -391,3 +395,73 @@ ExecShutdownForeignScan(ForeignScanState *node)
if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecReconsiderAsyncForeignScan
+ *
+ * Re-examine a ForeignScan node that was considered async-capable
+ * at plan time.
+ * ----------------------------------------------------------------
+ */
+bool
+ExecReconsiderAsyncForeignScan(ForeignScanState *node, AsyncContext *acxt)
+{
+ FdwRoutine *fdwroutine = node->fdwroutine;
+ bool result = true;
+
+ if (fdwroutine->ReconsiderAsyncForeignScan)
+ {
+ result = fdwroutine->ReconsiderAsyncForeignScan(node, acxt);
+ if (!result)
+ node->ss.ps.async_capable = false;
+ }
+ return result;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Asynchronously request a tuple from a designed async-capable node
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ fdwroutine->ForeignAsyncConfigureWait(areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Callback invoked when a relevant event has occurred
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 65bbc18ecb..8d64772931 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -87,6 +87,7 @@ _copyPlannedStmt(const PlannedStmt *from)
COPY_SCALAR_FIELD(transientPlan);
COPY_SCALAR_FIELD(dependsOnRole);
COPY_SCALAR_FIELD(parallelModeNeeded);
+ COPY_SCALAR_FIELD(asyncPlan);
COPY_SCALAR_FIELD(jitFlags);
COPY_NODE_FIELD(planTree);
COPY_NODE_FIELD(rtable);
@@ -120,6 +121,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
COPY_SCALAR_FIELD(plan_width);
COPY_SCALAR_FIELD(parallel_aware);
COPY_SCALAR_FIELD(parallel_safe);
+ COPY_SCALAR_FIELD(async_capable);
COPY_SCALAR_FIELD(plan_node_id);
COPY_NODE_FIELD(targetlist);
COPY_NODE_FIELD(qual);
@@ -241,6 +243,7 @@ _copyAppend(const Append *from)
*/
COPY_BITMAPSET_FIELD(apprelids);
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
COPY_SCALAR_FIELD(first_partial_plan);
COPY_NODE_FIELD(part_prune_info);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index f5dcedf6e8..80a853d706 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -305,6 +305,7 @@ _outPlannedStmt(StringInfo str, const PlannedStmt *node)
WRITE_BOOL_FIELD(transientPlan);
WRITE_BOOL_FIELD(dependsOnRole);
WRITE_BOOL_FIELD(parallelModeNeeded);
+ WRITE_BOOL_FIELD(asyncPlan);
WRITE_INT_FIELD(jitFlags);
WRITE_NODE_FIELD(planTree);
WRITE_NODE_FIELD(rtable);
@@ -333,6 +334,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
WRITE_INT_FIELD(plan_width);
WRITE_BOOL_FIELD(parallel_aware);
WRITE_BOOL_FIELD(parallel_safe);
+ WRITE_BOOL_FIELD(async_capable);
WRITE_INT_FIELD(plan_node_id);
WRITE_NODE_FIELD(targetlist);
WRITE_NODE_FIELD(qual);
@@ -431,6 +433,7 @@ _outAppend(StringInfo str, const Append *node)
WRITE_BITMAPSET_FIELD(apprelids);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
WRITE_INT_FIELD(first_partial_plan);
WRITE_NODE_FIELD(part_prune_info);
}
@@ -2221,6 +2224,7 @@ _outPlannerGlobal(StringInfo str, const PlannerGlobal *node)
WRITE_BOOL_FIELD(parallelModeOK);
WRITE_BOOL_FIELD(parallelModeNeeded);
WRITE_CHAR_FIELD(maxParallelHazard);
+ WRITE_BOOL_FIELD(asyncPlan);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 4388aae71d..5104d1a2b4 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1581,6 +1581,7 @@ _readPlannedStmt(void)
READ_BOOL_FIELD(transientPlan);
READ_BOOL_FIELD(dependsOnRole);
READ_BOOL_FIELD(parallelModeNeeded);
+ READ_BOOL_FIELD(asyncPlan);
READ_INT_FIELD(jitFlags);
READ_NODE_FIELD(planTree);
READ_NODE_FIELD(rtable);
@@ -1614,6 +1615,7 @@ ReadCommonPlan(Plan *local_node)
READ_INT_FIELD(plan_width);
READ_BOOL_FIELD(parallel_aware);
READ_BOOL_FIELD(parallel_safe);
+ READ_BOOL_FIELD(async_capable);
READ_INT_FIELD(plan_node_id);
READ_NODE_FIELD(targetlist);
READ_NODE_FIELD(qual);
@@ -1710,6 +1712,7 @@ _readAppend(void)
READ_BITMAPSET_FIELD(apprelids);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
READ_INT_FIELD(first_partial_plan);
READ_NODE_FIELD(part_prune_info);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index aab06c7d21..3b034a0326 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -147,6 +147,7 @@ bool enable_partitionwise_aggregate = false;
bool enable_parallel_append = true;
bool enable_parallel_hash = true;
bool enable_partition_pruning = true;
+bool enable_async_append = true;
typedef struct
{
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 6c8305c977..b30f8255f2 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -81,6 +81,7 @@ static List *get_gating_quals(PlannerInfo *root, List *quals);
static Plan *create_gating_plan(PlannerInfo *root, Path *path, Plan *plan,
List *gating_quals);
static Plan *create_join_plan(PlannerInfo *root, JoinPath *best_path);
+static bool is_async_capable_path(Path *path);
static Plan *create_append_plan(PlannerInfo *root, AppendPath *best_path,
int flags);
static Plan *create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
@@ -1066,6 +1067,30 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
return plan;
}
+/*
+ * is_async_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
+
/*
* create_append_plan
* Create an Append plan for 'best_path' and (recursively) plans
@@ -1083,6 +1108,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
List *pathkeys = best_path->path.pathkeys;
List *subplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
RelOptInfo *rel = best_path->path.parent;
PartitionPruneInfo *partpruneinfo = NULL;
int nodenumsortkeys = 0;
@@ -1090,6 +1116,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
Oid *nodeSortOperators = NULL;
Oid *nodeCollations = NULL;
bool *nodeNullsFirst = NULL;
+ bool consider_async = false;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1153,6 +1180,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
tlist_was_changed = (orig_tlist_length != list_length(plan->plan.targetlist));
}
+ /* If appropriate, consider async append */
+ consider_async = (enable_async_append && pathkeys == NIL &&
+ !best_path->path.parallel_safe &&
+ list_length(best_path->subpaths) > 1);
+
/* Build the plan for each child */
foreach(subpaths, best_path->subpaths)
{
@@ -1220,6 +1252,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
subplans = lappend(subplans, subplan);
+
+ /* Check to see if subplan can be executed asynchronously */
+ if (consider_async && is_async_capable_path(subpath))
+ {
+ subplan->async_capable = true;
+ ++nasyncplans;
+ }
}
/*
@@ -1252,9 +1291,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
plan->appendplans = subplans;
+ plan->nasyncplans = nasyncplans;
plan->first_partial_plan = best_path->first_partial_path;
plan->part_prune_info = partpruneinfo;
+ if (nasyncplans > 0)
+ root->glob->asyncPlan = true;
+
copy_generic_path_info(&plan->plan, (Path *) best_path);
/*
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index adf68d8790..95e7601a31 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -312,6 +312,7 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
glob->lastPlanNodeId = 0;
glob->transientPlan = false;
glob->dependsOnRole = false;
+ glob->asyncPlan = false;
/*
* Assess whether it's feasible to use parallel mode for this query. We
@@ -513,6 +514,7 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
result->transientPlan = glob->transientPlan;
result->dependsOnRole = glob->dependsOnRole;
result->parallelModeNeeded = glob->parallelModeNeeded;
+ result->asyncPlan = glob->asyncPlan;
result->planTree = top_plan;
result->rtable = glob->finalrtable;
result->resultRelations = glob->resultRelations;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..58f8e0bbcf 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3999,6 +3999,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
switch (w)
{
+ case WAIT_EVENT_APPEND_READY:
+ event_name = "AppendReady";
+ break;
case WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE:
event_name = "BackupWaitWalArchive";
break;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index eafdb1118e..507567aff3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1111,6 +1111,16 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_async_append", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of async append plans."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_async_append,
+ true,
+ NULL, NULL, NULL
+ },
{
{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("Enables genetic query optimization."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index bd57e917e1..1306094865 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -370,6 +370,7 @@
#enable_partitionwise_aggregate = off
#enable_parallel_hash = on
#enable_partition_pruning = on
+#enable_async_append = on
# - Planner Cost Constants -
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
index e69de29bb2..bce30417d7 100644
--- a/src/include/executor/execAsync.h
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ * execAsync.h
+ * Support functions for asynchronous execution
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/executor/execAsync.h
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+typedef struct AsyncContext
+{
+ PlanState *requestor;
+ int request_index;
+} AsyncContext;
+
+extern bool ExecReconsiderAsyncCapablePlan(PlanState *node,
+ AsyncContext *acxt);
+extern void ExecAsyncRequest(AsyncRequest *areq);
+extern void ExecAsyncConfigureWait(AsyncRequest *areq);
+extern void ExecAsyncNotify(AsyncRequest *areq);
+extern void ExecAsyncRequestDone(AsyncRequest *areq, TupleTableSlot *result);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index cafd410a5d..8c7ebc2998 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -25,4 +25,7 @@ extern void ExecAppendInitializeDSM(AppendState *node, ParallelContext *pcxt);
extern void ExecAppendReInitializeDSM(AppendState *node, ParallelContext *pcxt);
extern void ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt);
+extern void ExecReconsiderAsyncAppend(AppendState *node);
+extern void ExecAsyncAppendResponse(AsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 6ae7733e25..56c3809d2d 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -17,6 +17,8 @@
#include "access/parallel.h"
#include "nodes/execnodes.h"
+struct AsyncContext;
+
extern ForeignScanState *ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags);
extern void ExecEndForeignScan(ForeignScanState *node);
extern void ExecReScanForeignScan(ForeignScanState *node);
@@ -31,4 +33,10 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
ParallelWorkerContext *pwcxt);
extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecReconsiderAsyncForeignScan(ForeignScanState *node,
+ struct AsyncContext *acxt);
+extern void ExecAsyncForeignScanRequest(AsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(AsyncRequest *areq);
+extern void ExecAsyncForeignScanNotify(AsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 248f78da45..99cabd6b94 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -19,6 +19,7 @@
/* To avoid including explain.h here, reference ExplainState thus: */
struct ExplainState;
+struct AsyncContext;
/*
* Callback function signatures --- see fdwhandler.sgml for more info.
@@ -178,6 +179,17 @@ typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
List *fdw_private,
RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+
+typedef bool (*ReconsiderAsyncForeignScan_function) (ForeignScanState *node,
+ struct AsyncContext *acxt);
+
+typedef void (*ForeignAsyncRequest_function) (AsyncRequest *areq);
+
+typedef void (*ForeignAsyncConfigureWait_function) (AsyncRequest *areq);
+
+typedef void (*ForeignAsyncNotify_function) (AsyncRequest *areq);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -256,6 +268,13 @@ typedef struct FdwRoutine
/* Support functions for path reparameterization. */
ReparameterizeForeignPathByChild_function ReparameterizeForeignPathByChild;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ReconsiderAsyncForeignScan_function ReconsiderAsyncForeignScan;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b6a88ff76b..68584b3c14 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -512,6 +512,32 @@ typedef struct ResultRelInfo
struct CopyMultiInsertBuffer *ri_CopyMultiInsertBuffer;
} ResultRelInfo;
+/* ----------------
+ * AsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct AsyncRequest
+{
+ struct PlanState *requestor; /* Node that wants a tuple */
+ struct PlanState *requestee; /* Node from which a tuple is wanted */
+ int request_index; /* Scratch space for requestor */
+ bool callback_pending; /* Callback is needed */
+ bool request_complete; /* Request complete, result valid */
+ TupleTableSlot *result; /* Result (NULL if no more tuples) */
+} AsyncRequest;
+
+/*
+ * Hash entry to store the set of IDs of ForeignScanStates that use the same
+ * user mapping
+ */
+typedef struct ForeignScanHashEntry
+{
+ Oid umid; /* hash key -- must be first */
+ Bitmapset *fsplanids;
+} ForeignScanHashEntry;
+
/* ----------------
* EState information
*
@@ -602,6 +628,14 @@ typedef struct EState
/* The per-query shared memory area to use for parallel execution. */
struct dsa_area *es_query_dsa;
+ List *es_asyncappends; /* List of async-aware AppendStates */
+
+ /*
+ * Hash table to store the set of IDs of ForeignScanStates using the same
+ * user mapping
+ */
+ HTAB *es_foreign_scan_hash;
+
/*
* JIT information. es_jit_flags indicates whether JIT should be performed
* and with which options. es_jit is created on-demand when JITing is
@@ -969,6 +1003,8 @@ typedef struct PlanState
*/
Bitmapset *chgParam; /* set of IDs of changed Params */
+ bool async_capable;
+
/*
* Other run-time state needed by most if not all node types.
*/
@@ -1217,12 +1253,24 @@ struct AppendState
PlanState **appendplans; /* array of PlanStates for my inputs */
int as_nplans;
int as_whichplan;
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_asyncplans; /* asynchronous plans indexes */
+ Bitmapset *as_asyncplanids; /* asynchronous plans IDs */
+ int as_nasyncplans; /* # of asynchronous plans */
+ AsyncRequest **as_asyncrequests; /* array of AsyncRequests */
+ TupleTableSlot **as_asyncresults; /* unreturned results of async plans */
+ int as_nasyncresults; /* # of valid entries in as_asyncresults */
+ int as_nasyncremain; /* # of remaining async plans */
+ Bitmapset *as_needrequest; /* async plans ready for a request */
+ struct WaitEventSet *as_eventset; /* WaitEventSet used to configure
+ * file descriptor wait events */
int as_first_partial_plan; /* Index of 'appendplans' containing
* the first partial plan */
ParallelAppendState *as_pstate; /* parallel coordination info */
Size pstate_len; /* size of parallel coordination info */
struct PartitionPruneState *as_prune_state;
Bitmapset *as_valid_subplans;
+ Bitmapset *as_valid_asyncplans;
bool (*choose_next_subplan) (AppendState *);
};
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0ec93e648c..e76db3eb4c 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -141,6 +141,8 @@ typedef struct PlannerGlobal
char maxParallelHazard; /* worst PROPARALLEL hazard level */
PartitionDirectory partition_directory; /* partition descriptors */
+
+ bool asyncPlan; /* does plan have async-aware Append? */
} PlannerGlobal;
/* macro for fetching the Plan associated with a SubPlan node */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 43160439f0..c636b498ef 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -59,6 +59,8 @@ typedef struct PlannedStmt
bool parallelModeNeeded; /* parallel mode required to execute? */
+ bool asyncPlan; /* does plan have async-aware Append? */
+
int jitFlags; /* which forms of JIT should be performed */
struct Plan *planTree; /* tree of Plan nodes */
@@ -129,6 +131,11 @@ typedef struct Plan
bool parallel_aware; /* engage parallel-aware logic? */
bool parallel_safe; /* OK to use as part of parallel plan? */
+ /*
+ * information needed for asynchronous execution
+ */
+ bool async_capable; /* engage asynchronous-capable logic? */
+
/*
* Common structural data for all Plan types.
*/
@@ -245,6 +252,7 @@ typedef struct Append
Plan plan;
Bitmapset *apprelids; /* RTIs of appendrel(s) formed by this node */
List *appendplans;
+ int nasyncplans; /* # of asynchronous plans */
/*
* All 'appendplans' preceding this index are non-partial plans. All
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ed2e4af4be..c2952e375d 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -65,6 +65,7 @@ extern PGDLLIMPORT bool enable_partitionwise_aggregate;
extern PGDLLIMPORT bool enable_parallel_append;
extern PGDLLIMPORT bool enable_parallel_hash;
extern PGDLLIMPORT bool enable_partition_pruning;
+extern PGDLLIMPORT bool enable_async_append;
extern PGDLLIMPORT int constraint_exclusion;
extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..d9588da38a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -957,6 +957,7 @@ typedef enum
*/
typedef enum
{
+ WAIT_EVENT_APPEND_READY,
WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE = PG_WAIT_IPC,
WAIT_EVENT_BGWORKER_SHUTDOWN,
WAIT_EVENT_BGWORKER_STARTUP,
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index dc7ab2ce8b..e78ca7bddb 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -87,6 +87,7 @@ select explain_filter('explain (analyze, buffers, format json) select * from int
"Plan": { +
"Node Type": "Seq Scan", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "int8_tbl",+
"Alias": "i8", +
"Startup Cost": N.N, +
@@ -136,6 +137,7 @@ select explain_filter('explain (analyze, buffers, format xml) select * from int8
<Plan> +
<Node-Type>Seq Scan</Node-Type> +
<Parallel-Aware>false</Parallel-Aware> +
+ <Async-Capable>false</Async-Capable> +
<Relation-Name>int8_tbl</Relation-Name> +
<Alias>i8</Alias> +
<Startup-Cost>N.N</Startup-Cost> +
@@ -183,6 +185,7 @@ select explain_filter('explain (analyze, buffers, format yaml) select * from int
- Plan: +
Node Type: "Seq Scan" +
Parallel Aware: false +
+ Async Capable: false +
Relation Name: "int8_tbl"+
Alias: "i8" +
Startup Cost: N.N +
@@ -233,6 +236,7 @@ select explain_filter('explain (buffers, format json) select * from int8_tbl i8'
"Plan": { +
"Node Type": "Seq Scan", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "int8_tbl",+
"Alias": "i8", +
"Startup Cost": N.N, +
@@ -348,6 +352,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Relation Name": "tenk1", +
"Parallel Aware": true, +
"Local Hit Blocks": 0, +
@@ -393,6 +398,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Parallel Aware": false, +
"Sort Space Used": 0, +
"Local Hit Blocks": 0, +
@@ -435,6 +441,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Parallel Aware": false, +
"Workers Planned": 0, +
"Local Hit Blocks": 0, +
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index d574583844..406fb88130 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -558,6 +558,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
"Node Type": "Incremental Sort", +
"Actual Rows": 55, +
"Actual Loops": 1, +
+ "Async Capable": false, +
"Presorted Key": [ +
"t.a" +
], +
@@ -745,6 +746,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
"Node Type": "Incremental Sort", +
"Actual Rows": 70, +
"Actual Loops": 1, +
+ "Async Capable": false, +
"Presorted Key": [ +
"t.a" +
], +
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index ff157ceb1c..499245068a 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -204,6 +204,7 @@ explain (costs off, format json) insert into insertconflicttest values (0, 'Bilb
"Node Type": "ModifyTable", +
"Operation": "Insert", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "insertconflicttest", +
"Alias": "insertconflicttest", +
"Conflict Resolution": "UPDATE", +
@@ -213,7 +214,8 @@ explain (costs off, format json) insert into insertconflicttest values (0, 'Bilb
{ +
"Node Type": "Result", +
"Parent Relationship": "Member", +
- "Parallel Aware": false +
+ "Parallel Aware": false, +
+ "Async Capable": false +
} +
] +
} +
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 81bdacf59d..b7818c0637 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -88,6 +88,7 @@ select count(*) = 1 as ok from pg_stat_wal;
select name, setting from pg_settings where name like 'enable%';
name | setting
--------------------------------+---------
+ enable_async_append | on
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
@@ -106,7 +107,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(18 rows)
+(19 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
At Wed, 10 Feb 2021 21:31:15 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
On Wed, Feb 10, 2021 at 7:31 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
Attached is an updated version of the patch. Sorry for the delay.
I noticed that I forgot to add new files. :-(. Please find attached
an updated patch.
Thanks for the new version.
It seems too specific to async Append so I look it as a PoC of the
mechanism.
It creates a hash table that keyed by connection umid to record
planids run on the connection, triggerd by core planner via a dedicate
API function. It seems to me that ConnCacheEntry.state can hold that
and the hash is not needed at all.
| postgresReconsiderAsyncForeignScan(ForeignScanState *node, AsyncContext *acxt)
| {
| ...
| /*
| * If the connection used for the ForeignScan node is used in other parts
| * of the query plan tree except async subplans of the parent Append node,
| * disable async execution of the ForeignScan node.
| */
| if (!bms_is_subset(fsplanids, asyncplanids))
| return false;
This would be a reasonable restriction.
| /*
| * If the subplans of the Append node are all async-capable, and use the
| * same connection, then we won't execute them asynchronously.
| */
| if (requestor->as_nasyncplans == requestor->as_nplans &&
| !bms_nonempty_difference(asyncplanids, fsplanids))
| return false;
It is the correct restiction? I understand that the currently
intending restriction is one connection accepts at most one FDW-scan
node. This looks somethig different...
(Sorry, time's up for now.)
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Fri, Feb 12, 2021 at 5:30 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
It seems too specific to async Append so I look it as a PoC of the
mechanism.
Are you saying that the patch only reconsiders async ForeignScans?
It creates a hash table that keyed by connection umid to record
planids run on the connection, triggerd by core planner via a dedicate
API function. It seems to me that ConnCacheEntry.state can hold that
and the hash is not needed at all.
I think a good thing about the hash table is that it can be used by
other FDWs that support async execution in a similar way to
postgres_fdw, so they don’t need to create their own hash tables. But
I’d like to know about the idea of using ConnCacheEntry. Could you
elaborate a bit more about that?
| postgresReconsiderAsyncForeignScan(ForeignScanState *node, AsyncContext *acxt)
| {
| ...
| /*
| * If the connection used for the ForeignScan node is used in other parts
| * of the query plan tree except async subplans of the parent Append node,
| * disable async execution of the ForeignScan node.
| */
| if (!bms_is_subset(fsplanids, asyncplanids))
| return false;This would be a reasonable restriction.
Cool!
| /*
| * If the subplans of the Append node are all async-capable, and use the
| * same connection, then we won't execute them asynchronously.
| */
| if (requestor->as_nasyncplans == requestor->as_nplans &&
| !bms_nonempty_difference(asyncplanids, fsplanids))
| return false;It is the correct restiction? I understand that the currently
intending restriction is one connection accepts at most one FDW-scan
node. This looks somethig different...
People put multiple partitions in a remote PostgreSQL server in
sharding, so the patch allows multiple postgres_fdw ForeignScans
beneath an Append that use the same connection to be executed
asynchronously like this:
postgres=# create table t1 (a int, b int, c text);
postgres=# create table t2 (a int, b int, c text);
postgres=# create table t3 (a int, b int, c text);
postgres=# create foreign table p1 (a int, b int, c text) server
server1 options (table_name 't1');
postgres=# create foreign table p2 (a int, b int, c text) server
server2 options (table_name 't2');
postgres=# create foreign table p3 (a int, b int, c text) server
server2 options (table_name 't3');
postgres=# create table pt (a int, b int, c text) partition by range (a);
postgres=# alter table pt attach partition p1 for values from (10) to (20);
postgres=# alter table pt attach partition p2 for values from (20) to (30);
postgres=# alter table pt attach partition p3 for values from (30) to (40);
postgres=# insert into p1 select 10 + i % 10, i, to_char(i, 'FM0000')
from generate_series(0, 99) i;
postgres=# insert into p2 select 20 + i % 10, i, to_char(i, 'FM0000')
from generate_series(0, 99) i;
postgres=# insert into p3 select 30 + i % 10, i, to_char(i, 'FM0000')
from generate_series(0, 99) i;
postgres=# analyze pt;
postgres=# explain verbose select count(*) from pt;
QUERY PLAN
------------------------------------------------------------------------------------------
Aggregate (cost=314.25..314.26 rows=1 width=8)
Output: count(*)
-> Append (cost=100.00..313.50 rows=300 width=0)
-> Async Foreign Scan on public.p1 pt_1
(cost=100.00..104.00 rows=100 width=0)
Remote SQL: SELECT NULL FROM public.t1
-> Async Foreign Scan on public.p2 pt_2
(cost=100.00..104.00 rows=100 width=0)
Remote SQL: SELECT NULL FROM public.t2
-> Async Foreign Scan on public.p3 pt_3
(cost=100.00..104.00 rows=100 width=0)
Remote SQL: SELECT NULL FROM public.t3
(9 rows)
For this query, p2 and p3, which use the same connection, are scanned
asynchronously!
But if all the subplans of an Append are async postgres_fdw
ForeignScans that use the same connection, they won’t be parallelized
at all, and the overhead of async execution may cause a performance
degradation. So the patch disables async execution of them in that
case using the above code bit.
Thanks for the review!
Best regards,
Etsuro Fujita
On Wed, Feb 10, 2021 at 9:31 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
Please find attached an updated patch.
I noticed that this doesn’t work for cases where ForeignScans are
executed inside functions, and I don’t have any simple solution for
that. So I’m getting back to what Horiguchi-san proposed for
postgres_fdw to handle concurrent fetches from a remote server
performed by multiple ForeignScan nodes that use the same connection.
As discussed before, we would need to create a scheduler for
performing such fetches in a more optimized way to avoid a performance
degradation in some cases, but that wouldn’t be easy. Instead, how
about reducing concurrency as an alternative? In his proposal,
postgres_fdw was modified to perform prefetching pretty aggressively,
so I mean removing aggressive prefetching. I think we could add it to
postgres_fdw later maybe as the server/table options. Sorry for the
back and forth.
Best regards,
Etsuro Fujita
Sorry that I haven't been able to respond.
At Thu, 18 Feb 2021 11:51:59 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
On Wed, Feb 10, 2021 at 9:31 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
Please find attached an updated patch.
I noticed that this doesn’t work for cases where ForeignScans are
executed inside functions, and I don’t have any simple solution for
Ah, concurrent fetches in different plan trees? (For fairness, I
hadn't noticed that case:p) The same can happen when an extension that
is called via hooks.
that. So I’m getting back to what Horiguchi-san proposed for
postgres_fdw to handle concurrent fetches from a remote server
performed by multiple ForeignScan nodes that use the same connection.
As discussed before, we would need to create a scheduler for
performing such fetches in a more optimized way to avoid a performance
degradation in some cases, but that wouldn’t be easy. Instead, how
If the "degradation" means degradation caused by repeated creation of
remote cursors, anyway every node on the same connection create its
own connection named as "c<n>" and never "re"created in any case.
If the "degradation" means that my patch needs to wait for the
previous prefetching query to return tuples before sending a new query
(vacate_connection()), it is just moving the wait from just before
sending the new query to just before fetching the next round of the
previous node. The only case it becomes visible degradation is where
the tuples in the next round is not wanted by the upper nodes.
unpatched
nodeA <tuple exhaused>
<send prefetching FETCH A>
<return the last tuple of the last round>
nodeB !!<wait for FETCH A returns>
<send FETCH B>
!!<wait for FETCH B returns>
<return tuple just returned>
nodeA <return already fetched tuple>
patched
nodeA <tuple exhaused>
<return the last tuple of the last round>
nodeB <send FETCH B>
!!<wait for FETCH B returns>
<return the first tuple of the round>
nodeA <send FETCH A>
!!<wait for FETCH A returns>
<return the first tuple of the round>
That happens when the upper node stops just after the internal
tuplestore is emptied, and the probability is one in fetch_tuples. (It
is not stochastic so if a query gets suffered by the degradation, it
always suffers unless fetch_tuples is not changed.) I'm still not
sure that degree of degradaton becomes a show stopper.
degradation in some cases, but that wouldn’t be easy. Instead, how
about reducing concurrency as an alternative? In his proposal,
postgres_fdw was modified to perform prefetching pretty aggressively,
so I mean removing aggressive prefetching. I think we could add it to
postgres_fdw later maybe as the server/table options. Sorry for the
back and forth.
That was the natural extension from non-aggresive prefetching.
However, maybe we can live without that since if some needs more
speed, it is enought to give every remote tables a dedicate
connection.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Thu, Feb 18, 2021 at 3:16 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Thu, 18 Feb 2021 11:51:59 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
I noticed that this doesn’t work for cases where ForeignScans are
executed inside functions, and I don’t have any simple solution forAh, concurrent fetches in different plan trees? (For fairness, I
hadn't noticed that case:p) The same can happen when an extension that
is called via hooks.
Yeah, consider a plan containing a FunctionScan that invokes a query
like e.g., “SELECT * FROM foreign_table” via SPI.
So I’m getting back to what Horiguchi-san proposed for
postgres_fdw to handle concurrent fetches from a remote server
performed by multiple ForeignScan nodes that use the same connection.
As discussed before, we would need to create a scheduler for
performing such fetches in a more optimized way to avoid a performance
degradation in some cases, but that wouldn’t be easy.If the "degradation" means degradation caused by repeated creation of
remote cursors, anyway every node on the same connection create its
own connection named as "c<n>" and never "re"created in any case.If the "degradation" means that my patch needs to wait for the
previous prefetching query to return tuples before sending a new query
(vacate_connection()), it is just moving the wait from just before
sending the new query to just before fetching the next round of the
previous node. The only case it becomes visible degradation is where
the tuples in the next round is not wanted by the upper nodes.
The latter. And yeah, typical cases where the performance degradation
occurs would be queries with LIMIT, as discussed in [1]/messages/by-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx+3dyivCT=A@mail.gmail.com.
I’m not concerned about postgres_fdw modified to process an
in-progress fetch by a ForeignScan before starting a new
asynchronous/synchronous fetch by another ForeignScan using the same
connection. Actually, that seems pretty reasonable to me, so I’d like
to use that part in your patch in the next version. My concern is
that postgresIterateForeignScan() was modified to start another
asynchronous fetch from a remote table (if possible) right after doing
fetch_received_data() for the remote table, because aggressive
prefetching like that may increase the probability that ForeignScans
using the same connection conflict with each other, leading to a large
performance degradation. (Another issue with that would be that the
fsstate->tuples array for the remote table may be enlarged
indefinitely.)
Whether the degradation is acceptable or not would depend on the user,
and needless to say, the smaller degradation would be more acceptable.
So I’ll update the patch using your patch without the
postgresIterateForeignScan() change.
In his proposal,
postgres_fdw was modified to perform prefetching pretty aggressively,
so I mean removing aggressive prefetching. I think we could add it to
postgres_fdw later maybe as the server/table options.
That was the natural extension from non-aggresive prefetching.
I also suppose that that would improve the performance in some cases.
Let’s leave that for future work.
However, maybe we can live without that since if some needs more
speed, it is enought to give every remote tables a dedicate
connection.
Yeah, I think so too.
Thanks!
Best regards,
Etsuro Fujita
[1]: /messages/by-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx+3dyivCT=A@mail.gmail.com
On Sat, Feb 20, 2021 at 3:35 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
So I’ll update the patch using your patch without the
postgresIterateForeignScan() change.
Here is an updated version of the patch. Based on your idea of
completing an in-progress command (if any) before sending a new
command to the remote, I created a function for that
process_pending_request(), and added it where needed in
contrib/postgres_fdw. I also adjusted the patch, and fixed some bugs
in the postgres_fdw part of the patch.
Best regards,
Etsuro Fujita
Attachments:
async-wip-2021-03-01.patchapplication/octet-stream; name=async-wip-2021-03-01.patchDownload
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index ee0b4acf0b..a2c8eb93a1 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -62,6 +62,7 @@ typedef struct ConnCacheEntry
Oid serverid; /* foreign server OID used to get server name */
uint32 server_hashvalue; /* hash value of foreign server OID */
uint32 mapping_hashvalue; /* hash value of user mapping OID */
+ PgFdwConnState state; /* extra per-connection state */
} ConnCacheEntry;
/*
@@ -117,7 +118,7 @@ static bool disconnect_cached_connections(Oid serverid);
* (not even on error), we need this flag to cue manual cleanup.
*/
PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetConnection(UserMapping *user, bool will_prep_stmt, PgFdwConnState **state)
{
bool found;
bool retry = false;
@@ -196,6 +197,9 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
*/
PG_TRY();
{
+ /* Process a pending asynchronous request if any. */
+ if (entry->state.pendingAreq)
+ process_pending_request(entry->state.pendingAreq);
/* Start a new transaction or subtransaction if needed. */
begin_remote_xact(entry);
}
@@ -264,6 +268,10 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
/* Remember if caller will prepare statements */
entry->have_prep_stmt |= will_prep_stmt;
+ /* If caller needs access to the per-connection state, return it. */
+ if (state)
+ *state = &entry->state;
+
return entry->conn;
}
@@ -291,6 +299,7 @@ make_new_connection(ConnCacheEntry *entry, UserMapping *user)
entry->mapping_hashvalue =
GetSysCacheHashValue1(USERMAPPINGOID,
ObjectIdGetDatum(user->umid));
+ memset(&entry->state, 0, sizeof(entry->state));
/* Now try to make the connection */
entry->conn = connect_pg_server(server, user);
@@ -648,8 +657,12 @@ GetPrepStmtNumber(PGconn *conn)
* Caller is responsible for the error handling on the result.
*/
PGresult *
-pgfdw_exec_query(PGconn *conn, const char *query)
+pgfdw_exec_query(PGconn *conn, const char *query, PgFdwConnState *state)
{
+ /* First, process a pending asynchronous request, if any. */
+ if (state && state->pendingAreq)
+ process_pending_request(state->pendingAreq);
+
/*
* Submit a query. Since we don't use non-blocking mode, this also can
* block. But its risk is relatively small, so we ignore that for now.
@@ -940,6 +953,8 @@ pgfdw_xact_callback(XactEvent event, void *arg)
{
entry->have_prep_stmt = false;
entry->have_error = false;
+ /* Also reset per-connection state */
+ memset(&entry->state, 0, sizeof(entry->state));
}
/* Disarm changing_xact_state if it all worked. */
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 0649b6b81c..f3432ab790 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -7021,7 +7021,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+-------
a | aaa
@@ -7049,7 +7049,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7077,7 +7077,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7105,7 +7105,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | newtoo
@@ -7133,7 +7133,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
DELETE FROM a;
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+----
(0 rows)
@@ -7175,35 +7175,40 @@ insert into bar2 values(3,33,33);
insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
f1 | f2
----+----
1 | 11
@@ -7213,35 +7218,40 @@ select * from bar where f1 in (select f1 from foo) for update;
(4 rows)
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
f1 | f2
----+----
1 | 11
@@ -7273,7 +7283,7 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-> Hash Join
@@ -7291,7 +7301,7 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
(39 rows)
@@ -7326,12 +7336,12 @@ where bar.f1 = ss.f1;
-> Append
-> Seq Scan on public.foo
Output: ROW(foo.f1), foo.f1
- -> Foreign Scan on public.foo2 foo_1
+ -> Async Foreign Scan on public.foo2 foo_1
Output: ROW(foo_1.f1), foo_1.f1
Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo foo_2
Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
- -> Foreign Scan on public.foo2 foo_3
+ -> Async Foreign Scan on public.foo2 foo_3
Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
-> Hash
@@ -7353,12 +7363,12 @@ where bar.f1 = ss.f1;
-> Append
-> Seq Scan on public.foo
Output: ROW(foo.f1), foo.f1
- -> Foreign Scan on public.foo2 foo_1
+ -> Async Foreign Scan on public.foo2 foo_1
Output: ROW(foo_1.f1), foo_1.f1
Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo foo_2
Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
- -> Foreign Scan on public.foo2 foo_3
+ -> Async Foreign Scan on public.foo2 foo_3
Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
(45 rows)
@@ -7511,27 +7521,33 @@ delete from foo where f1 < 5 returning *;
(5 rows)
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
- QUERY PLAN
-------------------------------------------------------------------------------
- Update on public.bar
- Output: bar.f1, bar.f2
- Update on public.bar
- Foreign Update on public.bar2 bar_1
- -> Seq Scan on public.bar
- Output: bar.f1, (bar.f2 + 100), bar.ctid
- -> Foreign Update on public.bar2 bar_1
- Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with t as (update bar set f2 = f2 + 100 returning *) select * from t order by 1;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Output: t.f1, t.f2
+ Sort Key: t.f1
+ CTE t
+ -> Update on public.bar
+ Output: bar.f1, bar.f2
+ Update on public.bar
+ Foreign Update on public.bar2 bar_1
+ -> Seq Scan on public.bar
+ Output: bar.f1, (bar.f2 + 100), bar.ctid
+ -> Foreign Update on public.bar2 bar_1
+ Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+ -> CTE Scan on t
+ Output: t.f1, t.f2
+(14 rows)
-update bar set f2 = f2 + 100 returning *;
+with t as (update bar set f2 = f2 + 100 returning *) select * from t order by 1;
f1 | f2
----+-----
1 | 311
2 | 322
- 6 | 266
3 | 333
4 | 344
+ 6 | 266
7 | 277
(6 rows)
@@ -8606,9 +8622,9 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
Sort
Sort Key: t1.a, t3.c
-> Append
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: ((ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)) INNER JOIN (ftprt1_p1 t3_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: ((ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)) INNER JOIN (ftprt1_p2 t3_2)
(7 rows)
@@ -8645,19 +8661,19 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
-- with whole-row reference; partitionwise join does not apply
EXPLAIN (COSTS OFF)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
- QUERY PLAN
---------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------
Sort
Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
-> Hash Full Join
Hash Cond: (t1.a = t2.b)
-> Append
- -> Foreign Scan on ftprt1_p1 t1_1
- -> Foreign Scan on ftprt1_p2 t1_2
+ -> Async Foreign Scan on ftprt1_p1 t1_1
+ -> Async Foreign Scan on ftprt1_p2 t1_2
-> Hash
-> Append
- -> Foreign Scan on ftprt2_p1 t2_1
- -> Foreign Scan on ftprt2_p2 t2_2
+ -> Async Foreign Scan on ftprt2_p1 t2_1
+ -> Async Foreign Scan on ftprt2_p2 t2_2
(11 rows)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
@@ -8687,9 +8703,9 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
Sort
Sort Key: t1.a, t1.b
-> Append
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: (ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: (ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)
(7 rows)
@@ -8744,20 +8760,20 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
-- test FOR UPDATE; partitionwise join does not apply
EXPLAIN (COSTS OFF)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
- QUERY PLAN
---------------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------------
LockRows
-> Sort
Sort Key: t1.a
-> Hash Join
Hash Cond: (t2.b = t1.a)
-> Append
- -> Foreign Scan on ftprt2_p1 t2_1
- -> Foreign Scan on ftprt2_p2 t2_2
+ -> Async Foreign Scan on ftprt2_p1 t2_1
+ -> Async Foreign Scan on ftprt2_p2 t2_2
-> Hash
-> Append
- -> Foreign Scan on ftprt1_p1 t1_1
- -> Foreign Scan on ftprt1_p2 t1_2
+ -> Async Foreign Scan on ftprt1_p1 t1_1
+ -> Async Foreign Scan on ftprt1_p2 t1_2
(12 rows)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
@@ -8793,17 +8809,17 @@ ANALYZE fpagg_tab_p3;
SET enable_partitionwise_aggregate TO false;
EXPLAIN (COSTS OFF)
SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
- QUERY PLAN
------------------------------------------------------------
+ QUERY PLAN
+-----------------------------------------------------------------
Sort
Sort Key: pagg_tab.a
-> HashAggregate
Group Key: pagg_tab.a
Filter: (avg(pagg_tab.b) < '22'::numeric)
-> Append
- -> Foreign Scan on fpagg_tab_p1 pagg_tab_1
- -> Foreign Scan on fpagg_tab_p2 pagg_tab_2
- -> Foreign Scan on fpagg_tab_p3 pagg_tab_3
+ -> Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+ -> Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+ -> Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
(9 rows)
-- Plan with partitionwise aggregates is enabled
@@ -8815,11 +8831,11 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
Sort
Sort Key: pagg_tab.a
-> Append
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p1 pagg_tab)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p2 pagg_tab_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p3 pagg_tab_2)
(9 rows)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 35b48575c5..75c8026ed4 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -21,6 +21,7 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -37,6 +38,7 @@
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
#include "postgres_fdw.h"
+#include "storage/latch.h"
#include "utils/builtins.h"
#include "utils/float.h"
#include "utils/guc.h"
@@ -143,6 +145,7 @@ typedef struct PgFdwScanState
/* for remote query execution */
PGconn *conn; /* connection for the scan */
+ PgFdwConnState *conn_state; /* extra per-connection state */
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -159,6 +162,9 @@ typedef struct PgFdwScanState
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ /* for asynchronous execution */
+ bool async_capable; /* engage asynchronous-capable logic? */
+
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
MemoryContext temp_cxt; /* context for per-tuple temporary data */
@@ -176,6 +182,7 @@ typedef struct PgFdwModifyState
/* for remote query execution */
PGconn *conn; /* connection for the scan */
+ PgFdwConnState *conn_state; /* extra per-connection state */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -219,6 +226,7 @@ typedef struct PgFdwDirectModifyState
/* for remote query execution */
PGconn *conn; /* connection for the update */
+ PgFdwConnState *conn_state; /* extra per-connection state */
int numParams; /* number of parameters passed to query */
FmgrInfo *param_flinfo; /* output conversion functions for them */
List *param_exprs; /* executable expressions for param values */
@@ -408,6 +416,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
RelOptInfo *input_rel,
RelOptInfo *output_rel,
void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(AsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(AsyncRequest *areq);
+static void postgresForeignAsyncNotify(AsyncRequest *areq);
/*
* Helper functions
@@ -437,7 +449,8 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
void *arg);
static void create_cursor(ForeignScanState *node);
static void fetch_more_data(ForeignScanState *node);
-static void close_cursor(PGconn *conn, unsigned int cursor_number);
+static void close_cursor(PGconn *conn, unsigned int cursor_number,
+ PgFdwConnState *conn_state);
static PgFdwModifyState *create_foreign_modify(EState *estate,
RangeTblEntry *rte,
ResultRelInfo *resultRelInfo,
@@ -491,6 +504,8 @@ static int postgresAcquireSampleRowsFunc(Relation relation, int elevel,
double *totaldeadrows);
static void analyze_row_processor(PGresult *res, int row,
PgFdwAnalyzeState *astate);
+static void request_tuple_asynchronously(AsyncRequest *areq, bool fetch);
+static void fetch_more_data_begin(AsyncRequest *areq);
static HeapTuple make_tuple_from_result_row(PGresult *res,
int row,
Relation rel,
@@ -583,6 +598,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for asynchronous execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -1458,7 +1479,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->conn = GetConnection(user, false, &fsstate->conn_state);
/* Assign a unique ID for my cursor */
fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1509,6 +1530,9 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
&fsstate->param_flinfo,
&fsstate->param_exprs,
&fsstate->param_values);
+
+ /* Initialize async state */
+ fsstate->async_capable = node->ss.ps.plan->async_capable;
}
/*
@@ -1523,8 +1547,10 @@ postgresIterateForeignScan(ForeignScanState *node)
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
/*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
+ * In sync mode, if this is the first call after Begin or ReScan, we need
+ * to create the cursor on the remote side. In async mode, we would have
+ * aready created the cursor before we get here, even if this is the first
+ * call after Begin or ReScan.
*/
if (!fsstate->cursor_exists)
create_cursor(node);
@@ -1534,6 +1560,9 @@ postgresIterateForeignScan(ForeignScanState *node)
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
+ /* In async mode, just clear tuple slot. */
+ if (fsstate->async_capable)
+ return ExecClearTuple(slot);
/* No point in another fetch if we already detected EOF, though. */
if (!fsstate->eof_reached)
fetch_more_data(node);
@@ -1595,7 +1624,7 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->conn, sql, fsstate->conn_state);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
PQclear(res);
@@ -1623,7 +1652,8 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->conn, fsstate->cursor_number,
+ fsstate->conn_state);
/* Release remote connection */
ReleaseConnection(fsstate->conn);
@@ -2500,7 +2530,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->conn = GetConnection(user, false, &dmstate->conn_state);
/* Update the foreign-join-related fields. */
if (fsplan->scan.scanrelid == 0)
@@ -2881,7 +2911,7 @@ estimate_path_cost_size(PlannerInfo *root,
false, &retrieved_attrs, NULL);
/* Get the remote estimate */
- conn = GetConnection(fpinfo->user, false);
+ conn = GetConnection(fpinfo->user, false, NULL);
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -3327,7 +3357,7 @@ get_remote_estimate(const char *sql, PGconn *conn,
/*
* Execute EXPLAIN remotely.
*/
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_exec_query(conn, sql, NULL);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, sql);
@@ -3451,6 +3481,10 @@ create_cursor(ForeignScanState *node)
StringInfoData buf;
PGresult *res;
+ /* First, process a pending asynchronous request, if any. */
+ if (fsstate->conn_state->pendingAreq)
+ process_pending_request(fsstate->conn_state->pendingAreq);
+
/*
* Construct array of query parameter values in text format. We do the
* conversions in the short-lived per-tuple context, so as not to cause a
@@ -3531,17 +3565,38 @@ fetch_more_data(ForeignScanState *node)
PG_TRY();
{
PGconn *conn = fsstate->conn;
- char sql[64];
int numrows;
int i;
- snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
- fsstate->fetch_size, fsstate->cursor_number);
+ if (fsstate->async_capable)
+ {
+ Assert(fsstate->conn_state->pendingAreq);
- res = pgfdw_exec_query(conn, sql);
- /* On error, report the original query, not the FETCH. */
- if (PQresultStatus(res) != PGRES_TUPLES_OK)
- pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ /*
+ * The query was already sent by an earlier call to
+ * fetch_more_data_begin. So now we just fetch the result.
+ */
+ res = pgfdw_get_result(conn, fsstate->query);
+ /* On error, report the original query, not the FETCH. */
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+
+ /* Reset per-connection state */
+ fsstate->conn_state->pendingAreq = NULL;
+ }
+ else
+ {
+ char sql[64];
+
+ /* This is a regular synchronous fetch. */
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ res = pgfdw_exec_query(conn, sql, fsstate->conn_state);
+ /* On error, report the original query, not the FETCH. */
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ }
/* Convert the data into HeapTuples */
numrows = PQntuples(res);
@@ -3633,7 +3688,8 @@ reset_transmission_modes(int nestlevel)
* Utility routine to close a cursor.
*/
static void
-close_cursor(PGconn *conn, unsigned int cursor_number)
+close_cursor(PGconn *conn, unsigned int cursor_number,
+ PgFdwConnState *conn_state)
{
char sql[64];
PGresult *res;
@@ -3644,7 +3700,7 @@ close_cursor(PGconn *conn, unsigned int cursor_number)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_exec_query(conn, sql, conn_state);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
pgfdw_report_error(ERROR, res, conn, true, sql);
PQclear(res);
@@ -3693,7 +3749,7 @@ create_foreign_modify(EState *estate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->conn = GetConnection(user, true, &fmstate->conn_state);
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Set up remote query information. */
@@ -3792,6 +3848,10 @@ execute_foreign_modify(EState *estate,
operation == CMD_UPDATE ||
operation == CMD_DELETE);
+ /* First, process a pending asynchronous request, if any. */
+ if (fmstate->conn_state->pendingAreq)
+ process_pending_request(fmstate->conn_state->pendingAreq);
+
/*
* If the existing query was deparsed and prepared for a different number
* of rows, rebuild it for the proper number.
@@ -3893,6 +3953,11 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
char *p_name;
PGresult *res;
+ /*
+ * The caller would already have processed a pending asynchronous request
+ * if any, so no need to do it here.
+ */
+
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
GetPrepStmtNumber(fmstate->conn));
@@ -4078,7 +4143,7 @@ deallocate_query(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->conn, sql, fmstate->conn_state);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
PQclear(res);
@@ -4226,6 +4291,10 @@ execute_dml_stmt(ForeignScanState *node)
int numParams = dmstate->numParams;
const char **values = dmstate->param_values;
+ /* First, process a pending asynchronous request, if any. */
+ if (dmstate->conn_state->pendingAreq)
+ process_pending_request(dmstate->conn_state->pendingAreq);
+
/*
* Construct array of query parameter values in text format.
*/
@@ -4627,7 +4696,7 @@ postgresAnalyzeForeignTable(Relation relation,
*/
table = GetForeignTable(RelationGetRelid(relation));
user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
- conn = GetConnection(user, false);
+ conn = GetConnection(user, false, NULL);
/*
* Construct command to get page count for relation.
@@ -4638,7 +4707,7 @@ postgresAnalyzeForeignTable(Relation relation,
/* In what follows, do not risk leaking any PGresults. */
PG_TRY();
{
- res = pgfdw_exec_query(conn, sql.data);
+ res = pgfdw_exec_query(conn, sql.data, NULL);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, sql.data);
@@ -4713,7 +4782,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
table = GetForeignTable(RelationGetRelid(relation));
server = GetForeignServer(table->serverid);
user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
- conn = GetConnection(user, false);
+ conn = GetConnection(user, false, NULL);
/*
* Construct cursor that retrieves whole rows from remote.
@@ -4730,7 +4799,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
int fetch_size;
ListCell *lc;
- res = pgfdw_exec_query(conn, sql.data);
+ res = pgfdw_exec_query(conn, sql.data, NULL);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
pgfdw_report_error(ERROR, res, conn, false, sql.data);
PQclear(res);
@@ -4782,7 +4851,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
*/
/* Fetch some rows */
- res = pgfdw_exec_query(conn, fetch_sql);
+ res = pgfdw_exec_query(conn, fetch_sql, NULL);
/* On error, report the original query, not the FETCH. */
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, sql.data);
@@ -4801,7 +4870,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
}
/* Close the cursor, just to be tidy. */
- close_cursor(conn, cursor_number);
+ close_cursor(conn, cursor_number, NULL);
}
PG_CATCH();
{
@@ -4941,7 +5010,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
*/
server = GetForeignServer(serverOid);
mapping = GetUserMapping(GetUserId(), server->serverid);
- conn = GetConnection(mapping, false);
+ conn = GetConnection(mapping, false, NULL);
/* Don't attempt to import collation if remote server hasn't got it */
if (PQserverVersion(conn) < 90100)
@@ -4957,7 +5026,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
appendStringInfoString(&buf, "SELECT 1 FROM pg_catalog.pg_namespace WHERE nspname = ");
deparseStringLiteral(&buf, stmt->remote_schema);
- res = pgfdw_exec_query(conn, buf.data);
+ res = pgfdw_exec_query(conn, buf.data, NULL);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, buf.data);
@@ -5069,7 +5138,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
appendStringInfoString(&buf, " ORDER BY c.relname, a.attnum");
/* Fetch the data */
- res = pgfdw_exec_query(conn, buf.data);
+ res = pgfdw_exec_query(conn, buf.data, NULL);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, buf.data);
@@ -6488,6 +6557,211 @@ add_foreign_final_paths(PlannerInfo *root, RelOptInfo *input_rel,
add_path(final_rel, (Path *) final_path);
}
+/*
+ * postgresIsForeignPathAsyncCapable
+ * Check whether a given ForeignPath node is async-capable.
+ */
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+/*
+ * postgresForeignAsyncRequest
+ * Asynchronously request next tuple from a foreign PostgreSQL table.
+ */
+static void
+postgresForeignAsyncRequest(AsyncRequest *areq)
+{
+ request_tuple_asynchronously(areq, true);
+}
+
+/*
+ * postgresForeignAsyncConfigureWait
+ * Configure a file descriptor event for which we wish to wait.
+ */
+static void
+postgresForeignAsyncConfigureWait(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ AsyncRequest *pendingAreq = fsstate->conn_state->pendingAreq;
+ AppendState *requestor = (AppendState *) areq->requestor;
+ WaitEventSet *set = requestor->as_eventset;
+
+ /* This should not be called unless callback_pending */
+ Assert(areq->callback_pending);
+
+ /* The core code would have registered postmaster death event */
+ Assert(GetNumRegisteredWaitEvents(set) >= 1);
+
+ /* Begin an asynchronous data fetch if necessary */
+ if (!pendingAreq)
+ fetch_more_data_begin(areq);
+ else if (pendingAreq->requestor != areq->requestor)
+ {
+ if (GetNumRegisteredWaitEvents(set) > 1)
+ return;
+ process_pending_request(pendingAreq);
+ fetch_more_data_begin(areq);
+ }
+ else if (pendingAreq->requestee != areq->requestee)
+ return;
+ else
+ Assert(pendingAreq == areq);
+
+ AddWaitEventToSet(set, WL_SOCKET_READABLE, PQsocket(fsstate->conn),
+ NULL, areq);
+}
+
+/*
+ * postgresForeignAsyncNotify
+ * Fetch some more tuples from a file descriptor that becomes ready,
+ * requesting next tuple.
+ */
+static void
+postgresForeignAsyncNotify(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+
+ /* The core code would have initialized the callback_pending flag */
+ Assert(!areq->callback_pending);
+
+ /* The request should be currently in-process */
+ Assert(fsstate->conn_state->pendingAreq == areq);
+
+ /* On error, report the original query, not the FETCH. */
+ if (!PQconsumeInput(fsstate->conn))
+ pgfdw_report_error(ERROR, NULL, fsstate->conn, false, fsstate->query);
+
+ fetch_more_data(node);
+
+ request_tuple_asynchronously(areq, true);
+}
+
+/*
+ * Asynchronously request next tuple from a foreign PostgreSQL table.
+ */
+static void
+request_tuple_asynchronously(AsyncRequest *areq, bool fetch)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ AsyncRequest *pendingAreq = fsstate->conn_state->pendingAreq;
+ TupleTableSlot *result;
+
+ /* This should not be called if the request is currently in-process */
+ Assert(areq != pendingAreq);
+
+ /* Request some more tuples, if we've run out */
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ /* No point in another fetch if we already detected EOF, though */
+ if (!fsstate->eof_reached)
+ {
+ /* Mark the request as needing a callback */
+ areq->callback_pending = true;
+ areq->request_complete = false;
+ /* Begin another fetch if requested and if no pending request */
+ if (fetch && !pendingAreq)
+ fetch_more_data_begin(areq);
+ }
+ else
+ {
+ /* There's nothing more to do; just return a NULL pointer */
+ result = NULL;
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+ }
+ return;
+ }
+
+ /* Get a tuple from the ForeignScan node */
+ result = ExecProcNode((PlanState *) node);
+ if (!TupIsNull(result))
+ {
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+ return;
+ }
+ Assert(fsstate->next_tuple >= fsstate->num_tuples);
+
+ /* Request some more tuples, if we've not detected EOF yet */
+ if (!fsstate->eof_reached)
+ {
+ /* Mark the request as needing a callback */
+ areq->callback_pending = true;
+ areq->request_complete = false;
+ /* Begin another fetch if requested and if no pending request */
+ if (fetch && !pendingAreq)
+ fetch_more_data_begin(areq);
+ }
+ else
+ {
+ /* There's nothing more to do; just return a NULL pointer */
+ result = NULL;
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+ }
+}
+
+/*
+ * Begin an asynchronous data fetch.
+ *
+ * Note: fetch_more_data must be called to fetch the result.
+ */
+static void
+fetch_more_data_begin(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ char sql[64];
+
+ Assert(!fsstate->conn_state->pendingAreq);
+
+ /* Create the cursor synchronously. */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ /* We will send this query, but not wait for the response. */
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (PQsendQuery(fsstate->conn, sql) < 0)
+ pgfdw_report_error(ERROR, NULL, fsstate->conn, false, fsstate->query);
+
+ /* Remember that the request is in process */
+ fsstate->conn_state->pendingAreq = areq;
+}
+
+/*
+ * Process a pending asynchronous request.
+ */
+void
+process_pending_request(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ EState *estate = node->ss.ps.state;
+ MemoryContext oldcontext;
+
+ /* The request should be currently in-process */
+ Assert(fsstate->conn_state->pendingAreq == areq);
+
+ oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
+
+ fetch_more_data(node);
+
+ request_tuple_asynchronously(areq, false);
+
+ /* Unlike ForeignAsyncNotify(), we call ExecAsyncResponse() ourselves */
+ ExecAsyncResponse(areq);
+
+ MemoryContextSwitchTo(oldcontext);
+}
+
/*
* Create a tuple from the specified row of the PGresult.
*
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 1f67b4d9fd..3b7442f335 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -16,6 +16,7 @@
#include "foreign/foreign.h"
#include "lib/stringinfo.h"
#include "libpq-fe.h"
+#include "nodes/execnodes.h"
#include "nodes/pathnodes.h"
#include "utils/relcache.h"
@@ -124,17 +125,28 @@ typedef struct PgFdwRelationInfo
int relation_index;
} PgFdwRelationInfo;
+/*
+ * Extra control information relating to a connection.
+ */
+typedef struct PgFdwConnState
+{
+ AsyncRequest *pendingAreq; /* pending async request */
+} PgFdwConnState;
+
/* in postgres_fdw.c */
extern int set_transmission_modes(void);
extern void reset_transmission_modes(int nestlevel);
+extern void process_pending_request(AsyncRequest *areq);
/* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+ PgFdwConnState **state);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
extern PGresult *pgfdw_get_result(PGconn *conn, const char *query);
-extern PGresult *pgfdw_exec_query(PGconn *conn, const char *query);
+extern PGresult *pgfdw_exec_query(PGconn *conn, const char *query,
+ PgFdwConnState *state);
extern void pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
bool clear, const char *sql);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 2b525ea44a..caab9b37ed 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1822,31 +1822,31 @@ INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
DELETE FROM a;
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
@@ -1882,12 +1882,12 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
@@ -1946,8 +1946,8 @@ explain (verbose, costs off)
delete from foo where f1 < 5 returning *;
delete from foo where f1 < 5 returning *;
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with t as (update bar set f2 = f2 + 100 returning *) select * from t order by 1;
+with t as (update bar set f2 = f2 + 100 returning *) select * from t order by 1;
-- Test that UPDATE/DELETE with inherited target works with row-level triggers
CREATE TRIGGER trig_row_before
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b5718fc136..616384c14c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4770,6 +4770,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</para>
<variablelist>
+ <varlistentry id="guc-enable-async-append" xreflabel="enable_async_append">
+ <term><varname>enable_async_append</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_async_append</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of async-aware
+ append plan types. The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-bitmapscan" xreflabel="enable_bitmapscan">
<term><varname>enable_bitmapscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3513e127b7..2ba4223915 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1563,6 +1563,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</thead>
<tbody>
+ <row>
+ <entry><literal>AppendReady</literal></entry>
+ <entry>Waiting for a subplan of Append to be ready.</entry>
+ </row>
<row>
<entry><literal>BackupWaitWalArchive</literal></entry>
<entry>Waiting for WAL files required for a backup to be successfully
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index afc45429ba..fe75cabdcc 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1394,6 +1394,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
}
if (plan->parallel_aware)
appendStringInfoString(es->str, "Parallel ");
+ if (plan->async_capable)
+ appendStringInfoString(es->str, "Async ");
appendStringInfoString(es->str, pname);
es->indent++;
}
@@ -1413,6 +1415,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (custom_name)
ExplainPropertyText("Custom Plan Provider", custom_name, es);
ExplainPropertyBool("Parallel Aware", plan->parallel_aware, es);
+ ExplainPropertyBool("Async Capable", plan->async_capable, es);
}
switch (nodeTag(plan))
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 74ac59faa1..680fd69151 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
+ execAsync.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 4543ac79ed..069c6ba948 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -531,6 +531,10 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans != 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index e69de29bb2..e3d85ffabc 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,111 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+
+/*
+ * Asynchronously request a tuple from a designed async-capable node.
+ */
+void
+ExecAsyncRequest(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ ExecAsyncResponse(areq);
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor event
+ * for which it wishes to wait. We expect the node-type specific callback to
+ * make a sigle call of the following form:
+ *
+ * AddWaitEventToSet(set, WL_SOCKET_READABLE, fd, NULL, areq);
+ */
+void
+ExecAsyncConfigureWait(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanConfigureWait(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+void
+ExecAsyncNotify(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ ExecAsyncResponse(areq);
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+void
+ExecAsyncResponse(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestor))
+ {
+ case T_AppendState:
+ ExecAsyncAppendResponse(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestor));
+ }
+}
+
+/*
+ * A requestee node should call this function to deliver the tuple to its
+ * requestor node. The node can call this from its ExecAsyncRequest callback
+ * if the requested tuple is available immediately.
+ */
+void
+ExecAsyncRequestDone(AsyncRequest *areq, TupleTableSlot *result)
+{
+ areq->request_complete = true;
+ areq->result = result;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 15e4115bd6..123d5163de 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -57,10 +57,13 @@
#include "postgres.h"
+#include "executor/execAsync.h"
#include "executor/execdebug.h"
#include "executor/execPartition.h"
#include "executor/nodeAppend.h"
#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
/* Shared state for parallel-aware Append. */
struct ParallelAppendState
@@ -78,12 +81,18 @@ struct ParallelAppendState
};
#define INVALID_SUBPLAN_INDEX -1
+#define EVENT_BUFFER_SIZE 16
static TupleTableSlot *ExecAppend(PlanState *pstate);
static bool choose_next_subplan_locally(AppendState *node);
static bool choose_next_subplan_for_leader(AppendState *node);
static bool choose_next_subplan_for_worker(AppendState *node);
static void mark_invalid_subplans_as_finished(AppendState *node);
+static void ExecAppendAsyncBegin(AppendState *node);
+static bool ExecAppendAsyncGetNext(AppendState *node, TupleTableSlot **result);
+static bool ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result);
+static void ExecAppendAsyncEventWait(AppendState *node);
+static void classify_matching_subplans(AppendState *node);
/* ----------------------------------------------------------------
* ExecInitAppend
@@ -102,7 +111,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
AppendState *appendstate = makeNode(AppendState);
PlanState **appendplanstates;
Bitmapset *validsubplans;
+ Bitmapset *asyncplans;
int nplans;
+ int nasyncplans;
int firstvalid;
int i,
j;
@@ -119,6 +130,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
/* Let choose_next_subplan_* function handle setting the first subplan */
appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+ appendstate->as_syncdone = false;
/* If run-time partition pruning is enabled, then set that up now */
if (node->part_prune_info != NULL)
@@ -191,12 +203,24 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
* While at it, find out the first valid partial plan.
*/
j = 0;
+ asyncplans = NULL;
+ nasyncplans = 0;
firstvalid = nplans;
i = -1;
while ((i = bms_next_member(validsubplans, i)) >= 0)
{
Plan *initNode = (Plan *) list_nth(node->appendplans, i);
+ /*
+ * Record async subplans. When executing EvalPlanQual, we process
+ * async subplans synchronously, so don't do this in that case.
+ */
+ if (initNode->async_capable && estate->es_epq_active == NULL)
+ {
+ asyncplans = bms_add_member(asyncplans, j);
+ nasyncplans++;
+ }
+
/*
* Record the lowest appendplans index which is a valid partial plan.
*/
@@ -210,6 +234,39 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ /* Initialize async state */
+ appendstate->as_asyncplans = asyncplans;
+ appendstate->as_nasyncplans = nasyncplans;
+ appendstate->as_asyncrequests = NULL;
+ appendstate->as_asyncresults = (TupleTableSlot **)
+ palloc0(nasyncplans * sizeof(TupleTableSlot *));
+ appendstate->as_needrequest = NULL;
+ appendstate->as_eventset = NULL;
+
+ if (nasyncplans > 0)
+ {
+ appendstate->as_asyncrequests = (AsyncRequest **)
+ palloc0(nplans * sizeof(AsyncRequest *));
+
+ i = -1;
+ while ((i = bms_next_member(asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq;
+
+ areq = palloc(sizeof(AsyncRequest));
+ areq->requestor = (PlanState *) appendstate;
+ areq->requestee = appendplanstates[i];
+ areq->request_index = i;
+ areq->callback_pending = false;
+ areq->request_complete = false;
+ areq->result = NULL;
+
+ appendstate->as_asyncrequests[i] = areq;
+ }
+
+ classify_matching_subplans(appendstate);
+ }
+
/*
* Miscellaneous initialization
*/
@@ -232,31 +289,45 @@ static TupleTableSlot *
ExecAppend(PlanState *pstate)
{
AppendState *node = castNode(AppendState, pstate);
+ TupleTableSlot *result;
- if (node->as_whichplan < 0)
+ if (!node->as_syncdone && node->as_whichplan == INVALID_SUBPLAN_INDEX)
{
/* Nothing to do if there are no subplans */
if (node->as_nplans == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ /* If there are any async subplans, begin execution of them */
+ if (node->as_nasyncplans > 0)
+ ExecAppendAsyncBegin(node);
+
/*
- * If no subplan has been chosen, we must choose one before
+ * If no sync subplan has been chosen, we must choose one before
* proceeding.
*/
- if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
- !node->choose_next_subplan(node))
+ if (!node->choose_next_subplan(node) && node->as_nasyncremain == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
for (;;)
{
PlanState *subnode;
- TupleTableSlot *result;
CHECK_FOR_INTERRUPTS();
/*
- * figure out which subplan we are currently processing
+ * try to get a tuple from any of the async subplans
+ */
+ if (!bms_is_empty(node->as_needrequest) ||
+ (node->as_syncdone && node->as_nasyncremain > 0))
+ {
+ if (ExecAppendAsyncGetNext(node, &result))
+ return result;
+ Assert(bms_is_empty(node->as_needrequest));
+ }
+
+ /*
+ * figure out which sync subplan we are currently processing
*/
Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
subnode = node->appendplans[node->as_whichplan];
@@ -276,8 +347,16 @@ ExecAppend(PlanState *pstate)
return result;
}
- /* choose new subplan; if none, we're done */
- if (!node->choose_next_subplan(node))
+ /* wait or poll async events */
+ if (node->as_nasyncremain > 0)
+ {
+ Assert(!node->as_syncdone);
+ Assert(bms_is_empty(node->as_needrequest));
+ ExecAppendAsyncEventWait(node);
+ }
+
+ /* choose new sync subplan; if no sync/async subplans, we're done */
+ if (!node->choose_next_subplan(node) && node->as_nasyncremain == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
}
@@ -313,6 +392,7 @@ ExecEndAppend(AppendState *node)
void
ExecReScanAppend(AppendState *node)
{
+ int nasyncplans = node->as_nasyncplans;
int i;
/*
@@ -326,6 +406,11 @@ ExecReScanAppend(AppendState *node)
{
bms_free(node->as_valid_subplans);
node->as_valid_subplans = NULL;
+ if (nasyncplans > 0)
+ {
+ bms_free(node->as_valid_asyncplans);
+ node->as_valid_asyncplans = NULL;
+ }
}
for (i = 0; i < node->as_nplans; i++)
@@ -347,8 +432,26 @@ ExecReScanAppend(AppendState *node)
ExecReScan(subnode);
}
+ /* Reset async state */
+ if (nasyncplans > 0)
+ {
+ i = -1;
+ while ((i = bms_next_member(node->as_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ areq->callback_pending = false;
+ areq->request_complete = false;
+ areq->result = NULL;
+ }
+
+ bms_free(node->as_needrequest);
+ node->as_needrequest = NULL;
+ }
+
/* Let choose_next_subplan_* function handle setting the first subplan */
node->as_whichplan = INVALID_SUBPLAN_INDEX;
+ node->as_syncdone = false;
}
/* ----------------------------------------------------------------
@@ -429,7 +532,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
/* ----------------------------------------------------------------
* choose_next_subplan_locally
*
- * Choose next subplan for a non-parallel-aware Append,
+ * Choose next sync subplan for a non-parallel-aware Append,
* returning false if there are no more.
* ----------------------------------------------------------------
*/
@@ -444,9 +547,9 @@ choose_next_subplan_locally(AppendState *node)
/*
* If first call then have the bms member function choose the first valid
- * subplan by initializing whichplan to -1. If there happen to be no
- * valid subplans then the bms member function will handle that by
- * returning a negative number which will allow us to exit returning a
+ * sync subplan by initializing whichplan to -1. If there happen to be
+ * no valid sync subplans then the bms member function will handle that
+ * by returning a negative number which will allow us to exit returning a
* false value.
*/
if (whichplan == INVALID_SUBPLAN_INDEX)
@@ -467,7 +570,10 @@ choose_next_subplan_locally(AppendState *node)
nextplan = bms_prev_member(node->as_valid_subplans, whichplan);
if (nextplan < 0)
+ {
+ node->as_syncdone = true;
return false;
+ }
node->as_whichplan = nextplan;
@@ -709,3 +815,298 @@ mark_invalid_subplans_as_finished(AppendState *node)
node->as_pstate->pa_finished[i] = true;
}
}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncBegin
+ *
+ * Begin execution of designed async-capable subplans.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecAppendAsyncBegin(AppendState *node)
+{
+ Bitmapset *valid_asyncplans;
+ int i;
+
+ /* We should never be called when there are no async subplans. */
+ Assert(node->as_nasyncplans > 0);
+
+ if (node->as_valid_subplans == NULL)
+ {
+ Assert(node->as_valid_asyncplans == NULL);
+
+ node->as_valid_subplans =
+ ExecFindMatchingSubPlans(node->as_prune_state);
+
+ classify_matching_subplans(node);
+ }
+
+ node->as_nasyncremain = 0;
+
+ /* Nothing to do if there are no valid async subplans. */
+ valid_asyncplans = node->as_valid_asyncplans;
+ if (valid_asyncplans == NULL)
+ return;
+
+ /* Make a request for each of the async subplans. */
+ i = -1;
+ while ((i = bms_next_member(valid_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ Assert(areq->request_index == i);
+ Assert(!areq->callback_pending);
+
+ /* Do the actual work. */
+ ExecAsyncRequest(areq);
+
+ ++node->as_nasyncremain;
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncGetNext
+ *
+ * Get the next tuple from any of the asynchronous subplans.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecAppendAsyncGetNext(AppendState *node, TupleTableSlot **result)
+{
+ *result = NULL;
+
+ /* Make new async requests. */
+ if (ExecAppendAsyncRequest(node, result))
+ return true;
+
+ while (node->as_nasyncremain > 0)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ /* Wait or poll async events. */
+ ExecAppendAsyncEventWait(node);
+
+ /* Make new async requests. */
+ if (ExecAppendAsyncRequest(node, result))
+ return true;
+
+ /* Break from loop if there is any sync node that is not complete */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If all sync subplans are complete, we're totally done scanning the
+ * givne node. Otherwise, we're done with the asynchronous stuff but
+ * must continue scanning the sync subplans.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncremain == 0);
+ *result = ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ return true;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncRequest
+ *
+ * If there are any asynchronous subplans that need a new asynchronous
+ * request, make all of them.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result)
+{
+ Bitmapset *needrequest;
+ int i;
+
+ /* Nothing to do if there are no async subplans needing a new request. */
+ if (bms_is_empty(node->as_needrequest))
+ return false;
+
+ /*
+ * If there are any asynchronously-generated results that have not yet
+ * been returned, we have nothing to do; just return one of them.
+ */
+ if (node->as_nasyncresults > 0)
+ {
+ --node->as_nasyncresults;
+ *result = node->as_asyncresults[node->as_nasyncresults];
+ return true;
+ }
+
+ /* Make a new request for each of the async subplans that need it. */
+ needrequest = node->as_needrequest;
+ node->as_needrequest = NULL;
+ i = -1;
+ while ((i = bms_next_member(needrequest, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ /* Do the actual work. */
+ ExecAsyncRequest(areq);
+ }
+ bms_free(needrequest);
+
+ /* Return one of the asynchronously-generated results if any. */
+ if (node->as_nasyncresults > 0)
+ {
+ --node->as_nasyncresults;
+ *result = node->as_asyncresults[node->as_nasyncresults];
+ return true;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncEventWait
+ *
+ * Wait or poll for file descriptor wait events and fire callbacks.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecAppendAsyncEventWait(AppendState *node)
+{
+ long timeout = node->as_syncdone ? -1 : 0;
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+
+ /* Nothing to do if there are no remaining async subplans. */
+ if (node->as_nasyncremain == 0)
+ return;
+
+ node->as_eventset = CreateWaitEventSet(CurrentMemoryContext,
+ node->as_nasyncplans + 1);
+ AddWaitEventToSet(node->as_eventset, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
+ NULL, NULL);
+
+ /* Give each waiting subplan a chance to add a event. */
+ i = -1;
+ while ((i = bms_next_member(node->as_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ if (areq->callback_pending)
+ ExecAsyncConfigureWait(areq);
+ }
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(node->as_eventset, timeout, occurred_event,
+ EVENT_BUFFER_SIZE, WAIT_EVENT_APPEND_READY);
+ FreeWaitEventSet(node->as_eventset);
+ node->as_eventset = NULL;
+ if (noccurred == 0)
+ return;
+
+ /* Deliver notifications. */
+ for (i = 0; i < noccurred; i++)
+ {
+ WaitEvent *w = &occurred_event[i];
+
+ /*
+ * Each waiting subplan should have registered its wait event with
+ * user_data pointing back to its AsyncRequest.
+ */
+ if ((w->events & WL_SOCKET_READABLE) != 0)
+ {
+ AsyncRequest *areq = (AsyncRequest *) w->user_data;
+
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ Assert(areq->callback_pending);
+ areq->callback_pending = false;
+
+ /* Do the actual work. */
+ ExecAsyncNotify(areq);
+ }
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(AsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot = areq->result;
+
+ /* The result should be a TupleTableSlot or NULL. */
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* Nothing to do if the request is pending. */
+ if (!areq->request_complete)
+ {
+ /*
+ * The subplan for which the request was made would be pending for a
+ * callback.
+ */
+ Assert(areq->callback_pending);
+ return;
+ }
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ {
+ /* The ending subplan wouldn't have been pending for a callback. */
+ Assert(!areq->callback_pending);
+ --node->as_nasyncremain;
+ return;
+ }
+
+ /* Save result so we can return it */
+ Assert(node->as_nasyncresults < node->as_nasyncplans);
+ node->as_asyncresults[node->as_nasyncresults++] = slot;
+
+ /*
+ * Mark the subplan that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might complete.
+ */
+ node->as_needrequest = bms_add_member(node->as_needrequest,
+ areq->request_index);
+}
+
+/* ----------------------------------------------------------------
+ * classify_matching_subplans
+ *
+ * Classify the node's as_valid_subplans into sync ones and
+ * async ones, adjust it to contain sync ones only, and save
+ * async ones in the node's as_valid_asyncplans
+ * ----------------------------------------------------------------
+ */
+static void
+classify_matching_subplans(AppendState *node)
+{
+ Bitmapset *valid_asyncplans;
+
+ /* Nothing to do if there are no valid subplans. */
+ if (bms_is_empty(node->as_valid_subplans))
+ return;
+
+ /* Nothing to do if there are no valid async subplans. */
+ if (!bms_overlap(node->as_valid_subplans, node->as_asyncplans))
+ return;
+
+ /* Get valid async subplans. */
+ valid_asyncplans = bms_copy(node->as_asyncplans);
+ valid_asyncplans = bms_int_members(valid_asyncplans,
+ node->as_valid_subplans);
+
+ /* Adjust the valid subplans to contain sync subplans only. */
+ node->as_valid_subplans = bms_del_members(node->as_valid_subplans,
+ valid_asyncplans);
+
+ /* Save valid async subplans. */
+ node->as_valid_asyncplans = valid_asyncplans;
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 0969e53c3a..898890fb08 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -391,3 +391,51 @@ ExecShutdownForeignScan(ForeignScanState *node)
if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Asynchronously request a tuple from a designed async-capable node
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ fdwroutine->ForeignAsyncConfigureWait(areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Callback invoked when a relevant event has occurred
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index aaba1ec2c4..38aa9b5a85 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -120,6 +120,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
COPY_SCALAR_FIELD(plan_width);
COPY_SCALAR_FIELD(parallel_aware);
COPY_SCALAR_FIELD(parallel_safe);
+ COPY_SCALAR_FIELD(async_capable);
COPY_SCALAR_FIELD(plan_node_id);
COPY_NODE_FIELD(targetlist);
COPY_NODE_FIELD(qual);
@@ -241,6 +242,7 @@ _copyAppend(const Append *from)
*/
COPY_BITMAPSET_FIELD(apprelids);
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
COPY_SCALAR_FIELD(first_partial_plan);
COPY_NODE_FIELD(part_prune_info);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 8fc432bfe1..a4bffb8e88 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -333,6 +333,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
WRITE_INT_FIELD(plan_width);
WRITE_BOOL_FIELD(parallel_aware);
WRITE_BOOL_FIELD(parallel_safe);
+ WRITE_BOOL_FIELD(async_capable);
WRITE_INT_FIELD(plan_node_id);
WRITE_NODE_FIELD(targetlist);
WRITE_NODE_FIELD(qual);
@@ -431,6 +432,7 @@ _outAppend(StringInfo str, const Append *node)
WRITE_BITMAPSET_FIELD(apprelids);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
WRITE_INT_FIELD(first_partial_plan);
WRITE_NODE_FIELD(part_prune_info);
}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 718fb58e86..03d01eea3e 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1614,6 +1614,7 @@ ReadCommonPlan(Plan *local_node)
READ_INT_FIELD(plan_width);
READ_BOOL_FIELD(parallel_aware);
READ_BOOL_FIELD(parallel_safe);
+ READ_BOOL_FIELD(async_capable);
READ_INT_FIELD(plan_node_id);
READ_NODE_FIELD(targetlist);
READ_NODE_FIELD(qual);
@@ -1710,6 +1711,7 @@ _readAppend(void)
READ_BITMAPSET_FIELD(apprelids);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
READ_INT_FIELD(first_partial_plan);
READ_NODE_FIELD(part_prune_info);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a25b674a19..f3100f7540 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -147,6 +147,7 @@ bool enable_partitionwise_aggregate = false;
bool enable_parallel_append = true;
bool enable_parallel_hash = true;
bool enable_partition_pruning = true;
+bool enable_async_append = true;
typedef struct
{
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 906cab7053..06774a9ec3 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -81,6 +81,7 @@ static List *get_gating_quals(PlannerInfo *root, List *quals);
static Plan *create_gating_plan(PlannerInfo *root, Path *path, Plan *plan,
List *gating_quals);
static Plan *create_join_plan(PlannerInfo *root, JoinPath *best_path);
+static bool is_async_capable_path(Path *path);
static Plan *create_append_plan(PlannerInfo *root, AppendPath *best_path,
int flags);
static Plan *create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
@@ -1080,6 +1081,30 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
return plan;
}
+/*
+ * is_async_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
+
/*
* create_append_plan
* Create an Append plan for 'best_path' and (recursively) plans
@@ -1097,6 +1122,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
List *pathkeys = best_path->path.pathkeys;
List *subplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
RelOptInfo *rel = best_path->path.parent;
PartitionPruneInfo *partpruneinfo = NULL;
int nodenumsortkeys = 0;
@@ -1104,6 +1130,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
Oid *nodeSortOperators = NULL;
Oid *nodeCollations = NULL;
bool *nodeNullsFirst = NULL;
+ bool consider_async = false;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1167,6 +1194,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
tlist_was_changed = (orig_tlist_length != list_length(plan->plan.targetlist));
}
+ /* If appropriate, consider async append */
+ consider_async = (enable_async_append && pathkeys == NIL &&
+ !best_path->path.parallel_safe &&
+ list_length(best_path->subpaths) > 1);
+
/* Build the plan for each child */
foreach(subpaths, best_path->subpaths)
{
@@ -1234,6 +1266,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
subplans = lappend(subplans, subplan);
+
+ /* Check to see if subplan can be executed asynchronously */
+ if (consider_async && is_async_capable_path(subpath))
+ {
+ subplan->async_capable = true;
+ ++nasyncplans;
+ }
}
/*
@@ -1266,6 +1305,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
plan->appendplans = subplans;
+ plan->nasyncplans = nasyncplans;
plan->first_partial_plan = best_path->first_partial_path;
plan->part_prune_info = partpruneinfo;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..58f8e0bbcf 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3999,6 +3999,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
switch (w)
{
+ case WAIT_EVENT_APPEND_READY:
+ event_name = "AppendReady";
+ break;
case WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE:
event_name = "BackupWaitWalArchive";
break;
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 43a5fded10..5f3318fa8f 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -2020,6 +2020,15 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
}
#endif
+/*
+ * Get the number of wait events registered in a given WaitEventSet.
+ */
+int
+GetNumRegisteredWaitEvents(WaitEventSet *set)
+{
+ return set->nevents;
+}
+
#if defined(WAIT_USE_POLL)
/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d626731723..9d252f2e75 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1111,6 +1111,16 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_async_append", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of async append plans."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_async_append,
+ true,
+ NULL, NULL, NULL
+ },
{
{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("Enables genetic query optimization."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ee06528bb0..740e4698a1 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -371,6 +371,7 @@
#enable_partitionwise_aggregate = off
#enable_parallel_hash = on
#enable_partition_pruning = on
+#enable_async_append = on
# - Planner Cost Constants -
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
index e69de29bb2..93e8749476 100644
--- a/src/include/executor/execAsync.h
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ * execAsync.h
+ * Support functions for asynchronous execution
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/executor/execAsync.h
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(AsyncRequest *areq);
+extern void ExecAsyncConfigureWait(AsyncRequest *areq);
+extern void ExecAsyncNotify(AsyncRequest *areq);
+extern void ExecAsyncResponse(AsyncRequest *areq);
+extern void ExecAsyncRequestDone(AsyncRequest *areq, TupleTableSlot *result);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index cafd410a5d..fa54ac6ad2 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -25,4 +25,6 @@ extern void ExecAppendInitializeDSM(AppendState *node, ParallelContext *pcxt);
extern void ExecAppendReInitializeDSM(AppendState *node, ParallelContext *pcxt);
extern void ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt);
+extern void ExecAsyncAppendResponse(AsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 6ae7733e25..8ffc0ca5bf 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -31,4 +31,8 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
ParallelWorkerContext *pwcxt);
extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern void ExecAsyncForeignScanRequest(AsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(AsyncRequest *areq);
+extern void ExecAsyncForeignScanNotify(AsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 248f78da45..7c89d081c7 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -178,6 +178,14 @@ typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
List *fdw_private,
RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+
+typedef void (*ForeignAsyncRequest_function) (AsyncRequest *areq);
+
+typedef void (*ForeignAsyncConfigureWait_function) (AsyncRequest *areq);
+
+typedef void (*ForeignAsyncNotify_function) (AsyncRequest *areq);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -256,6 +264,12 @@ typedef struct FdwRoutine
/* Support functions for path reparameterization. */
ReparameterizeForeignPathByChild_function ReparameterizeForeignPathByChild;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e31ad6204e..c93b9c011e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -515,6 +515,22 @@ typedef struct ResultRelInfo
struct CopyMultiInsertBuffer *ri_CopyMultiInsertBuffer;
} ResultRelInfo;
+/* ----------------
+ * AsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct AsyncRequest
+{
+ struct PlanState *requestor; /* Node that wants a tuple */
+ struct PlanState *requestee; /* Node from which a tuple is wanted */
+ int request_index; /* Scratch space for requestor */
+ bool callback_pending; /* Callback is needed */
+ bool request_complete; /* Request complete, result valid */
+ TupleTableSlot *result; /* Result (NULL if no more tuples) */
+} AsyncRequest;
+
/* ----------------
* EState information
*
@@ -1220,12 +1236,23 @@ struct AppendState
PlanState **appendplans; /* array of PlanStates for my inputs */
int as_nplans;
int as_whichplan;
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_asyncplans; /* asynchronous plans indexes */
+ int as_nasyncplans; /* # of asynchronous plans */
+ AsyncRequest **as_asyncrequests; /* array of AsyncRequests */
+ TupleTableSlot **as_asyncresults; /* unreturned results of async plans */
+ int as_nasyncresults; /* # of valid entries in as_asyncresults */
+ int as_nasyncremain; /* # of remaining async plans */
+ Bitmapset *as_needrequest; /* async plans ready for a request */
+ struct WaitEventSet *as_eventset; /* WaitEventSet used to configure
+ * file descriptor wait events */
int as_first_partial_plan; /* Index of 'appendplans' containing
* the first partial plan */
ParallelAppendState *as_pstate; /* parallel coordination info */
Size pstate_len; /* size of parallel coordination info */
struct PartitionPruneState *as_prune_state;
Bitmapset *as_valid_subplans;
+ Bitmapset *as_valid_asyncplans;
bool (*choose_next_subplan) (AppendState *);
};
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 6e62104d0b..24ca616740 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -129,6 +129,11 @@ typedef struct Plan
bool parallel_aware; /* engage parallel-aware logic? */
bool parallel_safe; /* OK to use as part of parallel plan? */
+ /*
+ * information needed for asynchronous execution
+ */
+ bool async_capable; /* engage asynchronous-capable logic? */
+
/*
* Common structural data for all Plan types.
*/
@@ -245,6 +250,7 @@ typedef struct Append
Plan plan;
Bitmapset *apprelids; /* RTIs of appendrel(s) formed by this node */
List *appendplans;
+ int nasyncplans; /* # of asynchronous plans */
/*
* All 'appendplans' preceding this index are non-partial plans. All
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 1be93be098..a3fd93fe07 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -65,6 +65,7 @@ extern PGDLLIMPORT bool enable_partitionwise_aggregate;
extern PGDLLIMPORT bool enable_parallel_append;
extern PGDLLIMPORT bool enable_parallel_hash;
extern PGDLLIMPORT bool enable_partition_pruning;
+extern PGDLLIMPORT bool enable_async_append;
extern PGDLLIMPORT int constraint_exclusion;
extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..d9588da38a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -957,6 +957,7 @@ typedef enum
*/
typedef enum
{
+ WAIT_EVENT_APPEND_READY,
WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE = PG_WAIT_IPC,
WAIT_EVENT_BGWORKER_SHUTDOWN,
WAIT_EVENT_BGWORKER_STARTUP,
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 9e94fcaec2..44f9368c64 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -179,5 +179,6 @@ extern int WaitLatch(Latch *latch, int wakeEvents, long timeout,
extern int WaitLatchOrSocket(Latch *latch, int wakeEvents,
pgsocket sock, long timeout, uint32 wait_event_info);
extern void InitializeLatchWaitSet(void);
+extern int GetNumRegisteredWaitEvents(WaitEventSet *set);
#endif /* LATCH_H */
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index dc7ab2ce8b..e78ca7bddb 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -87,6 +87,7 @@ select explain_filter('explain (analyze, buffers, format json) select * from int
"Plan": { +
"Node Type": "Seq Scan", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "int8_tbl",+
"Alias": "i8", +
"Startup Cost": N.N, +
@@ -136,6 +137,7 @@ select explain_filter('explain (analyze, buffers, format xml) select * from int8
<Plan> +
<Node-Type>Seq Scan</Node-Type> +
<Parallel-Aware>false</Parallel-Aware> +
+ <Async-Capable>false</Async-Capable> +
<Relation-Name>int8_tbl</Relation-Name> +
<Alias>i8</Alias> +
<Startup-Cost>N.N</Startup-Cost> +
@@ -183,6 +185,7 @@ select explain_filter('explain (analyze, buffers, format yaml) select * from int
- Plan: +
Node Type: "Seq Scan" +
Parallel Aware: false +
+ Async Capable: false +
Relation Name: "int8_tbl"+
Alias: "i8" +
Startup Cost: N.N +
@@ -233,6 +236,7 @@ select explain_filter('explain (buffers, format json) select * from int8_tbl i8'
"Plan": { +
"Node Type": "Seq Scan", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "int8_tbl",+
"Alias": "i8", +
"Startup Cost": N.N, +
@@ -348,6 +352,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Relation Name": "tenk1", +
"Parallel Aware": true, +
"Local Hit Blocks": 0, +
@@ -393,6 +398,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Parallel Aware": false, +
"Sort Space Used": 0, +
"Local Hit Blocks": 0, +
@@ -435,6 +441,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Parallel Aware": false, +
"Workers Planned": 0, +
"Local Hit Blocks": 0, +
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index 68ca321163..a417b566d9 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -558,6 +558,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
"Node Type": "Incremental Sort", +
"Actual Rows": 55, +
"Actual Loops": 1, +
+ "Async Capable": false, +
"Presorted Key": [ +
"t.a" +
], +
@@ -760,6 +761,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
"Node Type": "Incremental Sort", +
"Actual Rows": 70, +
"Actual Loops": 1, +
+ "Async Capable": false, +
"Presorted Key": [ +
"t.a" +
], +
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index ff157ceb1c..499245068a 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -204,6 +204,7 @@ explain (costs off, format json) insert into insertconflicttest values (0, 'Bilb
"Node Type": "ModifyTable", +
"Operation": "Insert", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "insertconflicttest", +
"Alias": "insertconflicttest", +
"Conflict Resolution": "UPDATE", +
@@ -213,7 +214,8 @@ explain (costs off, format json) insert into insertconflicttest values (0, 'Bilb
{ +
"Node Type": "Result", +
"Parent Relationship": "Member", +
- "Parallel Aware": false +
+ "Parallel Aware": false, +
+ "Async Capable": false +
} +
] +
} +
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 6d048e309c..98dde452e6 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -95,6 +95,7 @@ select count(*) = 0 as ok from pg_stat_wal_receiver;
select name, setting from pg_settings where name like 'enable%';
name | setting
--------------------------------+---------
+ enable_async_append | on
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
@@ -113,7 +114,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(18 rows)
+(19 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
On Mon, Mar 1, 2021 at 5:56 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
Here is an updated version of the patch.
Another thing I'm concerned about in the postgres_fdw part is the case
where all/many postgres_fdw ForeignScans of an Append use the same
connection, because in that case those ForeignScans are executed one
by one, not in parallel, and hence the overhead of async execution
(i.e., doing ExecAppendAsyncEventWait()) would merely cause a
performance degradation. Here is such an example:
postgres=# create server loopback foreign data wrapper postgres_fdw
options (dbname 'postgres');
postgres=# create user mapping for current_user server loopback;
postgres=# create table pt (a int, b int, c text) partition by range (a);
postgres=# create table loct1 (a int, b int, c text);
postgres=# create table loct2 (a int, b int, c text);
postgres=# create table loct3 (a int, b int, c text);
postgres=# create foreign table p1 partition of pt for values from
(10) to (20) server loopback options (table_name 'loct1');
postgres=# create foreign table p2 partition of pt for values from
(20) to (30) server loopback options (table_name 'loct2');
postgres=# create foreign table p3 partition of pt for values from
(30) to (40) server loopback options (table_name 'loct3');
postgres=# insert into p1 select 10 + i % 10, i, to_char(i, 'FM00000')
from generate_series(0, 99999) i;
postgres=# insert into p2 select 20 + i % 10, i, to_char(i, 'FM00000')
from generate_series(0, 99999) i;
postgres=# insert into p3 select 30 + i % 10, i, to_char(i, 'FM00000')
from generate_series(0, 99999) i;
postgres=# analyze pt;
postgres=# set enable_async_append to off;
postgres=# select count(*) from pt;
count
--------
300000
(1 row)
Time: 366.905 ms
postgres=# set enable_async_append to on;
postgres=# select count(*) from pt;
count
--------
300000
(1 row)
Time: 385.431 ms
People would use postgres_fdw to access old partitions archived in a
single remote server. So the same degradation would be likely to
happen in such a use case. To avoid that, how about 1) adding the
table/server options to postgres_fdw that allow/disallow async
execution, and 2) setting them to false by default?
Best regards,
Etsuro Fujita
On Thu, Mar 4, 2021 at 1:00 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
To avoid that, how about 1) adding the
table/server options to postgres_fdw that allow/disallow async
execution, and 2) setting them to false by default?
There seems to be no objections, so I went ahead and added the
table/server option ‘async_capable’ set false by default. Attached is
an updated patch.
Best regards,
Etsuro Fujita
Attachments:
async-wip-2021-03-08.patchapplication/octet-stream; name=async-wip-2021-03-08.patchDownload
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index ee0b4acf0b..fe76b7cfd1 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -62,6 +62,7 @@ typedef struct ConnCacheEntry
Oid serverid; /* foreign server OID used to get server name */
uint32 server_hashvalue; /* hash value of foreign server OID */
uint32 mapping_hashvalue; /* hash value of user mapping OID */
+ PgFdwConnState state; /* extra per-connection state */
} ConnCacheEntry;
/*
@@ -117,7 +118,7 @@ static bool disconnect_cached_connections(Oid serverid);
* (not even on error), we need this flag to cue manual cleanup.
*/
PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetConnection(UserMapping *user, bool will_prep_stmt, PgFdwConnState **state)
{
bool found;
bool retry = false;
@@ -196,6 +197,9 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
*/
PG_TRY();
{
+ /* Process a pending asynchronous request if any. */
+ if (entry->state.pendingAreq)
+ process_pending_request(entry->state.pendingAreq);
/* Start a new transaction or subtransaction if needed. */
begin_remote_xact(entry);
}
@@ -264,6 +268,10 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
/* Remember if caller will prepare statements */
entry->have_prep_stmt |= will_prep_stmt;
+ /* If caller needs access to the per-connection state, return it. */
+ if (state)
+ *state = &entry->state;
+
return entry->conn;
}
@@ -291,6 +299,7 @@ make_new_connection(ConnCacheEntry *entry, UserMapping *user)
entry->mapping_hashvalue =
GetSysCacheHashValue1(USERMAPPINGOID,
ObjectIdGetDatum(user->umid));
+ memset(&entry->state, 0, sizeof(entry->state));
/* Now try to make the connection */
entry->conn = connect_pg_server(server, user);
@@ -648,8 +657,12 @@ GetPrepStmtNumber(PGconn *conn)
* Caller is responsible for the error handling on the result.
*/
PGresult *
-pgfdw_exec_query(PGconn *conn, const char *query)
+pgfdw_exec_query(PGconn *conn, const char *query, PgFdwConnState *state)
{
+ /* First, process a pending asynchronous request, if any. */
+ if (state && state->pendingAreq)
+ process_pending_request(state->pendingAreq);
+
/*
* Submit a query. Since we don't use non-blocking mode, this also can
* block. But its risk is relatively small, so we ignore that for now.
@@ -940,6 +953,8 @@ pgfdw_xact_callback(XactEvent event, void *arg)
{
entry->have_prep_stmt = false;
entry->have_error = false;
+ /* Also reset per-connection state */
+ memset(&entry->state, 0, sizeof(entry->state));
}
/* Disarm changing_xact_state if it all worked. */
@@ -1172,6 +1187,10 @@ pgfdw_reject_incomplete_xact_state_change(ConnCacheEntry *entry)
* Cancel the currently-in-progress query (whose query text we do not have)
* and ignore the result. Returns true if we successfully cancel the query
* and discard any pending result, and false if not.
+ *
+ * XXX: if the query was one sent by fetch_more_data_begin(), we could get the
+ * query text from the pendingAreq saved in the per-connection state, then
+ * report the query using it.
*/
static bool
pgfdw_cancel_query(PGconn *conn)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 0649b6b81c..126065ebf9 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -19,6 +19,7 @@ DO $d$
)$$;
END;
$d$;
+ALTER SERVER loopback OPTIONS (ADD async_capable 'true');
CREATE USER MAPPING FOR public SERVER testserver1
OPTIONS (user 'value', password 'value');
CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
@@ -7021,7 +7022,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+-------
a | aaa
@@ -7049,7 +7050,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7077,7 +7078,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -7105,7 +7106,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | newtoo
@@ -7133,7 +7134,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
DELETE FROM a;
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+----
(0 rows)
@@ -7175,35 +7176,40 @@ insert into bar2 values(3,33,33);
insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
f1 | f2
----+----
1 | 11
@@ -7213,35 +7219,40 @@ select * from bar where f1 in (select f1 from foo) for update;
(4 rows)
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar bar_1
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+ Sort Key: bar_1.f1
+ -> Seq Scan on public.bar bar_1
+ Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
-> Foreign Scan on public.bar2 bar_2
Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+ -> Sort
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.f1, foo.*, foo.tableoid
Group Key: foo.f1
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
f1 | f2
----+----
1 | 11
@@ -7273,7 +7284,7 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-> Hash Join
@@ -7291,7 +7302,7 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-> Append
-> Seq Scan on public.foo foo_1
Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
- -> Foreign Scan on public.foo2 foo_2
+ -> Async Foreign Scan on public.foo2 foo_2
Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
(39 rows)
@@ -7326,12 +7337,12 @@ where bar.f1 = ss.f1;
-> Append
-> Seq Scan on public.foo
Output: ROW(foo.f1), foo.f1
- -> Foreign Scan on public.foo2 foo_1
+ -> Async Foreign Scan on public.foo2 foo_1
Output: ROW(foo_1.f1), foo_1.f1
Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo foo_2
Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
- -> Foreign Scan on public.foo2 foo_3
+ -> Async Foreign Scan on public.foo2 foo_3
Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
-> Hash
@@ -7353,12 +7364,12 @@ where bar.f1 = ss.f1;
-> Append
-> Seq Scan on public.foo
Output: ROW(foo.f1), foo.f1
- -> Foreign Scan on public.foo2 foo_1
+ -> Async Foreign Scan on public.foo2 foo_1
Output: ROW(foo_1.f1), foo_1.f1
Remote SQL: SELECT f1 FROM public.loct1
-> Seq Scan on public.foo foo_2
Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
- -> Foreign Scan on public.foo2 foo_3
+ -> Async Foreign Scan on public.foo2 foo_3
Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
(45 rows)
@@ -7511,27 +7522,33 @@ delete from foo where f1 < 5 returning *;
(5 rows)
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
- QUERY PLAN
-------------------------------------------------------------------------------
- Update on public.bar
- Output: bar.f1, bar.f2
- Update on public.bar
- Foreign Update on public.bar2 bar_1
- -> Seq Scan on public.bar
- Output: bar.f1, (bar.f2 + 100), bar.ctid
- -> Foreign Update on public.bar2 bar_1
- Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with t as (update bar set f2 = f2 + 100 returning *) select * from t order by 1;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Output: t.f1, t.f2
+ Sort Key: t.f1
+ CTE t
+ -> Update on public.bar
+ Output: bar.f1, bar.f2
+ Update on public.bar
+ Foreign Update on public.bar2 bar_1
+ -> Seq Scan on public.bar
+ Output: bar.f1, (bar.f2 + 100), bar.ctid
+ -> Foreign Update on public.bar2 bar_1
+ Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+ -> CTE Scan on t
+ Output: t.f1, t.f2
+(14 rows)
-update bar set f2 = f2 + 100 returning *;
+with t as (update bar set f2 = f2 + 100 returning *) select * from t order by 1;
f1 | f2
----+-----
1 | 311
2 | 322
- 6 | 266
3 | 333
4 | 344
+ 6 | 266
7 | 277
(6 rows)
@@ -8606,9 +8623,9 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
Sort
Sort Key: t1.a, t3.c
-> Append
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: ((ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)) INNER JOIN (ftprt1_p1 t3_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: ((ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)) INNER JOIN (ftprt1_p2 t3_2)
(7 rows)
@@ -8645,19 +8662,19 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
-- with whole-row reference; partitionwise join does not apply
EXPLAIN (COSTS OFF)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
- QUERY PLAN
---------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------
Sort
Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
-> Hash Full Join
Hash Cond: (t1.a = t2.b)
-> Append
- -> Foreign Scan on ftprt1_p1 t1_1
- -> Foreign Scan on ftprt1_p2 t1_2
+ -> Async Foreign Scan on ftprt1_p1 t1_1
+ -> Async Foreign Scan on ftprt1_p2 t1_2
-> Hash
-> Append
- -> Foreign Scan on ftprt2_p1 t2_1
- -> Foreign Scan on ftprt2_p2 t2_2
+ -> Async Foreign Scan on ftprt2_p1 t2_1
+ -> Async Foreign Scan on ftprt2_p2 t2_2
(11 rows)
SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
@@ -8687,9 +8704,9 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
Sort
Sort Key: t1.a, t1.b
-> Append
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: (ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: (ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)
(7 rows)
@@ -8744,20 +8761,20 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
-- test FOR UPDATE; partitionwise join does not apply
EXPLAIN (COSTS OFF)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
- QUERY PLAN
---------------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------------
LockRows
-> Sort
Sort Key: t1.a
-> Hash Join
Hash Cond: (t2.b = t1.a)
-> Append
- -> Foreign Scan on ftprt2_p1 t2_1
- -> Foreign Scan on ftprt2_p2 t2_2
+ -> Async Foreign Scan on ftprt2_p1 t2_1
+ -> Async Foreign Scan on ftprt2_p2 t2_2
-> Hash
-> Append
- -> Foreign Scan on ftprt1_p1 t1_1
- -> Foreign Scan on ftprt1_p2 t1_2
+ -> Async Foreign Scan on ftprt1_p1 t1_1
+ -> Async Foreign Scan on ftprt1_p2 t1_2
(12 rows)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
@@ -8793,17 +8810,17 @@ ANALYZE fpagg_tab_p3;
SET enable_partitionwise_aggregate TO false;
EXPLAIN (COSTS OFF)
SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
- QUERY PLAN
------------------------------------------------------------
+ QUERY PLAN
+-----------------------------------------------------------------
Sort
Sort Key: pagg_tab.a
-> HashAggregate
Group Key: pagg_tab.a
Filter: (avg(pagg_tab.b) < '22'::numeric)
-> Append
- -> Foreign Scan on fpagg_tab_p1 pagg_tab_1
- -> Foreign Scan on fpagg_tab_p2 pagg_tab_2
- -> Foreign Scan on fpagg_tab_p3 pagg_tab_3
+ -> Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+ -> Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+ -> Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
(9 rows)
-- Plan with partitionwise aggregates is enabled
@@ -8815,11 +8832,11 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
Sort
Sort Key: pagg_tab.a
-> Append
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p1 pagg_tab)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p2 pagg_tab_1)
- -> Foreign Scan
+ -> Async Foreign Scan
Relations: Aggregate on (fpagg_tab_p3 pagg_tab_2)
(9 rows)
@@ -8946,7 +8963,7 @@ DO $d$
END;
$d$;
ERROR: invalid option "password"
-HINT: Valid options in this context are: service, passfile, channel_binding, connect_timeout, dbname, host, hostaddr, port, options, application_name, keepalives, keepalives_idle, keepalives_interval, keepalives_count, tcp_user_timeout, sslmode, sslcompression, sslcert, sslkey, sslrootcert, sslcrl, sslcrldir, requirepeer, ssl_min_protocol_version, ssl_max_protocol_version, gssencmode, krbsrvname, gsslib, target_session_attrs, use_remote_estimate, fdw_startup_cost, fdw_tuple_cost, extensions, updatable, fetch_size, batch_size
+HINT: Valid options in this context are: service, passfile, channel_binding, connect_timeout, dbname, host, hostaddr, port, options, application_name, keepalives, keepalives_idle, keepalives_interval, keepalives_count, tcp_user_timeout, sslmode, sslcompression, sslcert, sslkey, sslrootcert, sslcrl, sslcrldir, requirepeer, ssl_min_protocol_version, ssl_max_protocol_version, gssencmode, krbsrvname, gsslib, target_session_attrs, use_remote_estimate, fdw_startup_cost, fdw_tuple_cost, extensions, updatable, fetch_size, batch_size, async_capable
CONTEXT: SQL statement "ALTER SERVER loopback_nopw OPTIONS (ADD password 'dummypw')"
PL/pgSQL function inline_code_block line 3 at EXECUTE
-- If we add a password for our user mapping instead, we should get a different
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 64698c4da3..530d7a66d4 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -107,7 +107,8 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
* Validate option value, when we can do so without any context.
*/
if (strcmp(def->defname, "use_remote_estimate") == 0 ||
- strcmp(def->defname, "updatable") == 0)
+ strcmp(def->defname, "updatable") == 0 ||
+ strcmp(def->defname, "async_capable") == 0)
{
/* these accept only boolean values */
(void) defGetBoolean(def);
@@ -217,6 +218,9 @@ InitPgFdwOptions(void)
/* batch_size is available on both server and table */
{"batch_size", ForeignServerRelationId, false},
{"batch_size", ForeignTableRelationId, false},
+ /* async_capable is available on both server and table */
+ {"async_capable", ForeignServerRelationId, false},
+ {"async_capable", ForeignTableRelationId, false},
{"password_required", UserMappingRelationId, false},
/*
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 35b48575c5..1354190e42 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -21,6 +21,7 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -37,6 +38,7 @@
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
#include "postgres_fdw.h"
+#include "storage/latch.h"
#include "utils/builtins.h"
#include "utils/float.h"
#include "utils/guc.h"
@@ -143,6 +145,7 @@ typedef struct PgFdwScanState
/* for remote query execution */
PGconn *conn; /* connection for the scan */
+ PgFdwConnState *conn_state; /* extra per-connection state */
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -159,6 +162,9 @@ typedef struct PgFdwScanState
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ /* for asynchronous execution */
+ bool async_capable; /* engage asynchronous-capable logic? */
+
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
MemoryContext temp_cxt; /* context for per-tuple temporary data */
@@ -176,6 +182,7 @@ typedef struct PgFdwModifyState
/* for remote query execution */
PGconn *conn; /* connection for the scan */
+ PgFdwConnState *conn_state; /* extra per-connection state */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -219,6 +226,7 @@ typedef struct PgFdwDirectModifyState
/* for remote query execution */
PGconn *conn; /* connection for the update */
+ PgFdwConnState *conn_state; /* extra per-connection state */
int numParams; /* number of parameters passed to query */
FmgrInfo *param_flinfo; /* output conversion functions for them */
List *param_exprs; /* executable expressions for param values */
@@ -408,6 +416,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
RelOptInfo *input_rel,
RelOptInfo *output_rel,
void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(AsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(AsyncRequest *areq);
+static void postgresForeignAsyncNotify(AsyncRequest *areq);
/*
* Helper functions
@@ -437,7 +449,8 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
void *arg);
static void create_cursor(ForeignScanState *node);
static void fetch_more_data(ForeignScanState *node);
-static void close_cursor(PGconn *conn, unsigned int cursor_number);
+static void close_cursor(PGconn *conn, unsigned int cursor_number,
+ PgFdwConnState *conn_state);
static PgFdwModifyState *create_foreign_modify(EState *estate,
RangeTblEntry *rte,
ResultRelInfo *resultRelInfo,
@@ -491,6 +504,8 @@ static int postgresAcquireSampleRowsFunc(Relation relation, int elevel,
double *totaldeadrows);
static void analyze_row_processor(PGresult *res, int row,
PgFdwAnalyzeState *astate);
+static void request_tuple_asynchronously(AsyncRequest *areq, bool fetch);
+static void fetch_more_data_begin(AsyncRequest *areq);
static HeapTuple make_tuple_from_result_row(PGresult *res,
int row,
Relation rel,
@@ -583,6 +598,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for asynchronous execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -625,6 +646,7 @@ postgresGetForeignRelSize(PlannerInfo *root,
fpinfo->fdw_tuple_cost = DEFAULT_FDW_TUPLE_COST;
fpinfo->shippable_extensions = NIL;
fpinfo->fetch_size = 100;
+ fpinfo->async_capable = false;
apply_server_options(fpinfo);
apply_table_options(fpinfo);
@@ -1458,7 +1480,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->conn = GetConnection(user, false, &fsstate->conn_state);
/* Assign a unique ID for my cursor */
fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1509,6 +1531,9 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
&fsstate->param_flinfo,
&fsstate->param_exprs,
&fsstate->param_values);
+
+ /* Set the async-capable flag */
+ fsstate->async_capable = node->ss.ps.plan->async_capable;
}
/*
@@ -1523,8 +1548,10 @@ postgresIterateForeignScan(ForeignScanState *node)
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
/*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
+ * In sync mode, if this is the first call after Begin or ReScan, we need
+ * to create the cursor on the remote side. In async mode, we would have
+ * aready created the cursor before we get here, even if this is the first
+ * call after Begin or ReScan.
*/
if (!fsstate->cursor_exists)
create_cursor(node);
@@ -1534,6 +1561,9 @@ postgresIterateForeignScan(ForeignScanState *node)
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
+ /* In async mode, just clear tuple slot. */
+ if (fsstate->async_capable)
+ return ExecClearTuple(slot);
/* No point in another fetch if we already detected EOF, though. */
if (!fsstate->eof_reached)
fetch_more_data(node);
@@ -1595,7 +1625,7 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->conn, sql, fsstate->conn_state);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
PQclear(res);
@@ -1623,7 +1653,8 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->conn, fsstate->cursor_number,
+ fsstate->conn_state);
/* Release remote connection */
ReleaseConnection(fsstate->conn);
@@ -2500,7 +2531,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->conn = GetConnection(user, false, &dmstate->conn_state);
/* Update the foreign-join-related fields. */
if (fsplan->scan.scanrelid == 0)
@@ -2881,7 +2912,7 @@ estimate_path_cost_size(PlannerInfo *root,
false, &retrieved_attrs, NULL);
/* Get the remote estimate */
- conn = GetConnection(fpinfo->user, false);
+ conn = GetConnection(fpinfo->user, false, NULL);
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -3327,7 +3358,7 @@ get_remote_estimate(const char *sql, PGconn *conn,
/*
* Execute EXPLAIN remotely.
*/
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_exec_query(conn, sql, NULL);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, sql);
@@ -3451,6 +3482,10 @@ create_cursor(ForeignScanState *node)
StringInfoData buf;
PGresult *res;
+ /* First, process a pending asynchronous request, if any. */
+ if (fsstate->conn_state->pendingAreq)
+ process_pending_request(fsstate->conn_state->pendingAreq);
+
/*
* Construct array of query parameter values in text format. We do the
* conversions in the short-lived per-tuple context, so as not to cause a
@@ -3531,17 +3566,38 @@ fetch_more_data(ForeignScanState *node)
PG_TRY();
{
PGconn *conn = fsstate->conn;
- char sql[64];
int numrows;
int i;
- snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
- fsstate->fetch_size, fsstate->cursor_number);
+ if (fsstate->async_capable)
+ {
+ Assert(fsstate->conn_state->pendingAreq);
- res = pgfdw_exec_query(conn, sql);
- /* On error, report the original query, not the FETCH. */
- if (PQresultStatus(res) != PGRES_TUPLES_OK)
- pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ /*
+ * The query was already sent by an earlier call to
+ * fetch_more_data_begin. So now we just fetch the result.
+ */
+ res = pgfdw_get_result(conn, fsstate->query);
+ /* On error, report the original query, not the FETCH. */
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+
+ /* Reset per-connection state */
+ fsstate->conn_state->pendingAreq = NULL;
+ }
+ else
+ {
+ char sql[64];
+
+ /* This is a regular synchronous fetch. */
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ res = pgfdw_exec_query(conn, sql, fsstate->conn_state);
+ /* On error, report the original query, not the FETCH. */
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ }
/* Convert the data into HeapTuples */
numrows = PQntuples(res);
@@ -3633,7 +3689,8 @@ reset_transmission_modes(int nestlevel)
* Utility routine to close a cursor.
*/
static void
-close_cursor(PGconn *conn, unsigned int cursor_number)
+close_cursor(PGconn *conn, unsigned int cursor_number,
+ PgFdwConnState *conn_state)
{
char sql[64];
PGresult *res;
@@ -3644,7 +3701,7 @@ close_cursor(PGconn *conn, unsigned int cursor_number)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_exec_query(conn, sql, conn_state);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
pgfdw_report_error(ERROR, res, conn, true, sql);
PQclear(res);
@@ -3693,7 +3750,7 @@ create_foreign_modify(EState *estate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->conn = GetConnection(user, true, &fmstate->conn_state);
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Set up remote query information. */
@@ -3792,6 +3849,10 @@ execute_foreign_modify(EState *estate,
operation == CMD_UPDATE ||
operation == CMD_DELETE);
+ /* First, process a pending asynchronous request, if any. */
+ if (fmstate->conn_state->pendingAreq)
+ process_pending_request(fmstate->conn_state->pendingAreq);
+
/*
* If the existing query was deparsed and prepared for a different number
* of rows, rebuild it for the proper number.
@@ -3893,6 +3954,11 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
char *p_name;
PGresult *res;
+ /*
+ * The caller would already have processed a pending asynchronous request
+ * if any, so no need to do it here.
+ */
+
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
GetPrepStmtNumber(fmstate->conn));
@@ -4078,7 +4144,7 @@ deallocate_query(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->conn, sql, fmstate->conn_state);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
PQclear(res);
@@ -4226,6 +4292,10 @@ execute_dml_stmt(ForeignScanState *node)
int numParams = dmstate->numParams;
const char **values = dmstate->param_values;
+ /* First, process a pending asynchronous request, if any. */
+ if (dmstate->conn_state->pendingAreq)
+ process_pending_request(dmstate->conn_state->pendingAreq);
+
/*
* Construct array of query parameter values in text format.
*/
@@ -4627,7 +4697,7 @@ postgresAnalyzeForeignTable(Relation relation,
*/
table = GetForeignTable(RelationGetRelid(relation));
user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
- conn = GetConnection(user, false);
+ conn = GetConnection(user, false, NULL);
/*
* Construct command to get page count for relation.
@@ -4638,7 +4708,7 @@ postgresAnalyzeForeignTable(Relation relation,
/* In what follows, do not risk leaking any PGresults. */
PG_TRY();
{
- res = pgfdw_exec_query(conn, sql.data);
+ res = pgfdw_exec_query(conn, sql.data, NULL);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, sql.data);
@@ -4713,7 +4783,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
table = GetForeignTable(RelationGetRelid(relation));
server = GetForeignServer(table->serverid);
user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
- conn = GetConnection(user, false);
+ conn = GetConnection(user, false, NULL);
/*
* Construct cursor that retrieves whole rows from remote.
@@ -4730,7 +4800,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
int fetch_size;
ListCell *lc;
- res = pgfdw_exec_query(conn, sql.data);
+ res = pgfdw_exec_query(conn, sql.data, NULL);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
pgfdw_report_error(ERROR, res, conn, false, sql.data);
PQclear(res);
@@ -4782,7 +4852,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
*/
/* Fetch some rows */
- res = pgfdw_exec_query(conn, fetch_sql);
+ res = pgfdw_exec_query(conn, fetch_sql, NULL);
/* On error, report the original query, not the FETCH. */
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, sql.data);
@@ -4801,7 +4871,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
}
/* Close the cursor, just to be tidy. */
- close_cursor(conn, cursor_number);
+ close_cursor(conn, cursor_number, NULL);
}
PG_CATCH();
{
@@ -4941,7 +5011,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
*/
server = GetForeignServer(serverOid);
mapping = GetUserMapping(GetUserId(), server->serverid);
- conn = GetConnection(mapping, false);
+ conn = GetConnection(mapping, false, NULL);
/* Don't attempt to import collation if remote server hasn't got it */
if (PQserverVersion(conn) < 90100)
@@ -4957,7 +5027,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
appendStringInfoString(&buf, "SELECT 1 FROM pg_catalog.pg_namespace WHERE nspname = ");
deparseStringLiteral(&buf, stmt->remote_schema);
- res = pgfdw_exec_query(conn, buf.data);
+ res = pgfdw_exec_query(conn, buf.data, NULL);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, buf.data);
@@ -5069,7 +5139,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
appendStringInfoString(&buf, " ORDER BY c.relname, a.attnum");
/* Fetch the data */
- res = pgfdw_exec_query(conn, buf.data);
+ res = pgfdw_exec_query(conn, buf.data, NULL);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, buf.data);
@@ -5529,6 +5599,8 @@ apply_server_options(PgFdwRelationInfo *fpinfo)
ExtractExtensionList(defGetString(def), false);
else if (strcmp(def->defname, "fetch_size") == 0)
fpinfo->fetch_size = strtol(defGetString(def), NULL, 10);
+ else if (strcmp(def->defname, "async_capable") == 0)
+ fpinfo->async_capable = defGetBoolean(def);
}
}
@@ -5550,6 +5622,8 @@ apply_table_options(PgFdwRelationInfo *fpinfo)
fpinfo->use_remote_estimate = defGetBoolean(def);
else if (strcmp(def->defname, "fetch_size") == 0)
fpinfo->fetch_size = strtol(defGetString(def), NULL, 10);
+ else if (strcmp(def->defname, "async_capable") == 0)
+ fpinfo->async_capable = defGetBoolean(def);
}
}
@@ -5584,6 +5658,7 @@ merge_fdw_options(PgFdwRelationInfo *fpinfo,
fpinfo->shippable_extensions = fpinfo_o->shippable_extensions;
fpinfo->use_remote_estimate = fpinfo_o->use_remote_estimate;
fpinfo->fetch_size = fpinfo_o->fetch_size;
+ fpinfo->async_capable = fpinfo_o->async_capable;
/* Merge the table level options from either side of the join. */
if (fpinfo_i)
@@ -5605,6 +5680,13 @@ merge_fdw_options(PgFdwRelationInfo *fpinfo,
* relation sizes.
*/
fpinfo->fetch_size = Max(fpinfo_o->fetch_size, fpinfo_i->fetch_size);
+
+ /*
+ * We'll prefer to consider this join async-capable if any table from
+ * either side of the join is considered async-capable.
+ */
+ fpinfo->async_capable = fpinfo_o->async_capable ||
+ fpinfo_i->async_capable;
}
}
@@ -6488,6 +6570,214 @@ add_foreign_final_paths(PlannerInfo *root, RelOptInfo *input_rel,
add_path(final_rel, (Path *) final_path);
}
+/*
+ * postgresIsForeignPathAsyncCapable
+ * Check whether a given ForeignPath node is async-capable.
+ */
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ RelOptInfo *rel = ((Path *) path)->parent;
+ PgFdwRelationInfo *fpinfo = (PgFdwRelationInfo *) rel->fdw_private;
+
+ return fpinfo->async_capable;
+}
+
+/*
+ * postgresForeignAsyncRequest
+ * Asynchronously request next tuple from a foreign PostgreSQL table.
+ */
+static void
+postgresForeignAsyncRequest(AsyncRequest *areq)
+{
+ request_tuple_asynchronously(areq, true);
+}
+
+/*
+ * postgresForeignAsyncConfigureWait
+ * Configure a file descriptor event for which we wish to wait.
+ */
+static void
+postgresForeignAsyncConfigureWait(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ AsyncRequest *pendingAreq = fsstate->conn_state->pendingAreq;
+ AppendState *requestor = (AppendState *) areq->requestor;
+ WaitEventSet *set = requestor->as_eventset;
+
+ /* This should not be called unless callback_pending */
+ Assert(areq->callback_pending);
+
+ /* The core code would have registered postmaster death event */
+ Assert(GetNumRegisteredWaitEvents(set) >= 1);
+
+ /* Begin an asynchronous data fetch if necessary */
+ if (!pendingAreq)
+ fetch_more_data_begin(areq);
+ else if (pendingAreq->requestor != areq->requestor)
+ {
+ if (GetNumRegisteredWaitEvents(set) > 1)
+ return;
+ process_pending_request(pendingAreq);
+ fetch_more_data_begin(areq);
+ }
+ else if (pendingAreq->requestee != areq->requestee)
+ return;
+ else
+ Assert(pendingAreq == areq);
+
+ AddWaitEventToSet(set, WL_SOCKET_READABLE, PQsocket(fsstate->conn),
+ NULL, areq);
+}
+
+/*
+ * postgresForeignAsyncNotify
+ * Fetch some more tuples from a file descriptor that becomes ready,
+ * requesting next tuple.
+ */
+static void
+postgresForeignAsyncNotify(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+
+ /* The core code would have initialized the callback_pending flag */
+ Assert(!areq->callback_pending);
+
+ /* The request should be currently in-process */
+ Assert(fsstate->conn_state->pendingAreq == areq);
+
+ /* On error, report the original query, not the FETCH. */
+ if (!PQconsumeInput(fsstate->conn))
+ pgfdw_report_error(ERROR, NULL, fsstate->conn, false, fsstate->query);
+
+ fetch_more_data(node);
+
+ request_tuple_asynchronously(areq, true);
+}
+
+/*
+ * Asynchronously request next tuple from a foreign PostgreSQL table.
+ */
+static void
+request_tuple_asynchronously(AsyncRequest *areq, bool fetch)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ AsyncRequest *pendingAreq = fsstate->conn_state->pendingAreq;
+ TupleTableSlot *result;
+
+ /* This should not be called if the request is currently in-process */
+ Assert(areq != pendingAreq);
+
+ /* Request some more tuples, if we've run out */
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ /* No point in another fetch if we already detected EOF, though */
+ if (!fsstate->eof_reached)
+ {
+ /* Mark the request as needing a callback */
+ areq->callback_pending = true;
+ areq->request_complete = false;
+ /* Begin another fetch if requested and if no pending request */
+ if (fetch && !pendingAreq)
+ fetch_more_data_begin(areq);
+ }
+ else
+ {
+ /* There's nothing more to do; just return a NULL pointer */
+ result = NULL;
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+ }
+ return;
+ }
+
+ /* Get a tuple from the ForeignScan node */
+ result = ExecProcNode((PlanState *) node);
+ if (!TupIsNull(result))
+ {
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+ return;
+ }
+ Assert(fsstate->next_tuple >= fsstate->num_tuples);
+
+ /* Request some more tuples, if we've not detected EOF yet */
+ if (!fsstate->eof_reached)
+ {
+ /* Mark the request as needing a callback */
+ areq->callback_pending = true;
+ areq->request_complete = false;
+ /* Begin another fetch if requested and if no pending request */
+ if (fetch && !pendingAreq)
+ fetch_more_data_begin(areq);
+ }
+ else
+ {
+ /* There's nothing more to do; just return a NULL pointer */
+ result = NULL;
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+ }
+}
+
+/*
+ * Begin an asynchronous data fetch.
+ *
+ * Note: fetch_more_data must be called to fetch the result.
+ */
+static void
+fetch_more_data_begin(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ char sql[64];
+
+ Assert(!fsstate->conn_state->pendingAreq);
+
+ /* Create the cursor synchronously. */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ /* We will send this query, but not wait for the response. */
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (PQsendQuery(fsstate->conn, sql) < 0)
+ pgfdw_report_error(ERROR, NULL, fsstate->conn, false, fsstate->query);
+
+ /* Remember that the request is in process */
+ fsstate->conn_state->pendingAreq = areq;
+}
+
+/*
+ * Process a pending asynchronous request.
+ */
+void
+process_pending_request(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ EState *estate = node->ss.ps.state;
+ MemoryContext oldcontext;
+
+ /* The request should be currently in-process */
+ Assert(fsstate->conn_state->pendingAreq == areq);
+
+ oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
+
+ fetch_more_data(node);
+
+ request_tuple_asynchronously(areq, false);
+
+ /* Unlike AsyncRequest/AsyncNotify, we call ExecAsyncResponse ourselves */
+ ExecAsyncResponse(areq);
+
+ MemoryContextSwitchTo(oldcontext);
+}
+
/*
* Create a tuple from the specified row of the PGresult.
*
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 1f67b4d9fd..88d94da6f6 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -16,6 +16,7 @@
#include "foreign/foreign.h"
#include "lib/stringinfo.h"
#include "libpq-fe.h"
+#include "nodes/execnodes.h"
#include "nodes/pathnodes.h"
#include "utils/relcache.h"
@@ -78,6 +79,7 @@ typedef struct PgFdwRelationInfo
Cost fdw_startup_cost;
Cost fdw_tuple_cost;
List *shippable_extensions; /* OIDs of shippable extensions */
+ bool async_capable;
/* Cached catalog information. */
ForeignTable *table;
@@ -124,17 +126,28 @@ typedef struct PgFdwRelationInfo
int relation_index;
} PgFdwRelationInfo;
+/*
+ * Extra control information relating to a connection.
+ */
+typedef struct PgFdwConnState
+{
+ AsyncRequest *pendingAreq; /* pending async request */
+} PgFdwConnState;
+
/* in postgres_fdw.c */
extern int set_transmission_modes(void);
extern void reset_transmission_modes(int nestlevel);
+extern void process_pending_request(AsyncRequest *areq);
/* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+ PgFdwConnState **state);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
extern PGresult *pgfdw_get_result(PGconn *conn, const char *query);
-extern PGresult *pgfdw_exec_query(PGconn *conn, const char *query);
+extern PGresult *pgfdw_exec_query(PGconn *conn, const char *query,
+ PgFdwConnState *state);
extern void pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
bool clear, const char *sql);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 2b525ea44a..320844be02 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -22,6 +22,8 @@ DO $d$
END;
$d$;
+ALTER SERVER loopback OPTIONS (ADD async_capable 'true');
+
CREATE USER MAPPING FOR public SERVER testserver1
OPTIONS (user 'value', password 'value');
CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
@@ -1822,31 +1824,31 @@ INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
DELETE FROM a;
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
@@ -1882,12 +1884,12 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
@@ -1946,8 +1948,8 @@ explain (verbose, costs off)
delete from foo where f1 < 5 returning *;
delete from foo where f1 < 5 returning *;
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with t as (update bar set f2 = f2 + 100 returning *) select * from t order by 1;
+with t as (update bar set f2 = f2 + 100 returning *) select * from t order by 1;
-- Test that UPDATE/DELETE with inherited target works with row-level triggers
CREATE TRIGGER trig_row_before
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 967de73596..dc2a0d0987 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4781,6 +4781,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</para>
<variablelist>
+ <varlistentry id="guc-enable-async-append" xreflabel="enable_async_append">
+ <term><varname>enable_async_append</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_async_append</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of async-aware
+ append plan types. The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-bitmapscan" xreflabel="enable_bitmapscan">
<term><varname>enable_bitmapscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3513e127b7..2ba4223915 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1563,6 +1563,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</thead>
<tbody>
+ <row>
+ <entry><literal>AppendReady</literal></entry>
+ <entry>Waiting for a subplan of Append to be ready.</entry>
+ </row>
<row>
<entry><literal>BackupWaitWalArchive</literal></entry>
<entry>Waiting for WAL files required for a backup to be successfully
diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index 07aa25799d..153ff08d91 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -371,6 +371,34 @@ OPTIONS (ADD password_required 'false');
</sect3>
+ <sect3>
+ <title>Asynchronous Execution Options</title>
+
+ <para>
+ <filename>postgres_fdw</filename> supports asynchronous execution that
+ runs multiple subplan nodes of an <structname>Append</structname> plan
+ node concurrently rather than serially to improve query performance.
+ This execution can be controled using the following option:
+ </para>
+
+ <variablelist>
+
+ <varlistentry>
+ <term><literal>async_capable</literal></term>
+ <listitem>
+ <para>
+ This option controls whether <filename>postgres_fdw</filename> allows
+ foreign tables to be scanned concurrently for asynchronous execution.
+ It can be specified for a foreign table or a foreign server.
+ A table-level option overrides a server-level option.
+ The default is <literal>false</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </sect3>
+
<sect3>
<title>Updatability Options</title>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index afc45429ba..fe75cabdcc 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1394,6 +1394,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
}
if (plan->parallel_aware)
appendStringInfoString(es->str, "Parallel ");
+ if (plan->async_capable)
+ appendStringInfoString(es->str, "Async ");
appendStringInfoString(es->str, pname);
es->indent++;
}
@@ -1413,6 +1415,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (custom_name)
ExplainPropertyText("Custom Plan Provider", custom_name, es);
ExplainPropertyBool("Parallel Aware", plan->parallel_aware, es);
+ ExplainPropertyBool("Async Capable", plan->async_capable, es);
}
switch (nodeTag(plan))
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 74ac59faa1..680fd69151 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
+ execAsync.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 4543ac79ed..069c6ba948 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -531,6 +531,10 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans != 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index e69de29bb2..e3d85ffabc 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,111 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+
+/*
+ * Asynchronously request a tuple from a designed async-capable node.
+ */
+void
+ExecAsyncRequest(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ ExecAsyncResponse(areq);
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor event
+ * for which it wishes to wait. We expect the node-type specific callback to
+ * make a sigle call of the following form:
+ *
+ * AddWaitEventToSet(set, WL_SOCKET_READABLE, fd, NULL, areq);
+ */
+void
+ExecAsyncConfigureWait(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanConfigureWait(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+void
+ExecAsyncNotify(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ ExecAsyncResponse(areq);
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+void
+ExecAsyncResponse(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestor))
+ {
+ case T_AppendState:
+ ExecAsyncAppendResponse(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestor));
+ }
+}
+
+/*
+ * A requestee node should call this function to deliver the tuple to its
+ * requestor node. The node can call this from its ExecAsyncRequest callback
+ * if the requested tuple is available immediately.
+ */
+void
+ExecAsyncRequestDone(AsyncRequest *areq, TupleTableSlot *result)
+{
+ areq->request_complete = true;
+ areq->result = result;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 15e4115bd6..123d5163de 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -57,10 +57,13 @@
#include "postgres.h"
+#include "executor/execAsync.h"
#include "executor/execdebug.h"
#include "executor/execPartition.h"
#include "executor/nodeAppend.h"
#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
/* Shared state for parallel-aware Append. */
struct ParallelAppendState
@@ -78,12 +81,18 @@ struct ParallelAppendState
};
#define INVALID_SUBPLAN_INDEX -1
+#define EVENT_BUFFER_SIZE 16
static TupleTableSlot *ExecAppend(PlanState *pstate);
static bool choose_next_subplan_locally(AppendState *node);
static bool choose_next_subplan_for_leader(AppendState *node);
static bool choose_next_subplan_for_worker(AppendState *node);
static void mark_invalid_subplans_as_finished(AppendState *node);
+static void ExecAppendAsyncBegin(AppendState *node);
+static bool ExecAppendAsyncGetNext(AppendState *node, TupleTableSlot **result);
+static bool ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result);
+static void ExecAppendAsyncEventWait(AppendState *node);
+static void classify_matching_subplans(AppendState *node);
/* ----------------------------------------------------------------
* ExecInitAppend
@@ -102,7 +111,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
AppendState *appendstate = makeNode(AppendState);
PlanState **appendplanstates;
Bitmapset *validsubplans;
+ Bitmapset *asyncplans;
int nplans;
+ int nasyncplans;
int firstvalid;
int i,
j;
@@ -119,6 +130,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
/* Let choose_next_subplan_* function handle setting the first subplan */
appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+ appendstate->as_syncdone = false;
/* If run-time partition pruning is enabled, then set that up now */
if (node->part_prune_info != NULL)
@@ -191,12 +203,24 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
* While at it, find out the first valid partial plan.
*/
j = 0;
+ asyncplans = NULL;
+ nasyncplans = 0;
firstvalid = nplans;
i = -1;
while ((i = bms_next_member(validsubplans, i)) >= 0)
{
Plan *initNode = (Plan *) list_nth(node->appendplans, i);
+ /*
+ * Record async subplans. When executing EvalPlanQual, we process
+ * async subplans synchronously, so don't do this in that case.
+ */
+ if (initNode->async_capable && estate->es_epq_active == NULL)
+ {
+ asyncplans = bms_add_member(asyncplans, j);
+ nasyncplans++;
+ }
+
/*
* Record the lowest appendplans index which is a valid partial plan.
*/
@@ -210,6 +234,39 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ /* Initialize async state */
+ appendstate->as_asyncplans = asyncplans;
+ appendstate->as_nasyncplans = nasyncplans;
+ appendstate->as_asyncrequests = NULL;
+ appendstate->as_asyncresults = (TupleTableSlot **)
+ palloc0(nasyncplans * sizeof(TupleTableSlot *));
+ appendstate->as_needrequest = NULL;
+ appendstate->as_eventset = NULL;
+
+ if (nasyncplans > 0)
+ {
+ appendstate->as_asyncrequests = (AsyncRequest **)
+ palloc0(nplans * sizeof(AsyncRequest *));
+
+ i = -1;
+ while ((i = bms_next_member(asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq;
+
+ areq = palloc(sizeof(AsyncRequest));
+ areq->requestor = (PlanState *) appendstate;
+ areq->requestee = appendplanstates[i];
+ areq->request_index = i;
+ areq->callback_pending = false;
+ areq->request_complete = false;
+ areq->result = NULL;
+
+ appendstate->as_asyncrequests[i] = areq;
+ }
+
+ classify_matching_subplans(appendstate);
+ }
+
/*
* Miscellaneous initialization
*/
@@ -232,31 +289,45 @@ static TupleTableSlot *
ExecAppend(PlanState *pstate)
{
AppendState *node = castNode(AppendState, pstate);
+ TupleTableSlot *result;
- if (node->as_whichplan < 0)
+ if (!node->as_syncdone && node->as_whichplan == INVALID_SUBPLAN_INDEX)
{
/* Nothing to do if there are no subplans */
if (node->as_nplans == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ /* If there are any async subplans, begin execution of them */
+ if (node->as_nasyncplans > 0)
+ ExecAppendAsyncBegin(node);
+
/*
- * If no subplan has been chosen, we must choose one before
+ * If no sync subplan has been chosen, we must choose one before
* proceeding.
*/
- if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
- !node->choose_next_subplan(node))
+ if (!node->choose_next_subplan(node) && node->as_nasyncremain == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
for (;;)
{
PlanState *subnode;
- TupleTableSlot *result;
CHECK_FOR_INTERRUPTS();
/*
- * figure out which subplan we are currently processing
+ * try to get a tuple from any of the async subplans
+ */
+ if (!bms_is_empty(node->as_needrequest) ||
+ (node->as_syncdone && node->as_nasyncremain > 0))
+ {
+ if (ExecAppendAsyncGetNext(node, &result))
+ return result;
+ Assert(bms_is_empty(node->as_needrequest));
+ }
+
+ /*
+ * figure out which sync subplan we are currently processing
*/
Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
subnode = node->appendplans[node->as_whichplan];
@@ -276,8 +347,16 @@ ExecAppend(PlanState *pstate)
return result;
}
- /* choose new subplan; if none, we're done */
- if (!node->choose_next_subplan(node))
+ /* wait or poll async events */
+ if (node->as_nasyncremain > 0)
+ {
+ Assert(!node->as_syncdone);
+ Assert(bms_is_empty(node->as_needrequest));
+ ExecAppendAsyncEventWait(node);
+ }
+
+ /* choose new sync subplan; if no sync/async subplans, we're done */
+ if (!node->choose_next_subplan(node) && node->as_nasyncremain == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
}
@@ -313,6 +392,7 @@ ExecEndAppend(AppendState *node)
void
ExecReScanAppend(AppendState *node)
{
+ int nasyncplans = node->as_nasyncplans;
int i;
/*
@@ -326,6 +406,11 @@ ExecReScanAppend(AppendState *node)
{
bms_free(node->as_valid_subplans);
node->as_valid_subplans = NULL;
+ if (nasyncplans > 0)
+ {
+ bms_free(node->as_valid_asyncplans);
+ node->as_valid_asyncplans = NULL;
+ }
}
for (i = 0; i < node->as_nplans; i++)
@@ -347,8 +432,26 @@ ExecReScanAppend(AppendState *node)
ExecReScan(subnode);
}
+ /* Reset async state */
+ if (nasyncplans > 0)
+ {
+ i = -1;
+ while ((i = bms_next_member(node->as_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ areq->callback_pending = false;
+ areq->request_complete = false;
+ areq->result = NULL;
+ }
+
+ bms_free(node->as_needrequest);
+ node->as_needrequest = NULL;
+ }
+
/* Let choose_next_subplan_* function handle setting the first subplan */
node->as_whichplan = INVALID_SUBPLAN_INDEX;
+ node->as_syncdone = false;
}
/* ----------------------------------------------------------------
@@ -429,7 +532,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
/* ----------------------------------------------------------------
* choose_next_subplan_locally
*
- * Choose next subplan for a non-parallel-aware Append,
+ * Choose next sync subplan for a non-parallel-aware Append,
* returning false if there are no more.
* ----------------------------------------------------------------
*/
@@ -444,9 +547,9 @@ choose_next_subplan_locally(AppendState *node)
/*
* If first call then have the bms member function choose the first valid
- * subplan by initializing whichplan to -1. If there happen to be no
- * valid subplans then the bms member function will handle that by
- * returning a negative number which will allow us to exit returning a
+ * sync subplan by initializing whichplan to -1. If there happen to be
+ * no valid sync subplans then the bms member function will handle that
+ * by returning a negative number which will allow us to exit returning a
* false value.
*/
if (whichplan == INVALID_SUBPLAN_INDEX)
@@ -467,7 +570,10 @@ choose_next_subplan_locally(AppendState *node)
nextplan = bms_prev_member(node->as_valid_subplans, whichplan);
if (nextplan < 0)
+ {
+ node->as_syncdone = true;
return false;
+ }
node->as_whichplan = nextplan;
@@ -709,3 +815,298 @@ mark_invalid_subplans_as_finished(AppendState *node)
node->as_pstate->pa_finished[i] = true;
}
}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncBegin
+ *
+ * Begin execution of designed async-capable subplans.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecAppendAsyncBegin(AppendState *node)
+{
+ Bitmapset *valid_asyncplans;
+ int i;
+
+ /* We should never be called when there are no async subplans. */
+ Assert(node->as_nasyncplans > 0);
+
+ if (node->as_valid_subplans == NULL)
+ {
+ Assert(node->as_valid_asyncplans == NULL);
+
+ node->as_valid_subplans =
+ ExecFindMatchingSubPlans(node->as_prune_state);
+
+ classify_matching_subplans(node);
+ }
+
+ node->as_nasyncremain = 0;
+
+ /* Nothing to do if there are no valid async subplans. */
+ valid_asyncplans = node->as_valid_asyncplans;
+ if (valid_asyncplans == NULL)
+ return;
+
+ /* Make a request for each of the async subplans. */
+ i = -1;
+ while ((i = bms_next_member(valid_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ Assert(areq->request_index == i);
+ Assert(!areq->callback_pending);
+
+ /* Do the actual work. */
+ ExecAsyncRequest(areq);
+
+ ++node->as_nasyncremain;
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncGetNext
+ *
+ * Get the next tuple from any of the asynchronous subplans.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecAppendAsyncGetNext(AppendState *node, TupleTableSlot **result)
+{
+ *result = NULL;
+
+ /* Make new async requests. */
+ if (ExecAppendAsyncRequest(node, result))
+ return true;
+
+ while (node->as_nasyncremain > 0)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ /* Wait or poll async events. */
+ ExecAppendAsyncEventWait(node);
+
+ /* Make new async requests. */
+ if (ExecAppendAsyncRequest(node, result))
+ return true;
+
+ /* Break from loop if there is any sync node that is not complete */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If all sync subplans are complete, we're totally done scanning the
+ * givne node. Otherwise, we're done with the asynchronous stuff but
+ * must continue scanning the sync subplans.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncremain == 0);
+ *result = ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ return true;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncRequest
+ *
+ * If there are any asynchronous subplans that need a new asynchronous
+ * request, make all of them.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result)
+{
+ Bitmapset *needrequest;
+ int i;
+
+ /* Nothing to do if there are no async subplans needing a new request. */
+ if (bms_is_empty(node->as_needrequest))
+ return false;
+
+ /*
+ * If there are any asynchronously-generated results that have not yet
+ * been returned, we have nothing to do; just return one of them.
+ */
+ if (node->as_nasyncresults > 0)
+ {
+ --node->as_nasyncresults;
+ *result = node->as_asyncresults[node->as_nasyncresults];
+ return true;
+ }
+
+ /* Make a new request for each of the async subplans that need it. */
+ needrequest = node->as_needrequest;
+ node->as_needrequest = NULL;
+ i = -1;
+ while ((i = bms_next_member(needrequest, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ /* Do the actual work. */
+ ExecAsyncRequest(areq);
+ }
+ bms_free(needrequest);
+
+ /* Return one of the asynchronously-generated results if any. */
+ if (node->as_nasyncresults > 0)
+ {
+ --node->as_nasyncresults;
+ *result = node->as_asyncresults[node->as_nasyncresults];
+ return true;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncEventWait
+ *
+ * Wait or poll for file descriptor wait events and fire callbacks.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecAppendAsyncEventWait(AppendState *node)
+{
+ long timeout = node->as_syncdone ? -1 : 0;
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+
+ /* Nothing to do if there are no remaining async subplans. */
+ if (node->as_nasyncremain == 0)
+ return;
+
+ node->as_eventset = CreateWaitEventSet(CurrentMemoryContext,
+ node->as_nasyncplans + 1);
+ AddWaitEventToSet(node->as_eventset, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
+ NULL, NULL);
+
+ /* Give each waiting subplan a chance to add a event. */
+ i = -1;
+ while ((i = bms_next_member(node->as_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ if (areq->callback_pending)
+ ExecAsyncConfigureWait(areq);
+ }
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(node->as_eventset, timeout, occurred_event,
+ EVENT_BUFFER_SIZE, WAIT_EVENT_APPEND_READY);
+ FreeWaitEventSet(node->as_eventset);
+ node->as_eventset = NULL;
+ if (noccurred == 0)
+ return;
+
+ /* Deliver notifications. */
+ for (i = 0; i < noccurred; i++)
+ {
+ WaitEvent *w = &occurred_event[i];
+
+ /*
+ * Each waiting subplan should have registered its wait event with
+ * user_data pointing back to its AsyncRequest.
+ */
+ if ((w->events & WL_SOCKET_READABLE) != 0)
+ {
+ AsyncRequest *areq = (AsyncRequest *) w->user_data;
+
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ Assert(areq->callback_pending);
+ areq->callback_pending = false;
+
+ /* Do the actual work. */
+ ExecAsyncNotify(areq);
+ }
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(AsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot = areq->result;
+
+ /* The result should be a TupleTableSlot or NULL. */
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* Nothing to do if the request is pending. */
+ if (!areq->request_complete)
+ {
+ /*
+ * The subplan for which the request was made would be pending for a
+ * callback.
+ */
+ Assert(areq->callback_pending);
+ return;
+ }
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ {
+ /* The ending subplan wouldn't have been pending for a callback. */
+ Assert(!areq->callback_pending);
+ --node->as_nasyncremain;
+ return;
+ }
+
+ /* Save result so we can return it */
+ Assert(node->as_nasyncresults < node->as_nasyncplans);
+ node->as_asyncresults[node->as_nasyncresults++] = slot;
+
+ /*
+ * Mark the subplan that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might complete.
+ */
+ node->as_needrequest = bms_add_member(node->as_needrequest,
+ areq->request_index);
+}
+
+/* ----------------------------------------------------------------
+ * classify_matching_subplans
+ *
+ * Classify the node's as_valid_subplans into sync ones and
+ * async ones, adjust it to contain sync ones only, and save
+ * async ones in the node's as_valid_asyncplans
+ * ----------------------------------------------------------------
+ */
+static void
+classify_matching_subplans(AppendState *node)
+{
+ Bitmapset *valid_asyncplans;
+
+ /* Nothing to do if there are no valid subplans. */
+ if (bms_is_empty(node->as_valid_subplans))
+ return;
+
+ /* Nothing to do if there are no valid async subplans. */
+ if (!bms_overlap(node->as_valid_subplans, node->as_asyncplans))
+ return;
+
+ /* Get valid async subplans. */
+ valid_asyncplans = bms_copy(node->as_asyncplans);
+ valid_asyncplans = bms_int_members(valid_asyncplans,
+ node->as_valid_subplans);
+
+ /* Adjust the valid subplans to contain sync subplans only. */
+ node->as_valid_subplans = bms_del_members(node->as_valid_subplans,
+ valid_asyncplans);
+
+ /* Save valid async subplans. */
+ node->as_valid_asyncplans = valid_asyncplans;
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 0969e53c3a..898890fb08 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -391,3 +391,51 @@ ExecShutdownForeignScan(ForeignScanState *node)
if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Asynchronously request a tuple from a designed async-capable node
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ fdwroutine->ForeignAsyncConfigureWait(areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Callback invoked when a relevant event has occurred
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index aaba1ec2c4..38aa9b5a85 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -120,6 +120,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
COPY_SCALAR_FIELD(plan_width);
COPY_SCALAR_FIELD(parallel_aware);
COPY_SCALAR_FIELD(parallel_safe);
+ COPY_SCALAR_FIELD(async_capable);
COPY_SCALAR_FIELD(plan_node_id);
COPY_NODE_FIELD(targetlist);
COPY_NODE_FIELD(qual);
@@ -241,6 +242,7 @@ _copyAppend(const Append *from)
*/
COPY_BITMAPSET_FIELD(apprelids);
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
COPY_SCALAR_FIELD(first_partial_plan);
COPY_NODE_FIELD(part_prune_info);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 8fc432bfe1..a4bffb8e88 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -333,6 +333,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
WRITE_INT_FIELD(plan_width);
WRITE_BOOL_FIELD(parallel_aware);
WRITE_BOOL_FIELD(parallel_safe);
+ WRITE_BOOL_FIELD(async_capable);
WRITE_INT_FIELD(plan_node_id);
WRITE_NODE_FIELD(targetlist);
WRITE_NODE_FIELD(qual);
@@ -431,6 +432,7 @@ _outAppend(StringInfo str, const Append *node)
WRITE_BITMAPSET_FIELD(apprelids);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
WRITE_INT_FIELD(first_partial_plan);
WRITE_NODE_FIELD(part_prune_info);
}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 718fb58e86..03d01eea3e 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1614,6 +1614,7 @@ ReadCommonPlan(Plan *local_node)
READ_INT_FIELD(plan_width);
READ_BOOL_FIELD(parallel_aware);
READ_BOOL_FIELD(parallel_safe);
+ READ_BOOL_FIELD(async_capable);
READ_INT_FIELD(plan_node_id);
READ_NODE_FIELD(targetlist);
READ_NODE_FIELD(qual);
@@ -1710,6 +1711,7 @@ _readAppend(void)
READ_BITMAPSET_FIELD(apprelids);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
READ_INT_FIELD(first_partial_plan);
READ_NODE_FIELD(part_prune_info);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a25b674a19..f3100f7540 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -147,6 +147,7 @@ bool enable_partitionwise_aggregate = false;
bool enable_parallel_append = true;
bool enable_parallel_hash = true;
bool enable_partition_pruning = true;
+bool enable_async_append = true;
typedef struct
{
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 906cab7053..06774a9ec3 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -81,6 +81,7 @@ static List *get_gating_quals(PlannerInfo *root, List *quals);
static Plan *create_gating_plan(PlannerInfo *root, Path *path, Plan *plan,
List *gating_quals);
static Plan *create_join_plan(PlannerInfo *root, JoinPath *best_path);
+static bool is_async_capable_path(Path *path);
static Plan *create_append_plan(PlannerInfo *root, AppendPath *best_path,
int flags);
static Plan *create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
@@ -1080,6 +1081,30 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
return plan;
}
+/*
+ * is_async_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
+
/*
* create_append_plan
* Create an Append plan for 'best_path' and (recursively) plans
@@ -1097,6 +1122,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
List *pathkeys = best_path->path.pathkeys;
List *subplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
RelOptInfo *rel = best_path->path.parent;
PartitionPruneInfo *partpruneinfo = NULL;
int nodenumsortkeys = 0;
@@ -1104,6 +1130,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
Oid *nodeSortOperators = NULL;
Oid *nodeCollations = NULL;
bool *nodeNullsFirst = NULL;
+ bool consider_async = false;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1167,6 +1194,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
tlist_was_changed = (orig_tlist_length != list_length(plan->plan.targetlist));
}
+ /* If appropriate, consider async append */
+ consider_async = (enable_async_append && pathkeys == NIL &&
+ !best_path->path.parallel_safe &&
+ list_length(best_path->subpaths) > 1);
+
/* Build the plan for each child */
foreach(subpaths, best_path->subpaths)
{
@@ -1234,6 +1266,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
subplans = lappend(subplans, subplan);
+
+ /* Check to see if subplan can be executed asynchronously */
+ if (consider_async && is_async_capable_path(subpath))
+ {
+ subplan->async_capable = true;
+ ++nasyncplans;
+ }
}
/*
@@ -1266,6 +1305,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
plan->appendplans = subplans;
+ plan->nasyncplans = nasyncplans;
plan->first_partial_plan = best_path->first_partial_path;
plan->part_prune_info = partpruneinfo;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..58f8e0bbcf 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3999,6 +3999,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
switch (w)
{
+ case WAIT_EVENT_APPEND_READY:
+ event_name = "AppendReady";
+ break;
case WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE:
event_name = "BackupWaitWalArchive";
break;
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 43a5fded10..5f3318fa8f 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -2020,6 +2020,15 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
}
#endif
+/*
+ * Get the number of wait events registered in a given WaitEventSet.
+ */
+int
+GetNumRegisteredWaitEvents(WaitEventSet *set)
+{
+ return set->nevents;
+}
+
#if defined(WAIT_USE_POLL)
/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3fd1a5fbe2..07433aab83 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1111,6 +1111,16 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_async_append", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of async append plans."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_async_append,
+ true,
+ NULL, NULL, NULL
+ },
{
{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("Enables genetic query optimization."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ee06528bb0..740e4698a1 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -371,6 +371,7 @@
#enable_partitionwise_aggregate = off
#enable_parallel_hash = on
#enable_partition_pruning = on
+#enable_async_append = on
# - Planner Cost Constants -
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
index e69de29bb2..93e8749476 100644
--- a/src/include/executor/execAsync.h
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ * execAsync.h
+ * Support functions for asynchronous execution
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/executor/execAsync.h
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(AsyncRequest *areq);
+extern void ExecAsyncConfigureWait(AsyncRequest *areq);
+extern void ExecAsyncNotify(AsyncRequest *areq);
+extern void ExecAsyncResponse(AsyncRequest *areq);
+extern void ExecAsyncRequestDone(AsyncRequest *areq, TupleTableSlot *result);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index cafd410a5d..fa54ac6ad2 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -25,4 +25,6 @@ extern void ExecAppendInitializeDSM(AppendState *node, ParallelContext *pcxt);
extern void ExecAppendReInitializeDSM(AppendState *node, ParallelContext *pcxt);
extern void ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt);
+extern void ExecAsyncAppendResponse(AsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 6ae7733e25..8ffc0ca5bf 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -31,4 +31,8 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
ParallelWorkerContext *pwcxt);
extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern void ExecAsyncForeignScanRequest(AsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(AsyncRequest *areq);
+extern void ExecAsyncForeignScanNotify(AsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 248f78da45..7c89d081c7 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -178,6 +178,14 @@ typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
List *fdw_private,
RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+
+typedef void (*ForeignAsyncRequest_function) (AsyncRequest *areq);
+
+typedef void (*ForeignAsyncConfigureWait_function) (AsyncRequest *areq);
+
+typedef void (*ForeignAsyncNotify_function) (AsyncRequest *areq);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -256,6 +264,12 @@ typedef struct FdwRoutine
/* Support functions for path reparameterization. */
ReparameterizeForeignPathByChild_function ReparameterizeForeignPathByChild;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e31ad6204e..c93b9c011e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -515,6 +515,22 @@ typedef struct ResultRelInfo
struct CopyMultiInsertBuffer *ri_CopyMultiInsertBuffer;
} ResultRelInfo;
+/* ----------------
+ * AsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct AsyncRequest
+{
+ struct PlanState *requestor; /* Node that wants a tuple */
+ struct PlanState *requestee; /* Node from which a tuple is wanted */
+ int request_index; /* Scratch space for requestor */
+ bool callback_pending; /* Callback is needed */
+ bool request_complete; /* Request complete, result valid */
+ TupleTableSlot *result; /* Result (NULL if no more tuples) */
+} AsyncRequest;
+
/* ----------------
* EState information
*
@@ -1220,12 +1236,23 @@ struct AppendState
PlanState **appendplans; /* array of PlanStates for my inputs */
int as_nplans;
int as_whichplan;
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_asyncplans; /* asynchronous plans indexes */
+ int as_nasyncplans; /* # of asynchronous plans */
+ AsyncRequest **as_asyncrequests; /* array of AsyncRequests */
+ TupleTableSlot **as_asyncresults; /* unreturned results of async plans */
+ int as_nasyncresults; /* # of valid entries in as_asyncresults */
+ int as_nasyncremain; /* # of remaining async plans */
+ Bitmapset *as_needrequest; /* async plans ready for a request */
+ struct WaitEventSet *as_eventset; /* WaitEventSet used to configure
+ * file descriptor wait events */
int as_first_partial_plan; /* Index of 'appendplans' containing
* the first partial plan */
ParallelAppendState *as_pstate; /* parallel coordination info */
Size pstate_len; /* size of parallel coordination info */
struct PartitionPruneState *as_prune_state;
Bitmapset *as_valid_subplans;
+ Bitmapset *as_valid_asyncplans;
bool (*choose_next_subplan) (AppendState *);
};
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 6e62104d0b..24ca616740 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -129,6 +129,11 @@ typedef struct Plan
bool parallel_aware; /* engage parallel-aware logic? */
bool parallel_safe; /* OK to use as part of parallel plan? */
+ /*
+ * information needed for asynchronous execution
+ */
+ bool async_capable; /* engage asynchronous-capable logic? */
+
/*
* Common structural data for all Plan types.
*/
@@ -245,6 +250,7 @@ typedef struct Append
Plan plan;
Bitmapset *apprelids; /* RTIs of appendrel(s) formed by this node */
List *appendplans;
+ int nasyncplans; /* # of asynchronous plans */
/*
* All 'appendplans' preceding this index are non-partial plans. All
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 1be93be098..a3fd93fe07 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -65,6 +65,7 @@ extern PGDLLIMPORT bool enable_partitionwise_aggregate;
extern PGDLLIMPORT bool enable_parallel_append;
extern PGDLLIMPORT bool enable_parallel_hash;
extern PGDLLIMPORT bool enable_partition_pruning;
+extern PGDLLIMPORT bool enable_async_append;
extern PGDLLIMPORT int constraint_exclusion;
extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..d9588da38a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -957,6 +957,7 @@ typedef enum
*/
typedef enum
{
+ WAIT_EVENT_APPEND_READY,
WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE = PG_WAIT_IPC,
WAIT_EVENT_BGWORKER_SHUTDOWN,
WAIT_EVENT_BGWORKER_STARTUP,
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 9e94fcaec2..44f9368c64 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -179,5 +179,6 @@ extern int WaitLatch(Latch *latch, int wakeEvents, long timeout,
extern int WaitLatchOrSocket(Latch *latch, int wakeEvents,
pgsocket sock, long timeout, uint32 wait_event_info);
extern void InitializeLatchWaitSet(void);
+extern int GetNumRegisteredWaitEvents(WaitEventSet *set);
#endif /* LATCH_H */
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index dc7ab2ce8b..e78ca7bddb 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -87,6 +87,7 @@ select explain_filter('explain (analyze, buffers, format json) select * from int
"Plan": { +
"Node Type": "Seq Scan", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "int8_tbl",+
"Alias": "i8", +
"Startup Cost": N.N, +
@@ -136,6 +137,7 @@ select explain_filter('explain (analyze, buffers, format xml) select * from int8
<Plan> +
<Node-Type>Seq Scan</Node-Type> +
<Parallel-Aware>false</Parallel-Aware> +
+ <Async-Capable>false</Async-Capable> +
<Relation-Name>int8_tbl</Relation-Name> +
<Alias>i8</Alias> +
<Startup-Cost>N.N</Startup-Cost> +
@@ -183,6 +185,7 @@ select explain_filter('explain (analyze, buffers, format yaml) select * from int
- Plan: +
Node Type: "Seq Scan" +
Parallel Aware: false +
+ Async Capable: false +
Relation Name: "int8_tbl"+
Alias: "i8" +
Startup Cost: N.N +
@@ -233,6 +236,7 @@ select explain_filter('explain (buffers, format json) select * from int8_tbl i8'
"Plan": { +
"Node Type": "Seq Scan", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "int8_tbl",+
"Alias": "i8", +
"Startup Cost": N.N, +
@@ -348,6 +352,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Relation Name": "tenk1", +
"Parallel Aware": true, +
"Local Hit Blocks": 0, +
@@ -393,6 +398,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Parallel Aware": false, +
"Sort Space Used": 0, +
"Local Hit Blocks": 0, +
@@ -435,6 +441,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Parallel Aware": false, +
"Workers Planned": 0, +
"Local Hit Blocks": 0, +
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index 68ca321163..a417b566d9 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -558,6 +558,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
"Node Type": "Incremental Sort", +
"Actual Rows": 55, +
"Actual Loops": 1, +
+ "Async Capable": false, +
"Presorted Key": [ +
"t.a" +
], +
@@ -760,6 +761,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
"Node Type": "Incremental Sort", +
"Actual Rows": 70, +
"Actual Loops": 1, +
+ "Async Capable": false, +
"Presorted Key": [ +
"t.a" +
], +
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index ff157ceb1c..499245068a 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -204,6 +204,7 @@ explain (costs off, format json) insert into insertconflicttest values (0, 'Bilb
"Node Type": "ModifyTable", +
"Operation": "Insert", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "insertconflicttest", +
"Alias": "insertconflicttest", +
"Conflict Resolution": "UPDATE", +
@@ -213,7 +214,8 @@ explain (costs off, format json) insert into insertconflicttest values (0, 'Bilb
{ +
"Node Type": "Result", +
"Parent Relationship": "Member", +
- "Parallel Aware": false +
+ "Parallel Aware": false, +
+ "Async Capable": false +
} +
] +
} +
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 6d048e309c..98dde452e6 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -95,6 +95,7 @@ select count(*) = 0 as ok from pg_stat_wal_receiver;
select name, setting from pg_settings where name like 'enable%';
name | setting
--------------------------------+---------
+ enable_async_append | on
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
@@ -113,7 +114,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(18 rows)
+(19 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
On Tue, Nov 17, 2020 at 6:56 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
* I haven't yet added some planner/resowner changes from Horiguchi-san's patch.
The patch in [1]/messages/by-id/CAPmGK14wcXKqGDpYRieA1ETgyj+Ep5ntrGVD=29iESoQYUx9YQ@mail.gmail.com allocates, populates, frees a wait event set every
time when doing ExecAppendAsyncEventWait(), so it wouldn’t leak wait
event sets. Actually, we don’t need the ResourceOwner change?
I thought the change to cost_append() proposed in his patch would be a
good idea, but I noticed this:
+ /*
+ * It's not obvious how to determine the total cost of
+ * async subnodes. Although it is not always true, we
+ * assume it is the maximum cost among all async subnodes.
+ */
+ if (async_max_cost < subpath->total_cost)
+ async_max_cost = subpath->total_cost;
As commented, the assumption isn’t always correct (a counter-example
would be the case where all async subnodes use the same connection as
shown in [2]/messages/by-id/CAPmGK17Ap6AGTFrtn3==PsVfHUkuiRPFXZqXSQ=XWQDtDbNNBQ@mail.gmail.com). Rather than modifying that function as proposed, I
feel inclined to leave that function as-is.
Best regards,
Etsuro Fujita
[1]: /messages/by-id/CAPmGK14wcXKqGDpYRieA1ETgyj+Ep5ntrGVD=29iESoQYUx9YQ@mail.gmail.com
[2]: /messages/by-id/CAPmGK17Ap6AGTFrtn3==PsVfHUkuiRPFXZqXSQ=XWQDtDbNNBQ@mail.gmail.com
On Mon, Mar 8, 2021 at 2:05 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
There seems to be no objections, so I went ahead and added the
table/server option ‘async_capable’ set false by default. Attached is
an updated patch.
Attached is an updated version of the patch. Changes are:
* I modified nodeAppend.c a bit further to make the code simpler
(mostly, ExecAppendAsyncBegin() and related code).
* I added a function ExecAsyncRequestPending() to execAsync.c for the
convenience of FDWs.
* I fixed a bug in the definition of WAIT_EVENT_APPEND_READY in pgstat.h.
* I fixed a bug in process_pending_request() in postgres_fdw.c.
* I added comments to executor/README based on Robert’s original patch.
* I added/adjusted/fixed some other comments and docs.
* I think it would be better to keep the existing test cases in
postgres_fdw.sql as-is for testing the existing features, so I
modified it as such, and added new test cases for testing this
feature.
* I rebased the patch against HEAD.
I haven’t yet added docs on FDW APIs. I think the patch would need a
bit more comments. But other than that, I feel the patch is in good
shape.
Best regards,
Etsuro Fujita
Attachments:
async-2021-03-19.patchapplication/octet-stream; name=async-2021-03-19.patchDownload
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index ee0b4acf0b..54ab8edfab 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -62,6 +62,7 @@ typedef struct ConnCacheEntry
Oid serverid; /* foreign server OID used to get server name */
uint32 server_hashvalue; /* hash value of foreign server OID */
uint32 mapping_hashvalue; /* hash value of user mapping OID */
+ PgFdwConnState state; /* extra per-connection state */
} ConnCacheEntry;
/*
@@ -115,9 +116,12 @@ static bool disconnect_cached_connections(Oid serverid);
* will_prep_stmt must be true if caller intends to create any prepared
* statements. Since those don't go away automatically at transaction end
* (not even on error), we need this flag to cue manual cleanup.
+ *
+ * If state is not NULL, *state receives the per-connection state associated
+ * with the PGconn.
*/
PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetConnection(UserMapping *user, bool will_prep_stmt, PgFdwConnState **state)
{
bool found;
bool retry = false;
@@ -196,6 +200,9 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
*/
PG_TRY();
{
+ /* Process a pending asynchronous request if any. */
+ if (entry->state.pendingAreq)
+ process_pending_request(entry->state.pendingAreq);
/* Start a new transaction or subtransaction if needed. */
begin_remote_xact(entry);
}
@@ -264,6 +271,10 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
/* Remember if caller will prepare statements */
entry->have_prep_stmt |= will_prep_stmt;
+ /* If caller needs access to the per-connection state, return it. */
+ if (state)
+ *state = &entry->state;
+
return entry->conn;
}
@@ -291,6 +302,7 @@ make_new_connection(ConnCacheEntry *entry, UserMapping *user)
entry->mapping_hashvalue =
GetSysCacheHashValue1(USERMAPPINGOID,
ObjectIdGetDatum(user->umid));
+ memset(&entry->state, 0, sizeof(entry->state));
/* Now try to make the connection */
entry->conn = connect_pg_server(server, user);
@@ -648,8 +660,12 @@ GetPrepStmtNumber(PGconn *conn)
* Caller is responsible for the error handling on the result.
*/
PGresult *
-pgfdw_exec_query(PGconn *conn, const char *query)
+pgfdw_exec_query(PGconn *conn, const char *query, PgFdwConnState *state)
{
+ /* First, process a pending asynchronous request, if any. */
+ if (state && state->pendingAreq)
+ process_pending_request(state->pendingAreq);
+
/*
* Submit a query. Since we don't use non-blocking mode, this also can
* block. But its risk is relatively small, so we ignore that for now.
@@ -940,6 +956,8 @@ pgfdw_xact_callback(XactEvent event, void *arg)
{
entry->have_prep_stmt = false;
entry->have_error = false;
+ /* Also reset per-connection state */
+ memset(&entry->state, 0, sizeof(entry->state));
}
/* Disarm changing_xact_state if it all worked. */
@@ -1172,6 +1190,10 @@ pgfdw_reject_incomplete_xact_state_change(ConnCacheEntry *entry)
* Cancel the currently-in-progress query (whose query text we do not have)
* and ignore the result. Returns true if we successfully cancel the query
* and discard any pending result, and false if not.
+ *
+ * XXX: if the query was one sent by fetch_more_data_begin(), we could get the
+ * query text from the pendingAreq saved in the per-connection state, then
+ * report the query using it.
*/
static bool
pgfdw_cancel_query(PGconn *conn)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 0649b6b81c..58a5c3093f 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -8946,7 +8946,7 @@ DO $d$
END;
$d$;
ERROR: invalid option "password"
-HINT: Valid options in this context are: service, passfile, channel_binding, connect_timeout, dbname, host, hostaddr, port, options, application_name, keepalives, keepalives_idle, keepalives_interval, keepalives_count, tcp_user_timeout, sslmode, sslcompression, sslcert, sslkey, sslrootcert, sslcrl, sslcrldir, requirepeer, ssl_min_protocol_version, ssl_max_protocol_version, gssencmode, krbsrvname, gsslib, target_session_attrs, use_remote_estimate, fdw_startup_cost, fdw_tuple_cost, extensions, updatable, fetch_size, batch_size
+HINT: Valid options in this context are: service, passfile, channel_binding, connect_timeout, dbname, host, hostaddr, port, options, application_name, keepalives, keepalives_idle, keepalives_interval, keepalives_count, tcp_user_timeout, sslmode, sslcompression, sslcert, sslkey, sslrootcert, sslcrl, sslcrldir, requirepeer, ssl_min_protocol_version, ssl_max_protocol_version, gssencmode, krbsrvname, gsslib, target_session_attrs, use_remote_estimate, fdw_startup_cost, fdw_tuple_cost, extensions, updatable, fetch_size, batch_size, async_capable
CONTEXT: SQL statement "ALTER SERVER loopback_nopw OPTIONS (ADD password 'dummypw')"
PL/pgSQL function inline_code_block line 3 at EXECUTE
-- If we add a password for our user mapping instead, we should get a different
@@ -9437,3 +9437,375 @@ SELECT tableoid::regclass, * FROM batch_cp_upd_test;
-- Clean up
DROP TABLE batch_table, batch_cp_upd_test CASCADE;
+-- ===================================================================
+-- test asynchronous execution
+-- ===================================================================
+ALTER SERVER loopback OPTIONS (DROP extensions);
+ALTER SERVER loopback OPTIONS (ADD async_capable 'true');
+ALTER SERVER loopback2 OPTIONS (ADD async_capable 'true');
+CREATE TABLE async_pt (a int, b int, c text) PARTITION BY RANGE (a);
+CREATE TABLE base_tbl1 (a int, b int, c text);
+CREATE TABLE base_tbl2 (a int, b int, c text);
+CREATE FOREIGN TABLE async_p1 PARTITION OF async_pt FOR VALUES FROM (1000) TO (2000)
+ SERVER loopback OPTIONS (table_name 'base_tbl1');
+CREATE FOREIGN TABLE async_p2 PARTITION OF async_pt FOR VALUES FROM (2000) TO (3000)
+ SERVER loopback2 OPTIONS (table_name 'base_tbl2');
+INSERT INTO async_p1 SELECT 1000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+INSERT INTO async_p2 SELECT 2000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+ANALYZE async_pt;
+CREATE TABLE result_tbl (a int, b int, c text);
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+ QUERY PLAN
+----------------------------------------------------------------
+ Insert on public.result_tbl
+ -> Append
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Filter: (async_pt_1.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Filter: (async_pt_2.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+(10 rows)
+
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+-----+------
+ 1505 | 505 | 0505
+ 2505 | 505 | 0505
+(2 rows)
+
+DELETE FROM result_tbl;
+-- Check case where multiple partitions use the same connection
+CREATE TABLE base_tbl3 (a int, b int, c text);
+CREATE FOREIGN TABLE async_p3 PARTITION OF async_pt FOR VALUES FROM (3000) TO (4000)
+ SERVER loopback2 OPTIONS (table_name 'base_tbl3');
+INSERT INTO async_p3 SELECT 3000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+ANALYZE async_pt;
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+ QUERY PLAN
+----------------------------------------------------------------
+ Insert on public.result_tbl
+ -> Append
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Filter: (async_pt_1.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Filter: (async_pt_2.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+ -> Async Foreign Scan on public.async_p3 async_pt_3
+ Output: async_pt_3.a, async_pt_3.b, async_pt_3.c
+ Filter: (async_pt_3.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl3
+(14 rows)
+
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+-----+------
+ 1505 | 505 | 0505
+ 2505 | 505 | 0505
+ 3505 | 505 | 0505
+(3 rows)
+
+DELETE FROM result_tbl;
+DROP FOREIGN TABLE async_p3;
+DROP TABLE base_tbl3;
+-- Check case where the partitioned table has local/remote partitions
+CREATE TABLE async_p3 PARTITION OF async_pt FOR VALUES FROM (3000) TO (4000);
+INSERT INTO async_p3 SELECT 3000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+ANALYZE async_pt;
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+ QUERY PLAN
+----------------------------------------------------------------
+ Insert on public.result_tbl
+ -> Append
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Filter: (async_pt_1.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Filter: (async_pt_2.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+ -> Seq Scan on public.async_p3 async_pt_3
+ Output: async_pt_3.a, async_pt_3.b, async_pt_3.c
+ Filter: (async_pt_3.b === 505)
+(13 rows)
+
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+-----+------
+ 1505 | 505 | 0505
+ 2505 | 505 | 0505
+ 3505 | 505 | 0505
+(3 rows)
+
+DELETE FROM result_tbl;
+-- Test interaction of async execution with plan-time partition pruning
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE a < 3000;
+ QUERY PLAN
+-----------------------------------------------------------------------------
+ Append
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE ((a < 3000))
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE ((a < 3000))
+(7 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE a < 2000;
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Foreign Scan on public.async_p1 async_pt
+ Output: async_pt.a, async_pt.b, async_pt.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE ((a < 2000))
+(3 rows)
+
+-- Test interaction of async execution with run-time partition pruning
+SET plan_cache_mode TO force_generic_plan;
+PREPARE async_pt_query (int, int) AS
+ INSERT INTO result_tbl SELECT * FROM async_pt WHERE a < $1 AND b === $2;
+EXPLAIN (VERBOSE, COSTS OFF)
+EXECUTE async_pt_query (3000, 505);
+ QUERY PLAN
+------------------------------------------------------------------------------------------
+ Insert on public.result_tbl
+ -> Append
+ Subplans Removed: 1
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Filter: (async_pt_1.b === $2)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE ((a < $1::integer))
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Filter: (async_pt_2.b === $2)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE ((a < $1::integer))
+(11 rows)
+
+EXECUTE async_pt_query (3000, 505);
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+-----+------
+ 1505 | 505 | 0505
+ 2505 | 505 | 0505
+(2 rows)
+
+DELETE FROM result_tbl;
+EXPLAIN (VERBOSE, COSTS OFF)
+EXECUTE async_pt_query (2000, 505);
+ QUERY PLAN
+------------------------------------------------------------------------------------------
+ Insert on public.result_tbl
+ -> Append
+ Subplans Removed: 2
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Filter: (async_pt_1.b === $2)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE ((a < $1::integer))
+(7 rows)
+
+EXECUTE async_pt_query (2000, 505);
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+-----+------
+ 1505 | 505 | 0505
+(1 row)
+
+DELETE FROM result_tbl;
+RESET plan_cache_mode;
+CREATE TABLE local_tbl(a int, b int, c text);
+INSERT INTO local_tbl VALUES (1505, 505, 'foo'), (2505, 505, 'bar');
+ANALYZE local_tbl;
+CREATE INDEX base_tbl1_idx ON base_tbl1 (a);
+CREATE INDEX base_tbl2_idx ON base_tbl2 (a);
+CREATE INDEX async_p3_idx ON async_p3 (a);
+ANALYZE base_tbl1;
+ANALYZE base_tbl2;
+ANALYZE async_p3;
+ALTER FOREIGN TABLE async_p1 OPTIONS (use_remote_estimate 'true');
+ALTER FOREIGN TABLE async_p2 OPTIONS (use_remote_estimate 'true');
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c = 'bar';
+ QUERY PLAN
+------------------------------------------------------------------------------------------
+ Nested Loop
+ Output: local_tbl.a, local_tbl.b, local_tbl.c, async_pt.a, async_pt.b, async_pt.c
+ -> Seq Scan on public.local_tbl
+ Output: local_tbl.a, local_tbl.b, local_tbl.c
+ Filter: (local_tbl.c = 'bar'::text)
+ -> Append
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE (($1::integer = a))
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE (($1::integer = a))
+ -> Seq Scan on public.async_p3 async_pt_3
+ Output: async_pt_3.a, async_pt_3.b, async_pt_3.c
+ Filter: (local_tbl.a = async_pt_3.a)
+(15 rows)
+
+EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
+SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c = 'bar';
+ QUERY PLAN
+-------------------------------------------------------------------------------
+ Nested Loop (actual rows=1 loops=1)
+ -> Seq Scan on local_tbl (actual rows=1 loops=1)
+ Filter: (c = 'bar'::text)
+ Rows Removed by Filter: 1
+ -> Append (actual rows=1 loops=1)
+ -> Async Foreign Scan on async_p1 async_pt_1 (never executed)
+ -> Async Foreign Scan on async_p2 async_pt_2 (actual rows=1 loops=1)
+ -> Seq Scan on async_p3 async_pt_3 (never executed)
+ Filter: (local_tbl.a = a)
+(9 rows)
+
+SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c = 'bar';
+ a | b | c | a | b | c
+------+-----+-----+------+-----+------
+ 2505 | 505 | bar | 2505 | 505 | 0505
+(1 row)
+
+ALTER FOREIGN TABLE async_p1 OPTIONS (DROP use_remote_estimate);
+ALTER FOREIGN TABLE async_p2 OPTIONS (DROP use_remote_estimate);
+DROP TABLE local_tbl;
+DROP INDEX base_tbl1_idx;
+DROP INDEX base_tbl2_idx;
+DROP INDEX async_p3_idx;
+-- Test that pending requests are processed properly
+SET enable_mergejoin TO false;
+SET enable_hashjoin TO false;
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt t1, async_p2 t2 WHERE t1.a = t2.a AND t1.b === 505;
+ QUERY PLAN
+----------------------------------------------------------------
+ Nested Loop
+ Output: t1.a, t1.b, t1.c, t2.a, t2.b, t2.c
+ Join Filter: (t1.a = t2.a)
+ -> Append
+ -> Async Foreign Scan on public.async_p1 t1_1
+ Output: t1_1.a, t1_1.b, t1_1.c
+ Filter: (t1_1.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Async Foreign Scan on public.async_p2 t1_2
+ Output: t1_2.a, t1_2.b, t1_2.c
+ Filter: (t1_2.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+ -> Seq Scan on public.async_p3 t1_3
+ Output: t1_3.a, t1_3.b, t1_3.c
+ Filter: (t1_3.b === 505)
+ -> Materialize
+ Output: t2.a, t2.b, t2.c
+ -> Foreign Scan on public.async_p2 t2
+ Output: t2.a, t2.b, t2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+(20 rows)
+
+SELECT * FROM async_pt t1, async_p2 t2 WHERE t1.a = t2.a AND t1.b === 505;
+ a | b | c | a | b | c
+------+-----+------+------+-----+------
+ 2505 | 505 | 0505 | 2505 | 505 | 0505
+(1 row)
+
+-- Check with foreign modify
+CREATE TABLE local_tbl (a int, b int, c text);
+INSERT INTO local_tbl VALUES (1505, 505, 'foo');
+CREATE TABLE base_tbl3 (a int, b int, c text);
+CREATE FOREIGN TABLE remote_tbl (a int, b int, c text)
+ SERVER loopback OPTIONS (table_name 'base_tbl3');
+INSERT INTO remote_tbl VALUES (2505, 505, 'bar');
+CREATE TABLE base_tbl4 (a int, b int, c text);
+CREATE FOREIGN TABLE insert_tbl (a int, b int, c text)
+ SERVER loopback OPTIONS (table_name 'base_tbl4');
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO insert_tbl (SELECT * FROM local_tbl UNION ALL SELECT * FROM remote_tbl);
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Insert on public.insert_tbl
+ Remote SQL: INSERT INTO public.base_tbl4(a, b, c) VALUES ($1, $2, $3)
+ Batch Size: 1
+ -> Append
+ -> Seq Scan on public.local_tbl
+ Output: local_tbl.a, local_tbl.b, local_tbl.c
+ -> Async Foreign Scan on public.remote_tbl
+ Output: remote_tbl.a, remote_tbl.b, remote_tbl.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl3
+(9 rows)
+
+INSERT INTO insert_tbl (SELECT * FROM local_tbl UNION ALL SELECT * FROM remote_tbl);
+SELECT * FROM insert_tbl ORDER BY a;
+ a | b | c
+------+-----+-----
+ 1505 | 505 | foo
+ 2505 | 505 | bar
+(2 rows)
+
+-- Check with direct modify
+CREATE TABLE join_tbl (a1 int, b1 int, c1 text, a2 int, b2 int, c2 text);
+EXPLAIN (VERBOSE, COSTS OFF)
+WITH t AS (UPDATE remote_tbl SET c = c || c RETURNING *)
+INSERT INTO join_tbl SELECT * FROM async_pt LEFT JOIN t ON (async_pt.a = t.a AND async_pt.b = t.b) WHERE async_pt.b === 505;
+ QUERY PLAN
+----------------------------------------------------------------------------------------
+ Insert on public.join_tbl
+ CTE t
+ -> Update on public.remote_tbl
+ Output: remote_tbl.a, remote_tbl.b, remote_tbl.c
+ -> Foreign Update on public.remote_tbl
+ Remote SQL: UPDATE public.base_tbl3 SET c = (c || c) RETURNING a, b, c
+ -> Nested Loop Left Join
+ Output: async_pt.a, async_pt.b, async_pt.c, t.a, t.b, t.c
+ Join Filter: ((async_pt.a = t.a) AND (async_pt.b = t.b))
+ -> Append
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Filter: (async_pt_1.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Filter: (async_pt_2.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+ -> Seq Scan on public.async_p3 async_pt_3
+ Output: async_pt_3.a, async_pt_3.b, async_pt_3.c
+ Filter: (async_pt_3.b === 505)
+ -> CTE Scan on t
+ Output: t.a, t.b, t.c
+(23 rows)
+
+WITH t AS (UPDATE remote_tbl SET c = c || c RETURNING *)
+INSERT INTO join_tbl SELECT * FROM async_pt LEFT JOIN t ON (async_pt.a = t.a AND async_pt.b = t.b) WHERE async_pt.b === 505;
+SELECT * FROM join_tbl ORDER BY a1;
+ a1 | b1 | c1 | a2 | b2 | c2
+------+-----+------+------+-----+--------
+ 1505 | 505 | 0505 | | |
+ 2505 | 505 | 0505 | 2505 | 505 | barbar
+ 3505 | 505 | 0505 | | |
+(3 rows)
+
+RESET enable_mergejoin;
+RESET enable_hashjoin;
+-- Clean up
+DROP TABLE async_pt;
+DROP TABLE base_tbl1;
+DROP TABLE base_tbl2;
+DROP TABLE result_tbl;
+DROP TABLE local_tbl;
+DROP FOREIGN TABLE remote_tbl;
+DROP FOREIGN TABLE insert_tbl;
+DROP TABLE base_tbl3;
+DROP TABLE base_tbl4;
+DROP TABLE join_tbl;
+ALTER SERVER loopback OPTIONS (DROP async_capable);
+ALTER SERVER loopback2 OPTIONS (DROP async_capable);
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 64698c4da3..530d7a66d4 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -107,7 +107,8 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
* Validate option value, when we can do so without any context.
*/
if (strcmp(def->defname, "use_remote_estimate") == 0 ||
- strcmp(def->defname, "updatable") == 0)
+ strcmp(def->defname, "updatable") == 0 ||
+ strcmp(def->defname, "async_capable") == 0)
{
/* these accept only boolean values */
(void) defGetBoolean(def);
@@ -217,6 +218,9 @@ InitPgFdwOptions(void)
/* batch_size is available on both server and table */
{"batch_size", ForeignServerRelationId, false},
{"batch_size", ForeignTableRelationId, false},
+ /* async_capable is available on both server and table */
+ {"async_capable", ForeignServerRelationId, false},
+ {"async_capable", ForeignTableRelationId, false},
{"password_required", UserMappingRelationId, false},
/*
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 35b48575c5..25b9085232 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -21,6 +21,7 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -37,6 +38,7 @@
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
#include "postgres_fdw.h"
+#include "storage/latch.h"
#include "utils/builtins.h"
#include "utils/float.h"
#include "utils/guc.h"
@@ -143,6 +145,7 @@ typedef struct PgFdwScanState
/* for remote query execution */
PGconn *conn; /* connection for the scan */
+ PgFdwConnState *conn_state; /* extra per-connection state */
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -159,6 +162,9 @@ typedef struct PgFdwScanState
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ /* for asynchronous execution */
+ bool async_capable; /* engage asynchronous-capable logic? */
+
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
MemoryContext temp_cxt; /* context for per-tuple temporary data */
@@ -176,6 +182,7 @@ typedef struct PgFdwModifyState
/* for remote query execution */
PGconn *conn; /* connection for the scan */
+ PgFdwConnState *conn_state; /* extra per-connection state */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -219,6 +226,7 @@ typedef struct PgFdwDirectModifyState
/* for remote query execution */
PGconn *conn; /* connection for the update */
+ PgFdwConnState *conn_state; /* extra per-connection state */
int numParams; /* number of parameters passed to query */
FmgrInfo *param_flinfo; /* output conversion functions for them */
List *param_exprs; /* executable expressions for param values */
@@ -408,6 +416,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
RelOptInfo *input_rel,
RelOptInfo *output_rel,
void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(AsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(AsyncRequest *areq);
+static void postgresForeignAsyncNotify(AsyncRequest *areq);
/*
* Helper functions
@@ -437,7 +449,8 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
void *arg);
static void create_cursor(ForeignScanState *node);
static void fetch_more_data(ForeignScanState *node);
-static void close_cursor(PGconn *conn, unsigned int cursor_number);
+static void close_cursor(PGconn *conn, unsigned int cursor_number,
+ PgFdwConnState *conn_state);
static PgFdwModifyState *create_foreign_modify(EState *estate,
RangeTblEntry *rte,
ResultRelInfo *resultRelInfo,
@@ -491,6 +504,8 @@ static int postgresAcquireSampleRowsFunc(Relation relation, int elevel,
double *totaldeadrows);
static void analyze_row_processor(PGresult *res, int row,
PgFdwAnalyzeState *astate);
+static void produce_tuple_asynchronously(AsyncRequest *areq, bool fetch);
+static void fetch_more_data_begin(AsyncRequest *areq);
static HeapTuple make_tuple_from_result_row(PGresult *res,
int row,
Relation rel,
@@ -583,6 +598,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for asynchronous execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -625,6 +646,7 @@ postgresGetForeignRelSize(PlannerInfo *root,
fpinfo->fdw_tuple_cost = DEFAULT_FDW_TUPLE_COST;
fpinfo->shippable_extensions = NIL;
fpinfo->fetch_size = 100;
+ fpinfo->async_capable = false;
apply_server_options(fpinfo);
apply_table_options(fpinfo);
@@ -1458,7 +1480,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->conn = GetConnection(user, false, &fsstate->conn_state);
/* Assign a unique ID for my cursor */
fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1509,6 +1531,9 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
&fsstate->param_flinfo,
&fsstate->param_exprs,
&fsstate->param_values);
+
+ /* Set the async-capable flag */
+ fsstate->async_capable = node->ss.ps.plan->async_capable;
}
/*
@@ -1523,8 +1548,10 @@ postgresIterateForeignScan(ForeignScanState *node)
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
/*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
+ * In sync mode, if this is the first call after Begin or ReScan, we need
+ * to create the cursor on the remote side. In async mode, we would have
+ * aready created the cursor before we get here, even if this is the first
+ * call after Begin or ReScan.
*/
if (!fsstate->cursor_exists)
create_cursor(node);
@@ -1534,6 +1561,9 @@ postgresIterateForeignScan(ForeignScanState *node)
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
+ /* In async mode, just clear tuple slot. */
+ if (fsstate->async_capable)
+ return ExecClearTuple(slot);
/* No point in another fetch if we already detected EOF, though. */
if (!fsstate->eof_reached)
fetch_more_data(node);
@@ -1595,7 +1625,7 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->conn, sql, fsstate->conn_state);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
PQclear(res);
@@ -1623,7 +1653,8 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->conn, fsstate->cursor_number,
+ fsstate->conn_state);
/* Release remote connection */
ReleaseConnection(fsstate->conn);
@@ -2500,7 +2531,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->conn = GetConnection(user, false, &dmstate->conn_state);
/* Update the foreign-join-related fields. */
if (fsplan->scan.scanrelid == 0)
@@ -2881,7 +2912,7 @@ estimate_path_cost_size(PlannerInfo *root,
false, &retrieved_attrs, NULL);
/* Get the remote estimate */
- conn = GetConnection(fpinfo->user, false);
+ conn = GetConnection(fpinfo->user, false, NULL);
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -3327,7 +3358,7 @@ get_remote_estimate(const char *sql, PGconn *conn,
/*
* Execute EXPLAIN remotely.
*/
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_exec_query(conn, sql, NULL);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, sql);
@@ -3451,6 +3482,10 @@ create_cursor(ForeignScanState *node)
StringInfoData buf;
PGresult *res;
+ /* First, process a pending asynchronous request, if any. */
+ if (fsstate->conn_state->pendingAreq)
+ process_pending_request(fsstate->conn_state->pendingAreq);
+
/*
* Construct array of query parameter values in text format. We do the
* conversions in the short-lived per-tuple context, so as not to cause a
@@ -3531,17 +3566,38 @@ fetch_more_data(ForeignScanState *node)
PG_TRY();
{
PGconn *conn = fsstate->conn;
- char sql[64];
int numrows;
int i;
- snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
- fsstate->fetch_size, fsstate->cursor_number);
+ if (fsstate->async_capable)
+ {
+ Assert(fsstate->conn_state->pendingAreq);
- res = pgfdw_exec_query(conn, sql);
- /* On error, report the original query, not the FETCH. */
- if (PQresultStatus(res) != PGRES_TUPLES_OK)
- pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ /*
+ * The query was already sent by an earlier call to
+ * fetch_more_data_begin. So now we just fetch the result.
+ */
+ res = pgfdw_get_result(conn, fsstate->query);
+ /* On error, report the original query, not the FETCH. */
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+
+ /* Reset per-connection state */
+ fsstate->conn_state->pendingAreq = NULL;
+ }
+ else
+ {
+ char sql[64];
+
+ /* This is a regular synchronous fetch. */
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ res = pgfdw_exec_query(conn, sql, fsstate->conn_state);
+ /* On error, report the original query, not the FETCH. */
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ }
/* Convert the data into HeapTuples */
numrows = PQntuples(res);
@@ -3633,7 +3689,8 @@ reset_transmission_modes(int nestlevel)
* Utility routine to close a cursor.
*/
static void
-close_cursor(PGconn *conn, unsigned int cursor_number)
+close_cursor(PGconn *conn, unsigned int cursor_number,
+ PgFdwConnState *conn_state)
{
char sql[64];
PGresult *res;
@@ -3644,7 +3701,7 @@ close_cursor(PGconn *conn, unsigned int cursor_number)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_exec_query(conn, sql, conn_state);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
pgfdw_report_error(ERROR, res, conn, true, sql);
PQclear(res);
@@ -3693,7 +3750,7 @@ create_foreign_modify(EState *estate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->conn = GetConnection(user, true, &fmstate->conn_state);
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Set up remote query information. */
@@ -3792,6 +3849,10 @@ execute_foreign_modify(EState *estate,
operation == CMD_UPDATE ||
operation == CMD_DELETE);
+ /* First, process a pending asynchronous request, if any. */
+ if (fmstate->conn_state->pendingAreq)
+ process_pending_request(fmstate->conn_state->pendingAreq);
+
/*
* If the existing query was deparsed and prepared for a different number
* of rows, rebuild it for the proper number.
@@ -3893,6 +3954,11 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
char *p_name;
PGresult *res;
+ /*
+ * The caller would already have processed a pending asynchronous request
+ * if any, so no need to do it here.
+ */
+
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
GetPrepStmtNumber(fmstate->conn));
@@ -4078,7 +4144,7 @@ deallocate_query(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->conn, sql, fmstate->conn_state);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
PQclear(res);
@@ -4226,6 +4292,10 @@ execute_dml_stmt(ForeignScanState *node)
int numParams = dmstate->numParams;
const char **values = dmstate->param_values;
+ /* First, process a pending asynchronous request, if any. */
+ if (dmstate->conn_state->pendingAreq)
+ process_pending_request(dmstate->conn_state->pendingAreq);
+
/*
* Construct array of query parameter values in text format.
*/
@@ -4627,7 +4697,7 @@ postgresAnalyzeForeignTable(Relation relation,
*/
table = GetForeignTable(RelationGetRelid(relation));
user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
- conn = GetConnection(user, false);
+ conn = GetConnection(user, false, NULL);
/*
* Construct command to get page count for relation.
@@ -4638,7 +4708,7 @@ postgresAnalyzeForeignTable(Relation relation,
/* In what follows, do not risk leaking any PGresults. */
PG_TRY();
{
- res = pgfdw_exec_query(conn, sql.data);
+ res = pgfdw_exec_query(conn, sql.data, NULL);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, sql.data);
@@ -4713,7 +4783,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
table = GetForeignTable(RelationGetRelid(relation));
server = GetForeignServer(table->serverid);
user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
- conn = GetConnection(user, false);
+ conn = GetConnection(user, false, NULL);
/*
* Construct cursor that retrieves whole rows from remote.
@@ -4730,7 +4800,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
int fetch_size;
ListCell *lc;
- res = pgfdw_exec_query(conn, sql.data);
+ res = pgfdw_exec_query(conn, sql.data, NULL);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
pgfdw_report_error(ERROR, res, conn, false, sql.data);
PQclear(res);
@@ -4782,7 +4852,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
*/
/* Fetch some rows */
- res = pgfdw_exec_query(conn, fetch_sql);
+ res = pgfdw_exec_query(conn, fetch_sql, NULL);
/* On error, report the original query, not the FETCH. */
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, sql.data);
@@ -4801,7 +4871,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
}
/* Close the cursor, just to be tidy. */
- close_cursor(conn, cursor_number);
+ close_cursor(conn, cursor_number, NULL);
}
PG_CATCH();
{
@@ -4941,7 +5011,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
*/
server = GetForeignServer(serverOid);
mapping = GetUserMapping(GetUserId(), server->serverid);
- conn = GetConnection(mapping, false);
+ conn = GetConnection(mapping, false, NULL);
/* Don't attempt to import collation if remote server hasn't got it */
if (PQserverVersion(conn) < 90100)
@@ -4957,7 +5027,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
appendStringInfoString(&buf, "SELECT 1 FROM pg_catalog.pg_namespace WHERE nspname = ");
deparseStringLiteral(&buf, stmt->remote_schema);
- res = pgfdw_exec_query(conn, buf.data);
+ res = pgfdw_exec_query(conn, buf.data, NULL);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, buf.data);
@@ -5069,7 +5139,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
appendStringInfoString(&buf, " ORDER BY c.relname, a.attnum");
/* Fetch the data */
- res = pgfdw_exec_query(conn, buf.data);
+ res = pgfdw_exec_query(conn, buf.data, NULL);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, buf.data);
@@ -5529,6 +5599,8 @@ apply_server_options(PgFdwRelationInfo *fpinfo)
ExtractExtensionList(defGetString(def), false);
else if (strcmp(def->defname, "fetch_size") == 0)
fpinfo->fetch_size = strtol(defGetString(def), NULL, 10);
+ else if (strcmp(def->defname, "async_capable") == 0)
+ fpinfo->async_capable = defGetBoolean(def);
}
}
@@ -5550,6 +5622,8 @@ apply_table_options(PgFdwRelationInfo *fpinfo)
fpinfo->use_remote_estimate = defGetBoolean(def);
else if (strcmp(def->defname, "fetch_size") == 0)
fpinfo->fetch_size = strtol(defGetString(def), NULL, 10);
+ else if (strcmp(def->defname, "async_capable") == 0)
+ fpinfo->async_capable = defGetBoolean(def);
}
}
@@ -5584,6 +5658,7 @@ merge_fdw_options(PgFdwRelationInfo *fpinfo,
fpinfo->shippable_extensions = fpinfo_o->shippable_extensions;
fpinfo->use_remote_estimate = fpinfo_o->use_remote_estimate;
fpinfo->fetch_size = fpinfo_o->fetch_size;
+ fpinfo->async_capable = fpinfo_o->async_capable;
/* Merge the table level options from either side of the join. */
if (fpinfo_i)
@@ -5605,6 +5680,13 @@ merge_fdw_options(PgFdwRelationInfo *fpinfo,
* relation sizes.
*/
fpinfo->fetch_size = Max(fpinfo_o->fetch_size, fpinfo_i->fetch_size);
+
+ /*
+ * We'll prefer to consider this join async-capable if any table from
+ * either side of the join is considered async-capable.
+ */
+ fpinfo->async_capable = fpinfo_o->async_capable ||
+ fpinfo_i->async_capable;
}
}
@@ -6488,6 +6570,218 @@ add_foreign_final_paths(PlannerInfo *root, RelOptInfo *input_rel,
add_path(final_rel, (Path *) final_path);
}
+/*
+ * postgresIsForeignPathAsyncCapable
+ * Check whether a given ForeignPath node is async-capable.
+ */
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ RelOptInfo *rel = ((Path *) path)->parent;
+ PgFdwRelationInfo *fpinfo = (PgFdwRelationInfo *) rel->fdw_private;
+
+ return fpinfo->async_capable;
+}
+
+/*
+ * postgresForeignAsyncRequest
+ * Asynchronously request next tuple from a foreign PostgreSQL table.
+ */
+static void
+postgresForeignAsyncRequest(AsyncRequest *areq)
+{
+ produce_tuple_asynchronously(areq, true);
+}
+
+/*
+ * postgresForeignAsyncConfigureWait
+ * Configure a file descriptor event for which we wish to wait.
+ */
+static void
+postgresForeignAsyncConfigureWait(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ AsyncRequest *pendingAreq = fsstate->conn_state->pendingAreq;
+ AppendState *requestor = (AppendState *) areq->requestor;
+ WaitEventSet *set = requestor->as_eventset;
+
+ /* This should not be called unless callback_pending */
+ Assert(areq->callback_pending);
+
+ /* The core code would have registered postmaster death event */
+ Assert(GetNumRegisteredWaitEvents(set) >= 1);
+
+ /* Begin an asynchronous data fetch if necessary */
+ if (!pendingAreq)
+ fetch_more_data_begin(areq);
+ else if (pendingAreq->requestor != areq->requestor)
+ {
+ if (GetNumRegisteredWaitEvents(set) > 1)
+ return;
+ process_pending_request(pendingAreq);
+ fetch_more_data_begin(areq);
+ }
+ else if (pendingAreq->requestee != areq->requestee)
+ return;
+ else
+ Assert(pendingAreq == areq);
+
+ AddWaitEventToSet(set, WL_SOCKET_READABLE, PQsocket(fsstate->conn),
+ NULL, areq);
+}
+
+/*
+ * postgresForeignAsyncNotify
+ * Fetch some more tuples from a file descriptor that becomes ready,
+ * requesting next tuple.
+ */
+static void
+postgresForeignAsyncNotify(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+
+ /* The core code would have initialized the callback_pending flag */
+ Assert(!areq->callback_pending);
+
+ /* The request should be currently in-process */
+ Assert(fsstate->conn_state->pendingAreq == areq);
+
+ /* On error, report the original query, not the FETCH. */
+ if (!PQconsumeInput(fsstate->conn))
+ pgfdw_report_error(ERROR, NULL, fsstate->conn, false, fsstate->query);
+
+ fetch_more_data(node);
+
+ produce_tuple_asynchronously(areq, true);
+}
+
+/*
+ * Asynchronously produce next tuple from a foreign PostgreSQL table.
+ */
+static void
+produce_tuple_asynchronously(AsyncRequest *areq, bool fetch)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ AsyncRequest *pendingAreq = fsstate->conn_state->pendingAreq;
+ TupleTableSlot *result;
+
+ /* This should not be called if the request is currently in-process */
+ Assert(areq != pendingAreq);
+
+ /* Request some more tuples, if we've run out */
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ /* No point in another fetch if we already detected EOF, though */
+ if (!fsstate->eof_reached)
+ {
+ /* Mark the request as pending for a callback */
+ ExecAsyncRequestPending(areq);
+ /* Begin another fetch if requested and if no pending request */
+ if (fetch && !pendingAreq)
+ fetch_more_data_begin(areq);
+ }
+ else
+ {
+ /* There's nothing more to do; just return a NULL pointer */
+ result = NULL;
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+ }
+ return;
+ }
+
+ /* Get a tuple from the ForeignScan node */
+ result = ExecProcNode((PlanState *) node);
+ if (!TupIsNull(result))
+ {
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+ return;
+ }
+ Assert(fsstate->next_tuple >= fsstate->num_tuples);
+
+ /* Request some more tuples, if we've not detected EOF yet */
+ if (!fsstate->eof_reached)
+ {
+ /* Mark the request as pending for a callback */
+ ExecAsyncRequestPending(areq);
+ /* Begin another fetch if requested and if no pending request */
+ if (fetch && !pendingAreq)
+ fetch_more_data_begin(areq);
+ }
+ else
+ {
+ /* There's nothing more to do; just return a NULL pointer */
+ result = NULL;
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+ }
+}
+
+/*
+ * Begin an asynchronous data fetch.
+ *
+ * Note: fetch_more_data must be called to fetch the result.
+ */
+static void
+fetch_more_data_begin(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ char sql[64];
+
+ Assert(!fsstate->conn_state->pendingAreq);
+
+ /* Create the cursor synchronously. */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ /* We will send this query, but not wait for the response. */
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (PQsendQuery(fsstate->conn, sql) < 0)
+ pgfdw_report_error(ERROR, NULL, fsstate->conn, false, fsstate->query);
+
+ /* Remember that the request is in process */
+ fsstate->conn_state->pendingAreq = areq;
+}
+
+/*
+ * Process a pending asynchronous request.
+ */
+void
+process_pending_request(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ EState *estate = node->ss.ps.state;
+ MemoryContext oldcontext;
+
+ /* The request should be currently in-process */
+ Assert(fsstate->conn_state->pendingAreq == areq);
+ /* and would have been pending for a callback */
+ Assert(areq->callback_pending);
+
+ oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
+
+ /* Unlike AsyncNotify, we unset callback_pending ourselves */
+ areq->callback_pending = false;
+
+ fetch_more_data(node);
+
+ /* We need to send a new query afterwards; don't fetch */
+ produce_tuple_asynchronously(areq, false);
+
+ /* Unlike AsyncNotify, we call ExecAsyncResponse ourselves */
+ ExecAsyncResponse(areq);
+
+ MemoryContextSwitchTo(oldcontext);
+}
+
/*
* Create a tuple from the specified row of the PGresult.
*
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 1f67b4d9fd..88d94da6f6 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -16,6 +16,7 @@
#include "foreign/foreign.h"
#include "lib/stringinfo.h"
#include "libpq-fe.h"
+#include "nodes/execnodes.h"
#include "nodes/pathnodes.h"
#include "utils/relcache.h"
@@ -78,6 +79,7 @@ typedef struct PgFdwRelationInfo
Cost fdw_startup_cost;
Cost fdw_tuple_cost;
List *shippable_extensions; /* OIDs of shippable extensions */
+ bool async_capable;
/* Cached catalog information. */
ForeignTable *table;
@@ -124,17 +126,28 @@ typedef struct PgFdwRelationInfo
int relation_index;
} PgFdwRelationInfo;
+/*
+ * Extra control information relating to a connection.
+ */
+typedef struct PgFdwConnState
+{
+ AsyncRequest *pendingAreq; /* pending async request */
+} PgFdwConnState;
+
/* in postgres_fdw.c */
extern int set_transmission_modes(void);
extern void reset_transmission_modes(int nestlevel);
+extern void process_pending_request(AsyncRequest *areq);
/* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+ PgFdwConnState **state);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
extern PGresult *pgfdw_get_result(PGconn *conn, const char *query);
-extern PGresult *pgfdw_exec_query(PGconn *conn, const char *query);
+extern PGresult *pgfdw_exec_query(PGconn *conn, const char *query,
+ PgFdwConnState *state);
extern void pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
bool clear, const char *sql);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 2b525ea44a..aad8077f46 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -2928,3 +2928,173 @@ SELECT tableoid::regclass, * FROM batch_cp_upd_test;
-- Clean up
DROP TABLE batch_table, batch_cp_upd_test CASCADE;
+
+-- ===================================================================
+-- test asynchronous execution
+-- ===================================================================
+
+ALTER SERVER loopback OPTIONS (DROP extensions);
+ALTER SERVER loopback OPTIONS (ADD async_capable 'true');
+ALTER SERVER loopback2 OPTIONS (ADD async_capable 'true');
+
+CREATE TABLE async_pt (a int, b int, c text) PARTITION BY RANGE (a);
+CREATE TABLE base_tbl1 (a int, b int, c text);
+CREATE TABLE base_tbl2 (a int, b int, c text);
+CREATE FOREIGN TABLE async_p1 PARTITION OF async_pt FOR VALUES FROM (1000) TO (2000)
+ SERVER loopback OPTIONS (table_name 'base_tbl1');
+CREATE FOREIGN TABLE async_p2 PARTITION OF async_pt FOR VALUES FROM (2000) TO (3000)
+ SERVER loopback2 OPTIONS (table_name 'base_tbl2');
+INSERT INTO async_p1 SELECT 1000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+INSERT INTO async_p2 SELECT 2000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+ANALYZE async_pt;
+
+CREATE TABLE result_tbl (a int, b int, c text);
+
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
+-- Check case where multiple partitions use the same connection
+CREATE TABLE base_tbl3 (a int, b int, c text);
+CREATE FOREIGN TABLE async_p3 PARTITION OF async_pt FOR VALUES FROM (3000) TO (4000)
+ SERVER loopback2 OPTIONS (table_name 'base_tbl3');
+INSERT INTO async_p3 SELECT 3000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+ANALYZE async_pt;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
+DROP FOREIGN TABLE async_p3;
+DROP TABLE base_tbl3;
+
+-- Check case where the partitioned table has local/remote partitions
+CREATE TABLE async_p3 PARTITION OF async_pt FOR VALUES FROM (3000) TO (4000);
+INSERT INTO async_p3 SELECT 3000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+ANALYZE async_pt;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
+-- Test interaction of async execution with plan-time partition pruning
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE a < 3000;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE a < 2000;
+
+-- Test interaction of async execution with run-time partition pruning
+SET plan_cache_mode TO force_generic_plan;
+
+PREPARE async_pt_query (int, int) AS
+ INSERT INTO result_tbl SELECT * FROM async_pt WHERE a < $1 AND b === $2;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+EXECUTE async_pt_query (3000, 505);
+EXECUTE async_pt_query (3000, 505);
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+EXECUTE async_pt_query (2000, 505);
+EXECUTE async_pt_query (2000, 505);
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
+RESET plan_cache_mode;
+
+CREATE TABLE local_tbl(a int, b int, c text);
+INSERT INTO local_tbl VALUES (1505, 505, 'foo'), (2505, 505, 'bar');
+ANALYZE local_tbl;
+
+CREATE INDEX base_tbl1_idx ON base_tbl1 (a);
+CREATE INDEX base_tbl2_idx ON base_tbl2 (a);
+CREATE INDEX async_p3_idx ON async_p3 (a);
+ANALYZE base_tbl1;
+ANALYZE base_tbl2;
+ANALYZE async_p3;
+
+ALTER FOREIGN TABLE async_p1 OPTIONS (use_remote_estimate 'true');
+ALTER FOREIGN TABLE async_p2 OPTIONS (use_remote_estimate 'true');
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c = 'bar';
+EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
+SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c = 'bar';
+SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c = 'bar';
+
+ALTER FOREIGN TABLE async_p1 OPTIONS (DROP use_remote_estimate);
+ALTER FOREIGN TABLE async_p2 OPTIONS (DROP use_remote_estimate);
+
+DROP TABLE local_tbl;
+DROP INDEX base_tbl1_idx;
+DROP INDEX base_tbl2_idx;
+DROP INDEX async_p3_idx;
+
+-- Test that pending requests are processed properly
+SET enable_mergejoin TO false;
+SET enable_hashjoin TO false;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt t1, async_p2 t2 WHERE t1.a = t2.a AND t1.b === 505;
+SELECT * FROM async_pt t1, async_p2 t2 WHERE t1.a = t2.a AND t1.b === 505;
+
+-- Check with foreign modify
+CREATE TABLE local_tbl (a int, b int, c text);
+INSERT INTO local_tbl VALUES (1505, 505, 'foo');
+
+CREATE TABLE base_tbl3 (a int, b int, c text);
+CREATE FOREIGN TABLE remote_tbl (a int, b int, c text)
+ SERVER loopback OPTIONS (table_name 'base_tbl3');
+INSERT INTO remote_tbl VALUES (2505, 505, 'bar');
+
+CREATE TABLE base_tbl4 (a int, b int, c text);
+CREATE FOREIGN TABLE insert_tbl (a int, b int, c text)
+ SERVER loopback OPTIONS (table_name 'base_tbl4');
+
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO insert_tbl (SELECT * FROM local_tbl UNION ALL SELECT * FROM remote_tbl);
+INSERT INTO insert_tbl (SELECT * FROM local_tbl UNION ALL SELECT * FROM remote_tbl);
+
+SELECT * FROM insert_tbl ORDER BY a;
+
+-- Check with direct modify
+CREATE TABLE join_tbl (a1 int, b1 int, c1 text, a2 int, b2 int, c2 text);
+
+EXPLAIN (VERBOSE, COSTS OFF)
+WITH t AS (UPDATE remote_tbl SET c = c || c RETURNING *)
+INSERT INTO join_tbl SELECT * FROM async_pt LEFT JOIN t ON (async_pt.a = t.a AND async_pt.b = t.b) WHERE async_pt.b === 505;
+WITH t AS (UPDATE remote_tbl SET c = c || c RETURNING *)
+INSERT INTO join_tbl SELECT * FROM async_pt LEFT JOIN t ON (async_pt.a = t.a AND async_pt.b = t.b) WHERE async_pt.b === 505;
+
+SELECT * FROM join_tbl ORDER BY a1;
+
+RESET enable_mergejoin;
+RESET enable_hashjoin;
+
+-- Clean up
+DROP TABLE async_pt;
+DROP TABLE base_tbl1;
+DROP TABLE base_tbl2;
+DROP TABLE result_tbl;
+DROP TABLE local_tbl;
+DROP FOREIGN TABLE remote_tbl;
+DROP FOREIGN TABLE insert_tbl;
+DROP TABLE base_tbl3;
+DROP TABLE base_tbl4;
+DROP TABLE join_tbl;
+
+ALTER SERVER loopback OPTIONS (DROP async_capable);
+ALTER SERVER loopback2 OPTIONS (DROP async_capable);
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 863ac31c6b..17fd782ab7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4781,6 +4781,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</para>
<variablelist>
+ <varlistentry id="guc-enable-async-append" xreflabel="enable_async_append">
+ <term><varname>enable_async_append</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_async_append</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of async-aware
+ append plan types. The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-bitmapscan" xreflabel="enable_bitmapscan">
<term><varname>enable_bitmapscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index db4b4e460c..b2ea336a43 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1569,6 +1569,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</thead>
<tbody>
+ <row>
+ <entry><literal>AppendReady</literal></entry>
+ <entry>Waiting for subplan nodes of an <literal>Append</literal> plan
+ node to be ready.</entry>
+ </row>
<row>
<entry><literal>BackupWaitWalArchive</literal></entry>
<entry>Waiting for WAL files required for a backup to be successfully
diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index 07aa25799d..00600bf171 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -371,6 +371,34 @@ OPTIONS (ADD password_required 'false');
</sect3>
+ <sect3>
+ <title>Asynchronous Execution Options</title>
+
+ <para>
+ <filename>postgres_fdw</filename> supports asynchronous execution, which
+ improves query performance by running multiple parts of
+ an <structname>Append</structname> node concurrently rather than serially.
+ This execution can be controled using the following option:
+ </para>
+
+ <variablelist>
+
+ <varlistentry>
+ <term><literal>async_capable</literal></term>
+ <listitem>
+ <para>
+ This option controls whether <filename>postgres_fdw</filename> allows
+ foreign tables to be scanned concurrently for asynchronous execution.
+ It can be specified for a foreign table or a foreign server.
+ A table-level option overrides a server-level option.
+ The default is <literal>false</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </sect3>
+
<sect3>
<title>Updatability Options</title>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index afc45429ba..fe75cabdcc 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1394,6 +1394,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
}
if (plan->parallel_aware)
appendStringInfoString(es->str, "Parallel ");
+ if (plan->async_capable)
+ appendStringInfoString(es->str, "Async ");
appendStringInfoString(es->str, pname);
es->indent++;
}
@@ -1413,6 +1415,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (custom_name)
ExplainPropertyText("Custom Plan Provider", custom_name, es);
ExplainPropertyBool("Parallel Aware", plan->parallel_aware, es);
+ ExplainPropertyBool("Async Capable", plan->async_capable, es);
}
switch (nodeTag(plan))
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 74ac59faa1..680fd69151 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
+ execAsync.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index 18b2ac1865..4b0ffe2818 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -359,3 +359,43 @@ query returning the same set of scan tuples multiple times. Likewise,
SRFs are disallowed in an UPDATE's targetlist. There, they would have the
effect of the same row being updated multiple times, which is not very
useful --- and updates after the first would have no effect anyway.
+
+
+Asynchronous Execution
+----------------------
+
+In cases where a node is waiting on an event external to the database system,
+such as a ForeignScan awaiting network I/O, it's desirable for the node to
+indicate that it cannot return any tuple immediately but may be able to do so
+at a later time. A process which discovers this type of situation can always
+handle it simply by blocking, but this may waste time that could be spent
+executing some other part of the plan tree where progress could be made
+immediately. This is particularly likely to occur when the plan tree contains
+an Append node. Asynchronous execution improves query performance by running
+multiple parts of an Append node concurrently rather than serially.
+
+For asynchronous execution, an Append node must first request a tuple from an
+async-capable child node using ExecAsyncRequest. Next, it must execute the
+asynchronous event loop using ExecAppendAsyncEventWait. Eventually, when a
+child node to which an asynchronous request has been made produces a tuple,
+the Append node will receive it from the event loop via ExecAsyncResponse. In
+the current implementation of asynchronous execution, the only node type that
+requests tuples from an async-capable child node is an Append, while the only
+node type that might be async-capable is a ForeignScan.
+
+Typically, the ExecAsyncResponse callback is the only one required for nodes
+that wish to request tuples asynchronously. On the other hand, async-capable
+nodes generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+ will be invoked; it should use ExecAsyncRequestPending to indicate that the
+ request is pending for a callback described below. Alternatively, it can
+ instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events, the
+ node's ExecAsyncConfigureWait callback will be invoked to configure the
+ file descriptor event for which the node wishes to wait.
+
+3. When the file descriptor becomes ready, the node's ExecAsyncNotify callback
+ will be invoked; like #1, it should use ExecAsyncRequestPending for another
+ callback or ExecAsyncRequestDone to return a result immediately.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 4543ac79ed..58a8aa5ab7 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -531,6 +531,10 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans > 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index e69de29bb2..4c09065a7f 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,124 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+
+/*
+ * Asynchronously request a tuple from a designed async-capable node.
+ */
+void
+ExecAsyncRequest(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ ExecAsyncResponse(areq);
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor event
+ * for which it wishes to wait. We expect the node-type specific callback to
+ * make a sigle call of the following form:
+ *
+ * AddWaitEventToSet(set, WL_SOCKET_READABLE, fd, NULL, areq);
+ */
+void
+ExecAsyncConfigureWait(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanConfigureWait(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+void
+ExecAsyncNotify(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ ExecAsyncResponse(areq);
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+void
+ExecAsyncResponse(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestor))
+ {
+ case T_AppendState:
+ ExecAsyncAppendResponse(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestor));
+ }
+}
+
+/*
+ * A requestee node should call this function to deliver the tuple to its
+ * requestor node. The requestee node can call this from its ExecAsyncRequest
+ * or ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(AsyncRequest *areq, TupleTableSlot *result)
+{
+ areq->request_complete = true;
+ areq->result = result;
+}
+
+/*
+ * A requestee node should call this function to indicate that it is pending
+ * for a callback. The requestee node can call this from its ExecAsyncRequest
+ * or ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestPending(AsyncRequest *areq)
+{
+ areq->callback_pending = true;
+ areq->request_complete = false;
+ areq->result = NULL;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 15e4115bd6..dd73d5d9cd 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -57,10 +57,13 @@
#include "postgres.h"
+#include "executor/execAsync.h"
#include "executor/execdebug.h"
#include "executor/execPartition.h"
#include "executor/nodeAppend.h"
#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
/* Shared state for parallel-aware Append. */
struct ParallelAppendState
@@ -78,12 +81,18 @@ struct ParallelAppendState
};
#define INVALID_SUBPLAN_INDEX -1
+#define EVENT_BUFFER_SIZE 16
static TupleTableSlot *ExecAppend(PlanState *pstate);
static bool choose_next_subplan_locally(AppendState *node);
static bool choose_next_subplan_for_leader(AppendState *node);
static bool choose_next_subplan_for_worker(AppendState *node);
static void mark_invalid_subplans_as_finished(AppendState *node);
+static void ExecAppendAsyncBegin(AppendState *node);
+static bool ExecAppendAsyncGetNext(AppendState *node, TupleTableSlot **result);
+static bool ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result);
+static void ExecAppendAsyncEventWait(AppendState *node);
+static void classify_matching_subplans(AppendState *node);
/* ----------------------------------------------------------------
* ExecInitAppend
@@ -102,7 +111,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
AppendState *appendstate = makeNode(AppendState);
PlanState **appendplanstates;
Bitmapset *validsubplans;
+ Bitmapset *asyncplans;
int nplans;
+ int nasyncplans;
int firstvalid;
int i,
j;
@@ -119,6 +130,8 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
/* Let choose_next_subplan_* function handle setting the first subplan */
appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+ appendstate->as_syncdone = false;
+ appendstate->as_begun = false;
/* If run-time partition pruning is enabled, then set that up now */
if (node->part_prune_info != NULL)
@@ -191,12 +204,24 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
* While at it, find out the first valid partial plan.
*/
j = 0;
+ asyncplans = NULL;
+ nasyncplans = 0;
firstvalid = nplans;
i = -1;
while ((i = bms_next_member(validsubplans, i)) >= 0)
{
Plan *initNode = (Plan *) list_nth(node->appendplans, i);
+ /*
+ * Record async subplans. When executing EvalPlanQual, we process
+ * async subplans synchronously, so don't do this in that case.
+ */
+ if (initNode->async_capable && estate->es_epq_active == NULL)
+ {
+ asyncplans = bms_add_member(asyncplans, j);
+ nasyncplans++;
+ }
+
/*
* Record the lowest appendplans index which is a valid partial plan.
*/
@@ -210,6 +235,37 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ /* Initialize async state */
+ appendstate->as_asyncplans = asyncplans;
+ appendstate->as_nasyncplans = nasyncplans;
+ appendstate->as_asyncrequests = NULL;
+ appendstate->as_asyncresults = (TupleTableSlot **)
+ palloc0(nasyncplans * sizeof(TupleTableSlot *));
+ appendstate->as_needrequest = NULL;
+ appendstate->as_eventset = NULL;
+
+ if (nasyncplans > 0)
+ {
+ appendstate->as_asyncrequests = (AsyncRequest **)
+ palloc0(nplans * sizeof(AsyncRequest *));
+
+ i = -1;
+ while ((i = bms_next_member(asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq;
+
+ areq = palloc(sizeof(AsyncRequest));
+ areq->requestor = (PlanState *) appendstate;
+ areq->requestee = appendplanstates[i];
+ areq->request_index = i;
+ areq->callback_pending = false;
+ areq->request_complete = false;
+ areq->result = NULL;
+
+ appendstate->as_asyncrequests[i] = areq;
+ }
+ }
+
/*
* Miscellaneous initialization
*/
@@ -232,31 +288,59 @@ static TupleTableSlot *
ExecAppend(PlanState *pstate)
{
AppendState *node = castNode(AppendState, pstate);
+ TupleTableSlot *result;
- if (node->as_whichplan < 0)
+ /*
+ * If this is the first call after Init or ReScan, we need to do the
+ * remaining initialization work.
+ */
+ if (!node->as_begun)
{
+ Assert(node->as_whichplan == INVALID_SUBPLAN_INDEX);
+ Assert(!node->as_syncdone);
+
/* Nothing to do if there are no subplans */
if (node->as_nplans == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ /* If there are any async subplans, begin execution of them. */
+ if (node->as_nasyncplans > 0)
+ ExecAppendAsyncBegin(node);
+
/*
- * If no subplan has been chosen, we must choose one before
+ * If no sync subplan has been chosen, we must choose one before
* proceeding.
*/
- if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
- !node->choose_next_subplan(node))
+ if (!node->choose_next_subplan(node) && node->as_nasyncremain == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+
+ Assert(node->as_syncdone ||
+ (node->as_whichplan >= 0 &&
+ node->as_whichplan < node->as_nplans));
+
+ /* And we're initialized. */
+ node->as_begun = true;
}
for (;;)
{
PlanState *subnode;
- TupleTableSlot *result;
CHECK_FOR_INTERRUPTS();
/*
- * figure out which subplan we are currently processing
+ * try to get a tuple from an async subplan if any
+ */
+ if (node->as_syncdone || !bms_is_empty(node->as_needrequest))
+ {
+ if (ExecAppendAsyncGetNext(node, &result))
+ return result;
+ Assert(!node->as_syncdone);
+ Assert(bms_is_empty(node->as_needrequest));
+ }
+
+ /*
+ * figure out which sync subplan we are currently processing
*/
Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
subnode = node->appendplans[node->as_whichplan];
@@ -276,8 +360,16 @@ ExecAppend(PlanState *pstate)
return result;
}
- /* choose new subplan; if none, we're done */
- if (!node->choose_next_subplan(node))
+ /*
+ * wait or poll async events if any. We do this before checking for
+ * the end of iteration, because it might drain the remaining async
+ * subplans.
+ */
+ if (node->as_nasyncremain > 0)
+ ExecAppendAsyncEventWait(node);
+
+ /* choose new sync subplan; if no sync/async subplans, we're done */
+ if (!node->choose_next_subplan(node) && node->as_nasyncremain == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
}
@@ -313,6 +405,7 @@ ExecEndAppend(AppendState *node)
void
ExecReScanAppend(AppendState *node)
{
+ int nasyncplans = node->as_nasyncplans;
int i;
/*
@@ -326,6 +419,11 @@ ExecReScanAppend(AppendState *node)
{
bms_free(node->as_valid_subplans);
node->as_valid_subplans = NULL;
+ if (nasyncplans > 0)
+ {
+ bms_free(node->as_valid_asyncplans);
+ node->as_valid_asyncplans = NULL;
+ }
}
for (i = 0; i < node->as_nplans; i++)
@@ -347,8 +445,27 @@ ExecReScanAppend(AppendState *node)
ExecReScan(subnode);
}
+ /* Reset async state */
+ if (nasyncplans > 0)
+ {
+ i = -1;
+ while ((i = bms_next_member(node->as_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ areq->callback_pending = false;
+ areq->request_complete = false;
+ areq->result = NULL;
+ }
+
+ bms_free(node->as_needrequest);
+ node->as_needrequest = NULL;
+ }
+
/* Let choose_next_subplan_* function handle setting the first subplan */
node->as_whichplan = INVALID_SUBPLAN_INDEX;
+ node->as_syncdone = false;
+ node->as_begun = false;
}
/* ----------------------------------------------------------------
@@ -429,7 +546,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
/* ----------------------------------------------------------------
* choose_next_subplan_locally
*
- * Choose next subplan for a non-parallel-aware Append,
+ * Choose next sync subplan for a non-parallel-aware Append,
* returning false if there are no more.
* ----------------------------------------------------------------
*/
@@ -442,16 +559,25 @@ choose_next_subplan_locally(AppendState *node)
/* We should never be called when there are no subplans */
Assert(node->as_nplans > 0);
+ /* Nothing to do if syncdone */
+ if (node->as_syncdone)
+ return false;
+
/*
* If first call then have the bms member function choose the first valid
- * subplan by initializing whichplan to -1. If there happen to be no
- * valid subplans then the bms member function will handle that by
- * returning a negative number which will allow us to exit returning a
+ * sync subplan by initializing whichplan to -1. If there happen to be
+ * no valid sync subplans then the bms member function will handle that
+ * by returning a negative number which will allow us to exit returning a
* false value.
*/
if (whichplan == INVALID_SUBPLAN_INDEX)
{
- if (node->as_valid_subplans == NULL)
+ if (node->as_nasyncplans > 0)
+ {
+ /* We'd have filled as_valid_subplans already */
+ Assert(node->as_valid_subplans);
+ }
+ else if (node->as_valid_subplans == NULL)
node->as_valid_subplans =
ExecFindMatchingSubPlans(node->as_prune_state);
@@ -467,7 +593,12 @@ choose_next_subplan_locally(AppendState *node)
nextplan = bms_prev_member(node->as_valid_subplans, whichplan);
if (nextplan < 0)
+ {
+ /* Set as_syncdone if in async mode */
+ if (node->as_nasyncplans > 0)
+ node->as_syncdone = true;
return false;
+ }
node->as_whichplan = nextplan;
@@ -709,3 +840,302 @@ mark_invalid_subplans_as_finished(AppendState *node)
node->as_pstate->pa_finished[i] = true;
}
}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncBegin
+ *
+ * Begin execution of designed async-capable subplans.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecAppendAsyncBegin(AppendState *node)
+{
+ int i;
+
+ /* We should never be called when there are no async subplans. */
+ Assert(node->as_nasyncplans > 0);
+
+ /* If we've yet to determine the valid subplans then do so now. */
+ if (node->as_valid_subplans == NULL)
+ node->as_valid_subplans =
+ ExecFindMatchingSubPlans(node->as_prune_state);
+
+ classify_matching_subplans(node);
+
+ /* Nothing to do if there are no valid async subplans. */
+ if (node->as_nasyncremain == 0)
+ return;
+
+ /* Make a request for each of the async subplans. */
+ i = -1;
+ while ((i = bms_next_member(node->as_valid_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ Assert(areq->request_index == i);
+ Assert(!areq->callback_pending);
+
+ /* Do the actual work. */
+ ExecAsyncRequest(areq);
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncGetNext
+ *
+ * Get the next tuple from any of the asynchronous subplans.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecAppendAsyncGetNext(AppendState *node, TupleTableSlot **result)
+{
+ *result = NULL;
+
+ /* We should never be called when there are no valid async subplans */
+ Assert(node->as_nasyncremain > 0);
+
+ /* Make new async requests. */
+ if (ExecAppendAsyncRequest(node, result))
+ return true;
+
+ while (node->as_nasyncremain > 0)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ /* Wait or poll async events */
+ ExecAppendAsyncEventWait(node);
+
+ /* Make new async requests. */
+ if (ExecAppendAsyncRequest(node, result))
+ return true;
+
+ /* Break from loop if there is any sync node that is not complete */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If all sync subplans are complete, we're totally done scanning the
+ * givne node. Otherwise, we're done with the asynchronous stuff but
+ * must continue scanning the sync subplans.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncremain == 0);
+ *result = ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ return true;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncRequest
+ *
+ * If there are any asynchronous subplans that need a new asynchronous
+ * request, make all of them.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result)
+{
+ Bitmapset *needrequest;
+ int i;
+
+ /* Nothing to do if there are no async subplans needing a new request. */
+ if (bms_is_empty(node->as_needrequest))
+ return false;
+
+ /*
+ * If there are any asynchronously-generated results that have not yet
+ * been returned, we have nothing to do; just return one of them.
+ */
+ if (node->as_nasyncresults > 0)
+ {
+ --node->as_nasyncresults;
+ *result = node->as_asyncresults[node->as_nasyncresults];
+ return true;
+ }
+
+ /* Make a new request for each of the async subplans that need it. */
+ needrequest = node->as_needrequest;
+ node->as_needrequest = NULL;
+ i = -1;
+ while ((i = bms_next_member(needrequest, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ /* Do the actual work. */
+ ExecAsyncRequest(areq);
+ }
+ bms_free(needrequest);
+
+ /* Return one of the asynchronously-generated results if any. */
+ if (node->as_nasyncresults > 0)
+ {
+ --node->as_nasyncresults;
+ *result = node->as_asyncresults[node->as_nasyncresults];
+ return true;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncEventWait
+ *
+ * Wait or poll for file descriptor wait events and fire callbacks.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecAppendAsyncEventWait(AppendState *node)
+{
+ long timeout = node->as_syncdone ? -1 : 0;
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+
+ /* We should never be called there are no valid async subplans. */
+ Assert(node->as_nasyncremain > 0);
+
+ node->as_eventset = CreateWaitEventSet(CurrentMemoryContext,
+ node->as_nasyncplans + 1);
+ AddWaitEventToSet(node->as_eventset, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
+ NULL, NULL);
+
+ /* Give each waiting subplan a chance to add a event. */
+ i = -1;
+ while ((i = bms_next_member(node->as_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ if (areq->callback_pending)
+ ExecAsyncConfigureWait(areq);
+ }
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(node->as_eventset, timeout, occurred_event,
+ EVENT_BUFFER_SIZE, WAIT_EVENT_APPEND_READY);
+ FreeWaitEventSet(node->as_eventset);
+ node->as_eventset = NULL;
+ if (noccurred == 0)
+ return;
+
+ /* Deliver notifications. */
+ for (i = 0; i < noccurred; i++)
+ {
+ WaitEvent *w = &occurred_event[i];
+
+ /*
+ * Each waiting subplan should have registered its wait event with
+ * user_data pointing back to its AsyncRequest.
+ */
+ if ((w->events & WL_SOCKET_READABLE) != 0)
+ {
+ AsyncRequest *areq = (AsyncRequest *) w->user_data;
+
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ Assert(areq->callback_pending);
+ areq->callback_pending = false;
+
+ /* Do the actual work. */
+ ExecAsyncNotify(areq);
+ }
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(AsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot = areq->result;
+
+ /* The result should be a TupleTableSlot or NULL. */
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* Nothing to do if the request is pending. */
+ if (!areq->request_complete)
+ {
+ /*
+ * The subplan for which the request was made would be pending for a
+ * callback.
+ */
+ Assert(areq->callback_pending);
+ return;
+ }
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ {
+ /* The ending subplan wouldn't have been pending for a callback. */
+ Assert(!areq->callback_pending);
+ --node->as_nasyncremain;
+ return;
+ }
+
+ /* Save result so we can return it */
+ Assert(node->as_nasyncresults < node->as_nasyncplans);
+ node->as_asyncresults[node->as_nasyncresults++] = slot;
+
+ /*
+ * Mark the subplan that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might complete.
+ */
+ node->as_needrequest = bms_add_member(node->as_needrequest,
+ areq->request_index);
+}
+
+/* ----------------------------------------------------------------
+ * classify_matching_subplans
+ *
+ * Classify the node's as_valid_subplans into sync ones and
+ * async ones, adjust it to contain sync ones only, and save
+ * async ones in the node's as_valid_asyncplans
+ * ----------------------------------------------------------------
+ */
+static void
+classify_matching_subplans(AppendState *node)
+{
+ Bitmapset *valid_asyncplans;
+
+ Assert(node->as_valid_asyncplans == NULL);
+
+ /* Nothing to do if there are no valid subplans. */
+ if (bms_is_empty(node->as_valid_subplans))
+ {
+ node->as_syncdone = true;
+ node->as_nasyncremain = 0;
+ return;
+ }
+
+ /* Nothing to do if there are no valid async subplans. */
+ if (!bms_overlap(node->as_valid_subplans, node->as_asyncplans))
+ {
+ node->as_nasyncremain = 0;
+ return;
+ }
+
+ /* Get valid async subplans. */
+ valid_asyncplans = bms_copy(node->as_asyncplans);
+ valid_asyncplans = bms_int_members(valid_asyncplans,
+ node->as_valid_subplans);
+
+ /* Adjust the valid subplans to contain sync subplans only. */
+ node->as_valid_subplans = bms_del_members(node->as_valid_subplans,
+ valid_asyncplans);
+ node->as_syncdone = bms_is_empty(node->as_valid_subplans);
+
+ /* Save valid async subplans. */
+ node->as_valid_asyncplans = valid_asyncplans;
+ node->as_nasyncremain = bms_num_members(valid_asyncplans);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 0969e53c3a..898890fb08 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -391,3 +391,51 @@ ExecShutdownForeignScan(ForeignScanState *node)
if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Asynchronously request a tuple from a designed async-capable node
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ fdwroutine->ForeignAsyncConfigureWait(areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Callback invoked when a relevant event has occurred
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index bda379ba91..673a353b48 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -121,6 +121,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
COPY_SCALAR_FIELD(plan_width);
COPY_SCALAR_FIELD(parallel_aware);
COPY_SCALAR_FIELD(parallel_safe);
+ COPY_SCALAR_FIELD(async_capable);
COPY_SCALAR_FIELD(plan_node_id);
COPY_NODE_FIELD(targetlist);
COPY_NODE_FIELD(qual);
@@ -242,6 +243,7 @@ _copyAppend(const Append *from)
*/
COPY_BITMAPSET_FIELD(apprelids);
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
COPY_SCALAR_FIELD(first_partial_plan);
COPY_NODE_FIELD(part_prune_info);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 5054490c58..52810b0ba3 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -334,6 +334,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
WRITE_INT_FIELD(plan_width);
WRITE_BOOL_FIELD(parallel_aware);
WRITE_BOOL_FIELD(parallel_safe);
+ WRITE_BOOL_FIELD(async_capable);
WRITE_INT_FIELD(plan_node_id);
WRITE_NODE_FIELD(targetlist);
WRITE_NODE_FIELD(qual);
@@ -432,6 +433,7 @@ _outAppend(StringInfo str, const Append *node)
WRITE_BITMAPSET_FIELD(apprelids);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
WRITE_INT_FIELD(first_partial_plan);
WRITE_NODE_FIELD(part_prune_info);
}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 9b8f81c523..42ac0a871f 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1616,6 +1616,7 @@ ReadCommonPlan(Plan *local_node)
READ_INT_FIELD(plan_width);
READ_BOOL_FIELD(parallel_aware);
READ_BOOL_FIELD(parallel_safe);
+ READ_BOOL_FIELD(async_capable);
READ_INT_FIELD(plan_node_id);
READ_NODE_FIELD(targetlist);
READ_NODE_FIELD(qual);
@@ -1712,6 +1713,7 @@ _readAppend(void)
READ_BITMAPSET_FIELD(apprelids);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
READ_INT_FIELD(first_partial_plan);
READ_NODE_FIELD(part_prune_info);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index c81e2cf244..1902bcbb68 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -149,6 +149,7 @@ bool enable_partitionwise_aggregate = false;
bool enable_parallel_append = true;
bool enable_parallel_hash = true;
bool enable_partition_pruning = true;
+bool enable_async_append = true;
typedef struct
{
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 906cab7053..06774a9ec3 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -81,6 +81,7 @@ static List *get_gating_quals(PlannerInfo *root, List *quals);
static Plan *create_gating_plan(PlannerInfo *root, Path *path, Plan *plan,
List *gating_quals);
static Plan *create_join_plan(PlannerInfo *root, JoinPath *best_path);
+static bool is_async_capable_path(Path *path);
static Plan *create_append_plan(PlannerInfo *root, AppendPath *best_path,
int flags);
static Plan *create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
@@ -1080,6 +1081,30 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
return plan;
}
+/*
+ * is_async_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
+
/*
* create_append_plan
* Create an Append plan for 'best_path' and (recursively) plans
@@ -1097,6 +1122,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
List *pathkeys = best_path->path.pathkeys;
List *subplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
RelOptInfo *rel = best_path->path.parent;
PartitionPruneInfo *partpruneinfo = NULL;
int nodenumsortkeys = 0;
@@ -1104,6 +1130,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
Oid *nodeSortOperators = NULL;
Oid *nodeCollations = NULL;
bool *nodeNullsFirst = NULL;
+ bool consider_async = false;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1167,6 +1194,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
tlist_was_changed = (orig_tlist_length != list_length(plan->plan.targetlist));
}
+ /* If appropriate, consider async append */
+ consider_async = (enable_async_append && pathkeys == NIL &&
+ !best_path->path.parallel_safe &&
+ list_length(best_path->subpaths) > 1);
+
/* Build the plan for each child */
foreach(subpaths, best_path->subpaths)
{
@@ -1234,6 +1266,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
subplans = lappend(subplans, subplan);
+
+ /* Check to see if subplan can be executed asynchronously */
+ if (consider_async && is_async_capable_path(subpath))
+ {
+ subplan->async_capable = true;
+ ++nasyncplans;
+ }
}
/*
@@ -1266,6 +1305,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
plan->appendplans = subplans;
+ plan->nasyncplans = nasyncplans;
plan->first_partial_plan = best_path->first_partial_path;
plan->part_prune_info = partpruneinfo;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 208a33692f..999c0fbf3f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3998,6 +3998,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
switch (w)
{
+ case WAIT_EVENT_APPEND_READY:
+ event_name = "AppendReady";
+ break;
case WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE:
event_name = "BackupWaitWalArchive";
break;
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 43a5fded10..5f3318fa8f 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -2020,6 +2020,15 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
}
#endif
+/*
+ * Get the number of wait events registered in a given WaitEventSet.
+ */
+int
+GetNumRegisteredWaitEvents(WaitEventSet *set)
+{
+ return set->nevents;
+}
+
#if defined(WAIT_USE_POLL)
/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 997b4b70ee..6413b286fc 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1111,6 +1111,16 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_async_append", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of async append plans."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_async_append,
+ true,
+ NULL, NULL, NULL
+ },
{
{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("Enables genetic query optimization."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 3ff507d5f6..4938af4c26 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -372,6 +372,7 @@
#enable_parallel_hash = on
#enable_partition_pruning = on
#enable_parallel_insert = on
+#enable_async_append = on
# - Planner Cost Constants -
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
index e69de29bb2..724034f226 100644
--- a/src/include/executor/execAsync.h
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ * execAsync.h
+ * Support functions for asynchronous execution
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/executor/execAsync.h
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(AsyncRequest *areq);
+extern void ExecAsyncConfigureWait(AsyncRequest *areq);
+extern void ExecAsyncNotify(AsyncRequest *areq);
+extern void ExecAsyncResponse(AsyncRequest *areq);
+extern void ExecAsyncRequestDone(AsyncRequest *areq, TupleTableSlot *result);
+extern void ExecAsyncRequestPending(AsyncRequest *areq);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index cafd410a5d..fa54ac6ad2 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -25,4 +25,6 @@ extern void ExecAppendInitializeDSM(AppendState *node, ParallelContext *pcxt);
extern void ExecAppendReInitializeDSM(AppendState *node, ParallelContext *pcxt);
extern void ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt);
+extern void ExecAsyncAppendResponse(AsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 6ae7733e25..8ffc0ca5bf 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -31,4 +31,8 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
ParallelWorkerContext *pwcxt);
extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern void ExecAsyncForeignScanRequest(AsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(AsyncRequest *areq);
+extern void ExecAsyncForeignScanNotify(AsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 248f78da45..7c89d081c7 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -178,6 +178,14 @@ typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
List *fdw_private,
RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+
+typedef void (*ForeignAsyncRequest_function) (AsyncRequest *areq);
+
+typedef void (*ForeignAsyncConfigureWait_function) (AsyncRequest *areq);
+
+typedef void (*ForeignAsyncNotify_function) (AsyncRequest *areq);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -256,6 +264,12 @@ typedef struct FdwRoutine
/* Support functions for path reparameterization. */
ReparameterizeForeignPathByChild_function ReparameterizeForeignPathByChild;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e31ad6204e..e36842b467 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -515,6 +515,22 @@ typedef struct ResultRelInfo
struct CopyMultiInsertBuffer *ri_CopyMultiInsertBuffer;
} ResultRelInfo;
+/* ----------------
+ * AsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct AsyncRequest
+{
+ struct PlanState *requestor; /* Node that wants a tuple */
+ struct PlanState *requestee; /* Node from which a tuple is wanted */
+ int request_index; /* Scratch space for requestor */
+ bool callback_pending; /* Callback is needed */
+ bool request_complete; /* Request complete, result valid */
+ TupleTableSlot *result; /* Result (NULL if no more tuples) */
+} AsyncRequest;
+
/* ----------------
* EState information
*
@@ -1220,12 +1236,25 @@ struct AppendState
PlanState **appendplans; /* array of PlanStates for my inputs */
int as_nplans;
int as_whichplan;
+ bool as_begun; /* false means need to initialize */
+ Bitmapset *as_asyncplans; /* asynchronous plans indexes */
+ int as_nasyncplans; /* # of asynchronous plans */
+ AsyncRequest **as_asyncrequests; /* array of AsyncRequests */
+ TupleTableSlot **as_asyncresults; /* unreturned results of async plans */
+ int as_nasyncresults; /* # of valid entries in as_asyncresults */
+ bool as_syncdone; /* true if all synchronous plans done in
+ * asynchronous mode, else false */
+ int as_nasyncremain; /* # of remaining async plans */
+ Bitmapset *as_needrequest; /* async plans ready for a request */
+ struct WaitEventSet *as_eventset; /* WaitEventSet used to configure
+ * file descriptor wait events */
int as_first_partial_plan; /* Index of 'appendplans' containing
* the first partial plan */
ParallelAppendState *as_pstate; /* parallel coordination info */
Size pstate_len; /* size of parallel coordination info */
struct PartitionPruneState *as_prune_state;
Bitmapset *as_valid_subplans;
+ Bitmapset *as_valid_asyncplans;
bool (*choose_next_subplan) (AppendState *);
};
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 95292d7573..af9c5f4e77 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -131,6 +131,11 @@ typedef struct Plan
bool parallel_aware; /* engage parallel-aware logic? */
bool parallel_safe; /* OK to use as part of parallel plan? */
+ /*
+ * information needed for asynchronous execution
+ */
+ bool async_capable; /* engage asynchronous-capable logic? */
+
/*
* Common structural data for all Plan types.
*/
@@ -247,6 +252,7 @@ typedef struct Append
Plan plan;
Bitmapset *apprelids; /* RTIs of appendrel(s) formed by this node */
List *appendplans;
+ int nasyncplans; /* # of asynchronous plans */
/*
* All 'appendplans' preceding this index are non-partial plans. All
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 22e6db96b6..9fc147c953 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -66,6 +66,7 @@ extern PGDLLIMPORT bool enable_partitionwise_aggregate;
extern PGDLLIMPORT bool enable_parallel_append;
extern PGDLLIMPORT bool enable_parallel_hash;
extern PGDLLIMPORT bool enable_partition_pruning;
+extern PGDLLIMPORT bool enable_async_append;
extern PGDLLIMPORT int constraint_exclusion;
extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be43c04802..c9b1214a04 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -967,7 +967,8 @@ typedef enum
*/
typedef enum
{
- WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE = PG_WAIT_IPC,
+ WAIT_EVENT_APPEND_READY = PG_WAIT_IPC,
+ WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE,
WAIT_EVENT_BGWORKER_SHUTDOWN,
WAIT_EVENT_BGWORKER_STARTUP,
WAIT_EVENT_BTREE_PAGE,
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 9e94fcaec2..44f9368c64 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -179,5 +179,6 @@ extern int WaitLatch(Latch *latch, int wakeEvents, long timeout,
extern int WaitLatchOrSocket(Latch *latch, int wakeEvents,
pgsocket sock, long timeout, uint32 wait_event_info);
extern void InitializeLatchWaitSet(void);
+extern int GetNumRegisteredWaitEvents(WaitEventSet *set);
#endif /* LATCH_H */
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 791eba8511..b89b99fb02 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -87,6 +87,7 @@ select explain_filter('explain (analyze, buffers, format json) select * from int
"Plan": { +
"Node Type": "Seq Scan", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "int8_tbl",+
"Alias": "i8", +
"Startup Cost": N.N, +
@@ -136,6 +137,7 @@ select explain_filter('explain (analyze, buffers, format xml) select * from int8
<Plan> +
<Node-Type>Seq Scan</Node-Type> +
<Parallel-Aware>false</Parallel-Aware> +
+ <Async-Capable>false</Async-Capable> +
<Relation-Name>int8_tbl</Relation-Name> +
<Alias>i8</Alias> +
<Startup-Cost>N.N</Startup-Cost> +
@@ -183,6 +185,7 @@ select explain_filter('explain (analyze, buffers, format yaml) select * from int
- Plan: +
Node Type: "Seq Scan" +
Parallel Aware: false +
+ Async Capable: false +
Relation Name: "int8_tbl"+
Alias: "i8" +
Startup Cost: N.N +
@@ -233,6 +236,7 @@ select explain_filter('explain (buffers, format json) select * from int8_tbl i8'
"Plan": { +
"Node Type": "Seq Scan", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "int8_tbl",+
"Alias": "i8", +
"Startup Cost": N.N, +
@@ -346,6 +350,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Relation Name": "tenk1", +
"Parallel Aware": true, +
"Local Hit Blocks": 0, +
@@ -391,6 +396,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Parallel Aware": false, +
"Sort Space Used": 0, +
"Local Hit Blocks": 0, +
@@ -433,6 +439,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Parallel Aware": false, +
"Workers Planned": 0, +
"Local Hit Blocks": 0, +
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index 68ca321163..a417b566d9 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -558,6 +558,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
"Node Type": "Incremental Sort", +
"Actual Rows": 55, +
"Actual Loops": 1, +
+ "Async Capable": false, +
"Presorted Key": [ +
"t.a" +
], +
@@ -760,6 +761,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
"Node Type": "Incremental Sort", +
"Actual Rows": 70, +
"Actual Loops": 1, +
+ "Async Capable": false, +
"Presorted Key": [ +
"t.a" +
], +
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index ff157ceb1c..499245068a 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -204,6 +204,7 @@ explain (costs off, format json) insert into insertconflicttest values (0, 'Bilb
"Node Type": "ModifyTable", +
"Operation": "Insert", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "insertconflicttest", +
"Alias": "insertconflicttest", +
"Conflict Resolution": "UPDATE", +
@@ -213,7 +214,8 @@ explain (costs off, format json) insert into insertconflicttest values (0, 'Bilb
{ +
"Node Type": "Result", +
"Parent Relationship": "Member", +
- "Parallel Aware": false +
+ "Parallel Aware": false, +
+ "Async Capable": false +
} +
] +
} +
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a62bf5dc92..94d143c5c3 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -95,6 +95,7 @@ select count(*) = 0 as ok from pg_stat_wal_receiver;
select name, setting from pg_settings where name like 'enable%';
name | setting
--------------------------------+---------
+ enable_async_append | on
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
@@ -114,7 +115,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(19 rows)
+(20 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
is_async_capable_path() should probably have a "break" for case T_ForeignPath.
little typos:
aready
sigle
givne
a event: an event
--
Justin
On Fri, Mar 19, 2021 at 9:57 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
is_async_capable_path() should probably have a "break" for case T_ForeignPath.
Good catch! Will fix.
little typos:
aready
sigle
givne
a event
Lots of typos. :-( Will fix.
Thank you for the review!
Best regards,
Etsuro Fujita
On Fri, Mar 19, 2021 at 8:48 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
I haven’t yet added docs on FDW APIs. I think the patch would need a
bit more comments.
Here is an updated patch. Changes are:
* Added docs on FDW APIs.
* Added/tweaked some more comments.
* Fixed a bug and typos pointed out by Justin.
* Added an assertion to ExecAppendAsyncBegin().
* Added a bit more regression test cases.
* Rebased the patch against HEAD.
I think the patch would be committable.
Best regards,
Etsuro Fujita
Attachments:
0001-async-2021-03-29.patchapplication/octet-stream; name=0001-async-2021-03-29.patchDownload
From 8849639f2d1f6f425dac702e37c1e5cdd19fd507 Mon Sep 17 00:00:00 2001
From: Etsuro Fujita <efujita@postgresql.org>
Date: Mon, 29 Mar 2021 18:45:11 +0900
Subject: [PATCH] async 2021-03-29.
---
contrib/postgres_fdw/connection.c | 26 +-
.../postgres_fdw/expected/postgres_fdw.out | 509 +++++++++++++++++-
contrib/postgres_fdw/option.c | 6 +-
contrib/postgres_fdw/postgres_fdw.c | 374 +++++++++++--
contrib/postgres_fdw/postgres_fdw.h | 17 +-
contrib/postgres_fdw/sql/postgres_fdw.sql | 195 +++++++
doc/src/sgml/config.sgml | 14 +
doc/src/sgml/fdwhandler.sgml | 88 +++
doc/src/sgml/monitoring.sgml | 5 +
doc/src/sgml/postgres-fdw.sgml | 28 +
src/backend/commands/explain.c | 3 +
src/backend/executor/Makefile | 1 +
src/backend/executor/README | 40 ++
src/backend/executor/execAmi.c | 4 +
src/backend/executor/execAsync.c | 124 +++++
src/backend/executor/nodeAppend.c | 465 +++++++++++++++-
src/backend/executor/nodeForeignscan.c | 48 ++
src/backend/nodes/copyfuncs.c | 2 +
src/backend/nodes/outfuncs.c | 2 +
src/backend/nodes/readfuncs.c | 2 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/plan/createplan.c | 41 ++
src/backend/postmaster/pgstat.c | 3 +
src/backend/storage/ipc/latch.c | 9 +
src/backend/utils/misc/guc.c | 10 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/executor/execAsync.h | 25 +
src/include/executor/nodeAppend.h | 2 +
src/include/executor/nodeForeignscan.h | 4 +
src/include/foreign/fdwapi.h | 14 +
src/include/nodes/execnodes.h | 37 +-
src/include/nodes/plannodes.h | 6 +
src/include/optimizer/cost.h | 1 +
src/include/pgstat.h | 3 +-
src/include/storage/latch.h | 1 +
src/test/regress/expected/explain.out | 7 +
.../regress/expected/incremental_sort.out | 2 +
src/test/regress/expected/insert_conflict.out | 4 +-
src/test/regress/expected/sysviews.out | 3 +-
39 files changed, 2070 insertions(+), 57 deletions(-)
create mode 100644 src/backend/executor/execAsync.c
create mode 100644 src/include/executor/execAsync.h
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index ee0b4acf0b..54ab8edfab 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -62,6 +62,7 @@ typedef struct ConnCacheEntry
Oid serverid; /* foreign server OID used to get server name */
uint32 server_hashvalue; /* hash value of foreign server OID */
uint32 mapping_hashvalue; /* hash value of user mapping OID */
+ PgFdwConnState state; /* extra per-connection state */
} ConnCacheEntry;
/*
@@ -115,9 +116,12 @@ static bool disconnect_cached_connections(Oid serverid);
* will_prep_stmt must be true if caller intends to create any prepared
* statements. Since those don't go away automatically at transaction end
* (not even on error), we need this flag to cue manual cleanup.
+ *
+ * If state is not NULL, *state receives the per-connection state associated
+ * with the PGconn.
*/
PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetConnection(UserMapping *user, bool will_prep_stmt, PgFdwConnState **state)
{
bool found;
bool retry = false;
@@ -196,6 +200,9 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
*/
PG_TRY();
{
+ /* Process a pending asynchronous request if any. */
+ if (entry->state.pendingAreq)
+ process_pending_request(entry->state.pendingAreq);
/* Start a new transaction or subtransaction if needed. */
begin_remote_xact(entry);
}
@@ -264,6 +271,10 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
/* Remember if caller will prepare statements */
entry->have_prep_stmt |= will_prep_stmt;
+ /* If caller needs access to the per-connection state, return it. */
+ if (state)
+ *state = &entry->state;
+
return entry->conn;
}
@@ -291,6 +302,7 @@ make_new_connection(ConnCacheEntry *entry, UserMapping *user)
entry->mapping_hashvalue =
GetSysCacheHashValue1(USERMAPPINGOID,
ObjectIdGetDatum(user->umid));
+ memset(&entry->state, 0, sizeof(entry->state));
/* Now try to make the connection */
entry->conn = connect_pg_server(server, user);
@@ -648,8 +660,12 @@ GetPrepStmtNumber(PGconn *conn)
* Caller is responsible for the error handling on the result.
*/
PGresult *
-pgfdw_exec_query(PGconn *conn, const char *query)
+pgfdw_exec_query(PGconn *conn, const char *query, PgFdwConnState *state)
{
+ /* First, process a pending asynchronous request, if any. */
+ if (state && state->pendingAreq)
+ process_pending_request(state->pendingAreq);
+
/*
* Submit a query. Since we don't use non-blocking mode, this also can
* block. But its risk is relatively small, so we ignore that for now.
@@ -940,6 +956,8 @@ pgfdw_xact_callback(XactEvent event, void *arg)
{
entry->have_prep_stmt = false;
entry->have_error = false;
+ /* Also reset per-connection state */
+ memset(&entry->state, 0, sizeof(entry->state));
}
/* Disarm changing_xact_state if it all worked. */
@@ -1172,6 +1190,10 @@ pgfdw_reject_incomplete_xact_state_change(ConnCacheEntry *entry)
* Cancel the currently-in-progress query (whose query text we do not have)
* and ignore the result. Returns true if we successfully cancel the query
* and discard any pending result, and false if not.
+ *
+ * XXX: if the query was one sent by fetch_more_data_begin(), we could get the
+ * query text from the pendingAreq saved in the per-connection state, then
+ * report the query using it.
*/
static bool
pgfdw_cancel_query(PGconn *conn)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 0649b6b81c..a285412623 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -8946,7 +8946,7 @@ DO $d$
END;
$d$;
ERROR: invalid option "password"
-HINT: Valid options in this context are: service, passfile, channel_binding, connect_timeout, dbname, host, hostaddr, port, options, application_name, keepalives, keepalives_idle, keepalives_interval, keepalives_count, tcp_user_timeout, sslmode, sslcompression, sslcert, sslkey, sslrootcert, sslcrl, sslcrldir, requirepeer, ssl_min_protocol_version, ssl_max_protocol_version, gssencmode, krbsrvname, gsslib, target_session_attrs, use_remote_estimate, fdw_startup_cost, fdw_tuple_cost, extensions, updatable, fetch_size, batch_size
+HINT: Valid options in this context are: service, passfile, channel_binding, connect_timeout, dbname, host, hostaddr, port, options, application_name, keepalives, keepalives_idle, keepalives_interval, keepalives_count, tcp_user_timeout, sslmode, sslcompression, sslcert, sslkey, sslrootcert, sslcrl, sslcrldir, requirepeer, ssl_min_protocol_version, ssl_max_protocol_version, gssencmode, krbsrvname, gsslib, target_session_attrs, use_remote_estimate, fdw_startup_cost, fdw_tuple_cost, extensions, updatable, fetch_size, batch_size, async_capable
CONTEXT: SQL statement "ALTER SERVER loopback_nopw OPTIONS (ADD password 'dummypw')"
PL/pgSQL function inline_code_block line 3 at EXECUTE
-- If we add a password for our user mapping instead, we should get a different
@@ -9437,3 +9437,510 @@ SELECT tableoid::regclass, * FROM batch_cp_upd_test;
-- Clean up
DROP TABLE batch_table, batch_cp_upd_test CASCADE;
+-- ===================================================================
+-- test asynchronous execution
+-- ===================================================================
+ALTER SERVER loopback OPTIONS (DROP extensions);
+ALTER SERVER loopback OPTIONS (ADD async_capable 'true');
+ALTER SERVER loopback2 OPTIONS (ADD async_capable 'true');
+CREATE TABLE async_pt (a int, b int, c text) PARTITION BY RANGE (a);
+CREATE TABLE base_tbl1 (a int, b int, c text);
+CREATE TABLE base_tbl2 (a int, b int, c text);
+CREATE FOREIGN TABLE async_p1 PARTITION OF async_pt FOR VALUES FROM (1000) TO (2000)
+ SERVER loopback OPTIONS (table_name 'base_tbl1');
+CREATE FOREIGN TABLE async_p2 PARTITION OF async_pt FOR VALUES FROM (2000) TO (3000)
+ SERVER loopback2 OPTIONS (table_name 'base_tbl2');
+INSERT INTO async_p1 SELECT 1000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+INSERT INTO async_p2 SELECT 2000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+ANALYZE async_pt;
+-- simple queries
+CREATE TABLE result_tbl (a int, b int, c text);
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b % 100 = 0;
+ QUERY PLAN
+----------------------------------------------------------------------------------------
+ Insert on public.result_tbl
+ -> Append
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE (((b % 100) = 0))
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE (((b % 100) = 0))
+(8 rows)
+
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b % 100 = 0;
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+-----+------
+ 1000 | 0 | 0000
+ 1100 | 100 | 0100
+ 1200 | 200 | 0200
+ 1300 | 300 | 0300
+ 1400 | 400 | 0400
+ 1500 | 500 | 0500
+ 1600 | 600 | 0600
+ 1700 | 700 | 0700
+ 1800 | 800 | 0800
+ 1900 | 900 | 0900
+ 2000 | 0 | 0000
+ 2100 | 100 | 0100
+ 2200 | 200 | 0200
+ 2300 | 300 | 0300
+ 2400 | 400 | 0400
+ 2500 | 500 | 0500
+ 2600 | 600 | 0600
+ 2700 | 700 | 0700
+ 2800 | 800 | 0800
+ 2900 | 900 | 0900
+(20 rows)
+
+DELETE FROM result_tbl;
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+ QUERY PLAN
+----------------------------------------------------------------
+ Insert on public.result_tbl
+ -> Append
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Filter: (async_pt_1.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Filter: (async_pt_2.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+(10 rows)
+
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+-----+------
+ 1505 | 505 | 0505
+ 2505 | 505 | 0505
+(2 rows)
+
+DELETE FROM result_tbl;
+-- Check case where multiple partitions use the same connection
+CREATE TABLE base_tbl3 (a int, b int, c text);
+CREATE FOREIGN TABLE async_p3 PARTITION OF async_pt FOR VALUES FROM (3000) TO (4000)
+ SERVER loopback2 OPTIONS (table_name 'base_tbl3');
+INSERT INTO async_p3 SELECT 3000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+ANALYZE async_pt;
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+ QUERY PLAN
+----------------------------------------------------------------
+ Insert on public.result_tbl
+ -> Append
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Filter: (async_pt_1.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Filter: (async_pt_2.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+ -> Async Foreign Scan on public.async_p3 async_pt_3
+ Output: async_pt_3.a, async_pt_3.b, async_pt_3.c
+ Filter: (async_pt_3.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl3
+(14 rows)
+
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+-----+------
+ 1505 | 505 | 0505
+ 2505 | 505 | 0505
+ 3505 | 505 | 0505
+(3 rows)
+
+DELETE FROM result_tbl;
+DROP FOREIGN TABLE async_p3;
+DROP TABLE base_tbl3;
+-- Check case where the partitioned table has local/remote partitions
+CREATE TABLE async_p3 PARTITION OF async_pt FOR VALUES FROM (3000) TO (4000);
+INSERT INTO async_p3 SELECT 3000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+ANALYZE async_pt;
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+ QUERY PLAN
+----------------------------------------------------------------
+ Insert on public.result_tbl
+ -> Append
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Filter: (async_pt_1.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Filter: (async_pt_2.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+ -> Seq Scan on public.async_p3 async_pt_3
+ Output: async_pt_3.a, async_pt_3.b, async_pt_3.c
+ Filter: (async_pt_3.b === 505)
+(13 rows)
+
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+-----+------
+ 1505 | 505 | 0505
+ 2505 | 505 | 0505
+ 3505 | 505 | 0505
+(3 rows)
+
+DELETE FROM result_tbl;
+-- partitionwise joins
+SET enable_partitionwise_join TO true;
+CREATE TABLE join_tbl (a1 int, b1 int, c1 text, a2 int, b2 int, c2 text);
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO join_tbl SELECT * FROM async_pt t1, async_pt t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.b % 100 = 0;
+ QUERY PLAN
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Insert on public.join_tbl
+ -> Append
+ -> Async Foreign Scan
+ Output: t1_1.a, t1_1.b, t1_1.c, t2_1.a, t2_1.b, t2_1.c
+ Relations: (public.async_p1 t1_1) INNER JOIN (public.async_p1 t2_1)
+ Remote SQL: SELECT r5.a, r5.b, r5.c, r8.a, r8.b, r8.c FROM (public.base_tbl1 r5 INNER JOIN public.base_tbl1 r8 ON (((r5.a = r8.a)) AND ((r5.b = r8.b)) AND (((r5.b % 100) = 0))))
+ -> Async Foreign Scan
+ Output: t1_2.a, t1_2.b, t1_2.c, t2_2.a, t2_2.b, t2_2.c
+ Relations: (public.async_p2 t1_2) INNER JOIN (public.async_p2 t2_2)
+ Remote SQL: SELECT r6.a, r6.b, r6.c, r9.a, r9.b, r9.c FROM (public.base_tbl2 r6 INNER JOIN public.base_tbl2 r9 ON (((r6.a = r9.a)) AND ((r6.b = r9.b)) AND (((r6.b % 100) = 0))))
+ -> Hash Join
+ Output: t1_3.a, t1_3.b, t1_3.c, t2_3.a, t2_3.b, t2_3.c
+ Hash Cond: ((t2_3.a = t1_3.a) AND (t2_3.b = t1_3.b))
+ -> Seq Scan on public.async_p3 t2_3
+ Output: t2_3.a, t2_3.b, t2_3.c
+ -> Hash
+ Output: t1_3.a, t1_3.b, t1_3.c
+ -> Seq Scan on public.async_p3 t1_3
+ Output: t1_3.a, t1_3.b, t1_3.c
+ Filter: ((t1_3.b % 100) = 0)
+(20 rows)
+
+INSERT INTO join_tbl SELECT * FROM async_pt t1, async_pt t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.b % 100 = 0;
+SELECT * FROM join_tbl ORDER BY a1;
+ a1 | b1 | c1 | a2 | b2 | c2
+------+-----+------+------+-----+------
+ 1000 | 0 | 0000 | 1000 | 0 | 0000
+ 1100 | 100 | 0100 | 1100 | 100 | 0100
+ 1200 | 200 | 0200 | 1200 | 200 | 0200
+ 1300 | 300 | 0300 | 1300 | 300 | 0300
+ 1400 | 400 | 0400 | 1400 | 400 | 0400
+ 1500 | 500 | 0500 | 1500 | 500 | 0500
+ 1600 | 600 | 0600 | 1600 | 600 | 0600
+ 1700 | 700 | 0700 | 1700 | 700 | 0700
+ 1800 | 800 | 0800 | 1800 | 800 | 0800
+ 1900 | 900 | 0900 | 1900 | 900 | 0900
+ 2000 | 0 | 0000 | 2000 | 0 | 0000
+ 2100 | 100 | 0100 | 2100 | 100 | 0100
+ 2200 | 200 | 0200 | 2200 | 200 | 0200
+ 2300 | 300 | 0300 | 2300 | 300 | 0300
+ 2400 | 400 | 0400 | 2400 | 400 | 0400
+ 2500 | 500 | 0500 | 2500 | 500 | 0500
+ 2600 | 600 | 0600 | 2600 | 600 | 0600
+ 2700 | 700 | 0700 | 2700 | 700 | 0700
+ 2800 | 800 | 0800 | 2800 | 800 | 0800
+ 2900 | 900 | 0900 | 2900 | 900 | 0900
+ 3000 | 0 | 0000 | 3000 | 0 | 0000
+ 3100 | 100 | 0100 | 3100 | 100 | 0100
+ 3200 | 200 | 0200 | 3200 | 200 | 0200
+ 3300 | 300 | 0300 | 3300 | 300 | 0300
+ 3400 | 400 | 0400 | 3400 | 400 | 0400
+ 3500 | 500 | 0500 | 3500 | 500 | 0500
+ 3600 | 600 | 0600 | 3600 | 600 | 0600
+ 3700 | 700 | 0700 | 3700 | 700 | 0700
+ 3800 | 800 | 0800 | 3800 | 800 | 0800
+ 3900 | 900 | 0900 | 3900 | 900 | 0900
+(30 rows)
+
+DELETE FROM join_tbl;
+RESET enable_partitionwise_join;
+-- Test interaction of async execution with plan-time partition pruning
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE a < 3000;
+ QUERY PLAN
+-----------------------------------------------------------------------------
+ Append
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE ((a < 3000))
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE ((a < 3000))
+(7 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE a < 2000;
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Foreign Scan on public.async_p1 async_pt
+ Output: async_pt.a, async_pt.b, async_pt.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE ((a < 2000))
+(3 rows)
+
+-- Test interaction of async execution with run-time partition pruning
+SET plan_cache_mode TO force_generic_plan;
+PREPARE async_pt_query (int, int) AS
+ INSERT INTO result_tbl SELECT * FROM async_pt WHERE a < $1 AND b === $2;
+EXPLAIN (VERBOSE, COSTS OFF)
+EXECUTE async_pt_query (3000, 505);
+ QUERY PLAN
+------------------------------------------------------------------------------------------
+ Insert on public.result_tbl
+ -> Append
+ Subplans Removed: 1
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Filter: (async_pt_1.b === $2)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE ((a < $1::integer))
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Filter: (async_pt_2.b === $2)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE ((a < $1::integer))
+(11 rows)
+
+EXECUTE async_pt_query (3000, 505);
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+-----+------
+ 1505 | 505 | 0505
+ 2505 | 505 | 0505
+(2 rows)
+
+DELETE FROM result_tbl;
+EXPLAIN (VERBOSE, COSTS OFF)
+EXECUTE async_pt_query (2000, 505);
+ QUERY PLAN
+------------------------------------------------------------------------------------------
+ Insert on public.result_tbl
+ -> Append
+ Subplans Removed: 2
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Filter: (async_pt_1.b === $2)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE ((a < $1::integer))
+(7 rows)
+
+EXECUTE async_pt_query (2000, 505);
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+-----+------
+ 1505 | 505 | 0505
+(1 row)
+
+DELETE FROM result_tbl;
+RESET plan_cache_mode;
+CREATE TABLE local_tbl(a int, b int, c text);
+INSERT INTO local_tbl VALUES (1505, 505, 'foo'), (2505, 505, 'bar');
+ANALYZE local_tbl;
+CREATE INDEX base_tbl1_idx ON base_tbl1 (a);
+CREATE INDEX base_tbl2_idx ON base_tbl2 (a);
+CREATE INDEX async_p3_idx ON async_p3 (a);
+ANALYZE base_tbl1;
+ANALYZE base_tbl2;
+ANALYZE async_p3;
+ALTER FOREIGN TABLE async_p1 OPTIONS (use_remote_estimate 'true');
+ALTER FOREIGN TABLE async_p2 OPTIONS (use_remote_estimate 'true');
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c = 'bar';
+ QUERY PLAN
+------------------------------------------------------------------------------------------
+ Nested Loop
+ Output: local_tbl.a, local_tbl.b, local_tbl.c, async_pt.a, async_pt.b, async_pt.c
+ -> Seq Scan on public.local_tbl
+ Output: local_tbl.a, local_tbl.b, local_tbl.c
+ Filter: (local_tbl.c = 'bar'::text)
+ -> Append
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE (($1::integer = a))
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE (($1::integer = a))
+ -> Seq Scan on public.async_p3 async_pt_3
+ Output: async_pt_3.a, async_pt_3.b, async_pt_3.c
+ Filter: (local_tbl.a = async_pt_3.a)
+(15 rows)
+
+EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
+SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c = 'bar';
+ QUERY PLAN
+-------------------------------------------------------------------------------
+ Nested Loop (actual rows=1 loops=1)
+ -> Seq Scan on local_tbl (actual rows=1 loops=1)
+ Filter: (c = 'bar'::text)
+ Rows Removed by Filter: 1
+ -> Append (actual rows=1 loops=1)
+ -> Async Foreign Scan on async_p1 async_pt_1 (never executed)
+ -> Async Foreign Scan on async_p2 async_pt_2 (actual rows=1 loops=1)
+ -> Seq Scan on async_p3 async_pt_3 (never executed)
+ Filter: (local_tbl.a = a)
+(9 rows)
+
+SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c = 'bar';
+ a | b | c | a | b | c
+------+-----+-----+------+-----+------
+ 2505 | 505 | bar | 2505 | 505 | 0505
+(1 row)
+
+ALTER FOREIGN TABLE async_p1 OPTIONS (DROP use_remote_estimate);
+ALTER FOREIGN TABLE async_p2 OPTIONS (DROP use_remote_estimate);
+DROP TABLE local_tbl;
+DROP INDEX base_tbl1_idx;
+DROP INDEX base_tbl2_idx;
+DROP INDEX async_p3_idx;
+-- Test that pending requests are processed properly
+SET enable_mergejoin TO false;
+SET enable_hashjoin TO false;
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt t1, async_p2 t2 WHERE t1.a = t2.a AND t1.b === 505;
+ QUERY PLAN
+----------------------------------------------------------------
+ Nested Loop
+ Output: t1.a, t1.b, t1.c, t2.a, t2.b, t2.c
+ Join Filter: (t1.a = t2.a)
+ -> Append
+ -> Async Foreign Scan on public.async_p1 t1_1
+ Output: t1_1.a, t1_1.b, t1_1.c
+ Filter: (t1_1.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Async Foreign Scan on public.async_p2 t1_2
+ Output: t1_2.a, t1_2.b, t1_2.c
+ Filter: (t1_2.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+ -> Seq Scan on public.async_p3 t1_3
+ Output: t1_3.a, t1_3.b, t1_3.c
+ Filter: (t1_3.b === 505)
+ -> Materialize
+ Output: t2.a, t2.b, t2.c
+ -> Foreign Scan on public.async_p2 t2
+ Output: t2.a, t2.b, t2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+(20 rows)
+
+SELECT * FROM async_pt t1, async_p2 t2 WHERE t1.a = t2.a AND t1.b === 505;
+ a | b | c | a | b | c
+------+-----+------+------+-----+------
+ 2505 | 505 | 0505 | 2505 | 505 | 0505
+(1 row)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
+ QUERY PLAN
+----------------------------------------------------------------
+ Limit
+ Output: t1.a, t1.b, t1.c
+ -> Append
+ -> Async Foreign Scan on public.async_p1 t1_1
+ Output: t1_1.a, t1_1.b, t1_1.c
+ Filter: (t1_1.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Async Foreign Scan on public.async_p2 t1_2
+ Output: t1_2.a, t1_2.b, t1_2.c
+ Filter: (t1_2.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+ -> Seq Scan on public.async_p3 t1_3
+ Output: t1_3.a, t1_3.b, t1_3.c
+ Filter: (t1_3.b === 505)
+(14 rows)
+
+SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
+ a | b | c
+------+-----+------
+ 3505 | 505 | 0505
+(1 row)
+
+-- Check with foreign modify
+CREATE TABLE local_tbl (a int, b int, c text);
+INSERT INTO local_tbl VALUES (1505, 505, 'foo');
+CREATE TABLE base_tbl3 (a int, b int, c text);
+CREATE FOREIGN TABLE remote_tbl (a int, b int, c text)
+ SERVER loopback OPTIONS (table_name 'base_tbl3');
+INSERT INTO remote_tbl VALUES (2505, 505, 'bar');
+CREATE TABLE base_tbl4 (a int, b int, c text);
+CREATE FOREIGN TABLE insert_tbl (a int, b int, c text)
+ SERVER loopback OPTIONS (table_name 'base_tbl4');
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO insert_tbl (SELECT * FROM local_tbl UNION ALL SELECT * FROM remote_tbl);
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Insert on public.insert_tbl
+ Remote SQL: INSERT INTO public.base_tbl4(a, b, c) VALUES ($1, $2, $3)
+ Batch Size: 1
+ -> Append
+ -> Seq Scan on public.local_tbl
+ Output: local_tbl.a, local_tbl.b, local_tbl.c
+ -> Async Foreign Scan on public.remote_tbl
+ Output: remote_tbl.a, remote_tbl.b, remote_tbl.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl3
+(9 rows)
+
+INSERT INTO insert_tbl (SELECT * FROM local_tbl UNION ALL SELECT * FROM remote_tbl);
+SELECT * FROM insert_tbl ORDER BY a;
+ a | b | c
+------+-----+-----
+ 1505 | 505 | foo
+ 2505 | 505 | bar
+(2 rows)
+
+-- Check with direct modify
+EXPLAIN (VERBOSE, COSTS OFF)
+WITH t AS (UPDATE remote_tbl SET c = c || c RETURNING *)
+INSERT INTO join_tbl SELECT * FROM async_pt LEFT JOIN t ON (async_pt.a = t.a AND async_pt.b = t.b) WHERE async_pt.b === 505;
+ QUERY PLAN
+----------------------------------------------------------------------------------------
+ Insert on public.join_tbl
+ CTE t
+ -> Update on public.remote_tbl
+ Output: remote_tbl.a, remote_tbl.b, remote_tbl.c
+ -> Foreign Update on public.remote_tbl
+ Remote SQL: UPDATE public.base_tbl3 SET c = (c || c) RETURNING a, b, c
+ -> Nested Loop Left Join
+ Output: async_pt.a, async_pt.b, async_pt.c, t.a, t.b, t.c
+ Join Filter: ((async_pt.a = t.a) AND (async_pt.b = t.b))
+ -> Append
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Filter: (async_pt_1.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Filter: (async_pt_2.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+ -> Seq Scan on public.async_p3 async_pt_3
+ Output: async_pt_3.a, async_pt_3.b, async_pt_3.c
+ Filter: (async_pt_3.b === 505)
+ -> CTE Scan on t
+ Output: t.a, t.b, t.c
+(23 rows)
+
+WITH t AS (UPDATE remote_tbl SET c = c || c RETURNING *)
+INSERT INTO join_tbl SELECT * FROM async_pt LEFT JOIN t ON (async_pt.a = t.a AND async_pt.b = t.b) WHERE async_pt.b === 505;
+SELECT * FROM join_tbl ORDER BY a1;
+ a1 | b1 | c1 | a2 | b2 | c2
+------+-----+------+------+-----+--------
+ 1505 | 505 | 0505 | | |
+ 2505 | 505 | 0505 | 2505 | 505 | barbar
+ 3505 | 505 | 0505 | | |
+(3 rows)
+
+DELETE FROM join_tbl;
+RESET enable_mergejoin;
+RESET enable_hashjoin;
+-- Clean up
+DROP TABLE async_pt;
+DROP TABLE base_tbl1;
+DROP TABLE base_tbl2;
+DROP TABLE result_tbl;
+DROP TABLE local_tbl;
+DROP FOREIGN TABLE remote_tbl;
+DROP FOREIGN TABLE insert_tbl;
+DROP TABLE base_tbl3;
+DROP TABLE base_tbl4;
+DROP TABLE join_tbl;
+ALTER SERVER loopback OPTIONS (DROP async_capable);
+ALTER SERVER loopback2 OPTIONS (DROP async_capable);
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 64698c4da3..530d7a66d4 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -107,7 +107,8 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
* Validate option value, when we can do so without any context.
*/
if (strcmp(def->defname, "use_remote_estimate") == 0 ||
- strcmp(def->defname, "updatable") == 0)
+ strcmp(def->defname, "updatable") == 0 ||
+ strcmp(def->defname, "async_capable") == 0)
{
/* these accept only boolean values */
(void) defGetBoolean(def);
@@ -217,6 +218,9 @@ InitPgFdwOptions(void)
/* batch_size is available on both server and table */
{"batch_size", ForeignServerRelationId, false},
{"batch_size", ForeignTableRelationId, false},
+ /* async_capable is available on both server and table */
+ {"async_capable", ForeignServerRelationId, false},
+ {"async_capable", ForeignTableRelationId, false},
{"password_required", UserMappingRelationId, false},
/*
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 35b48575c5..691b1401ad 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -21,6 +21,7 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -37,6 +38,7 @@
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
#include "postgres_fdw.h"
+#include "storage/latch.h"
#include "utils/builtins.h"
#include "utils/float.h"
#include "utils/guc.h"
@@ -143,6 +145,7 @@ typedef struct PgFdwScanState
/* for remote query execution */
PGconn *conn; /* connection for the scan */
+ PgFdwConnState *conn_state; /* extra per-connection state */
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -159,6 +162,9 @@ typedef struct PgFdwScanState
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ /* for asynchronous execution */
+ bool async_capable; /* engage asynchronous-capable logic? */
+
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
MemoryContext temp_cxt; /* context for per-tuple temporary data */
@@ -176,6 +182,7 @@ typedef struct PgFdwModifyState
/* for remote query execution */
PGconn *conn; /* connection for the scan */
+ PgFdwConnState *conn_state; /* extra per-connection state */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -219,6 +226,7 @@ typedef struct PgFdwDirectModifyState
/* for remote query execution */
PGconn *conn; /* connection for the update */
+ PgFdwConnState *conn_state; /* extra per-connection state */
int numParams; /* number of parameters passed to query */
FmgrInfo *param_flinfo; /* output conversion functions for them */
List *param_exprs; /* executable expressions for param values */
@@ -408,6 +416,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
RelOptInfo *input_rel,
RelOptInfo *output_rel,
void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(AsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(AsyncRequest *areq);
+static void postgresForeignAsyncNotify(AsyncRequest *areq);
/*
* Helper functions
@@ -437,7 +449,8 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
void *arg);
static void create_cursor(ForeignScanState *node);
static void fetch_more_data(ForeignScanState *node);
-static void close_cursor(PGconn *conn, unsigned int cursor_number);
+static void close_cursor(PGconn *conn, unsigned int cursor_number,
+ PgFdwConnState *conn_state);
static PgFdwModifyState *create_foreign_modify(EState *estate,
RangeTblEntry *rte,
ResultRelInfo *resultRelInfo,
@@ -491,6 +504,8 @@ static int postgresAcquireSampleRowsFunc(Relation relation, int elevel,
double *totaldeadrows);
static void analyze_row_processor(PGresult *res, int row,
PgFdwAnalyzeState *astate);
+static void produce_tuple_asynchronously(AsyncRequest *areq, bool fetch);
+static void fetch_more_data_begin(AsyncRequest *areq);
static HeapTuple make_tuple_from_result_row(PGresult *res,
int row,
Relation rel,
@@ -583,6 +598,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for asynchronous execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -617,14 +638,16 @@ postgresGetForeignRelSize(PlannerInfo *root,
fpinfo->server = GetForeignServer(fpinfo->table->serverid);
/*
- * Extract user-settable option values. Note that per-table setting of
- * use_remote_estimate overrides per-server setting.
+ * Extract user-settable option values. Note that per-table settings of
+ * use_remote_estimate, fetch_size and async_capable override per-server
+ * settings of them, respectively.
*/
fpinfo->use_remote_estimate = false;
fpinfo->fdw_startup_cost = DEFAULT_FDW_STARTUP_COST;
fpinfo->fdw_tuple_cost = DEFAULT_FDW_TUPLE_COST;
fpinfo->shippable_extensions = NIL;
fpinfo->fetch_size = 100;
+ fpinfo->async_capable = false;
apply_server_options(fpinfo);
apply_table_options(fpinfo);
@@ -1458,7 +1481,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->conn = GetConnection(user, false, &fsstate->conn_state);
/* Assign a unique ID for my cursor */
fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1509,6 +1532,9 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
&fsstate->param_flinfo,
&fsstate->param_exprs,
&fsstate->param_values);
+
+ /* Set the async-capable flag */
+ fsstate->async_capable = node->ss.ps.plan->async_capable;
}
/*
@@ -1523,8 +1549,10 @@ postgresIterateForeignScan(ForeignScanState *node)
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
/*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
+ * In sync mode, if this is the first call after Begin or ReScan, we need
+ * to create the cursor on the remote side. In async mode, we would have
+ * already created the cursor before we get here, even if this is the
+ * first call after Begin or ReScan.
*/
if (!fsstate->cursor_exists)
create_cursor(node);
@@ -1534,6 +1562,9 @@ postgresIterateForeignScan(ForeignScanState *node)
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
+ /* In async mode, just clear tuple slot. */
+ if (fsstate->async_capable)
+ return ExecClearTuple(slot);
/* No point in another fetch if we already detected EOF, though. */
if (!fsstate->eof_reached)
fetch_more_data(node);
@@ -1595,7 +1626,7 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->conn, sql, fsstate->conn_state);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
PQclear(res);
@@ -1623,7 +1654,8 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->conn, fsstate->cursor_number,
+ fsstate->conn_state);
/* Release remote connection */
ReleaseConnection(fsstate->conn);
@@ -2500,7 +2532,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->conn = GetConnection(user, false, &dmstate->conn_state);
/* Update the foreign-join-related fields. */
if (fsplan->scan.scanrelid == 0)
@@ -2881,7 +2913,7 @@ estimate_path_cost_size(PlannerInfo *root,
false, &retrieved_attrs, NULL);
/* Get the remote estimate */
- conn = GetConnection(fpinfo->user, false);
+ conn = GetConnection(fpinfo->user, false, NULL);
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -3327,7 +3359,7 @@ get_remote_estimate(const char *sql, PGconn *conn,
/*
* Execute EXPLAIN remotely.
*/
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_exec_query(conn, sql, NULL);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, sql);
@@ -3451,6 +3483,10 @@ create_cursor(ForeignScanState *node)
StringInfoData buf;
PGresult *res;
+ /* First, process a pending asynchronous request, if any. */
+ if (fsstate->conn_state->pendingAreq)
+ process_pending_request(fsstate->conn_state->pendingAreq);
+
/*
* Construct array of query parameter values in text format. We do the
* conversions in the short-lived per-tuple context, so as not to cause a
@@ -3531,17 +3567,38 @@ fetch_more_data(ForeignScanState *node)
PG_TRY();
{
PGconn *conn = fsstate->conn;
- char sql[64];
int numrows;
int i;
- snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
- fsstate->fetch_size, fsstate->cursor_number);
+ if (fsstate->async_capable)
+ {
+ Assert(fsstate->conn_state->pendingAreq);
- res = pgfdw_exec_query(conn, sql);
- /* On error, report the original query, not the FETCH. */
- if (PQresultStatus(res) != PGRES_TUPLES_OK)
- pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ /*
+ * The query was already sent by an earlier call to
+ * fetch_more_data_begin. So now we just fetch the result.
+ */
+ res = pgfdw_get_result(conn, fsstate->query);
+ /* On error, report the original query, not the FETCH. */
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+
+ /* Reset per-connection state */
+ fsstate->conn_state->pendingAreq = NULL;
+ }
+ else
+ {
+ char sql[64];
+
+ /* This is a regular synchronous fetch. */
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ res = pgfdw_exec_query(conn, sql, fsstate->conn_state);
+ /* On error, report the original query, not the FETCH. */
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ }
/* Convert the data into HeapTuples */
numrows = PQntuples(res);
@@ -3633,7 +3690,8 @@ reset_transmission_modes(int nestlevel)
* Utility routine to close a cursor.
*/
static void
-close_cursor(PGconn *conn, unsigned int cursor_number)
+close_cursor(PGconn *conn, unsigned int cursor_number,
+ PgFdwConnState *conn_state)
{
char sql[64];
PGresult *res;
@@ -3644,7 +3702,7 @@ close_cursor(PGconn *conn, unsigned int cursor_number)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_exec_query(conn, sql, conn_state);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
pgfdw_report_error(ERROR, res, conn, true, sql);
PQclear(res);
@@ -3693,7 +3751,7 @@ create_foreign_modify(EState *estate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->conn = GetConnection(user, true, &fmstate->conn_state);
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Set up remote query information. */
@@ -3792,6 +3850,10 @@ execute_foreign_modify(EState *estate,
operation == CMD_UPDATE ||
operation == CMD_DELETE);
+ /* First, process a pending asynchronous request, if any. */
+ if (fmstate->conn_state->pendingAreq)
+ process_pending_request(fmstate->conn_state->pendingAreq);
+
/*
* If the existing query was deparsed and prepared for a different number
* of rows, rebuild it for the proper number.
@@ -3893,6 +3955,11 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
char *p_name;
PGresult *res;
+ /*
+ * The caller would already have processed a pending asynchronous request
+ * if any, so no need to do it here.
+ */
+
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
GetPrepStmtNumber(fmstate->conn));
@@ -4078,7 +4145,7 @@ deallocate_query(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->conn, sql, fmstate->conn_state);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
PQclear(res);
@@ -4226,6 +4293,10 @@ execute_dml_stmt(ForeignScanState *node)
int numParams = dmstate->numParams;
const char **values = dmstate->param_values;
+ /* First, process a pending asynchronous request, if any. */
+ if (dmstate->conn_state->pendingAreq)
+ process_pending_request(dmstate->conn_state->pendingAreq);
+
/*
* Construct array of query parameter values in text format.
*/
@@ -4627,7 +4698,7 @@ postgresAnalyzeForeignTable(Relation relation,
*/
table = GetForeignTable(RelationGetRelid(relation));
user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
- conn = GetConnection(user, false);
+ conn = GetConnection(user, false, NULL);
/*
* Construct command to get page count for relation.
@@ -4638,7 +4709,7 @@ postgresAnalyzeForeignTable(Relation relation,
/* In what follows, do not risk leaking any PGresults. */
PG_TRY();
{
- res = pgfdw_exec_query(conn, sql.data);
+ res = pgfdw_exec_query(conn, sql.data, NULL);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, sql.data);
@@ -4713,7 +4784,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
table = GetForeignTable(RelationGetRelid(relation));
server = GetForeignServer(table->serverid);
user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
- conn = GetConnection(user, false);
+ conn = GetConnection(user, false, NULL);
/*
* Construct cursor that retrieves whole rows from remote.
@@ -4730,7 +4801,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
int fetch_size;
ListCell *lc;
- res = pgfdw_exec_query(conn, sql.data);
+ res = pgfdw_exec_query(conn, sql.data, NULL);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
pgfdw_report_error(ERROR, res, conn, false, sql.data);
PQclear(res);
@@ -4782,7 +4853,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
*/
/* Fetch some rows */
- res = pgfdw_exec_query(conn, fetch_sql);
+ res = pgfdw_exec_query(conn, fetch_sql, NULL);
/* On error, report the original query, not the FETCH. */
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, sql.data);
@@ -4801,7 +4872,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
}
/* Close the cursor, just to be tidy. */
- close_cursor(conn, cursor_number);
+ close_cursor(conn, cursor_number, NULL);
}
PG_CATCH();
{
@@ -4941,7 +5012,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
*/
server = GetForeignServer(serverOid);
mapping = GetUserMapping(GetUserId(), server->serverid);
- conn = GetConnection(mapping, false);
+ conn = GetConnection(mapping, false, NULL);
/* Don't attempt to import collation if remote server hasn't got it */
if (PQserverVersion(conn) < 90100)
@@ -4957,7 +5028,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
appendStringInfoString(&buf, "SELECT 1 FROM pg_catalog.pg_namespace WHERE nspname = ");
deparseStringLiteral(&buf, stmt->remote_schema);
- res = pgfdw_exec_query(conn, buf.data);
+ res = pgfdw_exec_query(conn, buf.data, NULL);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, buf.data);
@@ -5069,7 +5140,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
appendStringInfoString(&buf, " ORDER BY c.relname, a.attnum");
/* Fetch the data */
- res = pgfdw_exec_query(conn, buf.data);
+ res = pgfdw_exec_query(conn, buf.data, NULL);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, buf.data);
@@ -5529,6 +5600,8 @@ apply_server_options(PgFdwRelationInfo *fpinfo)
ExtractExtensionList(defGetString(def), false);
else if (strcmp(def->defname, "fetch_size") == 0)
fpinfo->fetch_size = strtol(defGetString(def), NULL, 10);
+ else if (strcmp(def->defname, "async_capable") == 0)
+ fpinfo->async_capable = defGetBoolean(def);
}
}
@@ -5550,6 +5623,8 @@ apply_table_options(PgFdwRelationInfo *fpinfo)
fpinfo->use_remote_estimate = defGetBoolean(def);
else if (strcmp(def->defname, "fetch_size") == 0)
fpinfo->fetch_size = strtol(defGetString(def), NULL, 10);
+ else if (strcmp(def->defname, "async_capable") == 0)
+ fpinfo->async_capable = defGetBoolean(def);
}
}
@@ -5584,6 +5659,7 @@ merge_fdw_options(PgFdwRelationInfo *fpinfo,
fpinfo->shippable_extensions = fpinfo_o->shippable_extensions;
fpinfo->use_remote_estimate = fpinfo_o->use_remote_estimate;
fpinfo->fetch_size = fpinfo_o->fetch_size;
+ fpinfo->async_capable = fpinfo_o->async_capable;
/* Merge the table level options from either side of the join. */
if (fpinfo_i)
@@ -5605,6 +5681,13 @@ merge_fdw_options(PgFdwRelationInfo *fpinfo,
* relation sizes.
*/
fpinfo->fetch_size = Max(fpinfo_o->fetch_size, fpinfo_i->fetch_size);
+
+ /*
+ * We'll prefer to consider this join async-capable if any table from
+ * either side of the join is considered async-capable.
+ */
+ fpinfo->async_capable = fpinfo_o->async_capable ||
+ fpinfo_i->async_capable;
}
}
@@ -6488,6 +6571,235 @@ add_foreign_final_paths(PlannerInfo *root, RelOptInfo *input_rel,
add_path(final_rel, (Path *) final_path);
}
+/*
+ * postgresIsForeignPathAsyncCapable
+ * Check whether a given ForeignPath node is async-capable.
+ */
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ RelOptInfo *rel = ((Path *) path)->parent;
+ PgFdwRelationInfo *fpinfo = (PgFdwRelationInfo *) rel->fdw_private;
+
+ return fpinfo->async_capable;
+}
+
+/*
+ * postgresForeignAsyncRequest
+ * Asynchronously request next tuple from a foreign PostgreSQL table.
+ */
+static void
+postgresForeignAsyncRequest(AsyncRequest *areq)
+{
+ produce_tuple_asynchronously(areq, true);
+}
+
+/*
+ * postgresForeignAsyncConfigureWait
+ * Configure a file descriptor event for which we wish to wait.
+ */
+static void
+postgresForeignAsyncConfigureWait(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ AsyncRequest *pendingAreq = fsstate->conn_state->pendingAreq;
+ AppendState *requestor = (AppendState *) areq->requestor;
+ WaitEventSet *set = requestor->as_eventset;
+
+ /* This should not be called unless callback_pending */
+ Assert(areq->callback_pending);
+
+ /* The core code would have registered postmaster death event */
+ Assert(GetNumRegisteredWaitEvents(set) >= 1);
+
+ /* Begin an asynchronous data fetch if necessary */
+ if (!pendingAreq)
+ fetch_more_data_begin(areq);
+ else if (pendingAreq->requestor != areq->requestor)
+ {
+ /*
+ * This is the case when the in-process request was made by another
+ * Append. Note that it might be useless to process the request,
+ * because the query might not need tuples from that Append anymore.
+ * Skip the given request if there are any configured events other
+ * than the postmaster death event; otherwise process the request,
+ * then begin a fetch to configure the event below, because otherwise
+ * we might end up with no configured events other than the postmaster
+ * death event.
+ */
+ if (GetNumRegisteredWaitEvents(set) > 1)
+ return;
+ process_pending_request(pendingAreq);
+ fetch_more_data_begin(areq);
+ }
+ else if (pendingAreq->requestee != areq->requestee)
+ {
+ /*
+ * This is the case when the in-process request was made by the same
+ * parent but for a different child. Since we configure only the
+ * event for the request made for that child, skip the given request.
+ */
+ return;
+ }
+ else
+ Assert(pendingAreq == areq);
+
+ AddWaitEventToSet(set, WL_SOCKET_READABLE, PQsocket(fsstate->conn),
+ NULL, areq);
+}
+
+/*
+ * postgresForeignAsyncNotify
+ * Fetch some more tuples from a file descriptor that becomes ready,
+ * requesting next tuple.
+ */
+static void
+postgresForeignAsyncNotify(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+
+ /* The core code would have initialized the callback_pending flag */
+ Assert(!areq->callback_pending);
+
+ /* The request should be currently in-process */
+ Assert(fsstate->conn_state->pendingAreq == areq);
+
+ /* On error, report the original query, not the FETCH. */
+ if (!PQconsumeInput(fsstate->conn))
+ pgfdw_report_error(ERROR, NULL, fsstate->conn, false, fsstate->query);
+
+ fetch_more_data(node);
+
+ produce_tuple_asynchronously(areq, true);
+}
+
+/*
+ * Asynchronously produce next tuple from a foreign PostgreSQL table.
+ */
+static void
+produce_tuple_asynchronously(AsyncRequest *areq, bool fetch)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ AsyncRequest *pendingAreq = fsstate->conn_state->pendingAreq;
+ TupleTableSlot *result;
+
+ /* This should not be called if the request is currently in-process */
+ Assert(areq != pendingAreq);
+
+ /* Fetch some more tuples, if we've run out */
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ /* No point in another fetch if we already detected EOF, though */
+ if (!fsstate->eof_reached)
+ {
+ /* Mark the request as pending for a callback */
+ ExecAsyncRequestPending(areq);
+ /* Begin another fetch if requested and if no pending request */
+ if (fetch && !pendingAreq)
+ fetch_more_data_begin(areq);
+ }
+ else
+ {
+ /* There's nothing more to do; just return a NULL pointer */
+ result = NULL;
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+ }
+ return;
+ }
+
+ /* Get a tuple from the ForeignScan node */
+ result = ExecProcNode((PlanState *) node);
+ if (!TupIsNull(result))
+ {
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+ return;
+ }
+ Assert(fsstate->next_tuple >= fsstate->num_tuples);
+
+ /* Fetch some more tuples, if we've not detected EOF yet */
+ if (!fsstate->eof_reached)
+ {
+ /* Mark the request as pending for a callback */
+ ExecAsyncRequestPending(areq);
+ /* Begin another fetch if requested and if no pending request */
+ if (fetch && !pendingAreq)
+ fetch_more_data_begin(areq);
+ }
+ else
+ {
+ /* There's nothing more to do; just return a NULL pointer */
+ result = NULL;
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+ }
+}
+
+/*
+ * Begin an asynchronous data fetch.
+ *
+ * Note: fetch_more_data must be called to fetch the result.
+ */
+static void
+fetch_more_data_begin(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ char sql[64];
+
+ Assert(!fsstate->conn_state->pendingAreq);
+
+ /* Create the cursor synchronously. */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ /* We will send this query, but not wait for the response. */
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (PQsendQuery(fsstate->conn, sql) < 0)
+ pgfdw_report_error(ERROR, NULL, fsstate->conn, false, fsstate->query);
+
+ /* Remember that the request is in process */
+ fsstate->conn_state->pendingAreq = areq;
+}
+
+/*
+ * Process a pending asynchronous request.
+ */
+void
+process_pending_request(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ EState *estate = node->ss.ps.state;
+ MemoryContext oldcontext;
+
+ /* The request should be currently in-process */
+ Assert(fsstate->conn_state->pendingAreq == areq);
+ /* and would have been pending for a callback */
+ Assert(areq->callback_pending);
+
+ oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
+
+ /* Unlike AsyncNotify, we unset callback_pending ourselves */
+ areq->callback_pending = false;
+
+ fetch_more_data(node);
+
+ /* We need to send a new query afterwards; don't fetch */
+ produce_tuple_asynchronously(areq, false);
+
+ /* Unlike AsyncNotify, we call ExecAsyncResponse ourselves */
+ ExecAsyncResponse(areq);
+
+ MemoryContextSwitchTo(oldcontext);
+}
+
/*
* Create a tuple from the specified row of the PGresult.
*
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 1f67b4d9fd..88d94da6f6 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -16,6 +16,7 @@
#include "foreign/foreign.h"
#include "lib/stringinfo.h"
#include "libpq-fe.h"
+#include "nodes/execnodes.h"
#include "nodes/pathnodes.h"
#include "utils/relcache.h"
@@ -78,6 +79,7 @@ typedef struct PgFdwRelationInfo
Cost fdw_startup_cost;
Cost fdw_tuple_cost;
List *shippable_extensions; /* OIDs of shippable extensions */
+ bool async_capable;
/* Cached catalog information. */
ForeignTable *table;
@@ -124,17 +126,28 @@ typedef struct PgFdwRelationInfo
int relation_index;
} PgFdwRelationInfo;
+/*
+ * Extra control information relating to a connection.
+ */
+typedef struct PgFdwConnState
+{
+ AsyncRequest *pendingAreq; /* pending async request */
+} PgFdwConnState;
+
/* in postgres_fdw.c */
extern int set_transmission_modes(void);
extern void reset_transmission_modes(int nestlevel);
+extern void process_pending_request(AsyncRequest *areq);
/* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+ PgFdwConnState **state);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
extern PGresult *pgfdw_get_result(PGconn *conn, const char *query);
-extern PGresult *pgfdw_exec_query(PGconn *conn, const char *query);
+extern PGresult *pgfdw_exec_query(PGconn *conn, const char *query,
+ PgFdwConnState *state);
extern void pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
bool clear, const char *sql);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 2b525ea44a..127e131c56 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -2928,3 +2928,198 @@ SELECT tableoid::regclass, * FROM batch_cp_upd_test;
-- Clean up
DROP TABLE batch_table, batch_cp_upd_test CASCADE;
+
+-- ===================================================================
+-- test asynchronous execution
+-- ===================================================================
+
+ALTER SERVER loopback OPTIONS (DROP extensions);
+ALTER SERVER loopback OPTIONS (ADD async_capable 'true');
+ALTER SERVER loopback2 OPTIONS (ADD async_capable 'true');
+
+CREATE TABLE async_pt (a int, b int, c text) PARTITION BY RANGE (a);
+CREATE TABLE base_tbl1 (a int, b int, c text);
+CREATE TABLE base_tbl2 (a int, b int, c text);
+CREATE FOREIGN TABLE async_p1 PARTITION OF async_pt FOR VALUES FROM (1000) TO (2000)
+ SERVER loopback OPTIONS (table_name 'base_tbl1');
+CREATE FOREIGN TABLE async_p2 PARTITION OF async_pt FOR VALUES FROM (2000) TO (3000)
+ SERVER loopback2 OPTIONS (table_name 'base_tbl2');
+INSERT INTO async_p1 SELECT 1000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+INSERT INTO async_p2 SELECT 2000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+ANALYZE async_pt;
+
+-- simple queries
+CREATE TABLE result_tbl (a int, b int, c text);
+
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b % 100 = 0;
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b % 100 = 0;
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
+-- Check case where multiple partitions use the same connection
+CREATE TABLE base_tbl3 (a int, b int, c text);
+CREATE FOREIGN TABLE async_p3 PARTITION OF async_pt FOR VALUES FROM (3000) TO (4000)
+ SERVER loopback2 OPTIONS (table_name 'base_tbl3');
+INSERT INTO async_p3 SELECT 3000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+ANALYZE async_pt;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
+DROP FOREIGN TABLE async_p3;
+DROP TABLE base_tbl3;
+
+-- Check case where the partitioned table has local/remote partitions
+CREATE TABLE async_p3 PARTITION OF async_pt FOR VALUES FROM (3000) TO (4000);
+INSERT INTO async_p3 SELECT 3000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+ANALYZE async_pt;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
+-- partitionwise joins
+SET enable_partitionwise_join TO true;
+
+CREATE TABLE join_tbl (a1 int, b1 int, c1 text, a2 int, b2 int, c2 text);
+
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO join_tbl SELECT * FROM async_pt t1, async_pt t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.b % 100 = 0;
+INSERT INTO join_tbl SELECT * FROM async_pt t1, async_pt t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.b % 100 = 0;
+
+SELECT * FROM join_tbl ORDER BY a1;
+DELETE FROM join_tbl;
+
+RESET enable_partitionwise_join;
+
+-- Test interaction of async execution with plan-time partition pruning
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE a < 3000;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE a < 2000;
+
+-- Test interaction of async execution with run-time partition pruning
+SET plan_cache_mode TO force_generic_plan;
+
+PREPARE async_pt_query (int, int) AS
+ INSERT INTO result_tbl SELECT * FROM async_pt WHERE a < $1 AND b === $2;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+EXECUTE async_pt_query (3000, 505);
+EXECUTE async_pt_query (3000, 505);
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+EXECUTE async_pt_query (2000, 505);
+EXECUTE async_pt_query (2000, 505);
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
+RESET plan_cache_mode;
+
+CREATE TABLE local_tbl(a int, b int, c text);
+INSERT INTO local_tbl VALUES (1505, 505, 'foo'), (2505, 505, 'bar');
+ANALYZE local_tbl;
+
+CREATE INDEX base_tbl1_idx ON base_tbl1 (a);
+CREATE INDEX base_tbl2_idx ON base_tbl2 (a);
+CREATE INDEX async_p3_idx ON async_p3 (a);
+ANALYZE base_tbl1;
+ANALYZE base_tbl2;
+ANALYZE async_p3;
+
+ALTER FOREIGN TABLE async_p1 OPTIONS (use_remote_estimate 'true');
+ALTER FOREIGN TABLE async_p2 OPTIONS (use_remote_estimate 'true');
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c = 'bar';
+EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
+SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c = 'bar';
+SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c = 'bar';
+
+ALTER FOREIGN TABLE async_p1 OPTIONS (DROP use_remote_estimate);
+ALTER FOREIGN TABLE async_p2 OPTIONS (DROP use_remote_estimate);
+
+DROP TABLE local_tbl;
+DROP INDEX base_tbl1_idx;
+DROP INDEX base_tbl2_idx;
+DROP INDEX async_p3_idx;
+
+-- Test that pending requests are processed properly
+SET enable_mergejoin TO false;
+SET enable_hashjoin TO false;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt t1, async_p2 t2 WHERE t1.a = t2.a AND t1.b === 505;
+SELECT * FROM async_pt t1, async_p2 t2 WHERE t1.a = t2.a AND t1.b === 505;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
+SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
+
+-- Check with foreign modify
+CREATE TABLE local_tbl (a int, b int, c text);
+INSERT INTO local_tbl VALUES (1505, 505, 'foo');
+
+CREATE TABLE base_tbl3 (a int, b int, c text);
+CREATE FOREIGN TABLE remote_tbl (a int, b int, c text)
+ SERVER loopback OPTIONS (table_name 'base_tbl3');
+INSERT INTO remote_tbl VALUES (2505, 505, 'bar');
+
+CREATE TABLE base_tbl4 (a int, b int, c text);
+CREATE FOREIGN TABLE insert_tbl (a int, b int, c text)
+ SERVER loopback OPTIONS (table_name 'base_tbl4');
+
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO insert_tbl (SELECT * FROM local_tbl UNION ALL SELECT * FROM remote_tbl);
+INSERT INTO insert_tbl (SELECT * FROM local_tbl UNION ALL SELECT * FROM remote_tbl);
+
+SELECT * FROM insert_tbl ORDER BY a;
+
+-- Check with direct modify
+EXPLAIN (VERBOSE, COSTS OFF)
+WITH t AS (UPDATE remote_tbl SET c = c || c RETURNING *)
+INSERT INTO join_tbl SELECT * FROM async_pt LEFT JOIN t ON (async_pt.a = t.a AND async_pt.b = t.b) WHERE async_pt.b === 505;
+WITH t AS (UPDATE remote_tbl SET c = c || c RETURNING *)
+INSERT INTO join_tbl SELECT * FROM async_pt LEFT JOIN t ON (async_pt.a = t.a AND async_pt.b = t.b) WHERE async_pt.b === 505;
+
+SELECT * FROM join_tbl ORDER BY a1;
+DELETE FROM join_tbl;
+
+RESET enable_mergejoin;
+RESET enable_hashjoin;
+
+-- Clean up
+DROP TABLE async_pt;
+DROP TABLE base_tbl1;
+DROP TABLE base_tbl2;
+DROP TABLE result_tbl;
+DROP TABLE local_tbl;
+DROP FOREIGN TABLE remote_tbl;
+DROP FOREIGN TABLE insert_tbl;
+DROP TABLE base_tbl3;
+DROP TABLE base_tbl4;
+DROP TABLE join_tbl;
+
+ALTER SERVER loopback OPTIONS (DROP async_capable);
+ALTER SERVER loopback2 OPTIONS (DROP async_capable);
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ddc6d789d8..701cb65cc7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4787,6 +4787,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</para>
<variablelist>
+ <varlistentry id="guc-enable-async-append" xreflabel="enable_async_append">
+ <term><varname>enable_async_append</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_async_append</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of async-aware
+ append plan types. The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-bitmapscan" xreflabel="enable_bitmapscan">
<term><varname>enable_bitmapscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 04bc052ee8..0808bd6c5c 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1483,6 +1483,94 @@ ShutdownForeignScan(ForeignScanState *node);
</para>
</sect2>
+ <sect2 id="fdw-callbacks-async">
+ <title>FDW Routines for Asynchronous Execution</title>
+ <para>
+ A <structname>ForeignScan</structname> node can, optionally, support
+ asynchronous execution (as described in
+ <filename>src/backend/executor/README</filename>). The following
+ functions are all optional, but are all required if asynchronous
+ execution is to be supported.
+ </para>
+
+ <para>
+<programlisting>
+bool
+IsForeignPathAsyncCapable(ForeignPath *path);
+</programlisting>
+ Test whether a given <structname>ForeignPath</structname> path can scan
+ the underlying foreign relation asynchronously.
+ This function will only be called at the end of query planning when the
+ given path is a direct child of an <structname>AppendPath</structname>
+ path and when the planner believes that asynchronous execution improves
+ performance, and should return true if the given path is able to scan the
+ foreign relation asynchronously.
+ </para>
+
+ <para>
+ If this function is not defined, it is assumed that the given path scans
+ the foreign relation using <function>IterateForeignScan</function>.
+ (This implies that the callback functions described below will never be
+ called, so they need not be provided either.)
+ </para>
+
+ <para>
+<programlisting>
+void
+ForeignAsyncRequest(AsyncRequest *areq);
+</programlisting>
+ Produce one tuple asynchronously from the
+ <structname>ForeignScan</structname> node. <literal>areq</literal> is
+ the <structname>AsyncRequest</structname> struct describing the
+ <structname>ForeignScan</structname> node and the parent
+ <structname>Append</structname> node that requested the tuple from it.
+ This function should store the tuple into the slot specified by
+ <literal>areq->result</literal>, and set
+ <literal>areq->request_complete</literal> to <literal>true</literal>;
+ or if it needs to wait on an event external to the core server such as
+ network I/O, and cannot produce any tuple immediately, set the flag to
+ <literal>false</literal>, and set
+ <literal>areq->callback_pending</literal> to <literal>true</literal>
+ for the <structname>ForeignScan</structname> node to get a callback from
+ the callback functions described below. If no more tuples are available,
+ set the slot to NULL, and the
+ <literal>areq->request_complete</literal> flag to
+ <literal>true</literal>. It's recommended to use
+ <function>ExecAsyncRequestDone</function> or
+ <function>ExecAsyncRequestPending</function> to set the output parameters.
+ </para>
+
+ <para>
+<programlisting>
+void
+ForeignAsyncConfigureWait(AsyncRequest *areq);
+</programlisting>
+ Configure a file descriptor event for which the
+ <structname>ForeignScan</structname> node wishes to wait.
+ This function will only be called when the
+ <structname>ForeignScan</structname> node has the
+ <literal>areq->callback_pending</literal> flag set, and should add
+ the event to the <structfield>as_eventset</structfield> of the parent
+ <structname>Append</structname> node specified by the given
+ <structname>AsyncRequest</structname> struct. See the comments for
+ <function>ExecAsyncConfigureWait</function> in
+ <filename>src/backend/executor/execAsync.c</filename> for additional
+ information. When the file descriptor event occurs,
+ <function>ForeignAsyncNotify</function> will be called.
+ </para>
+
+ <para>
+<programlisting>
+void
+ForeignAsyncNotify(AsyncRequest *areq);
+</programlisting>
+ Process a relevant event that has occurred, then produce one tuple
+ asynchronously from the <structname>ForeignScan</structname> node.
+ This function should set the output parameters in the same way as
+ <function>ForeignAsyncRequest</function>.
+ </para>
+ </sect2>
+
<sect2 id="fdw-callbacks-reparameterize-paths">
<title>FDW Routines for Reparameterization of Paths</title>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 43c07da20e..af540fb02f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1564,6 +1564,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</thead>
<tbody>
+ <row>
+ <entry><literal>AppendReady</literal></entry>
+ <entry>Waiting for subplan nodes of an <literal>Append</literal> plan
+ node to be ready.</entry>
+ </row>
<row>
<entry><literal>BackupWaitWalArchive</literal></entry>
<entry>Waiting for WAL files required for a backup to be successfully
diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index 07aa25799d..a1b426c50b 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -371,6 +371,34 @@ OPTIONS (ADD password_required 'false');
</sect3>
+ <sect3>
+ <title>Asynchronous Execution Options</title>
+
+ <para>
+ <filename>postgres_fdw</filename> supports asynchronous execution, which
+ runs multiple parts of an <structname>Append</structname> node
+ concurrently rather than serially to improve performance.
+ This execution can be controled using the following option:
+ </para>
+
+ <variablelist>
+
+ <varlistentry>
+ <term><literal>async_capable</literal></term>
+ <listitem>
+ <para>
+ This option controls whether <filename>postgres_fdw</filename> allows
+ foreign tables to be scanned concurrently for asynchronous execution.
+ It can be specified for a foreign table or a foreign server.
+ A table-level option overrides a server-level option.
+ The default is <literal>false</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </sect3>
+
<sect3>
<title>Updatability Options</title>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index afc45429ba..fe75cabdcc 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1394,6 +1394,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
}
if (plan->parallel_aware)
appendStringInfoString(es->str, "Parallel ");
+ if (plan->async_capable)
+ appendStringInfoString(es->str, "Async ");
appendStringInfoString(es->str, pname);
es->indent++;
}
@@ -1413,6 +1415,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (custom_name)
ExplainPropertyText("Custom Plan Provider", custom_name, es);
ExplainPropertyBool("Parallel Aware", plan->parallel_aware, es);
+ ExplainPropertyBool("Async Capable", plan->async_capable, es);
}
switch (nodeTag(plan))
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 74ac59faa1..680fd69151 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
+ execAsync.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index 18b2ac1865..3726048c4a 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -359,3 +359,43 @@ query returning the same set of scan tuples multiple times. Likewise,
SRFs are disallowed in an UPDATE's targetlist. There, they would have the
effect of the same row being updated multiple times, which is not very
useful --- and updates after the first would have no effect anyway.
+
+
+Asynchronous Execution
+----------------------
+
+In cases where a node is waiting on an event external to the database system,
+such as a ForeignScan awaiting network I/O, it's desirable for the node to
+indicate that it cannot return any tuple immediately but may be able to do so
+at a later time. A process which discovers this type of situation can always
+handle it simply by blocking, but this may waste time that could be spent
+executing some other part of the plan tree where progress could be made
+immediately. This is particularly likely to occur when the plan tree contains
+an Append node. Asynchronous execution runs multiple parts of an Append node
+concurrently rather than serially to improve performance.
+
+For asynchronous execution, an Append node must first request a tuple from an
+async-capable child node using ExecAsyncRequest. Next, it must execute the
+asynchronous event loop using ExecAppendAsyncEventWait. Eventually, when a
+child node to which an asynchronous request has been made produces a tuple,
+the Append node will receive it from the event loop via ExecAsyncResponse. In
+the current implementation of asynchronous execution, the only node type that
+requests tuples from an async-capable child node is an Append, while the only
+node type that might be async-capable is a ForeignScan.
+
+Typically, the ExecAsyncResponse callback is the only one required for nodes
+that wish to request tuples asynchronously. On the other hand, async-capable
+nodes generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+ will be invoked; it should use ExecAsyncRequestPending to indicate that the
+ request is pending for a callback described below. Alternatively, it can
+ instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events, the
+ node's ExecAsyncConfigureWait callback will be invoked to configure the
+ file descriptor event for which the node wishes to wait.
+
+3. When the file descriptor becomes ready, the node's ExecAsyncNotify callback
+ will be invoked; like #1, it should use ExecAsyncRequestPending for another
+ callback or ExecAsyncRequestDone to return a result immediately.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 4543ac79ed..58a8aa5ab7 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -531,6 +531,10 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans > 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000000..f1985e658c
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,124 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+
+/*
+ * Asynchronously request a tuple from a designed async-capable node.
+ */
+void
+ExecAsyncRequest(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ ExecAsyncResponse(areq);
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor event
+ * for which it wishes to wait. We expect the node-type specific callback to
+ * make a single call of the following form:
+ *
+ * AddWaitEventToSet(set, WL_SOCKET_READABLE, fd, NULL, areq);
+ */
+void
+ExecAsyncConfigureWait(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanConfigureWait(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+void
+ExecAsyncNotify(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ ExecAsyncResponse(areq);
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+void
+ExecAsyncResponse(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestor))
+ {
+ case T_AppendState:
+ ExecAsyncAppendResponse(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestor));
+ }
+}
+
+/*
+ * A requestee node should call this function to deliver the tuple to its
+ * requestor node. The requestee node can call this from its ExecAsyncRequest
+ * or ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(AsyncRequest *areq, TupleTableSlot *result)
+{
+ areq->request_complete = true;
+ areq->result = result;
+}
+
+/*
+ * A requestee node should call this function to indicate that it is pending
+ * for a callback. The requestee node can call this from its ExecAsyncRequest
+ * or ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestPending(AsyncRequest *areq)
+{
+ areq->callback_pending = true;
+ areq->request_complete = false;
+ areq->result = NULL;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 15e4115bd6..5c366bc5d9 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -57,10 +57,13 @@
#include "postgres.h"
+#include "executor/execAsync.h"
#include "executor/execdebug.h"
#include "executor/execPartition.h"
#include "executor/nodeAppend.h"
#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
/* Shared state for parallel-aware Append. */
struct ParallelAppendState
@@ -78,12 +81,18 @@ struct ParallelAppendState
};
#define INVALID_SUBPLAN_INDEX -1
+#define EVENT_BUFFER_SIZE 16
static TupleTableSlot *ExecAppend(PlanState *pstate);
static bool choose_next_subplan_locally(AppendState *node);
static bool choose_next_subplan_for_leader(AppendState *node);
static bool choose_next_subplan_for_worker(AppendState *node);
static void mark_invalid_subplans_as_finished(AppendState *node);
+static void ExecAppendAsyncBegin(AppendState *node);
+static bool ExecAppendAsyncGetNext(AppendState *node, TupleTableSlot **result);
+static bool ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result);
+static void ExecAppendAsyncEventWait(AppendState *node);
+static void classify_matching_subplans(AppendState *node);
/* ----------------------------------------------------------------
* ExecInitAppend
@@ -102,7 +111,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
AppendState *appendstate = makeNode(AppendState);
PlanState **appendplanstates;
Bitmapset *validsubplans;
+ Bitmapset *asyncplans;
int nplans;
+ int nasyncplans;
int firstvalid;
int i,
j;
@@ -119,6 +130,8 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
/* Let choose_next_subplan_* function handle setting the first subplan */
appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+ appendstate->as_syncdone = false;
+ appendstate->as_begun = false;
/* If run-time partition pruning is enabled, then set that up now */
if (node->part_prune_info != NULL)
@@ -191,12 +204,25 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
* While at it, find out the first valid partial plan.
*/
j = 0;
+ asyncplans = NULL;
+ nasyncplans = 0;
firstvalid = nplans;
i = -1;
while ((i = bms_next_member(validsubplans, i)) >= 0)
{
Plan *initNode = (Plan *) list_nth(node->appendplans, i);
+ /*
+ * Record async subplans. When executing EvalPlanQual, we execute
+ * async subplans synchronously; don't do this when initializing an
+ * EvalPlanQual plan tree.
+ */
+ if (initNode->async_capable && estate->es_epq_active == NULL)
+ {
+ asyncplans = bms_add_member(asyncplans, j);
+ nasyncplans++;
+ }
+
/*
* Record the lowest appendplans index which is a valid partial plan.
*/
@@ -210,6 +236,37 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ /* Initialize async state */
+ appendstate->as_asyncplans = asyncplans;
+ appendstate->as_nasyncplans = nasyncplans;
+ appendstate->as_asyncrequests = NULL;
+ appendstate->as_asyncresults = (TupleTableSlot **)
+ palloc0(nasyncplans * sizeof(TupleTableSlot *));
+ appendstate->as_needrequest = NULL;
+ appendstate->as_eventset = NULL;
+
+ if (nasyncplans > 0)
+ {
+ appendstate->as_asyncrequests = (AsyncRequest **)
+ palloc0(nplans * sizeof(AsyncRequest *));
+
+ i = -1;
+ while ((i = bms_next_member(asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq;
+
+ areq = palloc(sizeof(AsyncRequest));
+ areq->requestor = (PlanState *) appendstate;
+ areq->requestee = appendplanstates[i];
+ areq->request_index = i;
+ areq->callback_pending = false;
+ areq->request_complete = false;
+ areq->result = NULL;
+
+ appendstate->as_asyncrequests[i] = areq;
+ }
+ }
+
/*
* Miscellaneous initialization
*/
@@ -232,31 +289,59 @@ static TupleTableSlot *
ExecAppend(PlanState *pstate)
{
AppendState *node = castNode(AppendState, pstate);
+ TupleTableSlot *result;
- if (node->as_whichplan < 0)
+ /*
+ * If this is the first call after Init or ReScan, we need to do the
+ * initialization work.
+ */
+ if (!node->as_begun)
{
+ Assert(node->as_whichplan == INVALID_SUBPLAN_INDEX);
+ Assert(!node->as_syncdone);
+
/* Nothing to do if there are no subplans */
if (node->as_nplans == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ /* If there are any async subplans, begin executing them. */
+ if (node->as_nasyncplans > 0)
+ ExecAppendAsyncBegin(node);
+
/*
- * If no subplan has been chosen, we must choose one before
+ * If no sync subplan has been chosen, we must choose one before
* proceeding.
*/
- if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
- !node->choose_next_subplan(node))
+ if (!node->choose_next_subplan(node) && node->as_nasyncremain == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+
+ Assert(node->as_syncdone ||
+ (node->as_whichplan >= 0 &&
+ node->as_whichplan < node->as_nplans));
+
+ /* And we're initialized. */
+ node->as_begun = true;
}
for (;;)
{
PlanState *subnode;
- TupleTableSlot *result;
CHECK_FOR_INTERRUPTS();
/*
- * figure out which subplan we are currently processing
+ * try to get a tuple from an async subplan if any
+ */
+ if (node->as_syncdone || !bms_is_empty(node->as_needrequest))
+ {
+ if (ExecAppendAsyncGetNext(node, &result))
+ return result;
+ Assert(!node->as_syncdone);
+ Assert(bms_is_empty(node->as_needrequest));
+ }
+
+ /*
+ * figure out which sync subplan we are currently processing
*/
Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
subnode = node->appendplans[node->as_whichplan];
@@ -276,8 +361,16 @@ ExecAppend(PlanState *pstate)
return result;
}
- /* choose new subplan; if none, we're done */
- if (!node->choose_next_subplan(node))
+ /*
+ * wait or poll async events if any. We do this before checking for
+ * the end of iteration, because it might drain the remaining async
+ * subplans.
+ */
+ if (node->as_nasyncremain > 0)
+ ExecAppendAsyncEventWait(node);
+
+ /* choose new sync subplan; if no sync/async subplans, we're done */
+ if (!node->choose_next_subplan(node) && node->as_nasyncremain == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
}
@@ -313,6 +406,7 @@ ExecEndAppend(AppendState *node)
void
ExecReScanAppend(AppendState *node)
{
+ int nasyncplans = node->as_nasyncplans;
int i;
/*
@@ -326,6 +420,11 @@ ExecReScanAppend(AppendState *node)
{
bms_free(node->as_valid_subplans);
node->as_valid_subplans = NULL;
+ if (nasyncplans > 0)
+ {
+ bms_free(node->as_valid_asyncplans);
+ node->as_valid_asyncplans = NULL;
+ }
}
for (i = 0; i < node->as_nplans; i++)
@@ -347,8 +446,27 @@ ExecReScanAppend(AppendState *node)
ExecReScan(subnode);
}
+ /* Reset async state */
+ if (nasyncplans > 0)
+ {
+ i = -1;
+ while ((i = bms_next_member(node->as_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ areq->callback_pending = false;
+ areq->request_complete = false;
+ areq->result = NULL;
+ }
+
+ bms_free(node->as_needrequest);
+ node->as_needrequest = NULL;
+ }
+
/* Let choose_next_subplan_* function handle setting the first subplan */
node->as_whichplan = INVALID_SUBPLAN_INDEX;
+ node->as_syncdone = false;
+ node->as_begun = false;
}
/* ----------------------------------------------------------------
@@ -429,7 +547,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
/* ----------------------------------------------------------------
* choose_next_subplan_locally
*
- * Choose next subplan for a non-parallel-aware Append,
+ * Choose next sync subplan for a non-parallel-aware Append,
* returning false if there are no more.
* ----------------------------------------------------------------
*/
@@ -442,16 +560,25 @@ choose_next_subplan_locally(AppendState *node)
/* We should never be called when there are no subplans */
Assert(node->as_nplans > 0);
+ /* Nothing to do if syncdone */
+ if (node->as_syncdone)
+ return false;
+
/*
* If first call then have the bms member function choose the first valid
- * subplan by initializing whichplan to -1. If there happen to be no
- * valid subplans then the bms member function will handle that by
- * returning a negative number which will allow us to exit returning a
+ * sync subplan by initializing whichplan to -1. If there happen to be
+ * no valid sync subplans then the bms member function will handle that
+ * by returning a negative number which will allow us to exit returning a
* false value.
*/
if (whichplan == INVALID_SUBPLAN_INDEX)
{
- if (node->as_valid_subplans == NULL)
+ if (node->as_nasyncplans > 0)
+ {
+ /* We'd have filled as_valid_subplans already */
+ Assert(node->as_valid_subplans);
+ }
+ else if (node->as_valid_subplans == NULL)
node->as_valid_subplans =
ExecFindMatchingSubPlans(node->as_prune_state);
@@ -467,7 +594,12 @@ choose_next_subplan_locally(AppendState *node)
nextplan = bms_prev_member(node->as_valid_subplans, whichplan);
if (nextplan < 0)
+ {
+ /* Set as_syncdone if in async mode */
+ if (node->as_nasyncplans > 0)
+ node->as_syncdone = true;
return false;
+ }
node->as_whichplan = nextplan;
@@ -709,3 +841,310 @@ mark_invalid_subplans_as_finished(AppendState *node)
node->as_pstate->pa_finished[i] = true;
}
}
+
+/* ----------------------------------------------------------------
+ * Asynchronous Append Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncBegin
+ *
+ * Begin executing designed async-capable subplans.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecAppendAsyncBegin(AppendState *node)
+{
+ int i;
+
+ /* Backward scan is not supported by async-aware Appends. */
+ Assert(ScanDirectionIsForward(node->ps.state->es_direction));
+
+ /* We should never be called when there are no async subplans. */
+ Assert(node->as_nasyncplans > 0);
+
+ /* If we've yet to determine the valid subplans then do so now. */
+ if (node->as_valid_subplans == NULL)
+ node->as_valid_subplans =
+ ExecFindMatchingSubPlans(node->as_prune_state);
+
+ classify_matching_subplans(node);
+
+ /* Nothing to do if there are no valid async subplans. */
+ if (node->as_nasyncremain == 0)
+ return;
+
+ /* Make a request for each of the valid async subplans. */
+ i = -1;
+ while ((i = bms_next_member(node->as_valid_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ Assert(areq->request_index == i);
+ Assert(!areq->callback_pending);
+
+ /* Do the actual work. */
+ ExecAsyncRequest(areq);
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncGetNext
+ *
+ * Get the next tuple from any of the asynchronous subplans.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecAppendAsyncGetNext(AppendState *node, TupleTableSlot **result)
+{
+ *result = NULL;
+
+ /* We should never be called when there are no valid async subplans. */
+ Assert(node->as_nasyncremain > 0);
+
+ /* Request a tuple asynchronously. */
+ if (ExecAppendAsyncRequest(node, result))
+ return true;
+
+ while (node->as_nasyncremain > 0)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ /* Wait or poll async events. */
+ ExecAppendAsyncEventWait(node);
+
+ /* Request a tuple asynchronously. */
+ if (ExecAppendAsyncRequest(node, result))
+ return true;
+
+ /* Break from loop if there's any sync subplan that isn't complete. */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If all sync subplans are complete, we're totally done scanning the
+ * given node. Otherwise, we're done with the asynchronous stuff but
+ * must continue scanning the sync subplans.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncremain == 0);
+ *result = ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ return true;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncRequest
+ *
+ * If there are any asynchronous subplans that need a new
+ * request, make all of them.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result)
+{
+ Bitmapset *needrequest;
+ int i;
+
+ /* Nothing to do if there are no async subplans needing a new request. */
+ if (bms_is_empty(node->as_needrequest))
+ return false;
+
+ /*
+ * If there are any asynchronously-generated results that have not yet
+ * been returned, we have nothing to do; just return one of them.
+ */
+ if (node->as_nasyncresults > 0)
+ {
+ --node->as_nasyncresults;
+ *result = node->as_asyncresults[node->as_nasyncresults];
+ return true;
+ }
+
+ /* Make a new request for each of the async subplans that need it. */
+ needrequest = node->as_needrequest;
+ node->as_needrequest = NULL;
+ i = -1;
+ while ((i = bms_next_member(needrequest, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ /* Do the actual work. */
+ ExecAsyncRequest(areq);
+ }
+ bms_free(needrequest);
+
+ /* Return one of the asynchronously-generated results if any. */
+ if (node->as_nasyncresults > 0)
+ {
+ --node->as_nasyncresults;
+ *result = node->as_asyncresults[node->as_nasyncresults];
+ return true;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncEventWait
+ *
+ * Wait or poll for file descriptor events and fire callbacks.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecAppendAsyncEventWait(AppendState *node)
+{
+ long timeout = node->as_syncdone ? -1 : 0;
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+
+ /* We should never be called when there are no valid async subplans. */
+ Assert(node->as_nasyncremain > 0);
+
+ node->as_eventset = CreateWaitEventSet(CurrentMemoryContext,
+ node->as_nasyncplans + 1);
+ AddWaitEventToSet(node->as_eventset, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
+ NULL, NULL);
+
+ /* Give each waiting subplan a chance to add an event. */
+ i = -1;
+ while ((i = bms_next_member(node->as_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ if (areq->callback_pending)
+ ExecAsyncConfigureWait(areq);
+ }
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(node->as_eventset, timeout, occurred_event,
+ EVENT_BUFFER_SIZE, WAIT_EVENT_APPEND_READY);
+ FreeWaitEventSet(node->as_eventset);
+ node->as_eventset = NULL;
+ if (noccurred == 0)
+ return;
+
+ /* Deliver notifications. */
+ for (i = 0; i < noccurred; i++)
+ {
+ WaitEvent *w = &occurred_event[i];
+
+ /*
+ * Each waiting subplan should have registered its wait event with
+ * user_data pointing back to its AsyncRequest.
+ */
+ if ((w->events & WL_SOCKET_READABLE) != 0)
+ {
+ AsyncRequest *areq = (AsyncRequest *) w->user_data;
+
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ Assert(areq->callback_pending);
+ areq->callback_pending = false;
+
+ /* Do the actual work. */
+ ExecAsyncNotify(areq);
+ }
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(AsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot = areq->result;
+
+ /* The result should be a TupleTableSlot or NULL. */
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* Nothing to do if the request is pending. */
+ if (!areq->request_complete)
+ {
+ /*
+ * The subplan for which the request was made would have been pending
+ * for a callback.
+ */
+ Assert(areq->callback_pending);
+ return;
+ }
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ {
+ /* The ending subplan wouldn't have been pending for a callback. */
+ Assert(!areq->callback_pending);
+ --node->as_nasyncremain;
+ return;
+ }
+
+ /* Save result so we can return it. */
+ Assert(node->as_nasyncresults < node->as_nasyncplans);
+ node->as_asyncresults[node->as_nasyncresults++] = slot;
+
+ /*
+ * Mark the subplan that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might complete.
+ */
+ node->as_needrequest = bms_add_member(node->as_needrequest,
+ areq->request_index);
+}
+
+/* ----------------------------------------------------------------
+ * classify_matching_subplans
+ *
+ * Classify the node's as_valid_subplans into sync ones and
+ * async ones, adjust it to contain sync ones only, and save
+ * async ones in the node's as_valid_asyncplans.
+ * ----------------------------------------------------------------
+ */
+static void
+classify_matching_subplans(AppendState *node)
+{
+ Bitmapset *valid_asyncplans;
+
+ Assert(node->as_valid_asyncplans == NULL);
+
+ /* Nothing to do if there are no valid subplans. */
+ if (bms_is_empty(node->as_valid_subplans))
+ {
+ node->as_syncdone = true;
+ node->as_nasyncremain = 0;
+ return;
+ }
+
+ /* Nothing to do if there are no valid async subplans. */
+ if (!bms_overlap(node->as_valid_subplans, node->as_asyncplans))
+ {
+ node->as_nasyncremain = 0;
+ return;
+ }
+
+ /* Get valid async subplans. */
+ valid_asyncplans = bms_copy(node->as_asyncplans);
+ valid_asyncplans = bms_int_members(valid_asyncplans,
+ node->as_valid_subplans);
+
+ /* Adjust the valid subplans to contain sync subplans only. */
+ node->as_valid_subplans = bms_del_members(node->as_valid_subplans,
+ valid_asyncplans);
+ node->as_syncdone = bms_is_empty(node->as_valid_subplans);
+
+ /* Save valid async subplans. */
+ node->as_valid_asyncplans = valid_asyncplans;
+ node->as_nasyncremain = bms_num_members(valid_asyncplans);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 0969e53c3a..898890fb08 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -391,3 +391,51 @@ ExecShutdownForeignScan(ForeignScanState *node)
if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Asynchronously request a tuple from a designed async-capable node
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ fdwroutine->ForeignAsyncConfigureWait(areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Callback invoked when a relevant event has occurred
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 1d0bb6e2e7..d58b79d525 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -120,6 +120,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
COPY_SCALAR_FIELD(plan_width);
COPY_SCALAR_FIELD(parallel_aware);
COPY_SCALAR_FIELD(parallel_safe);
+ COPY_SCALAR_FIELD(async_capable);
COPY_SCALAR_FIELD(plan_node_id);
COPY_NODE_FIELD(targetlist);
COPY_NODE_FIELD(qual);
@@ -241,6 +242,7 @@ _copyAppend(const Append *from)
*/
COPY_BITMAPSET_FIELD(apprelids);
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
COPY_SCALAR_FIELD(first_partial_plan);
COPY_NODE_FIELD(part_prune_info);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 301fa30490..ff127a19ad 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -333,6 +333,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
WRITE_INT_FIELD(plan_width);
WRITE_BOOL_FIELD(parallel_aware);
WRITE_BOOL_FIELD(parallel_safe);
+ WRITE_BOOL_FIELD(async_capable);
WRITE_INT_FIELD(plan_node_id);
WRITE_NODE_FIELD(targetlist);
WRITE_NODE_FIELD(qual);
@@ -431,6 +432,7 @@ _outAppend(StringInfo str, const Append *node)
WRITE_BITMAPSET_FIELD(apprelids);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
WRITE_INT_FIELD(first_partial_plan);
WRITE_NODE_FIELD(part_prune_info);
}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 377185f7c6..6a563e9903 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1615,6 +1615,7 @@ ReadCommonPlan(Plan *local_node)
READ_INT_FIELD(plan_width);
READ_BOOL_FIELD(parallel_aware);
READ_BOOL_FIELD(parallel_safe);
+ READ_BOOL_FIELD(async_capable);
READ_INT_FIELD(plan_node_id);
READ_NODE_FIELD(targetlist);
READ_NODE_FIELD(qual);
@@ -1711,6 +1712,7 @@ _readAppend(void)
READ_BITMAPSET_FIELD(apprelids);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
READ_INT_FIELD(first_partial_plan);
READ_NODE_FIELD(part_prune_info);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a25b674a19..f3100f7540 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -147,6 +147,7 @@ bool enable_partitionwise_aggregate = false;
bool enable_parallel_append = true;
bool enable_parallel_hash = true;
bool enable_partition_pruning = true;
+bool enable_async_append = true;
typedef struct
{
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 906cab7053..78ef068fb7 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -81,6 +81,7 @@ static List *get_gating_quals(PlannerInfo *root, List *quals);
static Plan *create_gating_plan(PlannerInfo *root, Path *path, Plan *plan,
List *gating_quals);
static Plan *create_join_plan(PlannerInfo *root, JoinPath *best_path);
+static bool is_async_capable_path(Path *path);
static Plan *create_append_plan(PlannerInfo *root, AppendPath *best_path,
int flags);
static Plan *create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
@@ -1080,6 +1081,31 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
return plan;
}
+/*
+ * is_async_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ break;
+ default:
+ break;
+ }
+ return false;
+}
+
/*
* create_append_plan
* Create an Append plan for 'best_path' and (recursively) plans
@@ -1097,6 +1123,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
List *pathkeys = best_path->path.pathkeys;
List *subplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
RelOptInfo *rel = best_path->path.parent;
PartitionPruneInfo *partpruneinfo = NULL;
int nodenumsortkeys = 0;
@@ -1104,6 +1131,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
Oid *nodeSortOperators = NULL;
Oid *nodeCollations = NULL;
bool *nodeNullsFirst = NULL;
+ bool consider_async = false;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1167,6 +1195,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
tlist_was_changed = (orig_tlist_length != list_length(plan->plan.targetlist));
}
+ /* If appropriate, consider async append */
+ consider_async = (enable_async_append && pathkeys == NIL &&
+ !best_path->path.parallel_safe &&
+ list_length(best_path->subpaths) > 1);
+
/* Build the plan for each child */
foreach(subpaths, best_path->subpaths)
{
@@ -1234,6 +1267,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
subplans = lappend(subplans, subplan);
+
+ /* Check to see if subplan can be executed asynchronously */
+ if (consider_async && is_async_capable_path(subpath))
+ {
+ subplan->async_capable = true;
+ ++nasyncplans;
+ }
}
/*
@@ -1266,6 +1306,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
plan->appendplans = subplans;
+ plan->nasyncplans = nasyncplans;
plan->first_partial_plan = best_path->first_partial_path;
plan->part_prune_info = partpruneinfo;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 60f45ccc4e..4b9bcd2b41 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3995,6 +3995,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
switch (w)
{
+ case WAIT_EVENT_APPEND_READY:
+ event_name = "AppendReady";
+ break;
case WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE:
event_name = "BackupWaitWalArchive";
break;
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 43a5fded10..5f3318fa8f 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -2020,6 +2020,15 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
}
#endif
+/*
+ * Get the number of wait events registered in a given WaitEventSet.
+ */
+int
+GetNumRegisteredWaitEvents(WaitEventSet *set)
+{
+ return set->nevents;
+}
+
#if defined(WAIT_USE_POLL)
/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0c5dc4d3e8..03daec9a08 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1128,6 +1128,16 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_async_append", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of async append plans."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_async_append,
+ true,
+ NULL, NULL, NULL
+ },
{
{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("Enables genetic query optimization."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b234a6bfe6..791d39cf07 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -371,6 +371,7 @@
#enable_partitionwise_aggregate = off
#enable_parallel_hash = on
#enable_partition_pruning = on
+#enable_async_append = on
# - Planner Cost Constants -
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000000..724034f226
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ * execAsync.h
+ * Support functions for asynchronous execution
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/executor/execAsync.h
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(AsyncRequest *areq);
+extern void ExecAsyncConfigureWait(AsyncRequest *areq);
+extern void ExecAsyncNotify(AsyncRequest *areq);
+extern void ExecAsyncResponse(AsyncRequest *areq);
+extern void ExecAsyncRequestDone(AsyncRequest *areq, TupleTableSlot *result);
+extern void ExecAsyncRequestPending(AsyncRequest *areq);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index cafd410a5d..fa54ac6ad2 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -25,4 +25,6 @@ extern void ExecAppendInitializeDSM(AppendState *node, ParallelContext *pcxt);
extern void ExecAppendReInitializeDSM(AppendState *node, ParallelContext *pcxt);
extern void ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt);
+extern void ExecAsyncAppendResponse(AsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 6ae7733e25..8ffc0ca5bf 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -31,4 +31,8 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
ParallelWorkerContext *pwcxt);
extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern void ExecAsyncForeignScanRequest(AsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(AsyncRequest *areq);
+extern void ExecAsyncForeignScanNotify(AsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 248f78da45..7c89d081c7 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -178,6 +178,14 @@ typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
List *fdw_private,
RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+
+typedef void (*ForeignAsyncRequest_function) (AsyncRequest *areq);
+
+typedef void (*ForeignAsyncConfigureWait_function) (AsyncRequest *areq);
+
+typedef void (*ForeignAsyncNotify_function) (AsyncRequest *areq);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -256,6 +264,12 @@ typedef struct FdwRoutine
/* Support functions for path reparameterization. */
ReparameterizeForeignPathByChild_function ReparameterizeForeignPathByChild;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e31ad6204e..43e7f62489 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -515,6 +515,22 @@ typedef struct ResultRelInfo
struct CopyMultiInsertBuffer *ri_CopyMultiInsertBuffer;
} ResultRelInfo;
+/* ----------------
+ * AsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct AsyncRequest
+{
+ struct PlanState *requestor; /* Node that wants a tuple */
+ struct PlanState *requestee; /* Node from which a tuple is wanted */
+ int request_index; /* Scratch space for requestor */
+ bool callback_pending; /* Callback is needed */
+ bool request_complete; /* Request complete, result valid */
+ TupleTableSlot *result; /* Result (NULL if no more tuples) */
+} AsyncRequest;
+
/* ----------------
* EState information
*
@@ -1199,12 +1215,12 @@ typedef struct ModifyTableState
* AppendState information
*
* nplans how many plans are in the array
- * whichplan which plan is being executed (0 .. n-1), or a
- * special negative value. See nodeAppend.c.
+ * whichplan which synchronous plan is being executed (0 .. n-1)
+ * or a special negative value. See nodeAppend.c.
* prune_state details required to allow partitions to be
* eliminated from the scan, or NULL if not possible.
- * valid_subplans for runtime pruning, valid appendplans indexes to
- * scan.
+ * valid_subplans for runtime pruning, valid synchronous appendplans
+ * indexes to scan.
* ----------------
*/
@@ -1220,12 +1236,25 @@ struct AppendState
PlanState **appendplans; /* array of PlanStates for my inputs */
int as_nplans;
int as_whichplan;
+ bool as_begun; /* false means need to initialize */
+ Bitmapset *as_asyncplans; /* asynchronous plans indexes */
+ int as_nasyncplans; /* # of asynchronous plans */
+ AsyncRequest **as_asyncrequests; /* array of AsyncRequests */
+ TupleTableSlot **as_asyncresults; /* unreturned results of async plans */
+ int as_nasyncresults; /* # of valid entries in as_asyncresults */
+ bool as_syncdone; /* true if all synchronous plans done in
+ * asynchronous mode, else false */
+ int as_nasyncremain; /* # of remaining async plans */
+ Bitmapset *as_needrequest; /* async plans needing a new request */
+ struct WaitEventSet *as_eventset; /* WaitEventSet used to configure
+ * file descriptor wait events */
int as_first_partial_plan; /* Index of 'appendplans' containing
* the first partial plan */
ParallelAppendState *as_pstate; /* parallel coordination info */
Size pstate_len; /* size of parallel coordination info */
struct PartitionPruneState *as_prune_state;
Bitmapset *as_valid_subplans;
+ Bitmapset *as_valid_asyncplans; /* valid asynchronous plans indexes */
bool (*choose_next_subplan) (AppendState *);
};
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 6e62104d0b..24ca616740 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -129,6 +129,11 @@ typedef struct Plan
bool parallel_aware; /* engage parallel-aware logic? */
bool parallel_safe; /* OK to use as part of parallel plan? */
+ /*
+ * information needed for asynchronous execution
+ */
+ bool async_capable; /* engage asynchronous-capable logic? */
+
/*
* Common structural data for all Plan types.
*/
@@ -245,6 +250,7 @@ typedef struct Append
Plan plan;
Bitmapset *apprelids; /* RTIs of appendrel(s) formed by this node */
List *appendplans;
+ int nasyncplans; /* # of asynchronous plans */
/*
* All 'appendplans' preceding this index are non-partial plans. All
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 1be93be098..a3fd93fe07 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -65,6 +65,7 @@ extern PGDLLIMPORT bool enable_partitionwise_aggregate;
extern PGDLLIMPORT bool enable_parallel_append;
extern PGDLLIMPORT bool enable_parallel_hash;
extern PGDLLIMPORT bool enable_partition_pruning;
+extern PGDLLIMPORT bool enable_async_append;
extern PGDLLIMPORT int constraint_exclusion;
extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 87672e6f30..d699502cd9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -966,7 +966,8 @@ typedef enum
*/
typedef enum
{
- WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE = PG_WAIT_IPC,
+ WAIT_EVENT_APPEND_READY = PG_WAIT_IPC,
+ WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE,
WAIT_EVENT_BGWORKER_SHUTDOWN,
WAIT_EVENT_BGWORKER_STARTUP,
WAIT_EVENT_BTREE_PAGE,
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 9e94fcaec2..44f9368c64 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -179,5 +179,6 @@ extern int WaitLatch(Latch *latch, int wakeEvents, long timeout,
extern int WaitLatchOrSocket(Latch *latch, int wakeEvents,
pgsocket sock, long timeout, uint32 wait_event_info);
extern void InitializeLatchWaitSet(void);
+extern int GetNumRegisteredWaitEvents(WaitEventSet *set);
#endif /* LATCH_H */
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 791eba8511..b89b99fb02 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -87,6 +87,7 @@ select explain_filter('explain (analyze, buffers, format json) select * from int
"Plan": { +
"Node Type": "Seq Scan", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "int8_tbl",+
"Alias": "i8", +
"Startup Cost": N.N, +
@@ -136,6 +137,7 @@ select explain_filter('explain (analyze, buffers, format xml) select * from int8
<Plan> +
<Node-Type>Seq Scan</Node-Type> +
<Parallel-Aware>false</Parallel-Aware> +
+ <Async-Capable>false</Async-Capable> +
<Relation-Name>int8_tbl</Relation-Name> +
<Alias>i8</Alias> +
<Startup-Cost>N.N</Startup-Cost> +
@@ -183,6 +185,7 @@ select explain_filter('explain (analyze, buffers, format yaml) select * from int
- Plan: +
Node Type: "Seq Scan" +
Parallel Aware: false +
+ Async Capable: false +
Relation Name: "int8_tbl"+
Alias: "i8" +
Startup Cost: N.N +
@@ -233,6 +236,7 @@ select explain_filter('explain (buffers, format json) select * from int8_tbl i8'
"Plan": { +
"Node Type": "Seq Scan", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "int8_tbl",+
"Alias": "i8", +
"Startup Cost": N.N, +
@@ -346,6 +350,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Relation Name": "tenk1", +
"Parallel Aware": true, +
"Local Hit Blocks": 0, +
@@ -391,6 +396,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Parallel Aware": false, +
"Sort Space Used": 0, +
"Local Hit Blocks": 0, +
@@ -433,6 +439,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Parallel Aware": false, +
"Workers Planned": 0, +
"Local Hit Blocks": 0, +
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index 68ca321163..a417b566d9 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -558,6 +558,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
"Node Type": "Incremental Sort", +
"Actual Rows": 55, +
"Actual Loops": 1, +
+ "Async Capable": false, +
"Presorted Key": [ +
"t.a" +
], +
@@ -760,6 +761,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
"Node Type": "Incremental Sort", +
"Actual Rows": 70, +
"Actual Loops": 1, +
+ "Async Capable": false, +
"Presorted Key": [ +
"t.a" +
], +
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index ff157ceb1c..499245068a 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -204,6 +204,7 @@ explain (costs off, format json) insert into insertconflicttest values (0, 'Bilb
"Node Type": "ModifyTable", +
"Operation": "Insert", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "insertconflicttest", +
"Alias": "insertconflicttest", +
"Conflict Resolution": "UPDATE", +
@@ -213,7 +214,8 @@ explain (costs off, format json) insert into insertconflicttest values (0, 'Bilb
{ +
"Node Type": "Result", +
"Parent Relationship": "Member", +
- "Parallel Aware": false +
+ "Parallel Aware": false, +
+ "Async Capable": false +
} +
] +
} +
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 6d048e309c..98dde452e6 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -95,6 +95,7 @@ select count(*) = 0 as ok from pg_stat_wal_receiver;
select name, setting from pg_settings where name like 'enable%';
name | setting
--------------------------------+---------
+ enable_async_append | on
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
@@ -113,7 +114,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(18 rows)
+(19 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
--
2.19.2
On Mon, Mar 29, 2021 at 6:50 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
I think the patch would be committable.
Here is a new version of the patch.
* Rebased the patch against HEAD.
* Tweaked docs/comments a bit further.
* Added the commit message. Does that make sense?
I'm happy with the patch, so I'll commit it if there are no objections.
Best regards,
Etsuro Fujita
Attachments:
0001-Add-support-for-asynchronous-execution.patchapplication/octet-stream; name=0001-Add-support-for-asynchronous-execution.patchDownload
From a3aecaf92c00730d484640a08dfb7ad1d8aa98e5 Mon Sep 17 00:00:00 2001
From: Etsuro Fujita <efujita@postgresql.org>
Date: Tue, 30 Mar 2021 20:30:26 +0900
Subject: [PATCH] Add support for asynchronous execution.
This implements asynchronous execution, which runs multiple parts of
an Append concurrently rather than serially to improve performance
when possible. Currently, the only node type that may be executed
concurrently is a ForeignScan that is an immediate child of an Append.
In cases where such ForeignScans access data on different remote
servers, asynchronous execution would overlap the remote operations to
be performed simultaneously, so queries can benefit from asynchrony
especially when the remote operations involve time-consuming ones such
as remote join and remote aggregation.
We may extend to the more complicated node types between a ForeignScan
and an Append such as joins or aggregates in the future.
This also adds the support for postgres_fdw, which is enabled by the
table-level/server-level option "async_capable". The default is false.
Robert Haas, Kyotaro Horiguchi, Thomas Munro, and myself. This commit
is mostly based on the patch proposed by Robert Haas, but also uses
stuff from the patch proposed by Kyotaro Horiguchi and from the patch
proposed by Thomas Munro. Reviewed by Kyotaro Horiguchi, Konstantin
Knizhnik, Andrey Lepikhov, Movead Li, Thomas Munro, Justin Pryzby, and
others.
Discussion: https://postgr.es/m/CA%2BTgmoaXQEt4tZ03FtQhnzeDEMzBck%2BLrni0UWHVVgOTnA6C1w%40mail.gmail.com
Discussion: https://postgr.es/m/CA%2BhUKGLBRyu0rHrDCMC4%3DRn3252gogyp1SjOgG8SEKKZv%3DFwfQ%40mail.gmail.com
Discussion: https://postgr.es/m/20200228.170650.667613673625155850.horikyota.ntt%40gmail.com
---
contrib/postgres_fdw/connection.c | 26 +-
.../postgres_fdw/expected/postgres_fdw.out | 509 +++++++++++++++++-
contrib/postgres_fdw/option.c | 6 +-
contrib/postgres_fdw/postgres_fdw.c | 374 +++++++++++--
contrib/postgres_fdw/postgres_fdw.h | 17 +-
contrib/postgres_fdw/sql/postgres_fdw.sql | 195 +++++++
doc/src/sgml/config.sgml | 14 +
doc/src/sgml/fdwhandler.sgml | 90 ++++
doc/src/sgml/monitoring.sgml | 5 +
doc/src/sgml/postgres-fdw.sgml | 28 +
src/backend/commands/explain.c | 3 +
src/backend/executor/Makefile | 1 +
src/backend/executor/README | 40 ++
src/backend/executor/execAmi.c | 4 +
src/backend/executor/execAsync.c | 124 +++++
src/backend/executor/nodeAppend.c | 462 +++++++++++++++-
src/backend/executor/nodeForeignscan.c | 48 ++
src/backend/nodes/copyfuncs.c | 2 +
src/backend/nodes/outfuncs.c | 2 +
src/backend/nodes/readfuncs.c | 2 +
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/plan/createplan.c | 41 ++
src/backend/postmaster/pgstat.c | 3 +
src/backend/storage/ipc/latch.c | 9 +
src/backend/utils/misc/guc.c | 10 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/executor/execAsync.h | 25 +
src/include/executor/nodeAppend.h | 2 +
src/include/executor/nodeForeignscan.h | 4 +
src/include/foreign/fdwapi.h | 14 +
src/include/nodes/execnodes.h | 37 +-
src/include/nodes/plannodes.h | 6 +
src/include/optimizer/cost.h | 1 +
src/include/pgstat.h | 3 +-
src/include/storage/latch.h | 1 +
src/test/regress/expected/explain.out | 7 +
.../regress/expected/incremental_sort.out | 2 +
src/test/regress/expected/insert_conflict.out | 4 +-
src/test/regress/expected/sysviews.out | 3 +-
39 files changed, 2069 insertions(+), 57 deletions(-)
create mode 100644 src/backend/executor/execAsync.c
create mode 100644 src/include/executor/execAsync.h
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index ee0b4acf0b..54ab8edfab 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -62,6 +62,7 @@ typedef struct ConnCacheEntry
Oid serverid; /* foreign server OID used to get server name */
uint32 server_hashvalue; /* hash value of foreign server OID */
uint32 mapping_hashvalue; /* hash value of user mapping OID */
+ PgFdwConnState state; /* extra per-connection state */
} ConnCacheEntry;
/*
@@ -115,9 +116,12 @@ static bool disconnect_cached_connections(Oid serverid);
* will_prep_stmt must be true if caller intends to create any prepared
* statements. Since those don't go away automatically at transaction end
* (not even on error), we need this flag to cue manual cleanup.
+ *
+ * If state is not NULL, *state receives the per-connection state associated
+ * with the PGconn.
*/
PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetConnection(UserMapping *user, bool will_prep_stmt, PgFdwConnState **state)
{
bool found;
bool retry = false;
@@ -196,6 +200,9 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
*/
PG_TRY();
{
+ /* Process a pending asynchronous request if any. */
+ if (entry->state.pendingAreq)
+ process_pending_request(entry->state.pendingAreq);
/* Start a new transaction or subtransaction if needed. */
begin_remote_xact(entry);
}
@@ -264,6 +271,10 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
/* Remember if caller will prepare statements */
entry->have_prep_stmt |= will_prep_stmt;
+ /* If caller needs access to the per-connection state, return it. */
+ if (state)
+ *state = &entry->state;
+
return entry->conn;
}
@@ -291,6 +302,7 @@ make_new_connection(ConnCacheEntry *entry, UserMapping *user)
entry->mapping_hashvalue =
GetSysCacheHashValue1(USERMAPPINGOID,
ObjectIdGetDatum(user->umid));
+ memset(&entry->state, 0, sizeof(entry->state));
/* Now try to make the connection */
entry->conn = connect_pg_server(server, user);
@@ -648,8 +660,12 @@ GetPrepStmtNumber(PGconn *conn)
* Caller is responsible for the error handling on the result.
*/
PGresult *
-pgfdw_exec_query(PGconn *conn, const char *query)
+pgfdw_exec_query(PGconn *conn, const char *query, PgFdwConnState *state)
{
+ /* First, process a pending asynchronous request, if any. */
+ if (state && state->pendingAreq)
+ process_pending_request(state->pendingAreq);
+
/*
* Submit a query. Since we don't use non-blocking mode, this also can
* block. But its risk is relatively small, so we ignore that for now.
@@ -940,6 +956,8 @@ pgfdw_xact_callback(XactEvent event, void *arg)
{
entry->have_prep_stmt = false;
entry->have_error = false;
+ /* Also reset per-connection state */
+ memset(&entry->state, 0, sizeof(entry->state));
}
/* Disarm changing_xact_state if it all worked. */
@@ -1172,6 +1190,10 @@ pgfdw_reject_incomplete_xact_state_change(ConnCacheEntry *entry)
* Cancel the currently-in-progress query (whose query text we do not have)
* and ignore the result. Returns true if we successfully cancel the query
* and discard any pending result, and false if not.
+ *
+ * XXX: if the query was one sent by fetch_more_data_begin(), we could get the
+ * query text from the pendingAreq saved in the per-connection state, then
+ * report the query using it.
*/
static bool
pgfdw_cancel_query(PGconn *conn)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 0649b6b81c..a285412623 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -8946,7 +8946,7 @@ DO $d$
END;
$d$;
ERROR: invalid option "password"
-HINT: Valid options in this context are: service, passfile, channel_binding, connect_timeout, dbname, host, hostaddr, port, options, application_name, keepalives, keepalives_idle, keepalives_interval, keepalives_count, tcp_user_timeout, sslmode, sslcompression, sslcert, sslkey, sslrootcert, sslcrl, sslcrldir, requirepeer, ssl_min_protocol_version, ssl_max_protocol_version, gssencmode, krbsrvname, gsslib, target_session_attrs, use_remote_estimate, fdw_startup_cost, fdw_tuple_cost, extensions, updatable, fetch_size, batch_size
+HINT: Valid options in this context are: service, passfile, channel_binding, connect_timeout, dbname, host, hostaddr, port, options, application_name, keepalives, keepalives_idle, keepalives_interval, keepalives_count, tcp_user_timeout, sslmode, sslcompression, sslcert, sslkey, sslrootcert, sslcrl, sslcrldir, requirepeer, ssl_min_protocol_version, ssl_max_protocol_version, gssencmode, krbsrvname, gsslib, target_session_attrs, use_remote_estimate, fdw_startup_cost, fdw_tuple_cost, extensions, updatable, fetch_size, batch_size, async_capable
CONTEXT: SQL statement "ALTER SERVER loopback_nopw OPTIONS (ADD password 'dummypw')"
PL/pgSQL function inline_code_block line 3 at EXECUTE
-- If we add a password for our user mapping instead, we should get a different
@@ -9437,3 +9437,510 @@ SELECT tableoid::regclass, * FROM batch_cp_upd_test;
-- Clean up
DROP TABLE batch_table, batch_cp_upd_test CASCADE;
+-- ===================================================================
+-- test asynchronous execution
+-- ===================================================================
+ALTER SERVER loopback OPTIONS (DROP extensions);
+ALTER SERVER loopback OPTIONS (ADD async_capable 'true');
+ALTER SERVER loopback2 OPTIONS (ADD async_capable 'true');
+CREATE TABLE async_pt (a int, b int, c text) PARTITION BY RANGE (a);
+CREATE TABLE base_tbl1 (a int, b int, c text);
+CREATE TABLE base_tbl2 (a int, b int, c text);
+CREATE FOREIGN TABLE async_p1 PARTITION OF async_pt FOR VALUES FROM (1000) TO (2000)
+ SERVER loopback OPTIONS (table_name 'base_tbl1');
+CREATE FOREIGN TABLE async_p2 PARTITION OF async_pt FOR VALUES FROM (2000) TO (3000)
+ SERVER loopback2 OPTIONS (table_name 'base_tbl2');
+INSERT INTO async_p1 SELECT 1000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+INSERT INTO async_p2 SELECT 2000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+ANALYZE async_pt;
+-- simple queries
+CREATE TABLE result_tbl (a int, b int, c text);
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b % 100 = 0;
+ QUERY PLAN
+----------------------------------------------------------------------------------------
+ Insert on public.result_tbl
+ -> Append
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE (((b % 100) = 0))
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE (((b % 100) = 0))
+(8 rows)
+
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b % 100 = 0;
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+-----+------
+ 1000 | 0 | 0000
+ 1100 | 100 | 0100
+ 1200 | 200 | 0200
+ 1300 | 300 | 0300
+ 1400 | 400 | 0400
+ 1500 | 500 | 0500
+ 1600 | 600 | 0600
+ 1700 | 700 | 0700
+ 1800 | 800 | 0800
+ 1900 | 900 | 0900
+ 2000 | 0 | 0000
+ 2100 | 100 | 0100
+ 2200 | 200 | 0200
+ 2300 | 300 | 0300
+ 2400 | 400 | 0400
+ 2500 | 500 | 0500
+ 2600 | 600 | 0600
+ 2700 | 700 | 0700
+ 2800 | 800 | 0800
+ 2900 | 900 | 0900
+(20 rows)
+
+DELETE FROM result_tbl;
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+ QUERY PLAN
+----------------------------------------------------------------
+ Insert on public.result_tbl
+ -> Append
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Filter: (async_pt_1.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Filter: (async_pt_2.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+(10 rows)
+
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+-----+------
+ 1505 | 505 | 0505
+ 2505 | 505 | 0505
+(2 rows)
+
+DELETE FROM result_tbl;
+-- Check case where multiple partitions use the same connection
+CREATE TABLE base_tbl3 (a int, b int, c text);
+CREATE FOREIGN TABLE async_p3 PARTITION OF async_pt FOR VALUES FROM (3000) TO (4000)
+ SERVER loopback2 OPTIONS (table_name 'base_tbl3');
+INSERT INTO async_p3 SELECT 3000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+ANALYZE async_pt;
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+ QUERY PLAN
+----------------------------------------------------------------
+ Insert on public.result_tbl
+ -> Append
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Filter: (async_pt_1.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Filter: (async_pt_2.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+ -> Async Foreign Scan on public.async_p3 async_pt_3
+ Output: async_pt_3.a, async_pt_3.b, async_pt_3.c
+ Filter: (async_pt_3.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl3
+(14 rows)
+
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+-----+------
+ 1505 | 505 | 0505
+ 2505 | 505 | 0505
+ 3505 | 505 | 0505
+(3 rows)
+
+DELETE FROM result_tbl;
+DROP FOREIGN TABLE async_p3;
+DROP TABLE base_tbl3;
+-- Check case where the partitioned table has local/remote partitions
+CREATE TABLE async_p3 PARTITION OF async_pt FOR VALUES FROM (3000) TO (4000);
+INSERT INTO async_p3 SELECT 3000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+ANALYZE async_pt;
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+ QUERY PLAN
+----------------------------------------------------------------
+ Insert on public.result_tbl
+ -> Append
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Filter: (async_pt_1.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Filter: (async_pt_2.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+ -> Seq Scan on public.async_p3 async_pt_3
+ Output: async_pt_3.a, async_pt_3.b, async_pt_3.c
+ Filter: (async_pt_3.b === 505)
+(13 rows)
+
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+-----+------
+ 1505 | 505 | 0505
+ 2505 | 505 | 0505
+ 3505 | 505 | 0505
+(3 rows)
+
+DELETE FROM result_tbl;
+-- partitionwise joins
+SET enable_partitionwise_join TO true;
+CREATE TABLE join_tbl (a1 int, b1 int, c1 text, a2 int, b2 int, c2 text);
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO join_tbl SELECT * FROM async_pt t1, async_pt t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.b % 100 = 0;
+ QUERY PLAN
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Insert on public.join_tbl
+ -> Append
+ -> Async Foreign Scan
+ Output: t1_1.a, t1_1.b, t1_1.c, t2_1.a, t2_1.b, t2_1.c
+ Relations: (public.async_p1 t1_1) INNER JOIN (public.async_p1 t2_1)
+ Remote SQL: SELECT r5.a, r5.b, r5.c, r8.a, r8.b, r8.c FROM (public.base_tbl1 r5 INNER JOIN public.base_tbl1 r8 ON (((r5.a = r8.a)) AND ((r5.b = r8.b)) AND (((r5.b % 100) = 0))))
+ -> Async Foreign Scan
+ Output: t1_2.a, t1_2.b, t1_2.c, t2_2.a, t2_2.b, t2_2.c
+ Relations: (public.async_p2 t1_2) INNER JOIN (public.async_p2 t2_2)
+ Remote SQL: SELECT r6.a, r6.b, r6.c, r9.a, r9.b, r9.c FROM (public.base_tbl2 r6 INNER JOIN public.base_tbl2 r9 ON (((r6.a = r9.a)) AND ((r6.b = r9.b)) AND (((r6.b % 100) = 0))))
+ -> Hash Join
+ Output: t1_3.a, t1_3.b, t1_3.c, t2_3.a, t2_3.b, t2_3.c
+ Hash Cond: ((t2_3.a = t1_3.a) AND (t2_3.b = t1_3.b))
+ -> Seq Scan on public.async_p3 t2_3
+ Output: t2_3.a, t2_3.b, t2_3.c
+ -> Hash
+ Output: t1_3.a, t1_3.b, t1_3.c
+ -> Seq Scan on public.async_p3 t1_3
+ Output: t1_3.a, t1_3.b, t1_3.c
+ Filter: ((t1_3.b % 100) = 0)
+(20 rows)
+
+INSERT INTO join_tbl SELECT * FROM async_pt t1, async_pt t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.b % 100 = 0;
+SELECT * FROM join_tbl ORDER BY a1;
+ a1 | b1 | c1 | a2 | b2 | c2
+------+-----+------+------+-----+------
+ 1000 | 0 | 0000 | 1000 | 0 | 0000
+ 1100 | 100 | 0100 | 1100 | 100 | 0100
+ 1200 | 200 | 0200 | 1200 | 200 | 0200
+ 1300 | 300 | 0300 | 1300 | 300 | 0300
+ 1400 | 400 | 0400 | 1400 | 400 | 0400
+ 1500 | 500 | 0500 | 1500 | 500 | 0500
+ 1600 | 600 | 0600 | 1600 | 600 | 0600
+ 1700 | 700 | 0700 | 1700 | 700 | 0700
+ 1800 | 800 | 0800 | 1800 | 800 | 0800
+ 1900 | 900 | 0900 | 1900 | 900 | 0900
+ 2000 | 0 | 0000 | 2000 | 0 | 0000
+ 2100 | 100 | 0100 | 2100 | 100 | 0100
+ 2200 | 200 | 0200 | 2200 | 200 | 0200
+ 2300 | 300 | 0300 | 2300 | 300 | 0300
+ 2400 | 400 | 0400 | 2400 | 400 | 0400
+ 2500 | 500 | 0500 | 2500 | 500 | 0500
+ 2600 | 600 | 0600 | 2600 | 600 | 0600
+ 2700 | 700 | 0700 | 2700 | 700 | 0700
+ 2800 | 800 | 0800 | 2800 | 800 | 0800
+ 2900 | 900 | 0900 | 2900 | 900 | 0900
+ 3000 | 0 | 0000 | 3000 | 0 | 0000
+ 3100 | 100 | 0100 | 3100 | 100 | 0100
+ 3200 | 200 | 0200 | 3200 | 200 | 0200
+ 3300 | 300 | 0300 | 3300 | 300 | 0300
+ 3400 | 400 | 0400 | 3400 | 400 | 0400
+ 3500 | 500 | 0500 | 3500 | 500 | 0500
+ 3600 | 600 | 0600 | 3600 | 600 | 0600
+ 3700 | 700 | 0700 | 3700 | 700 | 0700
+ 3800 | 800 | 0800 | 3800 | 800 | 0800
+ 3900 | 900 | 0900 | 3900 | 900 | 0900
+(30 rows)
+
+DELETE FROM join_tbl;
+RESET enable_partitionwise_join;
+-- Test interaction of async execution with plan-time partition pruning
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE a < 3000;
+ QUERY PLAN
+-----------------------------------------------------------------------------
+ Append
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE ((a < 3000))
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE ((a < 3000))
+(7 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE a < 2000;
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Foreign Scan on public.async_p1 async_pt
+ Output: async_pt.a, async_pt.b, async_pt.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE ((a < 2000))
+(3 rows)
+
+-- Test interaction of async execution with run-time partition pruning
+SET plan_cache_mode TO force_generic_plan;
+PREPARE async_pt_query (int, int) AS
+ INSERT INTO result_tbl SELECT * FROM async_pt WHERE a < $1 AND b === $2;
+EXPLAIN (VERBOSE, COSTS OFF)
+EXECUTE async_pt_query (3000, 505);
+ QUERY PLAN
+------------------------------------------------------------------------------------------
+ Insert on public.result_tbl
+ -> Append
+ Subplans Removed: 1
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Filter: (async_pt_1.b === $2)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE ((a < $1::integer))
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Filter: (async_pt_2.b === $2)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE ((a < $1::integer))
+(11 rows)
+
+EXECUTE async_pt_query (3000, 505);
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+-----+------
+ 1505 | 505 | 0505
+ 2505 | 505 | 0505
+(2 rows)
+
+DELETE FROM result_tbl;
+EXPLAIN (VERBOSE, COSTS OFF)
+EXECUTE async_pt_query (2000, 505);
+ QUERY PLAN
+------------------------------------------------------------------------------------------
+ Insert on public.result_tbl
+ -> Append
+ Subplans Removed: 2
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Filter: (async_pt_1.b === $2)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE ((a < $1::integer))
+(7 rows)
+
+EXECUTE async_pt_query (2000, 505);
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+-----+------
+ 1505 | 505 | 0505
+(1 row)
+
+DELETE FROM result_tbl;
+RESET plan_cache_mode;
+CREATE TABLE local_tbl(a int, b int, c text);
+INSERT INTO local_tbl VALUES (1505, 505, 'foo'), (2505, 505, 'bar');
+ANALYZE local_tbl;
+CREATE INDEX base_tbl1_idx ON base_tbl1 (a);
+CREATE INDEX base_tbl2_idx ON base_tbl2 (a);
+CREATE INDEX async_p3_idx ON async_p3 (a);
+ANALYZE base_tbl1;
+ANALYZE base_tbl2;
+ANALYZE async_p3;
+ALTER FOREIGN TABLE async_p1 OPTIONS (use_remote_estimate 'true');
+ALTER FOREIGN TABLE async_p2 OPTIONS (use_remote_estimate 'true');
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c = 'bar';
+ QUERY PLAN
+------------------------------------------------------------------------------------------
+ Nested Loop
+ Output: local_tbl.a, local_tbl.b, local_tbl.c, async_pt.a, async_pt.b, async_pt.c
+ -> Seq Scan on public.local_tbl
+ Output: local_tbl.a, local_tbl.b, local_tbl.c
+ Filter: (local_tbl.c = 'bar'::text)
+ -> Append
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE (($1::integer = a))
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE (($1::integer = a))
+ -> Seq Scan on public.async_p3 async_pt_3
+ Output: async_pt_3.a, async_pt_3.b, async_pt_3.c
+ Filter: (local_tbl.a = async_pt_3.a)
+(15 rows)
+
+EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
+SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c = 'bar';
+ QUERY PLAN
+-------------------------------------------------------------------------------
+ Nested Loop (actual rows=1 loops=1)
+ -> Seq Scan on local_tbl (actual rows=1 loops=1)
+ Filter: (c = 'bar'::text)
+ Rows Removed by Filter: 1
+ -> Append (actual rows=1 loops=1)
+ -> Async Foreign Scan on async_p1 async_pt_1 (never executed)
+ -> Async Foreign Scan on async_p2 async_pt_2 (actual rows=1 loops=1)
+ -> Seq Scan on async_p3 async_pt_3 (never executed)
+ Filter: (local_tbl.a = a)
+(9 rows)
+
+SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c = 'bar';
+ a | b | c | a | b | c
+------+-----+-----+------+-----+------
+ 2505 | 505 | bar | 2505 | 505 | 0505
+(1 row)
+
+ALTER FOREIGN TABLE async_p1 OPTIONS (DROP use_remote_estimate);
+ALTER FOREIGN TABLE async_p2 OPTIONS (DROP use_remote_estimate);
+DROP TABLE local_tbl;
+DROP INDEX base_tbl1_idx;
+DROP INDEX base_tbl2_idx;
+DROP INDEX async_p3_idx;
+-- Test that pending requests are processed properly
+SET enable_mergejoin TO false;
+SET enable_hashjoin TO false;
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt t1, async_p2 t2 WHERE t1.a = t2.a AND t1.b === 505;
+ QUERY PLAN
+----------------------------------------------------------------
+ Nested Loop
+ Output: t1.a, t1.b, t1.c, t2.a, t2.b, t2.c
+ Join Filter: (t1.a = t2.a)
+ -> Append
+ -> Async Foreign Scan on public.async_p1 t1_1
+ Output: t1_1.a, t1_1.b, t1_1.c
+ Filter: (t1_1.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Async Foreign Scan on public.async_p2 t1_2
+ Output: t1_2.a, t1_2.b, t1_2.c
+ Filter: (t1_2.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+ -> Seq Scan on public.async_p3 t1_3
+ Output: t1_3.a, t1_3.b, t1_3.c
+ Filter: (t1_3.b === 505)
+ -> Materialize
+ Output: t2.a, t2.b, t2.c
+ -> Foreign Scan on public.async_p2 t2
+ Output: t2.a, t2.b, t2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+(20 rows)
+
+SELECT * FROM async_pt t1, async_p2 t2 WHERE t1.a = t2.a AND t1.b === 505;
+ a | b | c | a | b | c
+------+-----+------+------+-----+------
+ 2505 | 505 | 0505 | 2505 | 505 | 0505
+(1 row)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
+ QUERY PLAN
+----------------------------------------------------------------
+ Limit
+ Output: t1.a, t1.b, t1.c
+ -> Append
+ -> Async Foreign Scan on public.async_p1 t1_1
+ Output: t1_1.a, t1_1.b, t1_1.c
+ Filter: (t1_1.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Async Foreign Scan on public.async_p2 t1_2
+ Output: t1_2.a, t1_2.b, t1_2.c
+ Filter: (t1_2.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+ -> Seq Scan on public.async_p3 t1_3
+ Output: t1_3.a, t1_3.b, t1_3.c
+ Filter: (t1_3.b === 505)
+(14 rows)
+
+SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
+ a | b | c
+------+-----+------
+ 3505 | 505 | 0505
+(1 row)
+
+-- Check with foreign modify
+CREATE TABLE local_tbl (a int, b int, c text);
+INSERT INTO local_tbl VALUES (1505, 505, 'foo');
+CREATE TABLE base_tbl3 (a int, b int, c text);
+CREATE FOREIGN TABLE remote_tbl (a int, b int, c text)
+ SERVER loopback OPTIONS (table_name 'base_tbl3');
+INSERT INTO remote_tbl VALUES (2505, 505, 'bar');
+CREATE TABLE base_tbl4 (a int, b int, c text);
+CREATE FOREIGN TABLE insert_tbl (a int, b int, c text)
+ SERVER loopback OPTIONS (table_name 'base_tbl4');
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO insert_tbl (SELECT * FROM local_tbl UNION ALL SELECT * FROM remote_tbl);
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Insert on public.insert_tbl
+ Remote SQL: INSERT INTO public.base_tbl4(a, b, c) VALUES ($1, $2, $3)
+ Batch Size: 1
+ -> Append
+ -> Seq Scan on public.local_tbl
+ Output: local_tbl.a, local_tbl.b, local_tbl.c
+ -> Async Foreign Scan on public.remote_tbl
+ Output: remote_tbl.a, remote_tbl.b, remote_tbl.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl3
+(9 rows)
+
+INSERT INTO insert_tbl (SELECT * FROM local_tbl UNION ALL SELECT * FROM remote_tbl);
+SELECT * FROM insert_tbl ORDER BY a;
+ a | b | c
+------+-----+-----
+ 1505 | 505 | foo
+ 2505 | 505 | bar
+(2 rows)
+
+-- Check with direct modify
+EXPLAIN (VERBOSE, COSTS OFF)
+WITH t AS (UPDATE remote_tbl SET c = c || c RETURNING *)
+INSERT INTO join_tbl SELECT * FROM async_pt LEFT JOIN t ON (async_pt.a = t.a AND async_pt.b = t.b) WHERE async_pt.b === 505;
+ QUERY PLAN
+----------------------------------------------------------------------------------------
+ Insert on public.join_tbl
+ CTE t
+ -> Update on public.remote_tbl
+ Output: remote_tbl.a, remote_tbl.b, remote_tbl.c
+ -> Foreign Update on public.remote_tbl
+ Remote SQL: UPDATE public.base_tbl3 SET c = (c || c) RETURNING a, b, c
+ -> Nested Loop Left Join
+ Output: async_pt.a, async_pt.b, async_pt.c, t.a, t.b, t.c
+ Join Filter: ((async_pt.a = t.a) AND (async_pt.b = t.b))
+ -> Append
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Filter: (async_pt_1.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Filter: (async_pt_2.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+ -> Seq Scan on public.async_p3 async_pt_3
+ Output: async_pt_3.a, async_pt_3.b, async_pt_3.c
+ Filter: (async_pt_3.b === 505)
+ -> CTE Scan on t
+ Output: t.a, t.b, t.c
+(23 rows)
+
+WITH t AS (UPDATE remote_tbl SET c = c || c RETURNING *)
+INSERT INTO join_tbl SELECT * FROM async_pt LEFT JOIN t ON (async_pt.a = t.a AND async_pt.b = t.b) WHERE async_pt.b === 505;
+SELECT * FROM join_tbl ORDER BY a1;
+ a1 | b1 | c1 | a2 | b2 | c2
+------+-----+------+------+-----+--------
+ 1505 | 505 | 0505 | | |
+ 2505 | 505 | 0505 | 2505 | 505 | barbar
+ 3505 | 505 | 0505 | | |
+(3 rows)
+
+DELETE FROM join_tbl;
+RESET enable_mergejoin;
+RESET enable_hashjoin;
+-- Clean up
+DROP TABLE async_pt;
+DROP TABLE base_tbl1;
+DROP TABLE base_tbl2;
+DROP TABLE result_tbl;
+DROP TABLE local_tbl;
+DROP FOREIGN TABLE remote_tbl;
+DROP FOREIGN TABLE insert_tbl;
+DROP TABLE base_tbl3;
+DROP TABLE base_tbl4;
+DROP TABLE join_tbl;
+ALTER SERVER loopback OPTIONS (DROP async_capable);
+ALTER SERVER loopback2 OPTIONS (DROP async_capable);
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 64698c4da3..530d7a66d4 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -107,7 +107,8 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
* Validate option value, when we can do so without any context.
*/
if (strcmp(def->defname, "use_remote_estimate") == 0 ||
- strcmp(def->defname, "updatable") == 0)
+ strcmp(def->defname, "updatable") == 0 ||
+ strcmp(def->defname, "async_capable") == 0)
{
/* these accept only boolean values */
(void) defGetBoolean(def);
@@ -217,6 +218,9 @@ InitPgFdwOptions(void)
/* batch_size is available on both server and table */
{"batch_size", ForeignServerRelationId, false},
{"batch_size", ForeignTableRelationId, false},
+ /* async_capable is available on both server and table */
+ {"async_capable", ForeignServerRelationId, false},
+ {"async_capable", ForeignTableRelationId, false},
{"password_required", UserMappingRelationId, false},
/*
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 20b25935ce..35c7a307c9 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -21,6 +21,7 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -37,6 +38,7 @@
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
#include "postgres_fdw.h"
+#include "storage/latch.h"
#include "utils/builtins.h"
#include "utils/float.h"
#include "utils/guc.h"
@@ -143,6 +145,7 @@ typedef struct PgFdwScanState
/* for remote query execution */
PGconn *conn; /* connection for the scan */
+ PgFdwConnState *conn_state; /* extra per-connection state */
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -159,6 +162,9 @@ typedef struct PgFdwScanState
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ /* for asynchronous execution */
+ bool async_capable; /* engage asynchronous-capable logic? */
+
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
MemoryContext temp_cxt; /* context for per-tuple temporary data */
@@ -176,6 +182,7 @@ typedef struct PgFdwModifyState
/* for remote query execution */
PGconn *conn; /* connection for the scan */
+ PgFdwConnState *conn_state; /* extra per-connection state */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -219,6 +226,7 @@ typedef struct PgFdwDirectModifyState
/* for remote query execution */
PGconn *conn; /* connection for the update */
+ PgFdwConnState *conn_state; /* extra per-connection state */
int numParams; /* number of parameters passed to query */
FmgrInfo *param_flinfo; /* output conversion functions for them */
List *param_exprs; /* executable expressions for param values */
@@ -408,6 +416,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
RelOptInfo *input_rel,
RelOptInfo *output_rel,
void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(AsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(AsyncRequest *areq);
+static void postgresForeignAsyncNotify(AsyncRequest *areq);
/*
* Helper functions
@@ -437,7 +449,8 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
void *arg);
static void create_cursor(ForeignScanState *node);
static void fetch_more_data(ForeignScanState *node);
-static void close_cursor(PGconn *conn, unsigned int cursor_number);
+static void close_cursor(PGconn *conn, unsigned int cursor_number,
+ PgFdwConnState *conn_state);
static PgFdwModifyState *create_foreign_modify(EState *estate,
RangeTblEntry *rte,
ResultRelInfo *resultRelInfo,
@@ -491,6 +504,8 @@ static int postgresAcquireSampleRowsFunc(Relation relation, int elevel,
double *totaldeadrows);
static void analyze_row_processor(PGresult *res, int row,
PgFdwAnalyzeState *astate);
+static void produce_tuple_asynchronously(AsyncRequest *areq, bool fetch);
+static void fetch_more_data_begin(AsyncRequest *areq);
static HeapTuple make_tuple_from_result_row(PGresult *res,
int row,
Relation rel,
@@ -583,6 +598,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for asynchronous execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -618,14 +639,15 @@ postgresGetForeignRelSize(PlannerInfo *root,
/*
* Extract user-settable option values. Note that per-table settings of
- * use_remote_estimate and fetch_size override per-server settings of
- * them, respectively.
+ * use_remote_estimate, fetch_size and async_capable override per-server
+ * settings of them, respectively.
*/
fpinfo->use_remote_estimate = false;
fpinfo->fdw_startup_cost = DEFAULT_FDW_STARTUP_COST;
fpinfo->fdw_tuple_cost = DEFAULT_FDW_TUPLE_COST;
fpinfo->shippable_extensions = NIL;
fpinfo->fetch_size = 100;
+ fpinfo->async_capable = false;
apply_server_options(fpinfo);
apply_table_options(fpinfo);
@@ -1459,7 +1481,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->conn = GetConnection(user, false, &fsstate->conn_state);
/* Assign a unique ID for my cursor */
fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1510,6 +1532,9 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
&fsstate->param_flinfo,
&fsstate->param_exprs,
&fsstate->param_values);
+
+ /* Set the async-capable flag */
+ fsstate->async_capable = node->ss.ps.plan->async_capable;
}
/*
@@ -1524,8 +1549,10 @@ postgresIterateForeignScan(ForeignScanState *node)
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
/*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
+ * In sync mode, if this is the first call after Begin or ReScan, we need
+ * to create the cursor on the remote side. In async mode, we would have
+ * already created the cursor before we get here, even if this is the
+ * first call after Begin or ReScan.
*/
if (!fsstate->cursor_exists)
create_cursor(node);
@@ -1535,6 +1562,9 @@ postgresIterateForeignScan(ForeignScanState *node)
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
+ /* In async mode, just clear tuple slot. */
+ if (fsstate->async_capable)
+ return ExecClearTuple(slot);
/* No point in another fetch if we already detected EOF, though. */
if (!fsstate->eof_reached)
fetch_more_data(node);
@@ -1596,7 +1626,7 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->conn, sql, fsstate->conn_state);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
PQclear(res);
@@ -1624,7 +1654,8 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->conn, fsstate->cursor_number,
+ fsstate->conn_state);
/* Release remote connection */
ReleaseConnection(fsstate->conn);
@@ -2501,7 +2532,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->conn = GetConnection(user, false, &dmstate->conn_state);
/* Update the foreign-join-related fields. */
if (fsplan->scan.scanrelid == 0)
@@ -2882,7 +2913,7 @@ estimate_path_cost_size(PlannerInfo *root,
false, &retrieved_attrs, NULL);
/* Get the remote estimate */
- conn = GetConnection(fpinfo->user, false);
+ conn = GetConnection(fpinfo->user, false, NULL);
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -3328,7 +3359,7 @@ get_remote_estimate(const char *sql, PGconn *conn,
/*
* Execute EXPLAIN remotely.
*/
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_exec_query(conn, sql, NULL);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, sql);
@@ -3452,6 +3483,10 @@ create_cursor(ForeignScanState *node)
StringInfoData buf;
PGresult *res;
+ /* First, process a pending asynchronous request, if any. */
+ if (fsstate->conn_state->pendingAreq)
+ process_pending_request(fsstate->conn_state->pendingAreq);
+
/*
* Construct array of query parameter values in text format. We do the
* conversions in the short-lived per-tuple context, so as not to cause a
@@ -3532,17 +3567,38 @@ fetch_more_data(ForeignScanState *node)
PG_TRY();
{
PGconn *conn = fsstate->conn;
- char sql[64];
int numrows;
int i;
- snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
- fsstate->fetch_size, fsstate->cursor_number);
+ if (fsstate->async_capable)
+ {
+ Assert(fsstate->conn_state->pendingAreq);
- res = pgfdw_exec_query(conn, sql);
- /* On error, report the original query, not the FETCH. */
- if (PQresultStatus(res) != PGRES_TUPLES_OK)
- pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ /*
+ * The query was already sent by an earlier call to
+ * fetch_more_data_begin. So now we just fetch the result.
+ */
+ res = pgfdw_get_result(conn, fsstate->query);
+ /* On error, report the original query, not the FETCH. */
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+
+ /* Reset per-connection state */
+ fsstate->conn_state->pendingAreq = NULL;
+ }
+ else
+ {
+ char sql[64];
+
+ /* This is a regular synchronous fetch. */
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ res = pgfdw_exec_query(conn, sql, fsstate->conn_state);
+ /* On error, report the original query, not the FETCH. */
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+ }
/* Convert the data into HeapTuples */
numrows = PQntuples(res);
@@ -3634,7 +3690,8 @@ reset_transmission_modes(int nestlevel)
* Utility routine to close a cursor.
*/
static void
-close_cursor(PGconn *conn, unsigned int cursor_number)
+close_cursor(PGconn *conn, unsigned int cursor_number,
+ PgFdwConnState *conn_state)
{
char sql[64];
PGresult *res;
@@ -3645,7 +3702,7 @@ close_cursor(PGconn *conn, unsigned int cursor_number)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_exec_query(conn, sql, conn_state);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
pgfdw_report_error(ERROR, res, conn, true, sql);
PQclear(res);
@@ -3694,7 +3751,7 @@ create_foreign_modify(EState *estate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->conn = GetConnection(user, true, &fmstate->conn_state);
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Set up remote query information. */
@@ -3793,6 +3850,10 @@ execute_foreign_modify(EState *estate,
operation == CMD_UPDATE ||
operation == CMD_DELETE);
+ /* First, process a pending asynchronous request, if any. */
+ if (fmstate->conn_state->pendingAreq)
+ process_pending_request(fmstate->conn_state->pendingAreq);
+
/*
* If the existing query was deparsed and prepared for a different number
* of rows, rebuild it for the proper number.
@@ -3894,6 +3955,11 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
char *p_name;
PGresult *res;
+ /*
+ * The caller would already have processed a pending asynchronous request
+ * if any, so no need to do it here.
+ */
+
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
GetPrepStmtNumber(fmstate->conn));
@@ -4079,7 +4145,7 @@ deallocate_query(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->conn, sql, fmstate->conn_state);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
PQclear(res);
@@ -4227,6 +4293,10 @@ execute_dml_stmt(ForeignScanState *node)
int numParams = dmstate->numParams;
const char **values = dmstate->param_values;
+ /* First, process a pending asynchronous request, if any. */
+ if (dmstate->conn_state->pendingAreq)
+ process_pending_request(dmstate->conn_state->pendingAreq);
+
/*
* Construct array of query parameter values in text format.
*/
@@ -4628,7 +4698,7 @@ postgresAnalyzeForeignTable(Relation relation,
*/
table = GetForeignTable(RelationGetRelid(relation));
user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
- conn = GetConnection(user, false);
+ conn = GetConnection(user, false, NULL);
/*
* Construct command to get page count for relation.
@@ -4639,7 +4709,7 @@ postgresAnalyzeForeignTable(Relation relation,
/* In what follows, do not risk leaking any PGresults. */
PG_TRY();
{
- res = pgfdw_exec_query(conn, sql.data);
+ res = pgfdw_exec_query(conn, sql.data, NULL);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, sql.data);
@@ -4714,7 +4784,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
table = GetForeignTable(RelationGetRelid(relation));
server = GetForeignServer(table->serverid);
user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
- conn = GetConnection(user, false);
+ conn = GetConnection(user, false, NULL);
/*
* Construct cursor that retrieves whole rows from remote.
@@ -4731,7 +4801,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
int fetch_size;
ListCell *lc;
- res = pgfdw_exec_query(conn, sql.data);
+ res = pgfdw_exec_query(conn, sql.data, NULL);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
pgfdw_report_error(ERROR, res, conn, false, sql.data);
PQclear(res);
@@ -4783,7 +4853,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
*/
/* Fetch some rows */
- res = pgfdw_exec_query(conn, fetch_sql);
+ res = pgfdw_exec_query(conn, fetch_sql, NULL);
/* On error, report the original query, not the FETCH. */
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, sql.data);
@@ -4802,7 +4872,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
}
/* Close the cursor, just to be tidy. */
- close_cursor(conn, cursor_number);
+ close_cursor(conn, cursor_number, NULL);
}
PG_CATCH();
{
@@ -4942,7 +5012,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
*/
server = GetForeignServer(serverOid);
mapping = GetUserMapping(GetUserId(), server->serverid);
- conn = GetConnection(mapping, false);
+ conn = GetConnection(mapping, false, NULL);
/* Don't attempt to import collation if remote server hasn't got it */
if (PQserverVersion(conn) < 90100)
@@ -4958,7 +5028,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
appendStringInfoString(&buf, "SELECT 1 FROM pg_catalog.pg_namespace WHERE nspname = ");
deparseStringLiteral(&buf, stmt->remote_schema);
- res = pgfdw_exec_query(conn, buf.data);
+ res = pgfdw_exec_query(conn, buf.data, NULL);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, buf.data);
@@ -5070,7 +5140,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
appendStringInfoString(&buf, " ORDER BY c.relname, a.attnum");
/* Fetch the data */
- res = pgfdw_exec_query(conn, buf.data);
+ res = pgfdw_exec_query(conn, buf.data, NULL);
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, buf.data);
@@ -5530,6 +5600,8 @@ apply_server_options(PgFdwRelationInfo *fpinfo)
ExtractExtensionList(defGetString(def), false);
else if (strcmp(def->defname, "fetch_size") == 0)
fpinfo->fetch_size = strtol(defGetString(def), NULL, 10);
+ else if (strcmp(def->defname, "async_capable") == 0)
+ fpinfo->async_capable = defGetBoolean(def);
}
}
@@ -5551,6 +5623,8 @@ apply_table_options(PgFdwRelationInfo *fpinfo)
fpinfo->use_remote_estimate = defGetBoolean(def);
else if (strcmp(def->defname, "fetch_size") == 0)
fpinfo->fetch_size = strtol(defGetString(def), NULL, 10);
+ else if (strcmp(def->defname, "async_capable") == 0)
+ fpinfo->async_capable = defGetBoolean(def);
}
}
@@ -5585,6 +5659,7 @@ merge_fdw_options(PgFdwRelationInfo *fpinfo,
fpinfo->shippable_extensions = fpinfo_o->shippable_extensions;
fpinfo->use_remote_estimate = fpinfo_o->use_remote_estimate;
fpinfo->fetch_size = fpinfo_o->fetch_size;
+ fpinfo->async_capable = fpinfo_o->async_capable;
/* Merge the table level options from either side of the join. */
if (fpinfo_i)
@@ -5606,6 +5681,13 @@ merge_fdw_options(PgFdwRelationInfo *fpinfo,
* relation sizes.
*/
fpinfo->fetch_size = Max(fpinfo_o->fetch_size, fpinfo_i->fetch_size);
+
+ /*
+ * We'll prefer to consider this join async-capable if any table from
+ * either side of the join is considered async-capable.
+ */
+ fpinfo->async_capable = fpinfo_o->async_capable ||
+ fpinfo_i->async_capable;
}
}
@@ -6489,6 +6571,236 @@ add_foreign_final_paths(PlannerInfo *root, RelOptInfo *input_rel,
add_path(final_rel, (Path *) final_path);
}
+/*
+ * postgresIsForeignPathAsyncCapable
+ * Check whether a given ForeignPath node is async-capable.
+ */
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ RelOptInfo *rel = ((Path *) path)->parent;
+ PgFdwRelationInfo *fpinfo = (PgFdwRelationInfo *) rel->fdw_private;
+
+ return fpinfo->async_capable;
+}
+
+/*
+ * postgresForeignAsyncRequest
+ * Asynchronously request next tuple from a foreign PostgreSQL table.
+ */
+static void
+postgresForeignAsyncRequest(AsyncRequest *areq)
+{
+ produce_tuple_asynchronously(areq, true);
+}
+
+/*
+ * postgresForeignAsyncConfigureWait
+ * Configure a file descriptor event for which we wish to wait.
+ */
+static void
+postgresForeignAsyncConfigureWait(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ AsyncRequest *pendingAreq = fsstate->conn_state->pendingAreq;
+ AppendState *requestor = (AppendState *) areq->requestor;
+ WaitEventSet *set = requestor->as_eventset;
+
+ /* This should not be called unless callback_pending */
+ Assert(areq->callback_pending);
+
+ /* The core code would have registered postmaster death event */
+ Assert(GetNumRegisteredWaitEvents(set) >= 1);
+
+ /* Begin an asynchronous data fetch if necessary */
+ if (!pendingAreq)
+ fetch_more_data_begin(areq);
+ else if (pendingAreq->requestor != areq->requestor)
+ {
+ /*
+ * This is the case when the in-process request was made by another
+ * Append. Note that it might be useless to process the request,
+ * because the query might not need tuples from that Append anymore.
+ * Skip the given request if there are any configured events other
+ * than the postmaster death event; otherwise process the request,
+ * then begin a fetch to configure the event below, because otherwise
+ * we might end up with no configured events other than the postmaster
+ * death event.
+ */
+ if (GetNumRegisteredWaitEvents(set) > 1)
+ return;
+ process_pending_request(pendingAreq);
+ fetch_more_data_begin(areq);
+ }
+ else if (pendingAreq->requestee != areq->requestee)
+ {
+ /*
+ * This is the case when the in-process request was made by the same
+ * parent but for a different child. Since we configure only the
+ * event for the request made for that child, skip the given request.
+ */
+ return;
+ }
+ else
+ Assert(pendingAreq == areq);
+
+ AddWaitEventToSet(set, WL_SOCKET_READABLE, PQsocket(fsstate->conn),
+ NULL, areq);
+}
+
+/*
+ * postgresForeignAsyncNotify
+ * Fetch some more tuples from a file descriptor that becomes ready,
+ * requesting next tuple.
+ */
+static void
+postgresForeignAsyncNotify(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+
+ /* The request should be currently in-process */
+ Assert(fsstate->conn_state->pendingAreq == areq);
+
+ /* The core code would have initialized the callback_pending flag */
+ Assert(!areq->callback_pending);
+
+ /* On error, report the original query, not the FETCH. */
+ if (!PQconsumeInput(fsstate->conn))
+ pgfdw_report_error(ERROR, NULL, fsstate->conn, false, fsstate->query);
+
+ fetch_more_data(node);
+
+ produce_tuple_asynchronously(areq, true);
+}
+
+/*
+ * Asynchronously produce next tuple from a foreign PostgreSQL table.
+ */
+static void
+produce_tuple_asynchronously(AsyncRequest *areq, bool fetch)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ AsyncRequest *pendingAreq = fsstate->conn_state->pendingAreq;
+ TupleTableSlot *result;
+
+ /* This should not be called if the request is currently in-process */
+ Assert(areq != pendingAreq);
+
+ /* Fetch some more tuples, if we've run out */
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ /* No point in another fetch if we already detected EOF, though */
+ if (!fsstate->eof_reached)
+ {
+ /* Mark the request as pending for a callback */
+ ExecAsyncRequestPending(areq);
+ /* Begin another fetch if requested and if no pending request */
+ if (fetch && !pendingAreq)
+ fetch_more_data_begin(areq);
+ }
+ else
+ {
+ /* There's nothing more to do; just return a NULL pointer */
+ result = NULL;
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+ }
+ return;
+ }
+
+ /* Get a tuple from the ForeignScan node */
+ result = ExecProcNode((PlanState *) node);
+ if (!TupIsNull(result))
+ {
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+ return;
+ }
+ Assert(fsstate->next_tuple >= fsstate->num_tuples);
+
+ /* Fetch some more tuples, if we've not detected EOF yet */
+ if (!fsstate->eof_reached)
+ {
+ /* Mark the request as pending for a callback */
+ ExecAsyncRequestPending(areq);
+ /* Begin another fetch if requested and if no pending request */
+ if (fetch && !pendingAreq)
+ fetch_more_data_begin(areq);
+ }
+ else
+ {
+ /* There's nothing more to do; just return a NULL pointer */
+ result = NULL;
+ /* Mark the request as complete */
+ ExecAsyncRequestDone(areq, result);
+ }
+}
+
+/*
+ * Begin an asynchronous data fetch.
+ *
+ * Note: fetch_more_data must be called to fetch the result.
+ */
+static void
+fetch_more_data_begin(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ char sql[64];
+
+ Assert(!fsstate->conn_state->pendingAreq);
+
+ /* Create the cursor synchronously. */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ /* We will send this query, but not wait for the response. */
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (PQsendQuery(fsstate->conn, sql) < 0)
+ pgfdw_report_error(ERROR, NULL, fsstate->conn, false, fsstate->query);
+
+ /* Remember that the request is in process */
+ fsstate->conn_state->pendingAreq = areq;
+}
+
+/*
+ * Process a pending asynchronous request.
+ */
+void
+process_pending_request(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ EState *estate = node->ss.ps.state;
+ MemoryContext oldcontext;
+
+ /* The request should be currently in-process */
+ Assert(fsstate->conn_state->pendingAreq == areq);
+
+ oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
+
+ /* The request would have been pending for a callback */
+ Assert(areq->callback_pending);
+
+ /* Unlike AsyncNotify, we unset callback_pending ourselves */
+ areq->callback_pending = false;
+
+ fetch_more_data(node);
+
+ /* We need to send a new query afterwards; don't fetch */
+ produce_tuple_asynchronously(areq, false);
+
+ /* Unlike AsyncNotify, we call ExecAsyncResponse ourselves */
+ ExecAsyncResponse(areq);
+
+ MemoryContextSwitchTo(oldcontext);
+}
+
/*
* Create a tuple from the specified row of the PGresult.
*
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 1f67b4d9fd..88d94da6f6 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -16,6 +16,7 @@
#include "foreign/foreign.h"
#include "lib/stringinfo.h"
#include "libpq-fe.h"
+#include "nodes/execnodes.h"
#include "nodes/pathnodes.h"
#include "utils/relcache.h"
@@ -78,6 +79,7 @@ typedef struct PgFdwRelationInfo
Cost fdw_startup_cost;
Cost fdw_tuple_cost;
List *shippable_extensions; /* OIDs of shippable extensions */
+ bool async_capable;
/* Cached catalog information. */
ForeignTable *table;
@@ -124,17 +126,28 @@ typedef struct PgFdwRelationInfo
int relation_index;
} PgFdwRelationInfo;
+/*
+ * Extra control information relating to a connection.
+ */
+typedef struct PgFdwConnState
+{
+ AsyncRequest *pendingAreq; /* pending async request */
+} PgFdwConnState;
+
/* in postgres_fdw.c */
extern int set_transmission_modes(void);
extern void reset_transmission_modes(int nestlevel);
+extern void process_pending_request(AsyncRequest *areq);
/* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+ PgFdwConnState **state);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
extern PGresult *pgfdw_get_result(PGconn *conn, const char *query);
-extern PGresult *pgfdw_exec_query(PGconn *conn, const char *query);
+extern PGresult *pgfdw_exec_query(PGconn *conn, const char *query,
+ PgFdwConnState *state);
extern void pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
bool clear, const char *sql);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 2b525ea44a..127e131c56 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -2928,3 +2928,198 @@ SELECT tableoid::regclass, * FROM batch_cp_upd_test;
-- Clean up
DROP TABLE batch_table, batch_cp_upd_test CASCADE;
+
+-- ===================================================================
+-- test asynchronous execution
+-- ===================================================================
+
+ALTER SERVER loopback OPTIONS (DROP extensions);
+ALTER SERVER loopback OPTIONS (ADD async_capable 'true');
+ALTER SERVER loopback2 OPTIONS (ADD async_capable 'true');
+
+CREATE TABLE async_pt (a int, b int, c text) PARTITION BY RANGE (a);
+CREATE TABLE base_tbl1 (a int, b int, c text);
+CREATE TABLE base_tbl2 (a int, b int, c text);
+CREATE FOREIGN TABLE async_p1 PARTITION OF async_pt FOR VALUES FROM (1000) TO (2000)
+ SERVER loopback OPTIONS (table_name 'base_tbl1');
+CREATE FOREIGN TABLE async_p2 PARTITION OF async_pt FOR VALUES FROM (2000) TO (3000)
+ SERVER loopback2 OPTIONS (table_name 'base_tbl2');
+INSERT INTO async_p1 SELECT 1000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+INSERT INTO async_p2 SELECT 2000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+ANALYZE async_pt;
+
+-- simple queries
+CREATE TABLE result_tbl (a int, b int, c text);
+
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b % 100 = 0;
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b % 100 = 0;
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
+-- Check case where multiple partitions use the same connection
+CREATE TABLE base_tbl3 (a int, b int, c text);
+CREATE FOREIGN TABLE async_p3 PARTITION OF async_pt FOR VALUES FROM (3000) TO (4000)
+ SERVER loopback2 OPTIONS (table_name 'base_tbl3');
+INSERT INTO async_p3 SELECT 3000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+ANALYZE async_pt;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
+DROP FOREIGN TABLE async_p3;
+DROP TABLE base_tbl3;
+
+-- Check case where the partitioned table has local/remote partitions
+CREATE TABLE async_p3 PARTITION OF async_pt FOR VALUES FROM (3000) TO (4000);
+INSERT INTO async_p3 SELECT 3000 + i, i, to_char(i, 'FM0000') FROM generate_series(0, 999, 5) i;
+ANALYZE async_pt;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
+-- partitionwise joins
+SET enable_partitionwise_join TO true;
+
+CREATE TABLE join_tbl (a1 int, b1 int, c1 text, a2 int, b2 int, c2 text);
+
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO join_tbl SELECT * FROM async_pt t1, async_pt t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.b % 100 = 0;
+INSERT INTO join_tbl SELECT * FROM async_pt t1, async_pt t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.b % 100 = 0;
+
+SELECT * FROM join_tbl ORDER BY a1;
+DELETE FROM join_tbl;
+
+RESET enable_partitionwise_join;
+
+-- Test interaction of async execution with plan-time partition pruning
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE a < 3000;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE a < 2000;
+
+-- Test interaction of async execution with run-time partition pruning
+SET plan_cache_mode TO force_generic_plan;
+
+PREPARE async_pt_query (int, int) AS
+ INSERT INTO result_tbl SELECT * FROM async_pt WHERE a < $1 AND b === $2;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+EXECUTE async_pt_query (3000, 505);
+EXECUTE async_pt_query (3000, 505);
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+EXECUTE async_pt_query (2000, 505);
+EXECUTE async_pt_query (2000, 505);
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
+RESET plan_cache_mode;
+
+CREATE TABLE local_tbl(a int, b int, c text);
+INSERT INTO local_tbl VALUES (1505, 505, 'foo'), (2505, 505, 'bar');
+ANALYZE local_tbl;
+
+CREATE INDEX base_tbl1_idx ON base_tbl1 (a);
+CREATE INDEX base_tbl2_idx ON base_tbl2 (a);
+CREATE INDEX async_p3_idx ON async_p3 (a);
+ANALYZE base_tbl1;
+ANALYZE base_tbl2;
+ANALYZE async_p3;
+
+ALTER FOREIGN TABLE async_p1 OPTIONS (use_remote_estimate 'true');
+ALTER FOREIGN TABLE async_p2 OPTIONS (use_remote_estimate 'true');
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c = 'bar';
+EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
+SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c = 'bar';
+SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c = 'bar';
+
+ALTER FOREIGN TABLE async_p1 OPTIONS (DROP use_remote_estimate);
+ALTER FOREIGN TABLE async_p2 OPTIONS (DROP use_remote_estimate);
+
+DROP TABLE local_tbl;
+DROP INDEX base_tbl1_idx;
+DROP INDEX base_tbl2_idx;
+DROP INDEX async_p3_idx;
+
+-- Test that pending requests are processed properly
+SET enable_mergejoin TO false;
+SET enable_hashjoin TO false;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt t1, async_p2 t2 WHERE t1.a = t2.a AND t1.b === 505;
+SELECT * FROM async_pt t1, async_p2 t2 WHERE t1.a = t2.a AND t1.b === 505;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
+SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
+
+-- Check with foreign modify
+CREATE TABLE local_tbl (a int, b int, c text);
+INSERT INTO local_tbl VALUES (1505, 505, 'foo');
+
+CREATE TABLE base_tbl3 (a int, b int, c text);
+CREATE FOREIGN TABLE remote_tbl (a int, b int, c text)
+ SERVER loopback OPTIONS (table_name 'base_tbl3');
+INSERT INTO remote_tbl VALUES (2505, 505, 'bar');
+
+CREATE TABLE base_tbl4 (a int, b int, c text);
+CREATE FOREIGN TABLE insert_tbl (a int, b int, c text)
+ SERVER loopback OPTIONS (table_name 'base_tbl4');
+
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO insert_tbl (SELECT * FROM local_tbl UNION ALL SELECT * FROM remote_tbl);
+INSERT INTO insert_tbl (SELECT * FROM local_tbl UNION ALL SELECT * FROM remote_tbl);
+
+SELECT * FROM insert_tbl ORDER BY a;
+
+-- Check with direct modify
+EXPLAIN (VERBOSE, COSTS OFF)
+WITH t AS (UPDATE remote_tbl SET c = c || c RETURNING *)
+INSERT INTO join_tbl SELECT * FROM async_pt LEFT JOIN t ON (async_pt.a = t.a AND async_pt.b = t.b) WHERE async_pt.b === 505;
+WITH t AS (UPDATE remote_tbl SET c = c || c RETURNING *)
+INSERT INTO join_tbl SELECT * FROM async_pt LEFT JOIN t ON (async_pt.a = t.a AND async_pt.b = t.b) WHERE async_pt.b === 505;
+
+SELECT * FROM join_tbl ORDER BY a1;
+DELETE FROM join_tbl;
+
+RESET enable_mergejoin;
+RESET enable_hashjoin;
+
+-- Clean up
+DROP TABLE async_pt;
+DROP TABLE base_tbl1;
+DROP TABLE base_tbl2;
+DROP TABLE result_tbl;
+DROP TABLE local_tbl;
+DROP FOREIGN TABLE remote_tbl;
+DROP FOREIGN TABLE insert_tbl;
+DROP TABLE base_tbl3;
+DROP TABLE base_tbl4;
+DROP TABLE join_tbl;
+
+ALTER SERVER loopback OPTIONS (DROP async_capable);
+ALTER SERVER loopback2 OPTIONS (DROP async_capable);
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ddc6d789d8..701cb65cc7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4787,6 +4787,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</para>
<variablelist>
+ <varlistentry id="guc-enable-async-append" xreflabel="enable_async_append">
+ <term><varname>enable_async_append</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_async_append</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables or disables the query planner's use of async-aware
+ append plan types. The default is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-enable-bitmapscan" xreflabel="enable_bitmapscan">
<term><varname>enable_bitmapscan</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 04bc052ee8..635c9ec559 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1483,6 +1483,96 @@ ShutdownForeignScan(ForeignScanState *node);
</para>
</sect2>
+ <sect2 id="fdw-callbacks-async">
+ <title>FDW Routines for Asynchronous Execution</title>
+ <para>
+ A <structname>ForeignScan</structname> node can, optionally, support
+ asynchronous execution as described in
+ <filename>src/backend/executor/README</filename>. The following
+ functions are all optional, but are all required if asynchronous
+ execution is to be supported.
+ </para>
+
+ <para>
+<programlisting>
+bool
+IsForeignPathAsyncCapable(ForeignPath *path);
+</programlisting>
+ Test whether a given <structname>ForeignPath</structname> path can scan
+ the underlying foreign relation asynchronously.
+ This function will only be called at the end of query planning when the
+ given path is a direct child of an <structname>AppendPath</structname>
+ path and when the planner believes that asynchronous execution improves
+ performance, and should return true if the given path is able to scan the
+ foreign relation asynchronously.
+ </para>
+
+ <para>
+ If this function is not defined, it is assumed that the given path scans
+ the foreign relation using <function>IterateForeignScan</function>.
+ (This implies that the callback functions described below will never be
+ called, so they need not be provided either.)
+ </para>
+
+ <para>
+<programlisting>
+void
+ForeignAsyncRequest(AsyncRequest *areq);
+</programlisting>
+ Produce one tuple asynchronously from the
+ <structname>ForeignScan</structname> node. <literal>areq</literal> is
+ the <structname>AsyncRequest</structname> struct describing the
+ <structname>ForeignScan</structname> node and the parent
+ <structname>Append</structname> node that requested the tuple from it.
+ This function should store the tuple into the slot specified by
+ <literal>areq->result</literal>, and set
+ <literal>areq->request_complete</literal> to <literal>true</literal>;
+ or if it needs to wait on an event external to the core server such as
+ network I/O, and cannot produce any tuple immediately, set the flag to
+ <literal>false</literal>, and set
+ <literal>areq->callback_pending</literal> to <literal>true</literal>
+ for the <structname>ForeignScan</structname> node to get a callback from
+ the callback functions described below. If no more tuples are available,
+ set the slot to NULL, and the
+ <literal>areq->request_complete</literal> flag to
+ <literal>true</literal>. It's recommended to use
+ <function>ExecAsyncRequestDone</function> or
+ <function>ExecAsyncRequestPending</function> to set the output parameters
+ in the <literal>areq</literal>.
+ </para>
+
+ <para>
+<programlisting>
+void
+ForeignAsyncConfigureWait(AsyncRequest *areq);
+</programlisting>
+ Configure a file descriptor event for which the
+ <structname>ForeignScan</structname> node wishes to wait.
+ This function will only be called when the
+ <structname>ForeignScan</structname> node has the
+ <literal>areq->callback_pending</literal> flag set, and should add
+ the event to the <structfield>as_eventset</structfield> of the parent
+ <structname>Append</structname> node described by the
+ <literal>areq</literal>. See the comments for
+ <function>ExecAsyncConfigureWait</function> in
+ <filename>src/backend/executor/execAsync.c</filename> for additional
+ information. When the file descriptor event occurs,
+ <function>ForeignAsyncNotify</function> will be called.
+ </para>
+
+ <para>
+<programlisting>
+void
+ForeignAsyncNotify(AsyncRequest *areq);
+</programlisting>
+ Process a relevant event that has occurred, then produce one tuple
+ asynchronously from the <structname>ForeignScan</structname> node.
+ This function should set the output parameters in the
+ <literal>areq</literal> in the same way as
+ <function>ForeignAsyncRequest</function>.
+ </para>
+ </sect2>
+
<sect2 id="fdw-callbacks-reparameterize-paths">
<title>FDW Routines for Reparameterization of Paths</title>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 43c07da20e..af540fb02f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1564,6 +1564,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</thead>
<tbody>
+ <row>
+ <entry><literal>AppendReady</literal></entry>
+ <entry>Waiting for subplan nodes of an <literal>Append</literal> plan
+ node to be ready.</entry>
+ </row>
<row>
<entry><literal>BackupWaitWalArchive</literal></entry>
<entry>Waiting for WAL files required for a backup to be successfully
diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index 07aa25799d..a1b426c50b 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -371,6 +371,34 @@ OPTIONS (ADD password_required 'false');
</sect3>
+ <sect3>
+ <title>Asynchronous Execution Options</title>
+
+ <para>
+ <filename>postgres_fdw</filename> supports asynchronous execution, which
+ runs multiple parts of an <structname>Append</structname> node
+ concurrently rather than serially to improve performance.
+ This execution can be controled using the following option:
+ </para>
+
+ <variablelist>
+
+ <varlistentry>
+ <term><literal>async_capable</literal></term>
+ <listitem>
+ <para>
+ This option controls whether <filename>postgres_fdw</filename> allows
+ foreign tables to be scanned concurrently for asynchronous execution.
+ It can be specified for a foreign table or a foreign server.
+ A table-level option overrides a server-level option.
+ The default is <literal>false</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </sect3>
+
<sect3>
<title>Updatability Options</title>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index afc45429ba..fe75cabdcc 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1394,6 +1394,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
}
if (plan->parallel_aware)
appendStringInfoString(es->str, "Parallel ");
+ if (plan->async_capable)
+ appendStringInfoString(es->str, "Async ");
appendStringInfoString(es->str, pname);
es->indent++;
}
@@ -1413,6 +1415,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (custom_name)
ExplainPropertyText("Custom Plan Provider", custom_name, es);
ExplainPropertyBool("Parallel Aware", plan->parallel_aware, es);
+ ExplainPropertyBool("Async Capable", plan->async_capable, es);
}
switch (nodeTag(plan))
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 74ac59faa1..680fd69151 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
+ execAsync.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index 18b2ac1865..3726048c4a 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -359,3 +359,43 @@ query returning the same set of scan tuples multiple times. Likewise,
SRFs are disallowed in an UPDATE's targetlist. There, they would have the
effect of the same row being updated multiple times, which is not very
useful --- and updates after the first would have no effect anyway.
+
+
+Asynchronous Execution
+----------------------
+
+In cases where a node is waiting on an event external to the database system,
+such as a ForeignScan awaiting network I/O, it's desirable for the node to
+indicate that it cannot return any tuple immediately but may be able to do so
+at a later time. A process which discovers this type of situation can always
+handle it simply by blocking, but this may waste time that could be spent
+executing some other part of the plan tree where progress could be made
+immediately. This is particularly likely to occur when the plan tree contains
+an Append node. Asynchronous execution runs multiple parts of an Append node
+concurrently rather than serially to improve performance.
+
+For asynchronous execution, an Append node must first request a tuple from an
+async-capable child node using ExecAsyncRequest. Next, it must execute the
+asynchronous event loop using ExecAppendAsyncEventWait. Eventually, when a
+child node to which an asynchronous request has been made produces a tuple,
+the Append node will receive it from the event loop via ExecAsyncResponse. In
+the current implementation of asynchronous execution, the only node type that
+requests tuples from an async-capable child node is an Append, while the only
+node type that might be async-capable is a ForeignScan.
+
+Typically, the ExecAsyncResponse callback is the only one required for nodes
+that wish to request tuples asynchronously. On the other hand, async-capable
+nodes generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+ will be invoked; it should use ExecAsyncRequestPending to indicate that the
+ request is pending for a callback described below. Alternatively, it can
+ instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events, the
+ node's ExecAsyncConfigureWait callback will be invoked to configure the
+ file descriptor event for which the node wishes to wait.
+
+3. When the file descriptor becomes ready, the node's ExecAsyncNotify callback
+ will be invoked; like #1, it should use ExecAsyncRequestPending for another
+ callback or ExecAsyncRequestDone to return a result immediately.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 4543ac79ed..58a8aa5ab7 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -531,6 +531,10 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans > 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000000..f1985e658c
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,124 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+
+/*
+ * Asynchronously request a tuple from a designed async-capable node.
+ */
+void
+ExecAsyncRequest(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ ExecAsyncResponse(areq);
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor event
+ * for which it wishes to wait. We expect the node-type specific callback to
+ * make a single call of the following form:
+ *
+ * AddWaitEventToSet(set, WL_SOCKET_READABLE, fd, NULL, areq);
+ */
+void
+ExecAsyncConfigureWait(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanConfigureWait(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+void
+ExecAsyncNotify(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ ExecAsyncResponse(areq);
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+void
+ExecAsyncResponse(AsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestor))
+ {
+ case T_AppendState:
+ ExecAsyncAppendResponse(areq);
+ break;
+ default:
+ /* If the node doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestor));
+ }
+}
+
+/*
+ * A requestee node should call this function to deliver the tuple to its
+ * requestor node. The requestee node can call this from its ExecAsyncRequest
+ * or ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(AsyncRequest *areq, TupleTableSlot *result)
+{
+ areq->request_complete = true;
+ areq->result = result;
+}
+
+/*
+ * A requestee node should call this function to indicate that it is pending
+ * for a callback. The requestee node can call this from its ExecAsyncRequest
+ * or ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestPending(AsyncRequest *areq)
+{
+ areq->callback_pending = true;
+ areq->request_complete = false;
+ areq->result = NULL;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 15e4115bd6..98346b3e30 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -57,10 +57,13 @@
#include "postgres.h"
+#include "executor/execAsync.h"
#include "executor/execdebug.h"
#include "executor/execPartition.h"
#include "executor/nodeAppend.h"
#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
/* Shared state for parallel-aware Append. */
struct ParallelAppendState
@@ -78,12 +81,18 @@ struct ParallelAppendState
};
#define INVALID_SUBPLAN_INDEX -1
+#define EVENT_BUFFER_SIZE 16
static TupleTableSlot *ExecAppend(PlanState *pstate);
static bool choose_next_subplan_locally(AppendState *node);
static bool choose_next_subplan_for_leader(AppendState *node);
static bool choose_next_subplan_for_worker(AppendState *node);
static void mark_invalid_subplans_as_finished(AppendState *node);
+static void ExecAppendAsyncBegin(AppendState *node);
+static bool ExecAppendAsyncGetNext(AppendState *node, TupleTableSlot **result);
+static bool ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result);
+static void ExecAppendAsyncEventWait(AppendState *node);
+static void classify_matching_subplans(AppendState *node);
/* ----------------------------------------------------------------
* ExecInitAppend
@@ -102,7 +111,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
AppendState *appendstate = makeNode(AppendState);
PlanState **appendplanstates;
Bitmapset *validsubplans;
+ Bitmapset *asyncplans;
int nplans;
+ int nasyncplans;
int firstvalid;
int i,
j;
@@ -119,6 +130,8 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
/* Let choose_next_subplan_* function handle setting the first subplan */
appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+ appendstate->as_syncdone = false;
+ appendstate->as_begun = false;
/* If run-time partition pruning is enabled, then set that up now */
if (node->part_prune_info != NULL)
@@ -191,12 +204,25 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
* While at it, find out the first valid partial plan.
*/
j = 0;
+ asyncplans = NULL;
+ nasyncplans = 0;
firstvalid = nplans;
i = -1;
while ((i = bms_next_member(validsubplans, i)) >= 0)
{
Plan *initNode = (Plan *) list_nth(node->appendplans, i);
+ /*
+ * Record async subplans. When executing EvalPlanQual, we execute
+ * async subplans synchronously; don't do this when initializing an
+ * EvalPlanQual plan tree.
+ */
+ if (initNode->async_capable && estate->es_epq_active == NULL)
+ {
+ asyncplans = bms_add_member(asyncplans, j);
+ nasyncplans++;
+ }
+
/*
* Record the lowest appendplans index which is a valid partial plan.
*/
@@ -210,6 +236,37 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ /* Initialize async state */
+ appendstate->as_asyncplans = asyncplans;
+ appendstate->as_nasyncplans = nasyncplans;
+ appendstate->as_asyncrequests = NULL;
+ appendstate->as_asyncresults = (TupleTableSlot **)
+ palloc0(nasyncplans * sizeof(TupleTableSlot *));
+ appendstate->as_needrequest = NULL;
+ appendstate->as_eventset = NULL;
+
+ if (nasyncplans > 0)
+ {
+ appendstate->as_asyncrequests = (AsyncRequest **)
+ palloc0(nplans * sizeof(AsyncRequest *));
+
+ i = -1;
+ while ((i = bms_next_member(asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq;
+
+ areq = palloc(sizeof(AsyncRequest));
+ areq->requestor = (PlanState *) appendstate;
+ areq->requestee = appendplanstates[i];
+ areq->request_index = i;
+ areq->callback_pending = false;
+ areq->request_complete = false;
+ areq->result = NULL;
+
+ appendstate->as_asyncrequests[i] = areq;
+ }
+ }
+
/*
* Miscellaneous initialization
*/
@@ -232,31 +289,59 @@ static TupleTableSlot *
ExecAppend(PlanState *pstate)
{
AppendState *node = castNode(AppendState, pstate);
+ TupleTableSlot *result;
- if (node->as_whichplan < 0)
+ /*
+ * If this is the first call after Init or ReScan, we need to do the
+ * initialization work.
+ */
+ if (!node->as_begun)
{
+ Assert(node->as_whichplan == INVALID_SUBPLAN_INDEX);
+ Assert(!node->as_syncdone);
+
/* Nothing to do if there are no subplans */
if (node->as_nplans == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ /* If there are any async subplans, begin executing them. */
+ if (node->as_nasyncplans > 0)
+ ExecAppendAsyncBegin(node);
+
/*
- * If no subplan has been chosen, we must choose one before
+ * If no sync subplan has been chosen, we must choose one before
* proceeding.
*/
- if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
- !node->choose_next_subplan(node))
+ if (!node->choose_next_subplan(node) && node->as_nasyncremain == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+
+ Assert(node->as_syncdone ||
+ (node->as_whichplan >= 0 &&
+ node->as_whichplan < node->as_nplans));
+
+ /* And we're initialized. */
+ node->as_begun = true;
}
for (;;)
{
PlanState *subnode;
- TupleTableSlot *result;
CHECK_FOR_INTERRUPTS();
/*
- * figure out which subplan we are currently processing
+ * try to get a tuple from an async subplan if any
+ */
+ if (node->as_syncdone || !bms_is_empty(node->as_needrequest))
+ {
+ if (ExecAppendAsyncGetNext(node, &result))
+ return result;
+ Assert(!node->as_syncdone);
+ Assert(bms_is_empty(node->as_needrequest));
+ }
+
+ /*
+ * figure out which sync subplan we are currently processing
*/
Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
subnode = node->appendplans[node->as_whichplan];
@@ -276,8 +361,16 @@ ExecAppend(PlanState *pstate)
return result;
}
- /* choose new subplan; if none, we're done */
- if (!node->choose_next_subplan(node))
+ /*
+ * wait or poll async events if any. We do this before checking for
+ * the end of iteration, because it might drain the remaining async
+ * subplans.
+ */
+ if (node->as_nasyncremain > 0)
+ ExecAppendAsyncEventWait(node);
+
+ /* choose new sync subplan; if no sync/async subplans, we're done */
+ if (!node->choose_next_subplan(node) && node->as_nasyncremain == 0)
return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
}
@@ -313,6 +406,7 @@ ExecEndAppend(AppendState *node)
void
ExecReScanAppend(AppendState *node)
{
+ int nasyncplans = node->as_nasyncplans;
int i;
/*
@@ -326,6 +420,11 @@ ExecReScanAppend(AppendState *node)
{
bms_free(node->as_valid_subplans);
node->as_valid_subplans = NULL;
+ if (nasyncplans > 0)
+ {
+ bms_free(node->as_valid_asyncplans);
+ node->as_valid_asyncplans = NULL;
+ }
}
for (i = 0; i < node->as_nplans; i++)
@@ -347,8 +446,27 @@ ExecReScanAppend(AppendState *node)
ExecReScan(subnode);
}
+ /* Reset async state */
+ if (nasyncplans > 0)
+ {
+ i = -1;
+ while ((i = bms_next_member(node->as_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ areq->callback_pending = false;
+ areq->request_complete = false;
+ areq->result = NULL;
+ }
+
+ bms_free(node->as_needrequest);
+ node->as_needrequest = NULL;
+ }
+
/* Let choose_next_subplan_* function handle setting the first subplan */
node->as_whichplan = INVALID_SUBPLAN_INDEX;
+ node->as_syncdone = false;
+ node->as_begun = false;
}
/* ----------------------------------------------------------------
@@ -429,7 +547,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
/* ----------------------------------------------------------------
* choose_next_subplan_locally
*
- * Choose next subplan for a non-parallel-aware Append,
+ * Choose next sync subplan for a non-parallel-aware Append,
* returning false if there are no more.
* ----------------------------------------------------------------
*/
@@ -442,16 +560,25 @@ choose_next_subplan_locally(AppendState *node)
/* We should never be called when there are no subplans */
Assert(node->as_nplans > 0);
+ /* Nothing to do if syncdone */
+ if (node->as_syncdone)
+ return false;
+
/*
* If first call then have the bms member function choose the first valid
- * subplan by initializing whichplan to -1. If there happen to be no
- * valid subplans then the bms member function will handle that by
- * returning a negative number which will allow us to exit returning a
+ * sync subplan by initializing whichplan to -1. If there happen to be
+ * no valid sync subplans then the bms member function will handle that
+ * by returning a negative number which will allow us to exit returning a
* false value.
*/
if (whichplan == INVALID_SUBPLAN_INDEX)
{
- if (node->as_valid_subplans == NULL)
+ if (node->as_nasyncplans > 0)
+ {
+ /* We'd have filled as_valid_subplans already */
+ Assert(node->as_valid_subplans);
+ }
+ else if (node->as_valid_subplans == NULL)
node->as_valid_subplans =
ExecFindMatchingSubPlans(node->as_prune_state);
@@ -467,7 +594,12 @@ choose_next_subplan_locally(AppendState *node)
nextplan = bms_prev_member(node->as_valid_subplans, whichplan);
if (nextplan < 0)
+ {
+ /* Set as_syncdone if in async mode */
+ if (node->as_nasyncplans > 0)
+ node->as_syncdone = true;
return false;
+ }
node->as_whichplan = nextplan;
@@ -709,3 +841,307 @@ mark_invalid_subplans_as_finished(AppendState *node)
node->as_pstate->pa_finished[i] = true;
}
}
+
+/* ----------------------------------------------------------------
+ * Asynchronous Append Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncBegin
+ *
+ * Begin executing designed async-capable subplans.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecAppendAsyncBegin(AppendState *node)
+{
+ int i;
+
+ /* Backward scan is not supported by async-aware Appends. */
+ Assert(ScanDirectionIsForward(node->ps.state->es_direction));
+
+ /* We should never be called when there are no async subplans. */
+ Assert(node->as_nasyncplans > 0);
+
+ /* If we've yet to determine the valid subplans then do so now. */
+ if (node->as_valid_subplans == NULL)
+ node->as_valid_subplans =
+ ExecFindMatchingSubPlans(node->as_prune_state);
+
+ classify_matching_subplans(node);
+
+ /* Nothing to do if there are no valid async subplans. */
+ if (node->as_nasyncremain == 0)
+ return;
+
+ /* Make a request for each of the valid async subplans. */
+ i = -1;
+ while ((i = bms_next_member(node->as_valid_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ Assert(areq->request_index == i);
+ Assert(!areq->callback_pending);
+
+ /* Do the actual work. */
+ ExecAsyncRequest(areq);
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncGetNext
+ *
+ * Get the next tuple from any of the asynchronous subplans.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecAppendAsyncGetNext(AppendState *node, TupleTableSlot **result)
+{
+ *result = NULL;
+
+ /* We should never be called when there are no valid async subplans. */
+ Assert(node->as_nasyncremain > 0);
+
+ /* Request a tuple asynchronously. */
+ if (ExecAppendAsyncRequest(node, result))
+ return true;
+
+ while (node->as_nasyncremain > 0)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ /* Wait or poll async events. */
+ ExecAppendAsyncEventWait(node);
+
+ /* Request a tuple asynchronously. */
+ if (ExecAppendAsyncRequest(node, result))
+ return true;
+
+ /* Break from loop if there's any sync subplan that isn't complete. */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If all sync subplans are complete, we're totally done scanning the
+ * given node. Otherwise, we're done with the asynchronous stuff but
+ * must continue scanning the sync subplans.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncremain == 0);
+ *result = ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ return true;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncRequest
+ *
+ * If there are any asynchronous subplans that need a new
+ * request, make all of them.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result)
+{
+ Bitmapset *needrequest;
+ int i;
+
+ /* Nothing to do if there are no async subplans needing a new request. */
+ if (bms_is_empty(node->as_needrequest))
+ return false;
+
+ /*
+ * If there are any asynchronously-generated results that have not yet
+ * been returned, we have nothing to do; just return one of them.
+ */
+ if (node->as_nasyncresults > 0)
+ {
+ --node->as_nasyncresults;
+ *result = node->as_asyncresults[node->as_nasyncresults];
+ return true;
+ }
+
+ /* Make a new request for each of the async subplans that need it. */
+ needrequest = node->as_needrequest;
+ node->as_needrequest = NULL;
+ i = -1;
+ while ((i = bms_next_member(needrequest, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ /* Do the actual work. */
+ ExecAsyncRequest(areq);
+ }
+ bms_free(needrequest);
+
+ /* Return one of the asynchronously-generated results if any. */
+ if (node->as_nasyncresults > 0)
+ {
+ --node->as_nasyncresults;
+ *result = node->as_asyncresults[node->as_nasyncresults];
+ return true;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------
+ * ExecAppendAsyncEventWait
+ *
+ * Wait or poll for file descriptor events and fire callbacks.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecAppendAsyncEventWait(AppendState *node)
+{
+ long timeout = node->as_syncdone ? -1 : 0;
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+
+ /* We should never be called when there are no valid async subplans. */
+ Assert(node->as_nasyncremain > 0);
+
+ node->as_eventset = CreateWaitEventSet(CurrentMemoryContext,
+ node->as_nasyncplans + 1);
+ AddWaitEventToSet(node->as_eventset, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
+ NULL, NULL);
+
+ /* Give each waiting subplan a chance to add an event. */
+ i = -1;
+ while ((i = bms_next_member(node->as_asyncplans, i)) >= 0)
+ {
+ AsyncRequest *areq = node->as_asyncrequests[i];
+
+ if (areq->callback_pending)
+ ExecAsyncConfigureWait(areq);
+ }
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(node->as_eventset, timeout, occurred_event,
+ EVENT_BUFFER_SIZE, WAIT_EVENT_APPEND_READY);
+ FreeWaitEventSet(node->as_eventset);
+ node->as_eventset = NULL;
+ if (noccurred == 0)
+ return;
+
+ /* Deliver notifications. */
+ for (i = 0; i < noccurred; i++)
+ {
+ WaitEvent *w = &occurred_event[i];
+
+ /*
+ * Each waiting subplan should have registered its wait event with
+ * user_data pointing back to its AsyncRequest.
+ */
+ if ((w->events & WL_SOCKET_READABLE) != 0)
+ {
+ AsyncRequest *areq = (AsyncRequest *) w->user_data;
+
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ Assert(areq->callback_pending);
+ areq->callback_pending = false;
+
+ /* Do the actual work. */
+ ExecAsyncNotify(areq);
+ }
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(AsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot = areq->result;
+
+ /* The result should be a TupleTableSlot or NULL. */
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* Nothing to do if the request is pending. */
+ if (!areq->request_complete)
+ {
+ /* The request would have been pending for a callback */
+ Assert(areq->callback_pending);
+ return;
+ }
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ {
+ /* The ending subplan wouldn't have been pending for a callback. */
+ Assert(!areq->callback_pending);
+ --node->as_nasyncremain;
+ return;
+ }
+
+ /* Save result so we can return it. */
+ Assert(node->as_nasyncresults < node->as_nasyncplans);
+ node->as_asyncresults[node->as_nasyncresults++] = slot;
+
+ /*
+ * Mark the subplan that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might complete.
+ */
+ node->as_needrequest = bms_add_member(node->as_needrequest,
+ areq->request_index);
+}
+
+/* ----------------------------------------------------------------
+ * classify_matching_subplans
+ *
+ * Classify the node's as_valid_subplans into sync ones and
+ * async ones, adjust it to contain sync ones only, and save
+ * async ones in the node's as_valid_asyncplans.
+ * ----------------------------------------------------------------
+ */
+static void
+classify_matching_subplans(AppendState *node)
+{
+ Bitmapset *valid_asyncplans;
+
+ Assert(node->as_valid_asyncplans == NULL);
+
+ /* Nothing to do if there are no valid subplans. */
+ if (bms_is_empty(node->as_valid_subplans))
+ {
+ node->as_syncdone = true;
+ node->as_nasyncremain = 0;
+ return;
+ }
+
+ /* Nothing to do if there are no valid async subplans. */
+ if (!bms_overlap(node->as_valid_subplans, node->as_asyncplans))
+ {
+ node->as_nasyncremain = 0;
+ return;
+ }
+
+ /* Get valid async subplans. */
+ valid_asyncplans = bms_copy(node->as_asyncplans);
+ valid_asyncplans = bms_int_members(valid_asyncplans,
+ node->as_valid_subplans);
+
+ /* Adjust the valid subplans to contain sync subplans only. */
+ node->as_valid_subplans = bms_del_members(node->as_valid_subplans,
+ valid_asyncplans);
+ node->as_syncdone = bms_is_empty(node->as_valid_subplans);
+
+ /* Save valid async subplans. */
+ node->as_valid_asyncplans = valid_asyncplans;
+ node->as_nasyncremain = bms_num_members(valid_asyncplans);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 0969e53c3a..898890fb08 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -391,3 +391,51 @@ ExecShutdownForeignScan(ForeignScanState *node)
if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Asynchronously request a tuple from a designed async-capable node
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ fdwroutine->ForeignAsyncConfigureWait(areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Callback invoked when a relevant event has occurred
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(AsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 1d0bb6e2e7..d58b79d525 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -120,6 +120,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
COPY_SCALAR_FIELD(plan_width);
COPY_SCALAR_FIELD(parallel_aware);
COPY_SCALAR_FIELD(parallel_safe);
+ COPY_SCALAR_FIELD(async_capable);
COPY_SCALAR_FIELD(plan_node_id);
COPY_NODE_FIELD(targetlist);
COPY_NODE_FIELD(qual);
@@ -241,6 +242,7 @@ _copyAppend(const Append *from)
*/
COPY_BITMAPSET_FIELD(apprelids);
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
COPY_SCALAR_FIELD(first_partial_plan);
COPY_NODE_FIELD(part_prune_info);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 301fa30490..ff127a19ad 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -333,6 +333,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
WRITE_INT_FIELD(plan_width);
WRITE_BOOL_FIELD(parallel_aware);
WRITE_BOOL_FIELD(parallel_safe);
+ WRITE_BOOL_FIELD(async_capable);
WRITE_INT_FIELD(plan_node_id);
WRITE_NODE_FIELD(targetlist);
WRITE_NODE_FIELD(qual);
@@ -431,6 +432,7 @@ _outAppend(StringInfo str, const Append *node)
WRITE_BITMAPSET_FIELD(apprelids);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
WRITE_INT_FIELD(first_partial_plan);
WRITE_NODE_FIELD(part_prune_info);
}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 377185f7c6..6a563e9903 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1615,6 +1615,7 @@ ReadCommonPlan(Plan *local_node)
READ_INT_FIELD(plan_width);
READ_BOOL_FIELD(parallel_aware);
READ_BOOL_FIELD(parallel_safe);
+ READ_BOOL_FIELD(async_capable);
READ_INT_FIELD(plan_node_id);
READ_NODE_FIELD(targetlist);
READ_NODE_FIELD(qual);
@@ -1711,6 +1712,7 @@ _readAppend(void)
READ_BITMAPSET_FIELD(apprelids);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
READ_INT_FIELD(first_partial_plan);
READ_NODE_FIELD(part_prune_info);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b92c948588..0c016a03dd 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -147,6 +147,7 @@ bool enable_partitionwise_aggregate = false;
bool enable_parallel_append = true;
bool enable_parallel_hash = true;
bool enable_partition_pruning = true;
+bool enable_async_append = true;
typedef struct
{
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 906cab7053..78ef068fb7 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -81,6 +81,7 @@ static List *get_gating_quals(PlannerInfo *root, List *quals);
static Plan *create_gating_plan(PlannerInfo *root, Path *path, Plan *plan,
List *gating_quals);
static Plan *create_join_plan(PlannerInfo *root, JoinPath *best_path);
+static bool is_async_capable_path(Path *path);
static Plan *create_append_plan(PlannerInfo *root, AppendPath *best_path,
int flags);
static Plan *create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
@@ -1080,6 +1081,31 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
return plan;
}
+/*
+ * is_async_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ break;
+ default:
+ break;
+ }
+ return false;
+}
+
/*
* create_append_plan
* Create an Append plan for 'best_path' and (recursively) plans
@@ -1097,6 +1123,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
List *pathkeys = best_path->path.pathkeys;
List *subplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
RelOptInfo *rel = best_path->path.parent;
PartitionPruneInfo *partpruneinfo = NULL;
int nodenumsortkeys = 0;
@@ -1104,6 +1131,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
Oid *nodeSortOperators = NULL;
Oid *nodeCollations = NULL;
bool *nodeNullsFirst = NULL;
+ bool consider_async = false;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1167,6 +1195,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
tlist_was_changed = (orig_tlist_length != list_length(plan->plan.targetlist));
}
+ /* If appropriate, consider async append */
+ consider_async = (enable_async_append && pathkeys == NIL &&
+ !best_path->path.parallel_safe &&
+ list_length(best_path->subpaths) > 1);
+
/* Build the plan for each child */
foreach(subpaths, best_path->subpaths)
{
@@ -1234,6 +1267,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
subplans = lappend(subplans, subplan);
+
+ /* Check to see if subplan can be executed asynchronously */
+ if (consider_async && is_async_capable_path(subpath))
+ {
+ subplan->async_capable = true;
+ ++nasyncplans;
+ }
}
/*
@@ -1266,6 +1306,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
plan->appendplans = subplans;
+ plan->nasyncplans = nasyncplans;
plan->first_partial_plan = best_path->first_partial_path;
plan->part_prune_info = partpruneinfo;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 60f45ccc4e..4b9bcd2b41 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3995,6 +3995,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
switch (w)
{
+ case WAIT_EVENT_APPEND_READY:
+ event_name = "AppendReady";
+ break;
case WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE:
event_name = "BackupWaitWalArchive";
break;
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 43a5fded10..5f3318fa8f 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -2020,6 +2020,15 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
}
#endif
+/*
+ * Get the number of wait events registered in a given WaitEventSet.
+ */
+int
+GetNumRegisteredWaitEvents(WaitEventSet *set)
+{
+ return set->nevents;
+}
+
#if defined(WAIT_USE_POLL)
/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0c5dc4d3e8..03daec9a08 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1128,6 +1128,16 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_async_append", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of async append plans."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_async_append,
+ true,
+ NULL, NULL, NULL
+ },
{
{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("Enables genetic query optimization."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b234a6bfe6..791d39cf07 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -371,6 +371,7 @@
#enable_partitionwise_aggregate = off
#enable_parallel_hash = on
#enable_partition_pruning = on
+#enable_async_append = on
# - Planner Cost Constants -
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000000..724034f226
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ * execAsync.h
+ * Support functions for asynchronous execution
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/executor/execAsync.h
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(AsyncRequest *areq);
+extern void ExecAsyncConfigureWait(AsyncRequest *areq);
+extern void ExecAsyncNotify(AsyncRequest *areq);
+extern void ExecAsyncResponse(AsyncRequest *areq);
+extern void ExecAsyncRequestDone(AsyncRequest *areq, TupleTableSlot *result);
+extern void ExecAsyncRequestPending(AsyncRequest *areq);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index cafd410a5d..fa54ac6ad2 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -25,4 +25,6 @@ extern void ExecAppendInitializeDSM(AppendState *node, ParallelContext *pcxt);
extern void ExecAppendReInitializeDSM(AppendState *node, ParallelContext *pcxt);
extern void ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt);
+extern void ExecAsyncAppendResponse(AsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 6ae7733e25..8ffc0ca5bf 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -31,4 +31,8 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
ParallelWorkerContext *pwcxt);
extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern void ExecAsyncForeignScanRequest(AsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(AsyncRequest *areq);
+extern void ExecAsyncForeignScanNotify(AsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 248f78da45..7c89d081c7 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -178,6 +178,14 @@ typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
List *fdw_private,
RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+
+typedef void (*ForeignAsyncRequest_function) (AsyncRequest *areq);
+
+typedef void (*ForeignAsyncConfigureWait_function) (AsyncRequest *areq);
+
+typedef void (*ForeignAsyncNotify_function) (AsyncRequest *areq);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -256,6 +264,12 @@ typedef struct FdwRoutine
/* Support functions for path reparameterization. */
ReparameterizeForeignPathByChild_function ReparameterizeForeignPathByChild;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e31ad6204e..43e7f62489 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -515,6 +515,22 @@ typedef struct ResultRelInfo
struct CopyMultiInsertBuffer *ri_CopyMultiInsertBuffer;
} ResultRelInfo;
+/* ----------------
+ * AsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct AsyncRequest
+{
+ struct PlanState *requestor; /* Node that wants a tuple */
+ struct PlanState *requestee; /* Node from which a tuple is wanted */
+ int request_index; /* Scratch space for requestor */
+ bool callback_pending; /* Callback is needed */
+ bool request_complete; /* Request complete, result valid */
+ TupleTableSlot *result; /* Result (NULL if no more tuples) */
+} AsyncRequest;
+
/* ----------------
* EState information
*
@@ -1199,12 +1215,12 @@ typedef struct ModifyTableState
* AppendState information
*
* nplans how many plans are in the array
- * whichplan which plan is being executed (0 .. n-1), or a
- * special negative value. See nodeAppend.c.
+ * whichplan which synchronous plan is being executed (0 .. n-1)
+ * or a special negative value. See nodeAppend.c.
* prune_state details required to allow partitions to be
* eliminated from the scan, or NULL if not possible.
- * valid_subplans for runtime pruning, valid appendplans indexes to
- * scan.
+ * valid_subplans for runtime pruning, valid synchronous appendplans
+ * indexes to scan.
* ----------------
*/
@@ -1220,12 +1236,25 @@ struct AppendState
PlanState **appendplans; /* array of PlanStates for my inputs */
int as_nplans;
int as_whichplan;
+ bool as_begun; /* false means need to initialize */
+ Bitmapset *as_asyncplans; /* asynchronous plans indexes */
+ int as_nasyncplans; /* # of asynchronous plans */
+ AsyncRequest **as_asyncrequests; /* array of AsyncRequests */
+ TupleTableSlot **as_asyncresults; /* unreturned results of async plans */
+ int as_nasyncresults; /* # of valid entries in as_asyncresults */
+ bool as_syncdone; /* true if all synchronous plans done in
+ * asynchronous mode, else false */
+ int as_nasyncremain; /* # of remaining async plans */
+ Bitmapset *as_needrequest; /* async plans needing a new request */
+ struct WaitEventSet *as_eventset; /* WaitEventSet used to configure
+ * file descriptor wait events */
int as_first_partial_plan; /* Index of 'appendplans' containing
* the first partial plan */
ParallelAppendState *as_pstate; /* parallel coordination info */
Size pstate_len; /* size of parallel coordination info */
struct PartitionPruneState *as_prune_state;
Bitmapset *as_valid_subplans;
+ Bitmapset *as_valid_asyncplans; /* valid asynchronous plans indexes */
bool (*choose_next_subplan) (AppendState *);
};
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 6e62104d0b..24ca616740 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -129,6 +129,11 @@ typedef struct Plan
bool parallel_aware; /* engage parallel-aware logic? */
bool parallel_safe; /* OK to use as part of parallel plan? */
+ /*
+ * information needed for asynchronous execution
+ */
+ bool async_capable; /* engage asynchronous-capable logic? */
+
/*
* Common structural data for all Plan types.
*/
@@ -245,6 +250,7 @@ typedef struct Append
Plan plan;
Bitmapset *apprelids; /* RTIs of appendrel(s) formed by this node */
List *appendplans;
+ int nasyncplans; /* # of asynchronous plans */
/*
* All 'appendplans' preceding this index are non-partial plans. All
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 1be93be098..a3fd93fe07 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -65,6 +65,7 @@ extern PGDLLIMPORT bool enable_partitionwise_aggregate;
extern PGDLLIMPORT bool enable_parallel_append;
extern PGDLLIMPORT bool enable_parallel_hash;
extern PGDLLIMPORT bool enable_partition_pruning;
+extern PGDLLIMPORT bool enable_async_append;
extern PGDLLIMPORT int constraint_exclusion;
extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 87672e6f30..d699502cd9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -966,7 +966,8 @@ typedef enum
*/
typedef enum
{
- WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE = PG_WAIT_IPC,
+ WAIT_EVENT_APPEND_READY = PG_WAIT_IPC,
+ WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE,
WAIT_EVENT_BGWORKER_SHUTDOWN,
WAIT_EVENT_BGWORKER_STARTUP,
WAIT_EVENT_BTREE_PAGE,
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 9e94fcaec2..44f9368c64 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -179,5 +179,6 @@ extern int WaitLatch(Latch *latch, int wakeEvents, long timeout,
extern int WaitLatchOrSocket(Latch *latch, int wakeEvents,
pgsocket sock, long timeout, uint32 wait_event_info);
extern void InitializeLatchWaitSet(void);
+extern int GetNumRegisteredWaitEvents(WaitEventSet *set);
#endif /* LATCH_H */
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 791eba8511..b89b99fb02 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -87,6 +87,7 @@ select explain_filter('explain (analyze, buffers, format json) select * from int
"Plan": { +
"Node Type": "Seq Scan", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "int8_tbl",+
"Alias": "i8", +
"Startup Cost": N.N, +
@@ -136,6 +137,7 @@ select explain_filter('explain (analyze, buffers, format xml) select * from int8
<Plan> +
<Node-Type>Seq Scan</Node-Type> +
<Parallel-Aware>false</Parallel-Aware> +
+ <Async-Capable>false</Async-Capable> +
<Relation-Name>int8_tbl</Relation-Name> +
<Alias>i8</Alias> +
<Startup-Cost>N.N</Startup-Cost> +
@@ -183,6 +185,7 @@ select explain_filter('explain (analyze, buffers, format yaml) select * from int
- Plan: +
Node Type: "Seq Scan" +
Parallel Aware: false +
+ Async Capable: false +
Relation Name: "int8_tbl"+
Alias: "i8" +
Startup Cost: N.N +
@@ -233,6 +236,7 @@ select explain_filter('explain (buffers, format json) select * from int8_tbl i8'
"Plan": { +
"Node Type": "Seq Scan", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "int8_tbl",+
"Alias": "i8", +
"Startup Cost": N.N, +
@@ -346,6 +350,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Relation Name": "tenk1", +
"Parallel Aware": true, +
"Local Hit Blocks": 0, +
@@ -391,6 +396,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Parallel Aware": false, +
"Sort Space Used": 0, +
"Local Hit Blocks": 0, +
@@ -433,6 +439,7 @@ select jsonb_pretty(
"Actual Rows": 0, +
"Actual Loops": 0, +
"Startup Cost": 0.0, +
+ "Async Capable": false, +
"Parallel Aware": false, +
"Workers Planned": 0, +
"Local Hit Blocks": 0, +
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index 68ca321163..a417b566d9 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -558,6 +558,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
"Node Type": "Incremental Sort", +
"Actual Rows": 55, +
"Actual Loops": 1, +
+ "Async Capable": false, +
"Presorted Key": [ +
"t.a" +
], +
@@ -760,6 +761,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
"Node Type": "Incremental Sort", +
"Actual Rows": 70, +
"Actual Loops": 1, +
+ "Async Capable": false, +
"Presorted Key": [ +
"t.a" +
], +
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index ff157ceb1c..499245068a 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -204,6 +204,7 @@ explain (costs off, format json) insert into insertconflicttest values (0, 'Bilb
"Node Type": "ModifyTable", +
"Operation": "Insert", +
"Parallel Aware": false, +
+ "Async Capable": false, +
"Relation Name": "insertconflicttest", +
"Alias": "insertconflicttest", +
"Conflict Resolution": "UPDATE", +
@@ -213,7 +214,8 @@ explain (costs off, format json) insert into insertconflicttest values (0, 'Bilb
{ +
"Node Type": "Result", +
"Parent Relationship": "Member", +
- "Parallel Aware": false +
+ "Parallel Aware": false, +
+ "Async Capable": false +
} +
] +
} +
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 6d048e309c..98dde452e6 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -95,6 +95,7 @@ select count(*) = 0 as ok from pg_stat_wal_receiver;
select name, setting from pg_settings where name like 'enable%';
name | setting
--------------------------------+---------
+ enable_async_append | on
enable_bitmapscan | on
enable_gathermerge | on
enable_hashagg | on
@@ -113,7 +114,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(18 rows)
+(19 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
--
2.19.2
At Tue, 30 Mar 2021 20:40:35 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
On Mon, Mar 29, 2021 at 6:50 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
I think the patch would be committable.
Here is a new version of the patch.
* Rebased the patch against HEAD.
* Tweaked docs/comments a bit further.
* Added the commit message. Does that make sense?I'm happy with the patch, so I'll commit it if there are no objections.
Thanks for the patch.
May I ask some questions?
+ <term><literal>async_capable</literal></term>
+ <listitem>
+ <para>
+ This option controls whether <filename>postgres_fdw</filename> allows
+ foreign tables to be scanned concurrently for asynchronous execution.
+ It can be specified for a foreign table or a foreign server.
Isn't it strange that an option named "async_capable" *allows* async?
+ * We'll prefer to consider this join async-capable if any table from
+ * either side of the join is considered async-capable.
+ */
+ fpinfo->async_capable = fpinfo_o->async_capable ||
+ fpinfo_i->async_capable;
We need to explain this behavior in the documentation.
Regarding to the wording "async capable", if it literally represents
the capability to run asynchronously, when any one element of a
combined path doesn't have the capability, the whole path cannot be
async-capable. If it represents allowance for an element to run
asynchronously, then the whole path is inhibited to run asynchronously
unless all elements are allowed to do so. If it represents
enforcement or suggestion to run asynchronously, enforcing asynchrony
to an element would lead to running the whole path asynchronously
since all elements of postgres_fdw are capable to run asynchronously
as the nature.
It looks somewhat inconsistent to be inhibitive for the default value
of "async_capable", but agressive in merging?
If I'm wrong in the understanding, please feel free to go ahead.
regrds.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Wed, Mar 31, 2021 at 10:11 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
+ <term><literal>async_capable</literal></term> + <listitem> + <para> + This option controls whether <filename>postgres_fdw</filename> allows + foreign tables to be scanned concurrently for asynchronous execution. + It can be specified for a foreign table or a foreign server.Isn't it strange that an option named "async_capable" *allows* async?
I think "async_capable" is a good name for that option. See the
option "updatable" below in the postgres_fdw documentation.
+ * We'll prefer to consider this join async-capable if any table from + * either side of the join is considered async-capable. + */ + fpinfo->async_capable = fpinfo_o->async_capable || + fpinfo_i->async_capable;We need to explain this behavior in the documentation.
Regarding to the wording "async capable", if it literally represents
the capability to run asynchronously, when any one element of a
combined path doesn't have the capability, the whole path cannot be
async-capable. If it represents allowance for an element to run
asynchronously, then the whole path is inhibited to run asynchronously
unless all elements are allowed to do so. If it represents
enforcement or suggestion to run asynchronously, enforcing asynchrony
to an element would lead to running the whole path asynchronously
since all elements of postgres_fdw are capable to run asynchronously
as the nature.It looks somewhat inconsistent to be inhibitive for the default value
of "async_capable", but agressive in merging?
If the foreign table has async_capable=true, it actually means that
there are resources (CPU, IO, network, etc.) to scan the foreign table
concurrently. And if any table from either side of the join has such
resources, then they could also be used for the join. So I don't
think this behavior is aggressive. I think it would be better to add
more comments, though.
Anyway, these are all about naming and docs/comments, so I'll return
to this after committing the patch.
Thanks for the review!
Best regards,
Etsuro Fujita
On Tue, Mar 30, 2021 at 8:40 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
I'm happy with the patch, so I'll commit it if there are no objections.
Pushed.
Best regards,
Etsuro Fujita
Etsuro Fujita <etsuro.fujita@gmail.com> writes:
Pushed.
The buildfarm points out that this fails under valgrind.
I easily reproduced it here:
==00:00:03:42.115 3410499== Syscall param epoll_wait(events) points to unaddressable byte(s)
==00:00:03:42.115 3410499== at 0x58E926B: epoll_wait (epoll_wait.c:30)
==00:00:03:42.115 3410499== by 0x7FC903: WaitEventSetWaitBlock (latch.c:1452)
==00:00:03:42.115 3410499== by 0x7FC903: WaitEventSetWait (latch.c:1398)
==00:00:03:42.115 3410499== by 0x6BF46C: ExecAppendAsyncEventWait (nodeAppend.c:1025)
==00:00:03:42.115 3410499== by 0x6BF667: ExecAppendAsyncGetNext (nodeAppend.c:915)
==00:00:03:42.115 3410499== by 0x6BF667: ExecAppend (nodeAppend.c:337)
==00:00:03:42.115 3410499== by 0x6D49E4: ExecProcNode (executor.h:257)
==00:00:03:42.115 3410499== by 0x6D49E4: ExecModifyTable (nodeModifyTable.c:2222)
==00:00:03:42.115 3410499== by 0x6A87F2: ExecProcNode (executor.h:257)
==00:00:03:42.115 3410499== by 0x6A87F2: ExecutePlan (execMain.c:1531)
==00:00:03:42.115 3410499== by 0x6A87F2: standard_ExecutorRun (execMain.c:350)
==00:00:03:42.115 3410499== by 0x82597F: ProcessQuery (pquery.c:160)
==00:00:03:42.115 3410499== by 0x825BE9: PortalRunMulti (pquery.c:1267)
==00:00:03:42.115 3410499== by 0x826826: PortalRun (pquery.c:779)
==00:00:03:42.115 3410499== by 0x82291E: exec_simple_query (postgres.c:1185)
==00:00:03:42.115 3410499== by 0x823F3E: PostgresMain (postgres.c:4415)
==00:00:03:42.115 3410499== by 0x79BAC1: BackendRun (postmaster.c:4483)
==00:00:03:42.115 3410499== by 0x79BAC1: BackendStartup (postmaster.c:4205)
==00:00:03:42.115 3410499== by 0x79BAC1: ServerLoop (postmaster.c:1737)
==00:00:03:42.115 3410499== Address 0x10d10628 is 7,960 bytes inside a recently re-allocated block of size 8,192 alloc'd
==00:00:03:42.115 3410499== at 0x4C30F0B: malloc (vg_replace_malloc.c:307)
==00:00:03:42.115 3410499== by 0x94F9EA: AllocSetAlloc (aset.c:919)
==00:00:03:42.115 3410499== by 0x957BAF: MemoryContextAlloc (mcxt.c:809)
==00:00:03:42.115 3410499== by 0x958CC0: MemoryContextStrdup (mcxt.c:1179)
==00:00:03:42.115 3410499== by 0x516AE4: untransformRelOptions (reloptions.c:1336)
==00:00:03:42.115 3410499== by 0x6E6ADF: GetForeignTable (foreign.c:273)
==00:00:03:42.115 3410499== by 0xF3BD470: postgresBeginForeignScan (postgres_fdw.c:1479)
==00:00:03:42.115 3410499== by 0x6C2E83: ExecInitForeignScan (nodeForeignscan.c:236)
==00:00:03:42.115 3410499== by 0x6AF893: ExecInitNode (execProcnode.c:283)
==00:00:03:42.115 3410499== by 0x6C0007: ExecInitAppend (nodeAppend.c:232)
==00:00:03:42.115 3410499== by 0x6AFA37: ExecInitNode (execProcnode.c:180)
==00:00:03:42.115 3410499== by 0x6D533A: ExecInitModifyTable (nodeModifyTable.c:2575)
==00:00:03:44.907 3410499== Syscall param epoll_wait(events) points to unaddressable byte(s)
==00:00:03:44.907 3410499== at 0x58E926B: epoll_wait (epoll_wait.c:30)
==00:00:03:44.907 3410499== by 0x7FC903: WaitEventSetWaitBlock (latch.c:1452)
==00:00:03:44.907 3410499== by 0x7FC903: WaitEventSetWait (latch.c:1398)
==00:00:03:44.907 3410499== by 0x6BF46C: ExecAppendAsyncEventWait (nodeAppend.c:1025)
==00:00:03:44.907 3410499== by 0x6BF718: ExecAppend (nodeAppend.c:370)
==00:00:03:44.907 3410499== by 0x6D49E4: ExecProcNode (executor.h:257)
==00:00:03:44.907 3410499== by 0x6D49E4: ExecModifyTable (nodeModifyTable.c:2222)
==00:00:03:44.907 3410499== by 0x6A87F2: ExecProcNode (executor.h:257)
==00:00:03:44.907 3410499== by 0x6A87F2: ExecutePlan (execMain.c:1531)
==00:00:03:44.907 3410499== by 0x6A87F2: standard_ExecutorRun (execMain.c:350)
==00:00:03:44.907 3410499== by 0x82597F: ProcessQuery (pquery.c:160)
==00:00:03:44.907 3410499== by 0x825BE9: PortalRunMulti (pquery.c:1267)
==00:00:03:44.907 3410499== by 0x826826: PortalRun (pquery.c:779)
==00:00:03:44.907 3410499== by 0x82291E: exec_simple_query (postgres.c:1185)
==00:00:03:44.907 3410499== by 0x823F3E: PostgresMain (postgres.c:4415)
==00:00:03:44.907 3410499== by 0x79BAC1: BackendRun (postmaster.c:4483)
==00:00:03:44.907 3410499== by 0x79BAC1: BackendStartup (postmaster.c:4205)
==00:00:03:44.907 3410499== by 0x79BAC1: ServerLoop (postmaster.c:1737)
==00:00:03:44.907 3410499== Address 0x1093fdd8 is 2,904 bytes inside a recently re-allocated block of size 16,384 alloc'd
==00:00:03:44.907 3410499== at 0x4C30F0B: malloc (vg_replace_malloc.c:307)
==00:00:03:44.907 3410499== by 0x94F9EA: AllocSetAlloc (aset.c:919)
==00:00:03:44.907 3410499== by 0x958233: palloc (mcxt.c:964)
==00:00:03:44.907 3410499== by 0x69C400: ExprEvalPushStep (execExpr.c:2310)
==00:00:03:44.907 3410499== by 0x69C541: ExecPushExprSlots (execExpr.c:2490)
==00:00:03:44.907 3410499== by 0x69C580: ExecInitExprSlots (execExpr.c:2445)
==00:00:03:44.907 3410499== by 0x69F0DD: ExecInitQual (execExpr.c:231)
==00:00:03:44.907 3410499== by 0x6D80EF: ExecInitSeqScan (nodeSeqscan.c:172)
==00:00:03:44.907 3410499== by 0x6AF9CE: ExecInitNode (execProcnode.c:208)
==00:00:03:44.907 3410499== by 0x6C0007: ExecInitAppend (nodeAppend.c:232)
==00:00:03:44.907 3410499== by 0x6AFA37: ExecInitNode (execProcnode.c:180)
==00:00:03:44.907 3410499== by 0x6D533A: ExecInitModifyTable (nodeModifyTable.c:2575)
==00:00:03:44.907 3410499==
Sorta looks like something is relying on a pointer into the relcache
to be valid for longer than it can safely rely on that. The
CLOBBER_CACHE_ALWAYS animals will probably be unhappy too, but
they are slower than valgrind.
(Note that the test case appears to succeed, you have to notice that
the backend crashed after exiting.)
regards, tom lane
On Fri, Apr 2, 2021 at 12:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
The buildfarm points out that this fails under valgrind.
I easily reproduced it here:
Sorta looks like something is relying on a pointer into the relcache
to be valid for longer than it can safely rely on that. The
CLOBBER_CACHE_ALWAYS animals will probably be unhappy too, but
they are slower than valgrind.(Note that the test case appears to succeed, you have to notice that
the backend crashed after exiting.)
Will look into this.
Best regards,
Etsuro Fujita
On Fri, Apr 2, 2021 at 12:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
The buildfarm points out that this fails under valgrind.
I easily reproduced it here:==00:00:03:42.115 3410499== Syscall param epoll_wait(events) points to unaddressable byte(s)
==00:00:03:42.115 3410499== at 0x58E926B: epoll_wait (epoll_wait.c:30)
==00:00:03:42.115 3410499== by 0x7FC903: WaitEventSetWaitBlock (latch.c:1452)
==00:00:03:42.115 3410499== by 0x7FC903: WaitEventSetWait (latch.c:1398)
==00:00:03:42.115 3410499== by 0x6BF46C: ExecAppendAsyncEventWait (nodeAppend.c:1025)
==00:00:03:42.115 3410499== by 0x6BF667: ExecAppendAsyncGetNext (nodeAppend.c:915)
==00:00:03:42.115 3410499== by 0x6BF667: ExecAppend (nodeAppend.c:337)
==00:00:03:42.115 3410499== by 0x6D49E4: ExecProcNode (executor.h:257)
==00:00:03:42.115 3410499== by 0x6D49E4: ExecModifyTable (nodeModifyTable.c:2222)
==00:00:03:42.115 3410499== by 0x6A87F2: ExecProcNode (executor.h:257)
==00:00:03:42.115 3410499== by 0x6A87F2: ExecutePlan (execMain.c:1531)
==00:00:03:42.115 3410499== by 0x6A87F2: standard_ExecutorRun (execMain.c:350)
==00:00:03:42.115 3410499== by 0x82597F: ProcessQuery (pquery.c:160)
==00:00:03:42.115 3410499== by 0x825BE9: PortalRunMulti (pquery.c:1267)
==00:00:03:42.115 3410499== by 0x826826: PortalRun (pquery.c:779)
==00:00:03:42.115 3410499== by 0x82291E: exec_simple_query (postgres.c:1185)
==00:00:03:42.115 3410499== by 0x823F3E: PostgresMain (postgres.c:4415)
==00:00:03:42.115 3410499== by 0x79BAC1: BackendRun (postmaster.c:4483)
==00:00:03:42.115 3410499== by 0x79BAC1: BackendStartup (postmaster.c:4205)
==00:00:03:42.115 3410499== by 0x79BAC1: ServerLoop (postmaster.c:1737)
==00:00:03:42.115 3410499== Address 0x10d10628 is 7,960 bytes inside a recently re-allocated block of size 8,192 alloc'd
==00:00:03:42.115 3410499== at 0x4C30F0B: malloc (vg_replace_malloc.c:307)
==00:00:03:42.115 3410499== by 0x94F9EA: AllocSetAlloc (aset.c:919)
==00:00:03:42.115 3410499== by 0x957BAF: MemoryContextAlloc (mcxt.c:809)
==00:00:03:42.115 3410499== by 0x958CC0: MemoryContextStrdup (mcxt.c:1179)
==00:00:03:42.115 3410499== by 0x516AE4: untransformRelOptions (reloptions.c:1336)
==00:00:03:42.115 3410499== by 0x6E6ADF: GetForeignTable (foreign.c:273)
==00:00:03:42.115 3410499== by 0xF3BD470: postgresBeginForeignScan (postgres_fdw.c:1479)
==00:00:03:42.115 3410499== by 0x6C2E83: ExecInitForeignScan (nodeForeignscan.c:236)
==00:00:03:42.115 3410499== by 0x6AF893: ExecInitNode (execProcnode.c:283)
==00:00:03:42.115 3410499== by 0x6C0007: ExecInitAppend (nodeAppend.c:232)
==00:00:03:42.115 3410499== by 0x6AFA37: ExecInitNode (execProcnode.c:180)
==00:00:03:42.115 3410499== by 0x6D533A: ExecInitModifyTable (nodeModifyTable.c:2575)
The reason for this would be that epoll_wait() is called with
maxevents exceeding the size of the input event array in the test
case. To fix, I adjusted the parameters to call the caller function
WaitEventSetWait() with in ExecAppendAsyncEventWait(). Patch
attached.
Best regards,
Etsuro Fujita
Attachments:
fix-ExecAppendAsyncEventWait-2021-04-05.patchapplication/octet-stream; name=fix-ExecAppendAsyncEventWait-2021-04-05.patchDownload
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 7da8ffe065..c252757268 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -1001,6 +1001,7 @@ ExecAppendAsyncEventWait(AppendState *node)
long timeout = node->as_syncdone ? -1 : 0;
WaitEvent occurred_event[EVENT_BUFFER_SIZE];
int noccurred;
+ int nevents;
int i;
/* We should never be called when there are no valid async subplans. */
@@ -1022,8 +1023,9 @@ ExecAppendAsyncEventWait(AppendState *node)
}
/* Wait for at least one event to occur. */
+ nevents = Min(node->as_nasyncplans + 1, EVENT_BUFFER_SIZE);
noccurred = WaitEventSetWait(node->as_eventset, timeout, occurred_event,
- EVENT_BUFFER_SIZE, WAIT_EVENT_APPEND_READY);
+ nevents, WAIT_EVENT_APPEND_READY);
FreeWaitEventSet(node->as_eventset);
node->as_eventset = NULL;
if (noccurred == 0)
Thanks for the patch.
At Mon, 5 Apr 2021 17:15:47 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
On Fri, Apr 2, 2021 at 12:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
The buildfarm points out that this fails under valgrind.
I easily reproduced it here:==00:00:03:42.115 3410499== Syscall param epoll_wait(events) points to unaddressable byte(s)
==00:00:03:42.115 3410499== at 0x58E926B: epoll_wait (epoll_wait.c:30)
==00:00:03:42.115 3410499== by 0x7FC903: WaitEventSetWaitBlock (latch.c:1452)
...
The reason for this would be that epoll_wait() is called with
maxevents exceeding the size of the input event array in the test
case. To fix, I adjusted the parameters to call the caller function
# s/input/output/ event array? (occurrred_events)
# I couldn't reproduce it, so sorry in advance if the following
# discussion is totally bogus..
I have nothing to say if it actually corrects the error, but the only
restriction of maxevents is that it must be positive, and in any case
epoll_wait returns no more than set->nevents events. So I'm a bit
wondering if that's the reason. In the first place I'm wondering if
valgrind is aware of that depth..
==00:00:03:42.115 3410499== Syscall param epoll_wait(events) points to unaddressable byte(s)
==00:00:03:42.115 3410499== at 0x58E926B: epoll_wait (epoll_wait.c:30)
...
==00:00:03:42.115 3410499== Address 0x10d10628 is 7,960 bytes inside a recently re-allocated block of size 8,192 alloc'd
==00:00:03:42.115 3410499== at 0x4C30F0B: malloc (vg_replace_malloc.c:307)
==00:00:03:42.115 3410499== by 0x94F9EA: AllocSetAlloc (aset.c:919)
==00:00:03:42.115 3410499== by 0x957BAF: MemoryContextAlloc (mcxt.c:809)
==00:00:03:42.115 3410499== by 0x958CC0: MemoryContextStrdup (mcxt.c:1179)
==00:00:03:42.115 3410499== by 0x516AE4: untransformRelOptions (reloptions.c:1336)
==00:00:03:42.115 3410499== by 0x6E6ADF: GetForeignTable (foreign.c:273)
==00:00:03:42.115 3410499== by 0xF3BD470: postgresBeginForeignScan (postgres_fdw.c:1479)
As Tom said, this looks like set->epoll_ret_events at the time points
to a palloc'ed memory resided within a realloced chunk.
Valgrind is saying that the variable (WaitEventSet*) set itself is a
valid pointer. On the other hand set->epoll_ret_events poinst to a
memory chunk that maybe valgrind thinks to have been freed. Since they
are in one allocation block so the pointer alone is broken if valgrind
is right in its complain.
I'm at a loss. How did you cause the error?
WaitEventSetWait() with in ExecAppendAsyncEventWait(). Patch
attached.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Tue, Apr 6, 2021 at 12:01 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Mon, 5 Apr 2021 17:15:47 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
On Fri, Apr 2, 2021 at 12:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
The buildfarm points out that this fails under valgrind.
I easily reproduced it here:==00:00:03:42.115 3410499== Syscall param epoll_wait(events) points to unaddressable byte(s)
==00:00:03:42.115 3410499== at 0x58E926B: epoll_wait (epoll_wait.c:30)
==00:00:03:42.115 3410499== by 0x7FC903: WaitEventSetWaitBlock (latch.c:1452)...
The reason for this would be that epoll_wait() is called with
maxevents exceeding the size of the input event array in the test
case. To fix, I adjusted the parameters to call the caller function# s/input/output/ event array? (occurrred_events)
Sorry, my explanation was not enough. I think I was in a hurry. I
mean by "the input event array" the epoll_event array given to
epoll_wait() (i.e., the epoll_ret_events array).
# I couldn't reproduce it, so sorry in advance if the following
# discussion is totally bogus..
I produced this failure by running the following simple query in async
mode on a valgrind-enabled build:
select * from ft1 union all select * from ft2
where ft1 and ft2 are postgres_fdw foreign tables. For this query, we
would call WaitEventSetWait() with nevents=16 in
ExecAppendAsyncEventWait() as EVENT_BUFFER_SIZE=16, and then
epoll_wait() with maxevents=16 in WaitEventSetWaitBlock(); but
maxevents would exceed the input event array as the array size is
three. I think this inconsitency would cause the valgrind failure.
I'm not 100% sure about that, but the patch fixing this inconsistency
I posted fixed the failure in my environment.
I have nothing to say if it actually corrects the error, but the only
restriction of maxevents is that it must be positive, and in any case
epoll_wait returns no more than set->nevents events. So I'm a bit
wondering if that's the reason. In the first place I'm wondering if
valgrind is aware of that depth..
Yeah, the failure might actually be harmless, but anyway, we should
make the buildfarm green. Also, we should improve the code to avoid
the consistency mentioned above, so I'll apply the patch.
Thanks for the comments!
Best regards,
Etsuro Fujita
On Tue, Apr 6, 2021 at 5:45 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
Also, we should improve the code to avoid
the consistency mentioned above,
Sorry, s/consistency/inconsistency/.
I'll apply the patch.
Done. Let's see if this works.
Best regards,
Etsuro Fujita
On Wed, Mar 31, 2021 at 2:12 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Wed, Mar 31, 2021 at 10:11 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:+ * We'll prefer to consider this join async-capable if any table from + * either side of the join is considered async-capable. + */ + fpinfo->async_capable = fpinfo_o->async_capable || + fpinfo_i->async_capable;We need to explain this behavior in the documentation.
It looks somewhat inconsistent to be inhibitive for the default value
of "async_capable", but agressive in merging?If the foreign table has async_capable=true, it actually means that
there are resources (CPU, IO, network, etc.) to scan the foreign table
concurrently. And if any table from either side of the join has such
resources, then they could also be used for the join. So I don't
think this behavior is aggressive. I think it would be better to add
more comments, though.I'll return to this after committing the patch.
I updated the above comment so that it explains the reason. Please
find attached a patch. I did some cleanup as well:
* Simplified code in ExecAppendAsyncEventWait() a little bit to avoid
duplicating the same nevents calculation, and updated comments there.
* Added an assertion to ExecAppendAsyncRequest().
* Updated comments for fetch_more_data_begin().
Best regards,
Etsuro Fujita
Attachments:
cleanup-in-async-support.patchapplication/octet-stream; name=cleanup-in-async-support.patchDownload
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index c590f374c6..e201b5404e 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -5835,7 +5835,10 @@ merge_fdw_options(PgFdwRelationInfo *fpinfo,
/*
* We'll prefer to consider this join async-capable if any table from
- * either side of the join is considered async-capable.
+ * either side of the join is considered async-capable. This would be
+ * reasonable because in that case the foreign server would have its
+ * own resources to scan that table asynchronously, and the join could
+ * also be computed asynchronously using the resources.
*/
fpinfo->async_capable = fpinfo_o->async_capable ||
fpinfo_i->async_capable;
@@ -6893,6 +6896,9 @@ produce_tuple_asynchronously(AsyncRequest *areq, bool fetch)
/*
* Begin an asynchronous data fetch.
*
+ * Note: this function assumes there is no currently-in-progress asynchronous
+ * data fetch.
+ *
* Note: fetch_more_data must be called to fetch the result.
*/
static void
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index c252757268..3c1f12adaf 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -952,7 +952,10 @@ ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result)
/* Nothing to do if there are no async subplans needing a new request. */
if (bms_is_empty(node->as_needrequest))
+ {
+ Assert(node->as_nasyncresults == 0);
return false;
+ }
/*
* If there are any asynchronously-generated results that have not yet
@@ -998,17 +1001,16 @@ ExecAppendAsyncRequest(AppendState *node, TupleTableSlot **result)
static void
ExecAppendAsyncEventWait(AppendState *node)
{
+ int nevents = node->as_nasyncplans + 1;
long timeout = node->as_syncdone ? -1 : 0;
WaitEvent occurred_event[EVENT_BUFFER_SIZE];
int noccurred;
- int nevents;
int i;
/* We should never be called when there are no valid async subplans. */
Assert(node->as_nasyncremain > 0);
- node->as_eventset = CreateWaitEventSet(CurrentMemoryContext,
- node->as_nasyncplans + 1);
+ node->as_eventset = CreateWaitEventSet(CurrentMemoryContext, nevents);
AddWaitEventToSet(node->as_eventset, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
NULL, NULL);
@@ -1022,8 +1024,14 @@ ExecAppendAsyncEventWait(AppendState *node)
ExecAsyncConfigureWait(areq);
}
- /* Wait for at least one event to occur. */
- nevents = Min(node->as_nasyncplans + 1, EVENT_BUFFER_SIZE);
+ /* We wait on at most EVENT_BUFFER_SIZE events. */
+ if (nevents > EVENT_BUFFER_SIZE)
+ nevents = EVENT_BUFFER_SIZE;
+
+ /*
+ * If the timeout is -1, wait until at least one event occurs. If the
+ * timeout is 0, poll for events, but do not wait at all.
+ */
noccurred = WaitEventSetWait(node->as_eventset, timeout, occurred_event,
nevents, WAIT_EVENT_APPEND_READY);
FreeWaitEventSet(node->as_eventset);
On Thu, Apr 22, 2021 at 12:30 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
+ * We'll prefer to consider this join async-capable if any table from + * either side of the join is considered async-capable. + */ + fpinfo->async_capable = fpinfo_o->async_capable || + fpinfo_i->async_capable;
I updated the above comment so that it explains the reason. Please
find attached a patch. I did some cleanup as well:
I have committed the patch.
Best regards,
Etsuro Fujita
On 4/23/21 8:12 AM, Etsuro Fujita wrote:
I have committed the patch.
While studying the capabilities of AsyncAppend, I noticed an
inconsistency with the cost model of the optimizer:
async_capable = off:
--------------------
Append (cost=100.00..695.00 ...)
-> Foreign Scan on f1 part_1 (cost=100.00..213.31 ...)
-> Foreign Scan on f2 part_2 (cost=100.00..216.07 ...)
-> Foreign Scan on f3 part_3 (cost=100.00..215.62 ...)
async_capable = on:
-------------------
Append (cost=100.00..695.00 ...)
-> Async Foreign Scan on f1 part_1 (cost=100.00..213.31 ...)
-> Async Foreign Scan on f2 part_2 (cost=100.00..216.07 ...)
-> Async Foreign Scan on f3 part_3 (cost=100.00..215.62 ...)
Here I see two problems:
1. Cost of an AsyncAppend is the same as cost of an Append. But
execution time of the AsyncAppend for three remote partitions has more
than halved.
2. Cost of an AsyncAppend looks as a sum of the child ForeignScan costs.
I haven't ideas why it may be a problem right now. But I can imagine
that it may be a problem in future if we have alternative paths: complex
pushdown in synchronous mode (a few rows to return) or simple
asynchronous append with a large set of rows to return.
--
regards,
Andrey Lepikhov
Postgres Professional
On 4/23/21 8:12 AM, Etsuro Fujita wrote:
I have committed the patch.
Small mistake i found. If no tuple was received from a foreign
partition, explain shows that we never executed node. For example,
if we have 0 tuples in f1 and 100 tuples in f2:
Query:
EXPLAIN (ANALYZE, VERBOSE, TIMING OFF, COSTS OFF)
SELECT * FROM (SELECT * FROM f1 UNION ALL SELECT * FROM f2) AS q1
LIMIT 101;
Explain:
Limit (actual rows=100 loops=1)
Output: f1.a
-> Append (actual rows=100 loops=1)
-> Async Foreign Scan on public.f1 (never executed)
Output: f1.a
Remote SQL: SELECT a FROM public.l1
-> Async Foreign Scan on public.f2 (actual rows=100 loops=1)
Output: f2.a
Remote SQL: SELECT a FROM public.l2
The patch in the attachment fixes this.
--
regards,
Andrey Lepikhov
Postgres Professional
Attachments:
never_executed_fix.patchtext/x-patch; charset=UTF-8; name=never_executed_fix.patchDownload
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index e201b5404e..a960ada441 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -6857,8 +6857,13 @@ produce_tuple_asynchronously(AsyncRequest *areq, bool fetch)
}
else
{
- /* There's nothing more to do; just return a NULL pointer */
- result = NULL;
+ /*
+ * There's nothing more to do; just check it and get an empty slot
+ * from the child node.
+ */
+ result = ExecProcNode((PlanState *) node);
+ Assert(TupIsNull(result));
+
/* Mark the request as complete */
ExecAsyncRequestDone(areq, result);
}
On 4/23/21 8:12 AM, Etsuro Fujita wrote:
On Thu, Apr 22, 2021 at 12:30 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
I have committed the patch.
One more question. Append choose async plans at the stage of the Append
plan creation.
Later, the planner performs some optimizations, such as eliminating
trivial Subquery nodes. So, AsyncAppend is impossible in some
situations, for example:
(SELECT * FROM f1 WHERE a < 10)
UNION ALL
(SELECT * FROM f2 WHERE a < 10);
But works for the query:
SELECT *
FROM (SELECT * FROM f1 UNION ALL SELECT * FROM f2) AS q1
WHERE a < 10;
As far as I understand, this is not a hard limit. We can choose async
subplans at the beginning of the execution stage.
For a demo, I prepared the patch (see in attachment).
It solves the problem and passes the regression tests.
--
regards,
Andrey Lepikhov
Postgres Professional
Attachments:
asyncappend_executor_fix.patchtext/x-patch; charset=UTF-8; name=asyncappend_executor_fix.patchDownload
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index a960ada441..655e743c6e 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -1246,6 +1246,7 @@ postgresGetForeignPlan(PlannerInfo *root,
bool has_final_sort = false;
bool has_limit = false;
ListCell *lc;
+ ForeignScan *fsplan;
/*
* Get FDW private data created by postgresGetForeignUpperPaths(), if any.
@@ -1430,7 +1431,7 @@ postgresGetForeignPlan(PlannerInfo *root,
* field of the finished plan node; we can't keep them in private state
* because then they wouldn't be subject to later planner processing.
*/
- return make_foreignscan(tlist,
+ fsplan = make_foreignscan(tlist,
local_exprs,
scan_relid,
params_list,
@@ -1438,6 +1439,13 @@ postgresGetForeignPlan(PlannerInfo *root,
fdw_scan_tlist,
fdw_recheck_quals,
outer_plan);
+
+ /* If appropriate, consider participation in async operations */
+ fsplan->scan.plan.async_capable = (enable_async_append &&
+ best_path->path.pathkeys == NIL &&
+ !fsplan->scan.plan.parallel_safe &&
+ is_async_capable_path((Path *)best_path));
+ return fsplan;
}
/*
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b3726a54f3..4e70f4eb54 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -524,6 +524,9 @@ ExecSupportsBackwardScan(Plan *node)
if (node->parallel_aware)
return false;
+ if (node->async_capable)
+ return false;
+
switch (nodeTag(node))
{
case T_Result:
@@ -536,10 +539,6 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
- /* With async, tuples may be interleaved, so can't back up. */
- if (((Append *) node)->nasyncplans > 0)
- return false;
-
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 3c1f12adaf..363cf9f4a5 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -117,6 +117,8 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
int firstvalid;
int i,
j;
+ ListCell *l;
+ bool consider_async = false;
/* check for unsupported flags */
Assert(!(eflags & EXEC_FLAG_MARK));
@@ -197,6 +199,23 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendplanstates = (PlanState **) palloc(nplans *
sizeof(PlanState *));
+ /* If appropriate, consider async append */
+ consider_async = (list_length(node->appendplans) > 1);
+
+ if (!consider_async)
+ {
+ foreach(l, node->appendplans)
+ {
+ Plan *subplan = (Plan *) lfirst(l);
+
+ /* Check to see if subplan can be executed asynchronously */
+ if (subplan->async_capable)
+ {
+ subplan->async_capable = false;
+ }
+ }
+ }
+
/*
* call ExecInitNode on each of the valid plans to be executed and save
* the results into the appendplanstates array.
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 632cc31a04..f7302ccf28 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -242,7 +242,6 @@ _copyAppend(const Append *from)
*/
COPY_BITMAPSET_FIELD(apprelids);
COPY_NODE_FIELD(appendplans);
- COPY_SCALAR_FIELD(nasyncplans);
COPY_SCALAR_FIELD(first_partial_plan);
COPY_NODE_FIELD(part_prune_info);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index c723f6d635..665cdf3add 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -432,7 +432,6 @@ _outAppend(StringInfo str, const Append *node)
WRITE_BITMAPSET_FIELD(apprelids);
WRITE_NODE_FIELD(appendplans);
- WRITE_INT_FIELD(nasyncplans);
WRITE_INT_FIELD(first_partial_plan);
WRITE_NODE_FIELD(part_prune_info);
}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 3746668f52..9e3822f7db 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1716,7 +1716,6 @@ _readAppend(void)
READ_BITMAPSET_FIELD(apprelids);
READ_NODE_FIELD(appendplans);
- READ_INT_FIELD(nasyncplans);
READ_INT_FIELD(first_partial_plan);
READ_NODE_FIELD(part_prune_info);
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index a9aff24831..8792eef451 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -81,7 +81,6 @@ static List *get_gating_quals(PlannerInfo *root, List *quals);
static Plan *create_gating_plan(PlannerInfo *root, Path *path, Plan *plan,
List *gating_quals);
static Plan *create_join_plan(PlannerInfo *root, JoinPath *best_path);
-static bool is_async_capable_path(Path *path);
static Plan *create_append_plan(PlannerInfo *root, AppendPath *best_path,
int flags);
static Plan *create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
@@ -1093,10 +1092,10 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
}
/*
- * is_async_capable_path
- * Check whether a given Path node is async-capable.
+ * is_async_capable_plan
+ * Check whether a given Plan node is async-capable.
*/
-static bool
+bool
is_async_capable_path(Path *path)
{
switch (nodeTag(path))
@@ -1134,7 +1133,6 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
List *pathkeys = best_path->path.pathkeys;
List *subplans = NIL;
ListCell *subpaths;
- int nasyncplans = 0;
RelOptInfo *rel = best_path->path.parent;
PartitionPruneInfo *partpruneinfo = NULL;
int nodenumsortkeys = 0;
@@ -1142,7 +1140,6 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
Oid *nodeSortOperators = NULL;
Oid *nodeCollations = NULL;
bool *nodeNullsFirst = NULL;
- bool consider_async = false;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1206,11 +1203,6 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
tlist_was_changed = (orig_tlist_length != list_length(plan->plan.targetlist));
}
- /* If appropriate, consider async append */
- consider_async = (enable_async_append && pathkeys == NIL &&
- !best_path->path.parallel_safe &&
- list_length(best_path->subpaths) > 1);
-
/* Build the plan for each child */
foreach(subpaths, best_path->subpaths)
{
@@ -1279,12 +1271,6 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
subplans = lappend(subplans, subplan);
- /* Check to see if subplan can be executed asynchronously */
- if (consider_async && is_async_capable_path(subpath))
- {
- subplan->async_capable = true;
- ++nasyncplans;
- }
}
/*
@@ -1317,7 +1303,6 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
plan->appendplans = subplans;
- plan->nasyncplans = nasyncplans;
plan->first_partial_plan = best_path->first_partial_path;
plan->part_prune_info = partpruneinfo;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index d671328dfd..b5ac0a1da2 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -250,7 +250,6 @@ typedef struct Append
Plan plan;
Bitmapset *apprelids; /* RTIs of appendrel(s) formed by this node */
List *appendplans;
- int nasyncplans; /* # of asynchronous plans */
/*
* All 'appendplans' preceding this index are non-partial plans. All
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index bf1adfc52a..8a96a19e5f 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -115,5 +115,6 @@ extern Plan *set_plan_references(PlannerInfo *root, Plan *plan);
extern void record_plan_function_dependency(PlannerInfo *root, Oid funcid);
extern void record_plan_type_dependency(PlannerInfo *root, Oid typid);
extern bool extract_query_dependencies_walker(Node *node, PlannerInfo *root);
+extern bool is_async_capable_path(Path *path);
#endif /* PLANMAIN_H */
On Mon, Apr 26, 2021 at 3:01 PM Andrey V. Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
While studying the capabilities of AsyncAppend, I noticed an
inconsistency with the cost model of the optimizer:
Here I see two problems:
1. Cost of an AsyncAppend is the same as cost of an Append. But
execution time of the AsyncAppend for three remote partitions has more
than halved.
2. Cost of an AsyncAppend looks as a sum of the child ForeignScan costs.
Yeah, we don’t adjust the cost for async Append; it’s the same as that
for sync Append. But I don’t see any issue as-is, either. (It’s not
that easy to adjust the cost to an appropriate value in the case of
postgres_fdw, because in that case the cost would vary depending on
which connections are used for scanning foreign tables [1]/messages/by-id/CAPmGK15i-OyCesd369P8zyBErjN_T18zVYu27714bf_L=COXew@mail.gmail.com.)
I haven't ideas why it may be a problem right now. But I can imagine
that it may be a problem in future if we have alternative paths: complex
pushdown in synchronous mode (a few rows to return) or simple
asynchronous append with a large set of rows to return.
Yeah, I think it’s better if we could consider async append paths and
estimate the costs for them accurately at path-creation time, not
plan-creation time, because that would make it possible to use async
execution in more cases, as you pointed out. But I left that for
future work, because I wanted to make the first cut simple.
Thanks for the review!
Best regards,
Etsuro Fujita
[1]: /messages/by-id/CAPmGK15i-OyCesd369P8zyBErjN_T18zVYu27714bf_L=COXew@mail.gmail.com
On Mon, Apr 26, 2021 at 7:35 PM Andrey V. Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
Small mistake i found. If no tuple was received from a foreign
partition, explain shows that we never executed node. For example,
if we have 0 tuples in f1 and 100 tuples in f2:Query:
EXPLAIN (ANALYZE, VERBOSE, TIMING OFF, COSTS OFF)
SELECT * FROM (SELECT * FROM f1 UNION ALL SELECT * FROM f2) AS q1
LIMIT 101;Explain:
Limit (actual rows=100 loops=1)
Output: f1.a
-> Append (actual rows=100 loops=1)
-> Async Foreign Scan on public.f1 (never executed)
Output: f1.a
Remote SQL: SELECT a FROM public.l1
-> Async Foreign Scan on public.f2 (actual rows=100 loops=1)
Output: f2.a
Remote SQL: SELECT a FROM public.l2The patch in the attachment fixes this.
Thanks for the report and patch! Will look into this.
Best regards,
Etsuro Fujita
On Thu, Mar 4, 2021 at 1:00 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
Another thing I'm concerned about in the postgres_fdw part is the case
where all/many postgres_fdw ForeignScans of an Append use the same
connection, because in that case those ForeignScans are executed one
by one, not in parallel, and hence the overhead of async execution
(i.e., doing ExecAppendAsyncEventWait()) would merely cause a
performance degradation. Here is such an example:postgres=# create server loopback foreign data wrapper postgres_fdw
options (dbname 'postgres');
postgres=# create user mapping for current_user server loopback;
postgres=# create table pt (a int, b int, c text) partition by range (a);
postgres=# create table loct1 (a int, b int, c text);
postgres=# create table loct2 (a int, b int, c text);
postgres=# create table loct3 (a int, b int, c text);
postgres=# create foreign table p1 partition of pt for values from
(10) to (20) server loopback options (table_name 'loct1');
postgres=# create foreign table p2 partition of pt for values from
(20) to (30) server loopback options (table_name 'loct2');
postgres=# create foreign table p3 partition of pt for values from
(30) to (40) server loopback options (table_name 'loct3');
postgres=# insert into p1 select 10 + i % 10, i, to_char(i, 'FM00000')
from generate_series(0, 99999) i;
postgres=# insert into p2 select 20 + i % 10, i, to_char(i, 'FM00000')
from generate_series(0, 99999) i;
postgres=# insert into p3 select 30 + i % 10, i, to_char(i, 'FM00000')
from generate_series(0, 99999) i;
postgres=# analyze pt;postgres=# set enable_async_append to off;
postgres=# select count(*) from pt;
count
--------
300000
(1 row)Time: 366.905 ms
postgres=# set enable_async_append to on;
postgres=# select count(*) from pt;
count
--------
300000
(1 row)Time: 385.431 ms
I think the user should be careful about this. How about adding a
note about it to the “Asynchronous Execution Options” section in
postgres-fdw.sgml, like the attached?
Best regards,
Etsuro Fujita
Attachments:
note-about-async.patchapplication/octet-stream; name=note-about-async.patchDownload
diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index 839126c4ef..97ad04dbe3 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -401,6 +401,15 @@ OPTIONS (ADD password_required 'false');
A table-level option overrides a server-level option.
The default is <literal>false</literal>.
</para>
+
+ <para>
+ In the case that foreign tables are associated with the same foreign
+ server, and scanned using the same connection to the remote server,
+ even if this option is set to <literal>true</literal> for them, they
+ would be scanned serially when processed using asynchronous execution.
+ In that case performance would not be improved, and, what is worse,
+ it might be degraded due to the overhead of asynchronous execution.
+ </para>
</listitem>
</varlistentry>
On Tue, Apr 27, 2021 at 9:31 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Mon, Apr 26, 2021 at 7:35 PM Andrey V. Lepikhov
<a.lepikhov@postgrespro.ru> wrote:Small mistake i found. If no tuple was received from a foreign
partition, explain shows that we never executed node.
The patch in the attachment fixes this.
Will look into this.
The patch fixes the issue, but I don’t think it’s the right way to go,
because it requires an extra ExecProcNode() call, which wouldn’t be
efficient. Also, the patch wouldn’t address another issue I noticed
in EXPLAIN ANALYZE for async-capable nodes that the command wouldn’t
measure the time spent in such nodes accurately. For the case of
async-capable node using postgres_fdw, it only measures the time spent
in ExecProcNode() in ExecAsyncRequest()/ExecAsyncNotify(), missing the
time spent in other things such as creating a cursor in
ExecAsyncRequest(). :-(. To address both issues, I’d like to propose
the attached, in which I added instrumentation support to
ExecAsyncRequest()/ExecAsyncConfigureWait()/ExecAsyncNotify(). I
think this would not only address the reported issue more efficiently,
but allow to collect timing for async-capable nodes more accurately.
Best regards,
Etsuro Fujita
Attachments:
fix-EXPLAIN-ANALYZE-for-async-capable-nodes.patchapplication/octet-stream; name=fix-EXPLAIN-ANALYZE-for-async-capable-nodes.patchDownload
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 445bb37191..e9092ba359 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -314,7 +314,7 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL);
+ queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
MemoryContextSwitchTo(oldcxt);
}
}
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index f42f07622e..77ca5abcdc 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -974,7 +974,7 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL);
+ queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
MemoryContextSwitchTo(oldcxt);
}
}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 6f533c745d..6a0e27d0f6 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10132,18 +10132,32 @@ SELECT * FROM join_tbl ORDER BY a1;
(3 rows)
DELETE FROM join_tbl;
+DROP TABLE local_tbl;
+DROP FOREIGN TABLE remote_tbl;
+DROP FOREIGN TABLE insert_tbl;
+DROP TABLE base_tbl3;
+DROP TABLE base_tbl4;
RESET enable_mergejoin;
RESET enable_hashjoin;
+-- Check EXPLAIN ANALYZE for a query that scans empty partitions asynchronously
+DELETE FROM async_p1;
+DELETE FROM async_p2;
+DELETE FROM async_p3;
+EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
+SELECT * FROM async_pt;
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Append (actual rows=0 loops=1)
+ -> Async Foreign Scan on async_p1 async_pt_1 (actual rows=0 loops=1)
+ -> Async Foreign Scan on async_p2 async_pt_2 (actual rows=0 loops=1)
+ -> Seq Scan on async_p3 async_pt_3 (actual rows=0 loops=1)
+(4 rows)
+
-- Clean up
DROP TABLE async_pt;
DROP TABLE base_tbl1;
DROP TABLE base_tbl2;
DROP TABLE result_tbl;
-DROP TABLE local_tbl;
-DROP FOREIGN TABLE remote_tbl;
-DROP FOREIGN TABLE insert_tbl;
-DROP TABLE base_tbl3;
-DROP TABLE base_tbl4;
DROP TABLE join_tbl;
ALTER SERVER loopback OPTIONS (DROP async_capable);
ALTER SERVER loopback2 OPTIONS (DROP async_capable);
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 8bcdc8d616..38ae9149b8 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -1542,7 +1542,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
&fsstate->param_values);
/* Set the async-capable flag */
- fsstate->async_capable = node->ss.ps.plan->async_capable;
+ fsstate->async_capable = node->ss.ps.async_capable;
}
/*
@@ -6864,7 +6864,7 @@ produce_tuple_asynchronously(AsyncRequest *areq, bool fetch)
}
/* Get a tuple from the ForeignScan node */
- result = ExecProcNode((PlanState *) node);
+ result = areq->requestee->ExecProcNodeReal(areq->requestee);
if (!TupIsNull(result))
{
/* Mark the request as complete */
@@ -6953,6 +6953,11 @@ process_pending_request(AsyncRequest *areq)
/* Unlike AsyncNotify, we call ExecAsyncResponse ourselves */
ExecAsyncResponse(areq);
+ /* Also, we do instrumentation ourselves, if required */
+ if (areq->requestee->instrument)
+ InstrUpdateTupleCount(areq->requestee->instrument,
+ TupIsNull(areq->result) ? 0.0 : 1.0);
+
MemoryContextSwitchTo(oldcontext);
}
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 000e2534fc..333988cd93 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -3226,19 +3226,28 @@ INSERT INTO join_tbl SELECT * FROM async_pt LEFT JOIN t ON (async_pt.a = t.a AND
SELECT * FROM join_tbl ORDER BY a1;
DELETE FROM join_tbl;
+DROP TABLE local_tbl;
+DROP FOREIGN TABLE remote_tbl;
+DROP FOREIGN TABLE insert_tbl;
+DROP TABLE base_tbl3;
+DROP TABLE base_tbl4;
+
RESET enable_mergejoin;
RESET enable_hashjoin;
+-- Check EXPLAIN ANALYZE for a query that scans empty partitions asynchronously
+DELETE FROM async_p1;
+DELETE FROM async_p2;
+DELETE FROM async_p3;
+
+EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
+SELECT * FROM async_pt;
+
-- Clean up
DROP TABLE async_pt;
DROP TABLE base_tbl1;
DROP TABLE base_tbl2;
DROP TABLE result_tbl;
-DROP TABLE local_tbl;
-DROP FOREIGN TABLE remote_tbl;
-DROP FOREIGN TABLE insert_tbl;
-DROP TABLE base_tbl3;
-DROP TABLE base_tbl4;
DROP TABLE join_tbl;
ALTER SERVER loopback OPTIONS (DROP async_capable);
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index f1985e658c..75108d36be 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -15,6 +15,7 @@
#include "postgres.h"
#include "executor/execAsync.h"
+#include "executor/executor.h"
#include "executor/nodeAppend.h"
#include "executor/nodeForeignscan.h"
@@ -24,6 +25,13 @@
void
ExecAsyncRequest(AsyncRequest *areq)
{
+ if (areq->requestee->chgParam != NULL) /* something changed? */
+ ExecReScan(areq->requestee); /* let ReScan handle this */
+
+ /* must provide our own instrumentation support */
+ if (areq->requestee->instrument)
+ InstrStartNode(areq->requestee->instrument);
+
switch (nodeTag(areq->requestee))
{
case T_ForeignScanState:
@@ -36,6 +44,11 @@ ExecAsyncRequest(AsyncRequest *areq)
}
ExecAsyncResponse(areq);
+
+ /* must provide our own instrumentation support */
+ if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument,
+ TupIsNull(areq->result) ? 0.0 : 1.0);
}
/*
@@ -48,6 +61,10 @@ ExecAsyncRequest(AsyncRequest *areq)
void
ExecAsyncConfigureWait(AsyncRequest *areq)
{
+ /* must provide our own instrumentation support */
+ if (areq->requestee->instrument)
+ InstrStartNode(areq->requestee->instrument);
+
switch (nodeTag(areq->requestee))
{
case T_ForeignScanState:
@@ -58,6 +75,10 @@ ExecAsyncConfigureWait(AsyncRequest *areq)
elog(ERROR, "unrecognized node type: %d",
(int) nodeTag(areq->requestee));
}
+
+ /* must provide our own instrumentation support */
+ if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument, 0.0);
}
/*
@@ -66,6 +87,10 @@ ExecAsyncConfigureWait(AsyncRequest *areq)
void
ExecAsyncNotify(AsyncRequest *areq)
{
+ /* must provide our own instrumentation support */
+ if (areq->requestee->instrument)
+ InstrStartNode(areq->requestee->instrument);
+
switch (nodeTag(areq->requestee))
{
case T_ForeignScanState:
@@ -78,6 +103,11 @@ ExecAsyncNotify(AsyncRequest *areq)
}
ExecAsyncResponse(areq);
+
+ /* must provide our own instrumentation support */
+ if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument,
+ TupIsNull(areq->result) ? 0.0 : 1.0);
}
/*
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index df3d7f9a8b..58b4968735 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1214,7 +1214,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0(n * sizeof(ExprState *));
if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options);
+ resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options, false);
}
else
{
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 9f8c7582e0..753f46863b 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -407,7 +407,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
/* Set up instrumentation for this node if requested */
if (estate->es_instrument)
- result->instrument = InstrAlloc(1, estate->es_instrument);
+ result->instrument = InstrAlloc(1, estate->es_instrument,
+ result->async_capable);
return result;
}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 237e13361b..b5792c1e53 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -28,7 +28,7 @@ static void WalUsageAdd(WalUsage *dst, WalUsage *add);
/* Allocate new instrumentation structure(s) */
Instrumentation *
-InstrAlloc(int n, int instrument_options)
+InstrAlloc(int n, int instrument_options, bool async_mode)
{
Instrumentation *instr;
@@ -46,6 +46,7 @@ InstrAlloc(int n, int instrument_options)
instr[i].need_bufusage = need_buffers;
instr[i].need_walusage = need_wal;
instr[i].need_timer = need_timer;
+ instr[i].async_mode = async_mode;
}
}
@@ -82,6 +83,7 @@ InstrStartNode(Instrumentation *instr)
void
InstrStopNode(Instrumentation *instr, double nTuples)
{
+ double save_tuplecount = instr->tuplecount;
instr_time endtime;
/* count the returned tuples */
@@ -114,6 +116,26 @@ InstrStopNode(Instrumentation *instr, double nTuples)
instr->running = true;
instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
}
+ else
+ {
+ /*
+ * In async mode, if the plan node hadn't emitted any tuples before,
+ * this might be the first tuple
+ */
+ if (instr->async_mode && save_tuplecount < 1.0)
+ instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
+ }
+}
+
+/* Update tuple count */
+void
+InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
+{
+ if (!instr->running)
+ elog(ERROR, "InstrUpdateTupleCount called on node not yet executed");
+
+ /* count the returned tuples */
+ instr->tuplecount += nTuples;
}
/* Finish a run cycle for a plan node */
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 3c1f12adaf..bf6aa10ad6 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -440,7 +440,7 @@ ExecReScanAppend(AppendState *node)
/*
* If chgParam of subnode is not null then plan will be re-scanned by
- * first ExecProcNode.
+ * first ExecProcNode or by first ExecAsyncRequest.
*/
if (subnode->chgParam == NULL)
ExecReScan(subnode);
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 898890fb08..9dc38d47ea 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -209,6 +209,13 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
scanstate->fdw_recheck_quals =
ExecInitQual(node->fdw_recheck_quals, (PlanState *) scanstate);
+ /*
+ * Determine whether to scan the foreign relation asynchronously or not;
+ * this has to be kept in sync with the code in ExecInitAppend().
+ */
+ scanstate->ss.ps.async_capable = (((Plan *) node)->async_capable &&
+ estate->es_epq_active == NULL);
+
/*
* Initialize FDW-related state.
*/
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index aa8eceda5f..c79e46aaaa 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -55,6 +55,7 @@ typedef struct Instrumentation
bool need_timer; /* true if we need timer data */
bool need_bufusage; /* true if we need buffer usage data */
bool need_walusage; /* true if we need WAL usage data */
+ bool async_mode; /* true if node is in async mode */
/* Info about current plan cycle: */
bool running; /* true if we've completed first tuple */
instr_time starttime; /* start time of current iteration of node */
@@ -84,10 +85,12 @@ typedef struct WorkerInstrumentation
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int n, int instrument_options);
+extern Instrumentation *InstrAlloc(int n, int instrument_options,
+ bool async_mode);
extern void InstrInit(Instrumentation *instr, int instrument_options);
extern void InstrStartNode(Instrumentation *instr);
extern void InstrStopNode(Instrumentation *instr, double nTuples);
+extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
extern void InstrEndLoop(Instrumentation *instr);
extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
extern void InstrStartParallelQuery(void);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e7ae21c023..3c22284b09 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1003,6 +1003,8 @@ typedef struct PlanState
ExprContext *ps_ExprContext; /* node's expression-evaluation context */
ProjectionInfo *ps_ProjInfo; /* info for doing tuple projection */
+ bool async_capable; /* true if node is async-capable */
+
/*
* Scanslot's descriptor if known. This is a bit of a hack, but otherwise
* it's hard for expression compilation to optimize based on the
On Tue, Apr 27, 2021 at 3:57 PM Andrey V. Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
One more question. Append choose async plans at the stage of the Append
plan creation.
Later, the planner performs some optimizations, such as eliminating
trivial Subquery nodes. So, AsyncAppend is impossible in some
situations, for example:(SELECT * FROM f1 WHERE a < 10)
UNION ALL
(SELECT * FROM f2 WHERE a < 10);But works for the query:
SELECT *
FROM (SELECT * FROM f1 UNION ALL SELECT * FROM f2) AS q1
WHERE a < 10;As far as I understand, this is not a hard limit.
I think so, but IMO I think this would be an improvement rather than a bug fix.
We can choose async
subplans at the beginning of the execution stage.
For a demo, I prepared the patch (see in attachment).
It solves the problem and passes the regression tests.
Thanks for the patch! IIUC, another approach to this would be the
patch you proposed before [1]/messages/by-id/7fe10f95-ac6c-c81d-a9d3-227493eb9055@postgrespro.ru. Right?
I didn't have time to look at the patch in [1]/messages/by-id/7fe10f95-ac6c-c81d-a9d3-227493eb9055@postgrespro.ru for PG14. My apologies
for that. Actually, I was planning to return it when the development
for PG15 starts.
Sorry for the late reply.
Best regards,
Etsuro Fujita
[1]: /messages/by-id/7fe10f95-ac6c-c81d-a9d3-227493eb9055@postgrespro.ru
Greetings,
* Etsuro Fujita (etsuro.fujita@gmail.com) wrote:
On Thu, Mar 4, 2021 at 1:00 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
Another thing I'm concerned about in the postgres_fdw part is the case
where all/many postgres_fdw ForeignScans of an Append use the same
connection, because in that case those ForeignScans are executed one
by one, not in parallel, and hence the overhead of async execution
(i.e., doing ExecAppendAsyncEventWait()) would merely cause a
performance degradation. Here is such an example:postgres=# create server loopback foreign data wrapper postgres_fdw
options (dbname 'postgres');
postgres=# create user mapping for current_user server loopback;
postgres=# create table pt (a int, b int, c text) partition by range (a);
postgres=# create table loct1 (a int, b int, c text);
postgres=# create table loct2 (a int, b int, c text);
postgres=# create table loct3 (a int, b int, c text);
postgres=# create foreign table p1 partition of pt for values from
(10) to (20) server loopback options (table_name 'loct1');
postgres=# create foreign table p2 partition of pt for values from
(20) to (30) server loopback options (table_name 'loct2');
postgres=# create foreign table p3 partition of pt for values from
(30) to (40) server loopback options (table_name 'loct3');
postgres=# insert into p1 select 10 + i % 10, i, to_char(i, 'FM00000')
from generate_series(0, 99999) i;
postgres=# insert into p2 select 20 + i % 10, i, to_char(i, 'FM00000')
from generate_series(0, 99999) i;
postgres=# insert into p3 select 30 + i % 10, i, to_char(i, 'FM00000')
from generate_series(0, 99999) i;
postgres=# analyze pt;postgres=# set enable_async_append to off;
postgres=# select count(*) from pt;
count
--------
300000
(1 row)Time: 366.905 ms
postgres=# set enable_async_append to on;
postgres=# select count(*) from pt;
count
--------
300000
(1 row)Time: 385.431 ms
I think the user should be careful about this. How about adding a
note about it to the “Asynchronous Execution Options” section in
postgres-fdw.sgml, like the attached?
I'd suggest the language point out that it's not actually possible to do
otherwise, since they all need to be part of the same transaction.
Without that, it looks like we're just missing a trick somewhere and
someone might think that they could improve PG to open multiple
connections to the same remote server to execute them in parallel.
Maybe:
In order to ensure that the data being returned from a foreign server
is consistent, postgres_fdw will only open one connection for a given
foreign server and will run all queries against that server sequentially
even if there are multiple foreign tables involved. In such a case, it
may be more performant to disable this option to eliminate the overhead
associated with running queries asynchronously.
... then again, it'd really be better if we could figure out a way to
just do the right thing here. I haven't looked at this in depth but I
would think that the overhead of async would be well worth it just about
any time there's more than one foreign server involved. Is it not
reasonable to have a heuristic where we disable async in the cases where
there's only one foreign server, but have it enabled all the other time?
While continuing to allow users to manage it explicitly if they want.
Thanks,
Stephen
On 6/5/21 22:12, Stephen Frost wrote:
* Etsuro Fujita (etsuro.fujita@gmail.com) wrote:
I think the user should be careful about this. How about adding a
note about it to the “Asynchronous Execution Options” section in
postgres-fdw.sgml, like the attached?
+1
... then again, it'd really be better if we could figure out a way to
just do the right thing here. I haven't looked at this in depth but I
would think that the overhead of async would be well worth it just about
any time there's more than one foreign server involved. Is it not
reasonable to have a heuristic where we disable async in the cases where
there's only one foreign server, but have it enabled all the other time?
While continuing to allow users to manage it explicitly if they want.
Bechmarking of SELECT from foreign partitions hosted on the same server,
i see results:
With async append:
1 partition - 178 ms; 4 - 263; 8 - 450; 16 - 860; 32 - 1740.
Without:
1 - 178 ms; 4 - 583; 8 - 1140; 16 - 2302; 32 - 4620.
So, these results show that we have a reason to use async append in the
case where there's only one foreign server.
--
regards,
Andrey Lepikhov
Postgres Professional
On 6/5/21 11:45, Etsuro Fujita wrote:
On Tue, Apr 27, 2021 at 9:31 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
The patch fixes the issue, but I don’t think it’s the right way to go,
because it requires an extra ExecProcNode() call, which wouldn’t be
efficient. Also, the patch wouldn’t address another issue I noticed
in EXPLAIN ANALYZE for async-capable nodes that the command wouldn’t
measure the time spent in such nodes accurately. For the case of
async-capable node using postgres_fdw, it only measures the time spent
in ExecProcNode() in ExecAsyncRequest()/ExecAsyncNotify(), missing the
time spent in other things such as creating a cursor in
ExecAsyncRequest(). :-(. To address both issues, I’d like to propose
the attached, in which I added instrumentation support to
ExecAsyncRequest()/ExecAsyncConfigureWait()/ExecAsyncNotify(). I
think this would not only address the reported issue more efficiently,
but allow to collect timing for async-capable nodes more accurately.
Ok, I agree with the approach, but the next test case failed:
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
SELECT * FROM (
(SELECT * FROM f1) UNION ALL (SELECT * FROM f2)
) q1 LIMIT 100;
ERROR: InstrUpdateTupleCount called on node not yet executed
Initialization script see in attachment.
--
regards,
Andrey Lepikhov
Postgres Professional
Attachments:
On 6/5/21 14:11, Etsuro Fujita wrote:
On Tue, Apr 27, 2021 at 3:57 PM Andrey V. Lepikhov
<a.lepikhov@postgrespro.ru> wrote:One more question. Append choose async plans at the stage of the Append
plan creation.
Later, the planner performs some optimizations, such as eliminating
trivial Subquery nodes. So, AsyncAppend is impossible in some
situations, for example:(SELECT * FROM f1 WHERE a < 10)
UNION ALL
(SELECT * FROM f2 WHERE a < 10);But works for the query:
SELECT *
FROM (SELECT * FROM f1 UNION ALL SELECT * FROM f2) AS q1
WHERE a < 10;As far as I understand, this is not a hard limit.
I think so, but IMO I think this would be an improvement rather than a bug fix.
We can choose async
subplans at the beginning of the execution stage.
For a demo, I prepared the patch (see in attachment).
It solves the problem and passes the regression tests.Thanks for the patch! IIUC, another approach to this would be the
patch you proposed before [1]. Right?
Yes. I think, new solution will be better.
--
regards,
Andrey Lepikhov
Postgres Professional
On Fri, May 7, 2021 at 2:12 AM Stephen Frost <sfrost@snowman.net> wrote:
I'd suggest the language point out that it's not actually possible to do
otherwise, since they all need to be part of the same transaction.Without that, it looks like we're just missing a trick somewhere and
someone might think that they could improve PG to open multiple
connections to the same remote server to execute them in parallel.
Agreed.
Maybe:
In order to ensure that the data being returned from a foreign server
is consistent, postgres_fdw will only open one connection for a given
foreign server and will run all queries against that server sequentially
even if there are multiple foreign tables involved. In such a case, it
may be more performant to disable this option to eliminate the overhead
associated with running queries asynchronously.
Ok, I’ll merge this into the next version.
Thanks!
Best regards,
Etsuro Fujita
On Fri, May 7, 2021 at 7:35 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
On 6/5/21 14:11, Etsuro Fujita wrote:
On Tue, Apr 27, 2021 at 3:57 PM Andrey V. Lepikhov
<a.lepikhov@postgrespro.ru> wrote:One more question. Append choose async plans at the stage of the Append
plan creation.
Later, the planner performs some optimizations, such as eliminating
trivial Subquery nodes. So, AsyncAppend is impossible in some
situations, for example:(SELECT * FROM f1 WHERE a < 10)
UNION ALL
(SELECT * FROM f2 WHERE a < 10);
We can choose async
subplans at the beginning of the execution stage.
For a demo, I prepared the patch (see in attachment).
It solves the problem and passes the regression tests.IIUC, another approach to this would be the
patch you proposed before [1]. Right?Yes. I think, new solution will be better.
Ok, will review.
I think it would be better to start a new thread for this, and add the
patch to the next CF so that it doesn’t get lost.
Best regards,
Etsuro Fujita
On Fri, May 7, 2021 at 7:32 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
Ok, I agree with the approach, but the next test case failed:
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
SELECT * FROM (
(SELECT * FROM f1) UNION ALL (SELECT * FROM f2)
) q1 LIMIT 100;
ERROR: InstrUpdateTupleCount called on node not yet executedInitialization script see in attachment.
Reproduced. Here is the EXPLAIN output for the query:
explain verbose select * from ((select * from f1) union all (select *
from f2)) q1 limit 100;
QUERY PLAN
--------------------------------------------------------------------------------------
Limit (cost=100.00..104.70 rows=100 width=4)
Output: f1.a
-> Append (cost=100.00..724.22 rows=13292 width=4)
-> Async Foreign Scan on public.f1 (cost=100.00..325.62
rows=6554 width=4)
Output: f1.a
Remote SQL: SELECT a FROM public.l1
-> Async Foreign Scan on public.f2 (cost=100.00..332.14
rows=6738 width=4)
Output: f2.a
Remote SQL: SELECT a FROM public.l2
(9 rows)
When executing the query “select * from ((select * from f1) union all
(select * from f2)) q1 limit 100” in async mode, the remote queries
for f1 and f2 would be sent to the remote at the same time in the
first ExecAppend(). If the result for the remote query for f1 is
returned first, the local query would be processed using the result,
and the remote query for f2 in progress would be processed during
ExecutorEnd() using process_pending_request() (and vice versa). But
in the EXPLAIN ANALYZE case, InstrEndLoop() is called *before*
ExecutorEnd(), and it initializes the instr->running flag, so in that
case, when processing the in-progress remote query in
process_pending_request(), we would call InstrUpdateTupleCount() with
the flag unset, causing this error.
I think a simple fix for this would be just remove the check whether
the instr->running flag is set or not in InstrUpdateTupleCount().
Attached is an updated patch, in which I also updated a comment in
execnodes.h and docs in fdwhandler.sgml to match the code in
nodeAppend.c, and fixed typos in comments in nodeAppend.c.
Thanks for the review and script!
Best regards,
Etsuro Fujita
Attachments:
fix-EXPLAIN-ANALYZE-for-async-capable-nodes-v2.patchapplication/octet-stream; name=fix-EXPLAIN-ANALYZE-for-async-capable-nodes-v2.patchDownload
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 445bb37191..e9092ba359 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -314,7 +314,7 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL);
+ queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
MemoryContextSwitchTo(oldcxt);
}
}
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index f42f07622e..77ca5abcdc 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -974,7 +974,7 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL);
+ queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
MemoryContextSwitchTo(oldcxt);
}
}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 6f533c745d..0b0c45f0d9 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10051,6 +10051,21 @@ SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
Filter: (t1_3.b === 505)
(14 rows)
+EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
+SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Limit (actual rows=1 loops=1)
+ -> Append (actual rows=1 loops=1)
+ -> Async Foreign Scan on async_p1 t1_1 (actual rows=0 loops=1)
+ Filter: (b === 505)
+ -> Async Foreign Scan on async_p2 t1_2 (actual rows=0 loops=1)
+ Filter: (b === 505)
+ -> Seq Scan on async_p3 t1_3 (actual rows=1 loops=1)
+ Filter: (b === 505)
+ Rows Removed by Filter: 101
+(9 rows)
+
SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
a | b | c
------+-----+------
@@ -10132,18 +10147,32 @@ SELECT * FROM join_tbl ORDER BY a1;
(3 rows)
DELETE FROM join_tbl;
+DROP TABLE local_tbl;
+DROP FOREIGN TABLE remote_tbl;
+DROP FOREIGN TABLE insert_tbl;
+DROP TABLE base_tbl3;
+DROP TABLE base_tbl4;
RESET enable_mergejoin;
RESET enable_hashjoin;
+-- Check EXPLAIN ANALYZE for a query that scans empty partitions asynchronously
+DELETE FROM async_p1;
+DELETE FROM async_p2;
+DELETE FROM async_p3;
+EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
+SELECT * FROM async_pt;
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Append (actual rows=0 loops=1)
+ -> Async Foreign Scan on async_p1 async_pt_1 (actual rows=0 loops=1)
+ -> Async Foreign Scan on async_p2 async_pt_2 (actual rows=0 loops=1)
+ -> Seq Scan on async_p3 async_pt_3 (actual rows=0 loops=1)
+(4 rows)
+
-- Clean up
DROP TABLE async_pt;
DROP TABLE base_tbl1;
DROP TABLE base_tbl2;
DROP TABLE result_tbl;
-DROP TABLE local_tbl;
-DROP FOREIGN TABLE remote_tbl;
-DROP FOREIGN TABLE insert_tbl;
-DROP TABLE base_tbl3;
-DROP TABLE base_tbl4;
DROP TABLE join_tbl;
ALTER SERVER loopback OPTIONS (DROP async_capable);
ALTER SERVER loopback2 OPTIONS (DROP async_capable);
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 4ff58d9c27..ee93ee07cc 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -1542,7 +1542,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
&fsstate->param_values);
/* Set the async-capable flag */
- fsstate->async_capable = node->ss.ps.plan->async_capable;
+ fsstate->async_capable = node->ss.ps.async_capable;
}
/*
@@ -6867,7 +6867,7 @@ produce_tuple_asynchronously(AsyncRequest *areq, bool fetch)
}
/* Get a tuple from the ForeignScan node */
- result = ExecProcNode((PlanState *) node);
+ result = areq->requestee->ExecProcNodeReal(areq->requestee);
if (!TupIsNull(result))
{
/* Mark the request as complete */
@@ -6956,6 +6956,11 @@ process_pending_request(AsyncRequest *areq)
/* Unlike AsyncNotify, we call ExecAsyncResponse ourselves */
ExecAsyncResponse(areq);
+ /* Also, we do instrumentation ourselves, if required */
+ if (areq->requestee->instrument)
+ InstrUpdateTupleCount(areq->requestee->instrument,
+ TupIsNull(areq->result) ? 0.0 : 1.0);
+
MemoryContextSwitchTo(oldcontext);
}
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 000e2534fc..53adfe2abc 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -3195,6 +3195,8 @@ SELECT * FROM async_pt t1, async_p2 t2 WHERE t1.a = t2.a AND t1.b === 505;
EXPLAIN (VERBOSE, COSTS OFF)
SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
+EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
+SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
-- Check with foreign modify
@@ -3226,19 +3228,28 @@ INSERT INTO join_tbl SELECT * FROM async_pt LEFT JOIN t ON (async_pt.a = t.a AND
SELECT * FROM join_tbl ORDER BY a1;
DELETE FROM join_tbl;
+DROP TABLE local_tbl;
+DROP FOREIGN TABLE remote_tbl;
+DROP FOREIGN TABLE insert_tbl;
+DROP TABLE base_tbl3;
+DROP TABLE base_tbl4;
+
RESET enable_mergejoin;
RESET enable_hashjoin;
+-- Check EXPLAIN ANALYZE for a query that scans empty partitions asynchronously
+DELETE FROM async_p1;
+DELETE FROM async_p2;
+DELETE FROM async_p3;
+
+EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
+SELECT * FROM async_pt;
+
-- Clean up
DROP TABLE async_pt;
DROP TABLE base_tbl1;
DROP TABLE base_tbl2;
DROP TABLE result_tbl;
-DROP TABLE local_tbl;
-DROP FOREIGN TABLE remote_tbl;
-DROP FOREIGN TABLE insert_tbl;
-DROP TABLE base_tbl3;
-DROP TABLE base_tbl4;
DROP TABLE join_tbl;
ALTER SERVER loopback OPTIONS (DROP async_capable);
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 8aa7edfe4a..d1194def82 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1597,7 +1597,7 @@ ForeignAsyncRequest(AsyncRequest *areq);
<literal>areq->callback_pending</literal> to <literal>true</literal>
for the <structname>ForeignScan</structname> node to get a callback from
the callback functions described below. If no more tuples are available,
- set the slot to NULL, and the
+ set the slot to NULL or an empty slot, and the
<literal>areq->request_complete</literal> flag to
<literal>true</literal>. It's recommended to use
<function>ExecAsyncRequestDone</function> or
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index f1985e658c..75108d36be 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -15,6 +15,7 @@
#include "postgres.h"
#include "executor/execAsync.h"
+#include "executor/executor.h"
#include "executor/nodeAppend.h"
#include "executor/nodeForeignscan.h"
@@ -24,6 +25,13 @@
void
ExecAsyncRequest(AsyncRequest *areq)
{
+ if (areq->requestee->chgParam != NULL) /* something changed? */
+ ExecReScan(areq->requestee); /* let ReScan handle this */
+
+ /* must provide our own instrumentation support */
+ if (areq->requestee->instrument)
+ InstrStartNode(areq->requestee->instrument);
+
switch (nodeTag(areq->requestee))
{
case T_ForeignScanState:
@@ -36,6 +44,11 @@ ExecAsyncRequest(AsyncRequest *areq)
}
ExecAsyncResponse(areq);
+
+ /* must provide our own instrumentation support */
+ if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument,
+ TupIsNull(areq->result) ? 0.0 : 1.0);
}
/*
@@ -48,6 +61,10 @@ ExecAsyncRequest(AsyncRequest *areq)
void
ExecAsyncConfigureWait(AsyncRequest *areq)
{
+ /* must provide our own instrumentation support */
+ if (areq->requestee->instrument)
+ InstrStartNode(areq->requestee->instrument);
+
switch (nodeTag(areq->requestee))
{
case T_ForeignScanState:
@@ -58,6 +75,10 @@ ExecAsyncConfigureWait(AsyncRequest *areq)
elog(ERROR, "unrecognized node type: %d",
(int) nodeTag(areq->requestee));
}
+
+ /* must provide our own instrumentation support */
+ if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument, 0.0);
}
/*
@@ -66,6 +87,10 @@ ExecAsyncConfigureWait(AsyncRequest *areq)
void
ExecAsyncNotify(AsyncRequest *areq)
{
+ /* must provide our own instrumentation support */
+ if (areq->requestee->instrument)
+ InstrStartNode(areq->requestee->instrument);
+
switch (nodeTag(areq->requestee))
{
case T_ForeignScanState:
@@ -78,6 +103,11 @@ ExecAsyncNotify(AsyncRequest *areq)
}
ExecAsyncResponse(areq);
+
+ /* must provide our own instrumentation support */
+ if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument,
+ TupIsNull(areq->result) ? 0.0 : 1.0);
}
/*
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index df3d7f9a8b..58b4968735 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1214,7 +1214,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0(n * sizeof(ExprState *));
if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options);
+ resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options, false);
}
else
{
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 9f8c7582e0..753f46863b 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -407,7 +407,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
/* Set up instrumentation for this node if requested */
if (estate->es_instrument)
- result->instrument = InstrAlloc(1, estate->es_instrument);
+ result->instrument = InstrAlloc(1, estate->es_instrument,
+ result->async_capable);
return result;
}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 237e13361b..2b106d8473 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -28,7 +28,7 @@ static void WalUsageAdd(WalUsage *dst, WalUsage *add);
/* Allocate new instrumentation structure(s) */
Instrumentation *
-InstrAlloc(int n, int instrument_options)
+InstrAlloc(int n, int instrument_options, bool async_mode)
{
Instrumentation *instr;
@@ -46,6 +46,7 @@ InstrAlloc(int n, int instrument_options)
instr[i].need_bufusage = need_buffers;
instr[i].need_walusage = need_wal;
instr[i].need_timer = need_timer;
+ instr[i].async_mode = async_mode;
}
}
@@ -82,6 +83,7 @@ InstrStartNode(Instrumentation *instr)
void
InstrStopNode(Instrumentation *instr, double nTuples)
{
+ double save_tuplecount = instr->tuplecount;
instr_time endtime;
/* count the returned tuples */
@@ -114,6 +116,23 @@ InstrStopNode(Instrumentation *instr, double nTuples)
instr->running = true;
instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
}
+ else
+ {
+ /*
+ * In async mode, if the plan node hadn't emitted any tuples before,
+ * this might be the first tuple
+ */
+ if (instr->async_mode && save_tuplecount < 1.0)
+ instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
+ }
+}
+
+/* Update tuple count */
+void
+InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
+{
+ /* count the returned tuples */
+ instr->tuplecount += nTuples;
}
/* Finish a run cycle for a plan node */
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 3c1f12adaf..1558fafad1 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -362,9 +362,9 @@ ExecAppend(PlanState *pstate)
}
/*
- * wait or poll async events if any. We do this before checking for
- * the end of iteration, because it might drain the remaining async
- * subplans.
+ * wait or poll for async events if any. We do this before checking
+ * for the end of iteration, because it might drain the remaining
+ * async subplans.
*/
if (node->as_nasyncremain > 0)
ExecAppendAsyncEventWait(node);
@@ -440,7 +440,7 @@ ExecReScanAppend(AppendState *node)
/*
* If chgParam of subnode is not null then plan will be re-scanned by
- * first ExecProcNode.
+ * first ExecProcNode or by first ExecAsyncRequest.
*/
if (subnode->chgParam == NULL)
ExecReScan(subnode);
@@ -911,7 +911,7 @@ ExecAppendAsyncGetNext(AppendState *node, TupleTableSlot **result)
{
CHECK_FOR_INTERRUPTS();
- /* Wait or poll async events. */
+ /* Wait or poll for async events. */
ExecAppendAsyncEventWait(node);
/* Request a tuple asynchronously. */
@@ -1084,7 +1084,7 @@ ExecAsyncAppendResponse(AsyncRequest *areq)
/* Nothing to do if the request is pending. */
if (!areq->request_complete)
{
- /* The request would have been pending for a callback */
+ /* The request would have been pending for a callback. */
Assert(areq->callback_pending);
return;
}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 898890fb08..9dc38d47ea 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -209,6 +209,13 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
scanstate->fdw_recheck_quals =
ExecInitQual(node->fdw_recheck_quals, (PlanState *) scanstate);
+ /*
+ * Determine whether to scan the foreign relation asynchronously or not;
+ * this has to be kept in sync with the code in ExecInitAppend().
+ */
+ scanstate->ss.ps.async_capable = (((Plan *) node)->async_capable &&
+ estate->es_epq_active == NULL);
+
/*
* Initialize FDW-related state.
*/
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index aa8eceda5f..c79e46aaaa 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -55,6 +55,7 @@ typedef struct Instrumentation
bool need_timer; /* true if we need timer data */
bool need_bufusage; /* true if we need buffer usage data */
bool need_walusage; /* true if we need WAL usage data */
+ bool async_mode; /* true if node is in async mode */
/* Info about current plan cycle: */
bool running; /* true if we've completed first tuple */
instr_time starttime; /* start time of current iteration of node */
@@ -84,10 +85,12 @@ typedef struct WorkerInstrumentation
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int n, int instrument_options);
+extern Instrumentation *InstrAlloc(int n, int instrument_options,
+ bool async_mode);
extern void InstrInit(Instrumentation *instr, int instrument_options);
extern void InstrStartNode(Instrumentation *instr);
extern void InstrStopNode(Instrumentation *instr, double nTuples);
+extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
extern void InstrEndLoop(Instrumentation *instr);
extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
extern void InstrStartParallelQuery(void);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e7ae21c023..91a1c3a780 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -538,7 +538,8 @@ typedef struct AsyncRequest
int request_index; /* Scratch space for requestor */
bool callback_pending; /* Callback is needed */
bool request_complete; /* Request complete, result valid */
- TupleTableSlot *result; /* Result (NULL if no more tuples) */
+ TupleTableSlot *result; /* Result (NULL or an empty slot if no more
+ * tuples) */
} AsyncRequest;
/* ----------------
@@ -1003,6 +1004,8 @@ typedef struct PlanState
ExprContext *ps_ExprContext; /* node's expression-evaluation context */
ProjectionInfo *ps_ProjInfo; /* info for doing tuple projection */
+ bool async_capable; /* true if node is async-capable */
+
/*
* Scanslot's descriptor if known. This is a bit of a hack, but otherwise
* it's hard for expression compilation to optimize based on the
On 10/5/21 08:03, Etsuro Fujita wrote:
On Fri, May 7, 2021 at 7:32 PM Andrey Lepikhov
I think a simple fix for this would be just remove the check whether
the instr->running flag is set or not in InstrUpdateTupleCount().
Attached is an updated patch, in which I also updated a comment in
execnodes.h and docs in fdwhandler.sgml to match the code in
nodeAppend.c, and fixed typos in comments in nodeAppend.c.
Your patch fixes the problem. But I found two more problems:
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
SELECT * FROM (
(SELECT * FROM f1)
UNION ALL
(SELECT * FROM f2)
UNION ALL
(SELECT *
FROM l3)
) q1
LIMIT 6709;
QUERY PLAN
--------------------------------------------------------------
Limit (actual rows=6709 loops=1)
-> Append (actual rows=6709 loops=1)
-> Async Foreign Scan on f1 (actual rows=1 loops=1)
-> Async Foreign Scan on f2 (actual rows=1 loops=1)
-> Seq Scan on l3 (actual rows=6708 loops=1)
Here we scan 6710 tuples at low level but appended only 6709. Where did
we lose one tuple?
2.
SELECT * FROM (
(SELECT * FROM f1)
UNION ALL
(SELECT * FROM f2)
UNION ALL
(SELECT * FROM f3 WHERE a > 0)
) q1 LIMIT 3000;
QUERY PLAN
--------------------------------------------------------------
Limit (actual rows=3000 loops=1)
-> Append (actual rows=3000 loops=1)
-> Async Foreign Scan on f1 (actual rows=0 loops=1)
-> Async Foreign Scan on f2 (actual rows=0 loops=1)
-> Foreign Scan on f3 (actual rows=3000 loops=1)
Here we give preference to the synchronous scan. Why?
--
regards,
Andrey Lepikhov
Postgres Professional
On 7/5/21 21:05, Etsuro Fujita wrote:
I think it would be better to start a new thread for this, and add the
patch to the next CF so that it doesn’t get lost.
Current implementation of async append choose asynchronous subplans at
the phase of an append plan creation. This is safe approach, but we
loose some optimizations, such of flattening trivial subqueries and
can't execute some simple queries asynchronously. For example:
EXPLAIN (ANALYZE, TIMING OFF, SUMMARY OFF, COSTS OFF)
(SELECT * FROM f1 WHERE a < 10) UNION ALL
(SELECT * FROM f2 WHERE a < 10);
But, as I could understand, we can choose these subplans later, at the
init append phase when all optimizations already passed.
In attachment - implementation of the proposed approach.
Initial script for the example see in the parent thread [1]/messages/by-id/a38bb206-8340-9528-5ef6-37de2d5cb1a3@postgrespro.ru.
[1]: /messages/by-id/a38bb206-8340-9528-5ef6-37de2d5cb1a3@postgrespro.ru
/messages/by-id/a38bb206-8340-9528-5ef6-37de2d5cb1a3@postgrespro.ru
--
regards,
Andrey Lepikhov
Postgres Professional
Attachments:
0001-Defer-selection-of-asynchronous-subplans-to-the-exec.patchtext/plain; charset=UTF-8; name=0001-Defer-selection-of-asynchronous-subplans-to-the-exec.patch; x-mac-creator=0; x-mac-type=0Download
From 395b1d62389cf40520a4afd87c11301aa2b17df2 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" <a.lepikhov@postgrespro.ru>
Date: Tue, 11 May 2021 08:43:03 +0500
Subject: [PATCH] Defer selection of asynchronous subplans to the executor
initial phase.
---
contrib/postgres_fdw/postgres_fdw.c | 10 +++++++++-
src/backend/executor/execAmi.c | 7 +++----
src/backend/executor/nodeAppend.c | 19 +++++++++++++++++++
src/backend/nodes/copyfuncs.c | 1 -
src/backend/nodes/outfuncs.c | 1 -
src/backend/nodes/readfuncs.c | 1 -
src/backend/optimizer/plan/createplan.c | 17 +----------------
src/include/nodes/plannodes.h | 1 -
src/include/optimizer/planmain.h | 1 +
9 files changed, 33 insertions(+), 25 deletions(-)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 4ff58d9c27..3e151a6790 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -1245,6 +1245,7 @@ postgresGetForeignPlan(PlannerInfo *root,
bool has_final_sort = false;
bool has_limit = false;
ListCell *lc;
+ ForeignScan *fsplan;
/*
* Get FDW private data created by postgresGetForeignUpperPaths(), if any.
@@ -1429,7 +1430,7 @@ postgresGetForeignPlan(PlannerInfo *root,
* field of the finished plan node; we can't keep them in private state
* because then they wouldn't be subject to later planner processing.
*/
- return make_foreignscan(tlist,
+ fsplan = make_foreignscan(tlist,
local_exprs,
scan_relid,
params_list,
@@ -1437,6 +1438,13 @@ postgresGetForeignPlan(PlannerInfo *root,
fdw_scan_tlist,
fdw_recheck_quals,
outer_plan);
+
+ /* If appropriate, consider participation in async operations */
+ fsplan->scan.plan.async_capable = (enable_async_append &&
+ best_path->path.pathkeys == NIL &&
+ !fsplan->scan.plan.parallel_safe &&
+ is_async_capable_path((Path *)best_path));
+ return fsplan;
}
/*
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b3726a54f3..4e70f4eb54 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -524,6 +524,9 @@ ExecSupportsBackwardScan(Plan *node)
if (node->parallel_aware)
return false;
+ if (node->async_capable)
+ return false;
+
switch (nodeTag(node))
{
case T_Result:
@@ -536,10 +539,6 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
- /* With async, tuples may be interleaved, so can't back up. */
- if (((Append *) node)->nasyncplans > 0)
- return false;
-
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 3c1f12adaf..363cf9f4a5 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -117,6 +117,8 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
int firstvalid;
int i,
j;
+ ListCell *l;
+ bool consider_async = false;
/* check for unsupported flags */
Assert(!(eflags & EXEC_FLAG_MARK));
@@ -197,6 +199,23 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendplanstates = (PlanState **) palloc(nplans *
sizeof(PlanState *));
+ /* If appropriate, consider async append */
+ consider_async = (list_length(node->appendplans) > 1);
+
+ if (!consider_async)
+ {
+ foreach(l, node->appendplans)
+ {
+ Plan *subplan = (Plan *) lfirst(l);
+
+ /* Check to see if subplan can be executed asynchronously */
+ if (subplan->async_capable)
+ {
+ subplan->async_capable = false;
+ }
+ }
+ }
+
/*
* call ExecInitNode on each of the valid plans to be executed and save
* the results into the appendplanstates array.
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 90770a89b0..a44185d7fc 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -243,7 +243,6 @@ _copyAppend(const Append *from)
*/
COPY_BITMAPSET_FIELD(apprelids);
COPY_NODE_FIELD(appendplans);
- COPY_SCALAR_FIELD(nasyncplans);
COPY_SCALAR_FIELD(first_partial_plan);
COPY_NODE_FIELD(part_prune_info);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 8da8b14f0e..cd5dbce76a 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -433,7 +433,6 @@ _outAppend(StringInfo str, const Append *node)
WRITE_BITMAPSET_FIELD(apprelids);
WRITE_NODE_FIELD(appendplans);
- WRITE_INT_FIELD(nasyncplans);
WRITE_INT_FIELD(first_partial_plan);
WRITE_NODE_FIELD(part_prune_info);
}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 3772ea07df..3f5879951b 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1717,7 +1717,6 @@ _readAppend(void)
READ_BITMAPSET_FIELD(apprelids);
READ_NODE_FIELD(appendplans);
- READ_INT_FIELD(nasyncplans);
READ_INT_FIELD(first_partial_plan);
READ_NODE_FIELD(part_prune_info);
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 7003238d76..b1f2a493f0 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -82,7 +82,6 @@ static List *get_gating_quals(PlannerInfo *root, List *quals);
static Plan *create_gating_plan(PlannerInfo *root, Path *path, Plan *plan,
List *gating_quals);
static Plan *create_join_plan(PlannerInfo *root, JoinPath *best_path);
-static bool is_async_capable_path(Path *path);
static Plan *create_append_plan(PlannerInfo *root, AppendPath *best_path,
int flags);
static Plan *create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
@@ -1097,7 +1096,7 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
* is_async_capable_path
* Check whether a given Path node is async-capable.
*/
-static bool
+bool
is_async_capable_path(Path *path)
{
switch (nodeTag(path))
@@ -1135,7 +1134,6 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
List *pathkeys = best_path->path.pathkeys;
List *subplans = NIL;
ListCell *subpaths;
- int nasyncplans = 0;
RelOptInfo *rel = best_path->path.parent;
PartitionPruneInfo *partpruneinfo = NULL;
int nodenumsortkeys = 0;
@@ -1143,7 +1141,6 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
Oid *nodeSortOperators = NULL;
Oid *nodeCollations = NULL;
bool *nodeNullsFirst = NULL;
- bool consider_async = false;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1207,11 +1204,6 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
tlist_was_changed = (orig_tlist_length != list_length(plan->plan.targetlist));
}
- /* If appropriate, consider async append */
- consider_async = (enable_async_append && pathkeys == NIL &&
- !best_path->path.parallel_safe &&
- list_length(best_path->subpaths) > 1);
-
/* Build the plan for each child */
foreach(subpaths, best_path->subpaths)
{
@@ -1280,12 +1272,6 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
subplans = lappend(subplans, subplan);
- /* Check to see if subplan can be executed asynchronously */
- if (consider_async && is_async_capable_path(subpath))
- {
- subplan->async_capable = true;
- ++nasyncplans;
- }
}
/*
@@ -1318,7 +1304,6 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
plan->appendplans = subplans;
- plan->nasyncplans = nasyncplans;
plan->first_partial_plan = best_path->first_partial_path;
plan->part_prune_info = partpruneinfo;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 841401be20..00f4f5f8ed 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -251,7 +251,6 @@ typedef struct Append
Plan plan;
Bitmapset *apprelids; /* RTIs of appendrel(s) formed by this node */
List *appendplans;
- int nasyncplans; /* # of asynchronous plans */
/*
* All 'appendplans' preceding this index are non-partial plans. All
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index bf1adfc52a..8a96a19e5f 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -115,5 +115,6 @@ extern Plan *set_plan_references(PlannerInfo *root, Plan *plan);
extern void record_plan_function_dependency(PlannerInfo *root, Oid funcid);
extern void record_plan_type_dependency(PlannerInfo *root, Oid typid);
extern bool extract_query_dependencies_walker(Node *node, PlannerInfo *root);
+extern bool is_async_capable_path(Path *path);
#endif /* PLANMAIN_H */
--
2.31.1
On Mon, May 10, 2021 at 8:45 PM Andrey Lepikhov <a.lepikhov@postgrespro.ru>
wrote:
On 7/5/21 21:05, Etsuro Fujita wrote:
I think it would be better to start a new thread for this, and add the
patch to the next CF so that it doesn’t get lost.Current implementation of async append choose asynchronous subplans at
the phase of an append plan creation. This is safe approach, but we
loose some optimizations, such of flattening trivial subqueries and
can't execute some simple queries asynchronously. For example:EXPLAIN (ANALYZE, TIMING OFF, SUMMARY OFF, COSTS OFF)
(SELECT * FROM f1 WHERE a < 10) UNION ALL
(SELECT * FROM f2 WHERE a < 10);But, as I could understand, we can choose these subplans later, at the
init append phase when all optimizations already passed.
In attachment - implementation of the proposed approach.Initial script for the example see in the parent thread [1].
[1]
/messages/by-id/a38bb206-8340-9528-5ef6-37de2d5cb1a3@postgrespro.ru
--
regards,
Andrey Lepikhov
Postgres Professional
Hi,
+ /* Check to see if subplan can be executed asynchronously */
+ if (subplan->async_capable)
+ {
+ subplan->async_capable = false;
It seems the if statement is not needed: you can directly assign false
to subplan->async_capable.
Cheers
On 11/5/21 08:55, Zhihong Yu wrote:
+ /* Check to see if subplan can be executed asynchronously */ + if (subplan->async_capable) + { + subplan->async_capable = false;It seems the if statement is not needed: you can directly assign false
to subplan->async_capable.Thank you, I agree with you.
Close look into the postgres_fdw regression tests show at least one open
problem with this approach: we need to control situations when only one
partition doesn't pruned and append isn't exist at all.
--
regards,
Andrey Lepikhov
Postgres Professional
On Tue, May 11, 2021 at 11:58 AM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
Your patch fixes the problem. But I found two more problems:
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
SELECT * FROM (
(SELECT * FROM f1)
UNION ALL
(SELECT * FROM f2)
UNION ALL
(SELECT * FROM l3)
) q1 LIMIT 6709;
QUERY PLAN
--------------------------------------------------------------
Limit (actual rows=6709 loops=1)
-> Append (actual rows=6709 loops=1)
-> Async Foreign Scan on f1 (actual rows=1 loops=1)
-> Async Foreign Scan on f2 (actual rows=1 loops=1)
-> Seq Scan on l3 (actual rows=6708 loops=1)Here we scan 6710 tuples at low level but appended only 6709. Where did
we lose one tuple?
The extra tuple, which is from f1 or f2, would have been kept in the
Append node's as_asyncresults, not returned from the Append node to
the Limit node. The async Foreign Scan nodes would fetch tuples
before the Append node ask the tuples, so the fetched tuples may or
may not be used.
2.
SELECT * FROM (
(SELECT * FROM f1)
UNION ALL
(SELECT * FROM f2)
UNION ALL
(SELECT * FROM f3 WHERE a > 0)
) q1 LIMIT 3000;
QUERY PLAN
--------------------------------------------------------------
Limit (actual rows=3000 loops=1)
-> Append (actual rows=3000 loops=1)
-> Async Foreign Scan on f1 (actual rows=0 loops=1)
-> Async Foreign Scan on f2 (actual rows=0 loops=1)
-> Foreign Scan on f3 (actual rows=3000 loops=1)Here we give preference to the synchronous scan. Why?
This would be expected behavior, and the reason is avoid performance
degradation; you might think it would be better to execute the async
Foreign Scan nodes more aggressively, but it would require
waiting/polling for file descriptor events many times, which is
expensive and might cause performance degradation. I think there is
room for improvement, though.
Thanks!
Best regards,
Etsuro Fujita
On 11/5/21 12:24, Etsuro Fujita wrote:
On Tue, May 11, 2021 at 11:58 AM Andrey Lepikhov
The extra tuple, which is from f1 or f2, would have been kept in the
Append node's as_asyncresults, not returned from the Append node to
the Limit node. The async Foreign Scan nodes would fetch tuples
before the Append node ask the tuples, so the fetched tuples may or
may not be used.
Ok.>> -> Append (actual rows=3000 loops=1)
-> Async Foreign Scan on f1 (actual rows=0 loops=1)
-> Async Foreign Scan on f2 (actual rows=0 loops=1)
-> Foreign Scan on f3 (actual rows=3000 loops=1)Here we give preference to the synchronous scan. Why?
This would be expected behavior, and the reason is avoid performance
degradation; you might think it would be better to execute the async
Foreign Scan nodes more aggressively, but it would require
waiting/polling for file descriptor events many times, which is
expensive and might cause performance degradation. I think there is
room for improvement, though.
Yes, I agree with you. Maybe you can add note in documentation on
async_capable, for example:
"... Synchronous and asynchronous scanning strategies can be mixed by
optimizer in one scan plan of a partitioned table or an 'UNION ALL'
command. For performance reasons, synchronous scans executes before the
first of async scan. ..."
--
regards,
Andrey Lepikhov
Postgres Professional
On Tue, May 11, 2021 at 6:27 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
On 11/5/21 12:24, Etsuro Fujita wrote:
-> Append (actual rows=3000 loops=1)
-> Async Foreign Scan on f1 (actual rows=0 loops=1)
-> Async Foreign Scan on f2 (actual rows=0 loops=1)
-> Foreign Scan on f3 (actual rows=3000 loops=1)Here we give preference to the synchronous scan. Why?
This would be expected behavior, and the reason is avoid performance
degradation; you might think it would be better to execute the async
Foreign Scan nodes more aggressively, but it would require
waiting/polling for file descriptor events many times, which is
expensive and might cause performance degradation. I think there is
room for improvement, though.Yes, I agree with you. Maybe you can add note in documentation on
async_capable, for example:
"... Synchronous and asynchronous scanning strategies can be mixed by
optimizer in one scan plan of a partitioned table or an 'UNION ALL'
command. For performance reasons, synchronous scans executes before the
first of async scan. ..."
+1 But I think this is an independent issue, so I think it would be
better to address the issue separately.
Best regards,
Etsuro Fujita
On Tue, May 11, 2021 at 6:55 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Tue, May 11, 2021 at 6:27 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:On 11/5/21 12:24, Etsuro Fujita wrote:
-> Append (actual rows=3000 loops=1)
-> Async Foreign Scan on f1 (actual rows=0 loops=1)
-> Async Foreign Scan on f2 (actual rows=0 loops=1)
-> Foreign Scan on f3 (actual rows=3000 loops=1)Here we give preference to the synchronous scan. Why?
This would be expected behavior, and the reason is avoid performance
degradation; you might think it would be better to execute the async
Foreign Scan nodes more aggressively, but it would require
waiting/polling for file descriptor events many times, which is
expensive and might cause performance degradation. I think there is
room for improvement, though.Yes, I agree with you. Maybe you can add note in documentation on
async_capable, for example:
"... Synchronous and asynchronous scanning strategies can be mixed by
optimizer in one scan plan of a partitioned table or an 'UNION ALL'
command. For performance reasons, synchronous scans executes before the
first of async scan. ..."+1 But I think this is an independent issue, so I think it would be
better to address the issue separately.
I have committed the patch for the original issue.
Best regards,
Etsuro Fujita
I'm resending this because I failed to reply to all.
On Sat, May 8, 2021 at 12:55 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Fri, May 7, 2021 at 2:12 AM Stephen Frost <sfrost@snowman.net> wrote:
In order to ensure that the data being returned from a foreign server
is consistent, postgres_fdw will only open one connection for a given
foreign server and will run all queries against that server sequentially
even if there are multiple foreign tables involved. In such a case, it
may be more performant to disable this option to eliminate the overhead
associated with running queries asynchronously.Ok, I’ll merge this into the next version.
Stephen’s version would be much better than mine, so I updated the
patch as proposed except the first sentence. If the foreign tables
are subject to different user mappings, multiple connections will be
opened, and queries will be performed in parallel. So I expanded the
sentence a little bit, to avoid misunderstanding. Attached is a new
version.
Best regards,
Etsuro Fujita
Attachments:
note-about-async-v2.patchapplication/octet-stream; name=note-about-async-v2.patchDownload
diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index 839126c4ef..fb87372bde 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -401,6 +401,16 @@ OPTIONS (ADD password_required 'false');
A table-level option overrides a server-level option.
The default is <literal>false</literal>.
</para>
+
+ <para>
+ In order to ensure that the data being returned from a foreign server
+ is consistent, <filename>postgres_fdw</filename> will only open one
+ connection for a given foreign server and will run all queries against
+ that server sequentially even if there are multiple foreign tables
+ involved, unless those tables are subject to different user mappings.
+ In such a case, it may be more performant to disable this option to
+ eliminate the overhead associated with running queries asynchronously.
+ </para>
</listitem>
</varlistentry>
On Sun, May 16, 2021 at 11:39 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
Attached is a new version.
I have committed the patch.
Best regards,
Etsuro Fujita
On Wed, Mar 31, 2021 at 6:55 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Tue, Mar 30, 2021 at 8:40 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
I'm happy with the patch, so I'll commit it if there are no objections.
Pushed.
I noticed that rescan of async Appends is broken when
do_exec_prune=false, leading to incorrect results on normal builds and
the following failure on assertion-enabled builds:
TRAP: FailedAssertion("node->as_valid_asyncplans == NULL", File:
"nodeAppend.c", Line: 1126, PID: 76644)
See a test case for this added in the attached. The root cause would
be that we call classify_matching_subplans() to re-determine
sync/async subplans when called from the first ExecAppend() after the
first ReScan, even if do_exec_prune=false, which is incorrect because
in that case it is assumed to re-use sync/async subplans determined
during the the first ExecAppend() after Init. The attached fixes this
issue. (A previous patch also had this issue, so I fixed it, but I
think I broke this again when simplifying the patch :-(.) I did a bit
of cleanup, and modified ExecReScanAppend() to initialize an async
state variable as_nasyncresults to zero, to be sure. I think the
variable would have been set to zero before we get to that function,
so I don't think we really need to do so, though.
I will add this to the open items list for v14.
Best regards,
Etsuro Fujita
Attachments:
fix-rescan-of-async-appends.patchapplication/octet-stream; name=fix-rescan-of-async-appends.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 7df30010f2..6f8f97c1a9 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -9862,6 +9862,48 @@ SELECT * FROM join_tbl ORDER BY a1;
DELETE FROM join_tbl;
RESET enable_partitionwise_join;
+-- Test rescan of an async Append node with do_exec_prune=false
+SET enable_hashjoin TO false;
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO join_tbl SELECT * FROM async_p1 t1, async_pt t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.b % 100 = 0;
+ QUERY PLAN
+----------------------------------------------------------------------------------------
+ Insert on public.join_tbl
+ -> Nested Loop
+ Output: t1.a, t1.b, t1.c, t2.a, t2.b, t2.c
+ Join Filter: ((t1.a = t2.a) AND (t1.b = t2.b))
+ -> Foreign Scan on public.async_p1 t1
+ Output: t1.a, t1.b, t1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE (((b % 100) = 0))
+ -> Append
+ -> Async Foreign Scan on public.async_p1 t2_1
+ Output: t2_1.a, t2_1.b, t2_1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Async Foreign Scan on public.async_p2 t2_2
+ Output: t2_2.a, t2_2.b, t2_2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+ -> Seq Scan on public.async_p3 t2_3
+ Output: t2_3.a, t2_3.b, t2_3.c
+(16 rows)
+
+INSERT INTO join_tbl SELECT * FROM async_p1 t1, async_pt t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.b % 100 = 0;
+SELECT * FROM join_tbl ORDER BY a1;
+ a1 | b1 | c1 | a2 | b2 | c2
+------+-----+------+------+-----+------
+ 1000 | 0 | 0000 | 1000 | 0 | 0000
+ 1100 | 100 | 0100 | 1100 | 100 | 0100
+ 1200 | 200 | 0200 | 1200 | 200 | 0200
+ 1300 | 300 | 0300 | 1300 | 300 | 0300
+ 1400 | 400 | 0400 | 1400 | 400 | 0400
+ 1500 | 500 | 0500 | 1500 | 500 | 0500
+ 1600 | 600 | 0600 | 1600 | 600 | 0600
+ 1700 | 700 | 0700 | 1700 | 700 | 0700
+ 1800 | 800 | 0800 | 1800 | 800 | 0800
+ 1900 | 900 | 0900 | 1900 | 900 | 0900
+(10 rows)
+
+DELETE FROM join_tbl;
+RESET enable_hashjoin;
-- Test interaction of async execution with plan-time partition pruning
EXPLAIN (VERBOSE, COSTS OFF)
SELECT * FROM async_pt WHERE a < 3000;
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 78379bdea5..589eb84cf7 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -3128,6 +3128,18 @@ DELETE FROM join_tbl;
RESET enable_partitionwise_join;
+-- Test rescan of an async Append node with do_exec_prune=false
+SET enable_hashjoin TO false;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO join_tbl SELECT * FROM async_p1 t1, async_pt t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.b % 100 = 0;
+INSERT INTO join_tbl SELECT * FROM async_p1 t1, async_pt t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.b % 100 = 0;
+
+SELECT * FROM join_tbl ORDER BY a1;
+DELETE FROM join_tbl;
+
+RESET enable_hashjoin;
+
-- Test interaction of async execution with plan-time partition pruning
EXPLAIN (VERBOSE, COSTS OFF)
SELECT * FROM async_pt WHERE a < 3000;
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 62335ed4c4..8c72ab9155 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -239,11 +239,6 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
/* Initialize async state */
appendstate->as_asyncplans = asyncplans;
appendstate->as_nasyncplans = nasyncplans;
- appendstate->as_asyncrequests = NULL;
- appendstate->as_asyncresults = (TupleTableSlot **)
- palloc0(nasyncplans * sizeof(TupleTableSlot *));
- appendstate->as_needrequest = NULL;
- appendstate->as_eventset = NULL;
if (nasyncplans > 0)
{
@@ -265,6 +260,14 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->as_asyncrequests[i] = areq;
}
+
+ appendstate->as_asyncresults = (TupleTableSlot **)
+ palloc0(nasyncplans * sizeof(TupleTableSlot *));
+ appendstate->as_nasyncresults = 0;
+ appendstate->as_needrequest = NULL;
+
+ if (appendstate->as_valid_subplans != NULL)
+ classify_matching_subplans(appendstate);
}
/*
@@ -459,6 +462,7 @@ ExecReScanAppend(AppendState *node)
areq->result = NULL;
}
+ node->as_nasyncresults = 0;
bms_free(node->as_needrequest);
node->as_needrequest = NULL;
}
@@ -861,15 +865,24 @@ ExecAppendAsyncBegin(AppendState *node)
/* Backward scan is not supported by async-aware Appends. */
Assert(ScanDirectionIsForward(node->ps.state->es_direction));
+ /* We should never be called when there are no subplans */
+ Assert(node->as_nplans > 0);
+
/* We should never be called when there are no async subplans. */
Assert(node->as_nasyncplans > 0);
/* If we've yet to determine the valid subplans then do so now. */
if (node->as_valid_subplans == NULL)
+ {
node->as_valid_subplans =
ExecFindMatchingSubPlans(node->as_prune_state);
- classify_matching_subplans(node);
+ classify_matching_subplans(node);
+ }
+
+ /* Initialize state variables. */
+ node->as_syncdone = bms_is_empty(node->as_valid_subplans);
+ node->as_nasyncremain = bms_num_members(node->as_valid_asyncplans);
/* Nothing to do if there are no valid async subplans. */
if (node->as_nasyncremain == 0)
@@ -1148,9 +1161,7 @@ classify_matching_subplans(AppendState *node)
/* Adjust the valid subplans to contain sync subplans only. */
node->as_valid_subplans = bms_del_members(node->as_valid_subplans,
valid_asyncplans);
- node->as_syncdone = bms_is_empty(node->as_valid_subplans);
/* Save valid async subplans. */
node->as_valid_asyncplans = valid_asyncplans;
- node->as_nasyncremain = bms_num_members(valid_asyncplans);
}
At Fri, 28 May 2021 16:30:29 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
On Wed, Mar 31, 2021 at 6:55 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Tue, Mar 30, 2021 at 8:40 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
I'm happy with the patch, so I'll commit it if there are no objections.
Pushed.
I noticed that rescan of async Appends is broken when
do_exec_prune=false, leading to incorrect results on normal builds and
the following failure on assertion-enabled builds:TRAP: FailedAssertion("node->as_valid_asyncplans == NULL", File:
"nodeAppend.c", Line: 1126, PID: 76644)See a test case for this added in the attached. The root cause would
be that we call classify_matching_subplans() to re-determine
sync/async subplans when called from the first ExecAppend() after the
first ReScan, even if do_exec_prune=false, which is incorrect because
in that case it is assumed to re-use sync/async subplans determined
during the the first ExecAppend() after Init. The attached fixes this
issue. (A previous patch also had this issue, so I fixed it, but I
think I broke this again when simplifying the patch :-(.) I did a bit
of cleanup, and modified ExecReScanAppend() to initialize an async
state variable as_nasyncresults to zero, to be sure. I think the
variable would have been set to zero before we get to that function,
so I don't think we really need to do so, though.I will add this to the open items list for v14.
The patch drops some "= NULL" (initial) initializations when
nasyncplans == 0. AFAICS makeNode() fills the returned memory with
zeroes but I'm not sure it is our convention to omit the
intializations.
Otherwise the patch seems to make the code around cleaner.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Horiguchi-san,
On Fri, May 28, 2021 at 5:29 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Fri, 28 May 2021 16:30:29 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
The root cause would
be that we call classify_matching_subplans() to re-determine
sync/async subplans when called from the first ExecAppend() after the
first ReScan, even if do_exec_prune=false, which is incorrect because
in that case it is assumed to re-use sync/async subplans determined
during the the first ExecAppend() after Init.
I noticed I wrote it wrong. If do_exec_prune=false, we would
determine sync/async subplans during ExecInitAppend(), so the “re-use
sync/async subplans determined during the the first ExecAppend() after
Init" part should be corrected as “re-use sync/async subplans
determined during ExecInitAppend()”. Sorry for that.
The patch drops some "= NULL" (initial) initializations when
nasyncplans == 0. AFAICS makeNode() fills the returned memory with
zeroes but I'm not sure it is our convention to omit the
intializations.
I’m not sure, but I think we omit it in some cases; for example, we
don’t set as_valid_subplans to NULL explicitly in ExecInitAppend(), if
do_exec_prune=true.
Otherwise the patch seems to make the code around cleaner.
Thanks for reviewing!
Best regards,
Etsuro Fujita
On Fri, May 28, 2021 at 10:53 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Fri, May 28, 2021 at 5:29 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:The patch drops some "= NULL" (initial) initializations when
nasyncplans == 0. AFAICS makeNode() fills the returned memory with
zeroes but I'm not sure it is our convention to omit the
intializations.I’m not sure, but I think we omit it in some cases; for example, we
don’t set as_valid_subplans to NULL explicitly in ExecInitAppend(), if
do_exec_prune=true.
Ok, I think it would be a good thing to initialize the
pointers/variables to NULL/zero explicitly, so I updated the patch as
such. Barring objections, I'll get the patch committed in a few days.
Best regards,
Etsuro Fujita
Attachments:
fix-rescan-of-async-appends-v2.patchapplication/octet-stream; name=fix-rescan-of-async-appends-v2.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 7df30010f2..6f8f97c1a9 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -9862,6 +9862,48 @@ SELECT * FROM join_tbl ORDER BY a1;
DELETE FROM join_tbl;
RESET enable_partitionwise_join;
+-- Test rescan of an async Append node with do_exec_prune=false
+SET enable_hashjoin TO false;
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO join_tbl SELECT * FROM async_p1 t1, async_pt t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.b % 100 = 0;
+ QUERY PLAN
+----------------------------------------------------------------------------------------
+ Insert on public.join_tbl
+ -> Nested Loop
+ Output: t1.a, t1.b, t1.c, t2.a, t2.b, t2.c
+ Join Filter: ((t1.a = t2.a) AND (t1.b = t2.b))
+ -> Foreign Scan on public.async_p1 t1
+ Output: t1.a, t1.b, t1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE (((b % 100) = 0))
+ -> Append
+ -> Async Foreign Scan on public.async_p1 t2_1
+ Output: t2_1.a, t2_1.b, t2_1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Async Foreign Scan on public.async_p2 t2_2
+ Output: t2_2.a, t2_2.b, t2_2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+ -> Seq Scan on public.async_p3 t2_3
+ Output: t2_3.a, t2_3.b, t2_3.c
+(16 rows)
+
+INSERT INTO join_tbl SELECT * FROM async_p1 t1, async_pt t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.b % 100 = 0;
+SELECT * FROM join_tbl ORDER BY a1;
+ a1 | b1 | c1 | a2 | b2 | c2
+------+-----+------+------+-----+------
+ 1000 | 0 | 0000 | 1000 | 0 | 0000
+ 1100 | 100 | 0100 | 1100 | 100 | 0100
+ 1200 | 200 | 0200 | 1200 | 200 | 0200
+ 1300 | 300 | 0300 | 1300 | 300 | 0300
+ 1400 | 400 | 0400 | 1400 | 400 | 0400
+ 1500 | 500 | 0500 | 1500 | 500 | 0500
+ 1600 | 600 | 0600 | 1600 | 600 | 0600
+ 1700 | 700 | 0700 | 1700 | 700 | 0700
+ 1800 | 800 | 0800 | 1800 | 800 | 0800
+ 1900 | 900 | 0900 | 1900 | 900 | 0900
+(10 rows)
+
+DELETE FROM join_tbl;
+RESET enable_hashjoin;
-- Test interaction of async execution with plan-time partition pruning
EXPLAIN (VERBOSE, COSTS OFF)
SELECT * FROM async_pt WHERE a < 3000;
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 78379bdea5..589eb84cf7 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -3128,6 +3128,18 @@ DELETE FROM join_tbl;
RESET enable_partitionwise_join;
+-- Test rescan of an async Append node with do_exec_prune=false
+SET enable_hashjoin TO false;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO join_tbl SELECT * FROM async_p1 t1, async_pt t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.b % 100 = 0;
+INSERT INTO join_tbl SELECT * FROM async_p1 t1, async_pt t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.b % 100 = 0;
+
+SELECT * FROM join_tbl ORDER BY a1;
+DELETE FROM join_tbl;
+
+RESET enable_hashjoin;
+
-- Test interaction of async execution with plan-time partition pruning
EXPLAIN (VERBOSE, COSTS OFF)
SELECT * FROM async_pt WHERE a < 3000;
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 62335ed4c4..755c1392f0 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -240,10 +240,12 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->as_asyncplans = asyncplans;
appendstate->as_nasyncplans = nasyncplans;
appendstate->as_asyncrequests = NULL;
- appendstate->as_asyncresults = (TupleTableSlot **)
- palloc0(nasyncplans * sizeof(TupleTableSlot *));
+ appendstate->as_asyncresults = NULL;
+ appendstate->as_nasyncresults = 0;
+ appendstate->as_nasyncremain = 0;
appendstate->as_needrequest = NULL;
appendstate->as_eventset = NULL;
+ appendstate->as_valid_asyncplans = NULL;
if (nasyncplans > 0)
{
@@ -265,6 +267,12 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->as_asyncrequests[i] = areq;
}
+
+ appendstate->as_asyncresults = (TupleTableSlot **)
+ palloc0(nasyncplans * sizeof(TupleTableSlot *));
+
+ if (appendstate->as_valid_subplans != NULL)
+ classify_matching_subplans(appendstate);
}
/*
@@ -459,6 +467,8 @@ ExecReScanAppend(AppendState *node)
areq->result = NULL;
}
+ node->as_nasyncresults = 0;
+ node->as_nasyncremain = 0;
bms_free(node->as_needrequest);
node->as_needrequest = NULL;
}
@@ -861,15 +871,24 @@ ExecAppendAsyncBegin(AppendState *node)
/* Backward scan is not supported by async-aware Appends. */
Assert(ScanDirectionIsForward(node->ps.state->es_direction));
+ /* We should never be called when there are no subplans */
+ Assert(node->as_nplans > 0);
+
/* We should never be called when there are no async subplans. */
Assert(node->as_nasyncplans > 0);
/* If we've yet to determine the valid subplans then do so now. */
if (node->as_valid_subplans == NULL)
+ {
node->as_valid_subplans =
ExecFindMatchingSubPlans(node->as_prune_state);
- classify_matching_subplans(node);
+ classify_matching_subplans(node);
+ }
+
+ /* Initialize state variables. */
+ node->as_syncdone = bms_is_empty(node->as_valid_subplans);
+ node->as_nasyncremain = bms_num_members(node->as_valid_asyncplans);
/* Nothing to do if there are no valid async subplans. */
if (node->as_nasyncremain == 0)
@@ -1148,9 +1167,7 @@ classify_matching_subplans(AppendState *node)
/* Adjust the valid subplans to contain sync subplans only. */
node->as_valid_subplans = bms_del_members(node->as_valid_subplans,
valid_asyncplans);
- node->as_syncdone = bms_is_empty(node->as_valid_subplans);
/* Save valid async subplans. */
node->as_valid_asyncplans = valid_asyncplans;
- node->as_nasyncremain = bms_num_members(valid_asyncplans);
}
On Tue, May 11, 2021 at 6:55 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Tue, May 11, 2021 at 6:27 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:On 11/5/21 12:24, Etsuro Fujita wrote:
-> Append (actual rows=3000 loops=1)
-> Async Foreign Scan on f1 (actual rows=0 loops=1)
-> Async Foreign Scan on f2 (actual rows=0 loops=1)
-> Foreign Scan on f3 (actual rows=3000 loops=1)Here we give preference to the synchronous scan. Why?
This would be expected behavior, and the reason is avoid performance
degradation; you might think it would be better to execute the async
Foreign Scan nodes more aggressively, but it would require
waiting/polling for file descriptor events many times, which is
expensive and might cause performance degradation. I think there is
room for improvement, though.Yes, I agree with you. Maybe you can add note in documentation on
async_capable, for example:
"... Synchronous and asynchronous scanning strategies can be mixed by
optimizer in one scan plan of a partitioned table or an 'UNION ALL'
command. For performance reasons, synchronous scans executes before the
first of async scan. ..."+1 But I think this is an independent issue, so I think it would be
better to address the issue separately.
I think that since postgres-fdw.sgml would be for users rather than
developers, unlike fdwhandler.sgml, it would be better to explain this
more in a not-too-technical way. So how about something like this?
Asynchronous execution is applied even when an Append node contains
subplan(s) executed synchronously as well as subplan(s) executed
asynchronously. In that case, if the asynchronous subplans are ones
executed using postgres_fdw, tuples from the asynchronous subplans are
not returned until after at least one synchronous subplan returns all
tuples, as that subplan is executed while the asynchronous subplans
are waiting for the results of queries sent to foreign servers. This
behavior might change in a future release.
Best regards,
Etsuro Fujita
On 3/6/21 14:49, Etsuro Fujita wrote:
On Tue, May 11, 2021 at 6:55 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Tue, May 11, 2021 at 6:27 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:On 11/5/21 12:24, Etsuro Fujita wrote:
-> Append (actual rows=3000 loops=1)
-> Async Foreign Scan on f1 (actual rows=0 loops=1)
-> Async Foreign Scan on f2 (actual rows=0 loops=1)
-> Foreign Scan on f3 (actual rows=3000 loops=1)Here we give preference to the synchronous scan. Why?
This would be expected behavior, and the reason is avoid performance
degradation; you might think it would be better to execute the async
Foreign Scan nodes more aggressively, but it would require
waiting/polling for file descriptor events many times, which is
expensive and might cause performance degradation. I think there is
room for improvement, though.Yes, I agree with you. Maybe you can add note in documentation on
async_capable, for example:
"... Synchronous and asynchronous scanning strategies can be mixed by
optimizer in one scan plan of a partitioned table or an 'UNION ALL'
command. For performance reasons, synchronous scans executes before the
first of async scan. ..."+1 But I think this is an independent issue, so I think it would be
better to address the issue separately.I think that since postgres-fdw.sgml would be for users rather than
developers, unlike fdwhandler.sgml, it would be better to explain this
more in a not-too-technical way. So how about something like this?Asynchronous execution is applied even when an Append node contains
subplan(s) executed synchronously as well as subplan(s) executed
asynchronously. In that case, if the asynchronous subplans are ones
executed using postgres_fdw, tuples from the asynchronous subplans are
not returned until after at least one synchronous subplan returns all
tuples, as that subplan is executed while the asynchronous subplans
are waiting for the results of queries sent to foreign servers. This
behavior might change in a future release.
Good, this text is clear for me.
--
regards,
Andrey Lepikhov
Postgres Professional
On Tue, Jun 1, 2021 at 6:30 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Fri, May 28, 2021 at 10:53 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Fri, May 28, 2021 at 5:29 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:The patch drops some "= NULL" (initial) initializations when
nasyncplans == 0. AFAICS makeNode() fills the returned memory with
zeroes but I'm not sure it is our convention to omit the
intializations.I’m not sure, but I think we omit it in some cases; for example, we
don’t set as_valid_subplans to NULL explicitly in ExecInitAppend(), if
do_exec_prune=true.Ok, I think it would be a good thing to initialize the
pointers/variables to NULL/zero explicitly, so I updated the patch as
such. Barring objections, I'll get the patch committed in a few days.
I'm replanning to push this early next week for some reason.
Best regards,
Etsuro Fujita
On Fri, Jun 4, 2021 at 7:26 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Tue, Jun 1, 2021 at 6:30 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Fri, May 28, 2021 at 10:53 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Fri, May 28, 2021 at 5:29 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:The patch drops some "= NULL" (initial) initializations when
nasyncplans == 0. AFAICS makeNode() fills the returned memory with
zeroes but I'm not sure it is our convention to omit the
intializations.I’m not sure, but I think we omit it in some cases; for example, we
don’t set as_valid_subplans to NULL explicitly in ExecInitAppend(), if
do_exec_prune=true.Ok, I think it would be a good thing to initialize the
pointers/variables to NULL/zero explicitly, so I updated the patch as
such. Barring objections, I'll get the patch committed in a few days.I'm replanning to push this early next week for some reason.
Pushed. I will close this in the open items list for v14.
Best regards,
Etsuro Fujita
On Fri, Jun 4, 2021 at 12:33 AM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
Good, this text is clear for me.
Cool! I created a patch for that, which I'm attaching. I'm planning
to commit the patch.
Thanks for reviewing!
Best regards,
Etsuro Fujita
Attachments:
note-about-sync-vs-async.patchapplication/octet-stream; name=note-about-sync-vs-async.patchDownload
diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index 65171841c9..1a250f0287 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -411,6 +411,18 @@ OPTIONS (ADD password_required 'false');
In such a case, it may be more performant to disable this option to
eliminate the overhead associated with running queries asynchronously.
</para>
+
+ <para>
+ Asynchronous execution is applied even when an
+ <structname>Append</structname> node contains subplan(s) executed
+ synchronously as well as subplan(s) executed asynchronously. In such
+ a case, if the asynchronous subplans are ones processed using
+ <filename>postgres_fdw</filename>, tuples from the asynchronous
+ subplans are not returned until after at least one synchronous subplan
+ returns all tuples, as that subplan is executed while the asynchronous
+ subplans are waiting for the results of asynchronous queries sent to
+ foreign servers. This behavior might change in a future release.
+ </para>
</listitem>
</varlistentry>
On Mon, Jun 7, 2021 at 6:36 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
I created a patch for that, which I'm attaching. I'm planning
to commit the patch.
Done.
Best regards,
Etsuro Fujita
On 11/5/21 06:55, Zhihong Yu wrote:
On Mon, May 10, 2021 at 8:45 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru <mailto:a.lepikhov@postgrespro.ru>> wrote:
It seems the if statement is not needed: you can directly assign false
to subplan->async_capable.
I have completely rewritten this patch.
Main idea:
The async_capable field of a plan node inform us that this node could
work in async mode. Each node sets this field based on its own logic.
The actual mode of a node is defined by the async_capable of PlanState
structure. It is made at the executor initialization stage.
In this patch, only an append node could define async behaviour for its
subplans.
With such approach the IsForeignPathAsyncCapable routine become
unecessary, I think.
--
regards,
Andrey Lepikhov
Postgres Professional
Attachments:
0001-Choose-async-append-subplans-at-the-initial-executio.patchtext/plain; charset=UTF-8; name=0001-Choose-async-append-subplans-at-the-initial-executio.patch; x-mac-creator=0; x-mac-type=0Download
From d935bbb70565d70f1b0f547bf37e71ffc6fa2ef2 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" <a.lepikhov@postgrespro.ru>
Date: Tue, 29 Jun 2021 22:09:54 +0300
Subject: [PATCH] Choose async append subplans at the initial execution stage
---
contrib/file_fdw/file_fdw.c | 3 +-
.../postgres_fdw/expected/postgres_fdw.out | 81 ++++++++++++++++++-
contrib/postgres_fdw/postgres_fdw.c | 13 +--
contrib/postgres_fdw/sql/postgres_fdw.sql | 29 +++++++
src/backend/commands/explain.c | 2 +-
src/backend/executor/execAmi.c | 4 -
src/backend/executor/nodeAppend.c | 27 ++++---
src/backend/executor/nodeForeignscan.c | 7 --
src/backend/nodes/copyfuncs.c | 1 -
src/backend/nodes/outfuncs.c | 1 -
src/backend/nodes/readfuncs.c | 1 -
src/backend/optimizer/path/costsize.c | 1 -
src/backend/optimizer/plan/createplan.c | 45 +----------
src/backend/utils/misc/guc.c | 1 +
src/include/executor/nodeAppend.h | 2 +
src/include/nodes/plannodes.h | 1 -
src/include/optimizer/cost.h | 1 -
src/include/optimizer/planmain.h | 2 +-
18 files changed, 141 insertions(+), 81 deletions(-)
diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 2c2f149fb0..5f67e1ca94 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -609,7 +609,8 @@ fileGetForeignPlan(PlannerInfo *root,
best_path->fdw_private,
NIL, /* no custom tlist */
NIL, /* no remote quals */
- outer_plan);
+ outer_plan,
+ false);
}
/*
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 31b5de91ad..30c38c6992 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10169,7 +10169,7 @@ EXECUTE async_pt_query (2000, 505);
Insert on public.result_tbl
-> Append
Subplans Removed: 2
- -> Async Foreign Scan on public.async_p1 async_pt_1
+ -> Foreign Scan on public.async_p1 async_pt_1
Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
Filter: (async_pt_1.b === $2)
Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE ((a < $1::integer))
@@ -10237,6 +10237,85 @@ SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c
2505 | 505 | bar | 2505 | 505 | 0505
(1 row)
+-- Subquery flattening must be done before choosing of async plans.
+EXPLAIN (VERBOSE, COSTS OFF)
+(SELECT * FROM async_p1 LIMIT 1)
+ UNION ALL
+(SELECT * FROM async_p2 WHERE a < 5)
+ UNION ALL
+(SELECT * FROM async_p2)
+ UNION ALL
+(SELECT * FROM async_p3 LIMIT 3);
+ QUERY PLAN
+--------------------------------------------------------------------------
+ Append
+ -> Async Foreign Scan on public.async_p1
+ Output: async_p1.a, async_p1.b, async_p1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 LIMIT 1::bigint
+ -> Async Foreign Scan on public.async_p2 async_p2_1
+ Output: async_p2_1.a, async_p2_1.b, async_p2_1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE ((a < 5))
+ -> Async Foreign Scan on public.async_p2
+ Output: async_p2.a, async_p2.b, async_p2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+ -> Limit
+ Output: async_p3.a, async_p3.b, async_p3.c
+ -> Seq Scan on public.async_p3
+ Output: async_p3.a, async_p3.b, async_p3.c
+(14 rows)
+
+-- Check that async append doesn't break the scrollable cursors logic:
+-- If the query plan doesn't support backward scan, a materialize node will be
+-- inserted in the head of this plan.
+BEGIN;
+EXPLAIN (COSTS OFF)
+DECLARE curs1 SCROLL CURSOR FOR (SELECT * FROM async_p3);
+ QUERY PLAN
+----------------------
+ Seq Scan on async_p3
+(1 row)
+
+EXPLAIN (COSTS OFF)
+DECLARE curs1 SCROLL CURSOR FOR (SELECT * FROM async_pt);
+ QUERY PLAN
+-------------------------------------------------------
+ Materialize
+ -> Append
+ -> Async Foreign Scan on async_p1 async_pt_1
+ -> Async Foreign Scan on async_p2 async_pt_2
+ -> Seq Scan on async_p3 async_pt_3
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+DECLARE curs1 NO SCROLL CURSOR FOR (SELECT * FROM async_p1);
+ QUERY PLAN
+--------------------------
+ Foreign Scan on async_p1
+(1 row)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+DECLARE curs2 SCROLL CURSOR FOR
+ (SELECT * FROM async_p1)
+ UNION ALL
+ (SELECT * FROM async_p2 WHERE a < 5)
+ UNION ALL
+ (SELECT * FROM async_p3);
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Materialize
+ Output: async_p1.a, async_p1.b, async_p1.c
+ -> Append
+ -> Async Foreign Scan on public.async_p1
+ Output: async_p1.a, async_p1.b, async_p1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Async Foreign Scan on public.async_p2
+ Output: async_p2.a, async_p2.b, async_p2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE ((a < 5))
+ -> Seq Scan on public.async_p3
+ Output: async_p3.a, async_p3.b, async_p3.c
+(11 rows)
+
+ROLLBACK;
ALTER FOREIGN TABLE async_p1 OPTIONS (DROP use_remote_estimate);
ALTER FOREIGN TABLE async_p2 OPTIONS (DROP use_remote_estimate);
DROP TABLE local_tbl;
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index fafbab6b02..6a52859f5e 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -163,9 +163,6 @@ typedef struct PgFdwScanState
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
- /* for asynchronous execution */
- bool async_capable; /* engage asynchronous-capable logic? */
-
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
MemoryContext temp_cxt; /* context for per-tuple temporary data */
@@ -1436,7 +1433,8 @@ postgresGetForeignPlan(PlannerInfo *root,
fdw_private,
fdw_scan_tlist,
fdw_recheck_quals,
- outer_plan);
+ outer_plan,
+ fpinfo->async_capable);
}
/*
@@ -1591,9 +1589,6 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
&fsstate->param_flinfo,
&fsstate->param_exprs,
&fsstate->param_values);
-
- /* Set the async-capable flag */
- fsstate->async_capable = node->ss.ps.async_capable;
}
/*
@@ -1622,7 +1617,7 @@ postgresIterateForeignScan(ForeignScanState *node)
if (fsstate->next_tuple >= fsstate->num_tuples)
{
/* In async mode, just clear tuple slot. */
- if (fsstate->async_capable)
+ if (node->ss.ps.async_capable)
return ExecClearTuple(slot);
/* No point in another fetch if we already detected EOF, though. */
if (!fsstate->eof_reached)
@@ -3781,7 +3776,7 @@ fetch_more_data(ForeignScanState *node)
int numrows;
int i;
- if (fsstate->async_capable)
+ if (node->ss.ps.async_capable)
{
Assert(fsstate->conn_state->pendingAreq);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 286dd99573..1c1c152bcf 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -3252,6 +3252,35 @@ EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c = 'bar';
SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c = 'bar';
+-- Subquery flattening must be done before choosing of async plans.
+EXPLAIN (VERBOSE, COSTS OFF)
+(SELECT * FROM async_p1 LIMIT 1)
+ UNION ALL
+(SELECT * FROM async_p2 WHERE a < 5)
+ UNION ALL
+(SELECT * FROM async_p2)
+ UNION ALL
+(SELECT * FROM async_p3 LIMIT 3);
+
+-- Check that async append doesn't break the scrollable cursors logic:
+-- If the query plan doesn't support backward scan, a materialize node will be
+-- inserted in the head of this plan.
+BEGIN;
+EXPLAIN (COSTS OFF)
+DECLARE curs1 SCROLL CURSOR FOR (SELECT * FROM async_p3);
+EXPLAIN (COSTS OFF)
+DECLARE curs1 SCROLL CURSOR FOR (SELECT * FROM async_pt);
+EXPLAIN (COSTS OFF)
+DECLARE curs1 NO SCROLL CURSOR FOR (SELECT * FROM async_p1);
+EXPLAIN (VERBOSE, COSTS OFF)
+DECLARE curs2 SCROLL CURSOR FOR
+ (SELECT * FROM async_p1)
+ UNION ALL
+ (SELECT * FROM async_p2 WHERE a < 5)
+ UNION ALL
+ (SELECT * FROM async_p3);
+ROLLBACK;
+
ALTER FOREIGN TABLE async_p1 OPTIONS (DROP use_remote_estimate);
ALTER FOREIGN TABLE async_p2 OPTIONS (DROP use_remote_estimate);
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e81b990092..6c7f8e9d9f 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1411,7 +1411,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
}
if (plan->parallel_aware)
appendStringInfoString(es->str, "Parallel ");
- if (plan->async_capable)
+ if (planstate->async_capable)
appendStringInfoString(es->str, "Async ");
appendStringInfoString(es->str, pname);
es->indent++;
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 10f0b349b5..ddeb028cf1 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -537,10 +537,6 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
- /* With async, tuples may be interleaved, so can't back up. */
- if (((Append *) node)->nasyncplans > 0)
- return false;
-
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 755c1392f0..0f2148e097 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -83,6 +83,8 @@ struct ParallelAppendState
#define INVALID_SUBPLAN_INDEX -1
#define EVENT_BUFFER_SIZE 16
+bool enable_async_append = true;
+
static TupleTableSlot *ExecAppend(PlanState *pstate);
static bool choose_next_subplan_locally(AppendState *node);
static bool choose_next_subplan_for_leader(AppendState *node);
@@ -117,6 +119,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
int firstvalid;
int i,
j;
+ bool consider_async;
/* check for unsupported flags */
Assert(!(eflags & EXEC_FLAG_MARK));
@@ -197,6 +200,8 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendplanstates = (PlanState **) palloc(nplans *
sizeof(PlanState *));
+ consider_async = (enable_async_append && !node->plan.parallel_safe &&
+ bms_num_members(validsubplans) > 1);
/*
* call ExecInitNode on each of the valid plans to be executed and save
* the results into the appendplanstates array.
@@ -212,24 +217,28 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
{
Plan *initNode = (Plan *) list_nth(node->appendplans, i);
+ /*
+ * Record the lowest appendplans index which is a valid partial plan.
+ */
+ if (i >= node->first_partial_plan && j < firstvalid)
+ firstvalid = j;
+
+ appendplanstates[j] = ExecInitNode(initNode, estate, eflags);
+
/*
* Record async subplans. When executing EvalPlanQual, we treat them
* as sync ones; don't do this when initializing an EvalPlanQual plan
* tree.
*/
- if (initNode->async_capable && estate->es_epq_active == NULL)
+ if (consider_async && initNode->async_capable &&
+ estate->es_epq_active == NULL)
{
asyncplans = bms_add_member(asyncplans, j);
nasyncplans++;
+ appendplanstates[j++]->async_capable = true;
}
-
- /*
- * Record the lowest appendplans index which is a valid partial plan.
- */
- if (i >= node->first_partial_plan && j < firstvalid)
- firstvalid = j;
-
- appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
+ else
+ appendplanstates[j++]->async_capable = false;
}
appendstate->as_first_partial_plan = firstvalid;
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 9dc38d47ea..898890fb08 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -209,13 +209,6 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
scanstate->fdw_recheck_quals =
ExecInitQual(node->fdw_recheck_quals, (PlanState *) scanstate);
- /*
- * Determine whether to scan the foreign relation asynchronously or not;
- * this has to be kept in sync with the code in ExecInitAppend().
- */
- scanstate->ss.ps.async_capable = (((Plan *) node)->async_capable &&
- estate->es_epq_active == NULL);
-
/*
* Initialize FDW-related state.
*/
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index bd87f23784..aca4a7cce4 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -243,7 +243,6 @@ _copyAppend(const Append *from)
*/
COPY_BITMAPSET_FIELD(apprelids);
COPY_NODE_FIELD(appendplans);
- COPY_SCALAR_FIELD(nasyncplans);
COPY_SCALAR_FIELD(first_partial_plan);
COPY_NODE_FIELD(part_prune_info);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e32b92e299..8e72c1333f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -433,7 +433,6 @@ _outAppend(StringInfo str, const Append *node)
WRITE_BITMAPSET_FIELD(apprelids);
WRITE_NODE_FIELD(appendplans);
- WRITE_INT_FIELD(nasyncplans);
WRITE_INT_FIELD(first_partial_plan);
WRITE_NODE_FIELD(part_prune_info);
}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index f0b34ecfac..a2aafcd2ce 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1717,7 +1717,6 @@ _readAppend(void)
READ_BITMAPSET_FIELD(apprelids);
READ_NODE_FIELD(appendplans);
- READ_INT_FIELD(nasyncplans);
READ_INT_FIELD(first_partial_plan);
READ_NODE_FIELD(part_prune_info);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8577c7b138..21e0dd0049 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -149,7 +149,6 @@ bool enable_partitionwise_aggregate = false;
bool enable_parallel_append = true;
bool enable_parallel_hash = true;
bool enable_partition_pruning = true;
-bool enable_async_append = true;
typedef struct
{
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 439e6b6426..f2baa58269 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -82,7 +82,6 @@ static List *get_gating_quals(PlannerInfo *root, List *quals);
static Plan *create_gating_plan(PlannerInfo *root, Path *path, Plan *plan,
List *gating_quals);
static Plan *create_join_plan(PlannerInfo *root, JoinPath *best_path);
-static bool is_async_capable_path(Path *path);
static Plan *create_append_plan(PlannerInfo *root, AppendPath *best_path,
int flags);
static Plan *create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
@@ -1093,31 +1092,6 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
return plan;
}
-/*
- * is_async_capable_path
- * Check whether a given Path node is async-capable.
- */
-static bool
-is_async_capable_path(Path *path)
-{
- switch (nodeTag(path))
- {
- case T_ForeignPath:
- {
- FdwRoutine *fdwroutine = path->parent->fdwroutine;
-
- Assert(fdwroutine != NULL);
- if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
- fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
- return true;
- }
- break;
- default:
- break;
- }
- return false;
-}
-
/*
* create_append_plan
* Create an Append plan for 'best_path' and (recursively) plans
@@ -1135,7 +1109,6 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
List *pathkeys = best_path->path.pathkeys;
List *subplans = NIL;
ListCell *subpaths;
- int nasyncplans = 0;
RelOptInfo *rel = best_path->path.parent;
PartitionPruneInfo *partpruneinfo = NULL;
int nodenumsortkeys = 0;
@@ -1143,7 +1116,6 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
Oid *nodeSortOperators = NULL;
Oid *nodeCollations = NULL;
bool *nodeNullsFirst = NULL;
- bool consider_async = false;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1207,11 +1179,6 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
tlist_was_changed = (orig_tlist_length != list_length(plan->plan.targetlist));
}
- /* If appropriate, consider async append */
- consider_async = (enable_async_append && pathkeys == NIL &&
- !best_path->path.parallel_safe &&
- list_length(best_path->subpaths) > 1);
-
/* Build the plan for each child */
foreach(subpaths, best_path->subpaths)
{
@@ -1279,13 +1246,6 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
subplans = lappend(subplans, subplan);
-
- /* Check to see if subplan can be executed asynchronously */
- if (consider_async && is_async_capable_path(subpath))
- {
- subplan->async_capable = true;
- ++nasyncplans;
- }
}
/*
@@ -1318,7 +1278,6 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
plan->appendplans = subplans;
- plan->nasyncplans = nasyncplans;
plan->first_partial_plan = best_path->first_partial_path;
plan->part_prune_info = partpruneinfo;
@@ -5685,7 +5644,8 @@ make_foreignscan(List *qptlist,
List *fdw_private,
List *fdw_scan_tlist,
List *fdw_recheck_quals,
- Plan *outer_plan)
+ Plan *outer_plan,
+ bool async_capable)
{
ForeignScan *node = makeNode(ForeignScan);
Plan *plan = &node->scan.plan;
@@ -5695,6 +5655,7 @@ make_foreignscan(List *qptlist,
plan->qual = qpqual;
plan->lefttree = outer_plan;
plan->righttree = NULL;
+ plan->async_capable = async_capable; /* set support of async opts */
node->scan.scanrelid = scanrelid;
/* these may be overridden by the FDW's PlanDirectModify callback. */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 480e8cd199..b14fe74050 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -51,6 +51,7 @@
#include "commands/vacuum.h"
#include "commands/variable.h"
#include "common/string.h"
+#include "executor/nodeAppend.h"
#include "funcapi.h"
#include "jit/jit.h"
#include "libpq/auth.h"
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index fa54ac6ad2..1831d6e021 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -17,6 +17,8 @@
#include "access/parallel.h"
#include "nodes/execnodes.h"
+extern PGDLLIMPORT bool enable_async_append;
+
extern AppendState *ExecInitAppend(Append *node, EState *estate, int eflags);
extern void ExecEndAppend(AppendState *node);
extern void ExecReScanAppend(AppendState *node);
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index aaa3b65d04..4d7595a7b1 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -251,7 +251,6 @@ typedef struct Append
Plan plan;
Bitmapset *apprelids; /* RTIs of appendrel(s) formed by this node */
List *appendplans;
- int nasyncplans; /* # of asynchronous plans */
/*
* All 'appendplans' preceding this index are non-partial plans. All
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 0fe60d82e4..67f925e793 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -66,7 +66,6 @@ extern PGDLLIMPORT bool enable_partitionwise_aggregate;
extern PGDLLIMPORT bool enable_parallel_append;
extern PGDLLIMPORT bool enable_parallel_hash;
extern PGDLLIMPORT bool enable_partition_pruning;
-extern PGDLLIMPORT bool enable_async_append;
extern PGDLLIMPORT int constraint_exclusion;
extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index bf1adfc52a..710503e501 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -42,7 +42,7 @@ extern Plan *create_plan(PlannerInfo *root, Path *best_path);
extern ForeignScan *make_foreignscan(List *qptlist, List *qpqual,
Index scanrelid, List *fdw_exprs, List *fdw_private,
List *fdw_scan_tlist, List *fdw_recheck_quals,
- Plan *outer_plan);
+ Plan *outer_plan, bool async_capable);
extern Plan *change_plan_targetlist(Plan *subplan, List *tlist,
bool tlist_parallel_safe);
extern Plan *materialize_finished_plan(Plan *subplan);
--
2.31.1
On Wed, Jun 30, 2021 at 1:50 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
I have completely rewritten this patch.
Main idea:
The async_capable field of a plan node inform us that this node could
work in async mode. Each node sets this field based on its own logic.
The actual mode of a node is defined by the async_capable of PlanState
structure. It is made at the executor initialization stage.
In this patch, only an append node could define async behaviour for its
subplans.
I finally reviewed the patch. One thing I noticed about the patch is
that it would break ordered Appends. Here is such an example using
the patch:
create table pt (a int) partition by range (a);
create table loct1 (a int);
create table loct2 (a int);
create foreign table p1 partition of pt for values from (10) to (20)
server loopback1 options (table_name 'loct1');
create foreign table p2 partition of pt for values from (20) to (30)
server loopback2 options (table_name 'loct2');
explain verbose select * from pt order by a;
QUERY PLAN
-------------------------------------------------------------------------------------
Append (cost=200.00..440.45 rows=5850 width=4)
-> Async Foreign Scan on public.p1 pt_1 (cost=100.00..205.60
rows=2925 width=4)
Output: pt_1.a
Remote SQL: SELECT a FROM public.loct1 ORDER BY a ASC NULLS LAST
-> Async Foreign Scan on public.p2 pt_2 (cost=100.00..205.60
rows=2925 width=4)
Output: pt_2.a
Remote SQL: SELECT a FROM public.loct2 ORDER BY a ASC NULLS LAST
(7 rows)
This would not always provide tuples in the required order, as async
execution would provide them from the subplans rather randomly. I
think it would not only be too late but be not efficient to do the
planning work at execution time (consider executing generic plans!),
so I think we should avoid doing so. (The cost of doing that work for
simple foreign scans is small, but if we support async execution for
upper plan nodes such as NestLoop as discussed before, I think the
cost for such plan nodes would not be small anymore.)
To just execute what was planned at execution time, I think we should
return to the patch in [1]/messages/by-id/7fe10f95-ac6c-c81d-a9d3-227493eb9055@postgrespro.ru. The patch was created for Horiguchi-san’s
async-execution patch, so I modified it to work with HEAD, and added a
simplified version of your test cases. Please find attached a patch.
Best regards,
Etsuro Fujita
[1]: /messages/by-id/7fe10f95-ac6c-c81d-a9d3-227493eb9055@postgrespro.ru
Attachments:
allow-async-in-more-cases.patchapplication/octet-stream; name=allow-async-in-more-cases.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index e3ee30f1aa..402882f0d2 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10456,6 +10456,88 @@ DROP TABLE local_tbl;
DROP INDEX base_tbl1_idx;
DROP INDEX base_tbl2_idx;
DROP INDEX async_p3_idx;
+-- UNION queries
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl
+(SELECT * FROM async_p1 ORDER BY a LIMIT 10)
+UNION
+(SELECT * FROM async_p2 WHERE b < 10);
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
+ Insert on public.result_tbl
+ -> HashAggregate
+ Output: async_p1.a, async_p1.b, async_p1.c
+ Group Key: async_p1.a, async_p1.b, async_p1.c
+ -> Append
+ -> Async Foreign Scan on public.async_p1
+ Output: async_p1.a, async_p1.b, async_p1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 ORDER BY a ASC NULLS LAST LIMIT 10::bigint
+ -> Async Foreign Scan on public.async_p2
+ Output: async_p2.a, async_p2.b, async_p2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE ((b < 10))
+(11 rows)
+
+INSERT INTO result_tbl
+(SELECT * FROM async_p1 ORDER BY a LIMIT 10)
+UNION
+(SELECT * FROM async_p2 WHERE b < 10);
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+----+------
+ 1000 | 0 | 0000
+ 1005 | 5 | 0005
+ 1010 | 10 | 0010
+ 1015 | 15 | 0015
+ 1020 | 20 | 0020
+ 1025 | 25 | 0025
+ 1030 | 30 | 0030
+ 1035 | 35 | 0035
+ 1040 | 40 | 0040
+ 1045 | 45 | 0045
+ 2000 | 0 | 0000
+ 2005 | 5 | 0005
+(12 rows)
+
+DELETE FROM result_tbl;
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl
+(SELECT * FROM async_p1 ORDER BY a LIMIT 10)
+UNION ALL
+(SELECT * FROM async_p2 WHERE b < 10);
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------
+ Insert on public.result_tbl
+ -> Append
+ -> Async Foreign Scan on public.async_p1
+ Output: async_p1.a, async_p1.b, async_p1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 ORDER BY a ASC NULLS LAST LIMIT 10::bigint
+ -> Async Foreign Scan on public.async_p2
+ Output: async_p2.a, async_p2.b, async_p2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE ((b < 10))
+(8 rows)
+
+INSERT INTO result_tbl
+(SELECT * FROM async_p1 ORDER BY a LIMIT 10)
+UNION ALL
+(SELECT * FROM async_p2 WHERE b < 10);
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+----+------
+ 1000 | 0 | 0000
+ 1005 | 5 | 0005
+ 1010 | 10 | 0010
+ 1015 | 15 | 0015
+ 1020 | 20 | 0020
+ 1025 | 25 | 0025
+ 1030 | 30 | 0030
+ 1035 | 35 | 0035
+ 1040 | 40 | 0040
+ 1045 | 45 | 0045
+ 2000 | 0 | 0000
+ 2005 | 5 | 0005
+(12 rows)
+
+DELETE FROM result_tbl;
-- Test that pending requests are processed properly
SET enable_mergejoin TO false;
SET enable_hashjoin TO false;
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 30b5175da5..22848710c5 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -3322,6 +3322,33 @@ DROP INDEX base_tbl1_idx;
DROP INDEX base_tbl2_idx;
DROP INDEX async_p3_idx;
+-- UNION queries
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl
+(SELECT * FROM async_p1 ORDER BY a LIMIT 10)
+UNION
+(SELECT * FROM async_p2 WHERE b < 10);
+INSERT INTO result_tbl
+(SELECT * FROM async_p1 ORDER BY a LIMIT 10)
+UNION
+(SELECT * FROM async_p2 WHERE b < 10);
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl
+(SELECT * FROM async_p1 ORDER BY a LIMIT 10)
+UNION ALL
+(SELECT * FROM async_p2 WHERE b < 10);
+INSERT INTO result_tbl
+(SELECT * FROM async_p1 ORDER BY a LIMIT 10)
+UNION ALL
+(SELECT * FROM async_p2 WHERE b < 10);
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
-- Test that pending requests are processed properly
SET enable_mergejoin TO false;
SET enable_hashjoin TO false;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 38251c2b8e..5ef1888407 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -626,6 +626,7 @@ _copySubqueryScan(const SubqueryScan *from)
* copy remainder of node
*/
COPY_NODE_FIELD(subplan);
+ COPY_SCALAR_FIELD(status);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 87561cbb6f..0d41a9ab4f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -629,6 +629,7 @@ _outSubqueryScan(StringInfo str, const SubqueryScan *node)
_outScanInfo(str, (const Scan *) node);
WRITE_NODE_FIELD(subplan);
+ WRITE_ENUM_FIELD(status, SubqueryScanStatus);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 0dd1ad7dfc..ebee3b6acf 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1967,6 +1967,7 @@ _readSubqueryScan(void)
ReadCommonScan(&local_node->scan);
READ_NODE_FIELD(subplan);
+ READ_ENUM_FIELD(status, SubqueryScanStatus);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index a5f6d678cc..9b6a8cfb74 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -82,7 +82,7 @@ static List *get_gating_quals(PlannerInfo *root, List *quals);
static Plan *create_gating_plan(PlannerInfo *root, Path *path, Plan *plan,
List *gating_quals);
static Plan *create_join_plan(PlannerInfo *root, JoinPath *best_path);
-static bool is_async_capable_path(Path *path);
+static bool mark_async_capable_plan(Plan *plan, Path *path);
static Plan *create_append_plan(PlannerInfo *root, AppendPath *best_path,
int flags);
static Plan *create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
@@ -1091,14 +1091,25 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
}
/*
- * is_async_capable_path
- * Check whether a given Path node is async-capable.
+ * mark_async_capable_plan
+ * Check whether a given Path node is async-capable, and if so, mark the
+ * Plan node created from it as such.
*/
static bool
-is_async_capable_path(Path *path)
+mark_async_capable_plan(Plan *plan, Path *path)
{
switch (nodeTag(path))
{
+ case T_SubqueryScanPath:
+ {
+ SubqueryScan *splan = (SubqueryScan *) plan;
+
+ if (trivial_subqueryscan(splan) &&
+ mark_async_capable_plan(splan->subplan,
+ ((SubqueryScanPath *) path)->subpath))
+ break;
+ return false;
+ }
case T_ForeignPath:
{
FdwRoutine *fdwroutine = path->parent->fdwroutine;
@@ -1106,13 +1117,15 @@ is_async_capable_path(Path *path)
Assert(fdwroutine != NULL);
if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
- return true;
+ break;
+ return false;
}
- break;
default:
- break;
+ return false;
}
- return false;
+
+ plan->async_capable = true;
+ return true;
}
/*
@@ -1278,9 +1291,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
subplans = lappend(subplans, subplan);
/* Check to see if subplan can be executed asynchronously */
- if (consider_async && is_async_capable_path(subpath))
+ if (consider_async && mark_async_capable_plan(subplan, subpath))
{
- subplan->async_capable = true;
+ Assert(subplan->async_capable);
++nasyncplans;
}
}
@@ -5551,6 +5564,7 @@ make_subqueryscan(List *qptlist,
plan->righttree = NULL;
node->scan.scanrelid = scanrelid;
node->subplan = subplan;
+ node->status = SUBQUERY_SCAN_UNKNOWN;
return node;
}
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index e50624c465..a4455b8a44 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -115,7 +115,6 @@ static Plan *set_indexonlyscan_references(PlannerInfo *root,
static Plan *set_subqueryscan_references(PlannerInfo *root,
SubqueryScan *plan,
int rtoffset);
-static bool trivial_subqueryscan(SubqueryScan *plan);
static Plan *clean_up_removed_plan_level(Plan *parent, Plan *child);
static void set_foreignscan_references(PlannerInfo *root,
ForeignScan *fscan,
@@ -1206,13 +1205,22 @@ set_subqueryscan_references(PlannerInfo *root,
* We can delete it if it has no qual to check and the targetlist just
* regurgitates the output of the child plan.
*/
-static bool
+bool
trivial_subqueryscan(SubqueryScan *plan)
{
int attrno;
ListCell *lp,
*lc;
+ /* We might have detected this already */
+ if (plan->status == SUBQUERY_SCAN_TRIVIAL)
+ return true;
+ if (plan->status == SUBQUERY_SCAN_NONTRIVIAL)
+ return false;
+ Assert(plan->status == SUBQUERY_SCAN_UNKNOWN);
+
+ plan->status = SUBQUERY_SCAN_NONTRIVIAL;
+
if (plan->scan.plan.qual != NIL)
return false;
@@ -1254,6 +1262,7 @@ trivial_subqueryscan(SubqueryScan *plan)
attrno++;
}
+ plan->status = SUBQUERY_SCAN_TRIVIAL;
return true;
}
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index ec9a8b0c81..67f21080df 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -518,16 +518,28 @@ typedef struct TidRangeScan
* relation, we make this a descendant of Scan anyway for code-sharing
* purposes.
*
+ * SubqueryScanStatus caches the trivial_subqueryscan property of the node.
+ * SUBQUERY_SCAN_UNKNOWN means not yet determined. This is only used during
+ * planning.
+ *
* Note: we store the sub-plan in the type-specific subplan field, not in
* the generic lefttree field as you might expect. This is because we do
* not want plan-tree-traversal routines to recurse into the subplan without
* knowing that they are changing Query contexts.
* ----------------
*/
+typedef enum SubqueryScanStatus
+{
+ SUBQUERY_SCAN_UNKNOWN,
+ SUBQUERY_SCAN_TRIVIAL,
+ SUBQUERY_SCAN_NONTRIVIAL
+} SubqueryScanStatus;
+
typedef struct SubqueryScan
{
Scan scan;
Plan *subplan;
+ SubqueryScanStatus status;
} SubqueryScan;
/* ----------------
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index bf1adfc52a..c908a49490 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -112,6 +112,7 @@ extern bool innerrel_is_unique(PlannerInfo *root,
* prototypes for plan/setrefs.c
*/
extern Plan *set_plan_references(PlannerInfo *root, Plan *plan);
+extern bool trivial_subqueryscan(SubqueryScan *plan);
extern void record_plan_function_dependency(PlannerInfo *root, Oid funcid);
extern void record_plan_type_dependency(PlannerInfo *root, Oid typid);
extern bool extract_query_dependencies_walker(Node *node, PlannerInfo *root);
On 8/23/21 2:18 PM, Etsuro Fujita wrote:
To just execute what was planned at execution time, I think we should
return to the patch in [1]. The patch was created for Horiguchi-san’s
async-execution patch, so I modified it to work with HEAD, and added a
simplified version of your test cases. Please find attached a patch.
[1] /messages/by-id/7fe10f95-ac6c-c81d-a9d3-227493eb9055@postgrespro.ru
I agree, this way is more safe. I tried to search for another approach,
because here isn't general solution: for each plan node we should
implement support of asynchronous behaviour.
But for practical use, for small set of nodes, it will work good. I
haven't any objections for this patch.
--
regards,
Andrey Lepikhov
Postgres Professional
On Mon, Aug 30, 2021 at 5:36 PM Andrey V. Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
On 8/23/21 2:18 PM, Etsuro Fujita wrote:
To just execute what was planned at execution time, I think we should
return to the patch in [1]. The patch was created for Horiguchi-san’s
async-execution patch, so I modified it to work with HEAD, and added a
simplified version of your test cases. Please find attached a patch.
[1] /messages/by-id/7fe10f95-ac6c-c81d-a9d3-227493eb9055@postgrespro.ru
I agree, this way is more safe. I tried to search for another approach,
because here isn't general solution: for each plan node we should
implement support of asynchronous behaviour.
I think so too.
But for practical use, for small set of nodes, it will work good. I
haven't any objections for this patch.
OK
To allow async execution in a bit more cases, I modified the patch a
bit further: a ProjectionPath put directly above an async-capable
ForeignPath would also be considered async-capable as ForeignScan can
project and no separate Result is needed in that case, so I modified
mark_async_capable_plan() as such, and added test cases to the
postgres_fdw regression test. Attached is an updated version of the
patch.
Thanks for the review!
Best regards,
Etsuro Fujita
Attachments:
allow-async-in-more-cases-2.patchapplication/octet-stream; name=allow-async-in-more-cases-2.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index e3ee30f1aa..3e7f64a6ed 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10141,6 +10141,31 @@ SELECT * FROM result_tbl ORDER BY a;
2505 | 505 | 0505
(2 rows)
+DELETE FROM result_tbl;
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT a, b + 1000, 'AAA' || c FROM async_pt WHERE b === 505;
+ QUERY PLAN
+------------------------------------------------------------------------------------------
+ Insert on public.result_tbl
+ -> Append
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, (async_pt_1.b + 1000), ('AAA'::text || async_pt_1.c)
+ Filter: (async_pt_1.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, (async_pt_2.b + 1000), ('AAA'::text || async_pt_2.c)
+ Filter: (async_pt_2.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+(10 rows)
+
+INSERT INTO result_tbl SELECT a, b + 1000, 'AAA' || c FROM async_pt WHERE b === 505;
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+------+---------
+ 1505 | 1505 | AAA0505
+ 2505 | 1505 | AAA0505
+(2 rows)
+
DELETE FROM result_tbl;
-- Check case where multiple partitions use the same connection
CREATE TABLE base_tbl3 (a int, b int, c text);
@@ -10456,6 +10481,88 @@ DROP TABLE local_tbl;
DROP INDEX base_tbl1_idx;
DROP INDEX base_tbl2_idx;
DROP INDEX async_p3_idx;
+-- UNION queries
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl
+(SELECT * FROM async_p1 ORDER BY a LIMIT 10)
+UNION
+(SELECT * FROM async_p2 WHERE b < 10);
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
+ Insert on public.result_tbl
+ -> HashAggregate
+ Output: async_p1.a, async_p1.b, async_p1.c
+ Group Key: async_p1.a, async_p1.b, async_p1.c
+ -> Append
+ -> Async Foreign Scan on public.async_p1
+ Output: async_p1.a, async_p1.b, async_p1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 ORDER BY a ASC NULLS LAST LIMIT 10::bigint
+ -> Async Foreign Scan on public.async_p2
+ Output: async_p2.a, async_p2.b, async_p2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE ((b < 10))
+(11 rows)
+
+INSERT INTO result_tbl
+(SELECT * FROM async_p1 ORDER BY a LIMIT 10)
+UNION
+(SELECT * FROM async_p2 WHERE b < 10);
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+----+------
+ 1000 | 0 | 0000
+ 1005 | 5 | 0005
+ 1010 | 10 | 0010
+ 1015 | 15 | 0015
+ 1020 | 20 | 0020
+ 1025 | 25 | 0025
+ 1030 | 30 | 0030
+ 1035 | 35 | 0035
+ 1040 | 40 | 0040
+ 1045 | 45 | 0045
+ 2000 | 0 | 0000
+ 2005 | 5 | 0005
+(12 rows)
+
+DELETE FROM result_tbl;
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl
+(SELECT * FROM async_p1 ORDER BY a LIMIT 10)
+UNION ALL
+(SELECT * FROM async_p2 WHERE b < 10);
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------
+ Insert on public.result_tbl
+ -> Append
+ -> Async Foreign Scan on public.async_p1
+ Output: async_p1.a, async_p1.b, async_p1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 ORDER BY a ASC NULLS LAST LIMIT 10::bigint
+ -> Async Foreign Scan on public.async_p2
+ Output: async_p2.a, async_p2.b, async_p2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE ((b < 10))
+(8 rows)
+
+INSERT INTO result_tbl
+(SELECT * FROM async_p1 ORDER BY a LIMIT 10)
+UNION ALL
+(SELECT * FROM async_p2 WHERE b < 10);
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+----+------
+ 1000 | 0 | 0000
+ 1005 | 5 | 0005
+ 1010 | 10 | 0010
+ 1015 | 15 | 0015
+ 1020 | 20 | 0020
+ 1025 | 25 | 0025
+ 1030 | 30 | 0030
+ 1035 | 35 | 0035
+ 1040 | 40 | 0040
+ 1045 | 45 | 0045
+ 2000 | 0 | 0000
+ 2005 | 5 | 0005
+(12 rows)
+
+DELETE FROM result_tbl;
-- Test that pending requests are processed properly
SET enable_mergejoin TO false;
SET enable_hashjoin TO false;
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 30b5175da5..e881a8fd63 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -3210,6 +3210,13 @@ INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
SELECT * FROM result_tbl ORDER BY a;
DELETE FROM result_tbl;
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT a, b + 1000, 'AAA' || c FROM async_pt WHERE b === 505;
+INSERT INTO result_tbl SELECT a, b + 1000, 'AAA' || c FROM async_pt WHERE b === 505;
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
-- Check case where multiple partitions use the same connection
CREATE TABLE base_tbl3 (a int, b int, c text);
CREATE FOREIGN TABLE async_p3 PARTITION OF async_pt FOR VALUES FROM (3000) TO (4000)
@@ -3322,6 +3329,33 @@ DROP INDEX base_tbl1_idx;
DROP INDEX base_tbl2_idx;
DROP INDEX async_p3_idx;
+-- UNION queries
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl
+(SELECT * FROM async_p1 ORDER BY a LIMIT 10)
+UNION
+(SELECT * FROM async_p2 WHERE b < 10);
+INSERT INTO result_tbl
+(SELECT * FROM async_p1 ORDER BY a LIMIT 10)
+UNION
+(SELECT * FROM async_p2 WHERE b < 10);
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl
+(SELECT * FROM async_p1 ORDER BY a LIMIT 10)
+UNION ALL
+(SELECT * FROM async_p2 WHERE b < 10);
+INSERT INTO result_tbl
+(SELECT * FROM async_p1 ORDER BY a LIMIT 10)
+UNION ALL
+(SELECT * FROM async_p2 WHERE b < 10);
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
-- Test that pending requests are processed properly
SET enable_mergejoin TO false;
SET enable_hashjoin TO false;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 38251c2b8e..5ef1888407 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -626,6 +626,7 @@ _copySubqueryScan(const SubqueryScan *from)
* copy remainder of node
*/
COPY_NODE_FIELD(subplan);
+ COPY_SCALAR_FIELD(status);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 87561cbb6f..0d41a9ab4f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -629,6 +629,7 @@ _outSubqueryScan(StringInfo str, const SubqueryScan *node)
_outScanInfo(str, (const Scan *) node);
WRITE_NODE_FIELD(subplan);
+ WRITE_ENUM_FIELD(status, SubqueryScanStatus);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 0dd1ad7dfc..ebee3b6acf 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1967,6 +1967,7 @@ _readSubqueryScan(void)
ReadCommonScan(&local_node->scan);
READ_NODE_FIELD(subplan);
+ READ_ENUM_FIELD(status, SubqueryScanStatus);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index a5f6d678cc..a420fe09f2 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -82,7 +82,7 @@ static List *get_gating_quals(PlannerInfo *root, List *quals);
static Plan *create_gating_plan(PlannerInfo *root, Path *path, Plan *plan,
List *gating_quals);
static Plan *create_join_plan(PlannerInfo *root, JoinPath *best_path);
-static bool is_async_capable_path(Path *path);
+static bool mark_async_capable_plan(Plan *plan, Path *path);
static Plan *create_append_plan(PlannerInfo *root, AppendPath *best_path,
int flags);
static Plan *create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
@@ -1091,14 +1091,25 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
}
/*
- * is_async_capable_path
- * Check whether a given Path node is async-capable.
+ * mark_async_capable_plan
+ * Check whether a given Path node is async-capable, and if so, mark the
+ * Plan node created from it as such.
*/
static bool
-is_async_capable_path(Path *path)
+mark_async_capable_plan(Plan *plan, Path *path)
{
switch (nodeTag(path))
{
+ case T_SubqueryScanPath:
+ {
+ SubqueryScan *splan = (SubqueryScan *) plan;
+
+ if (trivial_subqueryscan(splan) &&
+ mark_async_capable_plan(splan->subplan,
+ ((SubqueryScanPath *) path)->subpath))
+ break;
+ return false;
+ }
case T_ForeignPath:
{
FdwRoutine *fdwroutine = path->parent->fdwroutine;
@@ -1106,13 +1117,21 @@ is_async_capable_path(Path *path)
Assert(fdwroutine != NULL);
if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
- return true;
+ break;
+ return false;
}
- break;
+ case T_ProjectionPath:
+ if (!IsA(plan, Result) &&
+ mark_async_capable_plan(plan,
+ ((ProjectionPath *) path)->subpath))
+ return true;
+ return false;
default:
- break;
+ return false;
}
- return false;
+
+ plan->async_capable = true;
+ return true;
}
/*
@@ -1278,9 +1297,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
subplans = lappend(subplans, subplan);
/* Check to see if subplan can be executed asynchronously */
- if (consider_async && is_async_capable_path(subpath))
+ if (consider_async && mark_async_capable_plan(subplan, subpath))
{
- subplan->async_capable = true;
+ Assert(subplan->async_capable);
++nasyncplans;
}
}
@@ -5551,6 +5570,7 @@ make_subqueryscan(List *qptlist,
plan->righttree = NULL;
node->scan.scanrelid = scanrelid;
node->subplan = subplan;
+ node->status = SUBQUERY_SCAN_UNKNOWN;
return node;
}
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index e50624c465..a4455b8a44 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -115,7 +115,6 @@ static Plan *set_indexonlyscan_references(PlannerInfo *root,
static Plan *set_subqueryscan_references(PlannerInfo *root,
SubqueryScan *plan,
int rtoffset);
-static bool trivial_subqueryscan(SubqueryScan *plan);
static Plan *clean_up_removed_plan_level(Plan *parent, Plan *child);
static void set_foreignscan_references(PlannerInfo *root,
ForeignScan *fscan,
@@ -1206,13 +1205,22 @@ set_subqueryscan_references(PlannerInfo *root,
* We can delete it if it has no qual to check and the targetlist just
* regurgitates the output of the child plan.
*/
-static bool
+bool
trivial_subqueryscan(SubqueryScan *plan)
{
int attrno;
ListCell *lp,
*lc;
+ /* We might have detected this already */
+ if (plan->status == SUBQUERY_SCAN_TRIVIAL)
+ return true;
+ if (plan->status == SUBQUERY_SCAN_NONTRIVIAL)
+ return false;
+ Assert(plan->status == SUBQUERY_SCAN_UNKNOWN);
+
+ plan->status = SUBQUERY_SCAN_NONTRIVIAL;
+
if (plan->scan.plan.qual != NIL)
return false;
@@ -1254,6 +1262,7 @@ trivial_subqueryscan(SubqueryScan *plan)
attrno++;
}
+ plan->status = SUBQUERY_SCAN_TRIVIAL;
return true;
}
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index ec9a8b0c81..67f21080df 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -518,16 +518,28 @@ typedef struct TidRangeScan
* relation, we make this a descendant of Scan anyway for code-sharing
* purposes.
*
+ * SubqueryScanStatus caches the trivial_subqueryscan property of the node.
+ * SUBQUERY_SCAN_UNKNOWN means not yet determined. This is only used during
+ * planning.
+ *
* Note: we store the sub-plan in the type-specific subplan field, not in
* the generic lefttree field as you might expect. This is because we do
* not want plan-tree-traversal routines to recurse into the subplan without
* knowing that they are changing Query contexts.
* ----------------
*/
+typedef enum SubqueryScanStatus
+{
+ SUBQUERY_SCAN_UNKNOWN,
+ SUBQUERY_SCAN_TRIVIAL,
+ SUBQUERY_SCAN_NONTRIVIAL
+} SubqueryScanStatus;
+
typedef struct SubqueryScan
{
Scan scan;
Plan *subplan;
+ SubqueryScanStatus status;
} SubqueryScan;
/* ----------------
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index bf1adfc52a..c908a49490 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -112,6 +112,7 @@ extern bool innerrel_is_unique(PlannerInfo *root,
* prototypes for plan/setrefs.c
*/
extern Plan *set_plan_references(PlannerInfo *root, Plan *plan);
+extern bool trivial_subqueryscan(SubqueryScan *plan);
extern void record_plan_function_dependency(PlannerInfo *root, Oid funcid);
extern void record_plan_type_dependency(PlannerInfo *root, Oid typid);
extern bool extract_query_dependencies_walker(Node *node, PlannerInfo *root);
Etsuro Fujita писал 2021-08-30 12:52:
On Mon, Aug 30, 2021 at 5:36 PM Andrey V. Lepikhov
To allow async execution in a bit more cases, I modified the patch a
bit further: a ProjectionPath put directly above an async-capable
ForeignPath would also be considered async-capable as ForeignScan can
project and no separate Result is needed in that case, so I modified
mark_async_capable_plan() as such, and added test cases to the
postgres_fdw regression test. Attached is an updated version of the
patch.
Hi.
The patch looks good to me and seems to work as expected.
--
Best regards,
Alexander Pyhalov,
Postgres Professional
Hi Alexander,
On Wed, Sep 15, 2021 at 3:40 PM Alexander Pyhalov
<a.pyhalov@postgrespro.ru> wrote:
Etsuro Fujita писал 2021-08-30 12:52:
To allow async execution in a bit more cases, I modified the patch a
bit further: a ProjectionPath put directly above an async-capable
ForeignPath would also be considered async-capable as ForeignScan can
project and no separate Result is needed in that case, so I modified
mark_async_capable_plan() as such, and added test cases to the
postgres_fdw regression test. Attached is an updated version of the
patch.
The patch looks good to me and seems to work as expected.
Thanks for reviewing! I’m planning to commit the patch.
Sorry for the long delay.
Best regards,
Etsuro Fujita
On Sun, Mar 13, 2022 at 6:39 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Wed, Sep 15, 2021 at 3:40 PM Alexander Pyhalov
<a.pyhalov@postgrespro.ru> wrote:The patch looks good to me and seems to work as expected.
I’m planning to commit the patch.
I polished the patch a bit:
* Reordered a bit of code in create_append_plan() in logical order (no
functional changes).
* Added more comments.
* Added/Tweaked regression test cases.
Also, I added the commit message. Attached is a new version of the
patch. Barring objections, I’ll commit this.
Best regards,
Etsuro Fujita
Attachments:
allow-async-in-more-cases-3.patchapplication/octet-stream; name=allow-async-in-more-cases-3.patchDownload
From 9935ea071408d46c140916dbcca7ff069f9e7d38 Mon Sep 17 00:00:00 2001
From: Etsuro Fujita <etsuro.fujita@gmail.com>
Date: Sun, 3 Apr 2022 18:55:34 +0900
Subject: [PATCH] Allow asynchronous execution in more cases.
In commit 27e1f1456, create_append_plan() only allowed the subplan
created from a given subpath to be executed asynchronously when it was
an async-capable ForeignPath. To extend coverage, this patch handles
cases when the given subpath includes some other Path types as well that
can be omitted in the plan processing, such as a ProjectionPath directly
atop an async-capable ForeignPath, allowing asynchronous execution in
partitioned-scan/partitioned-join queries with non-Var tlist expressions
and more UNION queries.
Andrey Lepikhov and Etsuro Fujita, reviewed by Alexander Pyhalov.
Discussion: https://postgr.es/m/659c37a8-3e71-0ff2-394c-f04428c76f08%40postgrespro.ru
---
contrib/postgres_fdw/expected/postgres_fdw.out | 170 +++++++++++++++++++++++++
contrib/postgres_fdw/sql/postgres_fdw.sql | 41 ++++++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/plan/createplan.c | 56 ++++++--
src/backend/optimizer/plan/setrefs.c | 18 ++-
src/include/nodes/plannodes.h | 12 ++
src/include/optimizer/planmain.h | 1 +
9 files changed, 286 insertions(+), 15 deletions(-)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 11e9b4e8cc..30e95f585f 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10221,6 +10221,31 @@ SELECT * FROM result_tbl ORDER BY a;
2505 | 505 | 0505
(2 rows)
+DELETE FROM result_tbl;
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT a, b, 'AAA' || c FROM async_pt WHERE b === 505;
+ QUERY PLAN
+---------------------------------------------------------------------------------
+ Insert on public.result_tbl
+ -> Append
+ -> Async Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, ('AAA'::text || async_pt_1.c)
+ Filter: (async_pt_1.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Async Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, ('AAA'::text || async_pt_2.c)
+ Filter: (async_pt_2.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+(10 rows)
+
+INSERT INTO result_tbl SELECT a, b, 'AAA' || c FROM async_pt WHERE b === 505;
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+-----+---------
+ 1505 | 505 | AAA0505
+ 2505 | 505 | AAA0505
+(2 rows)
+
DELETE FROM result_tbl;
-- Check case where multiple partitions use the same connection
CREATE TABLE base_tbl3 (a int, b int, c text);
@@ -10358,6 +10383,69 @@ SELECT * FROM join_tbl ORDER BY a1;
3900 | 900 | 0900 | 3900 | 900 | 0900
(30 rows)
+DELETE FROM join_tbl;
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO join_tbl SELECT t1.a, t1.b, 'AAA' || t1.c, t2.a, t2.b, 'AAA' || t2.c FROM async_pt t1, async_pt t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.b % 100 = 0;
+ QUERY PLAN
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Insert on public.join_tbl
+ -> Append
+ -> Async Foreign Scan
+ Output: t1_1.a, t1_1.b, ('AAA'::text || t1_1.c), t2_1.a, t2_1.b, ('AAA'::text || t2_1.c)
+ Relations: (public.async_p1 t1_1) INNER JOIN (public.async_p1 t2_1)
+ Remote SQL: SELECT r5.a, r5.b, r5.c, r8.a, r8.b, r8.c FROM (public.base_tbl1 r5 INNER JOIN public.base_tbl1 r8 ON (((r5.a = r8.a)) AND ((r5.b = r8.b)) AND (((r5.b % 100) = 0))))
+ -> Async Foreign Scan
+ Output: t1_2.a, t1_2.b, ('AAA'::text || t1_2.c), t2_2.a, t2_2.b, ('AAA'::text || t2_2.c)
+ Relations: (public.async_p2 t1_2) INNER JOIN (public.async_p2 t2_2)
+ Remote SQL: SELECT r6.a, r6.b, r6.c, r9.a, r9.b, r9.c FROM (public.base_tbl2 r6 INNER JOIN public.base_tbl2 r9 ON (((r6.a = r9.a)) AND ((r6.b = r9.b)) AND (((r6.b % 100) = 0))))
+ -> Hash Join
+ Output: t1_3.a, t1_3.b, ('AAA'::text || t1_3.c), t2_3.a, t2_3.b, ('AAA'::text || t2_3.c)
+ Hash Cond: ((t2_3.a = t1_3.a) AND (t2_3.b = t1_3.b))
+ -> Seq Scan on public.async_p3 t2_3
+ Output: t2_3.a, t2_3.b, t2_3.c
+ -> Hash
+ Output: t1_3.a, t1_3.b, t1_3.c
+ -> Seq Scan on public.async_p3 t1_3
+ Output: t1_3.a, t1_3.b, t1_3.c
+ Filter: ((t1_3.b % 100) = 0)
+(20 rows)
+
+INSERT INTO join_tbl SELECT t1.a, t1.b, 'AAA' || t1.c, t2.a, t2.b, 'AAA' || t2.c FROM async_pt t1, async_pt t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.b % 100 = 0;
+SELECT * FROM join_tbl ORDER BY a1;
+ a1 | b1 | c1 | a2 | b2 | c2
+------+-----+---------+------+-----+---------
+ 1000 | 0 | AAA0000 | 1000 | 0 | AAA0000
+ 1100 | 100 | AAA0100 | 1100 | 100 | AAA0100
+ 1200 | 200 | AAA0200 | 1200 | 200 | AAA0200
+ 1300 | 300 | AAA0300 | 1300 | 300 | AAA0300
+ 1400 | 400 | AAA0400 | 1400 | 400 | AAA0400
+ 1500 | 500 | AAA0500 | 1500 | 500 | AAA0500
+ 1600 | 600 | AAA0600 | 1600 | 600 | AAA0600
+ 1700 | 700 | AAA0700 | 1700 | 700 | AAA0700
+ 1800 | 800 | AAA0800 | 1800 | 800 | AAA0800
+ 1900 | 900 | AAA0900 | 1900 | 900 | AAA0900
+ 2000 | 0 | AAA0000 | 2000 | 0 | AAA0000
+ 2100 | 100 | AAA0100 | 2100 | 100 | AAA0100
+ 2200 | 200 | AAA0200 | 2200 | 200 | AAA0200
+ 2300 | 300 | AAA0300 | 2300 | 300 | AAA0300
+ 2400 | 400 | AAA0400 | 2400 | 400 | AAA0400
+ 2500 | 500 | AAA0500 | 2500 | 500 | AAA0500
+ 2600 | 600 | AAA0600 | 2600 | 600 | AAA0600
+ 2700 | 700 | AAA0700 | 2700 | 700 | AAA0700
+ 2800 | 800 | AAA0800 | 2800 | 800 | AAA0800
+ 2900 | 900 | AAA0900 | 2900 | 900 | AAA0900
+ 3000 | 0 | AAA0000 | 3000 | 0 | AAA0000
+ 3100 | 100 | AAA0100 | 3100 | 100 | AAA0100
+ 3200 | 200 | AAA0200 | 3200 | 200 | AAA0200
+ 3300 | 300 | AAA0300 | 3300 | 300 | AAA0300
+ 3400 | 400 | AAA0400 | 3400 | 400 | AAA0400
+ 3500 | 500 | AAA0500 | 3500 | 500 | AAA0500
+ 3600 | 600 | AAA0600 | 3600 | 600 | AAA0600
+ 3700 | 700 | AAA0700 | 3700 | 700 | AAA0700
+ 3800 | 800 | AAA0800 | 3800 | 800 | AAA0800
+ 3900 | 900 | AAA0900 | 3900 | 900 | AAA0900
+(30 rows)
+
DELETE FROM join_tbl;
RESET enable_partitionwise_join;
-- Test rescan of an async Append node with do_exec_prune=false
@@ -10536,6 +10624,88 @@ DROP TABLE local_tbl;
DROP INDEX base_tbl1_idx;
DROP INDEX base_tbl2_idx;
DROP INDEX async_p3_idx;
+-- UNION queries
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl
+(SELECT a, b, 'AAA' || c FROM async_p1 ORDER BY a LIMIT 10)
+UNION
+(SELECT a, b, 'AAA' || c FROM async_p2 WHERE b < 10);
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
+ Insert on public.result_tbl
+ -> HashAggregate
+ Output: async_p1.a, async_p1.b, (('AAA'::text || async_p1.c))
+ Group Key: async_p1.a, async_p1.b, (('AAA'::text || async_p1.c))
+ -> Append
+ -> Async Foreign Scan on public.async_p1
+ Output: async_p1.a, async_p1.b, ('AAA'::text || async_p1.c)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 ORDER BY a ASC NULLS LAST LIMIT 10::bigint
+ -> Async Foreign Scan on public.async_p2
+ Output: async_p2.a, async_p2.b, ('AAA'::text || async_p2.c)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE ((b < 10))
+(11 rows)
+
+INSERT INTO result_tbl
+(SELECT a, b, 'AAA' || c FROM async_p1 ORDER BY a LIMIT 10)
+UNION
+(SELECT a, b, 'AAA' || c FROM async_p2 WHERE b < 10);
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+----+---------
+ 1000 | 0 | AAA0000
+ 1005 | 5 | AAA0005
+ 1010 | 10 | AAA0010
+ 1015 | 15 | AAA0015
+ 1020 | 20 | AAA0020
+ 1025 | 25 | AAA0025
+ 1030 | 30 | AAA0030
+ 1035 | 35 | AAA0035
+ 1040 | 40 | AAA0040
+ 1045 | 45 | AAA0045
+ 2000 | 0 | AAA0000
+ 2005 | 5 | AAA0005
+(12 rows)
+
+DELETE FROM result_tbl;
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl
+(SELECT a, b, 'AAA' || c FROM async_p1 ORDER BY a LIMIT 10)
+UNION ALL
+(SELECT a, b, 'AAA' || c FROM async_p2 WHERE b < 10);
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------
+ Insert on public.result_tbl
+ -> Append
+ -> Async Foreign Scan on public.async_p1
+ Output: async_p1.a, async_p1.b, ('AAA'::text || async_p1.c)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 ORDER BY a ASC NULLS LAST LIMIT 10::bigint
+ -> Async Foreign Scan on public.async_p2
+ Output: async_p2.a, async_p2.b, ('AAA'::text || async_p2.c)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE ((b < 10))
+(8 rows)
+
+INSERT INTO result_tbl
+(SELECT a, b, 'AAA' || c FROM async_p1 ORDER BY a LIMIT 10)
+UNION ALL
+(SELECT a, b, 'AAA' || c FROM async_p2 WHERE b < 10);
+SELECT * FROM result_tbl ORDER BY a;
+ a | b | c
+------+----+---------
+ 1000 | 0 | AAA0000
+ 1005 | 5 | AAA0005
+ 1010 | 10 | AAA0010
+ 1015 | 15 | AAA0015
+ 1020 | 20 | AAA0020
+ 1025 | 25 | AAA0025
+ 1030 | 30 | AAA0030
+ 1035 | 35 | AAA0035
+ 1040 | 40 | AAA0040
+ 1045 | 45 | AAA0045
+ 2000 | 0 | AAA0000
+ 2005 | 5 | AAA0005
+(12 rows)
+
+DELETE FROM result_tbl;
-- Test that pending requests are processed properly
SET enable_mergejoin TO false;
SET enable_hashjoin TO false;
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 6b5de89e14..ea35e61eb8 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -3245,6 +3245,13 @@ INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
SELECT * FROM result_tbl ORDER BY a;
DELETE FROM result_tbl;
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl SELECT a, b, 'AAA' || c FROM async_pt WHERE b === 505;
+INSERT INTO result_tbl SELECT a, b, 'AAA' || c FROM async_pt WHERE b === 505;
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
-- Check case where multiple partitions use the same connection
CREATE TABLE base_tbl3 (a int, b int, c text);
CREATE FOREIGN TABLE async_p3 PARTITION OF async_pt FOR VALUES FROM (3000) TO (4000)
@@ -3286,6 +3293,13 @@ INSERT INTO join_tbl SELECT * FROM async_pt t1, async_pt t2 WHERE t1.a = t2.a AN
SELECT * FROM join_tbl ORDER BY a1;
DELETE FROM join_tbl;
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO join_tbl SELECT t1.a, t1.b, 'AAA' || t1.c, t2.a, t2.b, 'AAA' || t2.c FROM async_pt t1, async_pt t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.b % 100 = 0;
+INSERT INTO join_tbl SELECT t1.a, t1.b, 'AAA' || t1.c, t2.a, t2.b, 'AAA' || t2.c FROM async_pt t1, async_pt t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.b % 100 = 0;
+
+SELECT * FROM join_tbl ORDER BY a1;
+DELETE FROM join_tbl;
+
RESET enable_partitionwise_join;
-- Test rescan of an async Append node with do_exec_prune=false
@@ -3357,6 +3371,33 @@ DROP INDEX base_tbl1_idx;
DROP INDEX base_tbl2_idx;
DROP INDEX async_p3_idx;
+-- UNION queries
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl
+(SELECT a, b, 'AAA' || c FROM async_p1 ORDER BY a LIMIT 10)
+UNION
+(SELECT a, b, 'AAA' || c FROM async_p2 WHERE b < 10);
+INSERT INTO result_tbl
+(SELECT a, b, 'AAA' || c FROM async_p1 ORDER BY a LIMIT 10)
+UNION
+(SELECT a, b, 'AAA' || c FROM async_p2 WHERE b < 10);
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO result_tbl
+(SELECT a, b, 'AAA' || c FROM async_p1 ORDER BY a LIMIT 10)
+UNION ALL
+(SELECT a, b, 'AAA' || c FROM async_p2 WHERE b < 10);
+INSERT INTO result_tbl
+(SELECT a, b, 'AAA' || c FROM async_p1 ORDER BY a LIMIT 10)
+UNION ALL
+(SELECT a, b, 'AAA' || c FROM async_p2 WHERE b < 10);
+
+SELECT * FROM result_tbl ORDER BY a;
+DELETE FROM result_tbl;
+
-- Test that pending requests are processed properly
SET enable_mergejoin TO false;
SET enable_hashjoin TO false;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 11c016495e..e2f464f8cb 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -632,6 +632,7 @@ _copySubqueryScan(const SubqueryScan *from)
* copy remainder of node
*/
COPY_NODE_FIELD(subplan);
+ COPY_SCALAR_FIELD(status);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 6e39590730..b572d0727b 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -638,6 +638,7 @@ _outSubqueryScan(StringInfo str, const SubqueryScan *node)
_outScanInfo(str, (const Scan *) node);
WRITE_NODE_FIELD(subplan);
+ WRITE_ENUM_FIELD(status, SubqueryScanStatus);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index c94b2561f0..af7a49f655 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2164,6 +2164,7 @@ _readSubqueryScan(void)
ReadCommonScan(&local_node->scan);
READ_NODE_FIELD(subplan);
+ READ_ENUM_FIELD(status, SubqueryScanStatus);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 179c87c671..2fd297e716 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -82,7 +82,7 @@ static List *get_gating_quals(PlannerInfo *root, List *quals);
static Plan *create_gating_plan(PlannerInfo *root, Path *path, Plan *plan,
List *gating_quals);
static Plan *create_join_plan(PlannerInfo *root, JoinPath *best_path);
-static bool is_async_capable_path(Path *path);
+static bool mark_async_capable_plan(Plan *plan, Path *path);
static Plan *create_append_plan(PlannerInfo *root, AppendPath *best_path,
int flags);
static Plan *create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
@@ -1110,14 +1110,29 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
}
/*
- * is_async_capable_path
- * Check whether a given Path node is async-capable.
+ * mark_async_capable_plan
+ * Check whether a given Path node is async-capable, and if so, mark the
+ * Plan node created from it as such.
*/
static bool
-is_async_capable_path(Path *path)
+mark_async_capable_plan(Plan *plan, Path *path)
{
switch (nodeTag(path))
{
+ case T_SubqueryScanPath:
+ {
+ SubqueryScan *scan_plan = (SubqueryScan *) plan;
+
+ /*
+ * If a SubqueryScan node atop of an async-capable plan node
+ * is deletable, consider it as async-capable.
+ */
+ if (trivial_subqueryscan(scan_plan) &&
+ mark_async_capable_plan(scan_plan->subplan,
+ ((SubqueryScanPath *) path)->subpath))
+ break;
+ return false;
+ }
case T_ForeignPath:
{
FdwRoutine *fdwroutine = path->parent->fdwroutine;
@@ -1125,13 +1140,27 @@ is_async_capable_path(Path *path)
Assert(fdwroutine != NULL);
if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
- return true;
+ break;
+ return false;
}
- break;
+ case T_ProjectionPath:
+
+ /*
+ * If the generated plan node doesn't include a Result node,
+ * consider it as async-capable if the subpath is async-capable.
+ */
+ if (!IsA(plan, Result) &&
+ mark_async_capable_plan(plan,
+ ((ProjectionPath *) path)->subpath))
+ return true;
+ return false;
default:
- break;
+ return false;
}
- return false;
+
+ plan->async_capable = true;
+
+ return true;
}
/*
@@ -1294,14 +1323,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
}
}
- subplans = lappend(subplans, subplan);
-
- /* Check to see if subplan can be executed asynchronously */
- if (consider_async && is_async_capable_path(subpath))
+ /* If needed, check to see if subplan can be executed asynchronously */
+ if (consider_async && mark_async_capable_plan(subplan, subpath))
{
- subplan->async_capable = true;
+ Assert(subplan->async_capable);
++nasyncplans;
}
+
+ subplans = lappend(subplans, subplan);
}
/*
@@ -5598,6 +5627,7 @@ make_subqueryscan(List *qptlist,
plan->righttree = NULL;
node->scan.scanrelid = scanrelid;
node->subplan = subplan;
+ node->status = SUBQUERY_SCAN_UNKNOWN;
return node;
}
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index bf4c722c02..df55ae96ba 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -115,7 +115,6 @@ static Plan *set_indexonlyscan_references(PlannerInfo *root,
static Plan *set_subqueryscan_references(PlannerInfo *root,
SubqueryScan *plan,
int rtoffset);
-static bool trivial_subqueryscan(SubqueryScan *plan);
static Plan *clean_up_removed_plan_level(Plan *parent, Plan *child);
static void set_foreignscan_references(PlannerInfo *root,
ForeignScan *fscan,
@@ -1319,14 +1318,26 @@ set_subqueryscan_references(PlannerInfo *root,
*
* We can delete it if it has no qual to check and the targetlist just
* regurgitates the output of the child plan.
+ *
+ * This might be called repeatedly on a SubqueryScan node, so we cache the
+ * result in the SubqueryScan node to avoid repeated computation.
*/
-static bool
+bool
trivial_subqueryscan(SubqueryScan *plan)
{
int attrno;
ListCell *lp,
*lc;
+ /* We might have detected this already (see mark_async_capable_plan) */
+ if (plan->status == SUBQUERY_SCAN_TRIVIAL)
+ return true;
+ if (plan->status == SUBQUERY_SCAN_NONTRIVIAL)
+ return false;
+ Assert(plan->status == SUBQUERY_SCAN_UNKNOWN);
+ /* Initially, mark the SubqueryScan as non-deletable from the plan tree */
+ plan->status = SUBQUERY_SCAN_NONTRIVIAL;
+
if (plan->scan.plan.qual != NIL)
return false;
@@ -1368,6 +1379,9 @@ trivial_subqueryscan(SubqueryScan *plan)
attrno++;
}
+ /* Re-mark the SubqueryScan as deletable from the plan tree */
+ plan->status = SUBQUERY_SCAN_TRIVIAL;
+
return true;
}
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 50ef3dda05..430c95bca9 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -536,16 +536,28 @@ typedef struct TidRangeScan
* relation, we make this a descendant of Scan anyway for code-sharing
* purposes.
*
+ * SubqueryScanStatus caches the trivial_subqueryscan property of the node.
+ * SUBQUERY_SCAN_UNKNOWN means not yet determined. This is only used during
+ * planning.
+ *
* Note: we store the sub-plan in the type-specific subplan field, not in
* the generic lefttree field as you might expect. This is because we do
* not want plan-tree-traversal routines to recurse into the subplan without
* knowing that they are changing Query contexts.
* ----------------
*/
+typedef enum SubqueryScanStatus
+{
+ SUBQUERY_SCAN_UNKNOWN,
+ SUBQUERY_SCAN_TRIVIAL,
+ SUBQUERY_SCAN_NONTRIVIAL
+} SubqueryScanStatus;
+
typedef struct SubqueryScan
{
Scan scan;
Plan *subplan;
+ SubqueryScanStatus status;
} SubqueryScan;
/* ----------------
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 54a0d4c188..6947bc65d1 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -112,6 +112,7 @@ extern bool innerrel_is_unique(PlannerInfo *root,
* prototypes for plan/setrefs.c
*/
extern Plan *set_plan_references(PlannerInfo *root, Plan *plan);
+extern bool trivial_subqueryscan(SubqueryScan *plan);
extern void record_plan_function_dependency(PlannerInfo *root, Oid funcid);
extern void record_plan_type_dependency(PlannerInfo *root, Oid typid);
extern bool extract_query_dependencies_walker(Node *node, PlannerInfo *root);
--
2.14.3 (Apple Git-98)
On Sun, Apr 3, 2022 at 3:28 AM Etsuro Fujita <etsuro.fujita@gmail.com>
wrote:
On Sun, Mar 13, 2022 at 6:39 PM Etsuro Fujita <etsuro.fujita@gmail.com>
wrote:On Wed, Sep 15, 2021 at 3:40 PM Alexander Pyhalov
<a.pyhalov@postgrespro.ru> wrote:The patch looks good to me and seems to work as expected.
I’m planning to commit the patch.
I polished the patch a bit:
* Reordered a bit of code in create_append_plan() in logical order (no
functional changes).
* Added more comments.
* Added/Tweaked regression test cases.Also, I added the commit message. Attached is a new version of the
patch. Barring objections, I’ll commit this.Best regards,
Etsuro Fujita
Hi,
+ WRITE_ENUM_FIELD(status, SubqueryScanStatus);
Looks like the new field can be named subquerystatus - this way its purpose
is clearer.
+ * mark_async_capable_plan
+ * Check whether a given Path node is async-capable, and if so, mark
the
+ * Plan node created from it as such.
Please add comment explaining what the return value means.
+ if (!IsA(plan, Result) &&
+ mark_async_capable_plan(plan,
+ ((ProjectionPath *) path)->subpath))
+ return true;
by returning true, `plan->async_capable = true;` is skipped.
Is that intentional ?
Cheers
Hi Zhihong,
On Sun, Apr 3, 2022 at 11:38 PM Zhihong Yu <zyu@yugabyte.com> wrote:
+ WRITE_ENUM_FIELD(status, SubqueryScanStatus);
Looks like the new field can be named subquerystatus - this way its purpose is clearer.
I agree that “status” is too general. “subquerystatus” might be good,
but I’d like to propose “scanstatus” instead, because I think this
would be consistent with the naming of the RowMarkType-enum member
“markType” in PlanRowMark defined in the same file.
+ * mark_async_capable_plan + * Check whether a given Path node is async-capable, and if so, mark the + * Plan node created from it as such.Please add comment explaining what the return value means.
Ok, how about something like this?
“Check whether a given Path node is async-capable, and if so, mark the
Plan node created from it as such and return true; otherwise, return
false.”
+ if (!IsA(plan, Result) && + mark_async_capable_plan(plan, + ((ProjectionPath *) path)->subpath)) + return true;by returning true, `plan->async_capable = true;` is skipped.
Is that intentional ?
That is intentional; we don’t need to set the async_capable flag
because in that case the flag would already have been set by the above
mark_async_capable_plan(). Note that we pass “plan” to that function.
Thanks for reviewing!
Best regards,
Etsuro Fujita
On 4/3/22 15:29, Etsuro Fujita wrote:
On Sun, Mar 13, 2022 at 6:39 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Wed, Sep 15, 2021 at 3:40 PM Alexander Pyhalov
<a.pyhalov@postgrespro.ru> wrote:The patch looks good to me and seems to work as expected.
I’m planning to commit the patch.
I polished the patch a bit:
* Reordered a bit of code in create_append_plan() in logical order (no
functional changes).
* Added more comments.
* Added/Tweaked regression test cases.Also, I added the commit message. Attached is a new version of the
patch. Barring objections, I’ll commit this.
Sorry for late answer - just vacation.
I looked through this patch - looks much more stable.
But, as far as I remember, on previous version some problems were found
out on the TPC-H test. I want to play a bit with the TPC-H and with
parameterized plans.
--
regards,
Andrey Lepikhov
Postgres Professional
On Mon, Apr 4, 2022 at 1:06 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Sun, Apr 3, 2022 at 11:38 PM Zhihong Yu <zyu@yugabyte.com> wrote:
+ WRITE_ENUM_FIELD(status, SubqueryScanStatus);
Looks like the new field can be named subquerystatus - this way its purpose is clearer.
I agree that “status” is too general. “subquerystatus” might be good,
but I’d like to propose “scanstatus” instead, because I think this
would be consistent with the naming of the RowMarkType-enum member
“markType” in PlanRowMark defined in the same file.+ * mark_async_capable_plan + * Check whether a given Path node is async-capable, and if so, mark the + * Plan node created from it as such.Please add comment explaining what the return value means.
Ok, how about something like this?
“Check whether a given Path node is async-capable, and if so, mark the
Plan node created from it as such and return true; otherwise, return
false.”
I have committed the patch after modifying it as such. (I think we
can improve these later, if necessary.)
Best regards,
Etsuro Fujita
On Mon, Apr 4, 2022 at 6:30 PM Andrey V. Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
On 4/3/22 15:29, Etsuro Fujita wrote:
Also, I added the commit message. Attached is a new version of the
patch. Barring objections, I’ll commit this.
I looked through this patch - looks much more stable.
But, as far as I remember, on previous version some problems were found
out on the TPC-H test. I want to play a bit with the TPC-H and with
parameterized plans.
I might be missing something, but I don't see any problems, so I have
committed the patch after some modifications. If you find them,
please let me know.
Thanks!
Best regards,
Etsuro Fujita
On Wed, Apr 06, 2022 at 03:58:29PM +0900, Etsuro Fujita wrote:
I have committed the patch after modifying it as such. (I think we
can improve these later, if necessary.)
This patch seems to be causing the planner to crash.
Here's a query reduced from sqlsmith.
| explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 <= pg_trigger_depth();
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055b4396a2edf in trivial_subqueryscan (plan=0x7f4219ed93b0) at ../../../../src/include/nodes/pg_list.h:151
151 return l ? l->length : 0;
(gdb) bt
#0 0x000055b4396a2edf in trivial_subqueryscan (plan=0x7f4219ed93b0) at ../../../../src/include/nodes/pg_list.h:151
#1 0x000055b43968af89 in mark_async_capable_plan (plan=plan@entry=0x7f4219ed93b0, path=path@entry=0x7f4219e89538) at createplan.c:1132
#2 0x000055b439691924 in create_append_plan (root=root@entry=0x55b43affb2b0, best_path=best_path@entry=0x7f4219ed0cb8, flags=flags@entry=0) at createplan.c:1329
#3 0x000055b43968fa21 in create_plan_recurse (root=root@entry=0x55b43affb2b0, best_path=best_path@entry=0x7f4219ed0cb8, flags=flags@entry=0) at createplan.c:421
#4 0x000055b43968f974 in create_projection_plan (root=root@entry=0x55b43affb2b0, best_path=best_path@entry=0x7f4219ed0f60, flags=flags@entry=1) at createplan.c:2039
#5 0x000055b43968fa6f in create_plan_recurse (root=root@entry=0x55b43affb2b0, best_path=0x7f4219ed0f60, flags=flags@entry=1) at createplan.c:433
#6 0x000055b439690221 in create_plan (root=root@entry=0x55b43affb2b0, best_path=<optimized out>) at createplan.c:348
#7 0x000055b4396a1451 in standard_planner (parse=0x55b43af05e28, query_string=<optimized out>, cursorOptions=2048, boundParams=0x0) at planner.c:413
#8 0x000055b4396a19c1 in planner (parse=parse@entry=0x55b43af05e28, query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();",
cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0) at planner.c:277
#9 0x000055b439790c78 in pg_plan_query (querytree=querytree@entry=0x55b43af05e28, query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();",
cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0) at postgres.c:883
#10 0x000055b439790d54 in pg_plan_queries (querytrees=0x55b43afdd528, query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();",
cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0) at postgres.c:975
#11 0x000055b439791239 in exec_simple_query (query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();") at postgres.c:1169
#12 0x000055b439793183 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4542
#13 0x000055b4396e6af7 in BackendRun (port=port@entry=0x55b43af2ffe0) at postmaster.c:4489
#14 0x000055b4396e9c03 in BackendStartup (port=port@entry=0x55b43af2ffe0) at postmaster.c:4217
#15 0x000055b4396e9e4a in ServerLoop () at postmaster.c:1791
#16 0x000055b4396eb401 in PostmasterMain (argc=7, argv=<optimized out>) at postmaster.c:1463
#17 0x000055b43962b4df in main (argc=7, argv=0x55b43aeff0c0) at main.c:202
Actually, the original query failed like this:
#2 0x000055b4398e9f90 in ExceptionalCondition (conditionName=conditionName@entry=0x55b439a61238 "plan->scanstatus == SUBQUERY_SCAN_UNKNOWN", errorType=errorType@entry=0x55b43994b00b "FailedAssertion",
#3 0x000055b4396a2ecf in trivial_subqueryscan (plan=0x55b43b59cac8) at setrefs.c:1367
On Fri, Apr 8, 2022 at 5:43 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Wed, Apr 06, 2022 at 03:58:29PM +0900, Etsuro Fujita wrote:
I have committed the patch after modifying it as such. (I think we
can improve these later, if necessary.)This patch seems to be causing the planner to crash.
Here's a query reduced from sqlsmith.| explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1
<= pg_trigger_depth();Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055b4396a2edf in trivial_subqueryscan (plan=0x7f4219ed93b0) at
../../../../src/include/nodes/pg_list.h:151
151 return l ? l->length : 0;
(gdb) bt
#0 0x000055b4396a2edf in trivial_subqueryscan (plan=0x7f4219ed93b0) at
../../../../src/include/nodes/pg_list.h:151
#1 0x000055b43968af89 in mark_async_capable_plan (plan=plan@entry=0x7f4219ed93b0,
path=path@entry=0x7f4219e89538) at createplan.c:1132
#2 0x000055b439691924 in create_append_plan (root=root@entry=0x55b43affb2b0,
best_path=best_path@entry=0x7f4219ed0cb8, flags=flags@entry=0) at
createplan.c:1329
#3 0x000055b43968fa21 in create_plan_recurse (root=root@entry=0x55b43affb2b0,
best_path=best_path@entry=0x7f4219ed0cb8, flags=flags@entry=0) at
createplan.c:421
#4 0x000055b43968f974 in create_projection_plan (root=root@entry=0x55b43affb2b0,
best_path=best_path@entry=0x7f4219ed0f60, flags=flags@entry=1) at
createplan.c:2039
#5 0x000055b43968fa6f in create_plan_recurse (root=root@entry=0x55b43affb2b0,
best_path=0x7f4219ed0f60, flags=flags@entry=1) at createplan.c:433
#6 0x000055b439690221 in create_plan (root=root@entry=0x55b43affb2b0,
best_path=<optimized out>) at createplan.c:348
#7 0x000055b4396a1451 in standard_planner (parse=0x55b43af05e28,
query_string=<optimized out>, cursorOptions=2048, boundParams=0x0) at
planner.c:413
#8 0x000055b4396a19c1 in planner (parse=parse@entry=0x55b43af05e28,
query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM
information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();",
cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0)
at planner.c:277
#9 0x000055b439790c78 in pg_plan_query (querytree=querytree@entry=0x55b43af05e28,
query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM
information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();",
cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0)
at postgres.c:883
#10 0x000055b439790d54 in pg_plan_queries (querytrees=0x55b43afdd528,
query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM
information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();",
cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0)
at postgres.c:975
#11 0x000055b439791239 in exec_simple_query
(query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM
information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();")
at postgres.c:1169
#12 0x000055b439793183 in PostgresMain (dbname=<optimized out>,
username=<optimized out>) at postgres.c:4542
#13 0x000055b4396e6af7 in BackendRun (port=port@entry=0x55b43af2ffe0) at
postmaster.c:4489
#14 0x000055b4396e9c03 in BackendStartup (port=port@entry=0x55b43af2ffe0)
at postmaster.c:4217
#15 0x000055b4396e9e4a in ServerLoop () at postmaster.c:1791
#16 0x000055b4396eb401 in PostmasterMain (argc=7, argv=<optimized out>) at
postmaster.c:1463
#17 0x000055b43962b4df in main (argc=7, argv=0x55b43aeff0c0) at main.c:202Actually, the original query failed like this:
#2 0x000055b4398e9f90 in ExceptionalCondition
(conditionName=conditionName@entry=0x55b439a61238 "plan->scanstatus ==
SUBQUERY_SCAN_UNKNOWN", errorType=errorType@entry=0x55b43994b00b
"FailedAssertion",
#3 0x000055b4396a2ecf in trivial_subqueryscan (plan=0x55b43b59cac8) at
setrefs.c:1367
Hi,
I logged the value of plan->scanstatus before the assertion :
2022-04-08 16:20:59.601 UTC [26325] LOG: scan status 0
2022-04-08 16:20:59.601 UTC [26325] STATEMENT: explain SELECT 1 FROM
information_schema.constraint_column_usage WHERE 1 <= pg_trigger_depth();
2022-04-08 16:20:59.796 UTC [26296] LOG: server process (PID 26325) was
terminated by signal 11: Segmentation fault
It seems its value was SUBQUERY_SCAN_UNKNOWN.
Still trying to find out the cause for the crash.
Hi,
On Fri, Apr 8, 2022 at 9:43 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
This patch seems to be causing the planner to crash.
Here's a query reduced from sqlsmith.| explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 <= pg_trigger_depth();
Program terminated with signal SIGSEGV, Segmentation fault.
Reproduced. Will look into this.
Thanks for the report!
Best regards,
Etsuro Fujita
On Sat, Apr 9, 2022 at 1:58 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Fri, Apr 8, 2022 at 9:43 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
This patch seems to be causing the planner to crash.
Here's a query reduced from sqlsmith.| explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 <= pg_trigger_depth();
Program terminated with signal SIGSEGV, Segmentation fault.
Reproduced. Will look into this.
I think the cause of this is that mark_async_capable_plan() failed to
take into account that when the given path is a SubqueryScanPath or
ForeignPath, the given corresponding plan might include a gating
Result node that evaluates pseudoconstant quals. My oversight. :-(
Attached is a patch for fixing that. I think v14 has the same issue,
so I think we need backpatching.
Best regards,
Etsuro Fujita
Attachments:
prevent-async.patchapplication/octet-stream; name=prevent-async.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 30e95f585f..5f74595198 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10706,6 +10706,72 @@ SELECT * FROM result_tbl ORDER BY a;
(12 rows)
DELETE FROM result_tbl;
+-- Prevent async execution if we use gating Result nodes for pseudoconstant
+-- quals
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE CURRENT_USER = SESSION_USER;
+ QUERY PLAN
+----------------------------------------------------------------
+ Append
+ -> Result
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ One-Time Filter: (CURRENT_USER = SESSION_USER)
+ -> Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Result
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ One-Time Filter: (CURRENT_USER = SESSION_USER)
+ -> Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+ -> Result
+ Output: async_pt_3.a, async_pt_3.b, async_pt_3.c
+ One-Time Filter: (CURRENT_USER = SESSION_USER)
+ -> Seq Scan on public.async_p3 async_pt_3
+ Output: async_pt_3.a, async_pt_3.b, async_pt_3.c
+(18 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+(SELECT * FROM async_p1 WHERE CURRENT_USER = SESSION_USER)
+UNION ALL
+(SELECT * FROM async_p2 WHERE CURRENT_USER = SESSION_USER);
+ QUERY PLAN
+----------------------------------------------------------------
+ Append
+ -> Result
+ Output: async_p1.a, async_p1.b, async_p1.c
+ One-Time Filter: (CURRENT_USER = SESSION_USER)
+ -> Foreign Scan on public.async_p1
+ Output: async_p1.a, async_p1.b, async_p1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Result
+ Output: async_p2.a, async_p2.b, async_p2.c
+ One-Time Filter: (CURRENT_USER = SESSION_USER)
+ -> Foreign Scan on public.async_p2
+ Output: async_p2.a, async_p2.b, async_p2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+(13 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM ((SELECT * FROM async_p1 WHERE b < 10) UNION ALL (SELECT * FROM async_p2 WHERE b < 10)) s WHERE CURRENT_USER = SESSION_USER;
+ QUERY PLAN
+---------------------------------------------------------------------------------
+ Append
+ -> Result
+ Output: async_p1.a, async_p1.b, async_p1.c
+ One-Time Filter: (CURRENT_USER = SESSION_USER)
+ -> Foreign Scan on public.async_p1
+ Output: async_p1.a, async_p1.b, async_p1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE ((b < 10))
+ -> Result
+ Output: async_p2.a, async_p2.b, async_p2.c
+ One-Time Filter: (CURRENT_USER = SESSION_USER)
+ -> Foreign Scan on public.async_p2
+ Output: async_p2.a, async_p2.b, async_p2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE ((b < 10))
+(13 rows)
+
-- Test that pending requests are processed properly
SET enable_mergejoin TO false;
SET enable_hashjoin TO false;
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index ea35e61eb8..e1b93f39f1 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -3398,6 +3398,19 @@ UNION ALL
SELECT * FROM result_tbl ORDER BY a;
DELETE FROM result_tbl;
+-- Prevent async execution if we use gating Result nodes for pseudoconstant
+-- quals
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE CURRENT_USER = SESSION_USER;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+(SELECT * FROM async_p1 WHERE CURRENT_USER = SESSION_USER)
+UNION ALL
+(SELECT * FROM async_p2 WHERE CURRENT_USER = SESSION_USER);
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM ((SELECT * FROM async_p1 WHERE b < 10) UNION ALL (SELECT * FROM async_p2 WHERE b < 10)) s WHERE CURRENT_USER = SESSION_USER;
+
-- Test that pending requests are processed properly
SET enable_mergejoin TO false;
SET enable_hashjoin TO false;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 95476ada0b..447d908c8a 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1125,6 +1125,13 @@ mark_async_capable_plan(Plan *plan, Path *path)
{
SubqueryScan *scan_plan = (SubqueryScan *) plan;
+ /*
+ * If the generated plan node includes a gating Result node,
+ * we can't execute it asynchronously.
+ */
+ if (IsA(plan, Result))
+ return false;
+
/*
* If a SubqueryScan node atop of an async-capable plan node
* is deletable, consider it as async-capable.
@@ -1139,6 +1146,13 @@ mark_async_capable_plan(Plan *plan, Path *path)
{
FdwRoutine *fdwroutine = path->parent->fdwroutine;
+ /*
+ * If the generated plan node includes a gating Result node,
+ * we can't execute it asynchronously.
+ */
+ if (IsA(plan, Result))
+ return false;
+
Assert(fdwroutine != NULL);
if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
Hi,
On Sat, Apr 9, 2022 at 1:24 AM Zhihong Yu <zyu@yugabyte.com> wrote:
On Fri, Apr 8, 2022 at 5:43 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
This patch seems to be causing the planner to crash.
Here's a query reduced from sqlsmith.| explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 <= pg_trigger_depth();
Program terminated with signal SIGSEGV, Segmentation fault.
I logged the value of plan->scanstatus before the assertion :
2022-04-08 16:20:59.601 UTC [26325] LOG: scan status 0
2022-04-08 16:20:59.601 UTC [26325] STATEMENT: explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 <= pg_trigger_depth();
2022-04-08 16:20:59.796 UTC [26296] LOG: server process (PID 26325) was terminated by signal 11: Segmentation faultIt seems its value was SUBQUERY_SCAN_UNKNOWN.
Still trying to find out the cause for the crash.
I think the cause is an oversight in mark_async_capable_plan(). See [1]/messages/by-id/CAPmGK15NkuaVo0Fu_0TfoCpPPJaJi4OMLzEQtkE6Bt6YT52fPQ@mail.gmail.com.
Thanks!
Best regards,
Etsuro Fujita
[1]: /messages/by-id/CAPmGK15NkuaVo0Fu_0TfoCpPPJaJi4OMLzEQtkE6Bt6YT52fPQ@mail.gmail.com
On Sun, Apr 10, 2022 at 3:42 AM Etsuro Fujita <etsuro.fujita@gmail.com>
wrote:
On Sat, Apr 9, 2022 at 1:58 AM Etsuro Fujita <etsuro.fujita@gmail.com>
wrote:On Fri, Apr 8, 2022 at 9:43 PM Justin Pryzby <pryzby@telsasoft.com>
wrote:
This patch seems to be causing the planner to crash.
Here's a query reduced from sqlsmith.| explain SELECT 1 FROM information_schema.constraint_column_usage
WHERE 1 <= pg_trigger_depth();
Program terminated with signal SIGSEGV, Segmentation fault.
Reproduced. Will look into this.
I think the cause of this is that mark_async_capable_plan() failed to
take into account that when the given path is a SubqueryScanPath or
ForeignPath, the given corresponding plan might include a gating
Result node that evaluates pseudoconstant quals. My oversight. :-(
Attached is a patch for fixing that. I think v14 has the same issue,
so I think we need backpatching.Best regards,
Etsuro Fujita
Hi,
Looking at the second hunk of the patch:
FdwRoutine *fdwroutine = path->parent->fdwroutine;
...
+ if (IsA(plan, Result))
+ return false;
It seems the check of whether plan is a Result node can be lifted ahead of
the switch statement (i.e. to the beginning of mark_async_capable_plan).
This way, we don't have to check for every case in the switch statement.
Cheers
On Sun, Apr 10, 2022 at 07:43:48PM +0900, Etsuro Fujita wrote:
On Sat, Apr 9, 2022 at 1:58 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Fri, Apr 8, 2022 at 9:43 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
This patch seems to be causing the planner to crash.
Here's a query reduced from sqlsmith.| explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 <= pg_trigger_depth();
Program terminated with signal SIGSEGV, Segmentation fault.
Reproduced. Will look into this.
I think the cause of this is that mark_async_capable_plan() failed to
take into account that when the given path is a SubqueryScanPath or
ForeignPath, the given corresponding plan might include a gating
Result node that evaluates pseudoconstant quals. My oversight. :-(
Attached is a patch for fixing that. I think v14 has the same issue,
so I think we need backpatching.
Thanks - this seems to resolve the issue.
On Sun, Apr 10, 2022 at 06:46:25AM -0700, Zhihong Yu wrote:
Looking at the second hunk of the patch:
FdwRoutine *fdwroutine = path->parent->fdwroutine;
...
+ if (IsA(plan, Result))
+ return false;It seems the check of whether plan is a Result node can be lifted ahead of
the switch statement (i.e. to the beginning of mark_async_capable_plan).This way, we don't have to check for every case in the switch statement.
I think you misread it - the other branch says: if (*not* IsA())
--
Justin
Import Notes
Reply to msg id not found: CALNJ-vRF1fw9dh-2sZW9bETqwcttOMw4ptLp6sOrzVUSwYcvw@mail.gmail.comCAPmGK15NkuaVo0Fu_0TfoCpPPJaJi4OMLzEQtkE6Bt6YT52fPQ@mail.gmail.com | Resolved by subject fallback
On Sun, Apr 10, 2022 at 7:41 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Sun, Apr 10, 2022 at 07:43:48PM +0900, Etsuro Fujita wrote:
On Sat, Apr 9, 2022 at 1:58 AM Etsuro Fujita <etsuro.fujita@gmail.com>
wrote:
On Fri, Apr 8, 2022 at 9:43 PM Justin Pryzby <pryzby@telsasoft.com>
wrote:
This patch seems to be causing the planner to crash.
Here's a query reduced from sqlsmith.| explain SELECT 1 FROM information_schema.constraint_column_usage
WHERE 1 <= pg_trigger_depth();
Program terminated with signal SIGSEGV, Segmentation fault.
Reproduced. Will look into this.
I think the cause of this is that mark_async_capable_plan() failed to
take into account that when the given path is a SubqueryScanPath or
ForeignPath, the given corresponding plan might include a gating
Result node that evaluates pseudoconstant quals. My oversight. :-(
Attached is a patch for fixing that. I think v14 has the same issue,
so I think we need backpatching.Thanks - this seems to resolve the issue.
On Sun, Apr 10, 2022 at 06:46:25AM -0700, Zhihong Yu wrote:
Looking at the second hunk of the patch:
FdwRoutine *fdwroutine = path->parent->fdwroutine;
...
+ if (IsA(plan, Result))
+ return false;It seems the check of whether plan is a Result node can be lifted ahead
of
the switch statement (i.e. to the beginning of mark_async_capable_plan).
This way, we don't have to check for every case in the switch statement.
I think you misread it - the other branch says: if (*not* IsA())
No, I didn't misread:
if (!IsA(plan, Result) &&
mark_async_capable_plan(plan,
((ProjectionPath *) path)->subpath))
return true;
return false;
If the plan is Result node, false would be returned.
So the check can be lifted to the beginning of the func.
Cheers
On Mon, Apr 11, 2022 at 11:44 AM Zhihong Yu <zyu@yugabyte.com> wrote:
On Sun, Apr 10, 2022 at 7:41 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Sun, Apr 10, 2022 at 06:46:25AM -0700, Zhihong Yu wrote:
Looking at the second hunk of the patch:
FdwRoutine *fdwroutine = path->parent->fdwroutine;
...
+ if (IsA(plan, Result))
+ return false;It seems the check of whether plan is a Result node can be lifted ahead of
the switch statement (i.e. to the beginning of mark_async_capable_plan).This way, we don't have to check for every case in the switch statement.
I think you misread it - the other branch says: if (*not* IsA())
No, I didn't misread:
if (!IsA(plan, Result) &&
mark_async_capable_plan(plan,
((ProjectionPath *) path)->subpath))
return true;
return false;If the plan is Result node, false would be returned.
So the check can be lifted to the beginning of the func.
I think we might support more cases in the switch statement in the
future. My concern about your proposal is that it might make it hard
to add new cases to the statement. I agree that what I proposed has a
bit of redundant code, but writing code inside each case independently
would make it easy to add them, making code consistent across branches
and thus making back-patching easy.
Thanks for reviewing!
Best regards,
Etsuro Fujita
On Sun, Apr 17, 2022 at 1:48 AM Etsuro Fujita <etsuro.fujita@gmail.com>
wrote:
On Mon, Apr 11, 2022 at 11:44 AM Zhihong Yu <zyu@yugabyte.com> wrote:
On Sun, Apr 10, 2022 at 7:41 PM Justin Pryzby <pryzby@telsasoft.com>
wrote:
On Sun, Apr 10, 2022 at 06:46:25AM -0700, Zhihong Yu wrote:
Looking at the second hunk of the patch:
FdwRoutine *fdwroutine = path->parent->fdwroutine;
...
+ if (IsA(plan, Result))
+ return false;It seems the check of whether plan is a Result node can be lifted
ahead of
the switch statement (i.e. to the beginning of
mark_async_capable_plan).
This way, we don't have to check for every case in the switch
statement.
I think you misread it - the other branch says: if (*not* IsA())
No, I didn't misread:
if (!IsA(plan, Result) &&
mark_async_capable_plan(plan,
((ProjectionPath *)path)->subpath))
return true;
return false;If the plan is Result node, false would be returned.
So the check can be lifted to the beginning of the func.I think we might support more cases in the switch statement in the
future. My concern about your proposal is that it might make it hard
to add new cases to the statement. I agree that what I proposed has a
bit of redundant code, but writing code inside each case independently
would make it easy to add them, making code consistent across branches
and thus making back-patching easy.Thanks for reviewing!
Best regards,
Etsuro Fujita
Hi,
When a new case arises where the plan is not a Result node, this func can
be rewritten.
If there is only one such new case, the check at the beginning of the func
can be tuned to exclude that case.
I still think the check should be lifted to the beginning of the func
(given the current cases).
Cheers
Hi,
On Sun, Apr 17, 2022 at 7:30 PM Zhihong Yu <zyu@yugabyte.com> wrote:
On Sun, Apr 17, 2022 at 1:48 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
I think we might support more cases in the switch statement in the
future. My concern about your proposal is that it might make it hard
to add new cases to the statement. I agree that what I proposed has a
bit of redundant code, but writing code inside each case independently
would make it easy to add them, making code consistent across branches
and thus making back-patching easy.
When a new case arises where the plan is not a Result node, this func can be rewritten.
If there is only one such new case, the check at the beginning of the func can be tuned to exclude that case.
Sorry, I don't agree with you.
I still think the check should be lifted to the beginning of the func (given the current cases).
The given path isn't limited to SubqueryScanPath, ForeignPath and
ProjectionPath, so another concern is extra cycles needed when the
path is other path type that is projection-capable (e.g., Path for
sequential scan, IndexPath, NestPath, ...). Assume that the given
path is a Path (that doesn't contain pseudoconstant quals). In that
case the given SeqScan plan node wouldn't contain a gating Result
node, so if we put the if test at the top of the function, we need to
execute not only the test but the switch statement for the given
path/plan nodes. But if we put the if test inside each case block, we
only need to execute the switch statement, without executing the test.
In the latter case I think we can save cycles for normal cases.
In short: I don't think it's a great idea to put the if test at the
top of the function.
Best regards,
Etsuro Fujita
On Tue, Apr 19, 2022 at 2:01 AM Etsuro Fujita <etsuro.fujita@gmail.com>
wrote:
Hi,
On Sun, Apr 17, 2022 at 7:30 PM Zhihong Yu <zyu@yugabyte.com> wrote:
On Sun, Apr 17, 2022 at 1:48 AM Etsuro Fujita <etsuro.fujita@gmail.com>
wrote:
I think we might support more cases in the switch statement in the
future. My concern about your proposal is that it might make it hard
to add new cases to the statement. I agree that what I proposed has a
bit of redundant code, but writing code inside each case independently
would make it easy to add them, making code consistent across branches
and thus making back-patching easy.When a new case arises where the plan is not a Result node, this func
can be rewritten.
If there is only one such new case, the check at the beginning of the
func can be tuned to exclude that case.
Sorry, I don't agree with you.
I still think the check should be lifted to the beginning of the func
(given the current cases).
The given path isn't limited to SubqueryScanPath, ForeignPath and
ProjectionPath, so another concern is extra cycles needed when the
path is other path type that is projection-capable (e.g., Path for
sequential scan, IndexPath, NestPath, ...). Assume that the given
path is a Path (that doesn't contain pseudoconstant quals). In that
case the given SeqScan plan node wouldn't contain a gating Result
node, so if we put the if test at the top of the function, we need to
execute not only the test but the switch statement for the given
path/plan nodes. But if we put the if test inside each case block, we
only need to execute the switch statement, without executing the test.
In the latter case I think we can save cycles for normal cases.In short: I don't think it's a great idea to put the if test at the
top of the function.Best regards,
Etsuro Fujita
Hi,
It is okay to keep the formation in your patch.
Cheers
Hi,
On Wed, Apr 20, 2022 at 2:04 AM Zhihong Yu <zyu@yugabyte.com> wrote:
It is okay to keep the formation in your patch.
I modified mark_async_capable_plan() a bit further; 1) adjusted code
in the ProjectionPath case, just for consistency with other cases, and
2) tweaked/improved comments a bit. Attached is a new version of the
patch (“prevent-async-2.patch”).
As mentioned before, v14 has the same issue, so I created a fix for
v14, which I’m attaching as well (“prevent-async-2-v14.patch”). In
the fix I modified is_async_capable_path() the same way as
mark_async_capable_plan() in HEAD, renaming it to
is_async_capable_plan(), and updated some comments.
Barring objections, I’ll push/back-patch these.
Thanks!
Best regards,
Etsuro Fujita
Attachments:
prevent-async-2.patchapplication/octet-stream; name=prevent-async-2.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 477de09a87..3ebe7df89f 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10734,6 +10734,72 @@ SELECT * FROM result_tbl ORDER BY a;
(12 rows)
DELETE FROM result_tbl;
+-- Prevent async execution if we use gating Result nodes for pseudoconstant
+-- quals
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE CURRENT_USER = SESSION_USER;
+ QUERY PLAN
+----------------------------------------------------------------
+ Append
+ -> Result
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ One-Time Filter: (CURRENT_USER = SESSION_USER)
+ -> Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Result
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ One-Time Filter: (CURRENT_USER = SESSION_USER)
+ -> Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+ -> Result
+ Output: async_pt_3.a, async_pt_3.b, async_pt_3.c
+ One-Time Filter: (CURRENT_USER = SESSION_USER)
+ -> Seq Scan on public.async_p3 async_pt_3
+ Output: async_pt_3.a, async_pt_3.b, async_pt_3.c
+(18 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+(SELECT * FROM async_p1 WHERE CURRENT_USER = SESSION_USER)
+UNION ALL
+(SELECT * FROM async_p2 WHERE CURRENT_USER = SESSION_USER);
+ QUERY PLAN
+----------------------------------------------------------------
+ Append
+ -> Result
+ Output: async_p1.a, async_p1.b, async_p1.c
+ One-Time Filter: (CURRENT_USER = SESSION_USER)
+ -> Foreign Scan on public.async_p1
+ Output: async_p1.a, async_p1.b, async_p1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Result
+ Output: async_p2.a, async_p2.b, async_p2.c
+ One-Time Filter: (CURRENT_USER = SESSION_USER)
+ -> Foreign Scan on public.async_p2
+ Output: async_p2.a, async_p2.b, async_p2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+(13 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM ((SELECT * FROM async_p1 WHERE b < 10) UNION ALL (SELECT * FROM async_p2 WHERE b < 10)) s WHERE CURRENT_USER = SESSION_USER;
+ QUERY PLAN
+---------------------------------------------------------------------------------
+ Append
+ -> Result
+ Output: async_p1.a, async_p1.b, async_p1.c
+ One-Time Filter: (CURRENT_USER = SESSION_USER)
+ -> Foreign Scan on public.async_p1
+ Output: async_p1.a, async_p1.b, async_p1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE ((b < 10))
+ -> Result
+ Output: async_p2.a, async_p2.b, async_p2.c
+ One-Time Filter: (CURRENT_USER = SESSION_USER)
+ -> Foreign Scan on public.async_p2
+ Output: async_p2.a, async_p2.b, async_p2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE ((b < 10))
+(13 rows)
+
-- Test that pending requests are processed properly
SET enable_mergejoin TO false;
SET enable_hashjoin TO false;
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index ed181dedff..310ac72788 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -3410,6 +3410,19 @@ UNION ALL
SELECT * FROM result_tbl ORDER BY a;
DELETE FROM result_tbl;
+-- Prevent async execution if we use gating Result nodes for pseudoconstant
+-- quals
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE CURRENT_USER = SESSION_USER;
+
+EXPLAIN (VERBOSE, COSTS OFF)
+(SELECT * FROM async_p1 WHERE CURRENT_USER = SESSION_USER)
+UNION ALL
+(SELECT * FROM async_p2 WHERE CURRENT_USER = SESSION_USER);
+
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM ((SELECT * FROM async_p1 WHERE b < 10) UNION ALL (SELECT * FROM async_p2 WHERE b < 10)) s WHERE CURRENT_USER = SESSION_USER;
+
-- Test that pending requests are processed properly
SET enable_mergejoin TO false;
SET enable_hashjoin TO false;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 7905bc4654..db11936efe 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1112,9 +1112,9 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
/*
* mark_async_capable_plan
- * Check whether a given Path node is async-capable, and if so, mark the
- * Plan node created from it as such and return true, otherwise return
- * false.
+ * Check whether the Plan node created from a Path node is async-capable,
+ * and if so, mark the Plan node as such and return true, otherwise
+ * return false.
*/
static bool
mark_async_capable_plan(Plan *plan, Path *path)
@@ -1125,6 +1125,13 @@ mark_async_capable_plan(Plan *plan, Path *path)
{
SubqueryScan *scan_plan = (SubqueryScan *) plan;
+ /*
+ * If the generated plan node includes a gating Result node,
+ * we can't execute it asynchronously.
+ */
+ if (IsA(plan, Result))
+ return false;
+
/*
* If a SubqueryScan node atop of an async-capable plan node
* is deletable, consider it as async-capable.
@@ -1139,6 +1146,13 @@ mark_async_capable_plan(Plan *plan, Path *path)
{
FdwRoutine *fdwroutine = path->parent->fdwroutine;
+ /*
+ * If the generated plan node includes a gating Result node,
+ * we can't execute it asynchronously.
+ */
+ if (IsA(plan, Result))
+ return false;
+
Assert(fdwroutine != NULL);
if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
@@ -1148,11 +1162,17 @@ mark_async_capable_plan(Plan *plan, Path *path)
case T_ProjectionPath:
/*
- * If the generated plan node doesn't include a Result node,
- * consider it as async-capable if the subpath is async-capable.
+ * If the generated plan node includes a Result node for
+ * the projection, we can't execute it asynchronously.
+ */
+ if (IsA(plan, Result))
+ return false;
+
+ /*
+ * create_projection_plan() would have pulled up the subplan, so
+ * check the capability using the subpath.
*/
- if (!IsA(plan, Result) &&
- mark_async_capable_plan(plan,
+ if (mark_async_capable_plan(plan,
((ProjectionPath *) path)->subpath))
return true;
return false;
prevent-async-2-v14.patchapplication/octet-stream; name=prevent-async-2-v14.patchDownload
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 8bcc18eca6..e22cc871f3 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10424,6 +10424,32 @@ DROP TABLE local_tbl;
DROP INDEX base_tbl1_idx;
DROP INDEX base_tbl2_idx;
DROP INDEX async_p3_idx;
+-- Prevent async execution if we use gating Result nodes for pseudoconstant
+-- quals
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE CURRENT_USER = SESSION_USER;
+ QUERY PLAN
+----------------------------------------------------------------
+ Append
+ -> Result
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ One-Time Filter: (CURRENT_USER = SESSION_USER)
+ -> Foreign Scan on public.async_p1 async_pt_1
+ Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Result
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ One-Time Filter: (CURRENT_USER = SESSION_USER)
+ -> Foreign Scan on public.async_p2 async_pt_2
+ Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+ -> Result
+ Output: async_pt_3.a, async_pt_3.b, async_pt_3.c
+ One-Time Filter: (CURRENT_USER = SESSION_USER)
+ -> Seq Scan on public.async_p3 async_pt_3
+ Output: async_pt_3.a, async_pt_3.b, async_pt_3.c
+(18 rows)
+
-- Test that pending requests are processed properly
SET enable_mergejoin TO false;
SET enable_hashjoin TO false;
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 855c7ea70e..fede8de1fb 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -3317,6 +3317,11 @@ DROP INDEX base_tbl1_idx;
DROP INDEX base_tbl2_idx;
DROP INDEX async_p3_idx;
+-- Prevent async execution if we use gating Result nodes for pseudoconstant
+-- quals
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE CURRENT_USER = SESSION_USER;
+
-- Test that pending requests are processed properly
SET enable_mergejoin TO false;
SET enable_hashjoin TO false;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index a2101fb3fc..0ed858f305 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -82,7 +82,7 @@ static List *get_gating_quals(PlannerInfo *root, List *quals);
static Plan *create_gating_plan(PlannerInfo *root, Path *path, Plan *plan,
List *gating_quals);
static Plan *create_join_plan(PlannerInfo *root, JoinPath *best_path);
-static bool is_async_capable_path(Path *path);
+static bool is_async_capable_plan(Plan *plan, Path *path);
static Plan *create_append_plan(PlannerInfo *root, AppendPath *best_path,
int flags);
static Plan *create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
@@ -1109,11 +1109,11 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
}
/*
- * is_async_capable_path
- * Check whether a given Path node is async-capable.
+ * is_async_capable_plan
+ * Check whether the Plan node created from a Path node is async-capable.
*/
static bool
-is_async_capable_path(Path *path)
+is_async_capable_plan(Plan *plan, Path *path)
{
switch (nodeTag(path))
{
@@ -1121,6 +1121,13 @@ is_async_capable_path(Path *path)
{
FdwRoutine *fdwroutine = path->parent->fdwroutine;
+ /*
+ * If the generated plan node includes a gating Result node,
+ * we can't execute it asynchronously.
+ */
+ if (IsA(plan, Result))
+ return false;
+
Assert(fdwroutine != NULL);
if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
@@ -1295,8 +1302,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
subplans = lappend(subplans, subplan);
- /* Check to see if subplan can be executed asynchronously */
- if (consider_async && is_async_capable_path(subpath))
+ /* If needed, check to see if subplan can be executed asynchronously */
+ if (consider_async && is_async_capable_plan(subplan, subpath))
{
subplan->async_capable = true;
++nasyncplans;
On Mon, Apr 25, 2022 at 1:29 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
I modified mark_async_capable_plan() a bit further; 1) adjusted code
in the ProjectionPath case, just for consistency with other cases, and
2) tweaked/improved comments a bit. Attached is a new version of the
patch (“prevent-async-2.patch”).As mentioned before, v14 has the same issue, so I created a fix for
v14, which I’m attaching as well (“prevent-async-2-v14.patch”). In
the fix I modified is_async_capable_path() the same way as
mark_async_capable_plan() in HEAD, renaming it to
is_async_capable_plan(), and updated some comments.Barring objections, I’ll push/back-patch these.
Done.
Best regards,
Etsuro Fujita
On Wed, Apr 6, 2022 at 3:58 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
I have committed the patch after modifying it as such.
The patch calls trivial_subqueryscan() during create_append_plan() to
determine the triviality of a SubqueryScan that is a child of an
Append node. Unlike when calling it from
set_subqueryscan_references(), this is done before some
post-processing such as set_plan_references() on the subquery. The
reason why this is safe wouldn't be that obvious, so I added to
trivial_subqueryscan() comments explaining this. Attached is a patch
for that.
Best regards,
Etsuro Fujita
Attachments:
Improve-comments-for-trivial_subqueryscan.patchapplication/octet-stream; name=Improve-comments-for-trivial_subqueryscan.patchDownload
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index d95fd89807..5108dbaf81 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -1349,8 +1349,22 @@ set_subqueryscan_references(PlannerInfo *root,
* We can delete it if it has no qual to check and the targetlist just
* regurgitates the output of the child plan.
*
- * This might be called repeatedly on a SubqueryScan node, so we cache the
- * result in the SubqueryScan node to avoid repeated computation.
+ * This can be called from mark_async_capable_plan(), a helper function for
+ * create_append_plan(), before set_subqueryscan_references(), to determine
+ * triviality of a SubqueryScan that is a child of an Append node. So we
+ * cache the result in the SubqueryScan node to avoid repeated computation.
+ *
+ * Note: when called from mark_async_capable_plan(), we determine the result
+ * before running finalize_plan() on the SubqueryScan node (if needed) and
+ * set_plan_references() on the subplan tree, but this would be safe because
+ * 1) finalize_plan() doesn't modify the tlist or quals for the SubqueryScan
+ * node (or that for any plan node in the subplan tree), 2)
+ * set_plan_references() modifies the tlist for every plan node in the
+ * subplan tree, but keeps const/resjunk columns as const/resjunk ones and
+ * preserves the length and order of the tlist, and 3) set_plan_references()
+ * might delete the topmost plan node like an Append or MergeAppend from the
+ * subplan tree and pull up the child plan node, but in that case, the tlist
+ * for the child plan node exactly matches the parent.
*/
bool
trivial_subqueryscan(SubqueryScan *plan)
@@ -1359,7 +1373,7 @@ trivial_subqueryscan(SubqueryScan *plan)
ListCell *lp,
*lc;
- /* We might have detected this already (see mark_async_capable_plan) */
+ /* We might have detected this already; in which case reuse the result */
if (plan->scanstatus == SUBQUERY_SCAN_TRIVIAL)
return true;
if (plan->scanstatus == SUBQUERY_SCAN_NONTRIVIAL)
On Thu, Jun 2, 2022 at 5:08 AM Etsuro Fujita <etsuro.fujita@gmail.com>
wrote:
On Wed, Apr 6, 2022 at 3:58 PM Etsuro Fujita <etsuro.fujita@gmail.com>
wrote:I have committed the patch after modifying it as such.
The patch calls trivial_subqueryscan() during create_append_plan() to
determine the triviality of a SubqueryScan that is a child of an
Append node. Unlike when calling it from
set_subqueryscan_references(), this is done before some
post-processing such as set_plan_references() on the subquery. The
reason why this is safe wouldn't be that obvious, so I added to
trivial_subqueryscan() comments explaining this. Attached is a patch
for that.Best regards,
Etsuro Fujita
Hi,
Suggestion on formatting the comment:
+ * node (or that for any plan node in the subplan tree), 2)
+ * set_plan_references() modifies the tlist for every plan node in the
It would be more readable if `2)` is put at the beginning of the second
line above.
+ * preserves the length and order of the tlist, and 3)
set_plan_references()
+ * might delete the topmost plan node like an Append or MergeAppend from
the
Similarly you can move `3) set_plan_references()` to the beginning of the
next line.
Cheers
On Fri, Jun 3, 2022 at 1:03 AM Zhihong Yu <zyu@yugabyte.com> wrote:
Suggestion on formatting the comment:
+ * node (or that for any plan node in the subplan tree), 2) + * set_plan_references() modifies the tlist for every plan node in theIt would be more readable if `2)` is put at the beginning of the second line above.
+ * preserves the length and order of the tlist, and 3) set_plan_references() + * might delete the topmost plan node like an Append or MergeAppend from theSimilarly you can move `3) set_plan_references()` to the beginning of the next line.
Seems like a good idea, so I updated the patch as you suggest. I did
some indentation as well, which I think improves readability a bit
further. Attached is an updated version. If no objections, I’ll
commit the patch.
Thanks for reviewing!
Best regards,
Etsuro Fujita
Attachments:
Improve-comments-for-trivial_subqueryscan-2.patchapplication/octet-stream; name=Improve-comments-for-trivial_subqueryscan-2.patchDownload
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index d95fd89807..9cef92cab2 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -1349,8 +1349,23 @@ set_subqueryscan_references(PlannerInfo *root,
* We can delete it if it has no qual to check and the targetlist just
* regurgitates the output of the child plan.
*
- * This might be called repeatedly on a SubqueryScan node, so we cache the
- * result in the SubqueryScan node to avoid repeated computation.
+ * This can be called from mark_async_capable_plan(), a helper function for
+ * create_append_plan(), before set_subqueryscan_references(), to determine
+ * triviality of a SubqueryScan that is a child of an Append node. So we
+ * cache the result in the SubqueryScan node to avoid repeated computation.
+ *
+ * Note: when called from mark_async_capable_plan(), we determine the result
+ * before running finalize_plan() on the SubqueryScan node (if needed) and
+ * set_plan_references() on the subplan tree, but this would be safe, because
+ * 1) finalize_plan() doesn't modify the tlist or quals for the SubqueryScan
+ * node (or that for any plan node in the subplan tree), and
+ * 2) set_plan_references() modifies the tlist for every plan node in the
+ * subplan tree, but keeps const/resjunk columns as const/resjunk ones and
+ * preserves the length and order of the tlist, and
+ * 3) set_plan_references() might delete the topmost plan node like an Append
+ * or MergeAppend from the subplan tree and pull up the child plan node,
+ * but in that case, the tlist for the child plan node exactly matches the
+ * parent.
*/
bool
trivial_subqueryscan(SubqueryScan *plan)
@@ -1359,7 +1374,7 @@ trivial_subqueryscan(SubqueryScan *plan)
ListCell *lp,
*lc;
- /* We might have detected this already (see mark_async_capable_plan) */
+ /* We might have detected this already; in which case reuse the result */
if (plan->scanstatus == SUBQUERY_SCAN_TRIVIAL)
return true;
if (plan->scanstatus == SUBQUERY_SCAN_NONTRIVIAL)
On Wed, Jun 8, 2022 at 7:18 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Fri, Jun 3, 2022 at 1:03 AM Zhihong Yu <zyu@yugabyte.com> wrote:
Suggestion on formatting the comment:
+ * node (or that for any plan node in the subplan tree), 2) + * set_plan_references() modifies the tlist for every plan node in theIt would be more readable if `2)` is put at the beginning of the second line above.
+ * preserves the length and order of the tlist, and 3) set_plan_references() + * might delete the topmost plan node like an Append or MergeAppend from theSimilarly you can move `3) set_plan_references()` to the beginning of the next line.
Seems like a good idea, so I updated the patch as you suggest. I did
some indentation as well, which I think improves readability a bit
further. Attached is an updated version. If no objections, I’ll
commit the patch.
Done.
Best regards,
Etsuro Fujita